by Philip Howard of Bloor Research
The figures bandied about when you discuss Big Data are as often expressed in exabytes or zettabytes as they are in petabytes. Big numbers. Terabytes rarely get a look in.
What is driving this growth in data? Three things: increasing Web-based activity, the ubiquity of mobile devices of various sorts, and the rapid adoption of sensor-based information, which spans everything from smart meters to RFID. However, just as it’s getting increasingly difficult to wade through an ever-widening sea of raw information, calls for real-time business analytics are on the rise.
So, how do you define Big Data? How do you store it? How do you manage and analyze it both efficiently and cost effectively?
These questions have commanded the attention of big industry players and start-ups alike, spawning new innovations, new tools and new open source projects from columnar databases to Hadoop and MapReduce, to the myriad of NoSQL databases being some notable examples.
But is "Big Data," as a category, actually too broad? First of all, “Big” can mean different things for different organizations. For one company, big is 10 terabytes, while for another it may be 100 petabytes. Second, not all data is created equal. So beyond size, understanding the kind of data you are dealing with is equally important so that the right technology (or combination of technologies) can be matched to the analytics challenge.
To take telecommunications as an example. Such companies need to be able to perform network analysis based on the call detail records (CDR) that they capture. On the other hand, those same organizations may want to conduct sentiment analysis to understand how their company is regarded by its customers.
These two scenarios require very different approaches: CDRs consist of what is essentially structured data and are therefore well suited to relational databases like columnar databases that excel at analytics. Conversely, data derived from social media is unstructured and new approaches such as Hadoop and MapReduce are required to glean information from the data.
The key distinction is between structured and unstructured data. Examples of the former include all types of logs (Web, network, and so forth), CDRs, sensor output, stock ticker data, online gaming data, etc. This data (sometimes called machine generated or interactional data) is structured in a similar way to transactional data and therefore it can be handled in the same way -- given the proviso that volumes are large and growing very rapidly.
So, in the interest of zeroing in on a specific data challenge and outlining some tangible and practical solutions, what follows are three best practices for analyzing machine generated data:
1) Look beyond hardware - In the face of the sort of data growth we are talking about, continuing to add more servers and more disk storage subsystems is simply not sustainable. At a certain point, the traditional, hardware-centric approach will result in a massive infrastructure footprint that’s extremely costly to scale, house, power and maintain.
Of course, you could host all this in the cloud but that’s only kicking the can down the road. More realistically, you need to look at more effective alternatives. Using a column-based approach is one such innovation that has come to the fore over the last few years.
As the name implies, databases in this category store data column-by-column rather than row-by-row. Since most analytic queries only involve a subset of the columns in a table, a columnar database has to retrieve much less data to answer a query than a row-based database, which must retrieve all the columns for each row. As a result, columnar databases, from vendors such as Infobright, Vertica (HP), Sybase (SAP) and ParAccel are becoming more and more common in data warehousing and analytics environments.
In addition, columnar databases provide data compression. This combination of reduced I/O and data compression has several benefits, including faster query response and the need for less storage hardware, which translates into lower costs. There are technologies that can achieve data compression rates of 3:1 or 4:1 all the way up to 10:1, 20:1, and even 30:1, depending on the type of data to be compressed.
2) Don’t constrain what business users can do - Real investigative intelligence requires databases that support complex and dynamic queries, and fast, ad-hoc analytics.
Traditional databases require database administrators to create and maintain indexes, partition data, or create cubes or projections to achieve fast query performance -- all based on understanding what queries and reports users want to run. This need for the pre-tuning of specific queries goes against the very nature of investigative analysis, which is, by definition, not pre-defined.
Indeed, even the way that data is partitioned (or sharded) constrains query performance. If you store data by shop or region, say, then that will suit queries that are predicated on that basis but will militate against good performance for any other sorts of queries you might want to run. So a database that doesn’t require such constructs should, other things being equal, be more flexible and perform better.
A corollary to not needing indexes, partitions and so on is that you don’t require the database administration that goes along with them. In addition to reducing costs this will also directly enable the creation of what are sometimes called “breakthrough applications” by ISVs. The characteristics of such applications is that they provide actionable intelligence that leverages machine-generated and other data types, and can deliver results to any relevant user platform in whatever format is required.
Supporting such applications requires a “fire and forget” database that requires minimal or no administration. For example, InterSystems is focusing on this market with its Caché database, as are others. Another example is Infobright, whose partner JDSU has embedded Infobright’s analytic database into its service assurance applications to enable large network operators to immediately drill down into huge volumes of CDRs.
3) Understand your objective - There’s a reason that businesses use purpose-built tools for certain jobs. You don't want your business solutions to use a standard relational database for everything; just as you wouldn't use a screwdriver when you really need a power drill.
This is especially true for databases and analytic solutions, where there are good and justifiable reasons for using everything from traditional row-based relational databases, to purpose built columnar stores, to memory based systems and complex event processing, to emerging technologies such as Hadoop and both NoSQL and NewSQL (VoltDB, NuoDB, JustOne and so forth) databases.
In the case of machine generated data, the critical characteristics of a purpose built solution must encompass the other best practices mentioned above: an ability to handle Big Data volumes within a manageable hardware footprint combined with query flexibility, high performance and reduced administrative requirements. There's no silver bullet, but understanding and then linking project objectives to the right architecture can mean the difference between a costly failure and an efficient success.
Dealing with Big Data is going to require a targeted, rather than a “one-size-fits-all” approach. What IBM refers to as workload optimized systems (though in a slightly different context). What is clear is that a large percentage of data that needs to be analyzed is machine generated. For this challenge, practitioners need to think carefully about their solution set and look at how to load data faster, store it more compactly, and reduce the cost, resources and time involved in analyzing and managing it.
Philip Howard is a research director at Bloor Research and focuses on data management. In addition to the numerous reports Philip has written on behalf of Bloor Research, Philip also contributes regularly to IT-Director.com and IT-Analysis.com and was previously editor of both Application Development News and Operating System News on behalf of Cambridge Market Intelligence (CMI). He has also contributed to various magazines and written a number of reports published by companies such as CMI and The Financial Times. Philip speaks regularly at conferences and other events throughout Europe and North America.