Big Data – Dealing with the Data Tsunami

There is a lot of buzz these days about the challenge of “Big Data”.  I’ll be speaking on the subject at GigaOM’s Structure 2010, on the “DEALING WITH THE DATA TSUNAMI: THE BIG DATA” panel. There are many dimensions to the challenges posed by “Big Data”, which I’ve presented here as five separate but related themes: speed of data arrival, data location, pace of change and balancing historical and continuous analysis.

Speed of data arrival

The first theme is speed.  When a lot of data arrive fast, it is often overlooked that they arrive in raw form and need to be processed or cooked before they can be of any real value. The processing normally comprises cleaning, filtering, aggregating and validating.  Sometimes the data need to be enhanced, normalized or de-normalized.  While there are a number of proprietary ETL tools out there that can help, most people prefer to perform these operations using SQL.  This approach has become known as ELT as the data are Extracted, Loaded and then Transformed (as opposed to Transformed then Loaded).  In the past, this has meant loading raw data into a data warehouse’s staging tables and then performing the ELT with SQL in batches until the data are fully cooked and ready to take part in the “main course” queries.

One of the strengths of the SQLstream approach is that for the first time you can use standards-based SQL for performing these ELT steps but as Continuous ETL rather than operating upon the data after first storing it.  We call this “analyze-before-store” approach: Query the Future – as the scope of the continuous queries is from the moment they start until the end of future time (in contrast with historical queries whose scope is from the moment they start until as far back in time as the data are stored).  SQLstream’s queries continuously process, clean, aggregate and enhance the data in a highly parallelized dataflow pipelined process.  The staging is in main memory using 64-bit architecture and multiple cores and servers.  This provides a highly scalable efficient and cost effective solution to ETL, with the virtuous side-effect of enabling the data warehouse to be kept continuously up-to-date by feeding it a stream of fully cooked data and updating its aggregate tables continuously in near real-time.  All of this is done without stealing valuable cycles of the data warehouse server.

Data location

The second theme is data location.  Like houses, location is very important when it comes to assessing the value (or usefulness) of the data.  Location might be spatial or temporal.  If you wish to be alerted of a special price for gas at a specific gas station, clearly it is of greater value if you are currently in the immediate vicinity of the gas station.  This shows the value of both the location in space and the location in time.  In contrast, most data warehouses dumbly store all service data and records without regard to their value.

Clearly, the value of the data in many cases greatly diminishes over time.  Many of the queries that a business might pose are better targeted at current data.  That is particularly true of targeted advertisements, but also when monitoring customer service level, cloud computing infrastructure and the like.  The data are much more valuable when the business is able to take proactive initiative to capitalize on the value – fixing problems or issues before they negatively impact customers, or making that promotion or sale before the customer purchases product or service from a competitor.  SQLstream’s continuous queries are all about focusing analytics where they have the most value by specifying explicit windows of focus for the queries in terms of time, quantity or space.  While many rows can flow into and out of the window of focus for any given query, the window represents the immediate focus of attention.

Pace of change

The third theme is the pace of change of data.  If you have a large quantity of data that is not changing very much, then historical queries and analysis will no doubt provide you with all of your answers.  However, if the data are changing constantly, or a lot of new data arriving constantly, or if you have a focus on a specific window of time or space, then historical analysis has little value.  What you care about is the derivative of the change – the rates of change.  For example, are our sales accelerating or decelerating?  Is the rate of acceleration unusually high or low?  What about service outages and error rates?  Or customer complaints?  The SQLstream approach enables you to see what is changing rather than what is staying the same.  It is analogous to predator vision: the predators want to see what is moving and their vision system prioritizes that over what remains motionless.  SQLstream provides such dynamic vision.

Balancing historical and continuous analysis

The fourth theme is the need to complement data mining and the results of historical analysis with continuous analysis.  Data warehousing allows you to find patterns and predictors from past data and to back test all of your hypotheses over extended periods of time.  The back testing of such hypotheses often takes the form of SQL queries that search for patterns of changes of data over time and check that the predicted results occurred and with what frequency.  Once you have mined and captured such valuable predictors, it is straightforward to take the SQL you have generated and tweak it to be used in real-time, continuously executed against live data.  Using this approach, SQLstream allows you to leverage you data mining results to perform real-time predictive analytics, giving your business a real-time heads up for key indicators of buying signals, or systems’ failure or what ad should be served up based on a customer’s web behavior.

Brain over brawn processing

My fifth and final theme is “smart declarative” versus “dumb brute force” when applied to data queries.  The latter is how I see Hadoop-based approaches.  You parallelize a problem to take advantage of a lot of available servers and related CPU cycles, but you do not rely on any intelligence on how you partition the problem.  In fact not having to “think” is one of the primary appeals of the technique.  It is a brute force method of brawn over brain.  However, where the problem space is truly huge, or the time or financial budget is more limited, there is always the attraction of the “brain over brawn” technique.  Declarative SQL processing draws upon the mathematical tractability of analyzing patterns and dependencies within the data, the use of keys and indexing, the rewriting of complex formulae into simpler ones and avoiding recalculation of intermediate results – in order to provide a faster, more efficient and smarter way of finding the solutions.  Such declarative techniques can still take extensive advantage of parallelism and inexpensive or available servers and CPU cycles, but they rely on smart analysis in order to optimize the calculations.  SQLstream, and all SQL-based data warehouses, heavily draw upon these mathematical SQL properties and patterns and analysis of the data to do the smart thing when it comes to query processing.

Stream Computing of the kind embodied by SQLstream however has even greater potential to take advantage of parallelism over and above SQL data warehouses because SQLstream’s Stream Computing has no transactional bottleneck and is purely declarative.  Input streams are not “side-effected” by the execution of stream SQL statements, rather new streams are created from the original ones (which are left untouched and can be presented concurrently to other SQLstream servers).  The execution paradigm is one of parallel dataflow execution – a paradigm that lends itself not only to massive parallel execution but also to massively distributed execution.  I believe that as Hadoop becomes more widely understood and deployed, people will begin to see just how much of a better job could be performed by adding a little intelligence and just how powerful declarative stream computing can be.