Structure Data, New York, and streaming Big Data for Hadoop
Last week SQLstream sponsored and CEO Damian Black presented at Structure Data in New York, a conference exploring “the technical and business opportunities spurred by the growth of big data”.
It’s clear that Big Data has moved on considerably in a very short space of time. From the Silicon Valley, 101 world of Java developers and Hadoop, into the mainstream wider business world (but still with Hadoop!).
Some themes emerging from the conference:
- The basic need to deliver high performance, massively scalable computing infrastructure as data volumes grow exponentially. It’s clear that the pain from structured and unstructured data is driving different approaches at different stages in the data management lifecycle – better visualizations, better cleansing and filtering, and a better understanding of the appropriate analytics tools that are most applicable at each stage.
- The emergence of the SQL layer. It’s clear Hadoop has its strengths and is here to stay. It’s effectively ‘supercomputing lite’ and given today’s data volumes, is just the tool for the job. However, there are a couple of trends emerging. First, is it actually necessary to store all the data, when much of it is obviously not of interest? Second, once the initial analysis of both all structured and unstructured data is achieved, there’s an emerging layer above Hadoop that’s looking very structured. Both these functions are looking much more SQL-like.
- Real-time, low latency analytics. Hadoop is not, nor does not claim to be, a low latency, real-time data management platform. There is a well-defined business need to analyze log file, sensor and network data in real-time (sub-second to a few minutes latency), but also to stream the arriving data through to Hadoop for further analysis. Obviously this layer needs to as scalable, if not more so, than the underlying Hadoop platform.
Damian’s presentation Structure Data focused on relational streaming – massive-scale parallel data processing using SQL, generating real-time results from streaming input data. The talk described relational streaming as a standalone real-time management layer, and also SQLstream integrated with Hadoop as the streaming layer in the Big Data stack (you can also read the GigaOM report in the presentation here).