GlueCON, SQLstream and SQL as a streaming API for real-time Big Data
Glue Conference 2012 , Denver CO, at the end of May was a great conference, well attended, knowledgeable participants and is the only conference I know that looks at gluing cloud and mobile applications together with a developer focus.
There was the usual wave of NoSQL, cloud storage, cloud platforms and Hadoop presentations, as you’d expect, but also with some interesting keynotes as well. Ray O’Brien, CTO for IT at NASA. talked about the evolution of Nebula and OpenStack at NASA, and James Governor from Redmonk, talking about the evolution of historical analytics.
From our perspective, the strength of the show was in making physical rather than logical connections. Both partnerships and potential customer interest in building real-time Big Data applications, and how SQL has been repurposed as an API for streaming Big Data, moving it forward significantly from its roots as static data management language.
Real-time, streaming Big Data
Our CEO, Damian Black, presented on real-time streaming Big Data, both as a real-time alternative to Hadoop, and also as a complement to add real-time responses and streaming integration to existing Hadoop installations. One question we were asked several times was why SQL? A good question. This isn’t a religious debate about the language by any means, and if we had opted to build a Big Data batch storage and analytics platform (e.g. like Hadoop), we would have gone a different route.
However, when it comes to processing streaming tuples in real-time, a standard SQL approach has two big advantages over all others. First, with the extension of the SQL WINDOW operator to process streaming data over fixed time windows, both structured and unstructured data can be processed painlessly without having (no pun intended) to define a static schema and without the need for any coding whatsoever. In effect, SQLstream processes streams of arriving tuples over time windows and pushes out the results to other systems. Similar in concept at least to Hadoop, although Hadoop is purely batch-based, processing static files and pipelining sets of tuples through low level Map-Reduce functions.
However, the second benefit is equally important. Streaming SQL queries include standard operators such GROUP BY and PARTITION. These provide the best clues possible to a query planner capable of automating the dynamic scaling of streaming pipelines over vast numbers of servers. This gives a reliable and controllable mechanism for Big Data scalability without the need for hardcoding server allocation hints.
Real-time at GlueCON
The strength of the real-time track at GlueCON was encouraging. It was interesting though that the term ‘real-time’ is now about as over used as ‘Big Data’, and about as poorly understood. For SQLstream, it’s the streaming integration of any and all data sources with in-memory analytics, processing streams at millions of events per second. For some other vendors, it appears real-time drops off at significantly lower rates and numbers of connections!
Next stop, GigaOM Structure in San Francisco
Next stop GigaOM Structure in San Francisco, June 20 / 21 at the Moscone Center. Visit us there if you’re attending.