Back to the future for real-time streaming Big Data and relational streaming

We’ve been exhibiting at Structure 2012 in San Francisco, where our CEO Damian Black was speaking on dataflow architectures for massively scalable real-time Big Data computing. In fact, this was a milestone for us as Damian was on the very first Big Data panel at the first Structure event in 2008.

Dataflow is a technique for parallel computing that emerged from research in the 1970s. It’s based on graph-based execution models where data flows along the arcs on a graph and is processed at the nodes. It was decades ahead of its time in an era when hardware was expensive and real-world requirements for massively parallel, low latency computing architectures were not required in the mainstream. However, dataflow as an architecture has found its place and time, with the emergence of Big Data volume, real-time low latency requirements, commodity hardware and low cost storage. Dataflow is driving the architectures for today’s real-time big data solutions.

Structure 2012 - Dataflow comes of age

Click to view Structure 2012 presentation video

SQLstream adopted the principles of dataflow as the basis of our architecture for SQLstream s-Server. Our adapters turn any data source into a live stream of data tuples which are combined, aggregated and analyzed by the SQLstream s-Server platform. SQLstream has added one essential feature to data flow – the use of SQL as a dataflow management language. SQL has been used for some time as the language of choice for relational database management systems, and in this context is getting a bad press in light of new structures for Big Data storage and NoSQL queries. However, SQL is powerful, declarative (therefore applications can be built easily, quickly and cheaply) and is a natural, powerful paradigm for processing streaming dataflows. The benefit is extremely low latency with the ability to process massive volumes of live data over an unlimited number of servers – exactly the requirements of real-time Big Data. In fact, this is the only architecture capable of processing real-time Big Data streams. With real-time requirements now in the 20 to 100 million events per second range, power, scalability and low latency are key.

Dataflow architecture for real-time streaming Big Data computing

Diagram 1: Dataflow architecture for real-time streaming Big Data computing

The SQLstream s-Server architecture concept is illustrated in Diagram 1. As a dataflow architecture, each node is a streaming SQL statement – a continuous SQL query, processing arriving data over a moving time window (time windows can be from 1 millisecond for ultra low latency requirements, to months or even years where comparison against long term moving averages is required, for example, Bollinger bands). Why is this important? Well, it’s the only approach for low latency, real-time solutions, as information flows out of the system as soon as input data arrives, that is, the high latency of batch-based approaches such as Hadoop Map-Reduce is removed completely.

Mozilla Glow: Real-time download monitor with SQLstream and HBase

Mozilla Glow: Real-time download monitor with SQLstream and HBase

Damian presented a simple example of SQLstream and parallel dataflow in action. Mozilla’s Glow application is a continuously updating download counter for the Firefox 4 browser when it was released. The application used SQLstream s-Server to collect live download statistic from all the download servers worldwide. Download records were processed and aggregated in real-time and displayed on the Glow visualization map, illustrating exactly how many copies of the browser had been retrieved. SQLstream s-Server also provided a continuous ETL operation into Apache Hbase, storing aggregated and filtered records for further in depth analysis. Click here to watch the application in action.

Finally, and in contrast to Structure, we also attended a Gartner session last week with Merv Adrian and Svetlana Sicular, which sought to bring some sense of perspective to Big Data. This was really a reality check as to the current maturity levels of the Hadoop Big Data platforms and the effort required to deploy. The wider adoption across industry in general will require significantly more mature products and applications, particularly around the OPEX costs for deployment, security concerns and ability to deliver business intelligence all consumers in an large organization. The recommendation was to use an integrator such as Cloudera or Hortonworks. Mainstream organizations are looking at the Hadoop / Big Data approach, but many do not currently see either a use case or a reason for adoption. It was interesting to hear a perspective that didn’t need to be buzzword compliant, and presented a positive yet realistic perspective on wider adoption.