Making the Elephant Fly: Real-time operational intelligence for Hadoop using streaming SQL

SQLstream sponsored the recent IE Group Big Data Innovation Summit in San Francisco where I also presented on streaming SQL for Hadoop, and extending Hadoop for real-time operational intelligence and streaming analytics. As Big Data technologies and Hadoop push further into mainstream enterprises, so the need for real-time business operations is an important parallel trend. ‘Real-time’ and ‘Hadoop’ had been considered synonymous by some, yet surprisingly, people are surprised when Hadoop does not seem to be as real-time as they hoped. This should not come as a surprise, as Hadoop as many strengths, but was never intended for low latency, real-time analytics over high velocity data.

SQLstream Hadoop-Innovation-Summit Real-time Hadoop

Click to View Damian’s Presentation on Slideshare

Real-time Big Data and streaming analytics

Which raises the question, what do we mean by real-time? Many products have emerging that claim ‘real-time’ analytics over Hadoop. Yet Hadoop remains a batch processing framework, and struggles to deliver low latency analytics against high velocity streaming data, struggling due to the same limitations as existing RDBMS-based data management platforms. These ‘real-time’ products may generate rapid results over the stored data, but ignore the latency introduced by data collection and storage, and also ignore the resource load of repeated execution of queries to process newly arriving data. The latency issue may not be apparent for slower data streams, such as twitter feeds for example, but with the data rates of machine data in the world of telecommunications, industrial automation, M2M and large scale security intelligence for example, the problem rapidly becomes extreme.

SQLstream’s core stream computing platform, s-Server, processes high velocity data as soon as they are generated, executing continuous SQL queries and streaming analytics directly over log files, sensor feeds and any other machine-generated data source. We measure real-time form the time of data creation, eliminating completely the latency introduced by collecting, storing and the repeated updates of results.

Drive real-time actions with streaming operational intelligence

We discussed in a previous blog how real-time operational intelligence eliminates the chasm between business operations and analytics. Operational intelligence is about more than the collection and analysis of log file and machine-generated data. One of the advantages of stream computing is the ease with which predictive analytics can be applied over multiple data streams. This makes it possible to alert on time and space-based patterns of machine, user and consumer behavior that are predictors of some future event – a security breach, network failure or service fault.

streaming operational intelligence

And true operational intelligence platforms need to go one step further – true real-time platforms must do more than visualize results on a dashboard – it’s essential to connect back to application and operational systems, and to drive automated updates. Security breaches can be avoided, network resilience mechanisms activated, and service faults corrected before SLA breaches occur and customers are aware of the problem.

Real-time operational intelligence on Hadoop

So what does this mean for Hadoop? Streaming is not a new technology, but approaches streaming technologies have focussed on single source problems, and have been deployed as standalone platforms for low velocity use cases. With SQLstream, standard SQL queries, albeit continuously executing SQL queries, execute to join, group, partition and analyze real-time machine data streams. There is a further difference – SQLstream’s s-Server streaming SQL platform can also be deployed as a streaming SQL query extension for Hadoop.

A number of streaming Hadoop scenarios are supported:

  • Stream persistence – Hadoop HBase as an active archive for streaming data and derived intelligence using the Flume API. SQLstream also performs continuous aggregation  to support high velocity streams without data loss.
  • Stream replay – restream the complete history of persisted streams from HBase for ‘fast forwarding’ of time-based and spatial analytics. Various interfaces can be utilized, including Cloudera’s Impala.
  • Streaming data queries, joining streaming real-time data with historical streams and intelligence persisted in HBase.

Making the Elephant fly

Accelerating Hadoop to process live, high velocity unstructured data streams delivers the low latency, streaming operational intelligence demanded by today’s real-time businesses. Hadoop has been the driving force behind Big Data Analytics but as the technology hits the mainstream, many industries are seeking to take a step further and eliminate latency from their business completely. With the SQL language emerging as the key enabler for the mainstream adoption of Hadoop, executing streaming SQL queries over Hadoop extends the platform out to the edge of the network, making it possible to query unstructured log file, sensor and network machine data sources on the fly and in real-time.