Why is Stream Processing so fast?

The stream processing paradigm differs from the traditional storage-based data management paradigm with which we grew up. Stream processors are fast as they are in-memory (although this is not unusual these days), process data streams record-by-record as they arrive over time or record-based windows using continuous queries (which never end), and, importantly, incrementally update all results of all executing queries for each and every new arriving record. Contrast this with the batch-based processing of Hadoop or an RDBMS data warehouse, where queries must be re-executed across the entire dataset in order to update results when new data arrive. This is one of the reasons that Hadoop is not a suitable platform for real-time systems.

Stream processors retain sufficient data to fulfill the criteria of all the window-based queries active in the system. However, windows can contain several months of data, but longer windows would normally be for maintaining analytics such as moving averages. Memory management is therefore key, which is why SQLstream offers sophisticated buffer management and scheduling. Stream processing platforms should support integration with storage platforms, both for stream persistence, and static table / data stream joins (although there’s more to stream-to-static table joins that is first apparent).

It’s also important to separate out today’s stream processing (or stream computing as Forrester has now termed the market) platforms from CEP platforms. CEP is the early first generation of event processing, focussed on the financial services market. More importantly, the performance for many of the CEP platforms came in large part through compilation of a proprietary SQL-like language. Which is fine, but does mean the platforms have to be taken down and recompiled in order to apply updates and new analytics. A limitation for operational scenarios. However, most CEP frameworks have now been subsumed by larger organizations apart from a few open source CEP vendors.

Another interesting concept raised by stream processing is the visualization of streaming analytics. BI visualization tools, even those claiming to be real-time, require data to be stored first in a table – an end of file marker is expected. Which of course introduces response latency and table management issues, and repeatedly querying large datasets increases the system resources required. Therefore unless the stream processing vendor provides specific streaming dash boarding tools, some form of integration using clever management of materialized views may be required.

One final perspective. Stream processing is almost always compared against storage data management platforms. Fast data versus slow data. However that’s not the whole story. Stream processing architectures are closer to ‘intelligent middleware’, with intelligent processing nodes, where the output streams are as important as the input streams. Stream processing is often associated with real-time and Big Data systems, but achieving real-time means integration with operational systems. That’s not to say that a stream processing engine can be compared directly with a middleware platform or real-time BPM tool, but the association is perhaps closer in terms of IT architectures for operational systems that the association with data storage platforms.