There’s a lot of talk about streaming architectures lately, but somehow there’s always a catch: streaming is either small-batch, fast-big-batch, or not-quite-there (ingest delta and/or throughput delta are too high for the whole thing to be called real time). So — with apologies for the duh factor — here’s a short list of stuff you need to make it happen:
Keep it coming
If there’s no data streaming, there’s no stream processing; the first and main requirement for a real-time stream processing system is to be able to receive and process data in-memory, without having to store the data before performing any operation on it.
Historically, streaming applications have been built in general-purpose languages such as Java or C++; unfortunately, they incur high maintenance costs and long development cycles.
Proprietary languages are also on the rise, and while some boast performances worthy of mentioning, most if not all face the same issue: integrating with existing systems is usually tedious and acquisition, training, and maintenance costs are exorbitant.
In contrast, SQL has remained the enduring standard data processing language for over four decades and recently it found explicit streaming venues that allow SQL systems to filter, merge, correlate, and aggregate streaming data in real time.
Marry streaming and stored data
For all data processing applications, comparing the current situation with the past is a common procedure, independent of how “current” is being defined. When real time is a requirement, though, historical analysis and live data need to be integrated in the same application instantaneously and continuously.
The capability to efficiently store, access, and modify state information and combine it with live streaming data is a unique feature of stream processors, and depends on factors such as processing power and the aforementioned uniform language.
Fix the broken data
In traditional systems, data is always there before it is queried. In a streaming setup, data has not yet been stored; therefore, the system needs to make the best of data it has, sometimes incomplete, late, or misplaced.
Continuously comparing and contrasting, plus the ability to time out computations and keep time windows running are the key in fixing and using the broken data, maximizing the usage of streams.
Keep it ready
With mission-critical information in need to be kept up and far from disruptions, stream processing technologies have to use solutions that allow applications to stay up all the time, with a hot backup in place that eliminates the need for restart and recovery.
Stream processors need to scale up and out automatically over any number of machines and their cores. In order to do that, they have to split an application over multiple machines and core, while allowing it to balance its load. Distributed architectures and support for multi-threaded operations are vital.
Do it all in real time
Even if all the above conditions are met, an application won’t achieve its function unless it can keep up with the data, indifferent to changes in volume, speed, and variety. Of all, volume is the most important factor, and stream processing has to — and is the only technology able to — process up to millions of messages per second with the lowest latency on the market (<10 millisecond range).
But can the system manage such performance? In order to deliver, a stream processor needs to minimize overhead and maximize useful work while continuously integrating input, processing and storage.
Otherwise, it simply ain’t streaming.