For the Love of Lambda Architectures (or how a full picture is better than a puzzle)
How it works, and how it doesn’t
In principle, most implementations follow the 3-layer rule, using a layer for batch processing, a serving layer employing a key/value store, and a speed layer for stream processing and real time analytics; they mostly work alongside each other, with periodic integration of results.
The batch layer: This is the archive used to hold all of the historical data ever collected, including processed results. This is usually a data lake system like Hadoop, or an OLAP data warehouse, and, as the name implies, it supports batch queries.
Fast-moving data is captured as it enters the system, then ingested by both the batch layer and speed layer by way of message queues such as Kafka. Data ingestion is done in parallel and it doesn’t require any response to the data source/bus- which means this is a one-way pipeline: in Lambda, immutable data flows in only one direction: into the system. Results are aggregated, loaded into the serving layer, and then the process starts again as per a predetermined query schedule, and for each ad hoc computation.
Pros: The point of a one-way pipeline is to execute data lake queries or OLAP-type processing faster, and it does: the time required to consult column-stored data, for example, is reduced from a couple of seconds to about 100ms.
Cons: No matter how fast the one-way data pipeline, important real-time analytics applications like user segmentation/ scoring, fraud detection, detecting denial of service attacks, adaptable consumer policies and billing simply can’t happen because they require a two-way pipeline. Batch processing prevents Lambda from transacting or making per-event decisions.
Also, the recalculation method increases the amount of data stored and the processing power needed to maintain the batch views- and this approach can expensive.
The serving layer: results from batch-layer computations are cached here so they are immediately available to answer queries. They are updated every time the batch layer runs a new analysis.
Pros: this is the only point where batch and speed meet, in a sort of data escrow setting that allows for a rather inactive integration of results.
Cons: the results here may only be as fresh as the batch layer is fast. In other words, data here is outdated from the moment it’s loaded.
The speed layer: after a parallel ingest, analytics is also done in parallel: the logs move to a data lake, where the batch metrics are recalculated, while the speed layer starts analyzing fast-moving data as it enters the system. This layer is a combination of queuing, streaming, and operational data stores, and it computes its own result combining fresh data with the most recent results from the serving layer.
Pros: hundreds of new use cases pop up every week, most of them batched (sic!) around log ingestion and analytics: server logs, access logs, or even the popular practice of collecting data from Twitter streams. However, some fast data flows remain largely unnavigable: market data, IoT sensors, mobile devices, clickstreams and transactions.
Cons: The speed layer is permanently ahead of the batch layer, just as the batch layer is permanently ahead of the serving layer. Even though Lambda allows applications to take (very) recent data into account, it still only supports the same basic functions as batch analytics.
Plus, unreliable consistency makes it even harder to keep up the pace, since data needs to be checked before being fed into the batch layer; altering analytics on the fly and real-time decision making are, therefore, but a dream.
But many businesses have come close to a working model, and are—for now, at least—relatively content. Their concept is the same: one single architecture built to ingest, process, and compute analytics on both live and historical data. The components, however, differ from company to company based on their strategic prioritization of needs, allocation of resources, and legacy system requirements.
This means businesses end up with a puzzle-like ecosystem of software made of many disparate elements trying to pass messages from one to another. At this point, imagine an absolute cacophony: most tools require their unique APIs instead of allowing unique/universal SQL queries.
The more elements, the more nodes; the more nodes, the more fragile the ecosystem, and the more complicated to insure seamlessness, immediacy, and affordability. Just like with a puzzle, the more pieces you have, the harder it is to construct a full picture.
Turning up the Lambda
The way it stands now, analytics at the speed and batch layer can be predefined or ad hoc; but should new analytics be desired in the Lambda, the application needs to rerun the entire data set, from the data lake or from the original log files, to recompute the new metrics. That means delays, and delays can translate into effective losses and missed opportunities.
That being said, the Lambda Architecture concept is extremely valuable. It allows companies to see the bigger picture of their data, by capturing the fresh analysis and the historical results in close-to-real-time—and that is vital when trying to stay ahead of changing business conditions.
With the new rise of complex event processing/ streaming ingestion/ streaming analytics technologies, there is a real opportunity to simplify and improve on the Lambda while preserving its key virtues. Our own Guavus SQLstream fits perfectly across all of Lambda’s three layers, with a unified architecture that can reduce the number of moving pieces, power per-event decision making, and add ad-hoc query capabilities to fast data implementations.
Guavus SQLstream supports streaming ingestion, real-time integration and analytics, and continuous Load to downstream systems such as Hadoop. The inclusion of this one single component replaces the batch and serving layers with an extended speed layer capable of:
- Instant and limitless aggregation, filtering, and integration capabilities, so that data entering the system can be meshed with stored data and operated upon continuously and in real time, on a record-by-record basis.
Benefit: continuous integration of all data, in-motion and at-rest, structured and unstructured.
- Supporting an interactive development environment built on a two-way data processing model.
Benefits: interactive system ready to change on-the-fly; automated actions.
- Using only SQL to enhance performance and scale for millions of events per second per core. Benefits: support for ad hoc queries; seamless integration; reduced implementation costs.
Conclusion
Lambda Architecture is a powerful analytics framework that serves queries from both fast and big data. However, the model emerged from a need to execute OLAP-type processing faster, without even considering—or being ready for—the new class of applications that require real-time, per-event decision-making.
Using a streaming analytics platform for Lambda simplifies and enhances the speed layer by reducing the number of components needed, and places applications in front of the event stream, opening it up for capturing value from all data, all the time.