We participated on the “Architecting Big Data Systems for Speed” panel at E2 Conference. Great event, and a great opportunity to discuss technology in a business context. The panel offered a range of perspectives with other panelists from Translattice and Oracle’s NoSQL division. A number of interesting topics emerged, including the meaning of real-time in a Big Data world, and how enterprise architectures are evolving to deliver low latency as well as volume, velocity and variety.
What is real-time Big Data?
People often seem disappointed when Hadoop turns out not to be as real-time as they hoped. Which is somewhat unfair. Hadoop was designed to deliver much faster results from petabytes of stored data. So if real-time means faster answers from unstructured stored data, great. And let’s face it, this is a significant improvement of the previous BI norm of 24 hour turnaround, if indeed you could tailor your RDBMS to store event data at all. However, if you consider real-time to be low latency actions measured from the time of data creation, that’s a different story. All this data arrives over time, so there’s obviously a velocity factor, and for some applications, near real-time answers may be possible, for example, Twitter peaks at around 10k tweets per second, and there’s a number of Big Data apps that claim real-time Twitter analytics.
Big Data are valuable, but only for a short period of time
The world is moving on and the kinds of low latency requirements that are now emerging have significantly higher data rates. For example:
- Telematics and M2M apps can generate 5 to 20 million car, transport network and environmental records per second.
- Real-time Cybersecurity monitoring can be several million records per second.
- Telecommunications, a 4G performance app can be 10 million records per second, an IP monitoring can be many times that./li>
There’s a couple of new requirements evident in these industry sectors:
- the need to collect and analyze many different sources and types of data
- the need to drive real-work actions
- the need for predictive and prescriptive analytics
- and most importantly, low latency actionable intelligence
How real-time can we get with different Big Data technologies?
RDBMS – a few minutes
Bizarrely, the traditional RDBMS has a decent claim to some limited real-time credentials, so long as the bars for velocity and latency are not set too high. The data collection, ETL and data loading process introduces significant delay, and scale out is limited by the difficulties of distributing processing across multiple servers. This means the cost of delivering real-time performance ramps up significantly as requirements for velocity increase and latency decreases. But when requirements are modest, latency of a few minutes may be possible.
Hadoop – a few hours
With Hadoop, the story is somewhat different. The latency introduced by ingest and execution of Map-Reduce jobs is significant, in the order of twenty minutes to several hours for real-world scenarios. However, the cost of scaling for velocity (and therefore greater volume) remains significant, but infrastructure costs scale better than for comparable RDBMS solutions. And of course, Hadoop suffers from the same problems as the RDBMS – queries must be executed in order to update answers when new data arrive.
Streaming – a few milliseconds
Streaming platforms offer the lowest latency, in the order of milliseconds, by processing data streams in-memeory over moving time windows, and without the overhead of having to persist the data first. The major advantage is that answers are updated continually as soon as new data arrive, without having to re-execute the queries. SQLstream’s technology is also very resource efficient, and scales out well over multiple servers, making it by far the most cost effective solution where low latency answers are required from high velocity data.
The analogy here is viewing a webpage on the internet. With an RDBMS or Hadoop based solution, you’re continually hitting the refresh button to view updated information, with streaming, there’s no need for a refresh button, streaming queries run continuously, updating answers automatically when new data arrive, a bit like having your webpage update instantaneously as soon as anything of relevance to that page, anywhere on the WWW changes.
Beyond Streaming Analytics – Driving Real-time Actions
SQLstream’s vision is to go much further than the generation of real-time analytics – in a real-time world it’s essential to be able to drive change automatically – for the examples discussed – telematics, telecoms, cybersecurity – what’s important is to drive action in real-time, flagging a bank accounts under attack, driver updates, changing mobile service quality in real-time – and to deploy predictive analysis to detect a business exception in advance. So for the web page analogy, this means not going to the internet at all, because your bank accounts were never hacked in the first place, you don’t need a tow truck because the your car warned of ice on the road based on data from other vehicles, and your streaming video never dropped in the first place because the QoS was corrected.