In less than a month, we’ll be at Strata in San Jose again- and again, there will probably be a lot of talk around streaming. Real-time data processing has been a hot topic for years now, and many have been blogging about it in a SQL context since, for example, Srinath Perera’s blog on using SQL for real-time streaming analytics. Needless to say, we certainly agree with his sentiment on the need for SQL in streaming analytics. There are many reasons as to why—some technical (such as as the power of the language), and some more business-oriented (such as the suitability of SQL platforms for reliable enterprise deployments, better performance and significant lower overall cost when compared with open source and proprietary SQL-like platforms). But let’s expand a bit.
SQL was invented for (streaming) data analytics
SQL has been the realm of structured queries over static data. However, SQL as an language does not (that never did) imply the need for a structured data store, in fact, it is even more compelling when deployed as continuous queries over data streams. Some of the reasons include:
· SQL is a declarative and expressive language that enables sophisticated real-time analytics to be expressed using simple queries.
· SQL is pervasive, most enterprise developers know and understand SQL, and for those that don’t, it’s easy to learn.
· SQL simplifies real-time analytics by encapsulating the underlying complexity of the data management operations.
· SQL queries can be optimized automatically over distributed systems for significantly (100X) better performance and open source frameworks such as Storm. No need for skills-intensive hand crafting of platform performance.
· SQL applications can be built in a fraction of the time required for low-level open source platforms and proprietary SQL-like platforms – a significant cost saving plus much faster time to implementation.
· SQL platforms can be updated on the fly without having to take them down and recompile – something that is essential for Enterprise deployments and a area where CEP and open source platforms struggle.
· SQL supports user defined operations that can be written in Java and deployed in a SQL query – this covers the small percentage of operations that cannot be readily expressed in SQL for whatever reason.
· And finally, SQL is easy to auto-generate, which enables platforms such as StreamLab for GUI-driven analysis of data streaming and visualization of streaming analytics.
SQL and SQL-like are not created equal
I would argue that these advantages however are only true of SQL platforms however, and not for Java or proprietary SQL-like platforms, where the latter use SQL-style queries built using Java (or another language) constructs, which sort of defeats the purpose and negates the benefits. There’s a lot more to a streaming SQL platform than having a ‘SELECT’ construct!
SQL is a Standard
Implementing the SQL standard means adherence to the same SQL that could be deployed in Oracle or DB2, but of course as continuous queries, and for (the most part), the processing of streams of unstructured data rather than structured stored data. SQL support means supporting the SQL data types, operators, statements and clauses. For example, SELECT .. FROM … (stream) WHERE … is a simple example, but adding clauses such as MERGE, JOIN, UNION, OVER, ORDER BY, GROUP BY and PARTITION BY adds powerful real-time correlation and query ability on data streams, particularly when combined with the WINDOW operator for processing data streams over time windows (sliding WINDOWs and with GROUP BY for tumbling windows).
SQL versus proprietary constructs, automatic query optimization versus manual
In terms of the expressiveness of data management languages, there are indeed different ways to crack a nut, each with pros and cons. It is true that some operations may be expressed more elegantly in other languages, or preferably, as a User Defined Operator (UDX) as a SQL extension. Part of the discussion is the expressiveness of a language, and part is the underlying execution environment for query processing. The latter is a strength of the SQL language. SQL is mathematically tractable and designed for dynamic query planning and optimization. This enables query execution to be optimized automatically over all cores on a server and across all servers in the cluster, eliminating the need for manual and often poor optimization of queries.
This is why a streaming SQL application can execute over 100X faster than the equivalent Storm or Spark Streaming application for example. SQL optimization also supports dynamic optimization of a streaming data processing pipeline without the need to rebuild or redeploy the application, in contrast to Java, Clojure and C++ implementations. Dynamic updates mean applications can be updated on the fly for 24×7 operations. So there are other considerations over and above the language syntax and semantics.
Some streaming SQL examples
First, for a widely recognized strength of SQL, streaming aggregation. There are two types, streaming aggregation or tumbling windows, and windowed analytics or sliding windows. For example, if we wanted to analyze a weather data stream continuously and find the minimum, maximum and average temperatures recorded each minute, we could use the following continuous SQL aggregation query. The output is a stream of new records, one per minute, specifying the maximum, minimum and average recordings.
FLOOR(WEATHERSTREAM.ROWTIME to MINUTE) AS FLOOR_MINUTE,
MIN(TEMP) AS MIN_TEMP,
MAX(TEMP) AS MAX_TEMP,
AVG(TEMP) AS AVG_TEMP
GROUP BY FLOOR(WEATHERSTREAM.ROWTIME TO MINUTE);
However if we wanted a continuous output stream, with the minimum, maximum and average values updated for each and every input data event over the preceding minute, we would execute a sliding window operation using a continuous SQL query as follows.
MIN(TEMP) OVER W1 AS WMIN_TEMP,
MAX(TEMP) OVER W1 AS WMAX_TEMP,
AVG(TEMP) OVER W1 AS WAVG_TEMP
WINDOW W1 AS ( RANGE INTERVAL ‘1’ MINUTE PRECEDING );
Expressiveness versus query performance
These are some examples where the power of the SQL syntax and its expressiveness is obvious, but to balance this out with an operation that is frequently held up as an example where proprietary languages may offer better syntax – ‘followed by’ operations. ‘Followed by’ operations are more prevalent in CEP / financial trading than in industry and data stream processing, but still represent a valid analytic. We’ve seen both approaches used extensively (1) as a SQL query with a user defined operator (UDX) and (2) as a standard SQL query. The former I would say is no better or worse than any syntax on any platform and arguably enables the alert_on(A then B then C) type syntax to be captured more concisely. However, the execution performance of the query will improve significantly if the query planner and optimizer is allowed to do its work, for example, using a streaming pipeline with nth_value to detect a sequence of events over a one minute time window:
CREATE OR REPLACE VIEW view1 AS
eventType AS currentEvent,
NTH_VALUE(eventType, 2) FROM LAST OVER w AS previousEvent,
NTH_VALUE(eventType, 3) FROM LAST over w AS pPreviousEvent,
WINDOW w AS (PARTITION BY p RANGE INTERVAL ‘1’ MINUTE PRECEDING);
CREATE OR REPLACE VIEW view2 AS
WHERE currentEvent = eventA AND previousEvent = eventB ….
You could argue that an alert_on(A then B then C) construct is more concise, however SQL offers powerful and elegant constructs that are general, can be optimized dynamically for optimum performance given the current underlying hardware configuration, and can be combined with other SQL operations such as sliding windows and GROUP BY. Even a simple COUNT operator adds an extra dimension to the query, enabling the patterns to be detected based on both temporal order and volume. Perhaps there is an opportunity to lobby for this to be included as a set of standard higher level operators in the SQL standard.
So that’s it. An overview of SQL for streaming data management, warts and all. If you’re looking for more information, contact us or visit our doc section here.