Edan Kabatchnik, Vice President of Product Management, Guavus Platform at Guavus, explains why SQL is the best option for high-performance streaming analytics during our recent webinar with DataRobot.
01. Time series
The latest SQL standard (SQL:2011) supports time-series data processing with operators such as the WINDOW clause.
02. Scale-ups
SQL supports the optimization of continuous queries over a large set of cores without requiring the developer to write multi-threaded code, understand lock concurrency or the use of semaphores.
03. Scale-outs
For scale-out, SQL for stream processing supports the optimization of distributed SQL queries over any number of servers with optimization for low latency and high throughput.
04. Continuous
Streaming SQL queries execute continuously, processing data as they arrive over row or time-based Windows. A streaming SQL query never ends, ensuring millisecond response times.
A streaming query is a continuous, standing query that executes over streaming data. Streaming queries are similar to database queries in how they analyze data; they differ by operating continuously on data as they arrive and by updating results incrementally in real-time.
This simple example illustrates a SQL query executing over a stream of weather sensor data. Unlike a traditional query over stored data, the query executes continuously, processing each arriving record immediately, and outputting a result as the condition is met.
The result is a record containing the time, location and temperature reading for every input reading matching the selection criteria. The latency from the time the data record was emitted by the sensor to the alert being issued can be as little as a few milliseconds, even where data arrive at millions of records per second. Queries such as this are useful for generating real-time alerts from input data streams.
There are two types of basic operations on streaming data – streaming aggregation (tumbling windows) and windowed analytics (sliding windows).
Streaming aggregation queries, for example with functions – including AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP – output a record with aggregated values for each GROUP of input records.
If we now want to find the minimum and maximum temperature recorded anywhere each minute (globally, regardless of city), the minimum and maximum temperature can be calculated using the aggregate functions MIN and MAX. To indicate that we want this information on a per-minute basis, we use the FLOOR function to round each record’s (or row’s) rowtime down to the nearest minute.
The result is a stream of new record, one per minute, specifying the maximum, minimum and average readings recorded in that minute. This type of query is particularly useful for generating periodic reports with zero latency from the input data streams.
The second basic operation for stream processing is windowed analytics, where an output record (or row) is produced for each new input record. Each field (or column) in the output record may be calculated using a different window or partition. Windows can be time or row-based.
Let’s say we want to find the minimum and maximum temperature recorded in the 60 second period prior to any given reading, globally, regardless of city. To do this, we define a one minute time window using RANGE INTERVAL ’1′ MINUTE PRECEDING, and use it in the OVER clause for the analytic functions.
The query may look similar to the streaming aggregation query above, but the result is very different. In this case, an output record (or row) is generated for each and every newly arriving input record, specifying the updates to the minimum and maximum temperatures over the preceding 60 seconds, plus an incrementally updated average for the temperature over that period. Windowed analytics queries are the most common on stream processing solutions, with the ability to incrementally update answers key to stream processing performance.
Let’s assume we want the same information but broken down by CITY. To do this, we add a PARTITION BY clause on CITY to the window specification. The result will be an output record for each and every new input record for the specific PARTITION, that is, for each CITY.
In effect, the PARTITION BY operation can be used to multiplex multiple logical output streams from a single input stream, in this case one logical output stream for each CITY in North America. Operations against partitioned data streams apply to all partitions, enabling the developer to write a single code statement and apply it to a large number of data streams. Window partitioning also allows queries to execute in parallel across multiple cores in a server, multiple servers in a data center, or multiple data centers in a cloud.
The ability to correlate and report over data arriving from multiple sources simultaneously is an essential stream processing requirement. Inner, outer and cross JOINs are supported (as well as UNION and UNION ALL). The following example illustrates a JOIN of two streaming data sources, one for stock orders, and one for the resulting stock trades.
This query outputs a stream of all Orders matched by a Trade within one minute of the Order being placed. The output stream is sorted by rowtime, in this case, the rowtime from the Orders stream. Let’s say we wanted the exceptions, that is, the unmatched Trades and Orders. This requires an OUTER JOIN as illustrated below.
The output is this case is a stream of Orders where the Trade has not occurred within a one minute time window, and including Trades with missing Orders.
JOINs are essential for real-world stream processing applications, enabling time and geo-spatial patterns to be identified over different types and formats of data.
SQL offers a wide range of libraries for manipulation of streaming data. For example, aggregation libraries including AVG, COUNT, MAX, MIN, SUM, MOD, LN and POWER, a statistical analysis library including STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, as well as a general library of functions such as ABS, CAST, CEILING, FLOOR, LOWER, UPPER, LAST_VALUE and FIRST_VALUE.
A full list of libraries and functions are available on the documentation pages at www.sqlstream.com/docs
The SQL UDX (User Defined Transform) mechanism enables streaming operations to be written in Java, and for third party libraries and even entire systems to be included in a streaming SQL query. For example, the following illustrates the syntax of including a custom stream processing function written in Java into a streaming SQL query.
New UDXes are straightforward to develop and deploy, with examples and developer documentation available at sqlstream.com/docs, and we offer a range of industry-specific operators, built-in libraries and third-party system support out of the box:
Guavus SQLstream’s mission is to make real-time stream analytics easy to use and own with a one-stop-shop solution that performs the best, has the widest footprint, never turns off, and can be developed and customized by data scientists and engineers alike.
©2021 Guavus, Inc. All Rights Reserved. SQLstream® is a registered trademark of Guavus, Inc.