Parallel scheduling of stream execution in SQLstream

SQL is a declarative language – a SQL query is a specification for the result, it’s neither a recipe nor a program to produce the results. A traditional relational database query returns a set of rows, the ResultSet. A streaming SQL query in SQLstream returns a stream of rows. That is, the ResultSet may never end. In a traditional relational database query, all the rows are fetched, and the ResultSet scans them. With a relational streaming query, the result rows do not exist as yet – as time goes by they come into existence as arriving data are processed.

However, just like any relational database, the SQLstream stream computing Server has two main components:

  1. The query engine or planner, calculates the most efficient plan to produce the requested results – this is the query plan.
  2. The data engine or kernel, executes the query plan to produce the results. The scheduler controls the execution process.

Streaming dataflow graphs

Executing a query means computing the results from the inputs. For a streaming query that means processing the rows in the input streams as they arrive. The execution is organized as a dataflow graph, that is, a mathematical directed graph of nodes and arrows, where the nodes represent elementary operations on data, and the edges into and out of a node represent the input and output data streams. In effect, an assembly line that produces results, where the nodes are the machines or stages on the line.

A query plan defines one of these data flow graphs. The executor runs the data through the graph: it is responsible for executing the nodes, where each node consumes rows arriving on input edges and produces rows on the output edges. Of course each output edge is often the input of a downstream node.

Multiple, connected query plans

In a traditional static database, each query plan is independent and transitory, and operates against persistent tables and indexes. In a relational streaming platform, the query plans last forever and are interconnected. Although streams are used just like tables in SQL, they are not persistent, in fact they have no contents at all, and can be as rendezvous points that accept input rows and pass them on to their output consumers.

Now if the data flow graph were a physical system — say, a collection of transparent plastic straws with colored water flowing through them — then all the processing would be happening simultaneously. However, for the software abstraction of the streaming straw pipelines, it’s not practical or necessary to run all the nodes at the same time. It is the scheduler that manages this network of interconnected query plans, and when and how to execute each node.

In a traditional, static database, the result of a query is a set of rows that are computable all at once. The executor can give good performance by running one node at a time, pushing batches of rows through the graph. A streaming database is different. When the inputs are streams of recent events, arriving in real time, it’s important to produce the outputs fast enough so that the result rows are timely.

The execution works by pushing outputs, not by pulling inputs, and it means executing several nodes at the same time, whenever possible. This requires a finer management of the execution objects and the ability to schedule parallel execution on the nodes.

Parallel scheduling of stream execution

In SQLstream, multiple, interconnected query plans are being executed at the same time. Together they constitute a large dataflow graph in which each node is a mini data processing machine that performs a simple operation on its input data, and passes its output to the next node.

The scheduler is responsible for managing the interconnected dataflow graph. It keeps track of the status of each node: at any moment, some are running, some are ready to run, some are waiting for more input, some are waiting for their output to be consumed. Each node is allowed to run for a quota of time, and where possible nodes are selected to execute in parallel as separate threads.

The SQLstream scheduler may not want to be fair: some branches in the graph (some streams) may be more important, some may need high throughput, some may need lower latency. The application designer decides, and the SQLstream scheduler delivers.

Next time …

This is the first is a series of blogs discussed both the principles and practical examples of parallel stream execution. The next blog in the series will look at some real world examples, and how parallel execution is essential to deliver both high throughput and low latency requirements.