Reading Streaming Data from Java – Part 1: The Three Rules of Streaming Data
One of the great advantages of SQLstream as an analytical platform is that it uses the most popular, standardized language for data analysis, SQL. SQLstream worked to make only the minimum number of extensions to SQL necessary to encompass the streaming data paradigm, so that most streaming SQL pipelines look almost indistinguishable from SQL for reading static relational data. This enables data analysts to leverage virtually all of their existing SQL skills in the streaming context.
Similarly, SQLstream felt it was important to make the streaming environment feel familiar and productive to application developers as well, so SQLstream supports the standard JDBC interface for using streams, again with just the minimum extensions necessary to encompass streams.
This post assumes a basic familiarity with JDBC and its main components: connections, statements, and result sets. First we’ll look at these in their usual tabular context, then see what it takes to extend the model to streaming data. All the data items come from the SALES
example schema that comes with SQLstream.
Reading JDBC Data
In a database world, the pattern for reading data is quite standardized: Connect to the database, execute a query, and read each row that comes back until there are no more rows. For example,
Connection c = getConnection();
try {
Statement s = c.createStatement();
ResultSet rs = s.executeQuery(“SELECT * FROM SALES.EMPS”);
while (rs.next()) {
System.out.println(rs.getString(“NAME”) + “ “ + rs.getString(“EMPID”));
}
rs.close();
s.close();
c.close();
} catch (SQLException se) {
}
This simple example loops through the entire EMPS table, printing the name and employee ID number for each row in the table, then finishes. The key for this model is that the result set is finite. Even if this were a very large table, the loop would eventually process every row. So you can handle reading the data from a table as a monolithic step in a sequence of procedures.
The Challenge of Streams
In a streaming data environment, however, you have to change a couple of your basic assumptions. In particular, I have found three Rules of Streaming that dictate how to write client code:
- There always might be more data.
- You never know when the next row might arrive.
- The rate of the rows matters.
The same pattern shown earlier for reading finite tables will work for streams, as long as you don’t expect your application to do anything else. For a streaming ResultSet, the next()
method only returns false when the stream closes. In many applications, that might never happen, or at least might not happen for weeks or months or longer, so clearly you cannot simply wait for each stream to end.
This is particularly critical in an application such as SQLstream Studio, where a developer needs to be able to edit the definitions of objects and at the same time be able to watch data flowing in existing streams. These streaming data views – known as Inspect windows – have to function in an event-driven, multitasking environment. There can be a virtually unlimited number of them active at any given time, along with editors, console views, and other dynamic content. So nothing should block the updates of other views. And for a little added complexity, Studio also needs to be able to handle non-streaming items as well, so preferably the same code should handle either tables or streams.
Studio also has to deal with a number of other requirements that are fairly unique to the development environment, such as how to manage updating a human-readable window of maybe 20 or 30 rows at a time from a stream that might be flowing at many thousands of rows per second. For now, we will just focus on the tasks common to handling streams in any application environment.
Reading in the Background
As with any long-running task, the solution involves partitioning the work into threads. Because of Rule #1 (“There always might be more data.”), you should just assume that reading from a stream needs to happen in a background thread. There is essentially never a scenario in which you want to wait for an entire stream to be read before proceeding to the next task.
To some extent how you implement your background stream-handling tasks will depend on the environment you are working in. Variations in the data rate and the required responsiveness of the application might cause you to make some different choices.
One good rule of thumb is to do as little processing as possible in the stream-reading loop. Slow processing of incoming rows can result in “back-pressure” in the data pipeline. As a result, it’s best to read rows from the ResultSet as expeditiously as possible, handing off the data to other threads for processing.
Here, for example, is a simple thread to read rows from the SALES.BIDS
stream:
class BidReader extends Thread
{
@Override
void run()
{
try {
Connection c = getConnection();
Statement s = c.createStatement();
ResultSet rs = s.executeQuery(“SELECT STREAM * FROM SALES.BIDS”);
while (!interrupted() && rs.next()) {
// read columns and put into work queue for processing thread
}
// close rs, s, and c
} catch (SQLException se) {
}
}
}
Note that the loop doesn’t depend solely on the ResultSet’s next()
method, but also tests whether the thread has been interrupted. You probably wouldn’t use this exact mechanism (I prefer to override Thread.interrupt()
and set a boolean flag), but it shows you succinctly that you need to be aware of more things than just whether there is more data to read.
Another thing to keep in mind is that ResultSet.next()
blocks until either more data arrives or the stream closes. That’s yet another reason to have this happening in a background thread.
You’ll notice that code snippet references a “processing thread.” That’s because of Rule #2 (“You never know when the next row might arrive.”). It’s generally good to decouple the reading code from the processing code. One easy model is to have the reader read each row, then add the row to a shared queue structure, where the processing thread can pick up the rows and process them asynchronously. This minimizes the chances of a slow processing step causing the reader to slow down and potentially push back up the stream. To the extent possible, you’d like your reader to be sitting and waiting for the next row when it arrives, rather than constantly trying to handle a backlog of incoming rows. If you have to have a backlog, you want it to be in your application, not in the stream server, because that will slow everyone down, including other applications trying to read from the same streams.
Such decoupling doesn’t really make sense in a pure database application, since there will be a finite number of rows to process, so your goal is to minimize total processing time. If there are 1,000 rows, it doesn’t matter whether you read them all in one second, then spend an hour processing, or read them over the period of an hour, processing as you go. The total amount of data read and processed is the same.
In the streaming world, it is important to read the data as quickly as possible in order to maximize the overall throughput of the system. That’s Rule #3 (“The rate of the rows matters.”) coming into play. The data throughput of a given stream is gated by its slowest reader. Part of the contract your client needs to fulfill is to not bog down the system for other clients.