Real-Time OLAP

Streaming SQL and OLAP are two of the most interesting and powerful methodologies in data processing. OLAP is a well-established technique for analyzing large databases of historic data. Streaming SQL is a more recent innovation that applies the declarative power of the SQL language to the problem of managing data in motion.

OLAP is hampered by conventional ETL techniques: it is difficult to keep the data warehouse up to date, because batch-based ETL processes are only efficient when dealing with a few hours or days of data. On the other hand, OLAP engines excel at comparisons, for example:

  • Between time periods, such as comparing this quarter's data or analyses with those of the same quarter last year
  • Between comparable data sets, such as comparing this brand against that brand

The combination Streaming SQL and OLAP can solve some business problems that can't be solved any other way. Powered by a streaming SQL engine, an OLAP engine can include, in real-time, the most current data in its analysis, such as this hour's data compared to the average for this hour of the day over the last 6 months.

SQLstream and Mondrian - Real-time Continuous OLAP

The Mondrian open source, relational OLAP engine (ROLAP) processes its data from a relational database, caching query results in memory to ensure high performance on large data sets. It also uses predefined, concurrently maintained aggregate tables populated with summaries of the data. For rapidly changing data, Mondrian's cache and aggregate tables both require careful management.

SQLstream helps mondrian achieve the requisite management by providing a continuous, real-time ETL process with the following steps:

  • Acquire the real-time data and expose it as a common relational format.
  • Transform the data into an organization suitable for OLAP and data warehousing.
  • Load the data into the data warehouse, including aggregate tables.
  • Notify mondrian of changes to its cache.

Acquisition

Traditional ETL processes are limited to those relatively static sources, whereas SQLstream makes available data-in-flight in diverse formats: messages on message-oriented middleware, web service calls, TCP network packets, and so forth.SQLstream helps in acquiring the data by subscribing to sources of data-in-flight as well as by accessing the traditional data warehouse sources: databases, mainframes, and files extracted from other operational systems. It can monitor database tables to generate events for every new transaction, and tail a log file to read rows as they are appended to the log file.

One of SQLstream's core concepts is a stream, analogous to a table in a relational database but with major additional operational advantages. While a table contains a finite set of rows inserted at some past time and stored on disk, a stream contains an infinite sequence of rows that arrive whenever the producer decides to send them. (SQLstream in fact supports tables too, so that you can combine historical or reference data with event data.)

What streams and tables have in common is the fact that you can manipulate them using SQL queries. Not just the simple operations like filtering and routing, but operations that combine multiple rows, such as join and a ggregation. You can combine rows from the same stream (often demarcated by a time window of interest), from other streams, and from historical and reference data.

Transformation

Next, you need to prepare the data and convert it into a form suitable for large-scale analysis. With SQLstream, you can use industry standard SQL to perform a real-time, continuous ETL process. For example:

  • You can apply standard SQL operators to cleanse and convert the data fields.
  • You can calculate trends such as moving averages using SQLstream's windowed aggregation operations.
  • If your data warehouse schema contains slowly-changing dimensions, SQLstream can help the loading process by identifying transactions that represent a new member of a dimension. For example, when an order is received from an existing customer, SQLstream can find that customer's id, whereas if the customer is new, it can generate a new surrogate key value.
  • If your data warehouse schema contains aggregate tables, they need to be populated with records that represent multiple fact table records. It is often cheaper to compute these aggregate records in memory.

One should be aware that using many aggregate tables when data rates are extremely high becomes limited by the I/O capacity of the DBMS, making it impossible to keep the aggregate tables 100% up-to-date. You should reduce the number or granularity of the aggregate tables, and partition each aggregate table by time to ensure that only one block per time period is being actively written to. The active block of aggregate tables can then fit into the DBMS's buffer cache.

Loading

Loading the data warehouse is straightforward. SQLstream's Database Adapter makes DBMS tables appear as foreign streams, so that writing to these streams makes an insert, update or delete occur in the data warehouse.

As data is loaded into the data warehouse, it becomes inconsistent with the state of mondrian's cache. Mondrian's cache is necessary for performance if Mondrian has many concurrent users or if the data warehouse is so large that SQL queries take a long time. However, flushing the entire cache every time there is an update negates the value of the cache.

Notification

Fortunately, Mondrian has an API to let you notify mondrian of changes that affect its cache contents. You can tell Mondrian specifically which data changed. For example, you can say "there was just a sale of beer in Texas," and Mondrian will mark precisely these entries in the cache as invalid. That setting forces a re-read from the database when the next OLAP query requests those entries.

Once again, the problem can easily be solved using a foreign stream. The foreign stream should call Mondrian's cache control API for each row it receives. A SQLstream pump object ensures that every record written into the fact table is mirrored into the foreign stream, and therefore mondrian's cache is kept in sync with the DBMS.

Conclusion

In conclusion, there is a synergy between OLAP and streaming SQL techniques that allows new business problems to be solved and existing problems to be solved much more efficiently. SQLstream enables continuous ETL operations, and Mondrian, with its extensible Java architecture, is a natural complement.

Have questions about your current project? Click here to ask a SQLstream expert.