SQLstream’s Continuous ETL Connector for Google BigQuery

SQLstream has joined the Google Cloud Platform Partner Program as a Technology Partner with the release of our Continuous ETL connector for real-time Big Data integration with Google BigQuery. Continuous ETL solves the real-time performance issue for Big Data storage platforms. Although very capable of processing large datasets once the data has been stored, Big Data storage platforms are not designed to process real-time streaming data. SQLstream’s Continuous ETL connectors enable SQLstream to integrate and analyze vast volumes of live, real-time Big Data, and to update the Big Data storage platforms immediately as input data arrives. With existing ETL solutions, the data warehouse and Big Data storage platforms are updated in batch mode and are therefore always out of date. Continuous ETL updates continuously and in real-time.

Continuous ETL for Google BigQuery

Real-time, integrated Big Data solutions

Grupo IntechThe continuous ETL integration was carried out by Grupo Intech as an integral component of a SQLstream and Google BigQuery solution for detecting road network traffic congestion in real-time using GPS data.

GPS records are collected from several specialist data providers. Each GPS record contains the position, direction and speed of a vehicle. SQLstream collects GPS records from all the providers in real-time and turns the live GPS data feeds into real-time traffic flow information, and by applying a variety of congestion detection algorithms, is able to generate a map of the current traffic congestion positions, plus additional information as to the extent or severity of the problem.

SQLstream calculates real-time traffic analytics and displays the results in real-time on a map-based display. Roads are color-coded based on the average traffic speed relative to the posted speed limits for each road segment. Colored push-pins appear on the highways to provide further information on the potential problem, such as the average speed for every minute over the previous 15 minutes.

Continuous ETL and Google BigQuery

So how does Big Query help? GPS data is of variable quality, and also, as the data is available from many different providers, the network coverage for each provider can differ significantly.

Google Big Query is used to store the real-time traffic information produced by SQLstream, and to generate a confidence index on the coverage and quality, and therefore the reliability, of the incoming GPS records. Effectively, Google Big Query is helping Grupo Intech answer the question – do we have sufficient valid data for the real-time congestion information to be reliable.

SQLstream’s Connector for Google Big Query is used to deliver real-time analyzed data using a Continuous ETL operation. With continuous ETL, streaming data is continuously acquired, cleaned, conditioned, transformed and then periodically appended to the BigQuery table, for example, every minute or every 500,000 events. Therefore the data in Big Query is always up to date and accurate.

Here we see the Big Query table, which holds a minimum of 3 months of enhanced and interpolated data for ad-hoc data mining. The period can be extended if needed for improved accuracy.

Continuous ETL – more than real-time

SQLstream’s analysis of the road network is based on individual 10m road segments. Typically GPS records are not available for every vehicle for each and every road segment, so SQLstream interpolates missing road segment data in real-time, and continuously appends those results as well into the Google Big Query table – storing a complete real-time map of traffic flow for the entire road network at a level of granularity of every 10 meters. Therefore unlike traditional ETL tools that tend to offer only a basic mapping function, SQLstream delivers significantly more information into Google Big Query than is available in the source data.

How does it work

SQLstream allows the user to zoom in and center the geographic map to an area of interest. This action defines a bounding box and the coordinates of the bounding box are used in this first Big Table query to extract the number of 10 meter road segments within the bounding box. The action of defining the bounding box in SQlstream  launches a query similar to the following example:

select count(*) from road_elements
where
  "reLatitude" between 7.7109920000000001 
                   and 9.8525100000000005 AND
  "reLongitude" between -65.754776000000007 
                    and -62.958756000000001.

This query retunes the number of road segments within the bounding box (in this example, there are 191,866 road segments) which is then plugged into the query executed by BigQuery in order to calculate the percentage of road elements for each GPS data provider actually used for calculating the road network coverage within the bounding box.

select vendorID, float(integer(count(*) / 191866 * 1000)/10) as Confidence
from (select vendorID, reID from sample.july2nd
where 
 float(reLatitude) between 7.7109920000000001 
                       and 9.8525100000000005 AND
 float(reLongitude) between -65.754776000000007 
                        and -62.958756000000001
group by vendorID, reID)
group by vendorID order by Confidence desc

The query results are returned to SQLstream which displays the answer to the user. Big Query is ideal for these types of geospatial coverage metrics. There is a high volume of meaningful data stored in the Big Query table and the calculation of coverage is available immediately. By choosing a variety of bounding boxes, it is possible to obtain coverage metrics a country-wide basis, on a metropolitan area basis, and on for a much smaller problem area.

Conclusions

In conclusion, we have demonstrated how continuous ETL is the basic building block for integrating real-time and historical data. Continuous ETL enables offline data storage to be kept up to date directly from real-time applications. And unlike conventional ETL solutions that simply map data between systems, SQLstream’s Continuous ETL connector for Google Big Query delivers aggregated data, analytics as well as augmented and interpolated data values.