SQLstream is helping to predict earthquakes across the world in real time. The system has been developed by a consortium of universities and government agencies, with funding from NSF (National Science Foundation), to provide an infrastructure of networked tools for research in ocean science – constructing an internet-based system to collect and share data.
This is a large system with 16,000 land and sea-based sensors, each sending several channels of seismic event data at a rate of 40 data points per second. The application executes in an Amazon EC2 Cloud on a cluster of SQLstream servers, connected by an AMQP message bus. SQLstream’s AMQP adapter (built with RabbitMQ) enables the streaming SQL application to view the AMQP bus as a domain of input and output data streams. The initial prototype was upgraded to a full-scale system running on a cluster of servers without any changes to the streaming SQL application.
SQLstream’s contribution to the project, an application that processes seismographic data in real-time, demonstrates:
- An operational deployment of streaming SQL for scientific calculations in real-time
- Real-time distributed processing in an Amazon EC2 cloud using SQLstream and AMQP
- Rapid development and rollout of real-time data applications using standards-based streaming SQL.
The sensor network contains about 16,000 seismic sensors, organized into grids, covering large parts of the North American continent and the adjacent oceans (see illustration below). Each sensor measures the motion of the ground under it in three dimensions, and transmits its data as several digitized channels, typically sampled 40 times a second.
While the rate of each signal channel is modest (since seismic waves are low-frequency),
this adds up to a large amount of data to process in real time. Moreover, the rules for detecting a seismic event are heuristics that apply to a time interval of several minutes: so the application has to calculate some quantities from the raw data and to store these calculated values over a time window.
But to detect an earthquake reliably, it’s better to monitor all the sensors at once, looking for a disturbance in the signals that first appear in one place, and then appear in nearby places: a disturbance signal that propagates and changes shape in a way consistent with the physics of a seismic wave.
Real-time sensor network management in an EC2 Cloud
Monitoring tens of thousands of signal channels arriving at 40 sample points per second is a complicated problem, but it can be made simpler by breaking it into stages. We’re interested in earthquakes, which are infrequent, so SQLstream first reduces the amount of data by scanning each channel for patterns that suggest the beginning, the peak or the end of a quake: in other words reduce the dense signal to a sequence of interesting events. Then we can look for events detected on other channels that could be due to the same quake propagating in physical space and time.
In the first phase, we built a real-time seismic event detector in streaming SQL.
We translated a scientific algorithm into streaming SQL, and connected to the scientific sensor data infrastructure – using our AMQP adapter. This involved less than 100 lines of streaming SQL.
In the second phase, the prototype was scaled up to a full-size system, dealing with 16,000 sensor channels, running on multiple SQLstream server nodes created automatically in an Amazon Elastic Cloud. This expansion required no changes to the streaming SQL application developed for the initial prototype – simply running the same streaming SQL application inside an elastic container/manager.
Real-time Event Detection with Streaming SQL
The sensor data processing pipeline has five functional stages:
- Reading Messages – over the AMQP adapter. A configuration parameter specifies which “topics” SQLstream subscribes to, that is, which set of sensor channels it receives.
- Unpack Data – a user defined transform (UDX) unpacks data channel messages into individual data points.
- Signal Processing – extract higher order information from the raw data to identify seismic events given background noise and non-seismic ‘bumps’. An example function would be to calculate the ratios of multiple rolling averages of the signal value (x) over different time windows.
- Event Detection – The next stage applies heuristic rules to the processed data streams – the output is much sparser: a stream of significant events, each indicating a possible start/peak/stop of a seismic wave on a particular channel. If events of the correct type occur within a correct interval of each other (as shown in the Signal Plot Diagram), they are accepted as significant.
- Writing Messages – output (publish) significant events over the AMQP adapter.
Automatic EC2 scaling for Big Data sensor volume
The pipeline described has been proven to scale well to handle even greater data volume:
- Channels are processed independently, so scale as O(N).
- Additional channels are processed by adding more pipelines
- Each pipeline starts and ends with AMQP messages – the SQLstream / AMQP interoperability has been shown to scale well.
- To add a pipeline, we simply add another Elastic Compute server, running the same pipeline streaming SQL, but configured to subscribe to its own set of sensor channels.
This is the first part of a series of blogs describing the seimic monitoring solution. The next blogs will focus on the streaming SQL used in the application, and the SQLstream / AMQP architecture for Big Data scalability.