Mastering the Internet – Really Big Data Analytics

The recent public release of information on the activities of our Intelligence Agencies makes interesting reading for Big Data professionals. In particular, how these agencies have mastered the Internet, with vast data collection and analytics facilities for monitoring Internet and smartphone activity. A series of news articles described GCHQ’s (the UK’s smaller cousin of NSA) role in the program and the numbers are staggering. For example, one claim is that as far back as 2011, the UK alone was collecting 39 billion events per day and storing over 20 petabytes of data per day. It is reported that the raw data is retained and analyzed over a 3 day period, scanned for 40,000 keyword searches simultaneously, with the results being correlated and stored as metadata. The resulting ‘analytical’ metadata is retained for a period of one month.

Staggering numbers which we can only assume have grown significantly over the past two years. But there is also another technological feat here – collecting that much data. There are billions of active Internet and Smartphone users worldwide at any given time – making it a seemingly daunting prospect to ‘monitor the Internet’. However, much of the Internet traffic flows into and out of any country through a relatively small number of very large pipes – usually submarine cables. And this is how data are captured – by storing all the data from multiple 100GB trans-continental cables.  Now, that really is some very Big Data firehouses from which to drink.

On one level, little has changed since the beginning of digital communication. Intelligence Agencies monitor and decode communications, searching for keywords that might indicate interesting communications by their respective enemies, terrorists and criminal operations. However, pre-Internet, the focus was on intercepting radio, satellite and microwave communications, and involved endless processing of industrial-strength, bulk cypher streams. Streaming Big Data indeed, but not on the same scale as its post-Internet cousin. The difficult part was the interception, something that is much more straightforward in the Internet age. Nor are today’s Internet and phone data as difficult to decipher. Much of the target data are plain text, or with lightweight encryption.

The complexity in the post-Internet age is volume, velocity and variety. Therefore a use case for Big Data technology. Intelligence Agencies and supercomputing centers have had Big Data Hadoop-like technology for thirty years or more. Today’s manifestation of Big Data with Hadoop is merely supercomputing ‘lite’ for the masses. Yet the scenarios described raise the question – why store the data at all? The processing described in the articles is nothing more than a short data processing pipeline with two stages, one with a three day rolling window, one with a one month rolling window. The entire article reads like a streaming data analytics problem, and perhaps this is the underlying technology layer, with stream persistence for additional cross-Agency correlation.

Streaming data management would also enable real-time analysis, rather than waiting three days for the results. Over the years, Intelligence Agencies have been accused of spending billions and yet still seem to find out about major security events in the same way as the rest of us – in the morning’s newspapers. So real-time must be a key requirement, and streaming data management is the only technology available that can deliver low latency analytics at this scale. Massively scalable, distributed streaming SQL platforms scale by simply adding more servers. A data velocity of 39 billion events per day may seem a lot, but is a little less than 500,000 per second. Given the number of rules, and an assumption of some fairly complex analytics (geo-spatial, semantic analytics, keyword matching etc), this is still well within the capability of streaming data management on a small number of commodity servers.

Streaming technology also has advantages in other areas for this type of high velocity use case. For example, data collection, using remote streaming data collections agents with filtering and basic analytics capability, the volume of backhauled data can be reduced significantly. And with so many rules, streaming data platforms enable new rules and analytics to be added on the fly, without have to take the platform down.

And finally, side-stepping the issue of the ethical framework that allows our Intelligence Agencies to operate in this mode, what does this mean for the future of the Internet? You could argue that this first Internet Age is build on naivety, “The Naive Age”. Many social media users seem blissfully unaware that anyone other than their immediate friends will want to read a post. Even those perpetrating criminal acts seem to like to boast on social media. And those involved in terrorist acts would seem to believe that each tweet, email and call is such a small needle in a very large haystack, that discovery is unlikely.

This could be the wakeup call for the wider world of Internet users. Just as 100% pure water can be super-cooled below zero degrees until the first impurity, no mater how small, will cause it to freeze, so perhaps the awareness of bulk monitoring will mean the same for the Internet. This could spell the end of the ‘The Age of Naivety’ and the beginning of ‘The Age of Enlightenment’ for the Internet. An age of privacy-aware users, and social media platform vendors who must work within a constraints of national security. For the Intelligence Agencies however, this may well be ‘back to the future’, applying tried and test cypher analytics to Internet Big Data firehouses.