SQLstream and Mozilla Firefox 4, a look at the SQL behind the scene

Since Tuesday’s announcement that the Firefox Download Monitor is powered by SQLstream, we’ve received a number of questions about how it all fits together. SQLstream server executes SQL statements, just like standard SQL, except the SQLstream’s queries run continuously, analyzing input data in real-time as it arrives. Statements are presented via JDBC or user friendly tools which use JDBC internally. Statements are compiled/prepared, the planner/optimizer chooses an access plan, and a runtime engine executes the plan. SQLstream is compliant with SQL2011 with just a couple of extensions. One extension includes the keyword STREAM as part of a SELECT statement. The STREAM keyword indicates that the results are continuously streaming rather than a point in time TABLE.

Applications in SQLstream are constructed out of a set of SQL CREATE STREAM statements and SQL VIEWS against streams and other views. Those statements are assembled into a pipeline. When describing a pipeline we refer to statements on the source side as being upstream, and statements closer to the destination as being downstream.

In the middle of the pipeline, we define a stream named FirefoxDownloadStream_ which contains the results of the parsed and conditioned download events. The stream declaration is identical to a table definition with the exception of the type of the object being a STREAM rather than TABLE.

CREATE STREAM "FirefoxDownloadStream_" (
+++"download_type"++++++++++VARCHAR(15),
+++"utc_timestamp"++++++++++TIMESTAMP,
+++"product_name"+++++++++++VARCHAR(12),
+++"product_version"++++++++VARCHAR(12),
+++"product_major_version"++VARCHAR(12),
+++"product_os"+++++++++++++VARCHAR(10),
+++"locale_code"++++++++++++VARCHAR(5),
+++"country_code"+++++++++++VARCHAR(2),
+++"city_name"++++++++++++++VARCHAR(32),
+++"region_code"++++++++++++VARCHAR(2),
+++"longitude"++++++++++++++VARCHAR(8),
+++"latitude"+++++++++++++++VARCHAR(8)
);

The stream is populated with a SQL INSERT-SELECT statement. Again, standard SQL statements are used. The WHERE clause defines that the downloads include new first time downloads, complete upgrades of prior versions of Firefox, or partial upgrades of prior versions of Firefox.

INSERT INTO "FirefoxDownloadStream_"
++("download_type",
+++"utc_timestamp",
+++"product_name",
+++"product_version",
+++"product_major_version",
+++"product_os",
+++"locale_code",
+++"country_code",
+++"city_name",
+++"region_code",
+++"longitude",
+++"latitude"
++)
SELECT STREAM
+++"dlType"+++AS "download_type",
+++"dlTime"+++AS "utc_timestamp",
+++"product"++AS "product_name",
+++"version"++AS "product_version",
+++"GetMajorVersion"("version") AS "product_major_version",
+++"os"+++++++AS "product_os",
+++"lang",++++AS "locale_code",
+++"cc",++++++AS "country_code",
+++"city",++++AS "city_name",
+++"rg",++++++AS "region_code",
+++CAST("latitude" AS VARCHAR(10)) AS "latitude",
+++CAST("longitude" AS VARCHAR(10)) AS "longitude"
FROM "FirefoxCountryFilter"
WHERE (("dlType" IS NULL) OR ("dlType" = 'complete') OR ("dlType" = 'partial'));

The download events contain the time of each download. Mozilla has a number of download servers feeding the worldwide requests to download Firefox. Each of these servers feeds the results of the download requests to a common logfile which is “tailed” by SQLstream. As the time for each download differs due to each client’s network capacity, the download requests may be slightly out of order. In practice the biggest gap we’ve seen is 4 seconds. Since we’re measuring downloads over the long period of time, it was deemed sufficient to adjust the download time of late arrivals to match the most recent download time.
The following SQL statement does that adjustment.


CREATE OR REPLACE VIEW "FirefoxDownloadStream" AS
+++SELECT STREAM MAX("utc_timestamp") OVER(ROWS UNBOUNDED PRECEDING)
++++++++++AS ROWTIME,
++++++++++*
+++FROM "FirefoxDownloadStream_";

SQLstream associates a ROWTIME with each row in a STREAM. The ROWTIME is a monotonically increasing SQL timestamp. In the default case, the ROWTIME is the current time expressed in UTC. Most applications require time to be defined according to time associated with the data itself. Associating the ROWTIME of a row in a stream based on the data contents of the row, is done by the AS ROWTIME clause for an individual column. In the Mozilla pipeline, we set the ROWTIME to be the maximum of the values in the “utc_timestamp” column to be that rows ROWTIME.
The analytics portion of the pipeline is implemented with a standard SQL statement. For example, each 10 seconds the number of downloads for each product, version, … country, city, region is calculated.


CREATE OR REPLACE VIEW "FirefoxStreamForLocationCounters"
DESCRIPTION 'Compute product counters for a minute' AS
+++SELECT STREAM
++++++++++"download_type",
++++++++++"product_name",
++++++++++"product_major_version",
++++++++++"product_version",
++++++++++"country_code",
++++++++++"region_code",
++++++++++"city_name",
++++++++++"latitude",
++++++++++"longitude",
++++++++++count(*) AS "count"
+++FROM "FirefoxDownloadStream" F
+++GROUP BY FLOOR(F.ROWTIME TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '10' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '20' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '30' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '40' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '50' SECOND TO MINUTE),
++++++++++++"product_name",
++++++++++++"download_type",
++++++++++++"product_major_version",
++++++++++++"product_version",
++++++++++++"country_code",
++++++++++++"region_code",
++++++++++++"city_name",
++++++++++++"latitude",
++++++++++++"longitude";

There is a similar view declaration where similar calculations are done for each product. Most of the interest since Tuesday is of course related to Firefox 4.0 downloads. This second view allows Mozilla to drill down on downloads by platform as well as downloads for previous (and future) Firefox versions.


CREATE OR REPLACE VIEW "FirefoxStreamForProductCounters"
DESCRIPTION 'Compute product counters for a minute' AS
+++SELECT STREAM
++++++++++"download_type",
++++++++++"product_name",
++++++++++"product_major_version",
++++++++++"product_version",
++++++++++"product_os",
++++++++++count(*) AS "count"
+++FROM "FirefoxDownloadStream" F
+++GROUP BY FLOOR(F.ROWTIME TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '10' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '20' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '30' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '40' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '50' SECOND TO MINUTE),
++++++++++++"product_name",
++++++++++++"download_type",
++++++++++++"product_major_version",
++++++++++++"product_version",
++++++++++++"product_os";

Each of these views (FirefoxStreamForLocationCounters and FirefoxStreamForProductCounters) is based on the FirefoxDownloadStream. Each defined stream and view is a point where the application can access data either directly or via another VIEW or INSERT…SELECT.

One component of the solution is a piece of code we call the HBaseAgent. The agent uses the JDBC interface to SQLstream and issues a SELECT * FROM each of the described views containing the location and product counter 10-second download counts. The HBaseAgent maps each fetched row to the HBase schema as defined by Mozilla.

I write this blog about 24 hours after Firefox 4 launched. So far there are more than 8 million downloads of Firefox 4. It certainly has been an exciting day for Mozilla and I congratulate everyone who contributed. I’m happy that SQLstream has been able to contribute to their success.