Parsing W3C Data

<< Click to Display Table of Contents >>

Navigation:  Integrating SQLstream Blaze with Other Systems > Reading Data into s-Server > Reading from Other Sources  > Parser Types for Reading >

Parsing W3C Data

Previous pageReturn to chapter overviewNext page

The W3C option lets you parse logs generated by W3C-compliant applications. You describe file entries using data specifiers defined in the Apache mod_log_config documentation.

The W3C parser uses the W3C parser function, described in the topic W3C_LOG_PARSE in the Streaming SQL Reference Guide. That function can be used anywhere in your code. The W3C parser for the Extensible Common Data Framework lets you parse W3C log data as it comes into s-Server. Doing so may be desirable for performance or other reasons.

To use the Extensible Common Data Adapter with W3C files, you set parser to W3C, then pass in groups of filters that will map to columns. The W3C parser takes one additional property, FORMAT, which takes data specifiers defined in the Apache mod_log_config documentation. Examples of these are provided below.

Column names cannot be dynamically assigned with W3C files. You need to declare these as part of a the foreign stream or table.

Note: SQLstream handles Apache log format specifiers without alteration.

Options for W3C Parser

Option

Definition

FORMAT

Format specification, such as "%h %l %u %t "%r" %>s %b". See http://httpd.apache.org/docs/current/mod/mod_log_config.html

Examples of Commonly Used Log Format Strings

Format Name

W3C Name

Format Specifiers

COMMON

Common Log Format (CLF)

%h %l %u %t "%r" %>s %b

COMMON WITH VHOST

Common Log Format with Virtual Host

%v %h %l %u %t "%r" %>s %b

NCSA EXTENDED

NCSA extended/combined log format

%h %l %u %t "%r" %>s %b "%[Referrer]i" "%[User-agent]i"

REFERRER

Referrer log format

%[Referrer]i ---> %U

AGENT

Agent (Browser) log format

%[User-agent]i

Sample Foreign Stream to Parse W3C Files

The following example will parse columns called "ip", "ident", "userId", "reqTime", "reqMethod", "reqLine", and "httpVer" from a file in /tmp.

Note: Information on file location, file name pattern and character encoding can also be set as server options.

 

CREATE OR REPLACE FOREIGN DATA WRAPPER MOZILLA_ECDA

   LIBRARY 'class com.sqlstream.aspen.namespace.common.CommonDataWrapper'

   LANGUAGE JAVA;

 

CREATE OR REPLACE SERVER "mozilla_server"

   TYPE 'FILE'

   FOREIGN DATA WRAPPER MOZILLA_ECDA;

 

CREATE OR REPLACE FOREIGN STREAM "mozilla"."BaseLogStream"

   ("ip" VARCHAR(15),

    "ident" VARCHAR(5),

    "userId" VARCHAR(5),

    "reqTime" VARCHAR(26),

    "reqMethod" VARCHAR(7),

    "reqLine" VARCHAR(256),

    "httpVer" VARCHAR(5)

 )

   SERVER "mozilla_server"

   OPTIONS (

            directory '/tmp',

            filename_pattern 'access_\d{4}(-\d\d){3}(\.\d+)?',

            encoding 'UTF-8',

            parser 'W3C',

            format '%h %l %u [%t] \"%r %r HTTP/%r\" %>s %b \"%r\" \"%r\"');

 

Sample Properties Implementing ECD Agent to Parse W3C Files

To parse W3C files with the ECD Agent, configure the options above using the ECD Agent property file with properties similar to the following:

ROWTYPE=RECORDTYPE(VARCHAR(15) ip, VARCHAR(5) ident, VARCHAR(5) userId, VARCHAR(26) reqTime, VARCHAR(7) reqMethod, VARCHAR(256) reqLine, VARCHAR(5) httpVer)

DIRECTORY=/tmp

FILENAME_PATTERN=access_\d{4}(-\d\d){3}(\.\d+)?

CHARACTER_ENCODING=UTF-8

SKIP_HEADER=TRUE

SEPARATOR=u\000A

parser=W3C

format=%h %l %u [%t] \"%r %r HTTP/%r\" %>s %b \"%r\" \"%r\"