W3C_LOG_PARSE

<< Click to Display Table of Contents >>

Navigation:  Streaming SQL Reference Guide > Basic Building Blocks > Functions > Built-in Functions >

W3C_LOG_PARSE

Previous pageReturn to chapter overviewNext page

W3C_LOG_PARSE( <character-expression>, <format-string> )

<format-string> := '<predefined-format> | <custom-format>'

<predefined format> :=

        COMMON

      | COMMON WITH VHOST

      | NCSA EXTENDED

      | REFERER

      | AGENT

      | IIS

<custom-format> := [an Apache log format specifier]

 

W3C-predefined formats

Specifying the following W3C-predefined-format names summarizes using the format specifiers indicated, as in the following statement:

select stream W3C_LOG_PARSE(message, 'COMMON') r  from w3ccommon t;

Format Name

W3C Name

Format Specifiers

COMMON

Common Log Format (CLF)

%h %l %u %t "%r" %>s %b

COMMON WITH VHOST

Common Log Format with Virtual Host

%v %h %l %u %t "%r" %>s %b

NCSA EXTENDED

NCSA extended/combined log format

%h %l %u %t "%r" %>s %b "%[Referer]i" "%[User-agent]i"

REFERER

Referer log format

%[Referer]i ---> %U

AGENT

Agent (Browser) log format

%[User-agent]i

W3C format specifiers

The format specifiers are listed below. W3C_LOG_PARSE will automatically detect these specifiers and output records with one column for each specifier. The column's type is automatically chosen based on the possible outputs of the specifier. For example, %b represents the number of bytes sent in processing an HTTP request, so the column type will be numeric. For %B, however, zero bytes is represented by a dash - forcing the column type to be text. Note A explains what the "..." and "<" or ">" markings shown in the specifier table mean.

W3C format specifiers alphabetically by command

Format Specifier

Explanation

%

The percent sign (Apache 2.0.44 and later)

%...a

Remote IP-address

%...A

Local IP-address

%...B

Size of response in bytes, excluding HTTP headers.

%...b

Size of response in bytes, excluding HTTP headers, in CLF format, which means that when no bytes are sent, uses a '-' rather than a 0.

%...[Foobar]C

The contents of cookie Foobar in the request sent to the server.

%...D

The time taken to serve the request, in microseconds.

%...[FOOBAR]e

The contents of the environment variable FOOBAR

%...f

Filename

%...h

Remote host

%...H

The request protocol

%...[Foobar]i

The contents of Foobar: header line(s) in the request sent to the server.

%...l

Remote logname (from identd, if supplied)

%...m

The request method

%...[Foobar]n

The contents of note Foobar from another module.

%...[Foobar]o

The contents of Foobar: header line(s) in the reply.

%...p

The canonical port of the server serving the request

%...P

The process ID of the child that serviced the request.

%...[format]P

The process ID or thread id of the child that serviced the request. Valid formats are pid and tid. (Apache 2.0.46 and later)

%...q

The query string (prepended with a ? if a query string exists, otherwise an empty string)

%...r

First line of request

%...s

Status. For requests that got internally redirected, this is the status of the *original* request --- %...>s for the last.

%...t

Time, in common log format time format (standard english format)

%...[format]t

The time, in the form given by format, which should be in strftime(3) format. (potentially localized)

%...T

The time taken to serve the request, in seconds.

%...u

Remote user (from auth; may be bogus if return status (%s) is 401)

%...U

The URL path requested, not including any query string.

%...v

The canonical ServerName of the server serving the request.

%...V

The server name according to the UseCanonicalName setting.

%...X

Connection status when response is completed

X = connection aborted before the response completed.

+ = connection may be kept alive after the response is sent.

- = connection will be closed after the response is sent.

(The %..X directive was %...c in late versions of Apache 1.3,

but this conflicted with the historical ssl %...[var]c syntax.)

:%...I:

Bytes received, including request and headers, cannot be zero. You need to enable mod_logio to use this.

:%...O:

Bytes sent, including headers, cannot be zero. You need to enable mod_logio to use this.

Note A: Some W3C format specifiers are shown as containing a "..." indication or a "<" or ">", which are optional controls on suppressing or redirecting the output of that specifier. The "..." can either be empty (as in the COMMON specification "\%h %u %r \%s %b") or it can indicate conditions for including the item. The conditions are a list of HTTP status codes, possibly preceded by "!", and if the specified condition is not met, then the column or field returned shows "-".
 

For example, as described in the Apache documentation, specifying "%400,501[User-agent]i" will log the User-agent only on 400 errors and 501 errors (Bad Request, Not Implemented). Similarly, "%!200,304,302[Referer]i" will log the Referer: on all requests that fail to return some sort of normal status.
 

The modifiers "<" and ">" can be used to choose whether the original or final (respectively) request should be consulted when a request has been internally redirected. By default, the % directives %s, %U, %T, %D, and %r look at the original request while all others look at the final request. So for example, %>s can be used to record the final status of the request and %<u can be used to record the original authenticated user on a request that is internally redirected to an unauthenticated resource.
 

For security reasons, starting with Apache 2.0.46, non-printable and other special characters are escaped mostly by using \xhh sequences, where hh stands for the hexadecimal representation of the raw byte. Exceptions from this rule are " and \ which are escaped by prepending a backslash, and all whitespace characters which are written in their C-style notation (\n, \t etc). In httpd 2.0 versions prior to 2.0.46, no escaping was performed on the strings from %...r, %...i and %...o, so great care was needed when dealing with raw log files, since clients could have inserted control characters into the log.
 

Also, in httpd 2.0, the B format strings represent simply the size in bytes of the HTTP response (which will differ, for instance, if the connection is aborted, or if SSL is used). For the actual number of bytes sent over the network to the client, use the %O format provided by mod_logio.

W3C format specifiers alphabetically by function or category

The categories are bytes sent, connection status, content of environmental variable, filename, host, IP, notes, protocol, query string, replies, requests, and time.
For the markings "..." or "<" or "<", see Note A.

Bytes sent, excluding HTTP headers:

--

with a "0" when no bytes are sent:

%...B

--

with a "-" (CLF format) when no bytes are sent:

%...b

--

Bytes received, including request and headers, cannot be zero: (Must enable mod_logio to use this.)

:% ... I:

--

Bytes sent, including headers, cannot be zero: (Must enable mod_logio to use this.)

:%... O:

Connection status when response is completed

--

Connection aborted before the response completed:

X

--

Connection may be kept alive after the response is sent

+

--

Connection will be closed after the response is sent:

-

--

(The %..X directive was %...c in late versions of Apache 1.3,

but this conflicted with the historical ssl %...[var]c syntax.)

Environment variable FOOBAR:

--

contents:

%...[FOOBAR]e

Filename:

%...f

Host (remote):

%...h

--

Protocol:

%...H

IP addresses:

--

Remote:

%...a

--

Local:

%...A

Notes:

--

Contents of note Foobar from another module

%...[Foobar]n

Protocol (request):

%...H

Query string:

%...q

--

if query exists, prepended with a ?

--

if not, the empty string

Replies:

--

Contents of Foobar: header line(s) in the reply

%...[Foobar]o

The W3C format specifiers for the response and time categories are as follows:

Requests:

--

Canonical port of the server serving the request

%...p

--

Contents of cookie Foobar in the request sent to server:

%... [Foobar]C

--

Contents of BAR:header line(s):

%... [BAR]i

--

First line sent:

%...r

--

Microseconds taken to serve a request:

%...D

--

Protocol:

%...H

--

Process ID of the child that serviced the request:

%...P

--

Process ID or thread id of the child that serviced the request.

Valid formats are pid and tid. (Apache 2.0.46 and later)

%...[format]P

--

Remote logname (from identd, if supplied):

%...l

--

Remote user: (from auth; may be bogus if return status (%s) is 401)

%...u

--

Server (canonical ServerName) serving the request:

%...v

--

Server name by the UseCanonicalName setting:

%...V

--

Request method:

%...m

--

Return status:

%s

--

Seconds taken to serve the request:

%...T

--

Status of the *original* request that was internally redirected:

%...s

--

Status of the last request:

%...>s

--

URL path requested, not including any query string:

%...U

Time:

--

Common log format time format (standard english format):

%...t

--

Time in strftime(3) format, potentially localized:

%...[format]t

--

Seconds taken to serve the request:

%...T

W3C examples

The example below shows the W3C function being specified as one of the OPTIONS in a CREATE FOREIGN STREAM command or within a SELECT STREAM statement in a CREATE VIEW command. The example is taken from the Log File Adapter topic of the SQLstream s-Server Integration Guide.

See also the examples in the page entitled Parsing Log File Records, which use the %h specifier and the COMMON predefined format.

W3C_LOG_PARSE supports access to logs generated by W3C-compliant applications like the Apache web server, producing output rows with one column for each specifier. The data types are derived from the log entry description specifiers listed in the Apache mod_log_config specification.

Example #1:

The input in this example is taken from an Apache log file and is representative of the COMMON log format.

Input:

(192.168.254.30 - John [24/May/2004:22:01:02 -0700]

                    "GET /icons/apache_pb.gif HTTP/1.1" 304 0),

(192.168.254.30 - Jane [24/May/2004:22:01:02 -0700]

                    "GET /icons/small/dir.gif HTTP/1.1" 304 0);

 

DDL:

   CREATE OR REPLACE FOREIGN STREAM schema1.weblog

       SERVER logfile_server

       OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10',

                PARSER 'W3C',

                PARSER_FORMAT 'COMMON');

  or

 

    CREATE FOREIGN STREAM "Schema1".weblog_read

    SERVER "logfile_server"

    OPTIONS (log_path '/path/to/logfile',

    encoding 'UTF-8',

    sleep_interval '10000',

    max_unchanged_stats '10');

    CREATE OR REPLACE VIEW "Schema1".weblog AS

       SELECT STREAM

           l.r.COLUMN1,

           l.r.COLUMN2,

           l.r.COLUMN3,

           l.r.COLUMN4,

           l.r.COLUMN5,

           l.r.COLUMN6,

           l.r.COLUMN7

       FROM (SELECT STREAM W3C_LOG_PARSE(message, 'COMMON')

             FROM "Schema1".weblog_read) AS l(r);

Output:

select stream CAST(COLUMN1 AS VARCHAR(15)) as COL1,

             CAST(COLUMN2 AS VARCHAR(5)) AS COL2,

             CAST(COLUMN3 AS VARCHAR(5)) as COL3,

             CAST(COLUMN4 AS VARCHAR(30)) AS COL4,

             CAST(COLUMN5 AS VARCHAR(34)) AS COL5,

             CAST(COLUMN6 AS VARCHAR(5)) AS COL6,

             CAST(COLUMN7 AS VARCHAR(5)) AS COL7 from "Schema1".weblog;

COL1         COL2 COL3  COL4

                                                         COL5          COL6  COL7

192.168.254.30 -  John  [24/May/2004:22:01:02 -0700]

                                     GET /icons/apache_pb.gif HTTP/1.1  304  0

192.168.254.30 -  Jane  [24/May/2004:22:01:02 -0700]

                                     GET /icons/small/dir.gif HTTP/1.1  304  0

 

The specification of COMMON in the FROM clause means the Common Log Format (CLF), which uses the specifiers %h %l %u %t "%r" %>s %b.

The W3C-predefined formats topic above shows the COMMON and other predefined specifier sets.

The output shown below uses the following view definitions:

select stream CAST(COLUMN1 AS VARCHAR(15)) as COL1,

             CAST(COLUMN2 AS VARCHAR(5)) AS COL2,

             CAST(COLUMN3 AS VARCHAR(5)) as COL3,

             CAST(COLUMN4 AS VARCHAR(30)) AS COL4,

             CAST(COLUMN5 AS VARCHAR(34)) AS COL5,

             CAST(COLUMN6 AS VARCHAR(5)) AS COL6,

             CAST(COLUMN7 AS VARCHAR(5)) AS COL7 from "Schema1".weblog;

COL1         COL2 COL3  COL4

                                                         COL5          COL6  COL7

192.168.254.30 -  John  [24/May/2004:22:01:02 -0700]

                                     GET /icons/apache_pb.gif HTTP/1.1  304  0

192.168.254.30 -  Jane  [24/May/2004:22:01:02 -0700]

                                     GET /icons/small/dir.gif HTTP/1.1  304  0

 

The specification of COMMON in the FROM clause means the Common Log Format (CLF), which uses the specifiers %h %l %u %t "%r" %>s %b.

The table below, Specifiers used by the Common Log Format, describes the specifiers used by COMMON in the FROM clause.

Specifiers used by the Common Log Format

The output column named:

Comes from this corresponding COMMON Log Format specifier:

Which returns this data item from the log.

COLUMN1

%h

The IP address of the remote host

COLUMN2

%l

The remote logname

COLUMN3

%u

The remote user

COLUMN4

%t

The time

COLUMN5

"%r"

The first line of the request

COLUMN6

%>s

The status: For internally redirected requests,

the status of the *original* request

--- %...>s for the last.

COLUMN7

%b

The number of bytes sent, excluding HTTP headers

Example #2:

The DDL in this example shows how to rename output columns and filter out unneeded columns.

DDL:

   CREATE OR REPLACE VIEW "Schema1".weblogreduced AS

       SELECT STREAM CAST(s.COLUMN3 AS VARCHAR(5)) AS LOG_USER,

       CAST(s.COLUMN1 AS VARCHAR(15)) AS ADDRESS,

       CAST(s.COLUMN4 AS VARCHAR(30)) as TIME_DATES

       FROM "Schema1".weblog s;

Output:

   +----------+-----------------+--------------------------------+

   | LOG_USER |     ADDRESS     |           TIME_DATES           |

   |          |                 |                                |

   +----------+-----------------+--------------------------------+

   | Jane     | 192.168.254.30  | [24/May/2004:22:01:02 -0700]   |

   |          |                 |                                |

   | John     | 192.168.254.30  | [24/May/2004:22:01:02 -0700]   |

   +----------+-----------------+--------------------------------+

W3C customized formats

The same results would be created by naming the specifiers directly rather than using the "COMMON" name:
 

   CREATE OR REPLACE FOREIGN STREAM schema1.weblog

       SERVER logfile_server

       OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10',

                PARSER 'W3C',

                PARSER_FORMAT '%h %l %u %t \"%r\" %>s %b');

   or

    CREATE FOREIGN STREAM "Schema1".weblog_read

    SERVER "logfile_server"

    OPTIONS (log_path '/path/to/logfile',

    encoding 'UTF-8',

    sleep_interval '10000',

    max_unchanged_stats '10');

    CREATE OR REPLACE VIEW "Schema1".weblog AS

       SELECT STREAM

           l.r.COLUMN1,

           l.r.COLUMN2,

           l.r.COLUMN3,

           l.r.COLUMN4,

           l.r.COLUMN5,

           l.r.COLUMN6

       FROM (SELECT STREAM W3C_LOG_PARSE(message, '%h %l %u %t \"%r\" %>s %b')

             FROM "Schema1".weblog_read) AS l(r);

 

Note that by changing %t to [%t], the date column will contain the following:

       24/May/2004:22:01:02 -0700

 

instead of this:

       [24/May/2004:22:01:02 -0700]