System-Defined Log-Parsing Functions

<< Click to Display Table of Contents >>

Navigation:  Integrating SQLstream Blaze with Other Systems > Appendix A: Legacy Adapters and Agents > Log File Adapter >

System-Defined Log-Parsing Functions

Previous pageReturn to chapter overviewNext page

Note: This topic describes a legacy adapter. Unless you have used these in the past, we recommend using the Extensible Common Data Adapter and Agent instead of these legacy adapters and agents.

System-Defined Log-Parsing Functions

For convenience, SQLstream provides a number of special parsing functions for log entry manipulation. The parsing functions allow the user to define specific sub-groups of data, to give them column names, and to convert data from text to other SQL data types. If there is an error during parsing, an error is logged and the row is skipped.

The parsing functions can be applied in two ways. The first is to apply them as functions in a SELECT statement, either directly or embedded in a VIEW for repeated use. The second method is to specify the parsing function to be used, along with supporting parameters, directly in the CREATE FOREIGN STREAM statement. The second method reduces the number of DDL statements needed and gives more concise definitions of actions being performed on the stream.

VARIABLE_COLUMN_LOG_PARSE

VARIABLE_COLUMN_LOG_PARSE processes each input row based on characteristics provided by the user. Each column of data can be an arbitrary length, and the user can set delimiters, escape, and quote characters, among other parameters.

Example #1:

In this example, the input data is parsed into 3 columns that will have the default names. Note the usage of the quote parameter (PARSER_QUOTES) so that the quoted address is taken as a single string, and the commas are not recognized as valid.

Input:

John Doe,"514 West Bay, Newport Beach, CA, 92661-1122",$32.16

Jane Doe,"4919 Lido Sands Drive, Newport Beach, CA, 92661",$15.22

DDL:

  CREATE FOREIGN STREAM schema1.transaction_log

         SERVER logfile_server

         OPTIONS (LOG_PATH '/path/to/logfile',

                  ENCODING 'UTF-8',

                  SLEEP_INTERVAL '10000',

                  MAX_UNCHANGED_STATS '10',

                  PARSER 'VARIABLE',

                 "PARSER_COLUMN COUNT" '3',

                  PARSER_DELIMITERS ',',

                 "PARSER_ESCAPE CHAR" '',

                  PARSER_QUOTES '"');

or

 

   CREATE FOREIGN STREAM schema1.transaction_log

   SERVER logfile_server

   OPTIONS (log_path '/var/log',

            file_pattern 'userlog\.\d{4}\.\d{2}.\d{2}',

            encoding 'UTF-8',

            sleep_interval '10000',

            max_unchanged_stats '10');

   CREATE VIEW schema1.transaction_log_view AS

         SELECT STREAM l.r.COLUMN1, l.r.COLUMN2, l.r.COLUMN3

         FROM (SELECT STREAM VARIABLE_COLUMN_LOG_PARSE(message, 3, ',', '', '"')

              FROM schema1.transaction_log) AS l(r);

Output:

    SELECT STREAM * FROM schema1.transaction_log

    SELECT STREAM * FROM schema1.transaction_log_view

    COLUMN1       COLUMN2                                          COLUMN3

    John Doe      514 West Bay, Newport Beach, CA, 92661-1122      $32.16

    Jane Doe      4919 Lido Sands Drive, Newport Beach, CA, 92661  $15.22

Example #2:

In this example, the columns are fully specified with names and SQL data types. The escape specification is used and set to backslash, which again forces the parser to treat the address portion of the data as a single unit.

Input:

John Doe,514 West Bay, Newport Beach, CA, 92661-1122,$32.16

Jane Doe,4919 Lido Sands Drive, Newport Beach, CA, 92661,$15.22

DDL:

   CREATE FOREIGN STREAM schema1.transaction_log

       SERVER logfile_server

       OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10',

                PARSER 'VARIABLE',

                PARSER_COLUMNS 'NAME TYPE VARCHAR(32) NOT NULL,

                                ADDRESS TYPE VARCHAR(80) NOT NULL,

                                AMOUNT TYPE VARCHAR(10)',

                PARSER_DELIMITERS ',',

               "PARSER_ESCAPE CHAR" '\');

 

or (with a VIEW defined)

 

CREATE FOREIGN STREAM schema1.transaction_log

        SERVER logfile_server

        OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10');

   CREATE VIEW schema1.transaction_log_view AS

       SELECT STREAM t.r.NAME, t.r.ADDRESS, t.r.AMOUNT

       FROM (SELECT STREAM VARIABLE_COLUMN_LOG_PARSE

                  (message,

                   'NAME TYPE VARCHAR(32) NOT NULL,

                   ADDRESS TYPE VARCHAR(80) NOT NULL,

                   AMOUNT TYPE VARCHAR(10)'

                   ',', '\') AS r

             FROM schema1.transaction_log) AS t;

Output:

SELECT STREAM * FROM schema1.transaction_log

SELECT STREAM * FROM schema1.transaction_log_view

NAME          ADDRESS                                          AMOUNT

John Doe      514 West Bay, Newport Beach, CA, 92661-1122      $32.16

Jane Doe      4919 Lido Sands Drive, Newport Beach, CA, 92661  $15.22

Example #2A:

This example replicates Example 2 but uses the Tab character as the delimiter instead of the commas that appear there. Since the Tab character is non-printing, the Input below shows <TAB>, but the actual input uses the Tab character. The PARSER_DELIMITERS option uses u&'\0009', which is a unicode literal string consisting only of a tab character.

Input:

John Doe<TAB>514 West Bay<TAB> Newport Beach<TAB> CA<TAB> 92661-1122<TAB>$32.16

Jane Doe<TAB>4919 Lido Sands Drive<TAB> Newport Beach<TAB> CA<TAB> 92661<TAB>$15.22

DDL:

  CREATE FOREIGN STREAM schema1.transaction_log

       SERVER logfile_server

       OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10',

                PARSER 'VARIABLE',

                PARSER_COLUMNS 'NAME TYPE VARCHAR(32) NOT NULL,

                                ADDRESS TYPE VARCHAR(80) NOT NULL,

                                AMOUNT TYPE VARCHAR(10)',

                PARSER_DELIMITERS 'u&'\0009'',

              "PARSER_ESCAPE CHAR" '\');

   or

   CREATE VIEW schema1.transaction_log_view AS

       SELECT STREAM t.r.NAME, t.r.ADDRESS, t.r.AMOUNT

       FROM (SELECT STREAM VARIABLE_COLUMN_LOG_PARSE

                  (message,

                   'NAME TYPE VARCHAR(32) NOT NULL,

                   ADDRESS TYPE VARCHAR(80) NOT NULL,

                   AMOUNT TYPE VARCHAR(10)'

                   ',', '\') AS r

             FROM schema1.transaction_log) AS t;

Output:

   SELECT STREAM * FROM schema1.transaction_log

   SELECT STREAM * FROM schema1.transaction_log_view

   NAME          ADDRESS                                          AMOUNT

   John Doe      514 West Bay, Newport Beach, CA, 92661-1122      $32.16

   Jane Doe      4919 Lido Sands Drive, Newport Beach, CA, 92661  $15.22

Example #3:

In this example, a column count is used, so output will consist of VARCHAR(1024) columns with default column names. No escape or quote characters are defined.

Input:

John Doe,514 West Bay,Newport Beach,CA,92661-1122,$32.16

Jane Doe,4919 Lido Sands Drive,Newport Beach,CA,92661,$15.22

DDL:

   CREATE FOREIGN STREAM schema1.transaction_log

       SERVER logfile_server

       OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10',

                PARSER 'VARIABLE',

               "PARSER_COLUMN COUNT" '6',

                PARSER_DELIMITERS ',');

  or

 

CREATE FOREIGN STREAM schema1.transaction_log

    SERVER logfile_server

    OPTIONS (log_path '/path/to/logfile',

            encoding 'UTF-8',

            sleep_interval '10000',

            max_unchanged_stats '10');

   CREATE VIEW schema1.transaction_log_view AS

       SELECT STREAM

           l.r.COLUMN1,

           l.r.COLUMN2,

           l.r.COLUMN3,

           l.r.COLUMN4,

           l.r.COLUMN5,

           l.r.COLUMN6

       FROM (SELECT STREAM VARIABLE_COLUMN_LOG_PARSE(message, 6, ',')

             FROM schema1.transaction_log) AS l(r);

Output:

  SELECT * FROM schema1.transaction_log

   SELECT * FROM schema1.transaction_log_view

   COLUMN1       COLUMN2                COLUMN3        COLUMN4  COLUMN5     COLUMN6

   John Doe      514 West Bay           Newport Beach  CA       92661-1122  $32.16

   Jane Doe      4919 Lido Sands Drive  Newport Beach  CA       92661       $15.22

 

If breaking down the source text by delimiters gives fewer columns than expected, then the remaining columns will be left null. If the source text gives more columns than expected, the extra columns will be silently dropped.

Example #4:

In this sqllineClient example, two delimiters are specified (comma and semicolon), causing an input string containing both to be split at each occurrence of each delimiter. (The output lines are folded and indented for readability within this document's page width.)

0: jdbc:sqlstream:sdp://bento> !set outputformat csv

0: jdbc:sqlstream:sdp://bento>  values(variable_column_log_parse('abcd,efgh;ijkl,mnop;1234',

                                 'a, b, c, d, e',  ',;'));

'EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0'

'abcd                    ','efgh                    ','ijkl                    ',

                           'mnop                      ','1234                    '

Example #5:

In this example, using sqllineClient, a third delimiter (a space) is specified, causing an input string containing spaces, commas, and semicolons to be split at each occurrence of each delimiter. (The input and output lines are folded and indented for readability within this document's page width.)

0: jdbc:sqlstream:sdp://bento> values(variable_column_log_parse('abc d,ef gh; i jkl,mn op; 12 34', 'a, b, c, d, e', ', ;'));

'EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0'

'abc                          ','d                            ',

                                'ef                            ',

                                'gh                           ','i                            '

1 row selected

0: jdbc:sqlstream:sdp://bento>  values(variable_column_log_parse('abc d,ef gh;i jkl,mn op;

                                 12 34','a, b, c, d, e, f, g, h, i, j', ', ;'));

'EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0','EXPR$0'

'abc                          ','d                            ',

                                'ef                            ',

                                'gh                           ',

                                'i                            ',

                                'jkl                           ',

                                'mn                           ',

                                'op                           ',

                                '12                            ',

                                '34                           '

FIXED_COLUMN_LOG_PARSE

FIXED_COLUMN_LOG_PARSE processes fixed-width fields and automatically converts them to specified SQL types.

Example:

This example demonstrates how to specify column definitions. Each column consists of a name the data type and a position specified as a beginning index and length.

Input:

   Mar  7 06:30:03 INFO  sendmail[29353]: i27EU2725212: from=<sender@source.com>,

               size=881, class=0, nrcpts=1

   Mar  7 06:30:03 INFO  sendmail[29354]: i27EU2725212: to=<recipient@destination.com>,

               delay=00:00:01

   Mar  7 06:31:49 INFO  ipop3d[29360]: pop3 service init from 10.0.0.1

   Mar  7 06:31:49 ERROR ipop3d[29359]: Login user=recipient host=bar [10.0.0.1] failed

DDL:

   CREATE OR REPLACE FOREIGN STREAM schema1.mail

       SERVER logfile_server

       OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10',

                PARSER 'FIXED',

                PARSER_COLUMNS 'entry_time TYPE CHAR(15) START 0 FOR 15,

                                severity TYPE VARCHAR(5) START 16 FOR 5,

                                entry TYPE VARCHAR(84) START 22'

                );

     or

   CREATE OR REPLACE FOREIGN STREAM schema1.mail

       SERVER logfile_server

       OPTIONS (log_path '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10'

   CREATE OR REPLACE VIEW schema1.mailview AS

       SELECT STREAM t.r.entry_time, t.r.severity, t.r.entry

       FROM (SELECT STREAM FIXED_COLUMN_LOG_PARSE

                        (message,

                         'entry_time TYPE CHAR(15) START 0 FOR 15,

                          severity TYPE VARCHAR(5) START 16 FOR 5,

                          entry TYPE VARCHAR(84) START 22') AS r

             FROM schema1.mail_stream) AS t;

Output:

    ENTRY_TIME         SEVERITY  ENTRY

    Mar  7 06:30:03 INFO   sendmail[29353]: i27EU2725212:

                                   from=<sender@source.com>,

                                   size=881, class=0, nrcpts=1

    Mar  7 06:30:03 INFO   sendmail[29354]: i27EU2725212:

                                   to=<recipient@destination.com>,

                                   delay=00:00:01

    Mar  7 06:31:49 INFO   ipop3d[29360]: pop3 service init from

                                   10.0.0.1

    Mar  7 06:31:49 ERROR   ipop3d[29359]: Login user=recipient host=bar

                                   [10.0.0.1] failed

W3C_LOG_PARSE

W3C_LOG_PARSE is provided to support access to logs generated by W3C-compliant applications like the Apache web server. This function produces output rows that have one column for each specifier, where the data types are derived from the log entry description specifiers listed in the Apache mod_log_config specification. Refer to the W3C_LOG_PARSE function description found in the SQLstream SQL Reference Guide for details.

Example #1:

The input in this example is taken from an Apache log file and is representative of the COMMON log format.

Input:

(192.168.254.30 - John [24/May/2004:22:01:02 -0700]

                    "GET /icons/apache_pb.gif HTTP/1.1" 304 0),

(192.168.254.30 - Jane [24/May/2004:22:01:02 -0700]

                    "GET /icons/small/dir.gif HTTP/1.1" 304 0);

DDL:

Output:

select stream CAST(COLUMN1 AS VARCHAR(15)) as COL1,

             CAST(COLUMN2 AS VARCHAR(5)) AS COL2,

             CAST(COLUMN3 AS VARCHAR(5)) as COL3,

             CAST(COLUMN4 AS VARCHAR(30)) AS COL4,

             CAST(COLUMN5 AS VARCHAR(34)) AS COL5,

             CAST(COLUMN6 AS VARCHAR(5)) AS COL6,

             CAST(COLUMN7 AS VARCHAR(5)) AS COL7 from "Schema1".weblog;

COL1         COL2 COL3  COL4

                                                         COL5          COL6  COL7

192.168.254.30 -  John  [24/May/2004:22:01:02 -0700]

                                     GET /icons/apache_pb.gif HTTP/1.1  304  0

192.168.254.30 -  Jane  [24/May/2004:22:01:02 -0700]

                                     GET /icons/small/dir.gif HTTP/1.1  304  0

 

The specification of COMMON in the FROM clause means the Common Log Format (CLF), which uses the specifiers %h %l %u %t "%r" %>s %b. All specifiers are described in the W3cLogParse section of the SQLstream SQL Reference Guide.

The table below, Specifiers used by the Common Log Format, describes the specifiers usedwhen the FROM clause specifies COMMON:

Specifiers used by the Common Log Format

Output column

COMMON Log Format specifier

Returns

COLUMN1

%h

The IP address of the remote host

COLUMN2

%l

The remote logname

COLUMN3

%u

The remote user

COLUMN4

%t

The time

COLUMN5

"%r"

The first line of the request

COLUMN6

%>s

The status: For internally redirected requests,

the status of the *original* request

--- %...>s for the last.

COLUMN7

%b

The number of bytes sent, excluding HTTP headers

 

Example #2:

The DDL in this example shows how to rename output columns and filter out unneeded columns.

DDL:

 CREATE OR REPLACE VIEW "Schema1".weblogreduced AS

       SELECT STREAM CAST(s.COLUMN3 AS VARCHAR(5)) AS LOG_USER,

       CAST(s.COLUMN1 AS VARCHAR(15)) AS ADDRESS,

       CAST(s.COLUMN4 AS VARCHAR(30)) as TIME_DATES

       FROM "Schema1".weblog s;

Output:

    +----------+-----------------+--------------------------------+

    | LOG_USER |     ADDRESS     |           TIME_DATES           |

    |          |                 |                                |

    +----------+-----------------+--------------------------------+

    | Jane     | 192.168.254.30  | [24/May/2004:22:01:02 -0700]   |

    |          |                 |                                |

    | John     | 192.168.254.30  | [24/May/2004:22:01:02 -0700]   |

    +----------+-----------------+--------------------------------+

SYS_LOG_PARSE

SqlRef.SysLogParse processes entries commonly found in UNIX/Linux system logs. System log entries start with a timestamp and are followed with a free-form text field. SYS_LOG_PARSE output consists of two columns. The first column is named "COLUMN1" and is of SQL data type TIMESTAMP. The second column is named "COLUMN2" and is of SQL type VARCHAR().

Example:

The sample data in this example are representative of log entries from a mail server application.

Input:

Mar  7 06:30:03 INFO  sendmail[9353]: i27EU2725212: from=<sender@source.com>, size=881,
zclass=0, nrcpts=1

Mar  7 06:30:03 INFO  sendmail[29354]: i27EU2725212: to=<recipient@destination.com>
delay=00:00:01

Mar  7 06:31:49 INFO  ipop3d[29360]: pop3 service init from 10.0.0.1

Mar  7 06:31:49 ERROR ipop3d[29359]: Login user=recipient host=bar [10.0.0.1] failed

DDL:

   CREATE OR REPLACE FOREIGN STREAM schema1.mail

       SERVER logfile_server

       OPTIONS (LOG_PATH '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10',

                PARSER 'SYSLOG'

                );

   or

   CREATE OR REPLACE FOREIGN STREAM schema1.mail

       SERVER logfile_server

       OPTIONS (log_path '/path/to/logfile',

                ENCODING 'UTF-8',

                SLEEP_INTERVAL '10000',

                MAX_UNCHANGED_STATS '10'

                );

   CREATE OR REPLACE VIEW schema1.mailview AS

       SELECT STREAM t.r.COLUMN1, t.r.COLUMN2,

       FROM (SELECT STREAM SYS_LOG_PARSE(message)

             FROM schema1.mail) as t(r);

Output:

  SELECT STREAM * FROM schema1.mailview

   SELECT STREAM * FROM schema1.mail

   COLUMN1           COLUMN2

   Mar  7 06:30:03   INFO sendmail[29353]: i27EU2725212: from=<sender@source.com>,

               size=881, class=0, nrcpts=1

   Mar  7 06:30:03   INFO sendmail[29354]: i27EU2725212: to=<recipient@destination.com>,

               delay=00:00:01

   Mar  7 06:31:49   INFO ipop3d[29360]: pop3 service init from 10.0.0.1

   Mar  7 06:31:49   ERROR ipop3d[29359]: Login user=recipient host=bar [10.0.0.1] failed

 

Another example using PARSER_COLUMNS:

   CREATE OR REPLACE FOREIGN STREAM schema1."FX"

   SERVER logfile_server

   OPTIONS (

    LOG_PATH '/path/to/logfile',

    ENCODING 'UTF-8',

    SLEEP_INTERVAL '1000',

    MAX_UNCHANGED_STATS '10',

    PARSER 'SYSLOG',

    PARSER_COLUMNS 'tstamp TYPE VARCHAR(20), message TYPE VARCHAR(100)' );

REGEX_LOG_PARSE

REGEX_LOG_PARSE processes regular expressions to generate output columns.

The following sample data set will be used in the example below:

    2006-09-20 01:00:00,amy,dep,1

    2006-09-20 01:15:00,amy,with,2

    2006-09-20 01:37:00,amy,dep,3

Code for example:

CREATE OR REPLACE FOREIGN STREAM schema1."FX"

SERVER logfile_server

OPTIONS (

LOG_PATH '/path/to/logfile',

ENCODING 'UTF-8',

SLEEP_INTERVAL '1000',

MAX_UNCHANGED_STATS '10',

PARSER 'REGEX',

PARSER_COLUMNS 'tstamp TYPE TIMESTAMP, username TYPE VARCHAR(10),

                       utype TYPE VARCHAR(4), xid TYPE INTEGER,

PARSER_FORMAT '([^,]*),([^,]*),([^,]*),([^,]*)'  );

Output:

+---------------------+------------+-------+------------+

|       TSTAMP        |  USERNAME  | UTYPE |    XID     |

+---------------------+------------+-------+------------+

| 2006-09-20 01:00:00 | amy        | dep   | 1          |

| 2006-09-20 01:15:00 | amy        | with  | 2          |

| 2006-09-20 01:37:00 | amy        | dep   | 3          |

+---------------------+------------+-------+------------+

 

The PARSER_FORMAT line establishes the expectation of a comma-delimited log. Each field is represented by a regular expression 'group', each enclosed by parentheses. The regular expression --- [^,]* --- is satisfied by 0 or more non-comma characters.

FAST_REGEX_LOG_PARSE

FAST_REGEX_LOG_PARSE operates like REGEX_LOG_PARSE but with no backtracking. It processes entries commonly found in UNIX/Linux system logs by using a regular expression to create the output columns, which are named "COLUMN1", "COLUMN2", and so forth.

The following sample data set will be used in the example below:

    2006-09-20 01:00:00,amy,dep,1

    2006-09-20 01:15:00,amy,with,2

    2006-09-20 01:37:00,amy,dep,3

 

CREATE OR REPLACE FOREIGN STREAM schema1."FX"

SERVER logfile_server

OPTIONS (

LOG_PATH '/path/to/logfile',

ENCODING 'UTF-8',

SLEEP_INTERVAL '1000',

MAX_UNCHANGED_STATS '10',

PARSER 'FASTREGEX',

PARSER_COLUMNS 'tstamp TYPE TIMESTAMP, username TYPE VARCHAR(10),

                       utype TYPE VARCHAR(4), xid TYPE INTEGER,

PARSER_FORMAT '([^,]*),([^,]*),([^,]*),([^,]*)'  );

Output:

+---------------------+------------+-------+------------+

|       TSTAMP        |  USERNAME  | UTYPE |    XID     |

+---------------------+------------+-------+------------+

| 2006-09-20 01:00:00 | amy        | dep   | 1          |

| 2006-09-20 01:15:00 | amy        | with  | 2          |

| 2006-09-20 01:37:00 | amy        | dep   | 3          |

+---------------------+------------+-------+------------+

 

The PARSER_FORMAT line establishes the expectation of a comma-delimited log. Each field is represented by a regular expression 'group', each enclosed by parentheses. The regular expression --- [^,]* --- is satisfied by 0 or more non-comma characters.