Regex Split

<< Click to Display Table of Contents >>

Navigation:  Using StreamLab > StreamLab Guides Overview > Guide Commands > Parsers >

Regex Split

Previous pageReturn to chapter overviewNext page

The Regex Split command parses (separates) a character string based on Java Regular Expression patterns as defined in java.util.regex.pattern. For more information on using regular expressions, see http://docs.oracle.com/javase/tutorial/essential/regex/

The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024).

sl_commands_regex

To use the Regex Split parser, you select the column to which you want to apply a Java Regular Expression pattern, then enter the pattern in the regular expression box. This topic provides a cursory explanation of RegEx, but you may want to consult an expert for help in creating RegEx strings.

For example, the following Regular Expression returns two columns with zero or more of [0-9] from the string:

([^0-9]*)1*([^0-9]*)2*([^0-9]*)

So if the selected column contained a string like

'abcde111fghij22klm'

the columns returned would be the following:

 

         +----------+-----------+

         | COLUMN1  | COLUMN2   |

         +----------+-----------+

         | 111      | 22        |

         +----------+-----------+

         1 row selected

 

If the Regular Expression you enter is invalid, StreamLab returns an error.

You can select from Full or Fast parsing. Full is more accurate, while Fast, as it suggests, works more quickly.

Full

Columns are based on match groups defined in the regex-pattern. Each group defines a column, and the groups are processed from left to right. Failure to match produces a NULL value result: If the regular expression does not match the the string passed as the first parameter, NULL is returned.

Fast

The FAST_REGEX_LOG_PARSE works by first decomposing the regular expression into a series of regular expressions, one for each expression inside a group and one for each expression outside a group. Any fixed length portions at the start of any expressions are moved to the end of the previous expression. If any expression is entirely fixed length, it is merged with the previous expression. The series of expressions is then evaluated using lazy semantics with no backtracking. (In regular expression parsing parlance, "lazy" means don't parse more than you need to at each step. "Greedy" means parse as much as you can at each step.)

Quick Regex Reference

For full details on Regex, see java.util.regex.pattern

[xyz]        Find single character of: x, y or z

[^abc]        Find any single character except: x, y, or z

[r-z]        Find any single character between r-z

[r-zR-Z]        Find any single character between r-z or R-Z

^        Start of line

$        End of line

\A        Start of string

\z        End of string

.        Any single character

\s        Find any whitespace character

\S        Find any non-whitespace character

\d        Find any digit

\D        Find any non-digit

\w        Find any word character (letter, number, underscore)

\W        Find any non-word character

\b        Find any word boundary

(...)        Capture everything enclosed

(x|y)        Find x or y (also works with symbols such as \d or \s)

x?        Find zero or one of x (also works with symbols such as \d or \s)

x*        Find zero or more of x (also works with symbols such as \d or \s)

x+        Find one or more of x (also works with symbols such as \d or \s)

x{3}        Find exactly 3 of x (also works with symbols such as \d or \s)

x{3,}        Find 3 or more of x (also works with symbols such as \d or \s)

x{3,6}        Find between 3 and 6 of x (also works with symbols such as \d or \s)