Generating Test Data with DataGen

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

Generating Test Data with DataGen

Return to chapter overview

DataGen is a program which generates test data for SQLstream applications, and pushes the data into a defined s-Server stream using the SQLstream JDBC Driver .

It is supplied either as part of the distributed SQLstream product or as part of the Client Tools download from the SQLstream website (via SQLstream-5.1.0-clienttools-linux.run or SQLstream-client-tools-5.1.0-windows.exe).

DataGen uses two kinds of spec files written in XML, stored in a directory called "config." These can be combined into the same XML file, though separating the two is a recommended process.

A TupleRule XML document tells Datagen what kind of data to generate, describing the rules for generating sample data for each column in a tuple (row). The format of a TupleRule document is defined by TupleRule.xsd.

A DataSet XML document tells Datagen how to deliver this data, in terms of timeframe, pattern, intensity, and so on, describing the rules for generating a sequence of tuples. The format of a DataSet document is defined by DataSet.xsd. A DataSet doc may either refer to an external TupleRule file (recommended) or may contain an embedded TupleRule element.

SQLstream s-Server ships with sample spec files, which are installed in $opt/sqlstream/4.0.XXX/clienttools/datagen/conf/samples.

However, you will most likely want to write TupleRules and DataSets which mimic your real-world conditions.

DataGen performs a validating parse of TupleRule and DataSet instance documents using the XSD schema definition documents. To take advantage of schema validation and self-documenting annotations in the XSDs, we recommend you use a a schema-aware authoring tool such as XMLSpy to create TupleRule and DataSet instance documents.

Executing DataGen

You execute DataGen from the command line as follows.

Linux:

datagen.sh [options] rulefile rulefile ...

 

Windows:

datagen.cmd [options] rulefile rulefile ...

 

where "rulefile" is a spec file such as sales.bids.random.set.xml

DataGen Options

Options are specified from the command line and are defined below. In Linux typical use might be something like the following:

datagen.sh -t -v -u jdbc:sqlstream:sdp://localhost:5570 conf/samples/sales.bids.fast.set.xml

A Windows example would work similarly, with "datagen.cmd" substituted for "datagen.sh".

Option

Parameters

Explanation

-c, --config=

<dir>

Specifies filesystem path to local DataGen config directory.

--repeat

<num>

number of times to repeat the data set default is 1

--streamSuffixBegin

<num>

Specifies beginning value of the integer suffix for the stream name. Not commonly used.

--streamSuffixEnd

<num>

Specifies ending value of the integer suffix for the stream name. Not commonly used.

-f, --file=

<filename>

Specifies filesystem path to output CSV file. Only applicable if "csv" selected for sinktype.

-h, --help


Displays help.

-n, --username=

<username>

Specifies username (if necessary) for connection to SQLstream server.

-p, --password=

<password>

Specifies password (if necessary) for connection to SQLstream server.

-s, --sink=

<sinktype>

Specifies data sink type: "stream", "null", or "csv"
Default=stream.

-t, --timing


Requests a final message with times & row rates, which goes to stdout.

-u, --uri=

<uri>

Specifies JDBC URI for SQLstream server.

Example: jdbc:sqlstream:sdp://host:port

-v, --verbose


Enable verbose output. This option requests a few start/stop/progress messages, which go to stdout.

XML spec files

XML files are specified from the command line after options, and can be one or one or more Tuple or DataSet rules.

<TupleRule>

The TupleRule determines the content and form of the date itself. You write these to mimic the kind of data you want to test in SQLstream s-Server.

The tag itself contains two attributes

note: Optional note to describe purpose or characteristics of this tuple rule, for change history and so on. Attributes: date, version.

sink: Data sink specification. Attributes: schema (name for sink object), name (stream, view, table)

The following rules are enclosed within the <TupleRule> tag. Any of the following rules can be specified in any order. Attributes are defined in another table following.

Rule

Description

Attributes and Example

LongDataRule

Rule for generating long data.

CommonRuleAttributes: name, randomize, seed, wrap.

LongRuleAttributes: first, last, step

Example:

<LongDataRule name="shares" first="300" last="700" step="20"/>

 

This rule would generate a series of values for a column named "shares," starting at 300 and increasing in increments of 20 until it reaches 700.

TimeDataRule

Rule for generating datetime data.

CommonRuleAttributes: name, randomize, seed, wrap.

TimeRuleAttributes: first, last, pattern, step.

Example:

<TimeDataRule name="time" first="*" last="*+7200000" step="600000"/>

 

Note: "*" tells Datagen to start with current time. This rule would generate a series of values for a column named "time," starting at the current time when the rule is called and increasing in increments of 600000 milliseconds until it reaches 7200000 milliseconds from the starting time

DoubleDataRule

Rule for generating double-precision data.

CommonRuleAttributes: name, randomize, seed, wrap.

DoubleRuleAttributes: first, last, precision, step.

Example:

<DoubleDataRule name="price" first="15.00" last="20.00" step="0.50" precision="2"/>

 

This rule uses a double value in order to implement floating point variables (15.5, 16.5, 17.5, and so on). See http://www.homeandlearn.co.uk/java/double_variables.html for more information. This rule would generate a series of values for a column named "price," starting at 15 and increasing in increments of .50 until it reaches 20.

 

Note: Precision refers to the number of decimal points retained. If undeclared or zero, all decimal points are preserved

StringDataRule

Rule for generating string data.

CommonRuleAttributes: name, randomize, seed, wrap. StringRuleAttributes: phraseCount

Example:

<StringDataRule name="_commentAlpha">

       <item>A</item>

       <item>B</item>

       <item>C</item>

     </StringDataRule>

 

This rule would generate a series of values for a column called "_commentAlpha", starting at A and ending at C. Note: Any number of items can be defined using the <item> tag.

CompoundDataRule

Rule for generating formatted data from multiple constituent rules.

Name, format.

CompoundDataRule also constituent DataRules enclosed in its tag.

Example:

<CompoundDataRule name="comment" format="sample comment">

     <StringDataRule name="_commentAlpha">

       <item>A</item>

       <item>B</item>

       <item>C</item>

     </StringDataRule>

     <LongDataRule name="_commentNum" first="1" last="3" step="1"/>

   </CompoundDataRule>

 

This rule would generate a concatenated column called "comment," with values "A1," "B2," and "C3".

Definition of attributes for above

Attribute

Type

Description

Values

name

STRING

Name of column

 

randomize

BOOLEAN

Whether randomization should be done

true => generate randomized data;

false => generate linear data that might also wrap

seed

LONG

A positive randomization seed

Only used if randomize is true.

wrap

BOOLEAN

Wrap to beginning of data range.

true, false

Only applies if randomize is false.

first

Use the same type as in the rule being applied.

Used as starting or minimum value in generating data values.

 

last

Use the same type as in the rule being applied.

Used as ending or maximum value.

 

step

Use the same type as in the rule being applied.

Step size (default is 1).

0

precision

INT

Precision refers to the number of decimal points retained. If undeclared or zero, all decimal points are preserved.

< 0 means ALL (only applies to DoubleDataRule)

<TupleRule> Example

The rule below produces tuples with the following columns: time, ticker, shares, price, expiryMs, comment.

<?xml version="1.0" encoding="UTF-8"?>

<TupleRule

       xmlns="http://www.SQLstream.com/xml/datagen/0"

       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

       xsi:schemaLocation="http://www.SQLstream.com/xml/datagen/0 ../TupleRule.xsd">

   <note version="0.3" date="2008-06-18">linear rules for sales.bids stream</note>

   <sink schema="SALES" name="BIDS" />

   <!-- stream columns: time, ticker, shares, price, expiryMs, comment -->

   <TimeDataRule name="time" first="*" last="*+7200000" step="600000"/>

   <StringDataRule name="ticker">

       <item>SQLS</item>

       <item>ORCL</item>

       <item>MSFT</item>

       <item>ADBE</item>

   </StringDataRule>

   <LongDataRule name="shares" first="300" last="700" step="200"/>

   <DoubleDataRule name="price" first="15.00" last="20.00" step="0.50" precision="2"/>

   <LongDataRule name="expiryMs" first="600000" last="18000000" step="60000"/>

   <CompoundDataRule name="comment" format="sample comment %s-%d">

     <StringDataRule name="_commentAlpha">

       <item>A</item>

       <item>B</item>

       <item>C</item>

     </StringDataRule>

     <LongDataRule name="_commentNum" first="1" last="3" step="1"/>

   </CompoundDataRule>

</TupleRule>

DataGen and rowtimes

When working with streams, ROWTIME is a special column. ROWTIME is typically a system-defined column of type TIMESTAMP, included in every stream definition. Rows in a stream always have monotonically ascending values for ROWTIME. However, you can also explicitly set a value for ROWTIME. When you do so, the explicit value of ROWTIME substitutes for the explicit system-defined ROWTIME column.

To specify a ROWTIME value, you will need to write a TimeDataRule with a column named ROWTIME. The name "ROWTIME" tells s-Server to substitute this value for the system-generated value.

Using fixed ROWTIMEs

When you set an explicit ROWTIME, you cause the sink (target) stream rowtime bounds to be set to a datetime in the past.

Before running DataGen with an explicitly defined ROWTIME, you should restart SQLstream s-Server to reset the sink stream's rowtimes before running a DataSet with fixed rowtimes Otherwise, rows from the second run will be rejected as being older than the last ROWTIME as set by the 1st run. Restarting SQLstream leaves stream rowtime bounds unconstrained until the first row (or rowtime bound) has been processed.

Note: Even you set ROWTIME to a future time, are used in the test, the start timestamp is still earlier than the end timestamp, and the SQLstream s-Server should be restarted to clear stream rowtime bounds.

You can also clear stream rowtimes by dropping and recreating the stream.

Using current ROWTIMEs

Current times can be expressed using simple contemporaneous date-time specs wherever a timestamp value is required:

"*"

current-time when constructor runs.

"**"

current-time each time a row is generated

"*+nnnn"

current-time plus delta in milliseconds.

"*-nnnn"

current-time minus delta in milliseconds.

<DataSet> Attributes

The <DataSet> tag determines how data is delivered. The following attributes, specified in a DataSet document, describe the amount, pattern, and frequency of rows generated. The following are all attributes of the <DataSet> tag.

Attribute

Type

Description

start

 

DATE

Date/time when DataGen should begin streaming data. If value is zero, DataGen will start immediately.

Example:

2014-10-22T21:30:00.0Z

stop

DATE

Date/time when DataGen should stop streaming data. If value is zero, DataGen will run forever or until tuple count is reached.

Example:

2014-10-22T21:30:00.0Z

commit

BOOLEAN

Auto or manual commit mode.

True = Autocommit on

tupleStyle

CHAR

The tupleStyle enum must be one of {none, steady, random, bursty}. These indicate the pattern of data generation:

none. This generates no tuples, rowtime bounds (punctuation) are sent if punctStyle=TUPLE.

steady: tuples delivered by a millisecond interval determined by tupleInterval

random: tuples delivered at random interval, <i>n</i> millis, where 0 <= <i>n</i> <= tupleInterval

bursty: tuples delivered in bursts set by burstCount. Interval between tuples in milliseconds is determined by tupleInterval. Interval between bursts in milliseconds is determined by burstInterval.

tupleCount

LONG

Total tuples (rows) to be generated.

tupleInterval

LONG

Milliseconds between tuples (rows) generated.

burstCount

LONG

Number of tuples per burst.

Only applies if bursty is selected for tupleStyle.

burstInterval

LONG

Milliseconds between bursts.

Only applies if bursty is selected for tupleStyle.

punctStyle

STATIC ENUM

The punctStyle enum must be one of {none, tuple, burst}. Whether and how to send a rowtime bound.

none: no explicit rowtime bound (punctuation) sent.

tuple: explicit rowtime bound (punctuation) sent after each tuple.

burst: explicit rowtime bound (punctuation) sent after each burst.

finalPunct

DATE

Date/time of final rowtime bound. If value is zero, no final rowtime bound will be sent.

finalPause

LONG

Milliseconds between last tuple and stream close.

maximumRate

LONG

Maximum rate in tuples per second.

A DataSet document can also have a <note> tag, which has version and date attributes and contains a description of the dataset:

 <note version="0.1" date="2007-07-30">5000-row bursty test with linear data and burst punctuation</note>

 

A DataSet document can also link to an external TupleRuleFile document. We recommend defining TupleRules in external documents. This lets you define multiple datasets using the same TupleRule:

 <TupleRuleFile href="sales.bids.linear.row.xml"/>

Example

The following example defines a data set and links to an external TupleRule file.

<?xml version="1.0" encoding="UTF-8"?>

<DataSet

 xmlns="http://www.SQLstream.com/xml/datagen/0"

 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

 xsi:schemaLocation="http://www.SQLstream.com/xml/datagen/0 ../DataSet.xsd"

 start="2006-10-22T21:30:00.0Z"

 stop="2006-10-22T21:30:30.0Z"

 commit="auto"

 tupleStyle="bursty"

 tupleCount="5000"

 tupleInterval="0"

 burstCount="25"

 burstInterval="500"

 punctStyle="burst"

 finalPause="2000" >

 <note version="0.1" date="2007-07-30">5000-row bursty test with linear data and burst punctuation</note>

 <TupleRuleFile href="sales.bids.linear.row.xml"/>

</DataSet>

Combined Example

You can combine <DataSet> and <TupleRule> in the same file (though this practice is not recommended because it provides less flexibility in terms of separating data format from data delivery.

<?xml version="1.0" encoding="UTF-8"?>

<DataSet

 xmlns="http://www.SQLstream.com/xml/datagen/0"

 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

 xsi:schemaLocation=

       "http://www.SQLstream.com/xml/datagen/0 ../../datagen/DataSet.xsd"

 start="2006-10-22T21:30:00.0Z"

 stop="2006-10-22T21:30:30.0Z"

 tupleStyle="steady"

 tupleCount="5000"

 tupleInterval="100"

 burstCount="0"

 burstInterval="0"

 punctStyle="none"

 finalPause="2000" >

 <note version="0.1" date="2008-07-10">5000 random bids on several tickers, 10/sec</note>

 <TupleRule>

   <note version="0.1"

         date="2008-07-10">random rules for sales.bids stream</note>

   <sink schema="SALES" name="BIDS" />

   <!-- stream columns: time, ticker, shares, price, expiryMs, comment -->

   <TimeDataRule name="time" randomize="true" seed="13" first="*" last="*+7200000"/>

   <StringDataRule name="ticker" randomize="true" seed="17">

       <item>SQLS</item>

       <item>ORCL</item>

       <item>MSFT</item>

       <item>ADBE</item>

   </StringDataRule>

   <LongDataRule name="shares" randomize="true" seed="19"

       first="300" last="700" step="100"/>

   <DoubleDataRule name="price" randomize="true" seed="23"

       first="15.00" last="200.00" step="5.00" precision="2"/>

 </TupleRule>

</DataSet>

 

DataGen property file

Using the property file datagen.properties you can specify URI, username and password. For SQLstream these properties can usually be defaulted:

Property name

Default

Description

datagen.datasink.jdbc.uri

jdbc:sqlstream:

default sink (target) JDBC URI

datagen.datasink.jdbc.user

sa

username for server connections

datagen.datasink.jdbc.pass

(none)

password for server connections