Writing to Hadoop/HDFS

<< Click to Display Table of Contents >>

Navigation:  Integrating SQLstream Blaze with Other Systems > Writing Data Out of s-Server > Writing to Other Destinations >

Writing to Hadoop/HDFS

Previous pageReturn to chapter overviewNext page

The Hadoop ECDA adapter automatically uploads a file to a designated Hadoop file system location.

When you do so, the ECDA adapter formats the file and uploads it to the location designated by the HDFS_OUTPUT_DIR property.

This file can be formatted as CSV, XML, JSON, or BSON. See Writing to Files with the ECD Adapter and Agent for more details. These files do not require further formatting in order to be integrated with Hadoop. (Once in the Hadoop file system, these files will be accessed by second-level systems such as Hive. See https://cwiki.apache.org/confluence/display/Hive/Home for more details on Hive.)

Configuring the ECD Adapter for HDFS

To write data, you first define a server object with connection information, including the directory and information on file rotation. Once you define this server object, you can write to the file system by referencing it. See the topic CREATE SERVER in the s-Server Streaming SQL Reference Guide for more details.

For adapters, you configure and launch the adapter in SQL, using either server or foreign stream/table options. For agents, you configure such options using a properties file and launch the agent at the command line. Many of the options for the ECD adapter and agent are common to all I/O systems. The CREATE FOREIGN STREAM topic in the Streaming SQL Reference Guide has a complete list of options for the ECD adapter.

Note: Because of the nature of streaming data, you will need to set up a pump in order to move rows continually from an s-Server stream to another stream, file, Kafka topic, RDBMS table or other location. Pumps are INSERT macros that continually pass data from one point in a streaming pipeline to the other. A model for setting up a pump is provided below. See the topic CREATE PUMP in the s-Server Streaming SQL Reference Guide for more details.

Required parameters are:

- HADOOP_USER_NAME

- HDFS_OUTPUT_DIR

- FORMATTER

- FILENAME_PREFIX

- FILENAME_SUFFIX

- MAX_BYTES_PER_FILE

To write data to HDFS, you need to create a server object which references the data wrapper ECDA and is of type 'hdfs'.

CREATE OR REPLACE SERVER "ECDAWriterServer_1" TYPE 'hdfs'

FOREIGN DATA WRAPPER ECDA;

 

Note: ECD Adapter server definitions need to reference the ECD foreign data wrapper. You can do so with the syntax FOREIGN DATA WRAPPER ECDA.

Unlike server objects, all foreign streams need to be created in a schema. The following code first creates a schema in which to run the rest of the sample code below, then creates a foreign stream named "hadoop_test"."bus_hdfs_sink".

CREATE or REPLACE FOREIGN STREAM "hadoop_test"."bus_hdfs_sink" (

   "id" INTEGER,

   "shift_no" DOUBLE,

   "reported_at" TIMESTAMP NOT NULL,

   "trip_no" VARCHAR(10),

   "latitude" DOUBLE,

   "longitude" DOUBLE,

   "speed" INTEGER,

   "baring" INTEGER,

   "active" BOOLEAN)

   SERVER "ECDAWriterServer_1"

   OPTIONS (

   "HADOOP_USER_NAME" 'root',

   "HDFS_OUTPUT_DIR" 'hdfs://storm-s3.disruptivetech.com:8020/user/sqlstream/',

   "FORMATTER" 'CSV',

   "CHARACTER_ENCODING" 'UTF-8',

   "QUOTE_CHARACTER" '',

   "SEPARATOR" ',',

   "WRITE_HEADER" 'false',

   "DIRECTORY" '/tmp',

   "FILENAME_PREFIX" 'output-',

   "FILENAME_SUFFIX" '.csv',

   "FILENAME_DATE_FORMAT" 'yyyy-MM-dd-HH-mm-ss',

   "MAX_BYTES_PER_FILE" '1073741',

   "FORMATTER_INCLUDE_ROWTIME" 'false');

 

Options for HDFS

You configure adapter options through foreign streams/tables. You configure agent options through the ECD agent property file.

Property Name

Description

hadoop_user_name

User name for HDFS.

hdfs_output_dir

Address for name node of HDFS, such as hdfs://storm-s3.disruptivetech.com:8020/user/sqlstream/

Sample Property File for ECD Agent

Note: Before using the ECD agent, you need to create a source stream for it. In the below example, you would need to create the foreign stream "HADOOPWRITERSTREAM."

 

SEPARATOR=^A

DIRECTORY=/home/sqlstream/customers/sample

FILENAME_PREFIX=test-

FILENAME_DATE_FORMAT=yyyy-MM-dd-HH-mm-ss

MAX_BYTES_PER_FILE=25000000

FILENAME_SUFFIX=.csv

CHARACTER_ENCODING=US-ASCII

ROWTYPE=RecordType(INTEGER COL1, INTEGER COL2, VARCHAR(256) COL3, VARCHAR(256) COL4)

SCHEMA_NAME=HADOOPWRITERTEST

TABLE_NAME=HADOOPWRITERSTREAM

HDFS_OUTPUT_DIR=hdfs://storm-s3.disruptivetech.com:8020/user/sample/NOCLOBBER=false

 

to invoke the agent

From the directory $SQLSTREAM_HOME/../clienttools/EcdaAgent/ enter the following at the command line

$ ./commondataagent.sh --output --props sample.test.properties --io hdfs