OSH Basic Tutorial
OSH Basic Tutorial
OSH Basic Tutorial
DataStage
Skill Level: Intermediate
Blayne Chard
Software Engineer
IBM
08 Feb 2007
Create a simple DataStage operator, then learn how to load the operator into
DataStage Designer. An operator is the basic building block of a DataStage job.
Operators can read records from input streams, modify or use the data from the input
stream, and then write the results to a output stream.
Objectives
In this tutorial you learn:
4. How to load your operator into the DataStage Designer so you can use it
on any job you create
Prerequisites
This tutorial is written for Windows programmers whose skills and experience are at
a intermediate level. You should have a solid understanding of IBM WebSphere
DataStage and a working knowledge of the C++ language.
System requirements
To run the examples in this tutorial, you need a Windows computer with the
following:
• Microsoft Visual Studio .NET 2003
• IBM WebSphere DataStage 8.0
• MKS Toolkit
MyHelloWorld
The first operator takes one input stream and one output stream. MyHelloWorld,
takes a single column as input, an integer, this integer determines how many times
"Hello World!" is printed into one of columns in the output stream. The output stream
consists of two columns, a counter showing how many times "Hello World!" was
printed, and the printed result. To go with the input and output streams, there is one
option "uppercase." This option determines if the text "Hello World!" is printed in
uppercase or not.
The parameter description shown below, tells the operator that there is one
parameter, "uppercase", which is optional, then goes on to tell it that there is only
one input and one output stream.
#define HELLO_ARGS_DESC \
"{ uppercase={optional," \
" description='capitalize or not'" \
" }," \
" otherInfo={ " \
" inputs={ " \
" input={ " \
" description='source data for helloworld', " \
" once" \
" } " \
" }, " \
" outputs={" \
" output={ " \
" description='output data for helloworld', " \
" minOccurrences=1, maxOccurrences=1 " \
" } " \
" }, " \
" description='myhelloworld operator:' " \
" } " \
"}"
You only need the functionality of the APT_Operator, as you only need one
operator. APT_CompositeOperator is not needed, and you are not wrapping this
operator around a third party process, so APT_SubProcessOperator is not required.
To use this base class, you must implement two virtual methods,
describeOperator() and runLocally(). Also as you have an input parameter
uppercase, a third method, initializeFromArgs_(), is required.
The basic class definition for the MyHelloWorld operator looks like the code shown
below:
APT_Operator::InitializeContext context);
virtual APT_Status describeOperator();
virtual APT_Status runLocally();
private:
bool uppercase_;
};
The macros in this definition play a important role in defining the operator.
• APT_DEFINE_OSH_NAME -- This macro defines the OSH name of this
operator. The OSH name is the way DataStage references operators and
is used whenever this operator is referenced.
• APT_IMPLEMENT_RTTI_ONEBASE and APT_DECLARE_RTTI -- These
macros set the runtime type identification for your operator.
• APT_IMPLEMENT_PERSISTENT and APT_DECLARE_PERSISTENT --
These macros tell you that this operator can be serialized and moved to a
processing node.
To set up the interface for the streams, the operator needs to know what type of data
to expect in each of the streams. This is specified using
setInputInterfaceSchema() and setOutputInterfaceSchema(). Both of
these methods take two parameters, an APT_String, and an integer. The
APT_String is a schema; the integer is an index indicating which input or output
stream to apply the schema to. Below is the describeOperator() that is used
inside the MyHelloWorld operator.
APT_Status APT_MyHelloWorldOp::describeOperator(){
setKind(eParallel);
// Set the number of input/output links
setInputDataSets(1);
setOutputDataSets(1);
// Set the schema for the input link
// inCount:int32 requires the first column of the input stream to be of type int32
setInputInterfaceSchema(APT_UString("record (inCount:int32;)"), 0);
// setup the output link
// sets the output link to have two columns a integer column outCount and a
// string column outHello
• initializeFromArgs_()
• serialize()
• setUppercase()
initializeFromArgs_() is called when the operator is first run. It receives a list
of parameters that have been passed into the operator. Here you have to look for
the parameter you are using, "uppercase." To do this, cycle through all the
parameters that have been passed in. If you find the "uppercase" keyword, you then
set it's value.
APT_MyHelloWorldOp::APT_MyHelloWorldOp() : uppercase_(false) {
...
}
After all these steps have finished, you can access uppercase_ like any other local
variable.
APT_Status APT_MyHelloWorldOp::runLocally() {
...
// Allows access to the input dataset, read only can only move forward
APT_InputCursor inCur;
setupInputCursor(&inCur, 0);
// allows access to output dataset
APT_OutputCursor outCur;
setupOutputCursor(&outCur, 0);
...
}
After setting up the cursors, you can then have direct access to the data in the
streams. To get this access you use an accessors class, there is one class per data
type. These accessors classes have most of their basic operator's overloaded
(+,=,*,-,/) so you can assign and change their values. A few examples are given
below
*field1out = *field1in;
*field1out = *field1in * 2;
*field1out = *field1in + 5;
*field2out = APT_String("Hello");
To set up the accessors, you initialize them by referencing the column name and the
input or output cursor that the column is in. The accessors for your input and output
columns and for your operator are shown below.
APT_Status APT_MyHelloWorldOp::runLocally() {
...
APT_InputAccessorToInt32 field1in("inCount",&inCur);
APT_OutputAccessorToInt32 field1out("outCount",&outCur);
APT_OutputAccessorToString field2out("outHello",&outCur);
...
}
Before attempting to access the data behind these accessors, you have to start
accessing the data inside the input streams. You do this by using the input cursor's
getRecord(). This method gets the next record from the input stream and loads all
the record's values into the accessors. You can then begin using the accessors.
Once you have finished with a row on the output accessors, you need to call
putRecord(). This flushes the output accessor's record into the output stream.
Add logic
Finally, by adding the logic behind the runLocally() method, you can finish off
your operator.
First, you should add some logic based on the input parameter "uppercase." Here
you have a simple if statement to decide if you should use an uppercase version of
"Hello World!".
Next, you loop through all the data on the input stream. You can use the input
cursor's getRecord() to exit the loop as it returns a boolean of true if there is still
more data on the link, and false when it has reached the end of the records.
For every record you loop through you output one record into the output stream
using the output stream's putRecord().
APT_Status APT_MyHelloWorldOp::runLocally(){
...
APT_String hello;
if(uppercase_){
hello = APT_String("HELLO WORLD!");
}else{
SET APT_OPERATOR_REGISTRY_PATH=E:\osh\
SET APT_ORCHHOME=E:\IBM\InformationServer\Server\PXEngine
directory.
SET
APT_CONFIG_FILE=E:\IBM\InformationServer\Server\Configurations\default.apt
SET PATH=%PATH%;E:\IBM\InformationServer\Server\PXEngine\bin;.;
6. The Windows LIB variable needs the PXEngine's lib directory and the
MKS Toolkit.
set LIB=%LIB%;E:\IBM\InformationServer\Server\PXEngine\
lib;c:\Program Files\MKS Toolkit\lib;
After making these changes, start a command prompt, navigate to your directory
then run setup.bat. Your output should look like the following:
E:\osh>setup.bat
Setting environment for using Microsoft Visual Studio .NET 2003 tools.
(If you have another version of Visual Studio or Visual C++ installed and wish
to use its tools from the command line, run vcvars32.bat for that version.)
E:\osh>
To test to see if everything was setup correctly, type osh into the command window.
The output should look similar to the following:
E:\osh>osh
##I IIS-DSEE-TFCN-00001 20:07:33(000) <main_program>
IBM WebSphere DataStage Enterprise Edition 8.0.0
Copyright IBM Corp. 2001, 2005
E:\osh>
This leaves you with a compiled DataStage operator named myhelloworld.dll in your
base directory.
OSH scripts
An OSH script represents a DataStage job. Instead of displaying it in a graphical
window like DataStage Designer, it is just a text representation of the operators, their
parameters, and the links between operators.
Inside the OSH script, there is a simple format to describe the basic structure of an
operator. The first line is the name of the operator. The following lines start with -.
These are the parameters for the operator. The streams between operators have a
prefix of < or >, based on the direction of the stream. Input streams start with < and
output streams start with >, then all the streams are suffixed with .v. Finally, a ; is
added to signify the end of the description for this operator. Additional operators are
appended after the semicolon.
operatorname
-parameter1
-parameter2 'hello world'
...
-parametern
< 'inputstream.v'
> 'outputstream.v'
;
For your myhelloworld operator, you have a very simple OSH flow: one input, one
myhelloworld operator, and one output.
Figure 1. Flow using sequential files as the input and output as shown the in
DataStage Designer
The input operator for your example is a file reading operator called import. The
import operator has one parameter that needs to be changed based on your
environment.
Note: This -file parameter must point to the text file located in the input directory.
## Operator Name
import
## Operator options
-schema record
{final_delim=end, delim=',', quote=double}
(
inCount:int32;
)
-file 'e:\\osh\\input\\mhw.txt'
-rejects continue
-reportProgress yes
## Outputs
> 'inputFile.v'
;
The input file required by this operator is a text file using double quotes to surround
strings, commas to separate columns, and each row is one line. As you only require
one integer column, the file looks like the following:
1
2
3
4
## Operator Name
myhelloworld
## Operator options
-uppercase
##Inputs
< 'inputFile.v'
##Outputs
> 'outputFile.v'
;
The output operator for your example is a file writing operator called "export." The
export operator also has one parameter that needs to be changed, the file parameter
needs to point to a file in the output directory. This file is overwritten every time this
script is called, it is also created if the file does not exist at runtime.
## Operator Name
export
## Operator options
-schema record
{final_delim=end, delim=',', quote=double}
(
outCount:int32;
outHello:string;
)
-file 'E:\\osh\\output\\mhw.txt'
-overwrite
-rejects continue
## Inputs
< 'outputFile.v'
;
Given this OSH script, using the input file specified above, there are three columns
to look at: the input inCount and the two outputs: outCount and outHello. The
expected column's values are shown below:
Operator mapping
The operator.apt file tells the PXEngine the mappings between operator names in a
OSH script and the dll or executable located in your Windows PATH variable. Below
is an example operator.apt for your operator.
myhelloworld myhelloworld 1
The first myhelloworld is the operator name that is defined inside the source code
and is used inside an OSH script. The second myhelloworld is the name the
PXEngine is looking for in its PATH search. To find the actual operator, it looks for
executables first (.exe), then looks at dll files, so make sure you don't have a
myhelloworld.exe sitting somewhere in the directories specified by your PATH
variable. The 1 in the third column indicates that this mapping is enabled, if this is
set to 0 the PXEngine ignores this mapping.
To run the operator, you give your OSH script you just created to the PXEngine. You
do this by calling osh.exe -f myhelloworld.osh
E:\osh>osh -f myhelloworld.osh
##I IIS-DSEE-TFCN-00001 13:27:10(000) <main_program>
IBM WebSphere DataStage Enterprise Edition 8.0.0
Copyright IBM Corp. 2001, 2005
After running the operator, you are left with a file in the output directory
e:\osh\output\mhw.txt. Inside this file, you see a list of comma-separated values. The
first column is outCount and the second column is outHello. The contents of this file
should look like the output below.
E:\osh>cat output/mhw.txt
"1","HELLO WORLD!"
"2","HELLO WORLD!HELLO WORLD!"
"3","HELLO WORLD!HELLO WORLD!HELLO WORLD!"
"4","HELLO WORLD!HELLO WORLD!HELLO WORLD!HELLO WORLD!"
E:\osh
However, the operator is still not found by DataStage, as there is no mapping from
the dll to the OSH name. To fix this, open up the PXEngine's main operator.apt file,
located in e:\IBM\InformationServer\Server\PXEngine\etc\, then add myhelloworld
myhelloworld 1 to this file.
• DataStage username
• DataStage password
• DataStage server name
• DataStage server port
• DataStage project
1. Start DataStage Designer and log into a project that you are able to use.
2. Once you are inside, right-click any folder in the repository view.
4. Fill in the tabs as show in the following figures. You do not need to modify
the mapping tab.
Fill in your information in the appropriate text boxes under the Creator tab.
5. Click Ok. This brings up the Save as dialog box. Select an appropriate
place to save, (the example uses the processing operator type folder)
then click Save.
Figure 7. Save as
6. You should now see a myhelloworld operator inside the Processing tab of
the palette.
8. Open the two sequential file operators and set the details of the files they
are reading from or writing to.
True.
10. Inside the myhelloworld operator, go to Input > Columns, and add the
input column. Add a new column with the column name inCount and the
SQL type Integer.
11. Go to Output > Columns, and add the output columns. Add two new
columns one with the column name outCount and the SQL type Integer,
and another with the column name outHello and the SQL type Varchar.
13. Click the Run icon, which spawns the run dialog box, where you click
Run.
14. If all goes well, in a couple of seconds the links should turn green in your
canvas and display the number of records they have processed.
Downloads
Description Name Size Download
method
Code for all the examples code.zip 4KB HTTP
Resources
Learn
• WebSphere DataStage zone: Get more details, and access resources and
support for DataStage.
• Information Integration zone: Read articles and tutorials and access
documentation, support resources, and more, for the IBM Information
Integration suite of products.
• developerWorks Information Management zone: Learn more about DB2. Find
technical documentation, how-to articles, education, downloads, product
information, and more.
• Stay current with developerWorks technical events and webcasts.
Get products and technologies
• Build your next development project with IBM trial software, available for
download directly from developerWorks.
• Learn more about the MKS Toolkit.
Discuss
• Participate in the discussion forum for this content.
• Participate in developerWorks blogs and get involved in the developerWorks
community.