DataStage Parallel Routines

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 5
At a glance
Powered by AI
The document discusses how to create and use parallel routines in DataStage to extend its functionality. Parallel routines allow writing custom code in C/C++ to perform tasks that cannot be done using standard DataStage components. The typical steps involve writing C++ code, compiling it, linking it as a parallel routine, and then calling the routine from within a DataStage job.

Parallel routines in DataStage are C++ components that are built and compiled externally. They must be compiled as C++ without a main function and create an object (.o) or shared library (.so) file. This file is then linked as a parallel routine by specifying details like the routine name, function name, return type etc. The compiled file needs to be placed in the library path specified for parallel routines.

The typical steps are: 1) Create the C++ code 2) Compile it using the compiler options specified for parallel routines 3) Link the compiled file as a parallel routine 4) Include and execute the job calling the routine

DataStage Parallel routines made really easy

Joshy George(Consulting Employee) Posted 12/1/2007


Comments (0) | Trackbacks (0)

DataStage is a powerful ETL tool with lot of inbuilt stages/routines which can do most of the
functionalities required; for those things DataStage EE cant do, there are parallel routines
which can be written in C++.
This primer can teach you how you can create a parallel routine in few minutes, regardless of
whether or not you are a C/C++ programmer. But to write some real good codes you might have
to learn some C++ programming. Starting C programming with Linux is a good link to start with.
Before we begin, few points to be noted:
Parallel routines are C++ components built and compiled external to DataStage. Note - they
must be compiled as C++ components, not C.
This C++ program should be without main() and compiled using the compiler option specified
under APT_COMPILEOPT which can be found under Administrator parameter option and
create an object (*.o) or shared object (*.so) file. This will create runtime libraries which are
compiled code, without main ie. non self-contained executable file.
Compiler and compiler options can be found in
DataStage --> Administrator --> Properties --> Environment --> Parallel --> Compiler
Ex: compiler = g++
compiler options = -O -fPIC -Wno-deprecated c
Compile command syntax
Compiler : compiler options : {filename with extenstion}
Ex: g++ -O -fPIC -Wno-deprecated -c {filename with extenstion}
In DataStage, parallel routines must reference a function in an object (*.o) or shared object
(*.so) file. If you are creating a Shared library, file name should begin with lib ie. lib*.so
To create a shared object/library (*.so) file you should have already created a *.o (Object file).
Ex: g++ -shared -o lib{filename}.so {filename}.o
Here's the typical sequence of steps for creating a DataStage parallel routine:
Create --> Compile --> Link --> Execute
1) Create
Create a C++ program with main ()
Test it and if successful remove the main ()
2) Compile
Compile using the compiler option specified under APT_COMPILEOPT. Note: Compiler and
compiler options can be found in "Data Stage --> Administrator --> Properties --> Environment
--> Parallel --> Compiler" and create an object (*.o) or shared object (*.so) file and put this
object file onto this directory (Or any of your Library Path of your preference):

Ex: /datastage/Ascential/DataStage/PXEngine/lib
I usually put in "lib" directory. You can locate your "lib" directory from Library Path
(LD_LIBRARY_PATH).
3) Link
Link the above object (*.o) or shared object (*.so) to a DataStage Parallel routine by making
the relevant entries in General tab:
Routine Name: {Parallel Routine Name}
Type: External Function
Object Type: Object / Library
External subroutine name: {Function Name specified inside your C++ program}
Library Path: {Specified in 2) Compile section + object (*.o) or shared object (*.so) file name }
Also specify the Return Type and if you have any input parameters to be passed specify that in
Arguments tab.
4) Execute
Now your parallel routine will be available inside your job. Include and compile your job and
execute.
FAQ:
**Why should the Shared library file name always begin with lib?
--> Built in DataStage linker looks for files with this naming convention. More on shared library
can be found in this link.
**When I move my code or release to a different project or environment where will I export
my parallel routines object (*.o) or shared object (*.so) files?
--> You have to move your object (*.o) or shared object (*.so) files to the respective Library
Path (Specified while linking the parallel routine) in the new project or environment .
**If I change my C++ program and re-compiled the object or shared object file, should I
recompile all the DataStage jobs which call these routines too?
--> In the case of an "Object", yes all your DataStage job/s needs to be re-compiled to reflect a
change in routine. But in the case of "shared object" this is not required.
Step by step Example:
Creating a shared object
1) Create a C++ program with main()
Create a text file with cpp extn (Ex: OBJTEST.cpp )
Ex:
#include <stdlib.h>
#include <stdio.h>
int main()
{

char* OutStr;
OutStr="Hello World - Object Testing";
printf(OutStr);
return 0;
}
Test this program
Copy your compiler specification from
"DataStage --> Administrator --> Properties --> Environment --> Parallel --> Compiler"
and compile the created C++ program
Syntax: g++ program.cpp o program
Ex: g++ OBJTEST.cpp -o OBJTEST
Run/Execute using the below command
Syntax: ./program
./OBJTEST
Output --> Hello World - Object Testing
If you get above output, that means your program is successfully executed.
Re-write the program without main()
Ex:
#include <stdlib.h>
#include <stdio.h>
char * ObjTestOne()
{
char* OutStr;
OutStr="Hello World - Object Testing";
return OutStr;
}
2)Compile the program
Get compiler and compiler options from:
DataStage --> Administrator --> Properties --> Environment --> Parallel --> Compiler
Ex: compiler = g++
compiler options = -O -fPIC -Wno-deprecated c
Compile command syntax
Compiler : compiler options : {filename with extenstion}
Ex: g++ -O -fPIC -Wno-deprecated -c {filename with extenstion}
Execute the below command:
g++ -O -fPIC -Wno-deprecated c OBJTEST.cpp
This will make and object file with .o extn -->Ex: OBJTEST.o
Move this object file to any of the Library Path of your preference:
Ex: /datastage/Ascential/DataStage/PXEngine/lib
I usually put in "lib" directory. You can locate your "lib" directory from Library Path
(LD_LIBRARY_PATH).
3) Link
Link the above object (*.o) to a DataStage Parallel routine.
In the repository pallet right click and chose New parallel routine and make these entries
in the General tab:

Routine Name: {Parallel Routine Name} Ex: OBJECTTEST


Type: External Function
Object Type: Object
External subroutine name: {Function Name specified inside your C++ program}
Ex: ObjTestOne (Remember? This is the function name we replaced for main() ie. char *
ObjTestOne() )
Library Path: {Specified in Compile section + object (*.o) file name }
Ex: /datastage/Ascential/DataStage/PXEngine/lib/OBJTEST.o
Return Type: char*
Note:As we dont have any input parameters to be passed we are not making any entries in
Arguments tab.
Now save and close the window.
4) Execute
Create a test job and call this parallel routine inside your job.
Ex: Row Generator --> Transformer --> Sequential File
In the transformer call this routine in your output column derivation. Compile and run the job.

Generate multiple output files for a single output stage

Hi All,
Issue: There is a single column(name) table with ten records, requirement is to provide
seperate output file for each record.. with column value as the file name.
name: (1, 2, 3, 4, 5) as values the output must be 1.txt, 2.txt, 3.txt, 4.txt 5.txt
How to achieve this...in a single run

[/list]

_________________
Regards
Naveen Kandukuri

sequence.
User variables activity to convert the list to delimited list.
Start Loop activity using user variable to control list.
It's been explained before. Search.

A parallel (C++) routine which creates files dynamically will be the best way to do this 'in a
single run'.
_________________
Joshy George

pxCreateFile: Parallel routine (C++) which creates files dynamically.


Code:
//Createsandwritesonatextfileforeachrecord
//PasstheFileNamewithpathasfirstparameterEx:<Path>:
LinkName.InputColumn:'.txt'
//PasstheContentofthefileassecondparameterEx:
LinkName.InputColumn
//Pxroutinecall>pxCreateFile(<Path>:LinkName.InputColumn:'.txt',
LinkName.InputColumn)
//Iffilecreationissuccess0isreturned,else1.Usethisstatusfor
report/tracking/printtofile/printtopeek
//Iffilealreadyexists,overwritehappens
//Pass''(empty)assecondParameterifyoudon'twantanythingtobe
printedinthefile)

#include<iostream>
#include<fstream>
usingnamespacestd;
intpxCreateFile(char*FileName,char*Msg)
{
intStatusFlag=0;
ofstreammyfile(FileName);
if(myfile.is_open())
{
myfile<<Msg;
myfile<<"\n";//Commentthislineifanewlinecharisnotrequired
attheendofthefilebydefault
myfile.close();
}
else
{
StatusFlag=1;
}
returnStatusFlag;
}
_________________
Joshy George

You might also like