DataStage Stages 12-Dec-2013 12PM

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 47

jmumnWhat is Environment Variable?

Data Stage provides default variables during the installation and can be used throughout the project called
environmental variables.

These variables are set for the project level and access the variables by logging into data stage administrator
client and select Projects tab under Properties -> General Tab -> Environment.

What is configuration file?


The Datastage configuration file is a master control file (a text file which sits on the server side) for jobs
which describes the parallel system resources and architecture. The configuration file provides hardware
configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory
and disk), Grid , Cluster or MPP (multiple CPU, multiple nodes and dedicated memory per node). DataStage
understands the architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed your processing
configurations, or changed servers or platform, you will never have to worry about it affecting your jobs since all the
jobs depend on this configuration file for execution. Datastage jobs determine which node to run the process on,
where to store the temporary data, where to store the dataset data, based on the entries provide in the
configuration file. There is a default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration file is to separate
software and hardware configuration from job design. It allows changing hardware and software resources without
changing a job design. Datastage jobs can point to different configuration files by using job parameters, which means
that a job can utilize different hardware architectures without being recompiled.
The configuration file contains the different processing nodes and also specifies the disk space provided for each
processing node which are logical processing nodes that are specified in the configuration file. So if you have more
than one CPU this does not mean the nodes in your configuration file correspond to these CPUs. It is possible to
have more than one logical node on a single physical node. However you should be wise in configuring the number
of logical nodes on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your underlying system
should have the capability to handle these loads then you will be having a very inefficient configuration on your
hands.

Node Structure:
{
Node "node1"
{
fastname "dc4c37"
pools ""
resource disk "/dstage/dsdata/pxdataset" {pools ""}
resource disk "/dstage/dsdata/pxfileset" {pools "export"}
resourcescratchdisk "/dstage/dstemp/dsscratch" {pools ""}
}

node "node 2"


{
fastname "dc4c47"
pools ""
resource disk "/dstage/dsdata/pxdataset" {pools ""}
resource disk "/dstage/dsdata/pxfileset" {pools "export"}
resourcescratchdisk "/dstage/dstemp/dsscratch" {pools ""}
}

}
Here fast name is the server name. If fast name is different in each node then job will run in MPP else it will run in
SMP. Scratch disk is the temporary storage for the data. Resource disk

Schema File:

You can also specify the Meta data for a stage in a plain text file known as a schema file. This is
not stored in the Repository but you could, for example, keep it in a document management or
source code control system, or publish it on an intranet site.

//Schema File is used to read Input data without specifying metadata in the Sequential File
//stage
//Creatd On : 11/17/2010
//Creatd By : Pavan Kumar Reddy
record
{final_delim=end,delim=none}
(
CUSTOMER_SSN: NULLABLE STRING[11];
CUSTOMER_NAME:STRING[30];
CUSTOMER_CITY:STRING[40];
CUSTOMER_ZIPCODE:STRING[10];
)
The format of each line describing a column is:

column_name:[nullability]datatype;

column_name. This is the name that identifies the column. Names must start with a
letter or an underscore (_), and can contain only alphanumeric or underscore characters.
The name is not case sensitive. The name can be of any length.
nullability. You can optionally specify whether a column is allowed to contain a null
value, or whether this would be viewed as invalid. If the column can be null, insert the
word 'nullable'. By default columns are not nullable.
You can also include 'nullable' at record level to specify that all columns are nullable,
then override the setting for individual columns by specifying `not nullable'. For
example:
record nullable (
name:not nullable string[255];
value1:int32;
date:date)
datatype. This is the data type of the column. This uses the internal data types, see Data
Types, not the SQL data types as used on Columns tabs in stage editors.

Remember that you should turn runtime column propagation on if you intend to use schema
files to define column meta data
Pipeline Parallelism: Instead of waiting for all source data to be read as soon as the
source data stream starts the data processed to subsequent stages. This method is called
pipeline parallelism.

Pipeline parallelism eliminates the need of intermediate storing to a disk.

Partition Parallelism: Dividing the incoming stream of data into subsets (Partitions)
or Partition parallelism is technique of distributing the records across the nodes based on
different partition techniques

. When large volumes of data involved you can use the power of Partition parallelism to your
best advantage by partitioning the data into a number of separate sets, which each partition
being handled by a separate instance of the job stages.

Partition Techniques:
Partition techniques were used for performance tuning of the jobs.

There are two types of Partition techniques.

1.Keyless: Rows are distributed independently of data values


1. Same
2. RoundRobin
3. Random
4. Entire

2. Keyed: Rows are distributed based on values in specified keys.


1. Hash
2. Modulus
3. Range
4. DB2

RoundRObin:

Rows are distributed evenly among partitions


First record goes to the first node and second to the second node and so on
The RR method always creates approximately equal-sized partitions
This method normally used when data stage initially partitions data.
Random:

Records are randomly distributed across all the nodes


Like RoundRobin, Random partition can rebalance the partitions of an input set to
guarantee that each processing node receives approximately equal sized partitions.
Same:

The stage using the data set as input performs no repartitioning and takes as input then
partitions output by the proceeding stage.
With this partitioning method records stay on the same processing node i.e they are not
redistributed.
Entire:

Everyprocessing node receives the complete data set as input.


It is useful when you want the benefits of parallel execution.
You are most likely to use this partitioning method with stages that lookup tables from
their input.
Hash Partition:

Rows with the same key columns values will go to the partition.
Partition is based on one or more columns in each record.
This method is useful for ensuring that related records are in the same partition.
This behavior will go to bottle neck because some nodes are required to process more
records than other node.
Modulus:

Partitioning is based on a key column modulo the number of partitions. This method is
similar to hash by field but involves simpler computation.
The modulus partitioning assigns each recordof an input set to a partition of its output
set as determined by a specified key field in the input set.
The partition number of each record is calculated as follows
o Partition_number=fieldname mod number_of_ partitions
o Partition(0,1,2,4)=20(column value) mod 4(4 node)
Here fieldname should be numeric field of an input set
Number_of_partitions is the no of processing nodes on which the partitions
executes. If a partitions executes on three processing nodes it has three
partitions.

Suppose the input set has

Deptno 15 20 22 9 44 16 25 33 30 10

Dname A B C D E F G H I J

Suppose we use there partitions

Partition 0 Partition 1 partition2

15 A 22 C 20 B

9 D 16 F 44 E

33 H 25 G

30 I 10 J

Range Partition:

Similar to hash, but partition mapping is user determined and partitions are ordered
Divides input data set into approximately equal sized partitions, each of which contains
records with key columns within a specified range.
This method is also useful for ensuring that related records are in the same partition
A Range partition divides a data set into approximately equal size partitions based on
one or more partition keys.
All partitions are of approximately the same size, In an ideal distribution every partition would
be exactly the same size.

DB2 Partition:

To use DB2 partitioning on a stage, select a partition type of DB2 on the partitioning tab
then click the properties button on the right.
In the partitioning/collection properties dialog box specify the details of the DB2 table
whose partitioning you want to replicate

Auto Partition:

The most common method you will see on the data stages is auto

This just means that data stage to determine the best partitioning method to use depending on
the type of stage

Typically Data stage would use Round robin when initially partitioning data

And same for the intermediate stage of job

Collecting Methods(Parallel to sequential):

Collecting is the process of joining the multiple partitions of data set back together again
into a single partition.
There are various situations where you may want to do this.
This method can be used only when the data flow between Parallel to Sequential.
At the end of the job you may want to collect all the data to a single database, in which
case you need to collect it before you write it.
This might be other cases where you do not want to collect the data at all. For Ex: you
may want to write each partition to a separate flat file
Collective methods are
o Auto
o Round Robin
o Ordered
o Sort Merge

Auto:

The most common method you will see in the parallel stage is auto

This means that Data stage will read any row from any input partition as it becomes available.

This is the fastest collecting method.

Round Robin:

Read a record from the first partition then from the second partition and so on
Slower than auto. Rarely used
Ordered Collection:

Reads all records from first partition, then all the records from second partition and so
on.
This collection method preserves the ordered of totally sorted input data set.
Sequential file stage
This is one of the file stage. It allows you to read data from one or more files or write data to flat file.. The stage
can have a single input link or a single output link, and a single rejects link.

If you are using for the source it can have 1 output and 1 reject link. And if you are using for the target
then it can have 1 input and 1 reject link.

The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one
file. By default a complete file will be read by a single node (although each node might read more than one file).
For fixed-width files, however, you can configure the stage to behave differently:

You can specify that single files can be read by multiple nodes. This can improve performance on
cluster systems. See "Read From Multiple Nodes"
You can specify that a number of readers run on a single node. This means, for example, that a single
file can be partitioned as it is read (even though the stage is constrained to running sequentially on the
conductor node). See "Number Of Readers Per Node".
(These two options are mutually exclusive.)

The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each
node writes to a single file, but a node can write more than one file.

When reading or writing a flat file, InfoSphere DataStage needs to know something about the format of the
file. The information required is how the file is divided into rows and how rows are divided into columns. You
specify this on the Format tab. Settings for individual columns can be overridden on the Columns tab using the
Edit Column Metadata dialog box.

The stage editor has up to three pages, depending on whether you are reading or writing a file:

Stage Page. This is always present and is used to specify general information about the stage.
Input Page. This is present when you are writing to a flat file. This is where you specify details about
the file or files being written to.
Output Page. This is present when you are reading from a flat file or have a reject link. This is where
you specify details about the file or files being read from.

Sequential File stage: Options category

First Line is Column Names


Specifies that the first line of the file contains column names. This property is false by default.

Missing file mode


Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from
Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a
node name prefix of *: in which case it is OK. The default is Depends.

Keep file partitions


Set this to True to partition the imported data set according to the organization of the input file(s). So, for
example, if you are reading three files you will have three partitions. Defaults to False.

Reject mode
Allows you to specify behavior if a read record does not match the expected schema. Choose from Continue
to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, Output to
send rejected rows down a reject link. Defaults to Continue.

Report progress
Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10%
interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed
length, and there is no filter on the file.

Filter
This is an optional property. You can use this to specify that the data is passed through a filter program after
being read from the files. Specify the filter command, and any required arguments, in the Property Value box.
File name column
This is an optional property. It adds an extra column of type VarChar to the output of the stage, containing the
pathname of the file the record is read from. You should also add this column manually to the Columns
definitions to ensure that the column is not dropped if you are not using runtime column propagation, or it is
turned off at some point.

Read first rows


Specify a number n so that the stage only reads the first n rows from the file.

Row number column


This is an optional property. It adds an extra column of type unsigned BigInt to the output of the stage,
containing the row number. You must also add the column to the columns tab, unless runtime column
propagation is enabled.

Number Of readers per node


This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with
the Read from multiple nodes property. Specifies the number of instances of the file read operator on a
processing node. The default is one operator per node per input data file. If numReaders is greater than one,
each instance of the file read operator reads a contiguous range of records from the input file. The starting
record location in the file for each operator, or seek location, is determined by the data file size, the record
length, and the number of instances of the operator, as specified by numReaders.

The resulting data set contains one partition per instance of the file read operator, as determined by
numReaders.

This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file
can be divided according to the number of readers per node, and written to separate partitions. This method
can result in better I/O performance on an SMP system.

Read from multiple nodes


This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with
the Number of Readers Per Node property. Set this to Yes to allow individual files to be read by several nodes.
This can improve performance on a cluster system.
InfoSphere DataStage knows the number of nodes available, and using the fixed length record size, and the
actual size of the file to be read, allocates the reader on each node a separate region within the file to process.
The regions will be of roughly equal size.

Schema file
This is an optional property. By default the Sequential File stage will use the column definitions defined on the
Columns and Format tabs as a schema for reading the file. You can, however, specify a file containing a
schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these
match the schema file). Type in a pathname or browse for a schema file.

Stages that will require a schema file are:

Sequential File
File Set
External Source
External Target
Column Import
Column Export

Limitations of Sequential File


It has some limitations. In that cases we use Dataset file.

The Limitations are

a) Memory Limit : This will not support more that 2 gb data

b) Sequential Mode: By default it will run in sequential mode

c) Conversion Problem : Data need to convert from ASCII to Native format from stage to stage.

d) It stores the data outside repository.

e) By default it doesnt support null values

Questions:
1. We have a list of .txt files and asked to read only the first 3 files using seq file stage.
How will you do it?

Ans: 1. In the sequential file stage, Use Read Method as Specific File(s).
2. Now in the file text field, Put below command.
`ls /* | head -<Number of files to be read>`

Data Set:

Data set is a file stage which is used for staging the data when we design dependent jobs.
Dataset Supports 1 input link or 1 Output link and there will be no reject links in dataset stage.

Extension of the data set is .ds


Dataset will overcome the Limitations of Sequential Stage
By Default Dataset will processed parallel mode.
Dataset will stores the data in the Native Format.
Dataset will stores the data inside Repository ( i.e inside Datastage)

And It supports more that 2 GB data.

There are Two types of Datasets. They are

1) Virtual And
2) Persistance

Virtual is nothing but the data formed when passing the link.
Persistance is nothing but the data loaded in the Target.

Alias names of Datasets are

1) Orchestrate File
2) Operating System file

And Dataset is multiple files. They are


a) Descriptor File
b) Data File
c) Control file
d) Header Files

In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System and acts as interface between Descriptor and
data file.

And we can organize the data by using Dataset Utilities


They are
GUI( Dataset Management ) in Windows environment.
CMD ( Orchadmin) in Unix Environment

How to delete dataset in datastage parallel jobs

Tools>Datasetmanagement>Smallwindowpopsup>selectthedatasetfilewhichwas
created>Againasmallwindowpopsupshowingyouthenodesnameandthesegments>
selecteachoneandhitonthedeletebuttononthetop>Itsgone!!:D

File set stage


The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can have a
single input link, a single output link, and a single rejects link. It only executes in parallel mode.

Extension of the file set is .fs

Lookup file set stage


The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup.
The stage can have a single input link or a single output link. The output link must be a reference link. The
stage can be configured to execute in parallel or sequential mode when used with an input link.

Complex Flat File stage


The Complex Flat File (CFF) stage is a file stage. You can use the stage to read a file or write to a file, but you
cannot use the same stage to do both.

As a source, the CFF stage can have multiple output links and a single reject link. You can read data from one
or more complex flat files, including MVS data sets with QSAM and VSAM files. You can also read data from
files that contain multiple record types. The source data can contain one or more of the following clauses:

GROUP

REDEFINES

OCCURS

OCCURS DEPENDING ON

CFF source stages run in parallel mode when they are used to read multiple files, but you can configure the
stage to run sequentially if it is reading only one file with a single reader.

As a target, the CFF stage can have a single input link and a single reject link. You can write data to one or
more complex flat files. You cannot write to MVS data sets or to files that contain multiple record types.
Figure 1. This job has a Complex Flat File source stage with a single reject link, and a Complex Flat File target
stage with a single reject link.

Link Sort:

This is as simple sort. This we can use on the link of any stages. Sorting can be done by
using the key column.

Nav: Open stage properties->Partitioning tab->select the partitioning technique then it will
enable the perform sort option. Tick mark to Perform Sort will give data only in sorting
order. But if you want the unique data means you need to use other two options like
Stable and Unique. By using the above options data will come to output by removing
duplicates in sorted order.

Sort Stage:

This is one of the processing stage. It can have 1-Output and 1-Input.

Properties of sort stage:

Key= Need to give the key column on which you want to perform the sort.

Sort Key Mode: Sort, Dont sort (Previously sorted), Dont sort (Previously grouped)

Create key change column: This will give the numbers 1 for the first record of the group if
it set to True and 0 to the corresponding records of the group if it set to False.

Create Cluster Key change column:This will give the numbers 1 for the first record of the
group if it set to True and 0 to the corresponding records of the group if it set to False.

Remove Duplicate Stage:

The Remove Duplicates stage is a processing stage. It can have a single input link and a
single output link.
The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate
rows, and writes the results to an output data set.

Removing duplicate records is a common way of cleansing a data set before you perform
further processing. Two rows are considered duplicates if they are adjacent in the input data
set and have identical values for the key column(s). A key column is any column you
designate to be used in determining whether two rows are identical.

The data set input to the Remove Duplicates stage must be sorted so that all records with
identical key values are adjacent. You can either achieve this using the in-stage sort facilities
available on the Input page Partitioning tab, or have an explicit Sort stage feeding the
Remove Duplicates stage.

Difference b/w link sort and Sort stage:

1. Link dont have the dont sort option where as sort stage have this facility

2. In sort we have the cluster key change column but in link sort we dont have

3. We cannot capture the duplicate records in link sort where as we can capture the duplicate
values by using sort stage.

4. Link sort should require partition technique enabled where as sort stage not mandatory.

Difference between link sort and Remove Duplicates?

5. Using link sort we can pick the first duplicate record. But using remove duplicate stage we can pick
either first or last record

Example
In the example the data is a list of GlobalCo's customers. The data contains some duplicate entries, and you
want to remove these.

The first step is to sort the data so that the duplicates are actually next to each other. As with all sorting
operations, there are implications around data partitions if you run the job in parallel (see "Copy Stage," for a
discussion of these). You should hash partition the data using the sort keys as hash keys in order to guarantee
that duplicate rows are in the same partition. In the example you sort on the CUSTOMER_NUMBER columns
and the sample of the sorted data shows up some duplicates:

"GC29834","AQUILUS CONDOS","917 N FIRST ST","8/29/1996 "


"GC29835","MINERAL COUNTRY","2991 TELLER CT","8/29/1996 "
"GC29836","ABC WATERWORKS","PO BOX 14093","8/30/1996 "
"GC29837","ANGEL BROTHERS INC","128 SOUTH MAIN ST","8/30/1996 "
"GC29837","ANGEL BROTHERS INC","128 SOUTH MAIN ST","8/30/1996 "
"GC29838","ENCNG WASHINGTON","1305 JOHN SMALL AV","8/30/1996 "
"GC29839","DYNEGY FACILITIES","1000 LOUISIANA STE 5800","8/30/1996 "
"GC29840","LITTLE HAITI GATEWAY","NE 2ND AVENUE AND NE 62ND STRE","8/30/1996 "
Next, you set up the Remove Duplicates stage to remove rows that share the same values in the
CUSTOMER_NUMBER column. The stage will retain the first of the duplicate records:

Figure 1. Property settings

Here is a sample of the data after the job has been run and the duplicates removed:

"GC29834","AQUILUS CONDOS","917 N FIRST ST","8/29/1996 "


"GC29835","MINERAL COUNTRY","2991 TELLER CT","8/29/1996 "
"GC29836","ABC WATERWORKS","PO BOX 14093","8/30/1996 "
"GC29837","ANGEL BROTHERS INC","128 SOUTH MAIN ST","8/30/1996 "
"GC29838","ENCNG WASHINGTON","1305 JOHN SMALL AV","8/30/1996 "
"GC29839","DYNEGY FACILITIES","1000 LOUISIANA STE 5800","8/30/1996 "
"GC29840","LITTLE HAITI GATEWAY","NE 2ND AVENUE AND NE 62ND STRE","8/30/1996 "

Join Stage:This is one of the processing stage. It supports N-inputs and 1-output and no
Reject links
The input names are first source is Left and ,last one is Right and remaining are
Intermediate.

2 inputs and 1 output and no reject link(Full outer join) .Primary and Secondary records
should be in sortorder.Key column name and data type should be same.

Join types: inner Join, Left Outer join, Right outer join, Full outer join.

Memory usage is light. If you compare with lookup it will store data in buffer memory so it
takes less memory to perform join.

If we are performing inner join, output will come with the data that should matches with left
table records with all intermediate and right tables.

If we are performing left outer join, output will come with all the left table records that will
match with intermediate and right table.For unmatched columnsitwill populate null values

If we are performing Right outer join, Output will come with all the right table records that
matches with left and intermediate tables. For unmatched records itwill populate null values.

If we are performing full outer join(only two tables) output will come for the matched and
unmatched data for both the tables. For unmatched columns it will populate null values.

If we like to join the tables using join stage we need to have common key columns in those tables. But
sometimes we get the data without common key column.

In that case we can use column generator to create common column in both the tables.

You can take Job Design as

Read and load the data in Seq. Files

Go to Column Generator to create column and sample data.

In properties select name to create. and Drag and Drop the columns into the target

Now Go to the Join Stage and select Key column which we have created( You can give any name, based
on business requirement you can give understandable name)

In Output Drag and Drop all required columns Give File name to Target File. Than Compile and Run the
Job.
Join stage doesnt support reject link. I want to capture reject data but how?

Join the multiple inputs(left,Rigjt,Intermediate) by doing the left outer join->Resultant data will come null
values for the columns which are coming from the right and intermediate tables->Perform the filter stage
for capturing the null values.

Lookup Stage: This is one of the processing stages. It supports N-inputs and 1-output and
1-Reject link. The input names are first source is primary and remaining are references. The
lookup key columns need not have the same names in the primary and secondary links. It
will allow only two inputs in case of Sparse lookup.

For reject link data will come from the primary source. If duplicate values will be there in
reference tables it will through warning Ignoring the duplicates except this worning.

Memory usage is heavy. Because all the reference data will store first in the buffer memory
and then it will perform lookup . So it will take more memory.

Join types: Inner(Drop) and Left outer join(Continue)

If you are using reject link by default it will consider inner join(drop) in the Lookup failure.
If we are not writing the condition in the lookup constraints no need to give the values for
the Condition on meet.

Lookup Failure: Continue (Left outer join), Drop(Inner),Fail(Job will abort if unmatched data
will be there in primary link), Reject (Inner)

Lookup Types: Range, Equality and case less .These types we will give in the Lookup Stage
reference table.

Range: this will take range values(1-10)

Equality: this will work on equal operation(Exact match)

Case less: this will ignore case values.

Sparse: This will work only for tables that too two tables only. This we can mention on
source reference table. And key columns should be same in sparse lookup.

Difference between normal lookup and Sparse lookup?

In normal lookup first it keeps the entire reference data into buffer memory after that it
starts the lookup with primary data where as in sparse it doesnt create special buffer
memory.

The lookup stage in Datastage 8 is an enhanced version of what was present in earlier
Datastage releases. This article is going to take a deep dive into the new lookup stage and
the various options it offers. Even though the lookup stage cant be used in cases where
huge amounts of data are involved (since it requires data to be present in the memory for
operations), it still warrants its own place in job designs. This is because the lookup stage
offers a bit more than the other conventional lookup stages like join and merge.

Lets look at the example shown below.

Source

Emp ID EmpName Dept

1001 AABB IT

1002 BBCC IT

1003 BBDD BS

Reference

Emp ID Salary Dept Quarter

1001 2000 IT Q1

1001 3000 IT Q2

1001 4000 IT Q3

Now if you use the lookup stage the with EmpiD as the key then the output would be as
below

EMp ID Salary Dept EmpName Quarter

1001 2000 IT AABB Q1

But if you have a closer look at the data we can see that the reference table actually has
three records for that ID. However your lookup stage actually only retrieved the one record.
Now if you need to retrieve all 3 records for that ID then you will have to

Go to the constraints page of the lookup stage

Go to tab Multiple rows returned from link

Select the reference link

This will modify your output as below

EMp ID Salary Dept EmpName Quarter

1001 2000 IT AABB Q1

1001 3000 IT AABB Q2

1001 4000 IT AABB Q3


A point to be noted is that only one reference link in the lookup stage can return multiple
rows. This cant be done for more than one reference link and can only be done for in-
memory lookups

There are a host of other options also available on the constraints page shown
below

In addition to the lookup, the stage also gives us the option of checking if the data satisfies a
particular condition like Salary > 2000,etc.. All such additional conditions that you want to
check can be done in this area. How the job behaves during a lookup is determined by the
Condition Not Met or Lookup Failure option. The four options available for this tab are
Continue, Drop, Reject and Fail. Condition Not Met option will be applicable if you provide a
condition check. If you do not provide such a check then the values in the Condition Not
Met option will not make a difference.

The Continue option will allow the job to continue without failing and the retrieved
reference value will be populated as NULL. If the value is specified as Drop, then the
records will be dropped from the data set if the lookup/condition has failed. If the option is
specified as Reject, then all records that failed lookup will go to the reject link. You should
remember to provide a reject link to the lookup stage if this option is set. Else your job will
fail. If you specify the value as Fail, then the job will move to the aborted state whenever a
lookup fails against the reference dataset.

The lookup stage gives us 3 different lookup options. The first is Equality which is the
normal look. The data is looked up for an exact match (Case sensitive). The second option is
the Casesless match. It does exactly what the name indicates. The third and final option is
the Range. This allows you to define a range lookup on the stream link or a reference link of
a Lookup stage. On the stream link, the lookup compares the value of a source column to a
range of values between two lookup columns. On the reference link, the lookup compares
the value of a lookup column to a range of values between two source columns.

This entry was posted in Basic Processing stages and tagged caseless, lookup, multiple
rows, range. Bookmark the permalink.

Merge Stage: This is a one of the processing stage. It can have any number of input links,
a single output link, and the same number of reject links as there are update input links.

The data sets input to the Merge stage must be sorted.


The Merge stage is one of three stages that join tables based on the values of key columns.
The other two are:

Join stage

Lookup Stage

The three stages differ mainly in the memory they use, the treatment of rows with
unmatched keys, and their requirements for data being input (for example, whether it is
sorted).

The Merge stage combines a master data set with one or more update data sets. The
columns from the records in the master and update data sets are merged so that the output
record contains all the columns from the master record plus any additional columns from
each update record that are required. A master record and an update record are merged
only if both of them have the same values for the merge key column(s) that you specify.
Merge key columns are one or more columns that exist in both the master and update
records.

This has two joins 1.Drop (inner) 2.Keep (Left outer join).

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that
rows with the same key column values are located in the same partition and will be
processed by the same node. It also minimizes memory requirements because fewer rows
need to be in memory at any one time. Choosing the auto partitioning method will ensure
that partitioning and sorting is done. If sorting and partitioning are carried out on separate
stages before the Merge stage, InfoSphere DataStage in auto partition mode will detect
this and not repartition (alternatively you could explicitly specify the Same partitioning
method).
As part of preprocessing your data for the Merge stage, you should also remove duplicate
records from the master data set. If you have more than one update data set, you must
remove duplicate records from the update data sets as well.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject
links

Example merge

This example shows what happens to a master data set and two update data sets when they
are merged. The key field is Horse, and all the data sets are sorted in descending order. Here
is the master data set:

Table 1. Master data set

Horse Freezemark Mchip Reg_Soc Level

William DAM7 N/A FPS Adv

Robin DG36 N/A FPS Nov

Kayser N/A N/A AHS N/A

Heathcliff A1B1 N/A N/A Adv

Fairfax N/A N/A FPS N/A

Chaz N/A a296100da AHS Inter

Here is the Update 1 data set:

Table 2. Update 1 data set

Horse vacc. last_worm

William 07.07.02 12.10.02

Robin 07.07.02 12.10.02

Kayser 11.12.02 12.10.02

Heathcliff 07.07.02 12.10.02

Fairfax 11.12.02 12.10.02

Chaz 10.02.02 12.10.02


Table 2. Update 1 data set

Horse vacc. last_worm

Here is the Update 2 data set:

Table 3. Update 2 data set

Horse last_trim shoes

William 11.05.02 N/A

Robin 12.03.02 refit

Kayser 11.05.02 N/A

Heathcliff 12.03.02 new

Fairfax 12.03.02 N/A

Chaz 12.03.02 new

Here is the merged data set output by the stage:

Table 4. Merged data set

R
Fre eg
eze . Le last
ma Mchi So ve wor last Sh
Horse rk p c l vacc. m trim oes

Willia DA N/A FP Ad 07.0 12.1 11.0 N/A


m M7 S v 7.02 0.02 5.02

Robin DG N?A FP No 07.0 12.1 12.0 Re


36 S v 7.02 0.02 3.02 fit

Kayse N/A N/A A No 11.1 12.1 11.0 N/A


r H v 2.02 0.02 5.02
S

Heat A1 N/A N/ Ad 07.0 12.1 12.0 Ne


Table 4. Merged data set

R
Fre eg
eze . Le last
ma Mchi So ve wor last Sh
Horse rk p c l vacc. m trim oes

hcliff B1 A v 7.02 0.02 3.02 w

Fairfa N/A N/A FP N/ 11.1 12.1 12.0 N/A


x S A 2.02 0.02 3.02

Chaz N/A a296 A Int 10.0 12.1 12.0 Ne


1da H er 2.02 0.02 3.02 w
S

Funnel Stage:

The Funnel stage is a processing stage. It copies multiple input data sets to a single output
data set. This operation is useful for combining separate data sets into a single large data
set. The stage can have any number of input links and a single output link.

The Funnel stage can operate in one of three modes:


Continuous Funnel combines the records of the input data in no guaranteed order. It takes one
record from each input link in turn. If data is not available on an input link, the stage skips to the next
link rather than waiting.
Sort Funnel combines the input records in the order defined by the value(s) of one or more key
columns and the order of the output records is determined by these sorting keys.
Sequence copies all records from the first input data set to the output data set, then all the records
from the second input data set, and so on.

For all methods the meta data of all input data sets must be identical.

Below are optional points for Funnel stage:

The sort funnel method has some particular requirements about its input data. All input data sets must be
sorted by the same key columns as to be used by the Funnel operation.

Typically all input data sets for a sort funnel operation are hash-partitioned before they're sorted (choosing the
auto partitioning method will ensure that this is done). Hash partitioning guarantees that all records with the
same key column values are located in the same partition and so are processed on the same node. If sorting
and partitioning are carried out on separate stages before the Funnel stage, this partitioning must be
preserved.

The sortfunnel operation allows you to set one primary key and multiple secondary keys. The Funnel stage
first examines the primary key in each input record. For multiple records with the same primary key value, it
then examines secondary keys to determine the order of records it will output.

Filter Stage: This is one of the processing stages. It can have a single input link and any number of
output links and, optionally, a single reject link. To get reject link you need to follow the navigation.

For reject link->select stream link->Right link->convert stream link to reject link.

We can write conditions on multiple columns. And it will support multiple operators
(=,<,>,<=,>=,AND,OR,NOT).

The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified
requirements and filters out all other records. You can specify different requirements to route rows down
different output links. The filtered out records can be routed to a reject link, if required.
External Filter: This is one of the processing stage. It supports 1-input and 1-output. And it will
support only unix commands that too Grep only.

The External Filter stage allows you to specify a UNIX command that acts as a filter on the data you are
processing. This can be a quick and efficient way of filtering data.

Switch Stage:
The Switch stage is a processing stage. It can have a single input link, up to 128 output links and a single
rejects link.

We can write condition on single column. And it supports single operation(=)


The Switch stage takes a single data set as input and assigns each input row to an output data set based
on the value of a selector field. The Switch stage performs an operation analogous to a C switch
statement, which causes the flow of control in a C program to branch to one of several cases based on
the value of a selector variable. Rows that satisfy none of the cases are output on the rejects link.

Example

The example Switch stage implements the following switch statement:

switch (selector)
{
case 0: // if selector = 0,
// write record to output data set 0
break;
case 10: // if selector = 10,
// write record to output data set 1
break;
case 12: // if selector = discard value (12)
// skip record
break;
case default: // if selector is invalid,
// send row down reject link
};
The meta data input to the switch stage is as follows:

Table 1. Column definitions

Column name SQL Type

Select Integer
Table 1. Column definitions

Column name SQL Type

col1 Char

col2 Char

col3 Char

The column called Select is the selector; the value of this determines which output links the rest of the
row will be output to. The properties of the stage are:

Figure 1. Properties tab


Copy Stage: The Copy stage is a processing stage. It can have a single input link and any
number of output links.

The Copy stage copies a single input data set to a number of output data sets. Each record of the input
data set is copied to every output data set. Records can be copied without modification or you can drop or
change the order of columns (to copy with more modification - for example changing column data types -
use the Modify stage as described in Modify stage). Copy lets you make a backup copy of a data set on
disk while performing an operation on another copy, for example
When you are using a Copy stage with a single input and a single output, you should ensure that you set
the Force property in the stage editor TRUE. This prevents InfoSphere DataStage from deciding that
the Copy operation is superfluous and optimizing it out of the job.

Stage Properties:

Example

In this example you are going to copy data from a table containing billing information for GlobalCo's
customers. You are going to copy it to three separate data sets, and in each case you are only copying a
subset of the columns. The Copy stage will drop the unwanted columns as it copies the data set.

The column names for the input data set are as follows:

BILL_TO_NUM
CUST_NAME
ADDR_1
ADDR_2
CITY
REGION_CODE
ZIP
ATTENT
COUNTRY_CODE
TEL_NUM
FIRST_SALES_DATE
LAST_SALES_DATE
REVIEW_MONTH
SETUP_DATE
STATUS_CODE
REMIT_TO_CODE
CUST_TYPE_CODE
CUST_VEND
MOD_DATE
MOD_USRNM
CURRENCY_CODE
CURRENCY_MOD_DATE
MAIL_INVC_FLAG
PYMNT_CODE
YTD_SALES_AMT
CNTRY_NAME
CAR_RTE
TPF_INVC_FLAG,
INVC_CPY_CNT
INVC_PRT_FLAG
FAX_PHONE,
FAX_FLAG
ANALYST_CODE
ERS_FLAG
Here is the job that will perform the copying:

Figure 1. Example job

The Copy stage properties are fairly simple. The only property is Force, and you do not need to set it in
this instance as you are copying to multiple data sets (and InfoSphere DataStage will not attempt to
optimize it out of the job). You need to concentrate on telling InfoSphere DataStage which columns to
drop on each output link. The easiest way to do this is using the Output page Mapping tab. When you
open this for a link the left pane shows the input columns, simply drag the columns you want to preserve
across to the right pane. You repeat this for each link as follows:
Figure 2. Mapping tab: first output link

Figure 3. Mapping tab: second output link


Figure 4. Mapping tab: third output link
When the job is run, three copies of the original data set are produced, each containing a subset of the
original columns, but all of the rows. Here is some sample data from each of the data set on DSLink6,
which gives name and address information:
Modify Stage:
The Modify stage is a processing stage. It can have a single input link and a single output link.

The Modify stage alters the record schema of its input data set. The modified data set is then output. You can
drop or keep columns from the schema, or change the data type of a column.

Note:

1. We can drop and keep columns in modify stage(Syntax: Keep sal and Drop Empnos)

2. We can change the column data types.

3. We can handle null values.Ex:Target_column=handle_null(source_column, )

Important Point:You need to mention the metadata in the modify stage output tab for the columns which you
need to come in the output.

Aggregator stage:
The Aggregator stage is a processing stage. It classifies data rows from a single input
link into groups and computes totals or other aggregate functions for each group. The
summed totals for each group are output from the stage via an output link.
Example
The example data is from a freight carrier who charges customers based on distance, equipment, packing, and
license requirements. They need a report of distance traveled and charges grouped by date and license type.

The following table shows a sample of the data:

Table 1. Sample of data

Ship Date Distric Distanc Equipme Packin Licens Charg


t e nt g e e

...
Table 1. Sample of data

Ship Date Distric Distanc Equipme Packin Licens Charg


t e nt g e e

2000-06- 1 1540 D M BUN 1300


02

2000-07- 1 1320 D C SUM 4800


12

2000-08- 1 1760 D C CUM 1300


02

2000-06- 2 1540 D C CUN 13500


22

2000-07- 2 1320 D M SUM 6000


30

...

The stage will output the following columns:

Table 2. Output column definitions

Column name SQL Type

DistanceSum Decimal

DistanceMean Decimal

ChargeSum Decimal

ChargeMean Decimal

License Char

Shipdate Date

The stage first hash partitions the incoming data on the license column, then sorts it on license and date:

Figure 1. Partitioning tab


The properties are then used to specify the grouping and the aggregating of the data:

Figure 2. Properties tab


The following is a sample of the output data:

Table 3. Output data

Ship Licens Distance Distance Charge Charge


Date e Sum Mean Sum Mean

...

2000- BUN 1126053.00 1563.93 20427400.0 28371.39


06-02 0

2000- BUN 2031526.00 2074.08 22426324.0 29843.55


06-12 0

2000- BUN 1997321.00 1958.45 19556450.0 19813.26


06-22 0

2000- BUN 1815733.00 1735.77 17023668.0 18453.02


06-30 0

...

Pivot Stage:The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.
The Pivot Enterprise stage is in the Processing section of the Palette pane.

Figure 1: Pivot Enterprise stage is in the Processing section of the Palette pane

Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows. The output
data of the horizontal pivot action typically has fewer columns, but more rows than the input data. With vertical
pivoting, you can map several sets of input columns to several output columns.

Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array size
determines the number of rows in the output data. The output data of the vertical pivot action typically has more
columns, but fewer rows than the input data.
Compress stage
The Compress stage is a processing stage. It can have a single input link and a single output link.

The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set
from a sequence of records into a stream of raw binary data. The complement to the Compress stage is the
Expand stage.

A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data Set
stage. However, a compressed data set cannot be processed by many stages until it is expanded, that is, until
its rows are returned to their normal format. Stages that do not perform column-based processing or reorder
the rows can operate on compressed data sets. For example, you can use the Copy stage to create a copy of
the compressed data set.

Because compressing a data set removes its normal record boundaries, the compressed data set must not be
repartitioned before it is expanded.

Note:Its suggested to use dataset for the target stage and in this example no need to mention the metadata for
the output tab in the compress stage.
Expand Stage
The Expand stage is a processing stage. It can have a single input link and a single output link.

The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously
compressed data set back into a sequence of records from a stream of raw binary data. The complement to the
Expand stage is the Compress stage which is described in Compress stage.

Note: No need to load the metadata for the source dataset for this example.
Change Capture Stage:
The Change Capture Stage is a processing stage. The stage compares two data sets and makes a record of
the differences.

The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set
whose records represent the changes made to the before data set to obtain the after data set. The stage
produces a change data set, whose table definition is transferred from the after data set's table definition with
the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit.
The preserve-partitioning flag is set on the change data set.

You might also like