DataStage Stages 12-Dec-2013 12PM
DataStage Stages 12-Dec-2013 12PM
DataStage Stages 12-Dec-2013 12PM
Data Stage provides default variables during the installation and can be used throughout the project called
environmental variables.
These variables are set for the project level and access the variables by logging into data stage administrator
client and select Projects tab under Properties -> General Tab -> Environment.
Node Structure:
{
Node "node1"
{
fastname "dc4c37"
pools ""
resource disk "/dstage/dsdata/pxdataset" {pools ""}
resource disk "/dstage/dsdata/pxfileset" {pools "export"}
resourcescratchdisk "/dstage/dstemp/dsscratch" {pools ""}
}
}
Here fast name is the server name. If fast name is different in each node then job will run in MPP else it will run in
SMP. Scratch disk is the temporary storage for the data. Resource disk
Schema File:
You can also specify the Meta data for a stage in a plain text file known as a schema file. This is
not stored in the Repository but you could, for example, keep it in a document management or
source code control system, or publish it on an intranet site.
//Schema File is used to read Input data without specifying metadata in the Sequential File
//stage
//Creatd On : 11/17/2010
//Creatd By : Pavan Kumar Reddy
record
{final_delim=end,delim=none}
(
CUSTOMER_SSN: NULLABLE STRING[11];
CUSTOMER_NAME:STRING[30];
CUSTOMER_CITY:STRING[40];
CUSTOMER_ZIPCODE:STRING[10];
)
The format of each line describing a column is:
column_name:[nullability]datatype;
column_name. This is the name that identifies the column. Names must start with a
letter or an underscore (_), and can contain only alphanumeric or underscore characters.
The name is not case sensitive. The name can be of any length.
nullability. You can optionally specify whether a column is allowed to contain a null
value, or whether this would be viewed as invalid. If the column can be null, insert the
word 'nullable'. By default columns are not nullable.
You can also include 'nullable' at record level to specify that all columns are nullable,
then override the setting for individual columns by specifying `not nullable'. For
example:
record nullable (
name:not nullable string[255];
value1:int32;
date:date)
datatype. This is the data type of the column. This uses the internal data types, see Data
Types, not the SQL data types as used on Columns tabs in stage editors.
Remember that you should turn runtime column propagation on if you intend to use schema
files to define column meta data
Pipeline Parallelism: Instead of waiting for all source data to be read as soon as the
source data stream starts the data processed to subsequent stages. This method is called
pipeline parallelism.
Partition Parallelism: Dividing the incoming stream of data into subsets (Partitions)
or Partition parallelism is technique of distributing the records across the nodes based on
different partition techniques
. When large volumes of data involved you can use the power of Partition parallelism to your
best advantage by partitioning the data into a number of separate sets, which each partition
being handled by a separate instance of the job stages.
Partition Techniques:
Partition techniques were used for performance tuning of the jobs.
RoundRObin:
The stage using the data set as input performs no repartitioning and takes as input then
partitions output by the proceeding stage.
With this partitioning method records stay on the same processing node i.e they are not
redistributed.
Entire:
Rows with the same key columns values will go to the partition.
Partition is based on one or more columns in each record.
This method is useful for ensuring that related records are in the same partition.
This behavior will go to bottle neck because some nodes are required to process more
records than other node.
Modulus:
Partitioning is based on a key column modulo the number of partitions. This method is
similar to hash by field but involves simpler computation.
The modulus partitioning assigns each recordof an input set to a partition of its output
set as determined by a specified key field in the input set.
The partition number of each record is calculated as follows
o Partition_number=fieldname mod number_of_ partitions
o Partition(0,1,2,4)=20(column value) mod 4(4 node)
Here fieldname should be numeric field of an input set
Number_of_partitions is the no of processing nodes on which the partitions
executes. If a partitions executes on three processing nodes it has three
partitions.
Deptno 15 20 22 9 44 16 25 33 30 10
Dname A B C D E F G H I J
15 A 22 C 20 B
9 D 16 F 44 E
33 H 25 G
30 I 10 J
Range Partition:
Similar to hash, but partition mapping is user determined and partitions are ordered
Divides input data set into approximately equal sized partitions, each of which contains
records with key columns within a specified range.
This method is also useful for ensuring that related records are in the same partition
A Range partition divides a data set into approximately equal size partitions based on
one or more partition keys.
All partitions are of approximately the same size, In an ideal distribution every partition would
be exactly the same size.
DB2 Partition:
To use DB2 partitioning on a stage, select a partition type of DB2 on the partitioning tab
then click the properties button on the right.
In the partitioning/collection properties dialog box specify the details of the DB2 table
whose partitioning you want to replicate
Auto Partition:
The most common method you will see on the data stages is auto
This just means that data stage to determine the best partitioning method to use depending on
the type of stage
Typically Data stage would use Round robin when initially partitioning data
Collecting is the process of joining the multiple partitions of data set back together again
into a single partition.
There are various situations where you may want to do this.
This method can be used only when the data flow between Parallel to Sequential.
At the end of the job you may want to collect all the data to a single database, in which
case you need to collect it before you write it.
This might be other cases where you do not want to collect the data at all. For Ex: you
may want to write each partition to a separate flat file
Collective methods are
o Auto
o Round Robin
o Ordered
o Sort Merge
Auto:
The most common method you will see in the parallel stage is auto
This means that Data stage will read any row from any input partition as it becomes available.
Round Robin:
Read a record from the first partition then from the second partition and so on
Slower than auto. Rarely used
Ordered Collection:
Reads all records from first partition, then all the records from second partition and so
on.
This collection method preserves the ordered of totally sorted input data set.
Sequential file stage
This is one of the file stage. It allows you to read data from one or more files or write data to flat file.. The stage
can have a single input link or a single output link, and a single rejects link.
If you are using for the source it can have 1 output and 1 reject link. And if you are using for the target
then it can have 1 input and 1 reject link.
The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one
file. By default a complete file will be read by a single node (although each node might read more than one file).
For fixed-width files, however, you can configure the stage to behave differently:
You can specify that single files can be read by multiple nodes. This can improve performance on
cluster systems. See "Read From Multiple Nodes"
You can specify that a number of readers run on a single node. This means, for example, that a single
file can be partitioned as it is read (even though the stage is constrained to running sequentially on the
conductor node). See "Number Of Readers Per Node".
(These two options are mutually exclusive.)
The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each
node writes to a single file, but a node can write more than one file.
When reading or writing a flat file, InfoSphere DataStage needs to know something about the format of the
file. The information required is how the file is divided into rows and how rows are divided into columns. You
specify this on the Format tab. Settings for individual columns can be overridden on the Columns tab using the
Edit Column Metadata dialog box.
The stage editor has up to three pages, depending on whether you are reading or writing a file:
Stage Page. This is always present and is used to specify general information about the stage.
Input Page. This is present when you are writing to a flat file. This is where you specify details about
the file or files being written to.
Output Page. This is present when you are reading from a flat file or have a reject link. This is where
you specify details about the file or files being read from.
Reject mode
Allows you to specify behavior if a read record does not match the expected schema. Choose from Continue
to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, Output to
send rejected rows down a reject link. Defaults to Continue.
Report progress
Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10%
interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed
length, and there is no filter on the file.
Filter
This is an optional property. You can use this to specify that the data is passed through a filter program after
being read from the files. Specify the filter command, and any required arguments, in the Property Value box.
File name column
This is an optional property. It adds an extra column of type VarChar to the output of the stage, containing the
pathname of the file the record is read from. You should also add this column manually to the Columns
definitions to ensure that the column is not dropped if you are not using runtime column propagation, or it is
turned off at some point.
The resulting data set contains one partition per instance of the file read operator, as determined by
numReaders.
This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file
can be divided according to the number of readers per node, and written to separate partitions. This method
can result in better I/O performance on an SMP system.
Schema file
This is an optional property. By default the Sequential File stage will use the column definitions defined on the
Columns and Format tabs as a schema for reading the file. You can, however, specify a file containing a
schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these
match the schema file). Type in a pathname or browse for a schema file.
Sequential File
File Set
External Source
External Target
Column Import
Column Export
c) Conversion Problem : Data need to convert from ASCII to Native format from stage to stage.
Questions:
1. We have a list of .txt files and asked to read only the first 3 files using seq file stage.
How will you do it?
Ans: 1. In the sequential file stage, Use Read Method as Specific File(s).
2. Now in the file text field, Put below command.
`ls /* | head -<Number of files to be read>`
Data Set:
Data set is a file stage which is used for staging the data when we design dependent jobs.
Dataset Supports 1 input link or 1 Output link and there will be no reject links in dataset stage.
1) Virtual And
2) Persistance
Virtual is nothing but the data formed when passing the link.
Persistance is nothing but the data loaded in the Target.
1) Orchestrate File
2) Operating System file
In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System and acts as interface between Descriptor and
data file.
Tools>Datasetmanagement>Smallwindowpopsup>selectthedatasetfilewhichwas
created>Againasmallwindowpopsupshowingyouthenodesnameandthesegments>
selecteachoneandhitonthedeletebuttononthetop>Itsgone!!:D
As a source, the CFF stage can have multiple output links and a single reject link. You can read data from one
or more complex flat files, including MVS data sets with QSAM and VSAM files. You can also read data from
files that contain multiple record types. The source data can contain one or more of the following clauses:
GROUP
REDEFINES
OCCURS
OCCURS DEPENDING ON
CFF source stages run in parallel mode when they are used to read multiple files, but you can configure the
stage to run sequentially if it is reading only one file with a single reader.
As a target, the CFF stage can have a single input link and a single reject link. You can write data to one or
more complex flat files. You cannot write to MVS data sets or to files that contain multiple record types.
Figure 1. This job has a Complex Flat File source stage with a single reject link, and a Complex Flat File target
stage with a single reject link.
Link Sort:
This is as simple sort. This we can use on the link of any stages. Sorting can be done by
using the key column.
Nav: Open stage properties->Partitioning tab->select the partitioning technique then it will
enable the perform sort option. Tick mark to Perform Sort will give data only in sorting
order. But if you want the unique data means you need to use other two options like
Stable and Unique. By using the above options data will come to output by removing
duplicates in sorted order.
Sort Stage:
This is one of the processing stage. It can have 1-Output and 1-Input.
Key= Need to give the key column on which you want to perform the sort.
Sort Key Mode: Sort, Dont sort (Previously sorted), Dont sort (Previously grouped)
Create key change column: This will give the numbers 1 for the first record of the group if
it set to True and 0 to the corresponding records of the group if it set to False.
Create Cluster Key change column:This will give the numbers 1 for the first record of the
group if it set to True and 0 to the corresponding records of the group if it set to False.
The Remove Duplicates stage is a processing stage. It can have a single input link and a
single output link.
The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate
rows, and writes the results to an output data set.
Removing duplicate records is a common way of cleansing a data set before you perform
further processing. Two rows are considered duplicates if they are adjacent in the input data
set and have identical values for the key column(s). A key column is any column you
designate to be used in determining whether two rows are identical.
The data set input to the Remove Duplicates stage must be sorted so that all records with
identical key values are adjacent. You can either achieve this using the in-stage sort facilities
available on the Input page Partitioning tab, or have an explicit Sort stage feeding the
Remove Duplicates stage.
1. Link dont have the dont sort option where as sort stage have this facility
2. In sort we have the cluster key change column but in link sort we dont have
3. We cannot capture the duplicate records in link sort where as we can capture the duplicate
values by using sort stage.
4. Link sort should require partition technique enabled where as sort stage not mandatory.
5. Using link sort we can pick the first duplicate record. But using remove duplicate stage we can pick
either first or last record
Example
In the example the data is a list of GlobalCo's customers. The data contains some duplicate entries, and you
want to remove these.
The first step is to sort the data so that the duplicates are actually next to each other. As with all sorting
operations, there are implications around data partitions if you run the job in parallel (see "Copy Stage," for a
discussion of these). You should hash partition the data using the sort keys as hash keys in order to guarantee
that duplicate rows are in the same partition. In the example you sort on the CUSTOMER_NUMBER columns
and the sample of the sorted data shows up some duplicates:
Here is a sample of the data after the job has been run and the duplicates removed:
Join Stage:This is one of the processing stage. It supports N-inputs and 1-output and no
Reject links
The input names are first source is Left and ,last one is Right and remaining are
Intermediate.
2 inputs and 1 output and no reject link(Full outer join) .Primary and Secondary records
should be in sortorder.Key column name and data type should be same.
Join types: inner Join, Left Outer join, Right outer join, Full outer join.
Memory usage is light. If you compare with lookup it will store data in buffer memory so it
takes less memory to perform join.
If we are performing inner join, output will come with the data that should matches with left
table records with all intermediate and right tables.
If we are performing left outer join, output will come with all the left table records that will
match with intermediate and right table.For unmatched columnsitwill populate null values
If we are performing Right outer join, Output will come with all the right table records that
matches with left and intermediate tables. For unmatched records itwill populate null values.
If we are performing full outer join(only two tables) output will come for the matched and
unmatched data for both the tables. For unmatched columns it will populate null values.
If we like to join the tables using join stage we need to have common key columns in those tables. But
sometimes we get the data without common key column.
In that case we can use column generator to create common column in both the tables.
In properties select name to create. and Drag and Drop the columns into the target
Now Go to the Join Stage and select Key column which we have created( You can give any name, based
on business requirement you can give understandable name)
In Output Drag and Drop all required columns Give File name to Target File. Than Compile and Run the
Job.
Join stage doesnt support reject link. I want to capture reject data but how?
Join the multiple inputs(left,Rigjt,Intermediate) by doing the left outer join->Resultant data will come null
values for the columns which are coming from the right and intermediate tables->Perform the filter stage
for capturing the null values.
Lookup Stage: This is one of the processing stages. It supports N-inputs and 1-output and
1-Reject link. The input names are first source is primary and remaining are references. The
lookup key columns need not have the same names in the primary and secondary links. It
will allow only two inputs in case of Sparse lookup.
For reject link data will come from the primary source. If duplicate values will be there in
reference tables it will through warning Ignoring the duplicates except this worning.
Memory usage is heavy. Because all the reference data will store first in the buffer memory
and then it will perform lookup . So it will take more memory.
If you are using reject link by default it will consider inner join(drop) in the Lookup failure.
If we are not writing the condition in the lookup constraints no need to give the values for
the Condition on meet.
Lookup Failure: Continue (Left outer join), Drop(Inner),Fail(Job will abort if unmatched data
will be there in primary link), Reject (Inner)
Lookup Types: Range, Equality and case less .These types we will give in the Lookup Stage
reference table.
Sparse: This will work only for tables that too two tables only. This we can mention on
source reference table. And key columns should be same in sparse lookup.
In normal lookup first it keeps the entire reference data into buffer memory after that it
starts the lookup with primary data where as in sparse it doesnt create special buffer
memory.
The lookup stage in Datastage 8 is an enhanced version of what was present in earlier
Datastage releases. This article is going to take a deep dive into the new lookup stage and
the various options it offers. Even though the lookup stage cant be used in cases where
huge amounts of data are involved (since it requires data to be present in the memory for
operations), it still warrants its own place in job designs. This is because the lookup stage
offers a bit more than the other conventional lookup stages like join and merge.
Source
1001 AABB IT
1002 BBCC IT
1003 BBDD BS
Reference
1001 2000 IT Q1
1001 3000 IT Q2
1001 4000 IT Q3
Now if you use the lookup stage the with EmpiD as the key then the output would be as
below
But if you have a closer look at the data we can see that the reference table actually has
three records for that ID. However your lookup stage actually only retrieved the one record.
Now if you need to retrieve all 3 records for that ID then you will have to
There are a host of other options also available on the constraints page shown
below
In addition to the lookup, the stage also gives us the option of checking if the data satisfies a
particular condition like Salary > 2000,etc.. All such additional conditions that you want to
check can be done in this area. How the job behaves during a lookup is determined by the
Condition Not Met or Lookup Failure option. The four options available for this tab are
Continue, Drop, Reject and Fail. Condition Not Met option will be applicable if you provide a
condition check. If you do not provide such a check then the values in the Condition Not
Met option will not make a difference.
The Continue option will allow the job to continue without failing and the retrieved
reference value will be populated as NULL. If the value is specified as Drop, then the
records will be dropped from the data set if the lookup/condition has failed. If the option is
specified as Reject, then all records that failed lookup will go to the reject link. You should
remember to provide a reject link to the lookup stage if this option is set. Else your job will
fail. If you specify the value as Fail, then the job will move to the aborted state whenever a
lookup fails against the reference dataset.
The lookup stage gives us 3 different lookup options. The first is Equality which is the
normal look. The data is looked up for an exact match (Case sensitive). The second option is
the Casesless match. It does exactly what the name indicates. The third and final option is
the Range. This allows you to define a range lookup on the stream link or a reference link of
a Lookup stage. On the stream link, the lookup compares the value of a source column to a
range of values between two lookup columns. On the reference link, the lookup compares
the value of a lookup column to a range of values between two source columns.
This entry was posted in Basic Processing stages and tagged caseless, lookup, multiple
rows, range. Bookmark the permalink.
Merge Stage: This is a one of the processing stage. It can have any number of input links,
a single output link, and the same number of reject links as there are update input links.
Join stage
Lookup Stage
The three stages differ mainly in the memory they use, the treatment of rows with
unmatched keys, and their requirements for data being input (for example, whether it is
sorted).
The Merge stage combines a master data set with one or more update data sets. The
columns from the records in the master and update data sets are merged so that the output
record contains all the columns from the master record plus any additional columns from
each update record that are required. A master record and an update record are merged
only if both of them have the same values for the merge key column(s) that you specify.
Merge key columns are one or more columns that exist in both the master and update
records.
This has two joins 1.Drop (inner) 2.Keep (Left outer join).
The data sets input to the Merge stage must be key partitioned and sorted. This ensures that
rows with the same key column values are located in the same partition and will be
processed by the same node. It also minimizes memory requirements because fewer rows
need to be in memory at any one time. Choosing the auto partitioning method will ensure
that partitioning and sorting is done. If sorting and partitioning are carried out on separate
stages before the Merge stage, InfoSphere DataStage in auto partition mode will detect
this and not repartition (alternatively you could explicitly specify the Same partitioning
method).
As part of preprocessing your data for the Merge stage, you should also remove duplicate
records from the master data set. If you have more than one update data set, you must
remove duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject
links
Example merge
This example shows what happens to a master data set and two update data sets when they
are merged. The key field is Horse, and all the data sets are sorted in descending order. Here
is the master data set:
R
Fre eg
eze . Le last
ma Mchi So ve wor last Sh
Horse rk p c l vacc. m trim oes
R
Fre eg
eze . Le last
ma Mchi So ve wor last Sh
Horse rk p c l vacc. m trim oes
Funnel Stage:
The Funnel stage is a processing stage. It copies multiple input data sets to a single output
data set. This operation is useful for combining separate data sets into a single large data
set. The stage can have any number of input links and a single output link.
For all methods the meta data of all input data sets must be identical.
The sort funnel method has some particular requirements about its input data. All input data sets must be
sorted by the same key columns as to be used by the Funnel operation.
Typically all input data sets for a sort funnel operation are hash-partitioned before they're sorted (choosing the
auto partitioning method will ensure that this is done). Hash partitioning guarantees that all records with the
same key column values are located in the same partition and so are processed on the same node. If sorting
and partitioning are carried out on separate stages before the Funnel stage, this partitioning must be
preserved.
The sortfunnel operation allows you to set one primary key and multiple secondary keys. The Funnel stage
first examines the primary key in each input record. For multiple records with the same primary key value, it
then examines secondary keys to determine the order of records it will output.
Filter Stage: This is one of the processing stages. It can have a single input link and any number of
output links and, optionally, a single reject link. To get reject link you need to follow the navigation.
For reject link->select stream link->Right link->convert stream link to reject link.
We can write conditions on multiple columns. And it will support multiple operators
(=,<,>,<=,>=,AND,OR,NOT).
The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified
requirements and filters out all other records. You can specify different requirements to route rows down
different output links. The filtered out records can be routed to a reject link, if required.
External Filter: This is one of the processing stage. It supports 1-input and 1-output. And it will
support only unix commands that too Grep only.
The External Filter stage allows you to specify a UNIX command that acts as a filter on the data you are
processing. This can be a quick and efficient way of filtering data.
Switch Stage:
The Switch stage is a processing stage. It can have a single input link, up to 128 output links and a single
rejects link.
Example
switch (selector)
{
case 0: // if selector = 0,
// write record to output data set 0
break;
case 10: // if selector = 10,
// write record to output data set 1
break;
case 12: // if selector = discard value (12)
// skip record
break;
case default: // if selector is invalid,
// send row down reject link
};
The meta data input to the switch stage is as follows:
Select Integer
Table 1. Column definitions
col1 Char
col2 Char
col3 Char
The column called Select is the selector; the value of this determines which output links the rest of the
row will be output to. The properties of the stage are:
The Copy stage copies a single input data set to a number of output data sets. Each record of the input
data set is copied to every output data set. Records can be copied without modification or you can drop or
change the order of columns (to copy with more modification - for example changing column data types -
use the Modify stage as described in Modify stage). Copy lets you make a backup copy of a data set on
disk while performing an operation on another copy, for example
When you are using a Copy stage with a single input and a single output, you should ensure that you set
the Force property in the stage editor TRUE. This prevents InfoSphere DataStage from deciding that
the Copy operation is superfluous and optimizing it out of the job.
Stage Properties:
Example
In this example you are going to copy data from a table containing billing information for GlobalCo's
customers. You are going to copy it to three separate data sets, and in each case you are only copying a
subset of the columns. The Copy stage will drop the unwanted columns as it copies the data set.
The column names for the input data set are as follows:
BILL_TO_NUM
CUST_NAME
ADDR_1
ADDR_2
CITY
REGION_CODE
ZIP
ATTENT
COUNTRY_CODE
TEL_NUM
FIRST_SALES_DATE
LAST_SALES_DATE
REVIEW_MONTH
SETUP_DATE
STATUS_CODE
REMIT_TO_CODE
CUST_TYPE_CODE
CUST_VEND
MOD_DATE
MOD_USRNM
CURRENCY_CODE
CURRENCY_MOD_DATE
MAIL_INVC_FLAG
PYMNT_CODE
YTD_SALES_AMT
CNTRY_NAME
CAR_RTE
TPF_INVC_FLAG,
INVC_CPY_CNT
INVC_PRT_FLAG
FAX_PHONE,
FAX_FLAG
ANALYST_CODE
ERS_FLAG
Here is the job that will perform the copying:
The Copy stage properties are fairly simple. The only property is Force, and you do not need to set it in
this instance as you are copying to multiple data sets (and InfoSphere DataStage will not attempt to
optimize it out of the job). You need to concentrate on telling InfoSphere DataStage which columns to
drop on each output link. The easiest way to do this is using the Output page Mapping tab. When you
open this for a link the left pane shows the input columns, simply drag the columns you want to preserve
across to the right pane. You repeat this for each link as follows:
Figure 2. Mapping tab: first output link
The Modify stage alters the record schema of its input data set. The modified data set is then output. You can
drop or keep columns from the schema, or change the data type of a column.
Note:
1. We can drop and keep columns in modify stage(Syntax: Keep sal and Drop Empnos)
Important Point:You need to mention the metadata in the modify stage output tab for the columns which you
need to come in the output.
Aggregator stage:
The Aggregator stage is a processing stage. It classifies data rows from a single input
link into groups and computes totals or other aggregate functions for each group. The
summed totals for each group are output from the stage via an output link.
Example
The example data is from a freight carrier who charges customers based on distance, equipment, packing, and
license requirements. They need a report of distance traveled and charges grouped by date and license type.
...
Table 1. Sample of data
...
DistanceSum Decimal
DistanceMean Decimal
ChargeSum Decimal
ChargeMean Decimal
License Char
Shipdate Date
The stage first hash partitions the incoming data on the license column, then sorts it on license and date:
...
...
Pivot Stage:The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.
The Pivot Enterprise stage is in the Processing section of the Palette pane.
Figure 1: Pivot Enterprise stage is in the Processing section of the Palette pane
Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows. The output
data of the horizontal pivot action typically has fewer columns, but more rows than the input data. With vertical
pivoting, you can map several sets of input columns to several output columns.
Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array size
determines the number of rows in the output data. The output data of the vertical pivot action typically has more
columns, but fewer rows than the input data.
Compress stage
The Compress stage is a processing stage. It can have a single input link and a single output link.
The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set
from a sequence of records into a stream of raw binary data. The complement to the Compress stage is the
Expand stage.
A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data Set
stage. However, a compressed data set cannot be processed by many stages until it is expanded, that is, until
its rows are returned to their normal format. Stages that do not perform column-based processing or reorder
the rows can operate on compressed data sets. For example, you can use the Copy stage to create a copy of
the compressed data set.
Because compressing a data set removes its normal record boundaries, the compressed data set must not be
repartitioned before it is expanded.
Note:Its suggested to use dataset for the target stage and in this example no need to mention the metadata for
the output tab in the compress stage.
Expand Stage
The Expand stage is a processing stage. It can have a single input link and a single output link.
The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously
compressed data set back into a sequence of records from a stream of raw binary data. The complement to the
Expand stage is the Compress stage which is described in Compress stage.
Note: No need to load the metadata for the source dataset for this example.
Change Capture Stage:
The Change Capture Stage is a processing stage. The stage compares two data sets and makes a record of
the differences.
The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set
whose records represent the changes made to the before data set to obtain the after data set. The stage
produces a change data set, whose table definition is transferred from the after data set's table definition with
the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit.
The preserve-partitioning flag is set on the change data set.