0% found this document useful (0 votes)
20 views64 pages

Datastage Interview Questions

Uploaded by

amruu26
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
20 views64 pages

Datastage Interview Questions

Uploaded by

amruu26
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 64

1.What is IBM DataStage?

Ans. DataStage is an ETL tool offered by IBM. IBM can use it to design, develop, and
execute programs. It extracts data from databases on Windows servers and puts it into
data storage. It also has the ability of graphic data integration visualizations. IBM
DataStage can also extract data from many sources.
2. What are the characteristics of DataStorage?
Ans. The characteristics of DataStorage are:

3. What are Links in DataStage?


Ans. A link is a model of a data flow that connects job stages. A link connects data
sources to processing stages, processing stages to target systems, and processing
stages to each other. The data passes through links like pipes from one stage to the
next.
4. What are table definitions?
Ans. The data format you want to use at each job stage is fixed in the table definitions.
We can share it throughout all the projects in InfoSphere DataStage and by all the jobs
in a project. Source stages normally load Table definitions and are occasionally inserted
into the target and added settings.
5. What is Infosphere in DataStage?
Ans. The infosphere information server can handle the high volume needs of the
parties. It gives high-quality, fast results. It gives the firms a single platform to manage
the data, helping them to understand, clean, transform, and deliver vast amounts of
data.
6. What is the aggregator stage in DataStage?
Ans. The Aggregator stage in DataStage is where rows are processed. It divides the
rows into various groups from the input links. The aggregate stage defines the total
value for each group. These sums denote the throughput for each group at that level.
Check here to read about : Azure Data Engineer Interview Questions
7. What is the Merge Stage?
Ans. A sorted master data set and one or more sorted update data sets are combined
during the merge stage. The output record includes all the columns from the master
record, along with any new columns from each update record. The columns from the
master record and revised data sets are combined.
8. What are the benefits of Flow Designer?
Ans. There are many benefits of Flow Designers. For instance:
1. No need to migrate jobs
2. Quickly work with your favorite jobs
3. Easily continue working where you left off
4. Efficiently search for any job
5. Cloning a job
6. Highlighting all compilation errors
7. Running a job
9. What is an HBase connector?
Ans. The HBase connector is used to connect to tables kept in the HBase database. It
performs functions like reading data from or writing data to the HBase database and
reading data in parallel mode. And using the HBase table as a lookup table in a sparse
or standard way.

10. How does Datastage manage rejected rows?


Ans. The transformer's constraints are used to manage rejected rows. There are two
ways to achieve this. First, the rejected rows can add to the properties of the
transformer. And second, temporary storage can be made for them by using the
REJECTED command.
11. What is the method for removing the duplicate in DataStage?
Ans. The sort function in DataStage can be used to remove duplicates. We must show
the option that helps copies by setting it to false before directing the sort function.
12. What is Hive Connector?
Ans. A Hive connector is a tool used to support partition modes while reading the data.
It can be done in two ways:
 modulus partition mode
 minimum-maximum partition mode
13. Who are the DataStage clients or users?
Ans. The DataStage tool can be used by the following:
 DataStage Administrator
 DataStage Designer
 DataStage Manager
 DataStage Director
14. How is DataStage different from Informatica?
Ans. Both DataStage and Informatica are capable ETL tools. However, there are some
differences between the two. DataStage helps parallelism and split principles, but
Informatica lacks parallelism abilities in node configuration. DataStage is also simpler to
use when compared to Informatica.
15. What are the Stages in DataStage?
Ans. Stages serve as InfoSphere DataStage's structural building blocks. It offers a
unique set of functions for performing complex or simple data integration tasks. The
steps that will be taken to process the data are stored and described in stages.
Intermediate DataStage Interview Questions
After revising the concepts in the easy level DataStage Interview Questions, we can
move forward with some Medium level questions. So let's discuss some Medium level
questions in DataStage Interview Questions.
16. What are Operators in DataStage?
Ans. Operators are used in the parallel job stages. One operator may own a single
stage, or there may be multiple operators. The quantity of operators is dependent on the
properties you've picked. InfoSphere DataStage assesses your job design during
compilation and occasionally optimizes operators.
17. Explain the Metadata Repository tier of the Infosphere
Information Server briefly.
Ans. The computer, the analysis database, and the metadata repository are all parts of
the Infosphere Information Server's metadata repository tier. It is used to share
configurations like data, shared data, and metadata.
18. How do we clean a DataStage repository?
Ans. Go to the DataStage Manager > Job in the menu bar > Clean Up Resources to
clean a DataStage repository. We must go to the respective jobs and clean up the log
files if we want to remove the logs further.
19. What are the jobs available in DataStage?
Ans. There are mainly four jobs present in DataStage.
 Server job
 Parallel job
 Sequencer job
 Container job
20. What is NLS in DataStage?
Ans. NLS stands for National Language Support. It can be used to add other
languages. For instance, French, German, and Spanish, to the data that the data
warehouse needs to process. The scripts of these languages are similar to those of
English.
Also see, Html interview questions
21. Describe the feature of data type conversion in DataStage.
Ans. The data conversion function in DataStage can be used to convert data. The input
or output to and from the operator should be similar for this to be effectively executed.
And the record schema must be compatible with the operator.
22. Explain the different types of hash files in the Datastage.
Ans. Static and Dynamic Hash Files are DataStage's two kinds of hash files. A static
hash file is used when only a finite amount of data needs to be put into the targeted
database. A dynamic hash file is used when loading an unknown amount of data from
the source file.
23. Explain the Services tier of Infosphere Information Server
briefly.
Ans. The Infosphere Information Server's services tier gives many standard services.
For instance, metadata, logging, and other module-specific services. In spare to
different product modules and other product services, it includes an application server.
24. How to validate and compile a job in DataStage?
Ans. Validation is the act of managing a DataStage job. The DataStage engine
assesses whether or not all properties are precisely declared when validating a job. The
DataStage engine will affirm whether all defined properties are valid during compilation.
25. Explain the DataStage architecture briefly.
Ans. The architecture of IBM DataStage is client-server, with multiple architecture kinds
for the various versions. The following are the components of the client-server
architecture:
1. Client components
2. Stages
3. Servers
4. Table definitions
5. Containers
6. Projects
7. Jobs
26. Explain the different types of Lookups in Datastage.
Ans. In Datastage, there are two kinds of lookups: normal and sparse. In a normal
lookup, the data is first saved in memory before the lookup is performed. Data is directly
kept in the database while using Sparse Lookup. The Sparse Lookup is hence quicker
than the Normal Lookup.
27. Describe the engine tier in the information server.
Ans. The logical group of components (the InfoSphere Information Server engine
components, service agents, etc.) and the machine on which those components are
installed are both included in the engine tier. For product modules, the engine runs jobs
and other tasks.
28. What is Data Pipelining?
Ans. The process of extracting records from the data source system and moving them
through the sequence of processing stages defined in the job's data flow is known as
data pipelining. Records can be processed without writing them to disk since they are
moving through the pipeline.
29. How do I optimize the performance of DataStage jobs?
Ans. We need to select the proper configuration files first. The next step is to choose
the appropriate partition and buffer memory. Data sorting and handling null-time values
are challenges we should address. As an alternative to the transformer, we should try
using modify, copy, or filter. Reduce the spread of unwarranted metadata between
various stages.
30. What are Players in DataStage?
Ans. The main functions in a parallel job are the players. Typically, there is one
participant per operator on each node. There is one section leader per processing node,
and players are the children of section leaders. The conductor process, which runs on
the conductor node, forms section leaders (the conductor node is defined in the setup
file).
Advanced Level DataStage Interview Questions
It’s time to practice some problematic DataStage Interview Questions. Below is the list
of some hard-level DataStage Interview Questions.
31. What are the different types of join stages in DataStage?
In DataStage, the different types of join stages are:
 Hash File Stage: Performs an inner or outer join using hash partitioning.
 Merge Stage: Merges two or more data streams based on specified keys.
 Lookup Stage: Performs a join by looking up values from another dataset.
32. How does DataStage handle rejects in a job?
DataStage handles rejects by redirecting them to a reject link in the job. The Reject Link
captures records that fail to meet the conditions specified in the stage.
33. Explain the difference between Sequential File Stage and
Dataset Stage in DataStage.
 Sequential File Stage: Reads data from or writes data to a sequential file. It
is suitable for small to medium-sized datasets.
 Dataset Stage: Reads data from or writes data to a dataset, which is a
collection of files. It is optimized for handling large volumes of data.
34. What is a Transformer Stage in DataStage, and how is it
used?
The Transformer Stage in DataStage is used for performing complex data
transformations. It provides a graphical interface to design transformation logic using
drag-and-drop functionality, making it user-friendly and efficient.
35. How can you optimize performance in DataStage jobs?
Performance optimization in DataStage jobs can be achieved by:
 Using efficient stage configurations.
 Partitioning data to parallelize processing.
 Limiting unnecessary data movements.
 Optimizing SQL queries and database connections.
 Utilizing job design best practices.
36. What is the difference between persistent and transient
DataStage variables?
 Persistent Variables: Retain their values between job runs and are stored
in the DataStage repository. They are useful for passing values between job
runs.
 Transient Variables: Exist only during the execution of a job and are not
stored in the repository. They are suitable for temporary calculations within a
job.
37. Explain the concept of job control in DataStage.
Job control in DataStage refers to the process of managing job execution, including
scheduling, monitoring, and error handling. It involves defining job dependencies,
setting job parameters, and orchestrating the execution flow to ensure smooth job
execution.
38. What are the different types of stages available in DataStage?
DataStage provides various types of stages for performing specific tasks:
 Input Stages: Read data from external sources.
 Processing Stages: Perform transformations and manipulations on the
data.
 Output Stages: Write data to target systems.
 Control Stages: Control the flow of data and job execution.
39. How do you handle incremental loading in DataStage?
Incremental loading in DataStage involves loading only the new or changed data since
the last load. It can be achieved by using techniques such as:
 Using change data capture (CDC) mechanisms.
 Implementing job parameters to track the last loaded timestamp.
 Utilizing lookup stages to identify new or updated records.
40. Explain the concept of parallel processing in DataStage.
Parallel processing in DataStage involves dividing data processing tasks into smaller
units and executing them simultaneously on multiple processing nodes. It improves job
performance and scalability by leveraging the processing power of distributed
computing environments.
41. How can you handle errors and exceptions in DataStage
jobs?
Errors and exceptions in DataStage jobs can be handled using:
 Reject links to capture and handle invalid records.
 Exception handling stages such as the Exception Stage to handle runtime
errors.
 Job sequencers to define error-handling workflows and recovery
mechanisms.
42. What is the purpose of the Balanced Optimization option in
DataStage?
The Balanced Optimization option in DataStage optimizes job performance by
dynamically adjusting data partitioning and processing to ensure a balanced workload
distribution across processing nodes. It helps maximize resource utilization and
minimize job execution time.
43. How do you monitor and manage DataStage jobs in a
production environment?
In a production environment, DataStage jobs can be monitored and managed using:
 DataStage Director for job monitoring, debugging, and job execution control.
 IBM Control Center for centralized management, monitoring, and reporting
of DataStage jobs across multiple environments.
 Custom scripts or automation tools for scheduling, job orchestration, and
alerting.
44. What are the key components of a DataStage job design?
Key components of a DataStage job design include:
 Stages: Input, processing, output, and control stages.
 Links: Connections between stages to define the flow of data.
 Parameters: Variables used to customize job behavior.
 Job Sequencers: Control elements to orchestrate job execution flow.
 Job Properties: Configuration settings such as job name, description, and
environment details.
45. How do you handle data quality issues in DataStage?
Data quality issues in DataStage can be addressed by:
 Implementing data validation rules using constraints and business rules.
 Using data cleansing techniques such as standardization, deduplication, and
error correction.
 Integrating data quality tools and libraries to identify and resolve data
anomalies.
 Establishing data governance practices and quality monitoring mechanisms.
DataStage is a popular ETL (Extract, Transform, Load) tool by IBM InfoSphere
Information Server. DataStage is used by organizations working with large
data sets and warehouses for data integration from the data source system
to the target system. Top DataStage job roles are DataStage Developer, ETL
Developer, DataStage Production Manager, etc.

In this article, we have shared a list of the most frequently asked IBM
DataStage interview questions and the answers to the questions. These
DataStage interview questions and answers are beneficial for beginners as
well as experienced professionals to crack DataStage interviews.

These questions will cover key concepts like DataStage and Informatica,
DataStage routine, lookup stages comparison, join, merge, performance
tuning of jobs, repository table, data type conversions, quality state, job
control, etc.

Basic DataStage Interview Questions

1. The most basic dataStage interview question is to define DataStage.

DataStage is an ETL tool that extracts, transforms, and loads tool for Windows servers for data
integration from databases into the data warehouse. It is used to design, develop and run different
applications to fill data into data warehouses and data marts. DataStage is an essential part of the
IBM InfoSphere Data Integration Suite.

Aspect Data Warehouse Data Mart


A centralized repository that consolidates A subset of a data warehouse focused
Definition data from multiple sources for analysis on a specific business area or
and reporting. department.
Limited to a specific function, team,
Covers the entire organization,
Scope or department, such as marketing,
supporting enterprise-wide data analysis.
sales, or finance.
Larger in size, holding vast amounts of Smaller in size, containing focused
Size
data from multiple domains. datasets relevant to a specific area.
Typically sourced from a data
Pulls data from various sources,
Data Source warehouse or specific operational
including internal and external systems.
systems.
High complexity due to diverse data Lower complexity, tailored for quick
Complexity
integration and transformation processes. and specific analyses.
Usage Used for enterprise-wide strategic Used for tactical or departmental
Aspect Data Warehouse Data Mart
decision-making and comprehensive decision-making and operational
reporting. tasks.
Business analysts, data scientists, and Departmental users or managers with
Users
executives across the organization. a focus on specific metrics.
Longer to implement due to extensive
Faster to implement as it deals with a
Time to Build integration, modeling, and scalability
smaller scope of data.
requirements.
Requires significant maintenance for Easier to maintain due to its smaller
Maintenance
updates, backups, and scalability. size and specific focus.
Data Contains detailed, granular data as well Primarily contains aggregated data
Granularity as aggregated summaries. relevant to the specific business area.

Summary:

 A Data Warehouse is a comprehensive, centralized system designed for enterprise-wide


analytics.
 A Data Mart is a smaller, more focused system designed to meet the needs of a specific
department or business unit.

2. What are DataStage characteristics?

DataStage supports the transformation of large volumes of data using a scalable parallel
processing approach.

It supports Big Data Hadoop by accessing Big Data in many ways, like on a distributed file
system, JSON support, and JDBC integrator.

DataStage is easy to use, with its improved speed, flexibility, and efficacy for data integration.

DataStage can be deployed on-premises or in the cloud as need be.

3. How is a DataStage source file populated?

A source file can be filled in many ways, like by creating a SQL query in Oracle, or through a
row generator extract tool.

4. How is merging done in DataStage?


Merging or joining of two or more tables can be done based on the primary key column in the
tables.

5. One of the most frequently asked dataStage interview questions is what is the
difference between DataStage 7.0 and 7.5?

DataStage 7.5 comes with many new stages added to version 7.0 for increased stability and
smoother performance. The new features include the command stage, procedure stage,
generating the report, and more.

6. What are data and descriptor files?

A data file contains only data, while a descriptor file contains all information or description
about data in data files.

7. One of the most frequntly asked datastage interview questions isDifferentiate


between DataStage and Informatica.

Both DataStage and Informatica are powerful ETL tools. While DataStage has the concept of
parallelism and partition for node configuration, Informatica does not support parallelism in node
configuration. DataStage is simpler to use than Informatica, but Informatica is more scalable.

8. What is a routine, and what are the types?


A routine is a collection of functions defined by the DataStage manager. There are 3 types of
routines, namely parallel routines, mainframe routines, and server routines.

9. How to write parallel routines.

Parallel routines can be written in C or C++ compiler. We can also create such routines in DS
manager and can be called from the transformer stage.

10. How are duplicates removed in DataStage?

The sort function can be sued to remove duplicates in DataStage. While running the sort
function, the user should specify the option that allows for duplicates and set it to false.

11. What is the difference between join, merge, and look up stages?
These concepts differ from each other in how they use memory storage, compare input
requirements and treat different records. Join and Merge need less memory than Look up.

12. How to convert server job to parallel job in DataStage?

We can convert a server job into a parallel job with the help of an IPC collector and a link
collector.

13. What is an HBase connector?

It is a tool used to connect databases and tables that are present in the HBase database. It can be
used to carry out tasks like:

Read data in parallel mode

Read/write data from and to the HBase database.

Use HBase as a view table

Intermediate DataStage Interview Questions

14. What steps should be taken to improve DataStage jobs?

First, we have to establish baselines. Also, we shouldn’t use only one flow for performance
testing. Work should be done incrementally. Evaluate data skews and, thereafter, isolate and
solve the problems one by one. Next, distribute the file systems to avoid bottlenecks, if any. Do
not include RDBMS at the beginning of the testing phase. Finally, identify and examine the

available tuning knobs.

15. What is the quality state in DataStage?

The quality stat is used for data cleansing with DataStage tool. It is a client-server software
provided as part of IBM information server.
16. One of the most frequntly asked datastage interview questions isDefine job
control.

Job control is a tool used for controlling a job or executing multiple jobs parallelly. The Job
Control Language within the IBM datastage tool is used to deploy job control.

17. How to do DataStage job performance tuning?

First, we choose the right configuration files, partition, and buffer memory. We take care of data
sorting and handling null-time values. We should try to use copy, modify or filter rather than the
transformer. The propagation of unnecessary metadata between stages needs to be reduced.

18. What is a repository table in DataStage?

A repository table or data warehouse is used for answering ad-hoc, historical, analytical, or
complex queries. A repository can be centralized or distributed.

19. Another frequntly asked datastage interview questions is how can you kill a
DataStage job?

We need to first kill the individual processing ID such that the DataStage is killed.

20. Compare Validated OK with Compiled Process in DataStage.

The Validated OK process ensures that the connections are valid, whereas the Compiled process
makes sure that important stage parameters are correctly mapped so that it creates an executable
job.

21. Explain Datatype conversion in DataStage.

We can use data conversion function for datatype conversion in DataStage. We must make sure
that the input or the output to and from the operator is the same. Plus, the record schema must be
compatible with the operator.

22. What is an exception activity?

If there is an unfamiliar error occurring during the execution of the job sequencer, all the stages
following the exception activity are executed. Hence, exception activity is very important in
DataStage.
23. Describe DataStage architecture briefly.

The DataStage architecture follows a client-server model with different architecture types for
different versions. The main components of the model are:

Client components

Servers

Jobs

Stages

Projects

Table Definitions

Containers

Advanced DataStage Interview Questions

24. What are the command line functions that can help to import and export DS jobs?

The dsimport.exe is used to import DS jobs, and the dsexport.exe is used for export.

25. Name the different types of lookups in DataStage.

There are normal, sparse, range, and caseless lookups.

26. How do you run a job using command line?

This is how we run a job using command line:

dsjob -run -jobstatus <projectname> <jobname>


27. What is Usage Analysis?

To check if a certain job is part of the sequence, we right click on the manager on the job and
select Usage Analysis.

28. Another frequntly asked datastage interview questions is what is the difference
between sequential file and hash file?

A hash file, being based on hash algorithm, can be used with a key value. However, a sequential
file does not have any key value column.

Hash file can be used as lookup reference while sequential file cannot be used for lookup. It is
easier to search hash file due to presence of hash key.

1. Mention DataStage characteristics.

Criteria Characteristics
Support for Big Data Access Big Data on a distributed file system, JSON support, and
Hadoop JDBC integrator
Ease of use Improve speed, flexibility, and efficacy for data integration
Deployment On-premise or cloud as the need dictates

2. What is IBM DataStage?

DataStage is an extract, transform, and load tool that is part of the


IBM Infosphere suite. It is a tool that is used for working with large
data warehouses and data marts for creating and maintaining a
data repository.

3. How is a DataStage source file filled?


We can develop a SQL query or we can use a row generator extract
tool through which we can fill the source file in DataStage.
Prepare yourself for ETL interviews from our ETL Interview
Questions for Experienced blog

4. How is merging done in DataStage?

In DataStage, merging is done when two or more tables are


expected to be combined based on their primary key column.

5. What are data and descriptor files?

Both these files are serving different purposes in DataStage. A


descriptor file contains all the information or description, while a
data file is the one that just contains data.

6. How is DataStage different from Informatica?

DataStage and Informatica are both powerful ETL tools, but there
are a few differences between the two. DataStage has parallelism
and partition concepts for node configuration; whereas in
Informatica, there is no support for parallelism in node
configuration. Also, DataStage is simpler to use as compared to
Informatica.

Interested in Power BI? Here is the Power BI Training provided by


Intellipaat.

7. What is a routine in DataStage?


DataStage Manager defines a collection of functions within a
routine. There are basically three types of routines in DataStage,
namely, job control routine, before/after subroutine, and transform
function.

8. What is the process for removing duplicates in


DataStage?

Duplicates in DataStage can be removed using the sort function.


While running the sort function, we need to specify the option which
allows for duplicates by setting it to false.

9. What is the difference between join, merge, and


lookup stages?

The fundamental difference between these three stages is the


amount of memory they take. Other than that how they treat the
input requirement and the various records are also factors that
differentiate one another. Based on the memory usage, the lookup
stage uses a very less amount of memory. Both lookup and merge
stages use a huge amount of memory.
10.Explain how a source file is populated?

We can populate a source file in several ways like by using a row


generator extract tool, or creating a SQL query in Oracle, etc.

11.How to convert a server job to a parallel job in


DataStage?
Using a Link collector and an IPC collector we can convert the server
job to a parallel job.

12.What is an HBase connector?

An HBase connector in DataStage is a tool used to connect


databases and tables present in the HBase database. It is used to
perform tasks like:

1. Reading data in the parallel mode.

2. Read and write data from and to the HBase database.

3. Using HBase as a view table

Intermediate DataStage Interview Questions

13. What is the quality state in DataStage?

The quality state is used for cleansing the data with the DataStage
tool. It is a client-server software tool that is provided as part of the
IBM Information Server.

14. What is job control in DataStage?

This tool is used for controlling a job or executing multiple jobs in a


parallel manner. It is deployed using the Job Control Language
within the IBM DataStage tool.

15. How to do DataStage jobs performance tuning?


First, we have to select the right configuration files. Then, we need
to select the right partition and buffer memory. We have to deal
with the sorting of data and handling null-time values. We need to
try to use modify, copy, or filter instead of the transformer. Reduce
the propagation of unnecessary metadata between various stages.

16. What is a repository table in DataStage?

The term ‘repository’ is another name for a data warehouse. It can


be centralized or distributed. The repository table is used for
answering ad-hoc, historical, analytical, or complex queries.

17. Compare massive parallel processing with


symmetric multiprocessing.

In massive parallel processing, many computers are present in the


same chassis. While in the symmetric multiprocessing, there are
many processors that share the same hardware resources. Massive
parallel processing is called ‘shared nothing’ as there is no aspect
between various computers. And it is faster than the symmetric
multiprocessing.

18. How can we kill a DataStage job?

To kill a DataStage job, we need to first kill the individual processing


ID so that this ensures that the DataStage is killed.

19. How do we compare the Validated OK with the


Compiled Process in DataStage?
The Compiled Process ensures that the important stage parameters
are mapped and these are correct such that it creates an executable
job. Whereas in the Validated OK, we make sure that the
connections are valid.

20. Explain the feature of data type conversion in


DataStage.

If we want to do data conversion in DataStage, then we can use the


data conversion function. For this to be successfully executed, we
need to ensure that the input or the output to and from the operator
is the same, and the record schema needs to be compatible with the
operator.

21. What is the significance of the exception activity in


DataStage?

Whenever there is an unfamiliar error happening while executing


the job sequencer, all the stages after the exception activity are run.
So, this makes the exception activity so important in the DataStage.

22.What is the difference between Datastage 7.5 and


7.0?

Datastage 7.5 is more robust and performs smoothly due to many


new stages which are added, such as Procedure Stage, Generate
Report, Command Stage, etc.

23.Describe the DataStage architecture briefly.


IBM DataStage preaches a client-server model as its architecture
and has different types of architecture for its various versions. The
different components of the client-server architecture are :

 Client components

 Servers

 Stages

 Table definitions

 Containers

 Projects

 Jobs

24. What are the main features of the Flow Designer?

The main features of the Flow Designer are:

1. There is no need to migrate the jobs to use the flow designer.

2. It is very useful to perform jobs with a large number of


stages.

3. We can use the provided palette to add and remove


connectors and operators on the designer canvas using the
drag and drop feature.

Learn more about DataStage from this insightful DataStage Tutorial


for Beginners blog post!
Advanced DataStage Interview Questions and
Answers for Experienced Professionals

25. Name the command line functions to import and


export the DS jobs?

The dsimport.exe function is used to import the DS jobs, and to


export the DS jobs, dsexport.exe is used.

26. What are the various types of lookups in DataStage?

There are different types of lookups in DataStage. These include


normal, sparse, range, and caseless lookups.

27. How can we run a job using the command line in


DataStage?

The command for running a job using the command line in


DataStage: dsjob -run -jobstatus <projectname> <jobname>

28. When do we use a parallel job and a server job?

Using the parallel job or a server job depends on the processing


need, functionality, time to implement, and cost. The server job
usually runs on a single node, it executes on a DataStage Server
Engine and handles small volumes of data. The parallel job runs on
multiple nodes; it executes on a DataStage Parallel Engine and
handles large volumes of data.
29. What is Usage Analysis in DataStage?

If we want to check whether a certain job is part of the sequence,


then we need to right-click on the Manager on the job and then
choose the Usage Analysis.

30. How to find the number of rows in a sequential file?

For counting the number of rows in a sequential file, we should use


the @INROWNUM variable.

31. What is the difference between a sequential file and


a hash file?

The hash file is based on a hash algorithm, and it can be used with a
key value. The sequential file, on the other hand, does not have any
key-value column. The hash file can be used as a reference for a
lookup, while a sequential file cannot be used for a lookup. Due to
the presence fo the hash key, the hash file is easier to search than a
sequential file.

32. How do we clean a DataStage repository?

For cleaning a DataStage repository, we have to go to DataStage


Manager > Job in the menu bar > Clean Up Resources.

If we want to further remove the logs, then we need to go to the


respective jobs and clean up the log files.

33. How do we call a routine in DataStage?


Routines are stored in the Routine branch of the DataStage
repository. This is where we can create, view, or edit all the
Routines. The Routines in DataStage could be the following: Job
Control Routine, Before-after Subroutine, and Transform function.

34. What is the difference between an Operational


DataStage and a Data Warehouse?

An Operational DataStage can be considered as a staging area for


real-time analysis for user processing; thus it is a temporary
repository. Whereas, the data warehouse is used for long-term data
storage needs and has the complete data of the entire business.

35. What does NLS mean in DataStage?

NLS means National Language Support. This means we can use this
IBM DataStage tool in various languages like multi-byte character
languages (Chinese or Japanese). We can read and write in any
language and process it as per the requirement.

36. In Datastage, how you can fix the truncated data


error?

The truncated data error can be fixed by using ENVIRONMENT


VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’.
37. What is a Hive connector?

A Hive connector is a tool to support partition modes while reading


the data. This can be done in two ways:

1. modulus partition mode

2. minimum-maximum partition mode

1) What is IBM DataStage?


DataStage is one of the most powerful ETL tools. It comes with the feature of
graphical visualizations for data integration. It extracts, transforms, and loads
data from source to the target.

DataStage is an integrated set of tools for designing, developing, running,


compiling, and managing applications. It can extract data from one or more data
sources, achieve multi-part conversions of the data, and load one or more target
files or databases with the resultant data.

2) Describe the Architecture of DataStage?


DataStage follows the client-server model. It has different types of client-server
architecture for different versions of DataStage.
PauseNext
Mute

Current Time 0:16

Duration 18:10
Loaded: 5.87%

Fullscreen
DataStage architecture contains the following components.

o Projects
o Jobs
o Stages
o Servers
o Client Components

3) Explain the DataStage Parallel Extender (PX) or Enterprise


Edition (EE)?
DataStage PX is an IBM data integration tool. It is one of the most widely used
extractions, transformation, and loading (ETL) tools in the data warehousing
industry. This tool collects the information from various sources to perform
transformations as per the business needs and load data into respective data
warehouses.

DataStage PX is also called as DataStage Enterprise Edition.

4) Describe the main features of DataStage?


The main features of DataStage are as follows.

o DataStage provides partitioning and parallel processing techniques


which allow the DataStage jobs to process an enormous volume of
data quite faster.
o It has enterprise-level networking.
o It's a data integration component of IBM InfoSphere information server.
o It's a GUI based tool.
o In DataStage, we need to drag and drop the DataStage objects, and
also we can convert it to DataStage code.
o DataStage is used to perform the various ETL operations (Extract,
transform, load)
o It provides connectivity with different sources & multiple targets at the
same time

5) What are some prerequisites for DataStage?


For DataStage, following set ups are necessary.

o InfoSphere
o DataStage Server 9.1.2 or above
o Microsoft Visual Studio .NET 2010 Express Edition C++
o Oracle client (full client, not an instant client) if connecting to an Oracle
database
o DB2 client if connecting to a DB2 database

6) How to read multiple files using a single DataStage job if


files have the same metadata?
o Search if the metadata of files is different or same then specify file
names in the sequential stage.
o Attach the metadata with a sequential stage in its properties.
o Select Read method as 'Specific File(s)'then add all files by selecting
'file' property from the 'available properties to add.'
It will look like:

1. File= /home/myFile1.txt
2. File= /home/myFile2.txt
3. File= /home/myFile3.txt
4. Read Method= Specific file(s) fcec

7) Explain IBM InfoSphere information server and highlight its


main features?
IBM InfoSphere Information Server is a leading data integration platform which
contains a group of products that enable you to understand, filter, monitor,
transform, and deliver data. The scalable solution facilitates with massively
parallel processing capabilities to help you to manage small and massive data
volumes. It assists you in forwarding reliable information to your key business
goals such as big data and analytics, data warehouse modernization,
and master data management.

Features of IBM InfoSphere information server

o IBM InfoSphere can connect with multiple source systems as well as


write to various target systems. It acts as a single platform for data
integration.
o It is based on centralized layers. All the modules of the suit can share
the baseline architecture of the suite.
o It has some additional layers for the unified repository, for integrated
metadata services, and sharing a parallel engine.
o It has tools for analysis, monitoring, cleansing, transforming and
delivering data.
o It has extremely parallel processing capabilities that provide high-
speed processing.

8) What is IBM DataStage Flow Designer?


IBM DataStage Flow Designer allows you to create, edit, load, and run jobs in
DataStage. DFD is a thin client, web-based version of DataStage. Its a web-
based UI for DataStage than DataStage Designer, which is a Window-based thick
client.

9) How do you run DataStage job from the command line?


To run a DataStage job, use command"dsjob" command as follows.

1. 'dsjob -run -jobstatus projectname jobname

10) What are some different alternative commands


associated with "dsjob"?
Many alternative optional commands can be used with dsjob command to
perform any specific task. These commands are used in the below format.

1. $dsjob -run alternative command


A list of commonly used alternative options of dsjob command is given below.

Stop: it is used to stop the running job

Advertisement

Lprojects: it is used to list the projects

ljobs: it is used to list the jobs in project

lparams: it is used to list the parameters in a job

paraminfo: it returns the parameters info

Linkinfo: It returns the link information

Logdetail: it is used to display details like event_id, time, and message

Lognewest: it is used to display the newest log id.

log: it is used to add a text message to log.

Logsum: it is used to display the log.

lstages: it is used to list the stages present in the job.

Llinks: it is used to list the links.


Projectinfo: it returns the project information (hostname and project name)

Jobinfo: it returns the job information (Job-status, job runtime,end time, etc.)

Stageinfo: it returns the stage name, stage type, input rows, etc.)

Report: it is used to display a report which contains Generated time, start time,
elapsed time, status, etc.

Jobid: it is used to provide Job id information.

11) What is a Quality Stage in DataStage tool?


A Quality Stage helps in integrating different types of data from multiple
sources.

It is also termed as the Integrity Stage.

12) What is the process of killing a job in DataStage?


To kill a job, you must destroy the particular processing ID.

13) What is a DS Designer?


DataStage Designer is used to design the job. It also develops the work area and
adds various links to it.

14) What are the Stages in DataStage?


Stages are the basic structural blocks in InfoSphere DataStage. It provides a
rich, unique set of functionality to perform advanced or straightforward data
integration task. Stages hold and represent the processing steps that will be
performed on the data.

15) What are Operators in DataStage?


The parallel job stages are made on operators. A single-stage might belong to a
single operator or a number of operators. The number of operators depends on
the properties you have set. During compilation, InfoSphere DataStage
estimates your job design and sometimes will optimize operators.

16) Explain connectivity between DataStage with


DataSources?
IBM InfoSphere Information Server supports connectors and enables jobs for
data transfer between InfoSphere Information Server and data sources.

IBM InfoSphere DataStage and QualityStage jobs can access data from
enterprise applications and data sources such as:

o Relational databases
o Mainframe databases
o Enterprise Resource Planning (ERP) or Customer Relationship
Management (CRM) databases
o Online Analytical Processing (OLAP) or performance management
databases
o Business and analytic applications

17) Describe Stream connector?


The Stream connector allows integration between the Streams and the
DataStage. InfoSphere Stream connector is used to send data from a DataStage
job to a Stream job and vice versa.

InfoSphere Streams can perform close to real-time analytic processing in


parallel to the data loading into a data warehouse. Alternatively, the InfoSphere
Streams job performs RTAP processing. After RTAP processing, it forwards the
data to InfoSphere DataStage to transform, enrich, and store the details for
archival purposes.

18) What is the use of HoursFromTime() Function in


Transformer Stage in DataStage?
HoursFromTime Function is used to return hour portion of the time. Its input is
time, and Output is hours (int8).
Examples: If myexample1.time contains the time 22:30:00, then the following
two functions are equivalent and return the integer value 22.

1. HoursFromTime(myexample1.time)
2. HoursFromTime("22:30:00")

19) What is the Difference between Informatica and


DataStage?
The DataStage and Informatica both are powerful ETL tools. Both tools do
almost the same work in nearly the same manner. In both tools, the
performance, maintainability, and learning curve are similar and comparable.
Below are the few differences between both tools.

Parameter DataStage Informatica

DataStage's pipeline
Informatica offers to pa
Multiple Partitions partitioning uses multiple
dynamic partitioning.
partitions.

DataStage offers 3 GUIs Informatica offers 4 GU


IBM DataStage Designer Informatica PowerDesig
User Interface Job Sequence Repository Manager
Designer(workflow design) Workflow Designer
Director (for monitoring) Workflow Manager.

Informatica allows "Dat


Data encryption needs to be
Transformation" insid
Data Encryption done before reaching the
PowerCenter Designer a
DataStage Server.
separate transformation

Transformations_ DataStage becomes a powerful Informatica allows abou


transformation engine by using necessary transformati
functions (Oconv and IConv) process incoming data.
and routines. It offers about 40
data transforming
stages/objects. Almost any
transformation can be
performed in DataStage.

We can achieve re-usability of a


It offers access of re-us
job in DataStage by
through Mapplets and
using containers(local&share
Reusability or re-using mappings a
d). To re-use a Job Sequence,
workflows. Reusability i
you will have to make a copy,
performance.
compile it, and run.

20) How We Can Covert Server Job To A Parallel Job?


We can convert a server job into a parallel job by using Link Collector and IPC
Collector.

21) What are the different layers in the information server


architecture?
The different layers of information server architecture are as follows.

o Unified user interface


o Common services
o Unified parallel processing
o Unified Metadata
o Common connectivity

22) If you want to use the same piece of code in different


jobs, how will you achieve it?
DataStage facilitates with a feature called shared containers which allows
sharing the same piece of code for a different job. The containers are shared for
reusability. A shared container consists of a reusable job element of stages and
links. We can call a shared container in, unlike DataStage jobs.

23) How many types of Sorting methods are available in


DataStage?
There are two types of sorting methods available in DataStage for parallel jobs.

o Link sort
o Standalone Sort stage

24) Describe Link Sort?


The Link sort supports fewer options than other sorts. It is easy to maintain in a
DataStage job as there are only few stages in the DataStage job canvas.

Link sort is used unless a specific option is needed over Sort Stage. Most often,
the Sort stage is used to specify the Sort Key mode for partial sorts.

Sorting on a link option is available on the input/partitioning stage options. We


cannot specify a keyed partition if we use auto partition method.

25) Which commands are used to import and export the


DataStage jobs?
We use the following commands for the given operations.

For Import: we use the dsimport.exe command

For Export, we use the dsexport.exe command

26) Describe routines in DataStage? Enlist various types of


routines.
Routine is a set of tasks which are defined by the DS manager. It is run via the
transformer stage.

There are three kinds of routines

o Parallel routines
o Mainframe routines
o Server routines

27) What is the different type of jobs in DataStage?


There are two types of jobs in DataStage

o Server jobs: These jobs run in a sequential manner


o Parallel jobs: These jobs get executed in a parallel way

28) State the difference between an Operational DataStage


and a Data Warehouse?
An Operational DataStage can be considered as a presentation area for user
processing and real-time analysis. Thus, operational DataStage is a temporary
repository. Whereas the Data Warehouse is used for durable data storage needs
and holds the complete data of the entire business.

29) What is the importance of the exception activity in


DataStage?
The reason behind the importance of exception activity is that during the job
execution, exception activity handles all the unfamiliar error activity.

30) What is "Fatal Error/RDBMS Code 3996" error?


This error occurs while testing jobs in DataStage 8.5 during Teradata 13 to 14
upgrade.

It is because the user tries to assign a longer string to a shorter string


destination, and sometimes if the length of one or more range boundaries in a
RANGE_N function is a string literal with a length higher than that of the test
value.

est Datastage Interview Questions


1. What is Datastage?
DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM
InfoSphere. It uses a graphical notation to construct data integration solutions and is available in
various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.

Explore DataStage Tutorial for more information

2. Explain the DataStage parallel Extender or Enterprise Edition (EE)?

Parallel extender in DataStage is the data extraction and transformation application for parallel
processing.

There are two types of parallel processing's are available they are:

1. Pipeline Parallelism

2. Partition Parallelism

3. What is a conductor node in DataStage?

Actually, every process contains a conductor process where the execution was started and a
section leader process for each processing node and a player process for each set of combined
operators, and an individual player process for each uncombined operator.

Whenever we want to kill a process we should have to destroy the player process and then the
section leader process and then the conductor process.

4. How do you run the DataStage job from the command line?

Using "dsjob" command as follows.


dsjob -run -jobstatus projectname jobname
Datastage Interview Questions for Beginners
5. What are the different options associated with "dsjob" command?

ex: $dsjob -run and also the options like

 stop -To stop the running job

 lprojects - To list the projects

 ljobs - To list the jobs in the project

 lstages - To list the stages present in the job.


 llinks - To list the links.

 projectinfo - returns the project information(hostname and project name)

 jobinfo - returns the job information(Job-status,job runtime,endtime, etc.,)

 stageinfo - returns the stage name, stage type, input rows, etc.,)

 linkinfo - It returns the link information

 lparams - To list the parameters in a job

 paraminfo - returns the parameters info

 log - add a text message to log.

 logsum - To display the log

 logdetail - To display with details like event_id, time, message

 lognewest - To display the newest log id.

 report - display a report contains Generated time, start time, elapsed time, status, etc.,

 jobid - Job id information.

Want to Enrich your career with a DataStage certified professional, then enroll in our “DataStage Training” This
will help you to achieve excellence in this domain.

6. Can you explain the difference between sequential file, dataset, and fileset?

Sequential File:
1. Extract/load from/to seq file max 2GB

2. When used as a source at the time of compilation it will be converted into a native format

from ASCII

3. Does not support null values

4. Seq file can only be accessed on one node.

Dataset:
1. It preserves partition.it stores data on the nodes so when you read from a dataset you

don't have to repartition the data

2. It stores data in binary in the internal format of Datastage. so it takes less time to

read/write from ds to any other source/target.

3. You cannot view the data without Datastage.

4. It Creates 2 types of files to store the data.

o Descriptor File: Which is created in a defined folder/path.

o Data File: Created in the Dataset folder mentioned in the configuration file.

5. Dataset (.ds) file cannot be open directly, and you could follow alternative ways to achieve
that, Data Set Management, the utility in client tool(such as Designer and Manager), and
command-line ORCHADMIN.
Fileset:
1. It stores data in a format similar to that of a sequential file. The only advantage of using a

fileset over a seq file is it preserves the partition scheme.

2. you can view the data but in the order defined in the partitioning scheme.

3. Fileset creates a .fs file and a .fs file is stored in ASCII format, so you could directly open it

to see the path of the data file and its schema.

8. What are the features of DataStage Flow Designer?

DataStage Flow Designer Features:


 IBM DataStage Flow Designer has many features to enhance your job-building experience.

 We can use the palette to drag and drop connectors and operators onto the designer

canvas.

 We can link nodes by selecting the previous node and dropping the next node or drawing

the link between the two nodes.

 We can edit stage properties on the sidebar, and make changes to your schema in the

Column Properties tab.

 We can zoom in and zoom out using your mouse, and leverage the mini-map on the lower-

right of the window to focus on a particular part of the DataStage job.

 This is very useful when you have a very large job with tens or hundreds of stages.

9. What are the benefits of Flow Designer?

There are many benefits with Flow designer, they are:

 No need to migrate jobs - You do not need to migrate jobs to a new location in order to

use the new web-based IBM DataStage Flow Designer user interface.

 No need to upgrade servers and purchase virtualization technology licenses

- Getting rid of a thick client means getting rid of keeping up with the latest version of the

software, upgrading servers, and purchasing Citrix licenses. IBM DataStage Flow Designer

saves time AND money!

 Easily work with your favorite jobs - You can mark your favorite jobs in the Jobs

Dashboard, and have them automatically show up on the welcome page. This gives you

fast, one-click access to jobs that are typically used for reference, saving you navigation

time.
 Easily continue working where you left off - Your recent activity automatically shows

up on the welcome page. This gives you fast, one-click access to jobs that you were

working on before, so you can easily start where you left off in the last session.

 Efficiently search for any job - Many organizations have thousands of DataStage jobs.

You can very easily find your job with the built-in type-ahead Search feature on the Jobs

Dashboard.

 Cloning a job - Instead of always starting Job Design from scratch, you can clone an

existing job on the Jobs Dashboard and use that to jump-start your new Job Design.

 Automatic metadata propagation - IBM DataStage Flow Designer comes with a

powerful feature to automatically propagate metadata. Once you add a source connector

to your job and link it to an operator, the operator automatically inherits the metadata.

You do not have to specify the metadata in each stage of the job.

 Storing your preferences - You can easily customize your viewing preferences and have

the IBM DataStage Flow Designer automatically save them across sessions.

 Saving a job - IBM DataStage Flow Designer allows you to save a job in any folder. The job

is saved as a DataStage job in the repository, alongside other jobs that might have been

created using the DataStage Designer thick client.

 Highlighting of all compilation errors - The DataStage thick client identifies

compilation errors one at a time. Large jobs with many stages can take longer to

troubleshoot in this situation. IBM DataStage Flow Designer highlights all errors and gives

you a way to see the problem with a quick hover over each stage, so you can fix multiple

problems at the same time before recompiling.


 Running a job - IBM DataStage Flow Designer allows you to run a job. You can refresh the

status of your job on the new user interface. You can also view the Job Log, or launch the

Ops Console to see more details of job execution

10. What is an HBase connector?

HBase connector is used to connect to tables stored in the HBase database and perform the
following operations:

 Read data from or write data to HBase database.

 Read data in parallel mode.

 Use HBase table as a lookup table in sparse or normal mode.

11. What is a Hive connector?

Hive connector supports modulus partition mode and minimum-maximum partition mode during
the read operation.

12. What is Kafka connector?

A) Kafka connector has been enhanced with the following new capabilities:

 Continuous mode, where incoming topic messages are consumed without stopping the

connector.

 Transactions, where a number of Kafka messages is fetched within a single transaction.

After the record count is reached, an end of the wave marker is sent to the output link.

 TLS connection to Kafka.

 Kerberos keytab locality is supported.

13. What is the Amazon S3 connector?

Amazon S3 connector now supports connecting by using an HTTP proxy server.


14. What is a File connector?

File connector has been enhanced with the following new capabilities:

 Native HDFS FileSystem model is supported.

 You can import metadata from the ORC files.

 New data types are supported for reading and writing the Parquet formatted files: Date /

Time and Timestamp.

15. Explain is Infosphere Information Server?

InfoSphere Information Server is capable of scaling to meet any information volume requirement
so that companies can deliver business results faster and with higher quality results. InfoSphere
Information Server provides a single unified platform that enables companies to understand,
cleanse, transform, and deliver trustworthy and context-rich information.

16. What are the different Tiers available in the InfoSphere Information Server?

In the InfoSphere information server there are four tiers are available, they are:

1. Client Tier

2. Engine Tier

3. Services Tier

4. Metadata Repository Tier

17. What is the Client tier in the Information server?

The client tier includes the client programs and consoles that are used for development and
administration and the computers where they are installed.

18. What is the Engine tier in the Information server?

The engine tier includes the logical group of components (the InfoSphere Information Server
engine components, service agents, and so on) and the computer where those components are
installed. The engine runs jobs and other tasks for product modules.

19. What is the Services tier in the Information server?


The services tier includes the application server, common services, and product services for the
suite and product modules, and the computer where those components are installed. The
services tier provides common services (such as metadata and logging) and services that are
specific to certain product modules. On the services tier, the WebSphere® Application Server
hosts the services. The services tier also hosts InfoSphere Information Server applications that
are web-based.

20. Metadata repository tier in Information server?

The metadata repository tier includes the metadata repository, the InfoSphere Information
Analyzer analysis database (if installed), and the computer where these components are installed.
The metadata repository contains the shared metadata, data, and configuration information for
InfoSphere Information Server product modules. The analysis database stores extended analysis
data for InfoSphere Information Analyzer.

Datastage Scenario Based Interview Questions for Experienced


21. What are the key elements of Datastage?

DataStage provides the elements that are necessary to build data integration and transformation
flows.

These elements include

 Stages

 Links

 Jobs

 Table definitions

 Containers

 Sequence jobs

 Projects

22. What are Stages in Datastage?

Stages are the basic building blocks in InfoSphere DataStage, providing a rich, unique set of
functionality that performs either a simple or advanced data integration task. Stages represent
the processing steps that will be performed on the data.
23. What are Links in Datastage?

A link is a representation of a data flow that joins the stages in a job. A link connects data sources
to processing stages, connects processing stages to each other, and also connects those
processing stages to target systems. Links are like pipes through which the data flows from one
stage to the next.

24. What are Jobs in Datastage?

Jobs include the design objects and compiled programmatic elements that can connect to data
sources, extract and transform that data, and then load that data into a target system. Jobs are
created within a visual paradigm that enables instant understanding of the goal of the job.

25. What are Sequence jobs in Datastage?

A sequence job is a special type of job that you can use to create a workflow by running other jobs
in a specified order. This type of job was previously called a job sequence.

26. What are Table definitions?

Table definitions specify the format of the data that you want to use at each stage of a job. They
can be shared by all the jobs in a project and between all projects in InfoSphere DataStage.
Typically, table definitions are loaded into source stages. They are sometimes loaded into target
stages and other stages.

27. What are Containers in Datastage?

Containers are reusable objects that hold user-defined groupings of stages and links. Containers
create a level of reuse that allows you to use the same set of logic several times while reducing
the maintenance. Containers make it easy to share a workflow because you can simplify and
modularize your job designs by replacing complex areas of the diagram with a single container.

28. What are Projects in Datastage?

A project is a container that organizes and provides security for objects that are supplied, created,
or maintained for data integration, data profiling, quality monitoring, and so on.

29. What is Parallel processing design?

InfoSphere DataStage brings the power of parallel processing to the data extraction and
transformation process. InfoSphere DataStage jobs automatically inherit the capabilities of data
pipelining and data partitioning, allowing you to design an integration process without concern
for data volumes or time constraints, and without any requirements for hand-coding.

30. What are the types of parallel processing?


InfoSphere DataStage jobs use two types of parallel processing:

1. Data pipelining

2. Data partitioning

31. What is Data pipelining?

Data pipelining is the process of extracting records from the data source system and moving them
through the sequence of processing functions that are defined in the data flow that is defined by
the job. Because records are flowing through the pipeline, they can be processed without writing
the records to disk.

32. What is Data partitioning?

Data partitioning is an approach to parallelism that involves breaking the records into partitions,
or subsets of records. Data partitioning generally provides linear increases in application
performance.

When you design a job, you select the type of data partitioning algorithm that you want to use
(hash, range, modulus, and so on). Then, at runtime, InfoSphere DataStage uses that selection for
the number of degrees of parallelism that are specified dynamically at run time through the
configuration file.

33. What are Operators in Datastage?

A single stage might correspond to a single operator, or a number of operators, depending on the
properties you have set, and whether you have chosen to partition or collect or sort data on the
input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will
sometimes optimize operators out if they are judged to be superfluous, or insert other operators
if they are needed for the logic of the job.

34. What is OSH in Datastage?

OSH is the scripting language used internally by the parallel engine.

35. What are Players in Datastage?

Players are the workhorse processes in a parallel job. There is generally a player for each operator
on each node. Players are the children of section leaders; there is one section leader per
processing node. Section leaders are started by the conductor process running on the conductor
node (the conductor node is defined in the configuration file).
36. What are the two major ways of combining data in an InfoSphere DataStage
Job? How do you decide which one to use?

the two major ways of combining data in an InfoSphere DataStage job are via a Lookup stage or a
Join stage

1. Lookup and Join stages perform equivalent operations: combining two or more input

data sets based on one or more specified keys. When one unsorted input is very large or

sorting is not feasible, Lookup is preferred. When all inputs are of a manageable size or are

pre-sorted, Join is the preferred solution.

2. The Lookup stage is most appropriate when the reference data for all Lookup stages in a

job is small enough to fit into available physical memory. Each lookup reference requires a

contiguous block of physical memory. The Lookup stage requires all but the first input

(the primary input) to fit into physical memory.

37. What is the advantage of using Modular development in the data stage?

We should aim to use modular development techniques in your job designs in order to maximize
the reuse of parallel jobs and components and save yourself time.

38. What is Link buffering?

InfoSphere DataStage automatically performs buffering on the links of certain stages. This is
primarily intended to prevent deadlock situations arising (where one stage is unable to read its
input because a previous stage in the job is blocked from writing to its output).

39. How do you import and export data into Datastage?

Here are the points on how to import and export data into Datastage

 The import/export utility consists of these operators:

 The import operator: imports one or more data files into a single data set.

 The export operator: exports a data set to one or more data files.
40. What is the collection library in Datastage?

The collection library is a set of related operators that are concerned with collecting partitioned
data.

41. What are the collectors available in the collection library?

The collection library contains three collectors:

1. The ordered collector

2. The round-robin collector

3. The sortmerge collector

42. What is the ordered collector?

The Ordered collector reads all records from the first partition, then all records from the second
partition, and so on. This collection method preserves the sorted order of an input data set that
has been totally sorted. In a totally sorted data set, the records in each partition of the data set, as
well as the partitions themselves, are ordered.

43. What is the round-robin collector?

The round-robin collector reads a record from the first input partition, then from the second
partition, and so on. After reaching the last partition, the collector starts over. After reaching the
final record in any partition, the collector skips that partition.

44. What is the sortmerge collector?

The sortmerge collector reads records in an order based on one or more fields of the record. The
fields used to define record order are called collecting keys.

45. What is aggtorec restructure operator and what it does?

aggtorec restructure operator groups records that have the same key-field values into an output
record

46. What is the field_export restructure operator and what it does?

field_export restructure operator combines the input fields specified in your output schema into
a string- or raw-valued field
47. What is the field_import restructure operator and what it does?

field_import restructure operator exports an input string or raw field to the output fields specified
in your import schema.

48. What is makesubrec restructure operator and what it does?

makesubrec restructure operator combines specified vector fields into a vector of subrecords

49. What is makevect restructure operator and what it does?

makevect restructure operator combines specified fields into a vector of fields of the same type

50. What is promotesubrec restructure operator and what it does?

promotesubrec restructure operator converts input sub-record fields to output top-level fields

Advanced DataStage Interview Questions


51. What is splitsubrec restructure operator and what it does?

splitsubrec restructure operator separates input sub-records into sets of output top-level vector
fields

52. What is splitvect restructure operator and what it does?

splitvect restructure operator promotes the elements of a fixed-length vector to a set of similarly-
named top-level fields

53. What is tagbatch restructure operator and what it does?

tagbatch restructure operator converts tagged fields into output records whose schema supports
all the possible fields of the tag cases.

54. What is tagswitch restructure operator and what it does?

The contents of tagged aggregates are converted to InfoSphere DataStage-compatible records.

Datastage UNIX Interview Questions


55. How do you print/display the first line of a file?

The easiest way to display the first line of a file is using the [head] command.
$> head -1 file.txt
If you specify [head -2] then it would print first 2 records of the file.

Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be
used for various text manipulation purposes like this.
$> sed '2,$ d' file.txt

56. How do you print/display the last line of a file?

The easiest way is to use the [tail] command.


$> tail -1 file.txt

If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test

57. How to display n-th line of a file?

The easiest way to do it will be by using [sed] command


$> sed –n ' p' file.txt

You need to replace with the actual line number. So if you want to print the 4th line, the
command will be
$> sed –n '4 p' test

Of course you can do it by using [head] and [tail] command as well like below:
$> head - file.txt | tail -1

You need to replace with the actual line number. So if you want to print the 4th line, the
command will be
$> head -4 file.txt | tail -1

58. How to remove the first line/header from a file?

We already know how [sed] can be used to delete a certain line from the output – by using the'd'
switch. So if we want to delete the first line the command should be:
$> sed '1 d' file.txt

But the issue with the above command is, it just prints out all the lines except the first line of the
file on the standard output. It does not really change the file in-place. So if you want to delete the
first line from the file itself, you have two options.

Either you can redirect the output of the file to some other file and then rename it back to original
file like below:
$> sed '1 d' file.txt > new_file.txt

$> mv new_file.txt file.txt

Or, you can use an inbuilt [sed] switch '–i' which changes the file in-place. See below:
$> sed –i '1 d' file.txt

59. How to remove the last line/ trailer from a file in Unix script?

Always remember that [sed] switch '$' refers to the last line. So using this knowledge we can
deduce the below command:
$> sed –i '$ d' file.txt

60. How to remove certain lines from a file in Unix?

If y ou want to remove line to line from a given file, you can accomplish the task in the similar
method shown above. Here is an example:
$> sed –i '5,7 d' file.txt

The above command will delete line 5 to line 7 from the file file.txt

Explore DataStage Sample Resumes! Download & Edit, Get Noticed by Top Employers!

Q1. What are the characteristics of DataStage?


Ans. Following are the characteristics of IBM DataStage:

Stay updated with the latest blogs on online courses and skills
Enter Mobile Number

Register Now
 The tool can be deployed on local servers and the cloud as per the requirement.
 It is easy to use and can efficiently increase the speed and flexibility of data
integration.
 It can support and access big data through JSON support, JDBC integrator and
distributed file systems.
Check out big data courses on the leading platforms
Q2. Describe the architecture of DataStage.
Ans. DataStage has a client-server model as its architecture. There are different
architecture types according to the version. The components of the client-server
architecture are:
 Client components
 Servers
 Stages
 Table definitions
 Containers
 Projects
 Jobs
Q3. What are DataStage jobs?
Ans. Datastage jobs determine data sources, required transformations and data
destinations. Jobs are together compiled for creating reusable components and parallel
job flows.
Q4. What is the need for exception activity in Datastage?
Ans. Exception activity is necessary for Datastage since post exception activity, every
stage is executed whenever an unknown error occurs while executing the job
sequencer.
Q5. How to perform usage analysis in Datastage?
Ans. You can perform Usage Analysis in a few clicks. First, you need to launch the
Datastage Manager and then, right-click the job. After that, select Usage Analysis.
Q6. How can you perform date conversion in IBM Datastage?
Ans. Datastage has a function called ‘date conversion’.i.e. Oconv(I
conv(Filedname,”Existing Date Format”),”New Date Format”).
Q7. What are the different types of Lookups in the Datastage?
Ans. Datastage has two types of Lookups: Normal lkp and Sparse lkp.
 Normal lkp: First, data is saved in the memory and then the lookup is performed.
 Sparse lkp: Data is directly saved into the database. Sparse lkp is faster than the
Normal lkp.
Q8. What is APT_CONFIG?
Ans. APT_CONFIG is the environment variable that identifies the *.apt file in the tool. It
also stores the disk storage information, node information and scratch information.
Q9. Explain OConv () and IConv () functions?
Ans. In Datastage, OConv () and IConv() are two functions that are used for converting
from one format to another etc. OConv () converts formats for users to understand
whereas IConv () converts formats for the system to understand.
Q10. What is the difference between Sequential and Hash files?
Ans. The sequential file does not have any key value for saving the data. Hash file
saves the data on the hash algorithm and on hash key value due to which searching in
Hash file is faster than in sequential file.
Q11. Define flow designer in IBM DataStage?
Ans. Flow designer is the web-based user interface that is used for creating, editing,
loading, and running jobs in IBM DataStage. It has the following features:
 Performs jobs with a large number of stages.
 No need for migrating the jobs to use the flow designer.
 Add/removeconnectors and operators through the palette onto the designer
canvas using the drag and drop feature.
Q12. Differentiate between Operational Datastage (ODS) and Data
warehouse?
Ans. Operational Datastage is a mini data warehouse since it does not contain
information for more than one year. A data warehouse contains in-depth information
related to the entire business.
Q13. What is the use of NLS in Datastage?
Ans. National language support (NLS) can include multiple languages in the data as per
the data warehouse processing requirements.
Q14. What are the different types of hash files in the Datastage?
Ans. DataStage has two types of hash files i.e. Static and Dynamic Hash File. Static
hash file is used in cases where limited data has to be loaded within the target
database. Dynamic hash file is used in case of loading an unknown amount of data from
the source file.
Q15. Why do we need to use the surrogate key instead of the unique
key?
Ans. The surrogate key is used instead of a unique key as it can retrieve the data faster
through the retrieval operation.
Do give a read to InfoSphere MDM Reference Data Management V.10
Q16. Which command line is used for running a job in the tool?
Ans. You can run jobs using ‘dsjob -run -jobstatus <projectname> <jobname>’
command.
Q17. How can you find bugs in a job sequence?
Ans. You can use DataStage Director for finding bugs in the job sequence.
Q18. Name the types of views in a Datastage Director?
Ans. There are three view types in a Datastage Director: Job View, Log View and
Status View.
Q19. How can you improve performance in Datastage?
Ans. It is advisable not to use more than 20 stages in each job. It is better to use
another job if you have to use more than 20 stages.
Q20.Which functions are used for importing and exporting DS jobs?
Ans. ‘dsimport.exe’ is used for importing and ‘dsexport.exe’ is used for exporting DS
jobs.
Check out the top IBM courses right now!
Q21. How can you fix truncated data errors in Datastage?
Ans. You can fix truncated data errors by using the ENVIRONMENT variable
‘IMPORT_REJECT_STRING_FIELD_OVERRUN’.
Q22. What is the difference between a data file and a descriptor file?
Ans. A data file contains the data and a descriptor file contains the description of this
data contained within the data file.
Q23. List down some functions that you can execute using ‘dsjob’
command.
Ans. You can execute the following functions using the ‘dsjob’ command:
 $dsjob -run: to run DataStage job
 $dsjob -jobid: for providing the job information
 $dsjob -stop: to stop the job that is currently present in the process
 $dsjob -report: to display complete job report
 $dsjob -llinks: for listing all the links
 $dsjob -lprojects: for listing all present projects
 $dsjob -ljobs: to list all jobs present in the project
 $dsjob -lstages: for listing every stage of the current job
 $dsjobs -lparams: for listing all parameters of the job
 $dsjob -projectinfo: to retrieve project information
 $dsjob -jobinfo: for retrieving information of the job
 Q24. What are Routines and how can we call them in Datastage jobs?
Ans. Routines are the collection of functions defined by the DS manager. There are
three types of routines including parallel, mainframe and server routines. You can call a
routine from the transformer stage of the tool.
Q25. How can you write parallel routines?
Ans. You can write parallel routines in the C and C++ compilers.
Q26. Why do we use an HBase connector in Datastage?
Ans. HBase connector is used for connecting databases and tables that are present in
the HBase database. Through the HBase connector, you can read and write the data in
the HBase database, You can also read the data in parallel mode.
Q27. What is the importance of Data partitioning?
Ans. Data partitioning involves the process of segmenting records into partitions for
processing. This increases processing efficiency in a linear model. Overall, it is a
parallel approach for data processing.
Q28. How does Datastage manage rejected rows?
Ans. Rejected rows are managed through constraints in the transformer. There are two
ways to do so. Either you can place the rejected rows in the transformer’s properties or
you can create temporary storage for these rejected rows using the REJECTED
command.
Q29. What is the use of link partitioner and link connector in Datastage?
Ans. Link Partitioner divides the data into different parts through partitioning methods.
Link Collector collects data from different segments into a single data and it saves this
data into the target table.
Q30. What are the tiers of Datastage InfoSphere Information Server?
Ans. There are four tiers in the InfoSphere Information Server are:
 Client tier: used for development and administration of computers using
client programs and consoles.
 Services tier: to provide standard and module-specific services. It contains
an application server, product modules as well as product services.
 Engine tier: it has a set of logical components which are used for running
the jobs and other tasks for product modules.
 Metadata repository tier: it includes metadata repository and analysis
database. The repository is used for sharing metadata, shared data and
configuration information.
Q31. How does datastage jobs performance tuning take place?
Ans. For the performance tuning, first of all, one needs to select the appropriate
configuration file. After that, select right partition and buffer the memory, We, then, need
to deal with data sorting and handling null-time value. Use copy, filter or modify. Avoid
using transformer. After that, reduce the propogation of redundant metadeta between
several stages.
Q32. How does DataStage handle merging?
Ans. The primary key column in the tables can be used to merge or join two or more
tables.
Q33. What’s the distinction between Datastage 7.5 and 7.0?
Ans. Many new stages, such as Procedure Stage, Command Stage, Generate Report,
and so on, have been added to Datastage 7.5, which were not present in the 7.0
version.

1) Define Data Stage?


A data stage is basically a tool that is used to design, develop and
execute various applications to fill multiple tables in data warehouse or
data marts. It is a program for Windows servers that extracts data from
databases and change them into data warehouses. It has become an
essential part of IBM WebSphere Data Integration suite.

Free PDF Download: Datastage Interview Questions & Answers

2) Explain how a source file is populated?


We can populate a source file in many ways such as by creating a SQL
query in Oracle, or by using row generator extract tool etc.

3) Name the command line functions to import


and export the DS jobs?
To import the DS jobs, dsimport.exe is used and to export the DS jobs,
dsexport.exe is used.
4) What is the difference between Datastage 7.5
and 7.0?
In Datastage 7.5 many new stages are added for more robustness and
smooth performance, such as Procedure Stage, Command Stage,
Generate Report etc.

Don't Miss:

 Top 50 Data Structure Interview Questions and Answers


 Top 100 Tableau Interview Questions and Answers (2024)
 Top 30 Data Analyst Interview Questions and Answers
(2024)
 Top 50 ADO.Net Interview Questions (2024)

5) In Datastage, how you can fix the truncated


data error?
The truncated data error can be fixed by using ENVIRONMENT
VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’.

6) Define Merge?
Merge means to join two or more tables. The two tables are joined on
the basis of Primary key columns in both the tables.
Datastage Interview
Questions

7) Differentiate between data file and descriptor


file?
As the name implies, data files contains the data and the descriptor file
contains the description/information about the data in the data files.

8) Differentiate between datastage and


informatica?
In datastage, there is a concept of partition, parallelism for node
configuration. While, there is no concept of partition and parallelism in
informatica for node configuration. Also, Informatica is more scalable
than Datastage. Datastage is more user-friendly as compared to
Informatica.

9) Define Routines and their types?


Routines are basically collection of functions that is defined by DS
manager. It can be called via transformer stage. There are three types
of routines such as, parallel routines, main frame routines and server
routines.
10) How can you write parallel routines in
datastage PX?
We can write parallel routines in C or C++ compiler. Such routines are
also created in DS manager and can be called from transformer stage.

11) What is the method of removing duplicates,


without the remove duplicate stage?
Duplicates can be removed by using Sort stage. We can use the option,
as allow duplicate = false.

12) What steps should be taken to improve


Datastage jobs?
In order to improve performance of Datastage jobs, we have to first
establish the baselines. Secondly, we should not use only one flow
for performance testing. Thirdly, we should work in increment. Then,
we should evaluate data skews. Then we should isolate and solve the
problems, one by one. After that, we should distribute the file systems
to remove bottlenecks, if any. Also, we should not include RDBMS in
start of testing phase. Last but not the least, we should understand
and assess the available tuning knobs.

13) Differentiate between Join, Merge and Lookup


stage?
All the three concepts are different from each other in the way they
use the memory storage, compare input requirements and how they
treat various records. Join and Merge needs less memory as compared
to the Lookup stage.

14) Explain Quality stage?


Quality stage is also known as Integrity stage. It assists in integrating
different types of data from various sources.

15) Define Job control?


Job control can be best performed by using Job Control Language (JCL).
This tool is used to execute multiple jobs simultaneously, without using
any kind of loop.

16) Differentiate between Symmetric


Multiprocessing and Massive Parallel Processing?
In Symmetric Multiprocessing, the hardware resources are shared by
processor. The processor has one operating system and it
communicates through shared memory. While in Massive Parallel
processing, the processor access the hardware resources exclusively.
This type of processing is also known as Shared Nothing, since nothing
is shared in this. It is faster than the Symmetric Multiprocessing.

17) What are the steps required to kill the job in


Datastage?
To kill the job in Datasatge, we have to kill the respective processing
ID.
18) Differentiate between validated and Compiled
in the Datastage?
In Datastage, validating a job means, executing a job. While validating,
the Datastage engine verifies whether all the required properties are
provided or not. In other case, while compiling a job, the Datastage
engine verifies that whether all the given properties are valid or not.

19) How to manage date conversion in


Datastage?
We can use date conversion function for this purpose i.e.
Oconv(Iconv(Filedname,”Existing Date Format”),”Another Date
Format”).

20) Why do we use exception activity in


Datastage?
All the stages after the exception activity in Datastage are executed in
case of any unknown error occurs while executing the job sequencer.

21) Define APT_CONFIG in Datastage?


It is the environment variable that is used to identify the *.apt file in
Datastage. It is also used to store the node information, disk storage
information and scratch information.

22) Name the different types of Lookups in


Datastage?
There are two types of Lookups in Datastage i.e. Normal lkp and
Sparse lkp. In Normal lkp, the data is saved in the memory first and
then the lookup is performed. In Sparse lkp, the data is directly saved
in the database. Therefore, the Sparse lkp is faster than the Normal
lkp.

23) How a server job can be converted to a


parallel job?
We can convert a server job in to a parallel job by using IPC stage and
Link Collector.

24) Define Repository tables in Datastage?


In Datastage, the Repository is another name for a data warehouse. It
can be centralized as well as distributed.

25) Define OConv () and IConv () functions in


Datastage?
In Datastage, OConv () and IConv() functions are used to convert
formats from one format to another i.e. conversions of roman numbers,
time, date, radix, numeral ASCII etc. IConv () is basically used to
convert formats for system to understand. While, OConv () is used to
convert formats for users to understand.

26) Explain Usage Analysis in Datastage?


In Datastage, Usage Analysis is performed within few clicks. Launch
Datastage Manager and right click the job. Then, select Usage Analysis
and that’s it.
27) How do you find the number of rows in a
sequential file?
To find rows in sequential file, we can use the System variable
@INROWNUM.

28) Differentiate between Hash file and


Sequential file?
The only difference between the Hash file and Sequential file is that
the Hash file saves data on hash algorithm and on a hash key value,
while sequential file doesn’t have any key value to save the data. Basis
on this hash key feature, searching in Hash file is faster than in
sequential file.

29) How to clean the Datastage repository?


We can clean the Datastage repository by using the Clean Up
Resources functionality in the Datastage Manager.

30) How a routine is called in Datastage job?


In Datastage, routines are of two types i.e. Before Sub Routines and
After Sub Routines. We can call a routine from the transformer stage in
Datastage.

31) Differentiate between Operational Datastage


(ODS) and Data warehouse?
We can say, ODS is a mini data warehouse. An ODS doesn’t contain
information for more than 1 year while a data warehouse contains
detailed information regarding the entire business.
32) NLS stands for what in Datastage?
NLS means National Language Support. It can be used to incorporate
other languages such as French, German, and Spanish etc. in the data,
required for processing by data warehouse. These languages have
same scripts as English language.

33) Can you explain how could anyone drop the


index before loading the data in target in
Datastage?
In Datastage, we can drop the index before loading the data in target
by using the Direct Load functionality of SQL Loaded Utility.

34) Does Datastage support slowly changing


dimensions ?
Yes. Version 8.5 + supports this feature

35) How can one find bugs in job sequence?


We can find bugs in job sequence by using DataStage Director.

36) How complex jobs are implemented in


Datstage to improve performance?
In order to improve performance in Datastage, it is recommended, not
to use more than 20 stages in every job. If you need to use more than
20 stages then it is better to use another job for those stages.
37) Name the third party tools that can be used
in Datastage?
The third party tools that can be used in Datastage, are Autosys, TNG
and Event Co-ordinator. I have worked with these tools and possess
hands on experience of working with these third party tools.

38) Define Project in Datastage?


Whenever we launch the Datastage client, we are asked to connect to
a Datastage project. A Datastage project contains Datastage jobs,
built-in components and Datastage Designer or User-Defined
components.

39) How many types of hash files are there?


There are two types of hash files in DataStage i.e. Static Hash File and
Dynamic Hash File. The static hash file is used when limited amount of
data is to be loaded in the target database. The dynamic hash file is
used when we don’t know the amount of data from the source file.

40) Define Meta Stage?


In Datastage, MetaStage is used to save metadata that is helpful for
data lineage and data analysis.

41) Have you have ever worked in UNIX


environment and why it is useful in Datastage?
Yes, I have worked in UNIX environment. This knowledge is useful in
Datastage because sometimes one has to write UNIX programs such as
batch programs to invoke batch processing etc.

42) Differentiate between Datastage and


Datastage TX?
Datastage is a tool from ETL (Extract, Transform and Load) and
Datastage TX is a tool from EAI (Enterprise Application Integration).

43) What is size of a transaction and


an array means in a Datastage?
Transaction size means the number of row written before committing
the records in a table. An array size means the number of rows
written/read to or from the table respectively.

44) How many types of views are there in a


Datastage Director?
There are three types of views in a Datastage Director i.e. Job View,
Log View and Status View.

45) Why we use surrogate key?


In Datastage, we use Surrogate Key instead of unique key. Surrogate
key is mostly used for retrieving data faster. It uses Index to perform
the retrieval operation.
46) How rejected rows are managed in
Datastage?
In the Datastage, the rejected rows are managed through constraints
in transformer. We can either place the rejected rows in the properties
of a transformer or we can create a temporary storage for rejected
rows with the help of REJECTED command.

47) Differentiate between ODBC and DRS stage?


DRS stage is faster than the ODBC stage because it uses native
databases for connectivity.

48) Define Orabulk and BCP stages?


Orabulk stage is used to load large amount of data in one target table
of Oracle database. The BCP stage is used to load large amount of data
in one target table of Microsoft SQL Server.

49) Define DS Designer?


The DS Designer is used to design work area and add various links to
it.

50) Why do we use Link Partitioner and Link


Collector in Datastage?
In Datastage, Link Partitioner is used to divide data into different parts
through certain partitioning methods. Link Collector is used to gather
data from various partitions/segments to a single data and save it in
the target table.

You might also like