Datastage Interview Questions
Datastage Interview Questions
Ans. DataStage is an ETL tool offered by IBM. IBM can use it to design, develop, and
execute programs. It extracts data from databases on Windows servers and puts it into
data storage. It also has the ability of graphic data integration visualizations. IBM
DataStage can also extract data from many sources.
2. What are the characteristics of DataStorage?
Ans. The characteristics of DataStorage are:
In this article, we have shared a list of the most frequently asked IBM
DataStage interview questions and the answers to the questions. These
DataStage interview questions and answers are beneficial for beginners as
well as experienced professionals to crack DataStage interviews.
These questions will cover key concepts like DataStage and Informatica,
DataStage routine, lookup stages comparison, join, merge, performance
tuning of jobs, repository table, data type conversions, quality state, job
control, etc.
DataStage is an ETL tool that extracts, transforms, and loads tool for Windows servers for data
integration from databases into the data warehouse. It is used to design, develop and run different
applications to fill data into data warehouses and data marts. DataStage is an essential part of the
IBM InfoSphere Data Integration Suite.
Summary:
DataStage supports the transformation of large volumes of data using a scalable parallel
processing approach.
It supports Big Data Hadoop by accessing Big Data in many ways, like on a distributed file
system, JSON support, and JDBC integrator.
DataStage is easy to use, with its improved speed, flexibility, and efficacy for data integration.
A source file can be filled in many ways, like by creating a SQL query in Oracle, or through a
row generator extract tool.
5. One of the most frequently asked dataStage interview questions is what is the
difference between DataStage 7.0 and 7.5?
DataStage 7.5 comes with many new stages added to version 7.0 for increased stability and
smoother performance. The new features include the command stage, procedure stage,
generating the report, and more.
A data file contains only data, while a descriptor file contains all information or description
about data in data files.
Both DataStage and Informatica are powerful ETL tools. While DataStage has the concept of
parallelism and partition for node configuration, Informatica does not support parallelism in node
configuration. DataStage is simpler to use than Informatica, but Informatica is more scalable.
Parallel routines can be written in C or C++ compiler. We can also create such routines in DS
manager and can be called from the transformer stage.
The sort function can be sued to remove duplicates in DataStage. While running the sort
function, the user should specify the option that allows for duplicates and set it to false.
11. What is the difference between join, merge, and look up stages?
These concepts differ from each other in how they use memory storage, compare input
requirements and treat different records. Join and Merge need less memory than Look up.
We can convert a server job into a parallel job with the help of an IPC collector and a link
collector.
It is a tool used to connect databases and tables that are present in the HBase database. It can be
used to carry out tasks like:
First, we have to establish baselines. Also, we shouldn’t use only one flow for performance
testing. Work should be done incrementally. Evaluate data skews and, thereafter, isolate and
solve the problems one by one. Next, distribute the file systems to avoid bottlenecks, if any. Do
not include RDBMS at the beginning of the testing phase. Finally, identify and examine the
The quality stat is used for data cleansing with DataStage tool. It is a client-server software
provided as part of IBM information server.
16. One of the most frequntly asked datastage interview questions isDefine job
control.
Job control is a tool used for controlling a job or executing multiple jobs parallelly. The Job
Control Language within the IBM datastage tool is used to deploy job control.
First, we choose the right configuration files, partition, and buffer memory. We take care of data
sorting and handling null-time values. We should try to use copy, modify or filter rather than the
transformer. The propagation of unnecessary metadata between stages needs to be reduced.
A repository table or data warehouse is used for answering ad-hoc, historical, analytical, or
complex queries. A repository can be centralized or distributed.
19. Another frequntly asked datastage interview questions is how can you kill a
DataStage job?
We need to first kill the individual processing ID such that the DataStage is killed.
The Validated OK process ensures that the connections are valid, whereas the Compiled process
makes sure that important stage parameters are correctly mapped so that it creates an executable
job.
We can use data conversion function for datatype conversion in DataStage. We must make sure
that the input or the output to and from the operator is the same. Plus, the record schema must be
compatible with the operator.
If there is an unfamiliar error occurring during the execution of the job sequencer, all the stages
following the exception activity are executed. Hence, exception activity is very important in
DataStage.
23. Describe DataStage architecture briefly.
The DataStage architecture follows a client-server model with different architecture types for
different versions. The main components of the model are:
Client components
Servers
Jobs
Stages
Projects
Table Definitions
Containers
24. What are the command line functions that can help to import and export DS jobs?
The dsimport.exe is used to import DS jobs, and the dsexport.exe is used for export.
To check if a certain job is part of the sequence, we right click on the manager on the job and
select Usage Analysis.
28. Another frequntly asked datastage interview questions is what is the difference
between sequential file and hash file?
A hash file, being based on hash algorithm, can be used with a key value. However, a sequential
file does not have any key value column.
Hash file can be used as lookup reference while sequential file cannot be used for lookup. It is
easier to search hash file due to presence of hash key.
Criteria Characteristics
Support for Big Data Access Big Data on a distributed file system, JSON support, and
Hadoop JDBC integrator
Ease of use Improve speed, flexibility, and efficacy for data integration
Deployment On-premise or cloud as the need dictates
DataStage and Informatica are both powerful ETL tools, but there
are a few differences between the two. DataStage has parallelism
and partition concepts for node configuration; whereas in
Informatica, there is no support for parallelism in node
configuration. Also, DataStage is simpler to use as compared to
Informatica.
The quality state is used for cleansing the data with the DataStage
tool. It is a client-server software tool that is provided as part of the
IBM Information Server.
Client components
Servers
Stages
Table definitions
Containers
Projects
Jobs
The hash file is based on a hash algorithm, and it can be used with a
key value. The sequential file, on the other hand, does not have any
key-value column. The hash file can be used as a reference for a
lookup, while a sequential file cannot be used for a lookup. Due to
the presence fo the hash key, the hash file is easier to search than a
sequential file.
NLS means National Language Support. This means we can use this
IBM DataStage tool in various languages like multi-byte character
languages (Chinese or Japanese). We can read and write in any
language and process it as per the requirement.
Duration 18:10
Loaded: 5.87%
Fullscreen
DataStage architecture contains the following components.
o Projects
o Jobs
o Stages
o Servers
o Client Components
o InfoSphere
o DataStage Server 9.1.2 or above
o Microsoft Visual Studio .NET 2010 Express Edition C++
o Oracle client (full client, not an instant client) if connecting to an Oracle
database
o DB2 client if connecting to a DB2 database
1. File= /home/myFile1.txt
2. File= /home/myFile2.txt
3. File= /home/myFile3.txt
4. Read Method= Specific file(s) fcec
Advertisement
Jobinfo: it returns the job information (Job-status, job runtime,end time, etc.)
Stageinfo: it returns the stage name, stage type, input rows, etc.)
Report: it is used to display a report which contains Generated time, start time,
elapsed time, status, etc.
IBM InfoSphere DataStage and QualityStage jobs can access data from
enterprise applications and data sources such as:
o Relational databases
o Mainframe databases
o Enterprise Resource Planning (ERP) or Customer Relationship
Management (CRM) databases
o Online Analytical Processing (OLAP) or performance management
databases
o Business and analytic applications
1. HoursFromTime(myexample1.time)
2. HoursFromTime("22:30:00")
DataStage's pipeline
Informatica offers to pa
Multiple Partitions partitioning uses multiple
dynamic partitioning.
partitions.
o Link sort
o Standalone Sort stage
Link sort is used unless a specific option is needed over Sort Stage. Most often,
the Sort stage is used to specify the Sort Key mode for partial sorts.
o Parallel routines
o Mainframe routines
o Server routines
Parallel extender in DataStage is the data extraction and transformation application for parallel
processing.
There are two types of parallel processing's are available they are:
1. Pipeline Parallelism
2. Partition Parallelism
Actually, every process contains a conductor process where the execution was started and a
section leader process for each processing node and a player process for each set of combined
operators, and an individual player process for each uncombined operator.
Whenever we want to kill a process we should have to destroy the player process and then the
section leader process and then the conductor process.
4. How do you run the DataStage job from the command line?
stageinfo - returns the stage name, stage type, input rows, etc.,)
report - display a report contains Generated time, start time, elapsed time, status, etc.,
Want to Enrich your career with a DataStage certified professional, then enroll in our “DataStage Training” This
will help you to achieve excellence in this domain.
6. Can you explain the difference between sequential file, dataset, and fileset?
Sequential File:
1. Extract/load from/to seq file max 2GB
2. When used as a source at the time of compilation it will be converted into a native format
from ASCII
Dataset:
1. It preserves partition.it stores data on the nodes so when you read from a dataset you
2. It stores data in binary in the internal format of Datastage. so it takes less time to
o Data File: Created in the Dataset folder mentioned in the configuration file.
5. Dataset (.ds) file cannot be open directly, and you could follow alternative ways to achieve
that, Data Set Management, the utility in client tool(such as Designer and Manager), and
command-line ORCHADMIN.
Fileset:
1. It stores data in a format similar to that of a sequential file. The only advantage of using a
2. you can view the data but in the order defined in the partitioning scheme.
3. Fileset creates a .fs file and a .fs file is stored in ASCII format, so you could directly open it
We can use the palette to drag and drop connectors and operators onto the designer
canvas.
We can link nodes by selecting the previous node and dropping the next node or drawing
We can edit stage properties on the sidebar, and make changes to your schema in the
We can zoom in and zoom out using your mouse, and leverage the mini-map on the lower-
This is very useful when you have a very large job with tens or hundreds of stages.
No need to migrate jobs - You do not need to migrate jobs to a new location in order to
use the new web-based IBM DataStage Flow Designer user interface.
- Getting rid of a thick client means getting rid of keeping up with the latest version of the
software, upgrading servers, and purchasing Citrix licenses. IBM DataStage Flow Designer
Easily work with your favorite jobs - You can mark your favorite jobs in the Jobs
Dashboard, and have them automatically show up on the welcome page. This gives you
fast, one-click access to jobs that are typically used for reference, saving you navigation
time.
Easily continue working where you left off - Your recent activity automatically shows
up on the welcome page. This gives you fast, one-click access to jobs that you were
working on before, so you can easily start where you left off in the last session.
Efficiently search for any job - Many organizations have thousands of DataStage jobs.
You can very easily find your job with the built-in type-ahead Search feature on the Jobs
Dashboard.
Cloning a job - Instead of always starting Job Design from scratch, you can clone an
existing job on the Jobs Dashboard and use that to jump-start your new Job Design.
powerful feature to automatically propagate metadata. Once you add a source connector
to your job and link it to an operator, the operator automatically inherits the metadata.
You do not have to specify the metadata in each stage of the job.
Storing your preferences - You can easily customize your viewing preferences and have
the IBM DataStage Flow Designer automatically save them across sessions.
Saving a job - IBM DataStage Flow Designer allows you to save a job in any folder. The job
is saved as a DataStage job in the repository, alongside other jobs that might have been
compilation errors one at a time. Large jobs with many stages can take longer to
troubleshoot in this situation. IBM DataStage Flow Designer highlights all errors and gives
you a way to see the problem with a quick hover over each stage, so you can fix multiple
status of your job on the new user interface. You can also view the Job Log, or launch the
HBase connector is used to connect to tables stored in the HBase database and perform the
following operations:
Hive connector supports modulus partition mode and minimum-maximum partition mode during
the read operation.
A) Kafka connector has been enhanced with the following new capabilities:
Continuous mode, where incoming topic messages are consumed without stopping the
connector.
After the record count is reached, an end of the wave marker is sent to the output link.
File connector has been enhanced with the following new capabilities:
New data types are supported for reading and writing the Parquet formatted files: Date /
InfoSphere Information Server is capable of scaling to meet any information volume requirement
so that companies can deliver business results faster and with higher quality results. InfoSphere
Information Server provides a single unified platform that enables companies to understand,
cleanse, transform, and deliver trustworthy and context-rich information.
16. What are the different Tiers available in the InfoSphere Information Server?
In the InfoSphere information server there are four tiers are available, they are:
1. Client Tier
2. Engine Tier
3. Services Tier
The client tier includes the client programs and consoles that are used for development and
administration and the computers where they are installed.
The engine tier includes the logical group of components (the InfoSphere Information Server
engine components, service agents, and so on) and the computer where those components are
installed. The engine runs jobs and other tasks for product modules.
The metadata repository tier includes the metadata repository, the InfoSphere Information
Analyzer analysis database (if installed), and the computer where these components are installed.
The metadata repository contains the shared metadata, data, and configuration information for
InfoSphere Information Server product modules. The analysis database stores extended analysis
data for InfoSphere Information Analyzer.
DataStage provides the elements that are necessary to build data integration and transformation
flows.
Stages
Links
Jobs
Table definitions
Containers
Sequence jobs
Projects
Stages are the basic building blocks in InfoSphere DataStage, providing a rich, unique set of
functionality that performs either a simple or advanced data integration task. Stages represent
the processing steps that will be performed on the data.
23. What are Links in Datastage?
A link is a representation of a data flow that joins the stages in a job. A link connects data sources
to processing stages, connects processing stages to each other, and also connects those
processing stages to target systems. Links are like pipes through which the data flows from one
stage to the next.
Jobs include the design objects and compiled programmatic elements that can connect to data
sources, extract and transform that data, and then load that data into a target system. Jobs are
created within a visual paradigm that enables instant understanding of the goal of the job.
A sequence job is a special type of job that you can use to create a workflow by running other jobs
in a specified order. This type of job was previously called a job sequence.
Table definitions specify the format of the data that you want to use at each stage of a job. They
can be shared by all the jobs in a project and between all projects in InfoSphere DataStage.
Typically, table definitions are loaded into source stages. They are sometimes loaded into target
stages and other stages.
Containers are reusable objects that hold user-defined groupings of stages and links. Containers
create a level of reuse that allows you to use the same set of logic several times while reducing
the maintenance. Containers make it easy to share a workflow because you can simplify and
modularize your job designs by replacing complex areas of the diagram with a single container.
A project is a container that organizes and provides security for objects that are supplied, created,
or maintained for data integration, data profiling, quality monitoring, and so on.
InfoSphere DataStage brings the power of parallel processing to the data extraction and
transformation process. InfoSphere DataStage jobs automatically inherit the capabilities of data
pipelining and data partitioning, allowing you to design an integration process without concern
for data volumes or time constraints, and without any requirements for hand-coding.
1. Data pipelining
2. Data partitioning
Data pipelining is the process of extracting records from the data source system and moving them
through the sequence of processing functions that are defined in the data flow that is defined by
the job. Because records are flowing through the pipeline, they can be processed without writing
the records to disk.
Data partitioning is an approach to parallelism that involves breaking the records into partitions,
or subsets of records. Data partitioning generally provides linear increases in application
performance.
When you design a job, you select the type of data partitioning algorithm that you want to use
(hash, range, modulus, and so on). Then, at runtime, InfoSphere DataStage uses that selection for
the number of degrees of parallelism that are specified dynamically at run time through the
configuration file.
A single stage might correspond to a single operator, or a number of operators, depending on the
properties you have set, and whether you have chosen to partition or collect or sort data on the
input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will
sometimes optimize operators out if they are judged to be superfluous, or insert other operators
if they are needed for the logic of the job.
Players are the workhorse processes in a parallel job. There is generally a player for each operator
on each node. Players are the children of section leaders; there is one section leader per
processing node. Section leaders are started by the conductor process running on the conductor
node (the conductor node is defined in the configuration file).
36. What are the two major ways of combining data in an InfoSphere DataStage
Job? How do you decide which one to use?
the two major ways of combining data in an InfoSphere DataStage job are via a Lookup stage or a
Join stage
1. Lookup and Join stages perform equivalent operations: combining two or more input
data sets based on one or more specified keys. When one unsorted input is very large or
sorting is not feasible, Lookup is preferred. When all inputs are of a manageable size or are
2. The Lookup stage is most appropriate when the reference data for all Lookup stages in a
job is small enough to fit into available physical memory. Each lookup reference requires a
contiguous block of physical memory. The Lookup stage requires all but the first input
37. What is the advantage of using Modular development in the data stage?
We should aim to use modular development techniques in your job designs in order to maximize
the reuse of parallel jobs and components and save yourself time.
InfoSphere DataStage automatically performs buffering on the links of certain stages. This is
primarily intended to prevent deadlock situations arising (where one stage is unable to read its
input because a previous stage in the job is blocked from writing to its output).
Here are the points on how to import and export data into Datastage
The import operator: imports one or more data files into a single data set.
The export operator: exports a data set to one or more data files.
40. What is the collection library in Datastage?
The collection library is a set of related operators that are concerned with collecting partitioned
data.
The Ordered collector reads all records from the first partition, then all records from the second
partition, and so on. This collection method preserves the sorted order of an input data set that
has been totally sorted. In a totally sorted data set, the records in each partition of the data set, as
well as the partitions themselves, are ordered.
The round-robin collector reads a record from the first input partition, then from the second
partition, and so on. After reaching the last partition, the collector starts over. After reaching the
final record in any partition, the collector skips that partition.
The sortmerge collector reads records in an order based on one or more fields of the record. The
fields used to define record order are called collecting keys.
aggtorec restructure operator groups records that have the same key-field values into an output
record
field_export restructure operator combines the input fields specified in your output schema into
a string- or raw-valued field
47. What is the field_import restructure operator and what it does?
field_import restructure operator exports an input string or raw field to the output fields specified
in your import schema.
makesubrec restructure operator combines specified vector fields into a vector of subrecords
makevect restructure operator combines specified fields into a vector of fields of the same type
promotesubrec restructure operator converts input sub-record fields to output top-level fields
splitsubrec restructure operator separates input sub-records into sets of output top-level vector
fields
splitvect restructure operator promotes the elements of a fixed-length vector to a set of similarly-
named top-level fields
tagbatch restructure operator converts tagged fields into output records whose schema supports
all the possible fields of the tag cases.
The easiest way to display the first line of a file is using the [head] command.
$> head -1 file.txt
If you specify [head -2] then it would print first 2 records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be
used for various text manipulation purposes like this.
$> sed '2,$ d' file.txt
If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test
You need to replace with the actual line number. So if you want to print the 4th line, the
command will be
$> sed –n '4 p' test
Of course you can do it by using [head] and [tail] command as well like below:
$> head - file.txt | tail -1
You need to replace with the actual line number. So if you want to print the 4th line, the
command will be
$> head -4 file.txt | tail -1
We already know how [sed] can be used to delete a certain line from the output – by using the'd'
switch. So if we want to delete the first line the command should be:
$> sed '1 d' file.txt
But the issue with the above command is, it just prints out all the lines except the first line of the
file on the standard output. It does not really change the file in-place. So if you want to delete the
first line from the file itself, you have two options.
Either you can redirect the output of the file to some other file and then rename it back to original
file like below:
$> sed '1 d' file.txt > new_file.txt
Or, you can use an inbuilt [sed] switch '–i' which changes the file in-place. See below:
$> sed –i '1 d' file.txt
59. How to remove the last line/ trailer from a file in Unix script?
Always remember that [sed] switch '$' refers to the last line. So using this knowledge we can
deduce the below command:
$> sed –i '$ d' file.txt
If y ou want to remove line to line from a given file, you can accomplish the task in the similar
method shown above. Here is an example:
$> sed –i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt
Explore DataStage Sample Resumes! Download & Edit, Get Noticed by Top Employers!
Stay updated with the latest blogs on online courses and skills
Enter Mobile Number
Register Now
The tool can be deployed on local servers and the cloud as per the requirement.
It is easy to use and can efficiently increase the speed and flexibility of data
integration.
It can support and access big data through JSON support, JDBC integrator and
distributed file systems.
Check out big data courses on the leading platforms
Q2. Describe the architecture of DataStage.
Ans. DataStage has a client-server model as its architecture. There are different
architecture types according to the version. The components of the client-server
architecture are:
Client components
Servers
Stages
Table definitions
Containers
Projects
Jobs
Q3. What are DataStage jobs?
Ans. Datastage jobs determine data sources, required transformations and data
destinations. Jobs are together compiled for creating reusable components and parallel
job flows.
Q4. What is the need for exception activity in Datastage?
Ans. Exception activity is necessary for Datastage since post exception activity, every
stage is executed whenever an unknown error occurs while executing the job
sequencer.
Q5. How to perform usage analysis in Datastage?
Ans. You can perform Usage Analysis in a few clicks. First, you need to launch the
Datastage Manager and then, right-click the job. After that, select Usage Analysis.
Q6. How can you perform date conversion in IBM Datastage?
Ans. Datastage has a function called ‘date conversion’.i.e. Oconv(I
conv(Filedname,”Existing Date Format”),”New Date Format”).
Q7. What are the different types of Lookups in the Datastage?
Ans. Datastage has two types of Lookups: Normal lkp and Sparse lkp.
Normal lkp: First, data is saved in the memory and then the lookup is performed.
Sparse lkp: Data is directly saved into the database. Sparse lkp is faster than the
Normal lkp.
Q8. What is APT_CONFIG?
Ans. APT_CONFIG is the environment variable that identifies the *.apt file in the tool. It
also stores the disk storage information, node information and scratch information.
Q9. Explain OConv () and IConv () functions?
Ans. In Datastage, OConv () and IConv() are two functions that are used for converting
from one format to another etc. OConv () converts formats for users to understand
whereas IConv () converts formats for the system to understand.
Q10. What is the difference between Sequential and Hash files?
Ans. The sequential file does not have any key value for saving the data. Hash file
saves the data on the hash algorithm and on hash key value due to which searching in
Hash file is faster than in sequential file.
Q11. Define flow designer in IBM DataStage?
Ans. Flow designer is the web-based user interface that is used for creating, editing,
loading, and running jobs in IBM DataStage. It has the following features:
Performs jobs with a large number of stages.
No need for migrating the jobs to use the flow designer.
Add/removeconnectors and operators through the palette onto the designer
canvas using the drag and drop feature.
Q12. Differentiate between Operational Datastage (ODS) and Data
warehouse?
Ans. Operational Datastage is a mini data warehouse since it does not contain
information for more than one year. A data warehouse contains in-depth information
related to the entire business.
Q13. What is the use of NLS in Datastage?
Ans. National language support (NLS) can include multiple languages in the data as per
the data warehouse processing requirements.
Q14. What are the different types of hash files in the Datastage?
Ans. DataStage has two types of hash files i.e. Static and Dynamic Hash File. Static
hash file is used in cases where limited data has to be loaded within the target
database. Dynamic hash file is used in case of loading an unknown amount of data from
the source file.
Q15. Why do we need to use the surrogate key instead of the unique
key?
Ans. The surrogate key is used instead of a unique key as it can retrieve the data faster
through the retrieval operation.
Do give a read to InfoSphere MDM Reference Data Management V.10
Q16. Which command line is used for running a job in the tool?
Ans. You can run jobs using ‘dsjob -run -jobstatus <projectname> <jobname>’
command.
Q17. How can you find bugs in a job sequence?
Ans. You can use DataStage Director for finding bugs in the job sequence.
Q18. Name the types of views in a Datastage Director?
Ans. There are three view types in a Datastage Director: Job View, Log View and
Status View.
Q19. How can you improve performance in Datastage?
Ans. It is advisable not to use more than 20 stages in each job. It is better to use
another job if you have to use more than 20 stages.
Q20.Which functions are used for importing and exporting DS jobs?
Ans. ‘dsimport.exe’ is used for importing and ‘dsexport.exe’ is used for exporting DS
jobs.
Check out the top IBM courses right now!
Q21. How can you fix truncated data errors in Datastage?
Ans. You can fix truncated data errors by using the ENVIRONMENT variable
‘IMPORT_REJECT_STRING_FIELD_OVERRUN’.
Q22. What is the difference between a data file and a descriptor file?
Ans. A data file contains the data and a descriptor file contains the description of this
data contained within the data file.
Q23. List down some functions that you can execute using ‘dsjob’
command.
Ans. You can execute the following functions using the ‘dsjob’ command:
$dsjob -run: to run DataStage job
$dsjob -jobid: for providing the job information
$dsjob -stop: to stop the job that is currently present in the process
$dsjob -report: to display complete job report
$dsjob -llinks: for listing all the links
$dsjob -lprojects: for listing all present projects
$dsjob -ljobs: to list all jobs present in the project
$dsjob -lstages: for listing every stage of the current job
$dsjobs -lparams: for listing all parameters of the job
$dsjob -projectinfo: to retrieve project information
$dsjob -jobinfo: for retrieving information of the job
Q24. What are Routines and how can we call them in Datastage jobs?
Ans. Routines are the collection of functions defined by the DS manager. There are
three types of routines including parallel, mainframe and server routines. You can call a
routine from the transformer stage of the tool.
Q25. How can you write parallel routines?
Ans. You can write parallel routines in the C and C++ compilers.
Q26. Why do we use an HBase connector in Datastage?
Ans. HBase connector is used for connecting databases and tables that are present in
the HBase database. Through the HBase connector, you can read and write the data in
the HBase database, You can also read the data in parallel mode.
Q27. What is the importance of Data partitioning?
Ans. Data partitioning involves the process of segmenting records into partitions for
processing. This increases processing efficiency in a linear model. Overall, it is a
parallel approach for data processing.
Q28. How does Datastage manage rejected rows?
Ans. Rejected rows are managed through constraints in the transformer. There are two
ways to do so. Either you can place the rejected rows in the transformer’s properties or
you can create temporary storage for these rejected rows using the REJECTED
command.
Q29. What is the use of link partitioner and link connector in Datastage?
Ans. Link Partitioner divides the data into different parts through partitioning methods.
Link Collector collects data from different segments into a single data and it saves this
data into the target table.
Q30. What are the tiers of Datastage InfoSphere Information Server?
Ans. There are four tiers in the InfoSphere Information Server are:
Client tier: used for development and administration of computers using
client programs and consoles.
Services tier: to provide standard and module-specific services. It contains
an application server, product modules as well as product services.
Engine tier: it has a set of logical components which are used for running
the jobs and other tasks for product modules.
Metadata repository tier: it includes metadata repository and analysis
database. The repository is used for sharing metadata, shared data and
configuration information.
Q31. How does datastage jobs performance tuning take place?
Ans. For the performance tuning, first of all, one needs to select the appropriate
configuration file. After that, select right partition and buffer the memory, We, then, need
to deal with data sorting and handling null-time value. Use copy, filter or modify. Avoid
using transformer. After that, reduce the propogation of redundant metadeta between
several stages.
Q32. How does DataStage handle merging?
Ans. The primary key column in the tables can be used to merge or join two or more
tables.
Q33. What’s the distinction between Datastage 7.5 and 7.0?
Ans. Many new stages, such as Procedure Stage, Command Stage, Generate Report,
and so on, have been added to Datastage 7.5, which were not present in the 7.0
version.
Don't Miss:
6) Define Merge?
Merge means to join two or more tables. The two tables are joined on
the basis of Primary key columns in both the tables.
Datastage Interview
Questions