Parallel Standard Practices DS
Parallel Standard Practices DS
InfoSphere DataStage
Parallel Framework
Standard Practices
Develop highly efficient and scalable
information integration applications
Julius Lerm
Paul Christensen
ibm.com/redbooks
International Technical Support Organization
September 2010
SG24-7830-00
Note: Before using this information and the product it supports, read the information in
“Notices” on page xiii.
This edition applies to Version 8, Release 1 of IBM InfoSphere Information Server (5724-Q36)
and Version 9, Release 0, Modification 1 of IBM InfoSphere Master Data Management Server
(5724-V51), and Version 5.3.2 of RDP.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . xix
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
Chapter 3. Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Directory structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Metadata layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Data, install, and project directory structures . . . . . . . . . . . . . . . . . . 23
3.1.3 Extending the DataStage project for external entities . . . . . . . . . . . . 24
3.1.4 File staging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Key attributes of the naming convention . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Designer object layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Documentation and metadata capture . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Naming conventions by object type . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Documentation and annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Working with source code control systems . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 Source code control standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Using object categorization standards . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.3 Export to source code control system . . . . . . . . . . . . . . . . . . . . . . . . 51
Contents v
9.1.4 Conditionally aborting jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.1.5 Using environment variable parameters . . . . . . . . . . . . . . . . . . . . . 142
9.1.6 Transformer decimal arithmetic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.1.7 Optimizing Transformer expressions and stage variables . . . . . . . 143
9.2 Modify stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.2.1 Modify and null handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.2.2 Modify and string trim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.3 Filter and Switch stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Contents vii
Chapter 14. Connector stage guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . 221
14.1 Connectors and the connector framework . . . . . . . . . . . . . . . . . . . . . . 222
14.1.1 Connectors in parallel jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
14.1.2 Large object (LOB) support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
14.1.3 Reject Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
14.1.4 Schema reconciliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
14.1.5 Stage editor concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
14.1.6 Connection objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
14.1.7 SQL Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
14.1.8 Metadata importation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
14.2 ODBC Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
14.3 WebSphere MQ Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
14.4 Teradata Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
14.4.1 Teradata Connector advantages. . . . . . . . . . . . . . . . . . . . . . . . . . 237
14.4.2 Parallel Synchronization Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
14.4.3 Parallel Transport operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
14.4.4 Cleanup after an aborted load or update . . . . . . . . . . . . . . . . . . . 238
14.4.5 Environment variables for debugging job execution . . . . . . . . . . . 239
14.4.6 Comparison with existing Teradata stages . . . . . . . . . . . . . . . . . . 239
14.5 DB2 Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
14.5.1 New features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
14.5.2 Using rejects with user-defined SQL . . . . . . . . . . . . . . . . . . . . . . . 244
14.5.3 Using alternate conductor setting . . . . . . . . . . . . . . . . . . . . . . . . . 245
14.5.4 Comparison with existing DB2 stages. . . . . . . . . . . . . . . . . . . . . . 246
14.6 Oracle Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
14.6.1 New features and improvements . . . . . . . . . . . . . . . . . . . . . . . . . 251
14.6.2 Comparison with Oracle Enterprise . . . . . . . . . . . . . . . . . . . . . . . 252
14.7 DT stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
14.8 SalesForce Connector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
14.9 Essbase connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
14.10 SWG Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Contents ix
16.6.10 Rejecting messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
16.6.11 Database contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
16.6.12 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
16.6.13 Design patterns to avoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
16.7 InfoSphere Information Services Director . . . . . . . . . . . . . . . . . . . . . . . 346
16.7.1 The scope of this section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
16.7.2 Design topology rules for always-on ISD jobs. . . . . . . . . . . . . . . . 351
16.7.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
16.7.4 Synchronizing database stages with ISD output . . . . . . . . . . . . . 353
16.7.5 ISD with DTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
16.7.6 ISD with connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
16.7.7 Re-partitioning in ISD jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
16.7.8 General considerations for using ISD jobs . . . . . . . . . . . . . . . . . . 359
16.7.9 Selecting server or EE jobs for publication through ISD . . . . . . . . 361
16.8 Transactional support in message-oriented applications . . . . . . . . . . . 362
16.9 Payload processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
16.10 Pipeline Parallelism challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
16.10.1 Key collisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
16.10.2 Data stubbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
16.10.3 Parent/Child processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
16.11 Special custom plug-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
16.12 Special considerations for QualityStage . . . . . . . . . . . . . . . . . . . . . . . 373
Contents xi
xii InfoSphere DataStage: Parallel Framework Standard Practices
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Java, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other
countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
The advice provided in this document is the result of the combined proven
experience from a number of expert practitioners in the field of high performance
information integration, evolved over several years.
Document history
Prior to this Redbooks publication, there was DataStage release documentation
made available dated April 15, 2009, written by Julius Lerm. This book updates
and extends the information in that initial release documentation with
terminology, import/export mechanisms, and job parameter handling (including
parameter sets).
Preface xvii
Paul Christensen is a Technical Architect and member of the worldwide IBM
Information Agenda™ Architecture team. With 19 years of experience in
enterprise data management and parallel computing technologies, he has led
the successful design, implementation, and management of large-scale Data
Integration and Information Management solutions using the IBM Information
Agenda and partner portfolios. Paul's experience includes early hardware-based
parallel computing platforms, massively parallel databases including Informix®
and DB2®, and the parallel framework of IBM Information Server and DataStage.
To facilitate successful customer and partner deployments using IBM Information
Server, he has helped to develop standard practices, course material, and
technical certifications. Paul holds a Bachelor’s degree in Electrical Engineering
from Drexel University, and is an IBM Certified Solution Developer.
Other Contributors
We would like to give special thanks to the following contributing authors whose
input added significant value to this publication.
Mike Carney - Technical Architect, IBM Software Group, Information
Management, Westford, MA
Tony Curcio - DataStage Product Manager, IBM Software Group, Information
Management, Charlotte, NC
Patrick Owen - Technical Architect, IBM Software Group, Information
Management, Little Rock, AR
Steve Rigo - Technical Architect, IBM Software Group, Information Management,
Atlanta, GA
Ernie Ostic - Technical Sales Specialist, IBM Software Group, Worldwide Sales,
Newark, NJ
Paul Stanley - Product Development Engineer, IBM Software Group, Information
Management, Boca Raton, FL
In the following sections we thank others who have contributed to the
development and publication of this IBM Redbooks publication.
From IBM Locations Worldwide
Tim Davis - Executive Director, Information Agenda Architecture Group, IBM
Software Group, Information Management, Littleton, MA
Susan Laime - IM Analytics and Optimization Software Services, IBM Software
Group, Information Management, Littleton, MA
Margaret Noel - Integration Architect, IBM Software Group, Information
Management, Atlantic Beach, FL
Find out more about the residency program, browse the residency index, and
apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
Preface xix
Stay connected to IBM Redbooks
Find us on Facebook:
http://www.facebook.com/pages/IBM-Redbooks/178023492563?ref=ts
Follow us on twitter:
http://twitter.com/ibmredbooks
Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
Explore new Redbooks publications, residencies, and workshops with the
IBM Redbooks weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html
IBM InfoSphere DataStage integrates data across multiple and high volumes
data sources and target applications. It integrates data on demand with a high
performance parallel framework, extended metadata management, and
With these components and a great set of standard practices, you are on your
way to a highly successful data integration effort. To help you further along the
way, this book also provides a brief overview of a number of services and
education offerings by IBM.
IBM InfoSphere Information Server 8 is installed in layers that are mapped to the
physical hardware. In addition to the main product modules, product components
are installed in each tier as needed.
From a job design perspective, the product has interesting new features:
New stages, such as Database Connectors, Slowly Changing Dimensions,
and Distributed Transaction.
Job Parameter Sets
Balanced optimization, a capability that automatically or semi-automatically
rewrites jobs to make use of RDBMS capabilities for transformations.
Information Server also provides new features for job developers and
administrators, such as a more powerful import/export facility, a job comparison
tool, and an impact analysis tool.
Client Tier
InfoSphere QualityStage
Product-specific services
modules
Service Agents
Application Server (Logging, Communications (ASB),
Job Monitor, Resource Tracker)
Metadata
Repository Tier
Staff Whether through workshop delivery, project leadership, or mentored augmentation, the
Augmentation Professional Services staff of IBM Information Platform and Solutions use IBM
and Mentoring methodologies, standard practices, and experience developed through thousands of
successful engagements in a wide range of industries and government entities.
Learning IBM offers a variety of courses covering the IBM Information Management product
Services portfolio. The IBM blended learning approach is based on the principle that people learn
best when provided with a variety of learning methods that build upon and complement
each other. With that in mind, courses are delivered through a variety of mechanisms:
classroom, on-site, and Web-enabled FlexLearning.
Certification IBM offers a number of professional certifications through independent testing centers
worldwide. These certification exams provide a reliable, valid, and fair method of
assessing product skills and knowledge gained through classroom and real-world
experience.
Client Support IBM is committed to providing our customers with reliable technical support worldwide.
Services All client support services are available to customers who are covered under an active
IBM InfoSphere maintenance agreement. Our worldwide support organization is
dedicated to assuring your continued success with IBM InfoSphere products and
solutions.
Virtual The low cost Virtual Services offering is designed to supplement the global IBM
Services InfoSphere delivery team, as needed, by providing real-time, remote consulting
services. Virtual Services has a large pool of experienced resources that can provide IT
consulting, development, migration, and training services to customers for IBM
InfoSphere DataStage.
Table 1-2 lists the workshop offerings and descriptions for project startup.
Information Exchange Targeted for clients new to the IBM InfoSphere product portfolio, this workshop
and Discovery provides high-level recommendations on how to solve a customer’s particular
Workshop problem. IBM analyzes the data integration challenges outlined by the client and
develops a strategic approach for addressing those challenges.
Requirements Guiding clients through the critical process of establishing a framework for a
Definition, successful future project implementation, this workshop delivers a detailed
Architecture, and project plan, as well as a project blueprint. These deliverables document project
Project Planning parameters, current and conceptual end states, network topology, and data
Workshop architecture and hardware and software specifications. It also outlines a
communication plan, defines scope, and captures identified project risk.
Iterations® 2 IBM Iterations 2 is a framework for managing enterprise data integration projects
that integrate with existing customer methodologies. Iterations 2 is a
comprehensive, iterative, step-by-step approach that leads project teams from
initial planning and strategy through tactical implementation. This workshop
includes the Iterations 2 software, along with customized mentoring.
Installation and This workshop establishes a documented, repeatable process for installation and
Configuration configuration of IBM InfoSphere Information Server components. This might
Workshop involve review and validation of one or more existing Information Server
environments, or planning, performing, and documenting a new installation.
Information Analysis This workshop provides clients with a set of standard practices and a repeatable
Workshop methodology for analyzing the content, structure, and quality of data sources
using a combination of IBM InfoSphere Information Analyzer, QualityStage, and
Audit stage.
Data Flow and Job This workshop helps clients establish standards and templates for the design and
Design Standard development of parallel jobs using IBM InfoSphere DataStage through
Practices Workshop practitioner-led application of IBM standard practices to a client’s environment,
business, and technical requirements. The delivery includes a customized
standards document as well as custom job designs and templates for a focused
subject area.
Data Quality This workshop provides clients with a set of standard processes for the design
Management and development of data standardization, matching, and survivorship processes
Standard Practices using IBM InfoSphere QualityStage The data quality strategy formulates an
Workshop auditing and monitoring program to ensure on-going confidence in data accuracy,
consistency, and identification through client mentoring and sharing of IBM
standard practices.
Administration, This workshop provides customers with a customized tool kit and a set of proven
Management, and standard practices for integrating IBM InfoSphere DataStage into a client’s
Production existing production infrastructure (monitoring, scheduling, auditing/logging,
Automation change management) and for administering, managing and operating DataStage
Workshop environments.
Health Check This workshop is targeted for clients currently engaged in IBM InfoSphere
Evaluation development efforts that are not progressing according to plan, or for clients seeking
validation of proposed plans prior to the commencement of new projects. It provides
review of, and recommendations for, core Extract Transform and Load (ETL)
development and operational environments by an IBM expert practitioner.
Sizing and The workshop provides clients with an action plan and set of recommendations for
Capacity Planning meeting current and future capacity requirements for data integration. This strategy
Workshop is based on analysis of business and technical requirements, data volumes and
growth projections, existing standards and technical architecture, and existing and
future data integration projects.
Performance This workshop guides a client’s technical staff through IBM standard practices and
Tuning Workshop methodologies for review, analysis, and performance optimization using a targeted
sample of client jobs and environments. This workshop identifies potential areas of
improvement, demonstrates IBM processes and techniques, and provides a final
report with recommended performance modifications and IBM performance tuning
guidelines.
High-Availability Using IBM InfoSphere Standard Practices for high availability, this workshop
Architecture presents a plan for meeting a customer’s high availability requirements using the
Workshop parallel framework of IBM InfoSphere DataStage. It implements the architectural
modifications necessary for high availability computing.
Grid Computing This workshop teaches the planning and readiness efforts required to support a
Discovery, future deployment of the parallel framework of IBM InfoSphere DataStage on Grid
Architecture and computing platforms. This workshop prepares the foundation on which a follow-on
Planning grid installation and deployment is executed, and includes hardware and software
Workshop recommendations and estimated scope.
Grid Computing In this workshop, the attendee installs, configures, and deploys the IBM InfoSphere
Installation and DataStage Grid Enabled Toolkit in a client’s grid environments, and provides
Deployment integration with Grid Resource Managers, configuration of DataStage,
Workshop QualityStage, and Information Analyzer.
For more details on any of these IBM InfoSphere Professional Services offerings,
and to find a local IBM Information Management contact, visit: the following web
page:
http://www.ibm.com/software/data/services/ii.html
Figure 1-4 Services workshops for the IBM InfoSphere DataStage Parallel Framework
Yes
Halt on Error? Exit Failure
No
Create Reject
Files (Limited) Read Input Data
Yes
Halt on Error? Exit Failure
No
Create Reject Perform
Files (Limited) Validations
Yes
Halt on Error? Exit Failure
No
Create Reject Perform
Files (Limited) Transformations
Yes
Halt on Error? Exit Failure
No
Perform Load and/or
Create Reject Create Intermediate Yes
Files (Limited) Over Job Warning Exit Failure
Datasets
Threshold?
No
These job sequences control the interaction and error handling between
individual DataStage jobs, and together form a single end-to-end module in a
DataStage application.
Job sequences also provide the recommended level of integration with external
schedulers (such as AutoSys, Cron, CA7, and so forth). This provides a level of
granularity and control that is easy to manage and maintain, and provides an
appropriate use of the respective technologies.
Transformation Data must not be changed by any Reference tables upon which all subsequent
method unless jobs transforming jobs and the current data target (usually a
an entire subject area have database) depend, or long running provisioning
successfully completed, or where processes. This prevents partial replacement of
the resource requirements for reference data in the event of transformation
data transformation are large. failure, and preserves the compute effort of long
running transformation jobs.
Hybrid Data can be changed regardless Non-reference data or independent data are
of success or failure. candidates. The data target (usually a database)
must allow subsequent processing of error or
reject rows and tolerate partial or complete
non-update of targets. Neither the
transformation nor provisioning requirements
are large.
Provisioning Data must not be changed by any Any target where either all sources have been
method unless jobs transforming successfully transformed or where the resources
an entire subject area have required to transform the data must be
successfully completed, or where preserved in the event of a load failure.
the resource requirements for
data provisioning are large.
If the entire target table is regenerated with each run, and no other external or
subsequent processing alters the contents of the target table, the output dataset
qualifies as a write-through cache that can be used by subsequent DataStage
jobs instead of reading the entire target table.
When we say “subsequent jobs” we mean any jobs executed as part of the same
transformation cycle, until the target table is updated in the target database by a
provisioning job. A transformation cycle might correspond, for instance, to daily,
weekly, or monthly batch processes.
The content of the Seg1_DS parallel dataset can be used by subsequent jobs, in
the same transformation cycle, as input to other joins, lookup, or transform
operations, for instance. Those jobs can thus avoid re-extracting data from that
table unnecessarily. These operations might result in new versions of this
dataset. Those new versions replace Seg1_DS (with different names) and are
used by other subsequent downstream jobs.
Chapter 3. Standards
Establishing consistent development standards helps to improve developer
productivity and reduce ongoing maintenance costs. Development standards can
also make it easier to integrate external processes such as automated auditing
and reporting, and to build technical and support documentation.
By default, these directories (except for file staging) are created during
installation as subdirectories under the base InfoSphere DataStage installation
directory.
Data file systems store individual segment files of DataStage parallel datasets.
Scratch file systems are used by the DataStage parallel framework to store
temporary files such as sort and buffer overflow.
Chapter 3. Standards 23
Install File Systems
Install FS Scratch File Systems
/Scratch_<phase>0 /Scratch_<phase>N
/Ascential
/<phase>_Project_A /<phase>_Project_A
/patches
/DSEngine
/Data_<phase>0 /Data_<phase>N
/Configurations
/<phase>_Project_A /<phase>_Project_A
/Projects
1 Gigabyte /<phase>_Project_Z /<phase>_Project_Z
/<phase>_Project_A
Project naming standards include the
deployment phase (dev, it, uat, prod)
prefix, as indicated by <phase>
/<phase>_Project_Z
Note: The file system where the Project_Plus hierarchy is stored must be
expandable without requiring destruction and re-creation.
Chapter 3. Standards 25
Project_Plus directory structure
Figure 3-2 shows typical components and the structure of the Project_Plus
directory hierarchy.
/dev_Project_A Subdirectory created for each DataStage project (the actual directory name
“dev_Project_A” should match the corresponding DataStage Project Name).
/bin Location of custom programs, DataStage routines, BuildOps, utilities, and shells
/params Location of parameter files for automated program control, a backup copy of
dsenv and backup copies of DSParams:$ProjectName project files
Chapter 3. Standards 27
Project_Plus environment variables
The Project_Plus directory structure is made to be transparent to the DataStage
application, through the use of environment variable parameters used by the
DataStage job developer. Environment variables are a critical portability tool that
enables DataStage applications to be deployed through the life cycle without any
code changes.
In parallel job designs, the Project_Plus parameters are added as job parameters
using the $PROJDEF default value. These parameters are used in the stage
properties to specify the location of DataSet header files, job parameter files,
orchestrate schemas, and external scripts in job flows.
Mount this directory on all members of the cluster after installing IBM InfoSphere
DataStage, but before creating any DataSets.
Chapter 3. Standards 29
However, to completely isolate support files in a manner that is easy to assign to
separate file systems, an additional level of directory structure can be used to
enable multiple phases of application deployment (development, integration test,
user acceptance test, and production) as appropriate. If the file system is not
shared across multiple servers, not all of these development phases might be
present on a local file system.
In each deployment directory, files are separated by project name. See Table 3-4.
Directory Description
/dev_Project_A Subdirectory created for each DataStage project (the actual directory name
dev_Project_A should match the corresponding DataStage Project Name)
location of source data files, target data files, error and reject files
Chapter 3. Standards 31
3.2 Naming conventions
As a graphical development environment, DataStage offers (in certain
restrictions) flexibility to developers when naming various objects and
components used to build a data flow. By default, the Designer tool assigns
default names based on the object type, and the order the item is placed on the
design canvas. Though the default names might create a functional data flow,
they do not facilitate ease of maintenance over time, nor do they adequately
document the business rules or subject areas. Providing a consistent naming
standard is essential to achieve the following results:
Maximize the speed of development
Minimize the effort and cost of downstream maintenance
Enable consistency across multiple teams and projects
Facilitate concurrent development
Maximize the quality of the developed application
Increase the readability of the objects in the visual display medium
Increase the understanding of components when seen in external systems
(for example in IBM InfoSphere Business Glossary, Metadata Workbench, or
an XML extract)
Throughout this section, the term standard refers to those principles that are
required. The term guideline refers to recommended, but not required, principles.
In a DataStage job, there might be multiple stages that correlate to a certain area
or task. The naming convention described in the previous section is meant to
reflect that logical organization in a straightforward way. A subject corresponds to
a so-called area in a job, a subject modifier corresponds to a more specific
operation in that area, and the class word indicates the nature of an object.
One example might be a job in which records for data sources such as accounts
and customers are prepared, correlated, credit scores calculated and results
A few examples, in which subject, subject modifier and class word are separated
by underscores, are shown in Table 3-5.
In the context of DataStage, the class word is used to identify either a type of
object, or the function that a particular type of object performs. In certain cases
objects can be sub-typed (for example, a Left Outer Join). In these cases the
class word represents the subtype.
For example, in the case of a link object, the class word refers to the functions of
reading, reference (Lookup), moving or writing data (or in a Sequence Job, the
moving of a message).
In the case of a data store the class word refers to the type of data store (as
examples Dataset, Sequential File, Table, View, and so forth).
Where there is no sub classification required, the class word refers to the object.
As an example, a Transformer might be named Data_Block_Split_Tfm.
One benefit of using the subject, subject modifier, class word approach instead of
using the prefix approach, is to enable two levels of sorting or grouping. In
InfoSphere Metadata Workbench, the object type is defined in a separate field.
There is a field that denotes whether the object is a column, a derivation, a link, a
Chapter 3. Standards 33
stage, a job design, and so forth. This is the same or similar information that is
carried in a prefix approach. Carrying this information as a separate attribute
enables the first word of the name to be used as the subject matter, allowing sort
either by subject matter or by object type. In addition, the class word approach
enables sub-classification by object type to provide additional information.
For the purposes of documentation, all word abbreviations are referenced by the
long form to get used to saying the name in full even if reading the abbreviation.
Like a logical name, however, when creating the object, the abbreviated form is
used. This re-enforces wider understanding of the subjects.
The key issue is readability. Though DataStage imposes limitations on the type of
characters and length of various object names, the standard, where possible, is
to separate words by an underscore, which allows clear identification of each
work in a name. This is enhanced by also using word capitalization (for example,
the first letter of each word is capitalized).
Projects
Each DataStage Project is a standalone repository. It might have a one-to-one
relationship with an organizations’ project of work. This factor can cause
terminology issues especially in teamwork where both business and developers
are involved.
Dev Development
IT Integration Test
Prod Production
Development Projects
Shared development projects should contain the phase in the life cycle, the
name, and a version number. Examples are shown in the following list:
Dev_ProjectA_p0 (initial development phase 0…phase N)
Dev_ProjectA_v1 (maintenance)
Dev_ProjectA_v2
Chapter 3. Standards 35
Individual developers can be given their own sandbox projects, which should
contain the user ID or initials, the application name, the phase in the life cycle
and a version number. This is difficult to do with 18 characters. The following list
shows some examples:
JoeDev_ProjectA_v2
SueDev_ProjectA_v1
Test projects
Test project names should contain the phase in the life cycle, project name, and
version. The following project names are intended for Integration Testing (IT) and
User Acceptance Testing (UAT):
IT_ProjectA_v1_0 (first release)
IT_ProjectA_v1_1 (patch or enhancement)
UAT_ProjectA_v1_0
Production projects
Although it is preferable to create a new project for each minor and major change
to a production project, making a change to the project name could require
changes to external objects. For example, an enterprise scheduler requires the
project name. Therefore, it is not a requirement that a project name contain
version information.
Using version numbers could allow you to run parallel versions of a DataStage
application, without making changes to the always-on system.
The following list shows examples of acceptable names for the production
project:
Prod_ProjectA_v1
ProjectA_v1
ProjectA
The following examples are project names where the project is single application
focused:
Accounting Engine NAB Development is named Dev_AcctEngNAB_v1_0
Accounting Engine NAB Production is named Prod_AcctEngNAB
The following examples are project names where the project is multiapplication
focused:
Accounting Engine Development or Dev_AcctEngine_v1_0
Accounting Engine Production or Prod_AcctEngine
DataStage 7.5 enforced the top level directory structure for various types of
objects, such as jobs, routines, shared containers, and table definitions.
Developers had the flexibility to define their own directory or category hierarchy
beneath that level.
Figure 3-5 presents the top level view, with a list of default folders. As stated
before, objects are not restricted to a top level folder named after the
corresponding type.
Chapter 3. Standards 37
Figure 3-6 shows an example of a custom top-level folder that aggregates
objects of several types.
Information Server 8 maintains the restriction that there can only be a single
object of a certain type with a given name.
Categorization by developer
In development projects, folders might be created for each developer as their
personal sandbox. That is the place where they perform unit test activities on
jobs they are developing.
Again, object names must be unique in a given project for the given object type.
Two developers cannot save a copy of the same job with the same name in their
individual sandbox categories. A unique job name must be given.
Chapter 3. Standards 39
Table definition categories
Unlike DataStage 7.5, in which table definitions were categorized using two level
names (based on the data source type and the data source name), Information
Server 8 allows them to be placed anywhere in the repository hierarchy. This is
depicted in Figure 3-8.
When saving temporary TableDefs (usually created from output link definitions to
assist with job creation), developers are prompted for the folder in the “Save
Table Definition As” window. The user must pay attention to the folder location,
as these objects are no longer stored in the Table Definition category by default.
Jobs and job sequences are all held under the Category Directory Structure, of
which the top level is the category Jobs.
A job is suffixed with the class word Job and a job sequence is suffixed with the
class word Seq.
Jobs must be organized under category directories to provide grouping such that
a directory should contain a sequence job and all the jobs that are contained in
that sequence. This is discussed further in “Folder hierarchy” on page 37.
Chapter 3. Standards 41
Shared containers
Shared containers have the same naming constraints as jobs in that the name
can be long but cannot contain underscores, so word capitalization must be used
for readability. Shared containers might be placed anywhere in the repository
tree and consideration must be given to a meaningful directory hierarchy. When a
shared container is used, a character code is automatically added to that
instance of its use throughout the project. It is optional as to whether you decide
to change this code to something meaningful.
Parameters
A parameter can be a long name consisting of alphanumeric characters and
underscores. The parameter name must be made readable using capitalized
words separated by underscores. The class word suffix is parm.
Links
In a DataStage job, links are objects that represent the flow of data from one
stage to the next. In a job sequence, links represent the flow of a message from
one activity or step to the next.
Stage names
DataStage assigns default names to stages as they are dragged onto the
Designer canvas. These names are based on the type of stage (Object) and a
unique number, based on the order the object was added to the flow. In a job or
job sequence, stage names must be unique.
Chapter 3. Standards 43
Instead of using the full object name, a 2, 3, or 4 character abbreviation must be
used for the class word suffix, after the subject name and subject modifier. A list
of frequently-used stages and their corresponding class word abbreviation can
be found in Appendix C, “DataStage naming reference” on page 391.
Data stores
For the purposes of this section, a data store is a physical piece of disk storage
where data is held for a period of time. In DataStage terms, this can be either a
table in a database structure or a file contained in a disk directory or catalog
structure. Data held in a database structure is referred to as either a table or a
view. In data warehousing, two additional subclasses of table might be used:
dimension and fact. Data held in a file in a directory structure is classified
according to its type, for example: Sequential File, Parallel Dataset, Lookup File
Set, and so on.
The concepts of “source” and “target” can be applied in a couple of ways. Every
job in a series of jobs could consider the data it gets in to be a source and the
data it writes out as being a target. However, for the sake of this naming
convention a source is only data that is extracted from an original system. A
target is the data structures that are produced or loaded as the final result of a
particular series of jobs. This is based on the purpose of the project: to move
data from a source to a target.
Data stores used as temporary structures to land data between jobs, supporting
restart and modularity, should use the same names in the originating job and any
downstream jobs reading the structure.
DataStage routines
DataStage BASIC routine names should indicate their function and be grouped
in sub-categories by function under a main category of that corresponds to the
subject area. For example:
Routines/Automation/SetDSParamsFromFile
DataStage custom Transformer routine names should indicate their function and
be grouped in sub-categories by function under a main category that
corresponds to the subject area. For example:
Routines/Automation/DetectTeradataUnicode
Source code, a makefile, and the resulting object for each Custom Transformer
routine must be placed in the Project_Plus source directory. For example:
/Project_Plus/projectA_dev/bin/source
Chapter 3. Standards 45
File names
Source file names should include the name of the source database or system
and the source table name or copybook name. The goal is to connect the name
of the file with the name of the storage object on the source system. Source flat
files have a unique serial number composed of the date, “_ETL_” and time. For
example:
Client_Relationship_File1_In_20060104_ETL_184325.psv
Intermediate datasets are created between modules. Their names include the
name of the module that created the dataset or the contents of the dataset in that
more than one module might use the dataset after it is written. For example:
BUSN_RCR_CUST.ds
Target output files include the name of the target subject area or system, the
target table name or copybook name. The goal is the same as with source
files—to connect the name of the file with the name of the file on the target
system. Target flat files have a unique serial number composed of the date,
_ETL_ and time. For example:
Client_Relationship_File1_Out_20060104_ETL_184325.psv
Files and datasets have suffixes that allow easy identification of the content and
type. DataStage proprietary format files have required suffixes. They are
identified in italics in Table 3-7, which defines the types of files and their suffixes.
The “Short Description” field is also displayed on summary lines in the Director
and Designer clients. At a minimum, although there is no technical requirement
to do so, job developers should provide descriptive annotations in the “Short
Description” field for each job and job sequence, as in Figure 3-10.
In a job, the Annotation tool must be used to highlight steps in a given job flow.
By changing the vertical alignment properties (for example, Bottom) the
annotation can be drawn around the referenced stages, as in Figure 3-11 on
page 48.
Chapter 3. Standards 47
Figure 3-11 Sample job annotation
Each stage should have a short description of its function specified in the stage
properties. These descriptions appear in the job documentation automatically
generated from jobs and sequencers adhering to the standards in this document.
More complex operators or operations should have correspondingly longer and
more complex explanations on this tab.
Chapter 3. Standards 49
3.4 Working with source code control systems
The DataStage built-in repository manages objects (jobs, sequences, table
definitions, routines, custom components) during job development. However, this
repository is not capable of managing non-DataStage components (as examples,
UNIX® shell scripts, environment files, and job scheduler configurations) that
might be part of a completed application.
Source code control systems (such as ClearCase®, PVCS and SCCS) are
useful for managing the development life cycle of all components of an
application, organized into specific releases for version control.
As of release 8.1, DataStage does not directly integrate with source code control
systems, but it does offer the ability to exchange information with these systems.
It is the responsibility of the DataStage developer to maintain DataStage objects
in the source code system.
The DataStage Designer client is the primary interface to the DataStage object
repository. Using Designer, you can export objects (such as job designs, table
definitions, custom stage types, and user-defined routines) from the repository as
clear-text format files. These files can be checked into the external source code
control system.
There are three export file format for DataStage 8.X objects:
DSX (DataStage eXport format),
XML
ISX
DSX and XML are established formats that have remained the same since
pre-8.X versions. ISX is a new format introduced in Information Server 8, which
can be imported and exported with the new command-line utility ISTool.
Chapter 3. Standards 51
Client-only tools
The DataStage client includes Windows® command-line utilities for automating
the export process. These utilities (dsexport, dscmdexport, dsimport and
XML2DSX) are documented in the DataStage Designer Client Guide,
LC18-9893.
All exports from the DataStage repository to DSX or XML format are performed
on the Windows workstation.
In the example of Figure 3-12, the add link opens up a dialog box from which
individual items or folders can be selected.
The import of the .DSX file places the object in the same DataStage folder
from which it originated. This means that if necessary it creates the job folder
if it does not already exist.
If the objects were not exported with the job executables, then compile the
imported objects from Designer, or from the multi-job compile tool.
There is an equivalent GUI option to import XML files. The import of XML files
first converts the input file from XML to DSX by means of a XSL stylesheet (this
is done behind the scenes). The DSX file is then finally imported into the
repository.
Chapter 3. Standards 53
XML format can be imported using the Designer client or the dsimport,
dscmdimport or XML2DSX client tools.
One can now export an entire project with the following syntax (the same syntax
applies to both the client and server environments):
istool export -domain <domain> -username <user> -password <passwd>
-archive <archive_name> -datastage '<hostname>/project/*.*‘
However, to export just the jobs category the wild card syntax changes a little
(notice the extra /*/). Without this, the folders in are skipped.
istool export -do <domain> -u <user> -p <passwd> -ar <archive_name> -ds
'<hostname>/project/Jobs/*/*.*'
The output or the archive file is a compressed file. If uncompressed, it creates the
directory structure similar to the GUI. The import option can be used to import
the .isx (archive suffix), similar to the .xml or .dsx files.
It is much faster to run the export/import on the server side with the ISTool when
compared to the DS Designer client tools described in 3.4.3, “Export to source
code control system” on page 51.
The ISTool is documented in the IBM Information Server Manager User Guide,
LC19-2453-00.
Shell script (if dsjob –local Only DataStage processes spawned by dsjob
is specified)
The daemon for managing client connections to the DataStage engine is called
dsrpcd. By default (in a root installation), dsrpcd is started when the server is
installed, and should start whenever the machine is restarted. dsrpcd can also be
manually started and stopped using the $DSHOME/uv –admin command. For more
information, see the IBM InfoSphere DataStage and QualityStage Administrator
Client Guide, LC18-9895.
Environment variable settings for particular projects can be set in the DataStage
Administrator client. Any project-level settings for a specific environment variable
override any settings inherited from dsrpcd.
$PROJDEF Causes the project default value for the environment variable (as
shown on the Administrator client) to be picked up and used to set the
environment variable and job parameter for the job.
Note: $ENV must not be used for specifying the default $APT_CONFIG_FILE
value because, during job development, Designer parses the corresponding
parallel configuration file to obtain a list of node maps and constraints
(Advanced stage properties)
$APT_CONFIG_FILE filepath Specifies the full path name to the parallel configuration file.
This variable must be included in all job parameters so that it
can be easily changed at runtime.
$APT_PERFORMANCE $UNSET If set, specifies the directory to capture advanced job runtime
_DATA performance statistics.
On Solaris platforms only: When working with large parallel datasets (where
the individual data segment files are larger than 2 GB), you must define the
environment variable $APT_IO_NOMAP.
Any project-level environment variables must be set for new projects using the
Administrator client, or by carefully editing the DSPARAMS file in the project.
The scope of job parameters depends on their type. The scope depends on the
following factors:
Is specific to the job in which it is defined and used. Job parameters are
stored internally in DataStage for the duration of the job, and are not
accessible outside that job.
Can be extended by the use of a job sequencer, which can manage and pass
the job parameter among jobs in the sequence.
If you answer yes to either of these questions, you should create a job parameter
and set the property to that parameter.
Job parameters are required for the following DataStage programming elements:
File name entries in stages that use files or datasets must never use a
hard-coded operating system path name.
– Staging area files must always have path names as follows:
/#$STAGING_DIR##$DEPLOY_PHASE_parm#[filename.suffix]
– DataStage datasets always have path names as follows:
/#$PROJECT_PLUS_DATASETS#[headerfilename.ds]
Database stages must always use variables for the server name, schema (if
appropriate), user ID and password.
A list of recommended job parameters is summarized in Appendix D, “Example
job template” on page 397.
Similar to directory path delimiters, database schema names, etc. should contain
any required delimiter.
Passwords must be set to type encrypted, and the default value maintained using
the DataStage Administrator.
The intent of this standard practice is to ensure a job is portable. It thus requires
the value of a parameter to be set independent of the job. During development of
a job, consider using the standard practice of always using a test harness
sequencer to execute a parallel job. The test harness allows the job to be run
independently and ensures the parameter values are set. When the job is ready
for integration into a production sequencer, the test harness can be cut, pasted,
and linked into the production sequencer. The test harness is also useful in test
environments, as it allows you to run isolated tests on a job.
A parameter set is assigned a name, and as such can be passed into jobs,
shared containers and sequences collectively. It is an entity on its own, stored
anywhere in the repository tree. We recommend creating a folder named
“parameter sets” for this purpose.
The multiple values that parameter set MDMIS can assume are defined in the
Values tab. The list of parameters is presented horizontally. The first column is
the name of a value file. Subsequent columns contain values for each individual
parameter.
In this example, there is a single value file, but there might be multiple such value
files for the same parameter set. This is depicted in Figure 4-2 on page 65.
Parameter sets are stored in the metadata layer along with the rest of the
project’s design metadata. They might be exported and imported individually or
with other objects using the DS Designer’s Import/Export facilities.
Value files are actual flat files stored in the DS Project in the DSEngine host file
system. Figure 4-2 presents an example of a shell session displaying the location
and content of the STATIC_MDMIS value file for the MDMIS parameter set.
There is a directory named ParameterSets/MDMIS under the MDMRDP project
directory. The value file STATIC_MDMIS is stored in that directory. The output
showing the first 10 lines of output is depicted in Figure 4-3.
Figure 4-3 Location and content of a value file in the DS project directory
Figure 4-4 shows a sample job sequence, with a parameter set named MDMIS.
Sequence
The parameters for the job invoked by this job activity are defined in a way similar
to the one depicted in Figure 4-4 on page 66.
Sequence
Figure 4-6 Setting the value file upon job sequence invocation
If the tool is not available then you must enter the environment variable
parameters one at a time in the DataStage Administrator. If this becomes too
cumbersome, consider the fact that environment variable parameters are stored
in the DSParams file. The DSParams is a text file that can be modified by hand.
However, if you choose to modify this file by hand you do so at your own risk.
We then delve into a greater level of detail on how to use the various stage types
in subsequent chapters of this book.
Though it might be possible to construct a large, complex job that satisfies given
functional requirements, this might not be appropriate. The following list details
factors to consider when establishing job boundaries:
Establishing job boundaries through intermediate datasets creates
checkpoints that can be used in the event of a failure when processing must
be restarted. Without these checkpoints, processing must be restarted from
the beginning of the job flow. It is for these reasons that long-running tasks
are often segmented into separate jobs in an overall sequence.
– For example, if the extract of source data takes a long time (such as an
FTP transfer over a wide area network) land the extracted source data to a
parallel data set before processing. To continue processing to a database
can cause conflict with locking tables when waiting for the FTP to
complete.
– As another example, land data to a parallel dataset before loading to a
target database unless the data volume is small, the overall time to
process the data is minimal, or if the data volume is so large that it cannot
be staged on the extract, transform, and load (ETL) server.
As a rule of thumb, keeping job designs to less than 50 stages is a good starting
point. But this is not a hard-and-fast rule. The proper job boundaries are
ultimately dictated by functional/restart/performance requirements, expected
throughput and data volumes, degree of parallelism, number of simultaneous
jobs and their corresponding complexity, and the capacity and capabilities of the
target hardware environment.
In addition, template jobs might contain any number of stages and pre-built logic,
allowing multiple templates to be created for various types of standardized
processing.
The default job design specifically supports the creation of write-through cache in
which data in load-ready format is stored in parallel datasets for use in the load
process or in the event the target table becomes unavailable.
Each subject area is broken into sub-areas and each sub-area might be further
subdivided. These sub-areas are populated by a DataStage job sequencer using
two types of DataStage jobs at a minimum:
A job that reads source data and then perform one of the following tasks
– Transforms it to load-ready format
– Optionally stores its results in a write-through cache DataStage dataset or
loads the data to the target table.
A job that reads the DataStage dataset and loads it to the target table.
Other sections discuss in detail each of the components and give examples of
their use in a working example job sequencer.
Because parallel shared containers are inserted when a job is compiled, all jobs
that use a shared container must be recompiled when the container is changed.
The Usage Analysis and Multi-Job Compile tools can be used to recompile jobs
that use a shared container.
The exact policy for each reject is specified in the job design document, and
further, whether the job or ETL processing is to continue is specified on a per-job
and per-sequence and per-script basis based on business requirements.
Reject files include those records rejected from the ETL stream due to
Referential Integrity failures, data rule violations or other reasons that would
disqualify a row from processing. The presence of rejects might indicate that a
job has failed and prevent further processing. Specification of this action is the
responsibility of the Business Analyst and is published in the design document.
Error files include those records from sources that fail quality tests. The presence
of errors might not prevent further processing. Specification of this action is the
responsibility of the Business Analyst and is published in the design document.
Both rejects and errors are archived and placed in a special directory for
evaluation or other action by support staff. The presence of rejects and errors are
detected and notification sent by email to selected staff. These activities are the
responsibility of job sequencers used to group jobs by reasonable grain or by a
federated scheduler.
ETL actions to be taken for each record type is specified for each stage in the job
design document. These actions include:
1. Ignore – some process or event downstream of the ETL process is
responsible for handling the error.
2. Reprocess – rows are reprocessed and re-enter the data stream.
3. Push back – rows are sent to a Data Steward for corrective action.
The default action is to push back reject and error rows to a Data Steward.
The Sequential File stage offers the reject options listed in Table 5-1:
Continue Drop read failures from input stream. Pass successful reads to the output
stream. (No reject link exists)
Fail Abort job on read format failure (No reject link exists)
Output Reject switch failures to the reject stream. Pass successful reads to the
output stream. (Reject link exists)
The reject option must be used in all cases where active management of the
rejects is required.
If a file is created by this option, it must have a *.rej file extension. Alternatively, a
shared container error handler can be used.
Rejects are categorized in the ETL job design document using the ranking listed
in Table 5-2.
Rejects are expected and can Use the Continue option. Only records
be ignored that match the given table definition
and format are output. Rejects are
tracked by count only.
Rejects should not exist but Use the Output option. Send the reject
should not stop the job, and stream to a *.rej file.
must be reviewed by the Data
Steward.
Continue Ignore Lookup failures and pass Lookup fields as nulls to the output
stream. Pass successful Lookups to the output stream.
Drop Drop Lookup failures from the input stream. Pass successful Lookups
to the output stream.
Reject Reject Lookup failures to the reject stream. Pass successful Lookups
to the output stream.
The reject option must be used in all cases where active management of the
rejects is required. Furthermore, to enforce error management only one
reference link is allowed on a Lookup stage. If there are multiple validations to
perform, each must be done in its own Lookup.
If a file is created by this option, it must have a *.rej or *.err file extension. The
*.rej extension is used when rejects require investigation after a job run, the *.err
extension when rejects can be ignored but need to be recorded. Alternatively, a
local error handler based on a shared container can be used.
Rejects are expected and Drop if lookup fields are necessary down
can be ignored stream or Continue if lookup fields are
optional
Rejects can exist in the data, Send the reject stream to an *.err file or
however, they only need to tag and merge with the output stream.
be recorded but not acted on.
Rejects should not exist but Send the reject stream to an *.rej file or
should not stop the job, and tag and merge with the output stream.
must be reviewed by the
Data Steward.
If a file is created from the reject stream, it must have a *.rej or *.err file extension.
The *.rej extension is used when rejects require investigation after a job run, the
*.err extension when rejects can be ignored but need to be recorded.
Alternatively, a shared container error handler can be used.
Rejects are expected and can be Funnel the reject stream back to the
ignored. output streams.
Rejects can exist in the data, Send the reject stream to an *.err file
however, they only need to be or tag and merge with the output
recorded but not acted on. stream.
Rejects should not exist but Send the reject stream to an *.rej file
should not stop the job, and be or tag and merge with the output
reviewed by the Data Steward. stream.
Rejects should not exist and Send the reject stream to a reject file
should stop the job. and halt the job.
Target Database stages offer the reject options listed in Table 5-6.
Reject link exists Pass rows that fail to be written to the reject stream.
The reject option must be used in all cases where active management of the
rejects is required.
If a file is created by this option, it must have a *.rej file extension. Alternatively, a
shared container error handler is used.
Rejects are expected and can be No reject link exists. Only records
ignored that match the given table definition
and database constraints are
written. Rejects are tracked by count
only.
Rejects should not exist but should Reject link exists. Send the reject
not stop the job, and must be stream to a *.rej file.
reviewed by the Data Steward.
Rows are converted to the common file record format with 9 columns (as shown
in Figure 5-2 on page 81) using Column Export and Transformer stages for each
reject port, and gathered using a Funnel stage that feeds a Sequential File stage.
The Column Export and Transformer stages might be kept in a template Shared
Container the developer makes local in each job.
STAGE_NAME Yes The name of the stage from which the error came
In Figure 5-1 we depict the stages that process the errors produced by a job.
Lookup A failed Lookup rejects an intact input Connect the reject port to a Transformer stage
row whose key fails to match the where those columns selected for replacement
reference link key. One or more are set to specific values. Connect the output
columns might have been selected for stream of the Transformer and Lookup stages
replacement when a reference key is to a Funnel stage to merge the two streams.
found.
Switch A failed switch rejects an intact input Connect the reject port to a Transformer stage
row show key fails to resolve to one of where columns are set to specific values.
the switch output stream. Connect the output stream of the Transformer
Stage and one or more output streams of the
Switch stage to a Funnel stage to merge the
two (or more) streams.
Transformer A Transformer rejects an intact input Connect the reject port to a Transformer stage
row that cannot pass conditions where columns are set to specific values.
specified on the output streams, OR Connect the output stream of the corrective
with columns contain illegal values for Transformer stage and one or more output
operations performed on said streams of the original Transformer stage to a
columns. In either case, attaching a Funnel stage to merge the two (or more)
non-specific reject stream (referred to streams.
as the stealth reject stream) gathers
rows from either condition to the reject
stream.
Parallel
lnk_lkup_reference
lnk_lkup_output lnk_merged_output
lnk_lkup_input
lnk_Merge_Tagged_Rejects
lnk_Validate_Something
lnk_lkup_reject
lnk_xfmr_output
xfrm_Tag_Lkup_Rejects
BASIC routines are still appropriate, and necessary, for the job control
components of a DataStage Job Sequence and Before/After Job Subroutines for
parallel jobs.
Changing the name of a properly acquired table or file does not break the
metadata connection, neither does deleting it and recreating it.
Once enabled, the relationships can be viewed in the stage editor on the Edit
Column panel, as shown in Figure 5-6.
The default partitioning method used when links are created is Auto partitioning.
The partitioning method is specified in the input stage properties using the
partitioning option, as shown in Figure 6-3 on page 94.
In the Designer canvas, links with Auto partitioning are drawn with the link icon,
depicted in Figure 6-4.
The Preserve Partitioning flag is an internal hint that Auto partitioning uses to
attempt to preserve previously ordered data (for example, on the output of a
parallel sort). This flag is set automatically by certain stages (sort, for example),
although it can be explicitly set or cleared in the advanced stage properties of a
given stage, as shown in Figure 6-5.
The Preserve Partitioning flag is part of the dataset structure, and its state is
stored in persistent datasets.
There are cases when the input stage requirements prevent partitioning from
being preserved. For example, when the upstream partitioning scheme is
round-robin, but the stage at hand is a Join. In this case, the Join requires the
data to be partitioned by hash on the Join key. In these instances, if the Preserve
Partitioning flag was set, a warning is placed in the Director log indicating the
parallel framework was unable to preserve partitioning for a specified stage.
Same partitioning
Same partitioning performs no partitioning to the input dataset. Instead, it retains
the partitioning from the output of the upstream stage, as shown in Figure 6-6.
0 1 2
3 4 5
0 1 2
3 4 5
Figure 6-6 Same partitioning
Same partitioning does not move data between partitions (or, in the case of a
cluster or grid, between servers), and is appropriate when trying to preserve the
grouping of a previous operation (for example, a parallel Sort).
In the Designer canvas, links that have been specified with Same partitioning are
drawn with a horizontal line partitioning icon, as in Figure 6-7.
If you read a parallel dataset with Same partitioning, the downstream stage runs
with the degree of parallelism used to create the dataset, regardless of the
current $APT_CONFIG_FILE.
Note: Minimize the use of SAME partitioning, using only when necessary.
Round-robin partitioning
Round-robin partitioning evenly distributes rows across partitions in a
round-robin assignment, similar to dealing cards. Round-robin partitioning has a
fairly low overhead. It is shown in Figure 6-8.
Round Robin
6 7 8
3 4 5
0 1 2
Figure 6-8 Round-robin partitioning
Because optimal parallel processing occurs when all partitions have the same
workload, round-robin partitioning is useful for redistributing data that is highly
skewed (there are an unequal number of rows in each partition).
Random partitioning
Like Round-robin, Random partitioning evenly distributes rows across partitions,
but using a random assignment. As a result, the order that rows are assigned to
a particular partition differ between job runs.
Though in theory Random partitioning is not subject to regular data patterns that
might exist in the source data, it is rarely used in functional data flows because,
though it shares basic principle of Round-robin partitioning, it has a slightly larger
overhead.
Entire
. . .
. . .
. . .
3 3 3
2 2 2
1 1 1
0 0 0
Entire partitioning is useful for distributing the reference data of a Lookup task
(this might or might not involve the Lookup stage).
Hash Assigns rows with the same values in one or more key columns to the
same partition using an internal hashing algorithm.
Modulus Assigns rows with the same values in a single integer key column to
the same partition using a simple modulus calculation.
Range Assigns rows with the same values in one or more key columns to the
same partition using a specified range map generated by pre-reading
the dataset.
DB2 For DB2 Enterprise Server Edition with DPF (DB2/UDB) only Matches
the internal partitioning of the specified source or target table.
Hash
0 1 2
3 1 2
0 1 2
3
Figure 6-10 Hash partitioning
If the source data values are evenly distributed in these key columns, and there
are a large number of unique values, then the resulting partitions are of relatively
equal size.
Hashing on the LName key column produces the results depicted in Table 6-4
and Table 6-5.
Table 6-4 Partition 0
ID LName FName Address
Using the same source dataset, hash partitioning on the LName and FName key
columns yields the distribution with a 4-node configuration file depicted in
Table 6-6, Table 6-7, Table 6-8, and Table 6-9.
In this example, the key column combination of LName and FName yields
improved data distribution and a greater degree of parallelism. Only the unique
combination of key column values appear in the same partition when used for
hash partitioning. When using hash partitioning on a composite key (more than
one key column), individual key column values have no significance for partition
assignment.
Like hash, the partition size of modulus partitioning is equally distributed as long
as the data values in the key column are equally distributed.
Because modulus partitioning is simpler and faster than hash, it must be used if
you have a single integer key column. Modulus partitioning cannot be used for
composite keys, or for a non-integer key column.
Range partitioning
As a keyed partitioning method, Range partitioning assigns rows with the same
values in one or more key columns to the same partition. Given a sufficient
number of unique values, Range partitioning ensures balanced workload by
assigning an approximately equal number of rows to each partition, unlike Hash
and Modulus partitioning where partition skew is dependent on the actual data
distribution. This is depicted in Figure 6-11.
Range
Range
Map File
0 4
1 4
0 3
To achieve this balanced distribution, Range partitioning must read the dataset
twice: the first to create a Range Map file, and the second to actually partition the
data in a flow using the Range Map. A Range Map file is specific to a given
parallel configuration file.
It is important to note that if the data distribution changes without recreating the
Range Map, partition balance is skewed, defeating the intention of Range
partitioning. Also, if new data values are processed outside of the range of a
given Range Map, these rows are assigned to either the first or the last partition,
depending on the value.
DB2 partitioning
The DB2/UDB Enterprise Stage (or EE Stage) matches the internal database
partitioning of the source or target DB2 Enterprise Server Edition with Data
Partitioning Facility database (previously called DB2/UDB EEE). Using the
DB2/UDB Enterprise stage, data is read in parallel from each DB2 node. And, by
default, when writing data to a target DB2 database using the DB2/UDB
Enterprise stage, data is partitioned to match the internal partitioning of the target
DB2 table using the DB2 partitioning method.
DB2 partitioning can only be specified for target DB2/UDB Enterprise stages. To
maintain partitioning on data read from a DB2/UDB Enterprise stage, use Same
partitioning on the input to downstream stages.
This information is detailed in the parallel job score, which is output to the
Director job log when the environment variable APT_DUMP_SCORE is set to
True. Specific details on interpreting the parallel job score can be found in
Appendix E, “Understanding the parallel job score” on page 401.
To display row counts per partition in the Director Job Monitor window, right-click
anywhere in the window, and select the Show Instances option, as shown in
Figure 6-13. This is useful in determining the distribution across parallel
partitions (skew). In this instance, the stage named Sort_3 is running across four
partitions (x 4 next to the stage name), and each stage is processing an equal
number (12,500) of rows for an optimal balanced workload.
However, on closer inspection, the partitioning and sorting of this scenario can be
optimized. Because the Join and Aggregator use the same partition keys and
sort order, we can move the Hash partition and Sort before the Copy stage, and
apply Same partitioning to the downstream links, as shown in Figure 6-17.
Header
Src Out
Detail
Figure 6-18 Standard partitioning assignment for a Join stage
Although Hash partitioning guarantees correct results for stages that require
groupings of related records, it is not always the most efficient solution,
depending on the business requirements. Although functionally correct, the
solution has one serious limitation. Remembering that the degree of parallel
operation is limited by the number of distinct values, the single value join column
assigns all rows to a single partition, resulting in sequential processing.
To optimize partitioning, consider that the single header row is really a form of
reference data. An optimized solution is to alter the partitioning for the input links
to the Join stage, as depicted in Figure 6-19.
Use round-robin partitioning on the detail input to distribute rows across all
partitions evenly.
Use Entire partitioning on the header input to copy the single header row to all
partitions.
Header
Src Out
Detail
Figure 6-19 Optimized Partitioning assignment based on business requirements
Because we are joining on a single value, there is no need to pre-sort the input to
the Join. We revisit this in the Sorting discussion.
If defined in reverse of this order, the Join attempts to read all detail rows from
the right input (because they have the same key column value) into memory.
For advanced users, there is one further detail in this example. Because the Join
waits until it receives an End of Group (new key value) or End of Data (no more
rows on the input dataset) from the Right input, the detail rows in the Left input
buffer to disk to prevent a deadlock. (See 12.4, “Understanding buffering” on
page 180). Changing the output derivation on the header row to a series of
numbers instead of a constant value establishes the End of Group and prevent
buffering to disk.
Assuming the data is not repartitioned in the job flow and that the number of rows
is not reduced (for example, through aggregation), then a Round-robin collector
can be used before the final sequential output to reconstruct a sequential output
stream in the same order as the input data stream. This is because a
Round-robin collector reads from partitions using the same partition order that a
Round-robin partitioner assigns rows to parallel partitions.
Ordered collectors are generally only useful if the input dataset has been Sorted
and Range partitioned on the same key columns. In this scenario, an Ordered
collector generates a sequential stream in sort order.
Chapter 7. Sorting
Traditionally, the process of sorting data uses one primary key column and,
optionally, one or more secondary key columns to generate a sequential ordered
result set. The order of key columns determines the sequence and groupings in
the result set. Each column is specified with an ascending or descending sort
order. This is the method the SQL databases use for an ORDER BY clause, as
illustrated in the following example, sorting on primary key LName (ascending),
secondary key FName (descending).
However, in most cases there is no need to globally sort data to produce a single
sequence of rows. Instead, sorting is most often needed to establish order in
specified groups of data. This sort can be done in parallel.
For example, the Remove Duplicates stage selects either the first or last row
from each group of an input dataset sorted by one or more key columns. Other
stages (for example, Sort Aggregator, Change Capture, Change Apply, Join,
Merge) require pre-sorted groups of related records.
In the following example, the input dataset from Table 7-1 on page 115 is
partitioned on the LName and FName columns. Given a 4-node configuration
file, you would see the results depicted in Table 7-3, Table 7-4, Table 7-5, and
Table 7-6 on page 118.
Applying a parallel sort to this partitioned input dataset, using the primary key
column LName (ascending) and secondary key column FName (descending),
would generate the resulting datasets depicted in Table 7-7, Table 7-8, Table 7-9,
and Table 7-10.
This is similar to the way parallel database engines perform their parallel sort
operations.
By default, both methods use the same internal sort package (the tsort operator).
The Link sort offers fewer options, but is easier to maintain in a DataStage job, as
there are fewer stages on the design canvas. The Standalone sort offers more
options, but as a separate stage makes job maintenance more complicated.
In general, use the Link sort unless a specific option is needed on the
stand-alone stage. Most often, the standalone Sort stage is used to specify the
Sort Key mode for partial sorts.
Key column options let the developer specify the following options:
Key column usage: sorting, partitioning, or both
Sort direction: Ascending or Descending
Case sensitivity (strings)
Sorting character set: ASCII (default) or EBCDIC (strings)
Position of nulls in the result set (for nullable columns)
Specifically, the following properties are not available when sorting on a link:
Sort Key Mode (a particularly important performance optimization)
Create Cluster Key Change Column
Create Key Change Column
Output Statistics
Sort Utility (do not change this)
Restrict Memory Usage
Of the options only available in the standalone Sort stage, the Sort Key Mode is
most frequently used.
Note: The Sort Utility option is an artifact of previous releases. Always specify
the DataStage Sort Utility, which is significantly faster than a UNIX sort.
By default, the Stable sort option is disabled for sorts on a link and enabled with
the standalone Sort stage.
7.5 Subsorts
In the standalone Sort stage, the key column property, Sort Key Mode, is a
particularly powerful feature and a significant performance optimizer. It is used
when resorting a sub-grouping of a previously sorted input dataset, instead of
performing a complete sort. This subsort uses significantly less disk space and
CPU resource, and can often be performed in memory (depending on the size of
the new subsort groups).
To resort based on a sub-group, all key columns must still be defined in the Sort
stage. Re-used sort keys are specified with the “Do not Sort (Previously Sorted)”
property. New sort keys are specified with the Sort Key Mode property, as shown
in Figure 7-4.
If the input data does not match the key column definition for a sub-sort, the job
aborts.
Typically, the parallel framework inserts sorts before any stage that requires
matched key values or ordered groupings (Join, Merge, Remove Duplicates, Sort
Aggregator). Sorts are only inserted automatically when the flow developer has
not explicitly defined an input sort.
To perform a sort, rows in the input dataset are read into a memory buffer on
each partition. If the sort operation can be performed in memory (as is often the
case with a sub-sort) no disk I/O is performed.
By default, each sort uses 20 MB of memory per partition for its memory buffer.
This value can be changed for each standalone Sort stage using the Restrict
Memory Usage option (the minimum is 1 MB/partition). On a global basis, the
APT_TSORT_STRESS_BLOCKSIZE environment variable can be use to
specify the size of the memory buffer, in MB, for all sort operators (link and
standalone), overriding any per-sort specifications.
The file system configuration and number of scratch disks defined in parallel
configuration file can impact the I/O performance of a parallel sort. Having a
greater number of scratch disks for each node allows the sort to spread I/O
across multiple file systems.
The parallel framework always allocates the space equivalent to their maximum
specified lengths. If most values are much shorter than their maximum lengths,
there will be a large amount of unused space being moved around between
operators as well as to/from datasets and fixed format files. That happens, for
instance, when an address field is defined as "varchar(500)" but most addresses
are 30 characters long.
This severely impacts the performance of sort operations: the more unused
bytes a record holds, the more unnecessary data is moved to and from scratch
space.
This rule must be applied judiciously, but it may result in great performance
gains.
Sequential File Read and write standard Cannot write to a single file in
files in a single format. parallel, performance penalty
of conversion, does not
support hierarchical data files.
Complex Flat File Need to read source data Cannot write in parallel;
in complex (hierarchical) performance penalty of format
format, such as mainframe conversion.
sources with COBOL
copybook file definitions.
SAS Parallel Need to share data with an Requires Parallel SAS, can
external Parallel SAS only be read from / written to
application. (Requires SAS by DataStage or Parallel SAS.
connectivity license for
DataStage.)
Lookup File Set Rare instances where Can only be written – contents
Lookup reference data is cannot be read or verified. Can
required by multiple jobs only be used as reference link
and is not updated on a Lookup stage.
frequently.
No parallel file stage supports update of existing records. Certain stages (parallel
dataset) support Append, to add new records to an existing file. But this is not
recommended, as it imposes risks for failure recovery.
We provide further information about these File stage types in the remaining
sections of this chapter.
Data is stored in datasets in fixed-length format and variable length fields are
padded up to their maximum length. This allows the parallel framework to
determine field boundaries quickly without having to scan the entire record
looking for field delimiters.
This yields the best performance when most of the fields are of fixed length and
unused positions in variable length fields tend to be minimal.
However, when the overall amount of unused space in variable length fields is
significant, the dataset advantages tend to be offset by the cost of storing that
much space. For nstance, if an address field is defined as "varchar(500)" and
most addresses are 30 characters long, there will be a significant amount of
unused space across the entire dataset. When dealing with millions or billions of
records, this cost is significant.
Datasets can only be read from and written to using a DataStage parallel job. If
data is to be read or written by other applications (including DS Server jobs), then
a different parallel stage such as a SequentialFile should be adopted instead.
Also, archived datasets can only be restored to DataStage instances that are on
the exact same OS platform.
Read Method: Specific Files, only one file specified might be a file or named pipe
Read Method: Specific Files, only one file specified, Readers Per Node option greater
than 1 useful for SMP configurations - file may be either fixed or variable-width
Read Method: Specific Files, more than one file specified, each file specified within a
single Sequential File stage must be of the same format
Read Method: Specific Files, Read From Multiple Nodes option is set to Yes, useful
for cluster and Grid configurations - file may only be fixed-width
When reading in parallel, input row order is not maintained across readers.
A better option for writing to a set of sequential files in parallel is to use the
FileSet stage. This creates a single header file (in text format) and corresponding
This method is also useful for External Source and FTP Sequential Source
stages.
The format of the schema file, including sequential file import/export format
properties is documented in the Orchestrate Record Schema manual, which is
included with the product documentation. This document is required, because
the Import/Export properties used by the Sequential File and Column Import
stages are not documented in the DataStage Parallel Job Developers Guide.
Note: The Complex Flat File stage cannot read from sources with OCCURS
DEPENDING ON clauses. (This is an error in the DataStage documentation.)
When used as a target, the stage allows you to write data to one or more
complex flat files. It does not write to MVS datasets.
Group subrec
8.4 Filesets
Filesets are a type of hybrid between datasets and sequential files.
Just like a dataset, records are stored in several partition files. Also, there is a
descriptor file that lists the paths to those partition files as well as the schema for
the records contained in them. However, data is stored in text format as in
SequentialFiles, as opposed to datasets that stored column values in internal
binary format.
Always include reject links in a parallel Transformer. This makes it easy to identify
reject conditions (by row counts). To create a Transformer reject link in Designer,
right-click an output link and choose Convert to Reject, as in Figure 9-1.
The parallel Transformer rejects NULL derivation results (including output link
constraints) because the rules for arithmetic and string handling of NULL values
are, by definition, undefined. Even if the target column in an output derivation
allows nullable results, the Transformer rejects the row instead of sending it to the
output links.
For example, the following stage variable expression would convert a null value
to a specific empty string:
If ISNULL(link.col) Then “” Else link.col
Because the Transformer aborts the entire job flow immediately, it is possible that
valid rows have not yet been flushed from sequential file (export) buffers, or
committed to database tables. It is important to set the database commit
parameters or adjust the Sequential File buffer settings (see 8.2.5, “Sequential
file (Export) buffering” on page 131.
The stage variables and the columns in a link are evaluated in the order in which
they are displayed in the Transformer editor. Similarly, the output links are also
evaluated in the order in which they are displayed.
From this sequence, it can be seen that there are certain constructs that are
inefficient to include in output column derivations, as they are evaluated once for
every output column that uses them. Such constructs are where the same part of
an expression is used in multiple column derivations
For example, if multiple columns in output links want to use the same substring of
an input column, the following test might appear in a number of output columns
derivations:
IF (DSLINK1.col[1,3] = “001”) THEN ...
This can be made more efficient by moving the substring calculation into a stage
variable. By doing this, the substring is evaluated once for every input row. In this
case, the stage variable definition would as follows:
DSLINK1.col1[1,3]
This reduces both the number of substring functions evaluated and string
comparisons made in the Transformer.
For example, a column definition might include a function call that returns a
constant value, for example:
Str(“ “,20)
This returns a string of 20 spaces. In this case, the function is evaluated every
time the column derivation is evaluated. It is more efficient to calculate the
constant value just once for the whole Transformer.
This can be achieved using stage variables. This function can be moved into a
stage variable derivation. In this case, the function is still evaluated once for
every input row. The solution here is to move the function evaluation into the
initial value of a stage variable.
A stage variable can be assigned an initial value from the stage Properties dialog
box/Variables tab in the Transformer stage editor. In this case, the variable would
have its initial value set as follows:
Str(“ “,20)
Leave the derivation of the stage variable on the main Transformer page empty.
Any expression that previously used this function is changed to use the stage
variable instead.
The initial value of the stage variable is evaluated once, before any input rows are
processed. Because the derivation expression of the stage variable is empty, it is
not re-evaluated for each input row. Therefore, its value for the whole Transformer
processing is unchanged from the initial value.
When using stage variables to evaluate parts of expressions, the data type of the
stage variable must be set correctly for that context. Otherwise, needless
conversions are required wherever that variable is used.
As noted in the previous section, the Output Mapping properties for any parallel
stage generate an underlying modify for default data type conversions, dropping
and renaming columns.
The standalone Modify stage can be used for non-default type conversions
(nearly all date and time conversions are non-default), null conversion, and string
trim. The Modify stage uses the syntax of the underlying modify operator,
documented in the Parallel Job Developers Guide, LC18-9892, as well as the
Orchestrate Operators Reference.
Note: The DataStage Parallel Job Developers Guide gives incorrect syntax for
converting an out-of-band null to an in-band null (value) representation.
Use this function to remove the characters used to pad variable-length strings
when they are converted to fixed-length strings of greater length. By default,
these characters are retained when the fixed-length string is converted back to a
variable-length string.
The character argument is the character to remove. By default, this is NULL. The
value of the direction and justify arguments can be either begin or end. Direction
defaults to end, and justify defaults to begin. Justify has no affect when the target
string has variable length.
The following example removes all leading ASCII NULL characters from the
beginning of name and places the remaining characters in an output
variable-length string with the same name:
name:string = string_trim[NULL, begin](name)
The following example removes all trailing Z characters from color, and
left-justifies the resulting hue fixed-length string:
hue:string[10] = string_trim[‘Z’, end, begin](color)
Use of Filter and Switch stages must be limited to instances where the entire
filter or switch expression must be parameterized at runtime. In a Parallel
Transformer, link constraint expressions, but not data, is fixed by the developer.
Limit the use of database Sparse Lookups (available in the DB2 Enterprise,
Oracle Enterprise, and ODBC Enterprise stages) to scenarios where the number
of input rows is significantly smaller (for example, 1:100 or more) than the
number of reference rows. (see 13.1.7, “Database sparse lookup versus join” on
page 197).
During an Outer Join, when a match does not occur, the Join stage inserts values
into the unmatched non-key columns using the following rules:
If the non-key column is defined as nullable (on the Join input links), the DS
parallel framework inserts NULL values in the unmatched columns
If the non-key column is defined as not-nullable, the parallel framework inserts
default values based on the data type. For example, the default value for an
Integer is zero, the default value for a Varchar is an empty string (“”), and the
default value for a Char is a string of padchar characters equal to the length of
the Char column.
For this reason, care must be taken to change the column properties to allow
NULL values before the Join. This is done by inserting a Copy stage and
mapping a column from NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
The Sort Aggregation method must be used when the number of key values is
unknown or large. Unlike the Hash Aggregator, the Sort Aggregator requires
presorted data, but only maintains the calculations for the current group in
memory.
You can also specify that the result of an individual calculation or recalculation is
decimal by using the optional Decimal Output sub-property.
In Figure 10-1 two Aggregators are used to prevent the sequential aggregation
from disrupting upstream processing. Therefore, they came into existence long
before DataStage (DS) itself. Those stages perform field-by-field comparisons on
two pre-sorted input datasets.
In 10.5, “Checksum” on page 155, we discuss the use of the Checksum stage,
which implements the MD5 Checksum algorithm. Although originally developed
for quite different purposes, its common use in the DataStage world tends to be
as a more efficient record comparison method.
The Slowly Changing Dimension (SCD) stage is included in this chapter as well.
It packs a considerable amount of features specifically tailored to the SCD
problem. This stage helps identify, for instance, if new records must be created or
updated.
These stages are described in detail in the DataStage Parallel Job Developer
Guide, LC18-9892.
There are differences between the workings of each stage, so one type might be
better suited than the others depending on the task at hand. For instance, the
ChangeCapture stage is supposed to be used in conjunction with the
ChangeApply. The ChangeCapture produces a set of records containing what
needs to be applied by ChangeApply to the before records to produce the after
records.
This example reads data from sequential files. The data can be stored in a
database, a persistent dataset or a fileset. Note that we are not discussing the
various techniques and optimizations for the extraction of data for the before
dataset. This is left for subsequent chapters of this book.
The before dataset is typically extracted from a data mart or a data warehouse.
The after represents the set of input records for a given processing cycle.
One could implement similar logic in various ways using other standard
components, such as a normal lookup followed by a Transformer. In this
scenario, the datasets do not have to be pre-sorted. The comparison of individual
columns is implemented by expressions inside the Transformer.
The point here is there are several ways of implementing the same thing in
DataStage. However, there are components that are suited for specific tasks. In
the case of comparing records, the Compare, Change Capture, and Difference
stages are specific to the problem of identifying differences between records.
They reduce the coding effort and clutter.
10.5 Checksum
The Checksum stage is a new component in Information Server8.1 that
generates an MD5 Checksum value for a set of columns in a give dataset.
The RSA MD5 algorithm was originally developed and historically used for
encryption and integrity checking. However, in the realm of DataStage
applications for business intelligence and data warehousing, the most common
use tends to be record comparison for incremental loads.
In Figure 10-3 on page 156 we present an example of the Checksum stage for
record comparison. The flow assumes the Before dataset already contains a
pre-calculated checksum, most likely stored in a target DW table. A new
checksum value is calculated for incoming records (After records). The
checksum must be calculated on the exact same set of fields for both datasets.
Records from the Before and After datasets are correlated with a Left Outer Join
(the left side being the After dataset). The subsequent Transformer contains the
logic to compare the checksums and route records to appropriate datasets.
The approach to use depends on how frequently the set of relevant attributes
changes and number and size of columns involved in the comparison.
You can find a detailed description of the SCD stage in the InfoSphere
DataStage 8.1 Parallel Job Developer Guide, LC18-9892. The SCD stage is
depicted in Figure 10-4.
The input stream link contains records from which data for both Fact and
Dimension tables are derived. There is one output link for fact records and a
second output link for updates to the dimension table.
The entire set of reference records is read from the reference link and stored in
memory in the form of a lookup table. This is done at initialization. The reference
data remains in memory throughout the duration of the entire job.
This release of the Data Flow Standard Practices proposes techniques for batch
incremental loads that include a discussion on how to restrict reference datasets
(See Chapter 15, “Batch data flow design” on page 259).
There is a chance that this technique will not make the reference dataset small
enough to fit in physical memory. That depends on how much memory is
available and how large are the data volumes involved.
Avoid overflowing the physical memory when running parallel jobs. This must be
avoided for any type of application, not only DataStage jobs. There must be
enough room for all processes (OS, DataStage jobs and other applications) and
user data such lookup tables to fit comfortably in physical memory.
The moment the physical memory fills up, the operating system starts paging in
and out to swap space, which is bad for performance.
Even more grave is when the entire swap space is consumed. That is when
DataStage starts throwing fatal exceptions related to fork() errors.
As previously stated, the SCD packs in lots of features that make slowly
changing dimension easier, so we are not necessarily ruling out its use.
One might consider making use of the SCD stage when the following
circumstances are in affect:
The SCD stage meets the business requirements (that is, the processing of
slowly changing dimensions).
It is guaranteed the reference dataset fits in memory
– Use the techniques outlined in Chapter 15, “Batch data flow design” on
page 259.
• Avoid extracting and loading the entire contents of a dimension table
• Make sure only the relevant reference subset is extracted and loaded
into memory through the “reference link”.
If there is a possibility the reference data is too large to fit in memory, another
technique, sorted joins, must be adopted instead.
One of the original purposes of the underlying DataStage parallel framework was
to parallelize COBOL applications. As such, it has been implemented with ways
of importing/exporting and representing those COBOL record formats.
11.1.1 Vectors
In Figure 11-1 we show an example of a single record, whose Temps field
consists of a vector of float values. The notation used here for vector elements
consists of a integer indicating the element index, followed by a colon and the
actual value.
A record might have multiple fields of vector types, as shown in Figure 11-2.
There is one additional field for SampleTimes. But, SampleTimes and Temps are
paired fields, so perhaps there can be a better representation.
One typically uses tagged fields when importing data from a COBOL data file
when the COBOL data definition contains a REDEFINES statement. A COBOL
REDEFINES statement specifies alternative data types for a single field.
In Figure 11-5, we show examples of records for the definition in Figure 11-4.
Each line represents a separate record. More than one record has the same
value for the key field.
Tagged data types are not described by the Information Server documentation.
They are described in Orchestrate 7.5 Operators Reference. However, tagged
types can still be used by means of the Orchestrate record schema definition, as
depicted in the previous examples.
N/A Tagbatch
N/A Tagswitch
These documents describe the corresponding stage types, with the exception of
tagbatch and tagswitch.
In Figure 11-6 we put all Restructure stages and Operators in perspective. You
can see the paths that can be followed to convert from one type to another.
The Column Import and Column Export stages can import into and export from
any of the record types.
The tagswitch operator writes each tag case to a separate output link, along with
a copy of the top-level fields.
In Figure 11-7, we show the input and output schemas of a tagswitch operator.
The input schema contains a top level key field, along with three tag cases.
Instead of the nested subrecord fields being flattened out to the same output
record format, each tag case is redirected to a separate output link. The top level
fields are copied along to the corresponding output link.
tagswitch
Figure 11-8 Using the generic stage to Invoke the Tagswitch operator
Id:Int32;
The type indicator type:Int8;
must be imported fname:string;
as a column value kname:string
Id:Int32;
type:Int8;
taggedField:string
Id:Int32;
Id:Int32; type:Int8;
type:Int8; income:Int32
taggedField:string
Id:Int32;
type:Int8; Id:Int32;
Transformer (or taggedField:string type:Int8;
Switch) sends Birth:date;
each record to a Retire:date
different link,
depending on the
type
The Tagbatch operator flattens the record definitions of tagged fields so all fields
from the tag cases are promoted to top level fields.
By default, multiple records with the same key value are combined into the same
output record. Or they might be written out as separate records.
tagCaseA
tagCaseA_toJoin
Id:int32; Id:int32;
type:int8; type:int8;
taggedField:string fname:string;
Kname:string Id:int32;
type:int8;
toParseTagCaseA
tagCaseB_toJoin fname:string;
Id:int32; Kname:string;
Id:int32; Id:int32; tagCaseB type:int8; Income:int32;
type:int8; type:int8; Birth:date;
Income:int32 Retire:date
taggedField:string taggedField:string
toParseTagCaseB FlattenedRecord
Id:int32;
type:int8; Peek
DSLink2 Id:int32; Join_32
Sequential_File_0 SplitByType type:int8; Birth:date;
taggedField:string tagCaseC_toJoin Kname:string
toParseTagCaseC
tagCaseC
DSLink29
Id:int32; Id:int32;
Left
Remove_Duplicate_key_values
You can see from the examples that the native Tagbatch and Tagswitch operators
provide a convenient way of dealing with tagged record structures. Instead of
having multiple stages, the flow can be simplified by using components that are
specific to the task at hand. This is often the case when dealing with complex
COBOL record types.
The Sequential File stage can be used in conjunction with the Column Import
stage. This might be done for two purposes:
De-couple the file import process from the parsing of input records: the record
parsing can be done in parallel, with a higher degree of parallelism than the
one used to read the input file.
Simplify the parsing by using multiple column import and Transformer stages
in tandem.
The full import/export record syntax is described in the DataStage 7.5 Operators
Reference (Chapter 25: Import/Export Properties).
You can generate a pivot index that assigns an index number to each row with a
set of pivoted data.
This stage was introduced in Information Server 8.1 and complements the set of
Restructure stages that were already present in previous releases, as discussed
in the previous sections.
One could mimic the functionality of the Pivot stage by using a combination of
Restructure stages depicted in Figure 11-6 on page 164. To go from the record
structure with fields “SampleTime0…SampleTime3” and “Temp0…3” the
following sequence of stages can be implemented:
1. MakeVector
2. MakeSubRecord
3. PromoteSubRecord
This must be done twice. First for SampleTimeN, and again for TempN.
The Pivot stage overcomes those limitations and does the transformation from a
single record into multiple records in a single stage. It provides for a more natural
and user-friendly restructure mechanism for certain scenarios.
This section provides tips for designing a job for optimal performance, and for
optimizing the performance of a given data flow using various settings and
features in DataStage (DS).
When composing the score, the DS parallel framework attempts to reduce the
number of processes by combining the logic from two or more stages (operators)
into a single process (per partition). Combined operators are generally adjacent
to each other in a data flow.
In general, it is best to let the framework decide what to combine and what to
leave uncombined. However, when other performance tuning measures have
been applied and still greater performance is needed, tuning combination might
yield additional performance benefits.
There are many factors that can reduce the number of processes generated at
runtime:
Use a single-node configuration file
Remove all partitioners and collectors (especially when using a single-node
configuration file)
Enable runtime column propagation on Copy stages with only one input and
one output
Minimize join structures (any stage with more than one input, such as Join,
Lookup, Merge, Funnel)
Minimize non-combinable stages (as outlined in the previous section) such as
Join, Aggregator, Remove Duplicates, Merge, Funnel, DB2 Enterprise, Oracle
Enterprise, ODBC Enterprise, BuildOps, BufferOp
Selectively (being careful to avoid deadlocks) disable buffering. (Buffering is
discussed in more detail in section 12.4, “Understanding buffering” on
page 180.)
The first block is used by the upstream/producer stage to output data it is done
with. The second block is used by the downstream/consumer stage to obtain
data that is ready for the next processing step. After the upstream block is full and
the downstream block is empty, the blocks are swapped and the process begins
again.
This type of buffering (or record blocking) is rarely tuned. It usually only comes
into play when the size of a single record exceeds the default size of the transport
block. Setting APT_DEFAULT_TRANSPORT_BLOCK_SIZE to a multiple of (or
equal to) the record size resolves the problem. Remember, there are two of these
transport blocks for each partition of each link, so setting this value too high can
result in a large amount of memory consumption.
Note: The following environment variables are used only with fixed-length
records:
APT_MIN/MAX_TRANSPORT_BLOCK_SIZE
APT_LATENCY_COEFFICIENT
APT_AUTO_TRANSPORT_BLOCK_SIZE
In this example, the Transformer creates a fork with two parallel Aggregators,
which go into an Inner Join. Note however, that Fork-Join is a graphical
description, it does not necessarily have to involve a Join stage.
Aggregator
1
Qu
eu
ed
Transformer Join
Waiting Waiting
Qu
e ue
d
Aggregator
2
Aggregator
BufferOp1
1
Transformer Join
Aggregator
BufferOp2
2
Because BufferOp is always ready to read or write, Join cannot be stuck waiting
to read from either of its inputs, breaking the circular dependency and
guaranteeing no deadlock occurs.
BufferOps is also placed on the input partitions to any Sequential stage that is
fed by a Parallel stage, as these same types of circular dependencies can result
from partition-wise Fork-Joins.
Tip: For wide rows, it might be necessary to increase the default buffer size
(APT_BUFFER_MAXIMUM_MEMORY) to hold more rows in memory.
Aside from ensuring that no deadlock occurs, BufferOps also have the effect of
smoothing out production/consumption spikes. This allows the job to run at the
highest rate possible even when a downstream stage is ready for data at various
times than when its upstream stage is ready to produce that data. When
attempting to address these mismatches in production/consumption, it is best to
tune the buffers on a per-stage basis, instead of globally through environment
variable settings.
Important: Choosing which stages to tune buffering for and which to leave
alone is as much art as a science, and must be considered among the last
resorts for performance tuning.
By using the performance statistics in conjunction with this buffering, you might
be able identify points in the data flow where a downstream stage is waiting on
an upstream stage to produce data. Each place might offer an opportunity for
buffer tuning.
As implied, when a buffer has consumed its RAM, it asks the upstream stage to
slow down. This is called pushback. Because of this, if you do not have force
buffering set and APT_BUFFER_FREE_RUN set to at least approximately 1000,
you cannot determine that any one stage is waiting on any other stage, as
another stage far downstream can be responsible for cascading pushback all the
way upstream to the place you are seeing the bottleneck.
Existing database stages are discussed in this chapter for ease of reference,
especially when dealing with earlier releases of Information Server (IS) and
DataStage.
Note: Not all database stages (for example, Teradata API) are visible in the
default DataStage Designer palette. You might need to customize the palette
to add hidden stages.
Because there are exceptions to this rule (especially with Teradata), specific
guidelines for when to use various stage types are provided in the
database-specific topics in this section.
Native parallel stages always pre-query the database for actual runtime
metadata (column names, types, attributes). This allows DataStage to match
return columns by name, not position in the stage Table Definitions. However,
care must be taken to assign the correct data types in the job design.
The benefit of ODBC Enterprise stage comes from the large number of included
and third-party ODBC drivers to enable connectivity to all major database
platforms. ODBC also provides an increased level of data virtualization that can
be useful when sources and targets (or deployment platforms) change.
From a design perspective, plug-in database stages match columns by order, not
name, so Table Definitions must match the order of columns in a query.
Runtime metadata
At runtime, the DS native parallel database stages always pre-query the
database source or target to determine the actual metadata (column names,
data types, null-ability) and partitioning scheme (in certain cases) of the source
or target table.
For each native parallel database stage, the following elements are true:
Rows of the database result set correspond to records of a parallel dataset.
Columns of the database row correspond to columns of a record.
The name and data type of each database column corresponds to a parallel
dataset name and data type using a predefined mapping of database data
types to parallel data types.
Both the DataStage Parallel Framework and relational databases support null
values, and a null value in a database column is stored as an out-of-band
NULL value in the DataStage column.
One disadvantage to the graphical orchdbutil metadata import is that the user
interface requires each table to be imported individually.
For example, the following SQL assigns the alias “Total” to the calculated column:
SELECT store_name, SUM(sales) Total
FROM store_info
GROUP BY store_name
The only exception to this rule is when building dynamic database jobs that use
runtime column propagation to process all columns in a source table.
If the connection fails, an error message might appear. You are prompted to view
additional detail. Clicking YES, as in Figure 13-3, displays a detailed dialog box
with the specific error messages generated by the database stage that can be
useful in debugging a database connection failure.
As another example, the OPEN command can be used to create a target table,
including database-specific options (such as table space, logging, and
constraints) not possible with the Create option. In general, it is not a good idea
to let DataStage generate target tables unless they are used for temporary
storage. There are limited capabilities to specify Create table options in the
stage, and doing so might violate data-management (DBA) policies.
When directly connected as the reference link to a Lookup stage, the DB2/UDB
Enterprise, ODBC Enterprise, and Oracle Enterprise stages allow the Lookup
type to be changed to Sparse, sending individual SQL statements to the
reference database for each incoming Lookup row. Sparse Lookup is only
available when the database stage is directly connected to the reference link,
with no intermediate stages.
For scenarios where the number of input rows is significantly smaller (for
example, 1:100 or more) than the number of reference rows in a DB2 or Oracle
table, a Sparse Lookup might be appropriate.
Though there are extreme scenarios when the appropriate technology choice is
clearly understood, there might be less obvious areas where the decision must
be made based on factors such as developer productivity, metadata capture and
re-use, and ongoing application maintenance costs. The following guidelines can
assist with the appropriate use of SQL and DataStage technologies in a given job
flow:
When possible, use a SQL filter (WHERE clause) to limit the number of rows
sent to the DataStage job. This minimizes impact on network and memory
resources, and makes use of the database capabilities.
Use a SQL Join to combine data from tables with a small number of rows in
the same database instance, especially when the join columns are indexed. A
join that reduces the result set significantly is also often appropriate to do in
the database.
Using the DB2/UDB API stage or the Dynamic RDBMS stage, it might be
possible to write to a DB2 target in parallel, because the DataStage parallel
framework instantiates multiple copies of these stages to handle the data that
The DB2/API (plug-in) stage must be used to read from and write to DB2
databases on non-UNIX platforms (such as mainframe editions through
DB2-Connect). Sparse Lookup is not supported through the DB2/API stage.
To connect to multiple DB2 instances, use separate jobs with their respective
DB2 environment variable settings, landing intermediate results to a parallel
dataset. Depending on platform configuration and I/O subsystem performance,
separate jobs can communicate through named pipes, although this incurs the
overhead of Sequential File stage (corresponding export/import operators),
which does not run in parallel.
If the data volumes are sufficiently small, DB2 plug-in stages (DB2 API, DB2
Load, Dynamic RDBMS) can be used to access data in other instances.
DATE date
DOUBLE-PRECISION dfloat
FLOAT dfloat
INTEGER int32
MONEY decimal
REAL sfloat
SERIAL int32
SMALLFLOAT sfloat
SMALLINT int16
Important: DB2 data types that are not listed in the table cannot be used in
the DB2/UDB Enterprise stage, and generate an error at runtime.
When writing to a DB2 database in parallel, the DB2/UDB Enterprise stage offers
the choice of SQL (insert/update/upsert/delete) or fast DB2 loader methods. The
choice between these methods depends on required performance, database log
usage, and recoverability.
The Write Method (and corresponding insert/update/upsert/delete)
communicates directly with the DB2 database nodes to execute instructions
in parallel. All operations are logged to the DB2 database log, and the target
tables might be accessed by other users. Time and row-based commit
intervals determine the transaction size, and the availability of new rows to
other applications.
The DB2 Load method requires that the DataStage user running the job have
DBADM privilege on the target DB2 database. During the load operation, the
DB2 Load method places an exclusive lock on the entire DB2 table space into
which it loads the data. No other tables in that table space can be accessed
by other applications until the load completes. The DB2 load operator
performs a non-recoverable load. That is, if the load operation is terminated
before it is completed, the contents of the table are unusable and the table
space is left in a load pending state. In this scenario, the DB2 Load DataStage
job must be rerun in Truncate mode to clear the load pending state.
CHAR(n) string[n]
DATE date
DOUBLE-PRECISION dfloat
FLOAT dfloat
INTEGER int32
MONEY decimal
NCHAR(n,r) string[n]
NVARCHAR(n,r) string[max=n]
REAL sfloat
SERIAL int32
SMALLFLOAT sfloat
SMALLINT int16
VARCHAR(n) string[max=n]
Important: Informix data types that are not listed in the table cannot be used
in the Informix Enterprise stage, and generates an error at runtime.
SQL_BIGINT int64
SQL_BINARY raw(n)
SQL_CHAR string[n]
SQL_DOUBLE decimal[p,s]
SQL_FLOAT decimal[p,s]
SQL_GUID string[36]
SQL_INTEGER int32
SQL_BIT int8 [0 or 1]
SQL_REAL decimal[p,s]
SQL_SMALLINT int16
SQL_TINYINT int8
SQL_TYPE_DATE date
SQL_TYPE_TIME time[p]
SQL_TYPE_TIMESTAMP timestamp[p]
SQL_VARBINARY raw[max=n]
SQL_VARCHAR string[max=n]
SQL_WCHAR ustring[n]
SQL_WVARCHAR ustring[max=n]
Important: ODBC data types that are not listed in the table cannot be used in
the ODBC Enterprise stage, and generates an error at runtime.
To read in parallel with the ODBC Enterprise stage, specify the Partition Column
option. For optimal performance, this column must be indexed in the database.
DATE timestamp
NUMBER decimal[38,10]
Important: Oracle data types that are not listed in the table cannot be used in
the Oracle Enterprise stage, and generate an error at runtime.
The Upsert Write Method can be used to insert rows into a target Oracle table
without bypassing indexes or constraints. To generate the SQL required by the
Upsert method, the key columns must be identified using the check boxes in the
column grid.
BINARY(n) raw(n)
BIT int8
DATE date
DATETIME timestamp
MONEY decimal[15,4]
REAL sfloat
SERIAL int32
SMALLDATETIME timestamp
SMALLFLOAT sfloat
SMALLINT int16
SMALLMONEY decimal[10,4]
TINYINT int8
TIME time
VARBINARY(n) raw[max=n]
Important: Sybase data types that are not listed in the table cannot be used in
the Sybase Enterprise stage, and generates an error at runtime.
Note: Unlike the FastLoad utility, the Teradata Enterprise stage supports
Append mode, inserting rows into an existing target table. This is done
through a shadow terasync table.
byte(n) raw[n]
byteint int8
char(n) string[n]
date date
float dfloat
graphic(n) raw[max=n]
integer int32
numeric(p,s) decimal[p,s]
real Dfloat
smallint int16
time time
timestamp timestamp
varbyte(n) raw[max=n]
varchar(n) string[max=n]
vargraphic(n) raw[max=n]
Important: Teradata data types that are not listed in the table cannot be used
in the Teradata Enterprise stage, and generate an error at runtime.
Aggregates and most arithmetic operators are not allowed in the SELECT clause
of a Teradata Enterprise stage.
Setting the SessionsPerPlayer too low on a large system can result in so many
players that the job fails due to insufficient resources. In that case
SessionsPerPlayer must be increased, and RequestedSessions must be
decreased.
Documentation for the Netezza Enterprise stage is installed with the DataStage
client, but is not referenced in the documentation bookshelf. You can find the
following installation and developer documentation for the Netezza Enterprise
stage in the Docs/ subdirectory in the installed DataStage client directory:
Connectivity Reference Guide for Netezza Servers
Netezza Interface Library
Netezza Load Uses NPS nzload utility to LOAD privileges for the target table; data in the source
load directly to target NPS database is consistent, contains no default values,
table single byte characters only, uses a predefined format
External Table Writes to an external table in If the data source contains default values for table
NPS; data is then streamed columns and uses variable format for data encoding
into the target table such as UTF-8
When writing data to the Netezza Performance Server by using the External
Table method, the log files are created in the /tmp directory in the Netezza
Performance Server. The following names are used for the log files:
/tmp/external table name.log
/tmp/external table name.bad
Note: The log files are appended every time an error occurs during the write
operation.
The Connector library is constantly evolving, so the reader should always consult
the latest documentation and release notes.
The Connector library provides a common framework for accessing external data
sources in a reusable way across separate IS layers. The generic design of
Connectors makes them independent of the specifics of the runtime environment
in which they run.
They provide consistent user experience. They have a similar look and feel, with
minor GUI differences depending on the target type. The need for custom stage
editors is eliminated. There is a common stage editor and metadata importer with
rich capabilities.
The APIs and layers are not exposed to the general public as development APIs.
They are restricted to IBM, and enable the creation of new tools and
components. One example of a component built on top of the Connector API is
the Distributed Transaction stage (DTS), which supports the execution of
distributed transactions across heterogeneous systems through the XA Protocol.
A similar bridge for DS Server jobs is named SE Bridge. The same Connector
runtime can be used in Java™ applications through a JNI layer.
The job is designed in the DataStage Designer client application, which runs on
Windows only. The job can be started and stopped from the DataStage Director
client application. The clients access ISF server through internal http port 9080
on the WebSphere Application Server. The stage types, data schemas, and job
definitions are saved to the common metadata repository.
When the job starts, the DataStage Enterprise engine starts a PX Bridge
Operator module for each Connector stage in the job. The PX Bridge acts as the
interface between the DataStage Enterprise framework and Connectors. The
Connectors are unaware of the type of the runtime environment.
The PX Bridge performs schema reconciliation on the server on which the job is
initiated. The schema reconciliation is the negotiation process in which the PX
Bridge serves as the mediator between the framework that offers the schema
definition provided by the job design (the schema defined on the Connector
stage’s link) and the Connector that offers the actual schema definition as
defined in the data resource to which it connects. During this process, the
attempt is made to agree on the schema definition to use in the job. The
differences that might be acceptable are as follows:
Separate column data types for which data can be converted without data
loss
Unused columns that can be dropped or ignored
The reject links in Connector stages is depicted in Figure 14-4. One major
implication of this reject functionality is the ability to output all records, regardless
of the presence of errors. This simplifies the development of real-time ISD jobs,
which need to synchronize the output from database stages with the InfoSphere
Information Services Director (ISD) Output stage. This is discussed in detail in
Chapter 16, “Real-time data flow design” on page 293.
The goal is to reduce unpredictable problems that can stem from non-matching
metadata in the following ways:
Identifying matching fields
Performing data conversions for compatible data types
Failing jobs that cannot be automatically corrected
Connection objects might be dragged onto existing Connector stages, or onto the
canvas. When dragged into the canvas, a new Connector stage is created.
Table definitions are associated with the data connection used to import the
metadata. This is illustrated in Figure 14-9.
These connectors use DMDI to display a GUI to browse metadata objects, then
to import them and to set stage properties. The DMDI is launched from the stage
editor.
When used in the source context, the Connector extracts data from the database
by executing SELECT SQL statements. It provides this data to other stages in
the job for transformation and load functions. When used in the target context,
the Connector loads, updates, and deletes data in the database by executing
INSERT, UPDATE, and DELETE SQL statements. The lookup context is similar
to the source context, with the difference that the SELECT SQL statements are
parameterized. The parameter values are provided dynamically on the input link
by the other stages in the job.
The ODBC Connector supports a special type of link called reject link. The
Connector can be configured to direct the data that it cannot process to the reject
link, from which it can be sent to any stage in the job (For example, the
Sequential File stage). The rejected data can later be inspected and
re-processed. Reject links provide for the option to continue running the job when
an error is detected, instead of interrupting the job at that moment. As shown in
Figure 14-10 on page 233, reject links are supported in lookup and target
context.
The Connector allows the option of passing LOBs by reference, rather than by
extracting the data and passing it inline into the job flow. When configured to
pass LOB values by reference, the ODBC Connector assembles a special block
of data, called a locator or reference, that it passes into the job dataflow. Other
Connectors are placed at the end of the job dataflow. When the LOB locator
arrives at the other Connector stage, the Connector framework initiates the
retrieval of the actual ODBC data represented by the reference and provides the
data to the target Connector so that it can be loaded into the represented
resource. This way it is possible to move LOB data from one data resource to
The Connector supports array insert operations in target context. The Connector
buffers the specified number of input rows before inserting them to the database
in a single operation. This provides for better performance when inserting large
numbers of rows.
The Connector user uses the SQL Builder tool to design the SQL statements.
The SQL Builder tool is a graphical tool that enables construction of SQL
statements in a drag and drop fashion.
The IBM Information Server comes with a set of branded ODBC drivers that are
ready for use by the ODBC Connector. On Windows, the built-in driver manager
is used. On UNIX, a driver manager is included with the IBM Information Server
installation.
It supports Teradata server versions V2R6.1 and V2R6.2 and Teradata client
TTU versions V8.1 and V8.2. The Connector uses CLIv2 API for immediate
operations (SELECT, INSERT, UPDATE, DELETE) and Parallel Transporter
Direct API (formerly TEL-API) for bulk load and bulk extract operations.
Parallel bulk load is supported through LOAD, UPDATE, and STREAM operators
in Parallel Transporter. This corresponds to the functionality provided by the
FastLoad, MultiLoad, and TPump Teradata utilities, respectively. When the
UPDATE operator is used it supports the option for deleting rows of data
(MultiLoad delete task).
The Teradata usage patterns are illustrated in Figure 14-12 on page 237.
Load FastLoad Fastest load method. Uses utility slot, INSERT only,
Locks table, No views, No
secondary indexes.
Update MultiLoad INSERT, UPDATE, DELETE, Uses utility slot, Locks table, No
Views, Non-unique secondary unique secondary indexes, Table
indexes. inaccessible on abort.
Stored
Procedure
plug-in
Teradata
TPump
client
FastExport
Figure 14-13 Teradata stages and their relation to Teradata client APIs
In Table 14-3, we present a feature comparison among all Teradata stage types.
Read, bulk Export YES Export only Export only No Read only
Write, bulk Load YES Load only Load only Load only Write only
Multiple-input-Links YES No No No No
Teradata PT client Required Not Required Not Required Not Required Not
(TPT) Required
The Connector is based on the CLI client interface. It can connect to any
database cataloged on the DB2 client. The DB2 client must be collocated with
the Connector, but the actual database might be local or remote to the
Connector.
Separate the sets of connection properties for the job setup phase (conductor)
and execution phase (player nodes), so the same database might be cataloged
differently on conductor and player nodes.
The following list describes common scenarios when the alternate conductor
setting might be needed:
Scenario 1: All nodes run on the same physical machine
If there are DB2 client versions installed on the system or if there multiple
instances defined, it is important that all DB2 stages in the job specify the
same conductor connection parameters, otherwise the job fails because of
the aforementioned facts/restrictions. Using the alternate conductor setting
helps achieve this.
Scenario 2: Each node runs on a separate machine
In this case the conductor might not run on the same machine as the player
processes. The same remote database that the connector is trying to access
might be cataloged differently on each node, and it might appear under a
different instance or database than it does to the players. Using the alternate
conductor setting allows you to use a separate instance/database for the
conductor process.
In terms of functionality, the DB2 Connector offers more capabilities than all of
the existing stages, with new features as listed in 14.5.1, “New features” on
page 244. There are four DB2 stages available to DataStage users, as listed in
Table 14-5 on page 247.
The plug-in stages collectively offer approximately the same functionality as both
the operator and connector. However, on the parallel canvas the two plug-in
stages perform significantly worse than the operator and connector. Because
these stages offer poor performance and do not provide any functionality that is
not offered by the connector or operator, the emphasis of this document shall be
on a comparison of the DB2 Connector and DB2 EE Operator.
The DB2 Connector must be considered the stage of choice in almost every
case. Because it offers more functionality and significantly better performance
than the DB2 API and DB2 Load stages, the connector is the choice, rather than
these stages. The DB2 Connector must be used instead of the EE stage in most
cases.
Player Node N
DB2
DB2 EE
Part N
Stage
In the case of the connector, the player nodes might correspond to DB2 partitions
(as is the requirement for the EE stage), but this is not a requirement. Because
the connector communicates with DB2 through the remote client interface, the
deployment options are much more flexible. This is depicted in Figure 14-16.
$ORACLE_HOME and $ORACLE_SID These are Oracle environment variables. They apply to the
connector the same way they apply to the Enterprise stage.
They are used by the Oracle client library to resolve Oracle
service name to which to connect.
$APT_ORACLE_LOAD_OPTIONS Not applicable to the connector. The connector does not use
SQL*Loader for bulk-load hence the loader control and data
files are not applicable to the connector. The connector uses
OCI Direct Path Load API. The settings to control the load
process are provided through connector properties.
$APT_ORA_WRITE_FILES None. The connector does not use SQL*Loader for bulk-load
operations.
The DT stage reads messages from the work queue and updates external
resources with the data rows corresponding to those work queue message. The
reading of messages and writing to external resources is done in atomic
distributed transactions using the two-phase XA protocol.
The DT stage can also be used outside the context of a queuing application.
Transactions are committed upon EOW markers. This has application in ISD jobs
that need to guarantee transactional consistency, but do not involve the
processing of messages.
MQ DT
Connector Stage
Stage
Transformation
Stages
Transaction
Messages Resource
C
Resource
B
Source Work Resource
Queue Queue A
It is implemented in C++ using a 3rd party interface library from a partner, and
uses DMDI to allow selection from cube data. See Figure 14-20.
DataStage PX DataStage SE
Connector Framework
SWG
Adapters Adapter
plugin here:
Parallel batch jobs must be implemented with parallel techniques that yield the
necessary scalability, which is the focus of this chapter. The basic assumption
throughout this discussion is that no applications other than DataStage (DS)
update the target databases during batch windows.
This chapter does not focus on individual stage parameters, but rather on data
flow design patterns. Individual stage types are the subject of other chapters in
this document. For instance, when dealing with database interfaces, there are
several options for each database type, along with their tuning parameters.
In most cases, the startup times tend to be small but they grow as the jobs grow
in size. They can take considerable time for a large number of jobs.
Jobs must be designed in a way that allows them to process large amounts of
data as part of a single run. This means input files must be accumulated and
submitted as input to a single job run. This way, instead of starting up each and
every job in a long sequence several times (such as once for each input file), the
sequences and their jobs are started up only once.
Once up and running, well-designed jobs are able to cope with large amounts of
data. The use of parallel techniques leads to more streamlined and efficient
designs that help mitigate the impact of startup times.
Network and disk I/O are the two slowest hardware resources in most systems.
Therefore, they must be used judiciously. For small amounts of data, they are not
a concern. But when there is a need to meet the demands of large environments,
these two resources are no longer negligible. There are a few aspects to keep in
mind that have impact on one or more of the resources listed:
Minimize the landing of data to disk.
– Instead of writing and reading data to and from disk after each and every
small job, create jobs that do more processing as part of the same
execution.
– This was the typical approach for DS Server applications, which must be
avoided in the parallel world.
Minimize interaction with the database Server.
– Extract/Load data in bulk.
– Avoid sparse lookups.
– Avoid extracting unnecessary data for lookups
For instance, avoid full table extracts when the number of input rows is
significantly smaller. One does not need to extract 100 million rows from a
database as a lookup table to process only a few thousand incoming
records.
Avoid unnecessary re-partitioning.
Re-partitioning cost is not pronounced in SMP systems, but it adds to private
network usage in DataStage grid environments.
Limit the amount of reference data for normal lookups.
– Single large normal lookup reference tables can consume a huge amount
of memory.
– Avoid full table extracts when the number of rows being processed is
significantly smaller.
– If the number of rows cannot be restricted, use a sorted merge instead of a
normal lookup.
Without going in to detail about the origins of bad practices, the following
sections discuss a few recurring patterns that unfortunately are common in the
field, and lead to less than optimal results to say the least. Bad practices
includes, but are not limited to, the topics discussed in the following subsections.
The active/passive nature of DS Server job stages limits the types of stages and
prevents a data flow execution model. This ultimately prevents joins and merges
from being made available in DS Server. As a result, DS Server jobs follow
design patterns that include the following attributes:
Singleton lookups.
A SQL statement is executed for each and every incoming row.
Small jobs with frequent landing of data to disk.
The main reason being checkpoint and job restartability.
All correlation must be done using hash files and Transformers.
Results from lack of join/merge functionality.
Manual partitioning with multiple instances of a same job.
This invocation involves preparing and sending one or more packets to the
database server, the database server must execute the statement with the given
parameters and results are returned to the lookup stage.
Even if the database server is collocated with DataStage, it is still a bad solution.
There is at least one round-trip communication between the Lookup stage and
the database server, as well as a minimum set of instructions on both sides
(DataStage and database) that is executed for each and every row.
By using bulk database loads and unloads, larger sets of data are transferred
between DataStage and the database. The usage of the network transports as
well as the code paths on each side are optimized. The best solution is then
achieved, at least in terms of DataStage interfacing.
In rare occasions, sparse lookups might be tolerated. However, that is only when
it is guaranteed that the number of incoming rows, for which the Lookup must be
performed, is limited to at most a few thousand records.
Those systems are not equipped with the right tools to only extract the net
difference since the last refresh. As a result, incremental transfers are not
possible and the entire content of the source database must be copied to the
target environment in each processing cycle (frequently on a daily basis).
Source systems must implement a form of change data capture to avoid this
unnecessary burden.
The size of reference datasets and lookup file sets must be restricted using the
techniques presented in this chapter.
It might be the case of not only a single lookup, but the combination of all normal
lookups across multiple concurrent jobs that eat up all available memory and
leave little or no space in physical memory for other processes and applications.
Even worse, projects continue to consume not only the physical memory, but all
swap space as well. In these severe cases, DataStage throws fatal exceptions
related to fork() errors. The application must be changed to use the right
mechanisms:
Restrict the size of the reference data
Use sorted joins
The place to store data that might be re-referenced in the same batch cycle is in
persistent datasets or file sets. As jobs process incoming data, looking up data,
resolving references and applying transformations, the data must be kept local to
DataStage. Only at the last phase of the Extract, Transform, and Load (ETL)
process can data be loaded to the target database by provisioning jobs.
We understand that combining too much logic inside a single job might saturate
the hardware with an excessive number of processors and used memory. That is
why it is important to know how much memory and processing power is
available. However, considering the difference between memory and disk access
speeds, we tend to favor larger jobs.
The number of stages for each job must be limited by the following factors:
The amount of hardware resources available: processors and memory
The amount of data in the form of normal lookup tables
– One or more large lookup tables might end up consuming most of
available memory.
– Excessively large normal lookups must be replaced by joins.
– The amount of data on the reference link must be restricted to what is
minimally necessary.
Natural logic boundaries
– Such as when producing a reusable dataset or lookup fileset. The creation
of such reusable set must be implemented as a separate job.
– Database extraction and provisioning jobs must be isolated to separate
jobs.
– There are times when an entire dataset must be written out to disk before
the next step can proceed.
This is the case, for instance, when we need to upload a set of natural
keys to extract relevant reference data (see 15.6.1, “Restricting incoming
data from the source” on page 270).
15.4 Checkpoint/Restart
The parallel framework does not support checkpoints inside jobs. The
checkpoint/restart unit is the job.
DataStage sequences must be designed in a way they can catch exceptions and
avoid re-executing jobs already completed. DataStage sequences provide for the
ability to design restart-able sequences. The designer, however, must explicitly
use constructs to enable this behavior.
The DataStage Server approach has always been to land data frequently, after
small steps, as pointed out in previous sections. It is not the same with parallel
jobs. Again, the question is: where is the biggest cost? The answer is, in
descending order:
1. Network access
2. Disk
3. Memory
4. Processor
As a result, it is better to restart a bigger scalable, optimized job than to land data
after each and every small step.
There are certain tasks that ETL tools do better, and tasks at which databases
excel. It is not only a matter of capabilities, but also how many resources are
available to each, and their financial, operational, and licensing costs.
As stated before, network and database interfacing have the highest costs (That
is, loading data into and extracting from databases are costly operations).
The following list details the conditions that, if true, indicate that you should
consider performing transformations inside the DB:
The database server is massively parallel and scalable
There is enough HW capacity and licensing on the database side for the extra
processing cost for transformations
All the data is contained inside the DB
Data is already cleansed
There are guarantees that set-oriented SQL statements does not fail
For example, you are certain that a “SELECT .. INTO…” always runs
successfully.
The database is one of the database types supported by the Balanced
Optimizer
If you decide to perform the transformations inside the DB, then you can
delegate transformation logic to the database server. That is what the Balanced
Optimizer (Teradata) was built for (and is a new component of the InfoSphere
product suite in Information Server 8.1). There is no need to extract and load
data back into the database because it is already there.
The key is the use of Balanced Optimizer, so the logic is designed inside the DS
paradigm, keeping the ability to draw data lineage reports. The use of pure SQL
statements defeats the purpose, as there is no support for data lineage and the
application becomes increasingly complex to maintain.
There is a multitude of topics covered in other chapters that are all still valid and
relevant. For example, consider the type of database stage for extracting and
loading data from and to databases. You need to use bulk database stages, or
loads and unloads take too long. This is dependent upon how efficiently DS
interoperates with target database systems.
The specifics on databases and other stage types and their properties are
described in the other chapters of this IBM Redbooks publication.
Networks have a certain bandwidth and do not yield a higher throughput unless
the entire network infrastructure is upgraded (assuming there is a faster network
technology available).
This means using Change Data Capture on the mainframe, for instance, and not
resending the entire source database. If it is not possible to restrict the source
data sent to DataStage, the application has to resort to full target table extracts.
Under those circumstances, you must resort to sorted joins instead of normal
lookups, as there is a high probability that the reference data does not fit in
memory.
The first solution is inefficient if the size of the input data is small when compared
to the database table size. The second solution involves a method that must be
used for really small input files.
In case of SCD type 2, doing a full extract implies restricting the set to rows with
Current Flag set to 1, or a filter on Expiration Date. Even restricting the extracted
set to the most current records might represent an amount of data that is too
large to hold in memory, or that contains way more records than are needed for a
given processing cycle. Instead, we propose an approach that is based on the
optimization of the interaction between DataStage and the target database,
which involves the following elements:
Restriction of the reference dataset to only what is minimally relevant for a
given run.
Avoid extracting reference data that is not referenced in the input files.
Use of the optimal correlation strategy inside DataStage.
Use elements such as the join, normal lookup, change capture, and SCD.
Use data already stored locally to DataStage in the form of datasets or
filesets.
Avoid repeated and redundant database extracts.
Each input file maps directly to a table. The Source System Key (SSK) fields are
marked as *_SSK. Surrogate key (SK) fields follow the pattern *_SK.
DB Model
Batch Load
Sale_SSK
Cust_SSK Prod_SSK
Cust_SSK
Cust_Name Prod_Name
Cust_Address Prod_SSK
Prod_Descr
Cust_Src_TS Price
Prod_Date
Qty
Prod_Src_TS
Date
Sale_Src_TS
In this chapter we abstract the type of slowly changing dimension. However, the
techniques described in the following sections apply to any SCD types.
Another pair of jobs, similar to the ones depicted in Figure 15-2 on page 273
and Figure 15-3 must be implemented for Product SSKs.
The net result of this phase is that each temp table (one for each SSK type)
contains all the unique SSKs for that type across all input files.
QS matching
QualityStage matching requires candidate records based on blocking keys.
There is no exact key match that can be used to extract candidate records from
the database. The load process needs to create a dataset containing unique
blocking keys, instead of unique SSKs. This set of unique blocking keys is
loaded into the aforementioned temporary table.
SCD Type 2
For SCD Type 2, you still need to upload a set of unique SSKs. The record
version to be extracted is determined by the SQL statement that implements the
database join, as discussed in the next section.
Figure 15-4 on page 275 shows an example where the target table contains 100
million records. The uploaded temp table contains less than a million records.
The result from the join is at most the same as the number of records in the
temporary table.
The number of records coming out of the join is certainly less than the number in
the temp table, because uploaded SSKs are not present in the target table yet.
DSLink19
Join_Temp_CustSSKs Existing_CustSSKs
Figure 15-4 Extracting existing customer SKs with a database join
The join in Figure 15-4 guarantees that no useless data is extracted for the given
run. Only what is necessary (that is, referenced by any of the input files) is
extracted.
QS matching
As described in the previous section, for QS matching the temp table would have
been loaded with a set of unique blocking keys.
For QS Matching, the database join extracts the set of candidate records for all
input records, which serves as reference input for a QS Matching stage.
SCD Type 2
For SCD Type 2, the database join implements a filter to restrict records based
on one of the following elements:
Current flag
Expiration Date
Effective Date
This makes it even more important to resort to Joins. The assumption is we are
dealing with large amounts of data, so performing a full table scan is much more
efficient.
The SCD stage embeds logic specific to the processing of slowly changing
dimensions. When using the SCD stage, the data for the reference link must still
be restricted only to what is relevant for a given run. The SCD stage receives in
its reference input whatever the corresponding source stage produces.
In Figure 15-5 on page 278 we show an example of using a Join to bring input
and reference records together based on the SSK.
The join type is left outer join. If there are incoming SSKs that were not found in
the target database table (that is, they are not present in the
Existing_Cust_SSKs dataset), the join still writes those records out. The
subsequent Transformer directs matching and non-matching records to separate
branches.
Non-matched records are directed to a surrogate key generator. All records, with
new and existing SKs are saved to the same dataset (Resolved_Cust_SSKs).
The top part of the example figure shows using a join to bring input and reference
records together based on the SSK.
The dataset containing the result from the database join (Existing_Cust_SSKs)
might have to include not only the existing SK values, but other fields as well.
This would be the situation, for example, where business requirements dictate
that other data values be compared to determine whether or not a given record
should be updated.
The Transformer after the join could compare checksum values. When doing
comparisons on checksums, the following tasks must be performed:
A checksum value must be stored in the database table.
The input dataset must be augmented with a newly created checksum value
on the same set of columns used originally to create the checksums stored in
the target table.
In Figure 15-7 on page 280, we describe how to use the resolved surrogate key
datasets for Customers and Products to resolve foreign key references in the fact
table (Sales).
For simplicity reasons, we are assuming that sales records are never updated.
Records are always added to the target table. The update process follows the
same approach as outlined for the Customer table.
The reference datasets are already sorted and partitioned. The input Sales file
needs to be re-partitioned twice: first in the input to the first sort (for Customer
lookup) and then in the input to the second sort (for Product lookup).
After the SOR is loaded with the day’s incremental load, the DW database must
be updated.
One might think of adopting a path as depicted in Figure 15-8. That diagram
represents a scenario in which the second database is updated with information
extracted from the first database.
Instead, Information Server must be kept as the transformation hub. Any data
that is fed into the second and subsequent databases is derived from datasets
stored locally by Information Server as persistent datasets or filesets. This is
depicted in Figure 15-9.
Flat Information
Files Server
N days of input files DW
There is minimal interaction with the database. The SOR database is only
queried to the extent of obtaining restricted reference datasets as outlined in the
previous section.
Information Server should keep as datasets, for instance, rolling aggregations for
a limited number of days, weeks and months. Those datasets act as shadows to
content in the second and subsequent DBs in the chain.
One good example is banking. Most likely there is separate feeds during the
same processing window that relate to each other. For instance, new customers
are received at one time, and then later the account movements are received as
well. In this case, why first load and then re-extract the same data? Save the new
customer SKs in a local dataset or file set, and use this one later on for
subsequent files.
This approach avoids uploading to the temp table SSKs that were already
resolved by the previous run.
This way, we avoid reloading and re-extracting SSKs that were already resolved
by the previous run. We combine newly created SKs and newly extracted SKs
into a single dataset that serve as input to the next run.
Figure 15-11 Resolving customer SKs, taking into account already resolved ones
The goal is to minimize the impact of the startup time for all jobs. You must strike
a balance between the urgency to process files as they arrive, and the need to
minimize the aggregate startup time for all parallel jobs.
The ODBC stage can be set to run a query in parallel, depending on the number
of partitions on which the stage is run. The stage modifies the SQL query with an
extra where clause predicate based on a partition column.
The Oracle Enterprise stage requires one to set the partition table property.
The DB2 UDB Enterprise stage requires DB2 DPF to be installed on the target
database, to take advantage of direct connections to separate DB2 partitions.
DB2 connectors do not need DPF installed on the target database, but the
degree of parallelism of the query is still determined by the partitioning scheme of
the source table.
See Chapter 13, “Database stage guidelines” on page 189 for details on how to
do parallel reads with the various database stage types supported by DataStage.
Figure 15-12 Making use of the sort order of a database query result
If the output from the query is naturally sorted according to the criteria necessary
to perform the join, one can set properties as marked in red in Figure 15-12.
By naturally sorted we mean the query itself does not include an ORDER BY
clause, but still returns a properly sorted result set. This might happen, for
instance, if the database query plan involves a scan on index pages and the
index definition matches the sort order required in the DS job flow.
However, if the result set is not naturally sorted, a sort must be performed in one
of two places:
In the DS parallel job
Inside the database by means of an ORDER BY clause
The first method uses highly optimized APIs to load large volumes of data the
fastest possible way. The underlying mechanisms tend to be specific to each
database, such as Oracle, DB2, and Teradata. With bulk loads, no SQL
statements are specified and they tend to be much faster than Upserts.
Bulk loads have the option of turning off or temporarily disabling indices and
constraints. These can be rebuilt and re-enabled at the end of the load process.
The Upsert method implies the execution of SQL statements, either generated
automatically or specified by the user as a custom SQL. Upserts rely on
database call-level interfaces and follow a record-at-a-time processing model, as
opposed to bulk loads. Most database CLIs support the execution of SQL
statements by sending arrays of rows to the DB.
Upserts invariably require indices to be present and enabled, otherwise, the cost
of executing an update or delete statement with a where clause is too inefficient,
requiring a full table scan per row.
Figure 15-14 presents the table action options for the DB2 Connector, when the
Write Mode is set to Bulk Load.
Although DataStage presents a single stage type with different options, these
options map to underlying Orchestrate operators and database load utilities at
runtime.
The question we address in this section is how to decide which database load
method to use for a given task at hand? The answer depends on a number of
factors:
Size of the input DS versus size of the target table
How clean the data to be uploaded is and if there might be any rejects
Ratio of inserts X updates
To help determine the database load method to use, we provide the following key
selection criteria:
Bulk Loads
– All rows of the input dataset are new to the target DB
– There is an option of disabling indices and constraints:
• Used when the cost of re-enabling indices and constraints is less then
the cost of updating indices and evaluating constraints for each and
every input row.
• Used when the data is guaranteed to be thoroughly cleansed by the
transformation phase, so when indices and constraints are re-enabled,
they do not fail.
Upserts
– Required whenever the following is needed:
• An existing record must be removed (Delete mode);
• An existing record must be replaced (Delete then Insert mode);
• An existing record must be updated.
– For new records:
The following options are available:
• Insert
• Update then Insert
• Insert then Update
Must be used in the event the data is not sufficiently clean (for instance,
there is a chance new records might already be present in the target
table). Duplicate and violating records can be caught through reject links;
You must consider doing Bulk Loads for new records instead
The use of Upsert mode for updates is the natural way with DataStage database
stages.
Bulk updates
There is one additional alternative that can be explored for applying bulk updates
to target tables, assuming the incoming update records are guaranteed to be
clean. Perform the following steps:
1. Bulk load the update records to a database temporary (or scratch) table
a. This table should have neither indices nor constraints;
b. The assumption is the input update records are cleansed and are
guaranteed not to throw any SQL exceptions when processed against the
target table. This cleansing must have been accomplished in the
transformation phase of the ETL process;
2. Execute a single Update SQL statement that updates the target table from the
records contained in the temporary table.
InputUpdateRecords
Temp
(1) Bulk
UpdateRecords BulkLoad_Temp
Load
Table
Figure 15-15 Combination of bulk load and SQL statement for processing bulk updates
The Bulk Update technique avoids that overhead because it assumes the data
does not raise any SQL exceptions. There is no need for checking the return
status on a row-by-row basis. This technique requires a closer cooperation with
DBAs, who have to set up the necessary table spaces for the creation of the
temp/scratch tables. Hopefully, the target database can do the two steps in
parallel.
We are aware that there are concerns related to limiting the number of job
executions, processing as much data as possible in a single job run, and
optimizing the interaction with the source and target databases (by using bulk
SQL techniques). All these are related to bulk data processing.
However, there are several applications that do not involve the processing of
huge amounts of data at once, but rather deal with several small, individual
requests. We are not saying the number and size of requests are minimal. They
might actually be relatively large, but not as much as in high volume batch
applications.
The types of applications in this chapter are focused on scenarios where there is
large number of requests, each of varying size, spanning a long period of time.
There are two capabilities that have been added to DataStage over time, to
address the needs of real-time:
MQ/DTS (Distributed Transaction stage)
MQ/DTS addresses the need for guaranteed delivery of source messages to
target databases, with the once-and-only-once semantics.
This type of delivery mechanism was originally made available in DataStage
7.5, in the form of the UOW (Unit-of-Work) stage. The original target in DS 7.5
was Oracle. In InfoSphere DataStage 8.X, this solution has been substantially
upgraded (incorporating the new database Connector technology for various
database flavors) and re-branded as the Distributed Transaction stage.
DTS and ISD work in different ways and serve different purposes. In this section
we discuss the specifics of each. However there are common aspects that relate
to both:
Job topologies
Real-time jobs of any sort need to obey certain rules so they can operate in a
request/response fashion.
Transactional support
– Most real-time applications require updating a target database. Batch
applications can tolerate failures by restarting jobs, as long as the results
at the end of processing windows are consistent.
– Real-time applications cannot afford the restarting of jobs. For each and
every request, the net result in the target database must be consistent.
End-of-wave
– The parallel framework implements virtual datasets (memory buffers) that
are excellent for batch applications. They are a key optimization
mechanism for high volume processing.
– Real-time applications cannot afford to have records sitting in buffers
waiting to be flushed.
– The framework was adapted to support End-of-Waves, which force the
flushing of memory buffers so responses are generated or transactions
committed.
Payload processing
– Frequently batch applications have to deal with large payloads, such as
big XML documents.
– Common payload formats are COBOL and XML.
Pipeline parallelism challenges
Fundamental for performance, the concept of pipeline parallelism introduces
challenges in real-time, but those can be circumvented.
We do not go into the details of the deployment or installation of DTS and ISD.
For that, the reader must see the product documentation.
The term near-real-time has been used, and applies to the following types of
scenarios:
Message delivery
– The data is delivered and expected to be processed immediately.
– Users can accept a short lag time (ranging from seconds to a few
minutes).
– There is no person waiting for a response.
– Examples:
• Reporting systems
• Active warehouses
– The following DS solutions can be applied:
• MQ->DTS/UOW
• ISD with Text over JMS binding
The high cost of starting up and shutting down jobs demanded that DataStage be
enhanced with additional capabilities to support these types of scenarios. This is
because their implementation with batch applications is not feasible.
This execution model fits well for cases when there are absolutely no
interdependencies between the requests, and the methods that implement the
services are simple. As the picture becomes more complicated, this model
reaches its limits, and that is where the parallel framework has an advantage.
Pipeline and partitioning parallelism allow for the breaking up of the application
logic into several concurrent steps, and execution paths. This is depicted in
Figure 16-1.
There are at least two scenarios in which the parallel framework helps break
through the limitations of a single-threaded execution model:
Services of increased complexity
In real-time decision-support applications (such in the sales of insurance
policies), the amount of input and output data is small, but the service itself
must fetch, correlate, and apply rules on relatively large amounts of data from
several tables across data sources. This might involve doing lots of database
queries, sorts/aggregations, and so forth. All this processing takes a
considerable amount of time to be processed by a single threaded enterprise
JavaBean method. With the parallel framework, the several steps cannot only
16.4.2 End-of-wave
The ability for jobs to remain always-on introduces the need to flush records
across process boundaries.
The parallel framework was originally optimized for batch workloads, and as
such, implements the concept of buffers in the form of virtual datasets. These
buffers are an important mechanism even for real-time jobs. However, if for a
given request (either an MQ message or an ISD SOAP request) there are not
enough records to fill virtual dataset buffers, they can wait until more requests
arrive so the buffers are flushed downstream.
Figure 16-2 shows an example of how EOW markers are propagated through a
parallel job. The source stage can be an MQConnector for the purpose of this
illustration (in this case, the target stage is a DTS/Distributed Transaction stage).
EOW markers are propagated as normal records. However, they do not carry
any data. They cause the state of stages to be reset, as well as records to be
flushed out of virtual datasets, so a response can be forced.
The ISD and MQConnector stages generate EOW markers the following way:
ISD Input
– It issues an EOW for each and every incoming request (SOAP, EJB, JMS).
EOWs modify the behavior of regular stages. Upon receiving EOW markers, the
stage’s internal state must be reset, so a new execution context begins. For most
stages, (Parallel Transformers, Modify, Lookups) this does not have any practical
impact from a job design perspective.
For database sparse lookups, record flow for the Lookup stage is the same. But
the stage needs to keep its connections to the database across waves, instead
of re-connecting after each and every wave. The performance is poor.
However, there are a few stages whose results are directly affected by EOWs:
Sorts
Aggregations
For these two stages, the corresponding logic is restricted to the set of records
belonging to a certain wave. Instead of consuming all records during the entire
execution of the job, the stage produces a partial result, just for the records that
belong to a certain wave. This means that a sort stage, for instance, writes out
sorted results that are sorted only in the wave, and not across waves. The stage
continues with the records for the next wave, until a new EOW marker arrives.
The transaction context for database stages prior to Information Server 8.5
always involved a single target table. Those stages support a single input link,
Pre-8.1, Information Services Director (ISD) and 7.X RTI jobs had no option
other than resorting to multiple separate database stages when applying
changes to target databases as part of a single job flow. The best they could do
was to synchronize the reject output of the database stages before sending the
final response to the ISD Output stage (See 16.7.4, “Synchronizing database
stages with ISD output ” on page 353 for a discussion on this technique).
The Connector and DTS stages are discussed in the following subsections.
Connector stages
Information Server 8.5 has Connector stages for the following targets:
DB2
Oracle
Teradata
ODBC
MQSeries
IS 8.5 Connectors support multiple input links, instead of a single input link in
pre-8.5 connectors and Enterprise database stage types. With multiple input
links, a Connector stage executes all SQL statements for all rows from all input
links as part of single unit of work. This is depicted in Figure 16-4 on page 305.
With database Connectors, ISD jobs no longer have to cope with potential
database inconsistencies in the event of failure. ISD requests might still have to
be re-executed (either SOAP, EJB, or JMS, depending on the binding type), but
the transactional consistency across multiple tables in the target database is
guaranteed as a unit of work.
For transactions spanning across multiple database types and those that include
guaranteed delivery of MQ messages, one must use the Distributed Transaction
stage, which is described next.
Figure 16-6 on page 308 illustrates how a real-time job must be structured. The
examples in this chapter use WISD stages, but they might be replaced with the
MQConnector as the source and the DTS as the target. The same constraints
apply to both message-oriented and SOA jobs.
All the data paths must originate from the same source, the source real-time
stage (either WISD Input or MQConnector). Figure 16-6 uses concentric blue
dotted semicircles to illustrate the waves originating from the same source.
Real-time jobs might split the source into multiple branches, but they must
converge back to the same target: either a DTS or WISD Output stage.
There can be multiple waves flowing through the job at the same time. You
cannot easily determine the number of such waves. It is dependent on the
number of stages and their combinability. The larger the job, the higher the
number of possible waves flowing through the job simultaneously.
Although originating from the exact same source, EOW markers might flow at
various speeds through different branches. This is dependent on the nature of
the stages along the data paths. In the example of Figure 16-6, the upper branch
might be slower because it involves a remote Web Service invocation, which
tends to be significantly slower than the standardization on the lower branch.
The need to restrict flow designs so that all data paths originate from the same
source restrict the types of jobs and stages that can be made use of. There is
one exception to this rule that is discussed later (consisting of data paths that
lead to the reference link of Normal Lookups).
Figure 16-7 presents one example of an invalid flow. The Funnel stage has two
inputs, one of them originating from the WISD input stage. The second input
(marked in red waves) originates from an Oracle Enterprise stage. The
semantics of this layout remain undefined and as a result it cannot be adopted as
a valid real-time construct. All data paths entering the Funnel stage must
originate from the same source. This example also applies to other Join stages,
such as Join, Merge, and ChangeCapture.
Figure 16-7 Invalid real-time data flow with multiple Source stages
Figure 16-8 presents another scenario, similar to the one in Figure 16-7 on
page 309. The Merge stage has an input that originates from an Oracle
Enterprise stage. As stated before, Merge and Join stages are perfectly legal
stages in real-time job flows. However, this is the case in which records from two
separate sources are being correlated by a common key. In other words, this
Merge stage is acting as a lookup.
No!
The flow in Figure 16-7 on page 309, although containing a separate branch that
does not originate from the same WISD input stage, is valid, because the output
from the Oracle Enterprise stage is read only once during the startup phase. For
the duration of the entire job, the lookup table remains unchanged in memory.
Therefore, one can have branches originating from independent sources, as long
as those branches lead to reference links of Normal Lookups. Such branches
can have multiple stages along the way and do not have to necessarily attach the
source stage directly to the reference link.
It is important to note that for Normal Lookups the in-memory reference table
does not change for the entire duration of the job. This means Normal Lookups
must only be used for static tables that fit comfortably in memory.
The same layout of Figure 16-8 on page 310 can be used for a Sparse Lookup, in
case the reference data changes frequently and the incoming records need to do
a Lookup against the always-on data in the target database.
The two-phase nature of the Normal Lookup stage renders this design invalid.
The lookup table is built only once, at stage startup. However, the flow requires
EOWs to flow throughout all branches. For the second EOW, the upper branch
remains stuck, as the Lookup stage does not consume any further records.
Therefore, this flow does not work.
These ISD job types are re-started for each request and might contain any data
flow patterns, including the ones that are okay for batch scenarios. The only
difference between them is the first requires data flow branches to be
synchronized before reaching the ISD Output. However, they can both have
branches that originate independently of each other.
16.6 MQConnector/DTS
The solution for transactional processing of MQ messages was originally
implemented in DataStage 7.X as a combination of two stages:
MQRead
Unit-Of-Work
These stages provided the guaranteed delivery from source MQ queues to target
Oracle databases, using MQSeries as the transaction manager. Those stages
were re-engineered in Information Server 8, and are now known as the
MQConnector and DTS stages.
The main difference is that they now implement a plug-able solution, taking
advantage of DataStage Connectors and supporting multiple types of database
targets, such as DB2, Oracle, Teradata, ODBC and MQ.
DataStage historically had a set of database stages, some of them from the
existing DS Server (referred to as Plug-in stages) as well as a set of native
parallel database operators (such as Oracle EE, DB2 EE, Teradata EE, and so
forth). IBM defined a new common database interface framework, known as
Connector stages, which is the foundation for a new set of database stages.
Information Server 8.1 supports a single database type as the target for DTS.
Information Server 8.5 supports different target types as part of the same unit of
work.
The job design must focus on obtaining a job flow that addresses business
requirements and, at the same time, complies with the constraints laid out in
16.5, “Job topologies” on page 307 and 16.6.4, “Design topology rules for DTS
jobs” on page 316.
There might be cases when there are concurrency conflicts that prevent multiple
messages from flowing through the job at the same time. In other words, the job
must somehow hold back messages from the source queue until the message
that is currently being processed is committed to the database. The third bullet
point is further elaborated in 16.6.11, “Database contention ” on page 339 and
16.10, “Pipeline Parallelism challenges” on page 366. The job design tends to be
independent from the runtime configuration.
Real-time jobs tend to be quite large, but that is the basic pattern to which they
tend to adhere.
The Column Import stage is sometimes replaced with an XMLInput stage, but the
result of the second step is always a set of records with which the rest of the DS
stages in the job can work.
The following rules from 16.5, “Job topologies” on page 307 must be obeyed:
All data paths must originate from the same source stage (the DTS
Connector).
The exceptions are reference links to Normal Lookups. These must originate
from independent branches.
All data paths must lead to the same target DTS stage
There might be dangling branches leading to Sequential File stages, for
instance, but these are to be used only for development and debugging. They
must not be relied on for production jobs.
What makes this possible is that the MQConnector and DTS stages interact with
each other by means of work queues.
Work queues are not visible on the DS canvas. They are referenced by means of
stage properties.
DB2
src queue
DB2
Local MQ transaction
moves message from
source to work queue work queue Distributed XA transaction deletes
under syncpoint control message(s) comprising unit-of-work
from queue and outputs
corresponding rows to DB2 target
The MQConnector transfers messages from the source to a work queue through
a local MQ transaction. It also forwards the payload to the next stage through its
output link. Typically, the next stage is a Column Import or XMLInput stage, to
convert the payload to DS records.
On the target side, the DTS receives records from transformation stages along
multiple data paths. There might have been Parallel Transformers, Sparse
Lookups, Joins, Merges, and so forth. Those rows are stored by DTS in internal
arrays.
Upon receiving an EOW marker (which was, again, originated from the source
MQConnector), the target DTS stage performs the following tasks:
1. Invokes the transaction manager (MQSeries) to begin a global transaction.
2. Executes the various SQL statements for the rows that arrived as part of that
wave according to a certain order.
3. Removes from the work queue one or more messages that were part of this
wave.
4. Requests the Transaction Manager to commit the transaction
This is a global transaction that follows the XA protocol, involving MQSeries
and the target DBs.
Figure 16-12 illustrates how an MQ/DTS job fits in the Information Server
Framework.
DS IS Admin
Director Endpoint
ASB ASB
Adapter Agent OSH
Conductor
DB2,
Oracle,
Teradata,
ODBC
Parallel Job
(Section
Leaders,
Players…)
Request XA Target MQ
Path to activate a job MQ Queue Transaction Queue
Runtime Parent/Child Process Relationship Work MQ
Runtime Data Path Queue(s)
The illustration does not include the XMeta layer, which is not relevant for this
discussion. The XMeta database is accessed directly by the domain layer, mostly
at design time.
There are several other processes and modules, so the figure and this
discussion could go down to a finely grained level of detail. We chose the
elements that are relevant for the level of abstraction that is appropriate for this
discussion.
Job activation
To activate a job, the operator user opens DS Director, selects a job or sequence
for execution, and issues the execute command. The DS Director client sends
the request to the domain layer. Inside the domain layer, an administration
endpoint forwards the request to the Application Service Backbone (ASB)
adapter, which is sent to the ASB agent and finally reaches the DSEngine (at this
moment, there is a dsapi_slave process, which is the DSEngine process that is
activated on behalf of a client connection). Note that we do not provide any
discussion on how DSEngine connections are established.
The dsapi_slave becomes the parent of the OSH conductor that, in turn, is the
parent of all other parallel processes, including the MQConnector and DTS
operators.
The parent/child relationship between the DSEngine and the OSH processes is
denoted by the dotted green line.
Job execution
The activated processes establish communication channels among the virtual
datasets, namely the virtual datasets.
The path followed by the actual data being processed is denoted by the small
dashes (red) line. The data flows entirely in the realm of parallel framework
processes. There is no participation by the domain layer (as opposed to
Information Services Director).
The reader should see the product documentation for a detailed description of all
stage and link properties for these two stages.
In our discussion, the pictures combine aspects of the job design, with the job
flow towards the top of the picture and stage and link properties at the bottom.
Associations between the job flow and stage/link properties are marked as blue
arrows. Comments on individual properties are marked in red. We believe this
provides for a more concise characterization of the properties that are otherwise
scattered throughout separate windows.
Linking by means of a queue manager and a work queue is not apparent on the
design canvas. It is, instead, expressed as stage and link properties.
For the DTS connector, the Queue Manager Name can be set either on the stage
or link properties (it supports a single output link).
At runtime, the work queue name might be appended with a partition number,
depending on the chosen runtime topology. This is discussed in 16.6.8, “Runtime
Topologies for DTS jobs” on page 326.
Figure 16-13 Queue manager and work queue information in Source and Target stages
The options in this example make the job wait indefinitely for messages to arrive
in a source queue SOURCEQ. Every message is written to the work queue as a
separate transaction. For each transaction, an EOW marker is sent downstream
through its output link.
Figure 16-15 presents a minimal set of columns that must be present in the
output link for the MQConnector stage.
The message payload must be sufficiently large to contain the largest expected
message. The APT_DEFAULT_TRANSPORT_BLOCK_SIZE environment
variable might have to be adjusted accordingly.
Each and every output row must have a field containing the MQ message ID
(DTS_msgID). For the MQConnector, this field can assume any name.
MQConnector knows it must assign the message ID to this field by means of the
data element property, which must be set to WSMQ.MSGID.
There are other MQ header field values that can be added to the record
definition, by selecting the appropriate type from the list of Data Element values.
Figure 16-17 depicts an example of input link metadata for the DTS.
For each topology, we discuss the corresponding stage and link properties.
No ordering, no relationships
Appendix A, “Runtime topologies for distributed transaction jobs” on page 375
presents the ideal case, from a scalability standpoint, which is illustrated in
Figure 16-18 on page 328. For as long as there are no ordering constraints and
no relationships between the incoming messages, the job can scale across
multiple partitions. There are multiple MQConnectors and multiple DTS
instances.
All MQConnectors read messages off the same source queue. However, they
transfer messages to work queues that are dedicated to each partition.
Upon job restart after failure, the job must be restarted with the exact same
number of nodes in the config file. Messages are either reprocessed out of the
private work queues, or read from the source queue.
3,1
MQ DT
Conn. WQ.0 Stage
3,1
SQ
4,3,2,1
4,2
MQ DT
Conn. WQ.1 Stage
4,2
Figure 16-18 indicates the properties that must be set in a parallel job to enable
the topology of Figure 16-19 on page 329:
APT_CONFIG_FILE must point to a config file containing mode than one
node in the default node pool.
The MQConnector and DTS stages must point to the same work queue name
(WORKQ).
The target DTS stage must have the “Append node number” option set to Yes.
This is the typical case when transferring data from operational mainframe-based
banking applications to reporting databases. Database transactions must be
executed in a strict order. Otherwise, the result in the target reporting database
does not make any sense when compared to the original source.
When that is the case, adopt the topology of Figure 16-20. There is a single node
in the configuration file and as a consequence, there is a single work queue.
Figure 16-21 on page 331 presents the properties to enable the topology of
Figure 16-20:
APT_CONFIG_FILE must point to a config file containing a single node in the
default node pool;
The MQConnector and DTS stages must point to the same work queue name
(WORKQ)
The target DTS stage must have the option Append node number set to No.
Even though the job execution is restricted to a single partition, such applications
can still benefit from pipeline parallelism. That is because the processing is
broken down into separate processes. That is one of the major advantages of the
parallel framework.
The job designers use the same DS constructs to express logic that meets
business rules regardless of partitioning. At runtime, the framework still makes
seamless use of multiple processes, scaling much better than an equivalent
single-threaded implementation, which is typically the case with J2EE based
applications.
Messages with the same hash key are still processed in the same order in which
they are posted on the source queue.
The SQL statements for all rows sent to the first input link are processed, and the
SQL statements for all rows to the second input link are processed, and so on.
The order of processing of the records from input links are highlighted in the box
in Figure 16-24, which is pointed to with an arrow.
All SQL statements are processed as part of the same transaction. Input links for
parent tables must be listed first, followed by the links for children, or dependent
tables, so the referential integrity in the target database is maintained.
There might be cases when this processing order is not adequate. The designer
might choose to adopt a different order: instead of processing all records on a
The DTS stage properties that cause records to be processed in a certain order
are highlighted in the box shown in Figure 16-25. The order is maintained across
all input links. The DTS stage then processes records from different input links,
according to the order of the DTS_SeqNum values.
Figure 16-26 shows how you can enable the use of a work queue in the DTS
stage. The DTS stage properties that send messages for failed units of work to a
reject queue REJECTQ are highlighted by the box in Figure 16-26. You have the
option to stop the job upon the first error.
In the event of a failure or rejection of a unit of work, the unit is rolled back so that
no data is written to the targets. Additionally:
The source messages might be as follows:
– Moved to a reject queue (the default behavior)
– Left on the source queue
The job might be aborted after a specified number of units of work have been
rejected.
int32 The link number containing the failure. If there are multiple, only the link
number for the first occurrence is listed.
12 12 string The error code from the resource (if the status of the unit of work was 2)
24 236 string The error code from the resource (if the Status is 2), or the text message
from the DTS_Message column (if the Status is 4). In the latter case, this
contains the text from the first DTS_Message column encountered for this
message that is not null or an empty string.
270 242 string The text message from the DTS_Message column. This contains the text
from the first DTS_Message column encountered for this message that is
not null or an empty string.
If there are multiple failures, then the status reflects the status of the first
detected error. The count field provides a count for the number of rows in the
message that failed.
Job-induced reject
The job RejectedByJobLogic demonstrates one way to reject a transaction by
detection of an error condition, such as an error in the data. The job looks like
Figure 16-27.
There are three links to the DTS. Two of the links are to insert or update DB2
tables, and the third is purely for sending reject signals to the stage.
Real-time jobs cannot afford the luxury of minimizing the interface with target
databases. Lookups, for instance, tend to be Sparse (the exception being
lookups against static tables). Transactions must be successfully committed for
every wave.
DTS jobs are tightly coupled with databases. There are multiple queries and
update/insert/delete statements being executed concurrently, for separate
waves, and for records that are part of the same wave.
These two challenges pertain to both ISD and MQ/DTS jobs. Key collisions are
discussed in 16.10, “Pipeline Parallelism challenges” on page 366.
Although lock-related exceptions can occur in both ISD and MQ/DTS jobs, they
are more pronounced in the latter.
Figure 16-29 presents an example of a DTS job that upserts records into three
separate tables. One might think that one upsert does not affect the others and
the job shall work perfectly well.
This might be true in a development and test environment, with only a few
messages submitted to the source queue at a time. Things go pretty well, up until
several more messages are submitted to that same source queue.
This type of problem tends to be more pronounced when testing real-time jobs
against empty or small tables. In this case, most of the activity is going against
the same table and index pages, increasing the chance of contention. As there
are more records being fed into the pipeline by the source MQConnector, these
exceptions increase in number.
These exceptions are further aggravated when running the job with multiple
partitions. They even occur with a Strict Ordering topology (See “Strict ordering”
on page 330).
Setting the work queue depth to 1 and running the job on a single partition might
be done as a starting point. The queue depth and the number of partitions can
then be increased, as long as the lock-related exceptions do not occur.
There are number of aspects related to the database that require close attention
by the DBAs:
Make sure the SQL plans are fully optimized and take advantage of indices.
Avoid sequential scans in SQL plans.
Make sure the locking is set at the record-level, not at page-level.
Enable periodic execution of update statistics.
DBAs must closely monitor database statistics and query plans.
The one that has most impact is the length of the waves. Having more than one
message in a single UOW is most beneficial because each 2-Phase Commit has
a significant overhead. A significant improvement can be seen as the length of
the wave is increased by adding more messages.
So-called mini-batches are to be avoided as well. They are not even considered
a valid real-time pattern, which is one of the reasons why they are not listed in
this section. The topic on mini-batches deserves a special topic on its own, and is
presented in 16.2, “Mini-batch approach” on page 297.
One example is the case of a retail bank that implemented a DS job to transfer
data from a source mainframe to a target system, using MQ Series as the
standard middleware interface. This is depicted in Figure 16-32. DataStage jobs
pick incoming messages, apply transformations, and deliver results to the target
queue.
This is the scenario depicted in Figure 16-33 on page 345. The real-time DS
logic is broken into multiple smaller steps.
In the case of real-time processes, the net effect, in addition to higher overhead,
is a longer lag time until the DataStage results are delivered to the target queue
for consumption by the target system.
Real-time jobs have a different profile of use of OS resources than batch jobs.
Batch jobs move and transform large amounts of data at once, incurring much
higher memory and disk usage.
Although real-time jobs normally have a larger number of stages (it is common to
see hundreds of them in a single MQ/DTS or ISD flow), they tend to deal with
smaller input data volumes. The overall memory usage for the data and heap
segment of parallel operators, as well as disk use, tends to be smaller.
The goal of real-time jobs from a performance standpoint must be delivering
results to the targets (result queues, target database tables, ISD outputs) as
quickly as possible with minimal lag time.
The approach depicted in Figure 16-33 works against that goal.
ISD is notable for the way it simplifies the exposure of DS jobs as SOA services,
letting users bypass the underlying complexities of creating J2EE services for the
various binding types.
ISD controls the invocation of those services, supporting request queuing and
load balancing across multiple service providers (a DataStage Engine is one of
the supported provider types).
A single DataStage job can be deployed as different service types, and can
retain a single dataflow design. DS jobs exposed as ISD services are referred to
as “ISD Jobs” throughout this section.
Figure 16-34 on page 347 is reproduced from the ISD manual and depicts its
major components. The top half of the diagram shows components that execute
inside the WebSphere Application Server on which the Information Server
Domain layer runs. Information Server and ISD are types of J2EE applications
that can be executed on top of J2EE containers.
The bottom half of Figure 16-34 on page 347 presents components that belong
to the Engine Layer, which can reside either on the same host as the Domain, or
on separate hosts.
Each WISD Endpoint relates to one of the possible bindings: SOAP, JMS, or
EJB. Such endpoints are part of the J2EE applications that are seamlessly
installed on the Domain WebSphere Application Server when an ISD application
is successfully deployed by means of the Information Server Console.
The endpoints forward incoming requests to the ASB adapter, which provides for
load balancing and interfacing with multiple services providers. Load balancing is
another important concept in SOA applications. In this context it means the
spreading of incoming requests across multiple DataStage engines.
For ISD applications, there is a direct mapping between a service request and a
wave. For each and every service request, an end-of-wave marker is generated
(See 16.4.2, “End-of-wave” on page 300).
Once an ISD job is compiled and ready, the ISD developer creates an operation
for that job using the Information Server Console. That operation is created as
part of a service, which is an element of an ISD application.
Once the ISD operations, services, and application are ready, the Information
Server Console can be used to deploy that application, which results in the
installation of a J2EE application on the WebSphere Application Server instance.
The deployment results in the activation of one or more job instances in the
corresponding service provider, namely the DS engines that participate in this
deployment.
The DS engine, in turn, spawns one or more parallel job instances. A parallel job
is started by means of an OSH process (the conductor). This process performs
the parsing of the OSH script that the ISD job flow was compiled into and
launches the multiple section leaders and players that actually implement the
runtime version of the job. This is represented by the dashed (green) arrows.
Incoming requests follow the path depicted with the red arrows. They originate
from remote or local applications, such as SOAP or EJB clients, and even
messages posted onto JMS queues. There is one endpoint for each type of
binding for each operation.
All endpoints forward requests to the local ASB Adapter. The ASB adapter
forwards a request to one of the participating engines according to a load
balancing algorithm. The request reaches the remote ASB agent, which puts the
request in the pipeline for the specific job instance.
The response flows back to the caller, flowing through the same components
they came originally from. As opposed to the MQ/DTS solution, the WebSphere
Application Server is actively participating in the processing of requests.
For these aspects, the reader should see the Information Services Director
product manual.
For tutorials on SOA, Web Services and J2EE there are countless resources
available in books and on the web.
The reason we exclude ISD job topologies that are not of always-on type is
because those types of jobs should follow the recommendations outlined in
Chapter 15, “Batch data flow design” on page 259.
ISD Load
http App Balancing
server Server
DS Server
Job
instances
DS Server
ISD Load
Balancing
ASB Agent
The first task is to make sure the logic is efficient, which includes, among other
things, making sure database transactions, transformations and the entire job
flow are optimally designed.
Once the job design is tuned, assess the maximum number of requests that a
single job instance can handle. This is a function of the job complexity and the
number of requests (in other words, EOW markers) that can flow through the job
However, there might be cases when one reaches the maximum number of
requests a single ASB agent can handle. This means the limit of a single DS
engine has been reached. This can be verified when no matter how many job
instances are instantiated the engine cannot handle more simultaneous
requests. If this is the case, add more DS Engines either on the same host (if
there is enough spare capacity) or on separate hosts.
Keep in mind that throughout this tuning exercise, the assumption is that there
are enough hardware resources. Increasing the number of job instances and DS
Engines does not help if the CPUs, disks, and network are already saturated.
In an always-on job, the database stages must complete the SQL statements
before the response is returned to the caller (that is, before the result is sent to
the WISD Output stage).
Pre-8.5 jobs required a technique, illustrated in Figure 16-37, that involves using
a sequence of stages connected to the reject link of a standard database stage.
There is a Column Generator that created a new column, which was used as the
aggregation key in the subsequent aggregator. The output from the aggregator
and the result from the main job logic (depicted as a local container)
synchronized with a Join stage.
The Join stage guaranteed that the response is sent only after the database
statements for the wave complete (either successfully or with errors).
Again, that is what had to be done in pre 8.5 releases. In IS 8.5, the database
Connectors are substantially enhanced to support multiple input links and output
links that can forward not only rejected rows, but also processed rows.
There are two alternatives for always-on 8.5 jobs when it comes to database
operations:
DTS
Database connector stages
The DTS stage supports an output link, whose table definition can be found in
category Table Definitions/Database/Distributed Transaction in the DS repository
tree.
When used in ISD jobs, the Use MQ Messaging DTS property must be set to
NO. Note that although source and work queues are not present, MQ Series
must still be installed and available locally, because it acts as the XA transaction
manager.
DTS must be used in ISD jobs only when there are multiple target database
types and multiple target database instances. If all SQL statements are to be
executed on the same target database instance, a database connector must be
used instead. This is discussed in the following section.
For connector stages, an output link carries the input link data plus an optional
error code and error message.
If configured to output successful rows to the reject link, each output record
represents one incoming row to the stage. Output links were already supported
by the Teradata Connector in IS 8.0.1, although that connector was still restricted
to a single input link.
Figure 16-39 on page 358 shows an example in which there are multiple input
links to a DB2 Connector (units of work with Connector stage in an Information
Services Director Job). All SQL statements for all input links are executed and
committed as part of a single transaction, for each and every wave. An EOW
marker triggers the commit.
All SQL statements must be executed against the same target database
instance. If more than one target database instance is involved (of the same or
different types), then the DTS stage must be used instead.
The example also depicts multiple input links to a DB2 Connector (units of work
with Connector stage in an Information Services Director Job).
Multiple partitions provide for a way of scaling a single job instance across
multiple partitions. This is a key advantage for complex services that receive
request payloads of an arbitrary size and, as part of the processing of those
requests, require the correlation of large amounts of data originating from sparse
lookups.
One such example is an application developed for an insurance company for the
approval of insurance policies. Jobs are of great complexity, with several
Lookups and lots of transformation logic. That type of job design benefits from
intra-job partitioning, instead of executing the entire job logic in a single partition.
With multiple partitions, there are multiple instances of stages along the way that
cut significantly the response time for such complex services.
Helpful hint: Use node pools in the configuration file to restrict the default
node pool to one node, and have other node pools available so that stages
can run in parallel for the conditions described.
Test your jobs fully without having them deployed as services. Use Flat File
Source to Flat File Target to perform QA on the logic and processing of the
job, or alternatively, RowGenerator (or in server, a leading Transformer with
Variable and Constraint stages). Replace the row generation stage and flat
files with WISDInput and WISDOutput once tested and verified.
Consider using shared containers for critical logic that is shared among
classic style batch jobs and always on. Though not always possible because
of logic constraints, this offers the opportunity to keep the core logic in one
place for simpler maintenance, and only alter the sources and targets.
Beware of binary data in character columns. Binary data, especially binary
00’s, is incompatible with SOAP processing and might cause problems for the
ISD Server, the eventual client, or both.
Beware of DS Parallel settings that pad character fields with NULLs. This
usually only happens with fixed length columns. Varchar is a solution, as is
setting $APT_STRING_PADCHAR to a single blank (no quotes around the
blank).
When performing INSERTs to relational databases, be safe and set array and
transaction sizes to 1 (one). Otherwise you might not see the new key values
immediately, or until the job is complete.
Do not expect values written to flat file and other targets to be immediately
available or visible. Use a transaction size of 1 for such things.
The text bindings (Text over JMS and Text over HTTP) exist primarily for
always on jobs, and because they have no formal metadata definition on their
payload, and work best for single-column input and output (XML is a common
The following scenarios justify the publication of DS Server jobs as ISD services:
Publishing existing DS Server jobs;
Invocation of DS Sequences
The obvious difference is that the first is for interoperability with JMS, and the
second is for MQ.
One might use of ISD with JMS for the processing of MQ Series messages and
MQ/DTS for the processing of JMS messages by setting up a bridging between
MQ and JMS by means of WebSphere ESB capabilities. However, there is a
relatively high complexity in the setup of a bridging between JMS and MQ.
We put both solutions side-by-side in a diagram, Figure 16-40 on page 363. The
goal of this illustration is to draw a comparison of how transactions are handled
and the path the data flows.
Both ISD and MQ/DTS jobs are parallel jobs, composed of processes that
implement a pipeline, possibly with multiple partitions. The parent/child
relationships between OS processes are represented by the dotted green lines.
The path followed by the actual data is represented by solid (red) lines.
ISD jobs deployed with JMS bindings have the active participation of the
WebSphere Application Server and the ASB agent, whereas in an MQ/DTS job,
the data flow is restricted to the parallel framework processes.
However, there is one additional transaction context one the JMS side, managed
by EJBs in the WAS J2EE container as JTA transactions.
JTA transactions make sure no messages are lost. If any components along the
way (WAS, ASB Agent or the parallel job) fail during the processing of an
incoming message before a response is placed on the response queue, the
This means database transactions in ISD jobs exposed as JMS services must be
idempotent. For the same input data, they must yield the same result on the
target DB.
In summary, the strengths of the MQ/DTS and ISD/JMS solutions are as follows:
MQ/DTS
Guaranteed delivery from a source queue to a target database.
ISD/JMS
Adequate for request/response scenarios when JMS queues are the delivery
mechanism.
There are a few important aspects that need attention when taking advantage of
ISD with JMS bindings:
The retry attempt when JTA transactions are enabled is largely dependent
upon the JMS provider. In WebSphere 6, with its embedded JMS support, the
default is five attempts (this number is configurable). After five attempts the
message is considered a poison message and goes into the dead letter
queue.
One problem is that there are subtle differences in all of this from provider to
provider. It becomes a question of exactly how things operate when MQ is the
provider, or when the embedded JMS in WAS is the provider.
The JMS binding also creates a spectrum of other issues because of the pool
of EJBs. Multiple queue listeners can result in such confusions as messages
ending up out of order, and a flood of concurrent clients going into ISD, and
overwhelm the number of instances that you have established for DataStage.
If the input payload is too large, you might run into memory issues. The
environment variable APT_DEFAULT_TRANSPORT_BLOCK_SIZE must be
tuned accordingly.
The nature of the parallel framework introduces interesting challenges, that are
the focus of this section. This discussion applies to both MQ/DTS and ISD jobs.
The challenges result from pipeline parallelism, in which there are multiple
processes acting on a dataflow pipeline concurrently. The most notable
challenge is key collision, for which a solution is described in the next section.
Figure 16-41 illustrates two contiguous records (R1 and R2), containing the
same business key (“AA”) for which either an existing surrogate (SK) must be
returned from the database, or a new SK must be created if such SK does not
exist yet.
Found
R1 arrives and goes through the first lookup first. An SK is not found, so it is
forwarded to the Generate New SK stage. This one invokes a NextVal function in
the target database to obtain a new SK value that is from now on associated to
business key AA. The Generate New SK stage returns a new value, SK1, which
is now propagated downstream. This value, though, is not saved in the target
database, so for all purposes, AA’s surrogate is not SK1 in the database yet.
R2 arrives immediately next. It also goes through the same initial lookup, and
because AA is not saved yet in the target database, it is also forwarded to the
Generate New SK stage.
In Figure 16-42, you can see that both records were sent downstream to the
Insert New SK stage. R2 has been assigned a surrogate key value of SK2.
However, only the first one is inserted successfully. The second one violates a
unique key constraint and therefore from now on, R2 becomes invalid.
R1 (AA, SK1)
R2 (AA, SK2)
Found
Any solutions involving any type of caching or global lookup table in DataStage
would not solve this problem; it would only further complicate things. The Lookup
must always be done against the target database. The least a DataStage cache
would do is introduce the problem of keeping that cache in synch with the target
database, so this approach is not suitable.
Single DS Job
Source Parse /
Input
DB2 Switch
MQ queue
Logs
Logical Dependency
The UDF returns a column named SK, which is propagated downstream. This
column is defined in the output link for the DB2 Connector stage.
DB2 functions
The code for the sample DB2 UDF is listed in Figure 16-45. The function is
defined as ATOMIC.
The syntax to invoke this Oracle stored function is shown in Figure 16-48.
The assumption is that the SKs are not generated automatically at the time of
insert. Instead, the SK can be obtained by invoking a stubbing UDF to obtain the
surrogate key for the parent table. This SK value is propagated downstream, in
both parent and children links. The DTS or target database Connector stage then
commits the records as a single unit of work.
Also, the JMSPlug-in runs entirely outside of a J2EE container, making the
interoperability with the third-party JMS provider straightforward. This is
illustrated in Figure 16-49 on page 373.
However, because the source and target JMSPlug-in stages run on standalone
JVMs, outside of J2EE containers, there is no support for JTA transactions.
Messages are never lost, but they might be re-processed in the event of restart
This solution is described in the document Integrating JMS and DataStage Jobs
with the JMSPlug-in available from IBM Information Management Professional
Services.
Instead of having a fixed data source attached to the reference link, perform a
Lookup based upon your high level blocking factors in the incoming WISD
request.
This involves using a Copy stage that splits the incoming row, sending a row to
the primary input as before when sending the other row to a Lookup where
multiple reference candidates can be dynamically retrieved.
Be sure to make this a Sparse Lookup if you expect that the source data could
change when the WISD job is on (enabled), and check it carefully to be sure you
have set it to return multiple rows.
In the diagrams in this section the numbers under the queues represent
message sequence numbers and illustrate how messages might be distributed
across queues. Letters represent hash partitioning key fields. The solid arrows
show the movement of MQ messages to and from queues. The dashed lines
Multiple work queues also aid performance, because there is no contention for
work queues, and the DTStage is more likely to find its message from the head
of the queue.
In this scenario, there is a single work queue, because the MQ Connector cannot
determine which node is targeted for a specific message. By contrast, the
MQRead operator is implemented as a combinable operator, and can be
combined with the hash partitioner. This permits MQRead to learn to which
partition the message is sent, and can therefore direct the message to the
appropriate work queue if multiple work queues are used. MQRead can therefore
use multiple work queues, but the MQ Connector cannot.
The revised topologies are depicted in Figure A-5 and Figure A-6 on page 380.
When no ordering is required, it is as depicted in Figure A-5.
One advantage of running without a work queue is that restart after failure is
simpler. The job can be restarted, and it continues from where it was aborted.
Additionally, evidence with the MQRead operator indicates that reading from a
source queue and writing to a work queue under sync point control for a small
transaction size (small number of messages) is an expensive operation. By
omitting the need to write to a work queue, the overall performance is improved.
There are dangers in this approach however. Prior work with MQ and
WebSphere TX determined two scenarios where source messages can be
missed due to the message cursor not detecting messages:
If multiple processes are writing to the source queue, the queue browser might
miss a message if the PUT and COMMIT calls from these processes are
interspersed in a certain order.
If the writing processes use message priorities, the queue browser does not see
messages of a higher priority, as they jump ahead of the current cursor position.
The solution offers support for all of these scenarios. It is the responsibility of the
job designer to select the appropriate settings to configure the stages to enable a
particular scenario.
Directory Structures
IBM InfoSphere DataStage requires file systems to be available for the following
elements:
Software Install Directory
IBM InfoSphere DataStage executables, libraries, and pre-built components
DataStage Project Directory
Runtime information (compiled jobs, OSH scripts, generated BuildOps and
Transformers, logging info);
Data Storage
– DataStage temporary storage: Scratch, temp, buffer
– DataStage parallel dataset segment files
– Staging and Archival storage for any source files
By default, these directories (except for file staging) are created during
installation as subdirectories under the base InfoSphere DataStage installation
directory.
In addition to the file systems listed, a DataStage project also requires a proper
amount of space in the Metadata layer (which is a relational database system).
As opposed to the Project directory, the Metadata layer stores the design time
information, including job and sequence designs, and table definitions.
This section does not include requirements and recommendations for other
Information Server layers (Metadata and Services) The discussion here is strictly
in terms of the Engine layer.
All DataStage jobs must be documented with a short description field, as well as
with annotation fields. See 3.3, “Documentation and annotation” on page 47.
Job parameters must be used for file paths, file names, and database login
settings.
Use default type conversions using the Copy stage or across the Output
mapping tab of other stages.
Using the above objectives as a guide, the following methodology can be applied:
Start with Auto partitioning (the default).
Specify Hash partitioning for stages that require groups of related records.
– Specify only the key columns that are necessary for correct grouping as
long as the number of unique values is sufficient.
– Use Modulus partitioning if the grouping is on a single integer key column.
– Use Range partitioning if the data is highly skewed and the key column
values and distribution do not change significantly over time (Range Map
can be reused).
Across jobs, persistent datasets can be used to retain the partitioning and sort
order. This is particularly useful if downstream jobs are run with the same degree
of parallelism (configuration file) and require the same partition and sort order.
The Lookup stage is most appropriate when reference data is small enough to fit
into available memory. If the datasets are larger than available memory
resources, use the Join or Merge stage. See 10.1, “Lookup versus Join versus
Merge” on page 150.
Be particularly careful to observe the nullability properties for input links to any
form of Outer Join. Even if the source data is not nullable, the non-key columns
must be defined as nullable in the Join stage input to identify unmatched records.
See 10.2, “Capturing unmatched records from a Join” on page 150.
Use Hash method Aggregators only when the number of distinct key column
values is small. A Sort method Aggregator must be used when the number of
distinct key values is large or unknown.
The ODBC Enterprise stage can only be used when a native parallel stage is not
available for the given source or target database.
If possible, use a SQL where clause to limit the number of rows sent to a
DataStage job.
Check the Director log for warnings, which might indicate an underlying problem
or data type conversion issue. All warnings and failures must be addressed (and
removed if possible) before deploying a DataStage job.
Development Dev_<proj>
Production Prod_<proj>
BuildOp BdOp<name>
Wrapper Wrap<name>
Load Load<job>
Sequence <job>_Seq
Parameter <name>_parm
Notify Notify
Input In
Output Out
Delete Del
Insert Ins
Update Upd
Database DB
Stored Procedure SP
Table Tbl
View View
Dimension Dim
Fact Fact
Source Src
Target Tgt
Head Head
Peek Peek
Sample Smpl
Tail Tail
Sequential File SF
File Set FS
Parallel dataset DS
Aggregator Agg
Copy Cp
Filter Filt
Funnel Funl
Lookup Lkp
Merge Mrg
Modify Mod
Pivot Pivt
Sort Srt
Switch Swch
stage Variable SV
At runtime, the DS parallel framework uses the given job design and
configuration file to compose a job score that details the processes created,
degree of parallelism and node (server) assignments, and interconnects
(datasets) between them. Similar to the way a parallel database optimizer builds
a query plan, the parallel job score performs the following tasks:
Identifies degree of parallelism and node assignments for each operator
Details mappings between functional (stage/operator) and actual operating
system processes
Includes operators automatically inserted at runtime:
– Buffer operators to prevent deadlocks and optimize data flow rates
between stages
– Sorts and Partitioners that have been automatically inserted to ensure
correct results
As shown in Figure E-1, job score entries start with the phrase main_program:
This step has n datasets Two separate scores are written to the log for each
job run. The first score is from the license operator, not the actual job, and can be
ignored. The second score entry is the actual job score.
The number of virtual datasets and the degree of parallelism determines the
amount of memory used by the inter-operator transport buffers. The memory
used by deadlock-prevention BufferOps can be calculated based on the number
of inserted BufferOps.
Datasets are identified in the first section of the parallel job score, with each
dataset identified by its number (starting at zero). In this example, the first
dataset is identified as “ds0”, and the next “ds1”.
The degree of parallelism is identified in brackets after the operator name. For
example, operator zero (op0) is running sequentially, with one degree of
parallelism [1p]. Operator 1 (op1) is running in parallel with four degrees of
parallelism [4p].
Producer
Partitioner Collector
Consumer
The notation between producer and consumer is used to report the type of
partitioning or collecting (if any) that is applied. The partition type is associated
with the first term, collector type with the second. The symbol between the
partition name and collector name indicates the partition type and consumer. A
list of the symbols and their description is shown in Table E-1.
Finally, if the Preserve Partitioning flag has been set for a particular dataset, the
notation “[pp]” appears in this section of the job score.
The lower portion of the parallel job score details the mapping between stages
and actual processes generated at runtime. For each operator, this includes (as
illustrated in the job score fragment) the following elements:
Operator name (opn) numbered sequentially from zero (example “op0”)
Degree of parallelism in brackets (example “[4p]”)
Sequential or parallel execution mode
Components of the operator, which have the following characteristics:
– Typically correspond to the user-specified stage name in the Designer
canvas
– Can include combined operators (APT_CombinedOperatorController),
which include logic from multiple stages in a single operator
– Can include framework-inserted operators such as Buffers, Sorts
– Can include composite operators (for example, Lookup)
Using this information together with the output from the $APT_PM_SHOW_PIDS
environment variable, you can evaluate the memory used by a lookup. Because
the entire structure needs to be loaded before actual lookup processing can
begin, you can also determine the delay associated with loading the lookup
structure.
Integers 4 bytes
Float 8 bytes
Time 4 bytes
8 bytes with microsecond resolution
Date 4 bytes
Timestamp 8 bytes
12 bytes with microsecond resolution
For the overall record width, calculate and add the following values:
(# nullable fields)/8 for null indicators
one byte per column for field alignment (worst case is 3.5 bytes per field)
The environment variable settings in this Appendix are only examples. Set values
that are optimal to your environment.
$APT_EXPORT_FLUSH_ [nrows] Specifies how frequently (in rows) that the Sequential File
COUNT stage (export operator) flushes its internal buffer to disk.
Setting this value to a low number (such as 1) is useful for
real-time applications, but there is a small performance
penalty from increased I/O.
$APT_IMPORT_REJECT_ Setting this environment variable directs DataStage to
STRING_FIELD_OVERRUNS reject Sequential File records with strings longer than
their declared maximum column length. By default,
(DataStage v7.01 and later) imported string fields that exceed their maximum
declared length are truncated.
$APT_IMPEXP_ALLOW_ [set] When set, allows zero length null_field value with fixed
ZERO_LENGTH_FIXED_ length fields. Use this with care as poorly formatted data
NULL causes incorrect results. By default, a zero length
null_field value causes an error.
$APT_IMPORT_BUFFER_ [Kbytes] Defines size of I/O buffer for Sequential File reads
SIZE (imports) and writes (exports) respectively. Default is 128
$APT_EXPORT_BUFFER_ (128K), with a minimum of 8. Increasing these values on
SIZE heavily-loaded file servers can improve performance.
$APT_CONSISTENT_ [bytes] In certain disk array configurations, setting this variable to
BUFFERIO_SIZE a value equal to the read/write size in bytes can improve
performance of Sequential File import/export operations.
$APT_DELIMITED_READ [bytes] Specifies the number of bytes the Sequential File (import)
_SIZE stage reads-ahead to get the next delimiter. The default is
500 bytes, but this can be set as low as 2 bytes.
This setting must be set to a lower value when reading
from streaming inputs (for example, socket or FIFO) to
avoid blocking.
$APT_MAX_DELIMITED_ [bytes] By default, Sequential File (import) reads ahead 500
READ_SIZE bytes to get the next delimiter. If it is not found the
importer looks ahead 4*500=2000 (1500 more) bytes,
and so on (4X) up to 100,000 bytes.
This variable controls the upper bound ,which is, by
default, 100,000 bytes. When more than 500 bytes
read-ahead is desired, use this variable instead of
APT_DELIMITED_READ_SIZE.
$APT_IMPORT_PATTERN_ [set] When this environment variable is set (present in the
USES_FILESET environment) file pattern reads are done in parallel by
dynamically building a File Set header based on the list of
files that match the given expression. For disk
configurations with multiple controllers and disk, this
significantly improves file pattern reads.
$APT_PHYSICAL_DATASET [bytes] Specifies the size, in bytes, of the unit of data set I/O.
_BLOCK_SIZE Dataset segment files are written in chunks of this size. The
default is 128 KB (131,072)
$APT_DBNAME [database] Specifies the name of the DB2 database for DB2/UDB
Enterprise stages if the Use Database Environment
Variable option is True. If $APT_DBNAME is not defined,
$DB2DBDFT is used to find the database name.
$APT_ORACLE_LOAD_ [char] Specifies a field delimiter for target Oracle stages using the
DELIMITED Load method. Setting this variable makes it possible to load
fields with trailing or leading blank characters.
(DataStage 7.01 and later)
$APT_ORA_IGNORE_ When set, a target Oracle stage with Load method limits the
CONFIG_FILE_ number of players to the number of datafiles in the table’s
PARALLELISM table space.
$APT_TERA_SYNC_ [name] Starting with V7, specifies the database used for the
DATABASE terasync table.
$APT_TERA_SYNC_USER [user] Starting with V7, specifies the user that creates and
writes to the terasync table.
$APT_MONITOR_TIME [seconds] In V7 and later, specifies the time interval (in seconds)
for generating job monitor information at runtime. To
enable size-based job monitoring, unset this
environment variable, and set $APT_MONITOR_SIZE.
Unknown, Char, string 1 byte per ASCII character string of fixed or variable
LongVarChar, character length (Unicode Extended option NOT
VarChar selected)
Timestamp timestamp 9 bytes Single field containing both date and time value
(microseconds) with resolution to microseconds. (Specify
microseconds Extended option)
a. BigInt values map to long long integers on all supported platforms except Tru64 where they map to longer
integer values.
String data represents unmapped bytes, ustring data represents full Unicode
(UTF-16) data.
The Char, VarChar, and LongVarChar SQL types relate to underlying string types
where each character is 8-bits and does not require mapping because it
represents an ASCII character. You can, however, specify that these data types
are extended, in which case they are taken as ustrings and require mapping.
(They are specified as such by selecting the “Extended” check box for the column
in the Edit Meta Data dialog box.) An Extended field appears in the columns grid,
and extended Char, VarChar, or LongVarChar columns have Unicode in this field.
The NChar, NVarChar, and LongNVarChar types relate to underlying ustring
types, so do not need to be explicitly extended.
Target Field
Source d = There is a default type conversion from source field type to destination field type.
Field e = You can use a Modify or a Transformer conversion function to explicitly convert from
the source field type to the destination field type.
A blank cell indicates that no conversion is provided.
timestamp
unstring
decimal
uint64
uint16
uint32
dfloat
string
sfloat
uint8
int16
int64
int32
date
time
int8
raw
int8 de de de
uint8
int16 de de de
uint16 de de
int32 de de de
uint32
int64 de
uint64
sfloat de
dfloat de de de de
decimal de de de de de de de
string de de de de de
unstring de de de de de
raw
date
time de
timestamp
The Transformer and Modify stages can change a null representation from an
out-of-band null to an in-band null and from an in-band null to an out-of-band null.
When reading from dataset and database sources with nullable columns, the
DataStage parallel framework uses the internal, out-of-band null representation
for NULL values.
Nullable not Nullable If the source value is not null, the source value
propagates.
If the source value is null, a fatal error occurs.
Using RCP judiciously in a job design facilitates re-usable job designs based on
input metadata, rather than using a large number of jobs with hard-coded table
definitions to perform the same tasks. Certain stages, for example the Sequential
File stage, allow their runtime schema to be parameterized, further extending
re-use through RCP.
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this book.
IBM Redbooks
For information about ordering these publications, see “How to get Redbooks” on
page 432. Note that some of the documents referenced here might be available
in softcopy only.
IBM WebSphere QualityStage Methodologies, Standardization, and
Matching, SG24-7546
Other publications
These publications are also relevant as further information sources:
IBM Information Server 8.1 Planning, Installation and Configuration Guide,
GC19-1048
IBM Information Server Introduction, GC19-1049
Online resources
These Web sites are also relevant as further information sources:
IBM Information Server information center
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r0/index.jsp
IBM Information Server Quick Start Guide
http://www-01.ibm.com/support/docview.wss?uid=swg27009391&aid=1
InfoSphere DataStage
Parallel Framework
Standard Practices ®
SG24-7830-00 0738434477