Pentaho Data Integration 4 Cookbook
()
About this ebook
Related to Pentaho Data Integration 4 Cookbook
Related ebooks
Pentaho Data Integration Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsTalend Open Studio Cookbook Rating: 2 out of 5 stars2/5PostgreSQL 9 Administration Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsInstant Pentaho Data Integration Kitchen Rating: 0 out of 5 stars0 ratingsCustomer Data Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsMicroservices: Build, Design And Deploy Distributed Services Rating: 0 out of 5 stars0 ratingsAzure Data Lake A Clear and Concise Reference Rating: 0 out of 5 stars0 ratingsData Architecture Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsBig Data Engineer A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsMicroservices Architectures A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsSoftware Design Pattern A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsData Engineering Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsService Oriented Architecture: An Integration Blueprint Rating: 0 out of 5 stars0 ratingsBig Data Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsData Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsApache Hive Cookbook Rating: 0 out of 5 stars0 ratingsData Lake Architecture Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsData Lakes A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsEnterprise Architecture Body Of Knowledge A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsEnterprise Data Model A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsBuilding the Data Warehouse Rating: 5 out of 5 stars5/5Learning RabbitMQ Rating: 0 out of 5 stars0 ratingsMicroservices Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsEnterprise Architecture Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsData Warehouse Architecture A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsEnterprise Architecture EA A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsUltimate Data Engineering with Databricks Rating: 0 out of 5 stars0 ratingsAWS Data Analytics: Unleashing the Power of Data: Insights and Solutions with AWS Analytics Rating: 0 out of 5 stars0 ratings
Information Technology For You
How to Find a Wolf in Siberia (or, How to Troubleshoot Almost Anything) Rating: 0 out of 5 stars0 ratingsCompTIA A+ CertMike: Prepare. Practice. Pass the Test! Get Certified!: Core 1 Exam 220-1101 Rating: 0 out of 5 stars0 ratingsData Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Summary of Super-Intelligence From Nick Bostrom Rating: 5 out of 5 stars5/5Supercommunicator: Explaining the Complicated So Anyone Can Understand Rating: 3 out of 5 stars3/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Cybersecurity for Beginners : Learn the Fundamentals of Cybersecurity in an Easy, Step-by-Step Guide: 1 Rating: 0 out of 5 stars0 ratings20 Windows Tools Every SysAdmin Should Know Rating: 5 out of 5 stars5/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5Data Governance For Dummies Rating: 2 out of 5 stars2/5Microsoft Access for Beginners and Intermediates Rating: 0 out of 5 stars0 ratingsHow To Use Chatgpt: Using Chatgpt To Make Money Online Has Never Been This Simple Rating: 0 out of 5 stars0 ratingsHow to Write Effective Emails at Work Rating: 4 out of 5 stars4/5Linux Command Line and Shell Scripting Bible Rating: 3 out of 5 stars3/5Health Informatics: Practical Guide Rating: 0 out of 5 stars0 ratingsCompTia Security 701: Fundamentals of Security Rating: 0 out of 5 stars0 ratingsWordPress Plugin Development: Beginner's Guide Rating: 0 out of 5 stars0 ratingsCompTIA Network+ CertMike: Prepare. Practice. Pass the Test! Get Certified!: Exam N10-008 Rating: 0 out of 5 stars0 ratingsCompTIA Security +: Malware and Malware Infections Rating: 0 out of 5 stars0 ratingsThe iPadOS 17: The Complete User Manual to Quick Set Up and Mastering the iPadOS 17 with New Features, Pictures, Tips, and Tricks Rating: 0 out of 5 stars0 ratingsPersonal Knowledge Graphs: Connected thinking to boost productivity, creativity and discovery Rating: 0 out of 5 stars0 ratingsWho Says Elephants Can't Dance?: Leading a Great Enterprise Through Dramatic Change Rating: 4 out of 5 stars4/5ChatGPT: The Future of Intelligent Conversation Rating: 4 out of 5 stars4/5Practical Ethical Hacking from Scratch Rating: 5 out of 5 stars5/5Learning Microsoft Endpoint Manager: Unified Endpoint Management with Intune and the Enterprise Mobility + Security Suite Rating: 0 out of 5 stars0 ratingsIntroduction to Oracle Database Administration Rating: 5 out of 5 stars5/5
Reviews for Pentaho Data Integration 4 Cookbook
0 ratings0 reviews
Book preview
Pentaho Data Integration 4 Cookbook - Adrián Sergio Pulvirenti
Table of Contents
Pentaho Data Integration 4 Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Working with Databases
Introduction
Sample databases
Pentaho BI platform databases
Connecting to a database
Getting ready
How to do it...
How it works...
There's more...
Avoiding creating the same database connection over and over again
Avoiding modifying jobs and transformations every time a connection changes
Specifying advanced connection properties
Connecting to a database not supported by Kettle
Checking the database connection at run-time
Getting data from a database
Getting ready
How to do it...
How it works...
There's more...
See also
Getting data from a database by providing parameters
Getting ready
How to do it...
How it works...
There's more...
Parameters coming in more than one row
Executing the SELECT statement several times, each for a different set of parameters
See also
Getting data from a database by running a query built at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Inserting or updating rows in a table
Getting ready
How to do it...
How it works...
There's more...
Alternative solution if you just want to insert records
Alternative solution if you just want to update rows
Alternative way for inserting and updating
See also
Inserting new rows where a simple primary key has to be generated
Getting ready
How to do it...
How it works...
There's more...
Using the Combination lookup/update for looking up
See also
Inserting new rows where the primary key has to be generated based on stored values
Getting ready
How to do it...
How it works...
There's more...
See also
Deleting data from a table
Getting ready
How to do it...
How it works...
See also
Creating or altering a database table from PDI (design time)
Getting ready
How to do it...
How it works...
There's more...
See also
Creating or altering a database table from PDI (runtime)
How to do it...
How it works...
There's more...
See also
Inserting, deleting, or updating a table depending on a field
Getting ready
How to do it...
How it works...
There's more...
Insert, update, and delete all-in-one
Synchronizing after merge
See also
Changing the database connection at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Loading a parent-child table
Getting ready
How to do it...
How it works...
See also
2. Reading and Writing Files
Introduction
Reading a simple file
Getting ready
How to do it...
How it works...
There's more...
Alternative notation for a separator
About file format and encoding
About data types and formats
Altering the names, order, or metadata of the fields coming from the file
Reading files with fixed width fields
Reading several files at the same time
Getting ready
How to do it...
How it works...
There's more...
Reading unstructured files
Getting ready
How to do it...
How it works...
There's more...
Master/detail files
Log files
Reading files having one field by row
Getting ready
How to do it...
How it works...
There's more...
See also
Reading files with some fields occupying two or more rows
Getting ready
How to do it...
How it works...
See also
Writing a simple file
Getting ready
How to do it...
How it works...
There's more...
Changing headers
Giving the output fields a format
Writing an unstructured file
Getting ready
How to do it...
How it works...
There's more...
Providing the name of a file (for reading or writing) dynamically
Getting ready
How to do it...
How it works...
There's more...
Get System Info
Generating several files simultaneously with the same structure, but different names
Using the name of a file (or part of it) as a field
Getting ready
How to do it...
How it works...
Reading an Excel file
Getting ready
How to do it...
How it works...
See also
Getting the value of specific cells in an Excel file
Getting ready
How to do it...
How it works...
There's more...
Labels and values horizontally arranged
Looking for a given cell
Writing an Excel file with several sheets
Getting ready
How to do it...
How it works...
There's more...
See also
Writing an Excel file with a dynamic number of sheets
Getting ready
How to do it...
How it works...
See also
3. Manipulating XML Structures
Introduction
Reading simple XML files
Getting ready
How to do it...
How it works...
There's more...
XML data in a field
XML file name in a field
ECMAScript for XML
See also
Specifying fields by using XPath notation
Getting ready
How to do it...
How it works...
There's more...
Getting data from a different path
Getting data selectively
Getting more than one node when the nodes share their XPath notation
Saving time when specifying XPath
Validating well-formed XML files
Getting ready
How to do it...
How it works...
See also
Validating an XML file against DTD definitions
Getting ready
How to do it...
How it works...
There's more...
See also
Validating an XML file against an XSD schema
Getting ready
How to do it...
How it works...
There's more...
See also
Generating a simple XML document
Getting ready
How to do it...
How it works...
There's more...
Generating fields with XML structures
See also
Generating complex XML structures
Getting ready
How to do it...
How it works...
See also
Generating an HTML page using XML and XSL transformations
Getting ready
How to do it...
How it works...
There's more...
See also
4. File Management
Introduction
Copying or moving one or more files
Getting ready
How to do it...
How it works...
There's more...
Moving files
Detecting the existence of the files before copying them
Creating folders
See also
Deleting one or more files
Getting ready
How to do it...
How it works...
There's more...
Figuring out which files have been deleted
See also
Getting files from a remote server
Getting ready
How to do it...
How it works...
There's more...
Specifying files to transfer
Some considerations about connecting to an FTP server
Access via SFTP
Access via FTPS
Getting information about the files being transferred
See also
Putting files on a remote server
Getting ready
How to do it...
How it works...
There's more...
See also
Copying or moving a custom list of files
Getting ready
How to do it...
How it works...
See also
Deleting a custom list of files
Getting ready
How to do it...
How it works...
See also
Comparing files and folders
Getting ready
How to do it...
How it works...
There's more...
Comparing folders
Working with ZIP files
Getting ready
How to do it...
How it works...
There's more...
Avoiding zipping files
Avoiding unzipping files
See also
5. Looking for Data
Introduction
Looking for values in a database table
Getting ready
How to do it...
How it works...
There's more...
Taking some action when the lookup fails
Taking some action when there are too many results
Looking for non-existent data
See also
Looking for values in a database (with complex conditions or multiple tables involved)
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values in a database with extreme flexibility
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values in a variety of sources
Getting ready
How to do it...
How it works...
There's more...
Looking for alternatives when the Stream Lookup step doesn't meet your needs
Speeding up your transformation
Using the Value Mapper step for looking up from a short list of values
See also
Looking for values by proximity
Getting ready
How to do it...
How it works...
There's more...
Looking for values consuming a web service
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values over an intranet or Internet
Getting ready
How to do it...
How it works...
There's more...
See also
6. Understanding Data Flows
Introduction
Splitting a stream into two or more streams based on a condition
Getting ready
How to do it...
How it works...
There's more...
Avoiding the use of Dummy steps
Comparing against the value of a Kettle variable
Avoiding the use of nested Filter Rows steps
Overcoming the difficulties of complex conditions
Merging rows of two streams with the same or different structures
Getting ready
How to do it...
How it works...
There's more...
Making sure that the metadata of the streams is the same
Telling Kettle how to merge the rows of your streams
See also
Comparing two streams and generating differences
Getting ready
How to do it...
How it works...
There's more...
Using the differences to keep a table up to date
See also
Generating all possible pairs formed from two datasets
How to do it...
How it works...
There's more...
Getting variables in the middle of the stream
Limiting the number of output rows
See also
Joining two or more streams based on given conditions
Getting ready
How to do it...
How it works...
There's more...
See also
Interspersing new rows between existent rows
Getting ready
How to do it...
How it works...
See also
Executing steps even when your stream is empty
Getting ready
How to do it...
How it works...
There's more...
Processing rows differently based on the row number
Getting ready
How to do it...
How it works...
There's more...
Identifying specific rows
Identifying the last row in the stream
Avoiding using an Add sequence step to enumerate the rows
See also
7. Executing and Reusing Jobs and Transformations
Introduction
Sample transformations
Sample transformation: Hello
Sample transformation: Random list
Sample transformation: Sequence
Sample transformation: File list
Launching jobs and transformations
Executing a job or a transformation by setting static arguments and parameters
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a job or a transformation from a job by setting arguments and parameters dynamically
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a job or a transformation whose name is determined at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Executing part of a job once for every row in a dataset
Getting ready
How to do it...
How it works...
There's more...
Accessing the copied rows from jobs, transformations, and other entries
Executing a transformation once for every row in a dataset
Executing a transformation or part of a job once for every file in a list of files
See also
Executing part of a job several times until a condition is true
Getting ready
How to do it...
How it works...
There's more...
Implementing loops in a job
Using the JavaScript step to control the execution of the entries in your job
See also
Creating a process flow
Getting ready
How to do it...
How it works...
There's more...
Serializing/De-serializing data
Other means for transferring or sharing data between transformations
Moving part of a transformation to a subtransformation
Getting ready
How to do it...
How it works...
There's more...
8. Integrating Kettle and the Pentaho Suite
Introduction
A sample transformation
Creating a Pentaho report with data coming from PDI
Getting ready
How to do it...
How it works...
There's more...
Configuring the Pentaho BI Server for running PDI jobs and transformations
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a PDI transformation as part of a Pentaho process
Getting ready
How to do it...
How it works...
There's more...
Specifying the location of the transformation
Supplying values for named parameters, variables and arguments
Keeping things simple when it's time to deliver a plain file
See also
Executing a PDI job from the Pentaho User Console
Getting ready
How to do it...
How it works...
There's more...
See also
Generating files from the PUC with PDI and the CDA plugin
Getting ready
How to do it...
How it works...
There's more...
Populating a CDF dashboard with data coming from a PDI transformation
Getting ready
How to do it...
How it works...
There's more...
See also
9. Getting the Most Out of Kettle
Introduction
Sending e-mails with attached files
Getting ready
How to do it...
How it works...
There's more...
Sending logs through an e-mail
Sending e-mails in a transformation
Generating a custom log file
Getting ready
How to do it...
How it works...
There's more...
Filtering the log file
Creating a clean log file
Isolating log files for different jobs or transformations
See also
Programming custom functionality
Getting ready
How to do it...
How it works...
There's more...
Data type's equivalence
Generalizing you code
Looking up information with additional steps
Customizing logs
Scripting alternatives to the UDJC step
Generating sample data for testing purposes
How to do it...
How it works...
There's more...
Using Data grid step to generate specific data
Working with subsets of your data
See also
Working with Json files
Getting ready
How to do it...
How it works...
There's more...
Reading Json files dynamically
Writing Json files
Getting information about transformations and jobs (file-based)
Getting ready
How to do it...
How it works...
There's more...
Transformation XML nodes
Job XML nodes
Steps and entries information
See also
Getting information about transformations and jobs (repository-based)
Getting ready
How to do it...
How it works...
There's more...
Transformation tables
Job tables
Database connections tables
A. Data Structures
Book's data structure
Books
Authors
Museum's data structure
Museums
Cities
Outdoor data structure
Products
Categories
Steel Wheels structure
Index
Pentaho Data Integration 4 Cookbook
Pentaho Data Integration 4 Cookbook
Copyright © 2011 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: June 2011
Production Reference: 1170611
Published by Packt Publishing Ltd.
32 Lincoln Road
Olton
Birmingham, B27 6PA, UK.
ISBN 978-1-849515-24-5
www.packtpub.com
Cover Image by Ed Maclean (<[email protected]>)
Credits
Authors
Adrián Sergio Pulvirenti
María Carina Roldán
Reviewers
Jan Aertsen
Pedro Alves
Slawomir Chodnicki
Paula Clemente
Samatar Hassan
Nelson Sousa
Acquisition Editor
Usha Iyer
Development Editor
Neha Mallik
Technical Editors
Conrad Sardinha
Azharuddin Sheikh
Project Coordinator
Joel Goveya
Proofreaders
Stephen Silk
Aaron Nash
Indexer
Tejal Daruwale
Graphics
Nilesh Mohite
Production Coordinator
Kruthika Bangera
Cover Work
Kruthika Bangera
About the Authors
Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America.
He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.
I'd like to thank my lovely kids Camila and Nicolas, who understood that I couldn't share with them the usual videogame sessions during the writing process. I'd also thank my wife who introduced me to the Pentaho world.
María Carina Roldán was born in Esquel, Argentina, in 1970. She earned her Bachelors degree in Computer Science at UNLP in La Plata; after that she did a postgraduate course in Statistics at the University of Buenos Aires (UBA) in Buenos Aires city where she lives since 1994.
She has worked as a BI consultant for more than 10 years. Over the last four years, she has been dedicated full time to developing BI solutions using Pentaho Suite. Currently she works for Webdetails, one of the main Pentaho contributors.
She is the author of Pentaho 3.2 Data Integration: Beginner's Guide published by Packt Publishing in April 2010.
You can follow her on Twitter at @mariacroldan.
I'd like to thank those who have encouraged me to write this book: On one hand, the Pentaho community. They have given me a rewarding feedback after the Beginner's book. On the other side, my husband who without hesitation agreed to write the book with me. Without them I'm not sure I would have embarked on a new book project.
I'd also like to thank the technical reviewers for the time and dedication that they have put in reviewing the book. In particular, thanks to my colleagues at Webdetails; it's a pleasure and a privilege to work with them every day.
About the Reviewers
Jan Aertsen has worked in IT and decision support for the past 10 years. Since the beginning of his career he has specialized in data warehouse design and business intelligence projects. He has worked on numerous global data warehouse projects within the fashion industry, retail, banking and insurance, telco and utilities, logistics, automotive, and public sector.
Jan holds the degree of Commercial Engineer in international business affairs from the Catholic University of Leuven (Belgium) and extended his further knowledge in the field of business intelligence through a Masters in Artificial Intelligence.
In 1999 Jan started up the business intelligence activities at IOcore together with some of his colleagues, rapidly making this the most important revenue area of the Belgian affiliate. They quickly gained access to a range of customers as KPN Belgium, Orange (now Base), Mobistar, and other Belgian Telcos.
After this experience Jan joined Cap Gemini Ernst & Young in Italy and rapidly became one of their top BI project managers. After having managed some large BI projects (up to 1 million € projects) Jan decided to leave the company and pursue his own ambitions.
In 2002, he founded kJube as an independent platform to develop his ambitions in the world of business intelligence. Since then this has resulted in collaborations with numerous companies as Volvo, Fendi-LVMH, ING, MSC, Securex, SDWorx, Blinck, and Beate Uhse.
Over the years Jan has worked his way through every possible aspect of business intelligence from KPI and strategy definition over budgeting, tool selection, and software investments acquisition to project management and all implementation aspects with most of the available tools. He knows the business side as well as the IT side of the business intelligence, and therefore is one of the rare persons that are able to give you a sound, all-round, vendor-independent advice on business intelligence.
He continues to share his experiences in the field through his blog (blog.kjube.be) and can be contacted at
Pedro Alves, is the founder of Webdetails. A Physicist by formation, serious video gamer, volleyball player, open source passionate, and dad of two lovely children.
Since his early professional years he has been responsible for Business Software development and his career led him to work as a Consultant in several Portuguese companies.
In 2008 he decided it was time to get his accumulated experience and share his knowledge about the Pentaho Business Intelligence platform on his own. He founded Webdetails and joined the Mozilla metrics team. Now he leads an international team of BI Consultants and keeps nurturing Webdetails as a world reference Pentaho BI solutions provider and community contributor. He is the Ctools (CDF, CDA, CDE, CBF, CST, CCC) architect and, on a daily basis, keeps developing and improving new components and features to extend and maximize Pentaho's capabilities.
Slawomir Chodnicki specializes in data warehousing and ETL, with a background in web development using various programming languages and frameworks. He has established his blog at http://type-exit.org to help fellow BI developers embrace the possibilities of PDI and other open source BI tools.
I would like to thank all regular members of the ##pentaho IRC channel for their endless patience and support regarding PDI related questions. Very special thanks go to María Carina and Adrián Sergio for creating the Kettle Cookbook and inviting me to be part of the project.
Paula Clemente was born in Sintra, Portugal, in 1983. Divided between the idea of spending her life caring about people and animals or spending quality time with computers, she started studying Computer Science at IST Engineering College—the Portuguese MIT
—at a time where Internet Social Networking was a synonym of IRC. She graduated in 2008 after completing her Master thesis on Business Processes Management. Since then she is proudly working as a BI Consultant for Webdetails, a Portuguese company specialized in delivering Pentaho BI solutions that earned the Pentaho Best Community Contributor 2011
award.
Samatar Hassan is an application developer focusing on data integration and business intelligence. He was involved in the Kettle project since the year it was open sourced. He tries to help the community by contributing in different ways; taking the translation effort for French language, participating in the forums, resolving bugs, and adding new features to the software.
He contributed to the Pentaho Kettle Solutions
book edited by Wiley and written by Matt Casters, the founder of Kettle.
I would first like to thank Adrián Sergio and María Carina Roldán for taking the time to write this book. It is a great idea to show how to take advantage of Kettle through step-by-step recipes. Kettle users have their own ETL bible now.
Finally, I'd like to thank all community members. They are the real power of open source software.
Nelson Sousa is a business intelligence consultant at Webdetails. He's part of the Metrics team at Mozilla where he helps develop and maintain Mozilla's Pentaho server and solution. He specializes in Pentaho dashboards using CDF, CDE, and CDA and also in PDI, processing vast amounts of information that are integrated daily in the various dashboards and reports that are part of the Metrics team day-to-day life.
www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
We dedicate this book to our family and specially our adorable kids.
- María Carina and Adrián -
Preface
Pentaho Data Integration (PDI, also called Kettle), one of the data integration tools leaders, is broadly used for all kind of data manipulation, such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. Do you need quick solutions to the problems you face while using Kettle?
Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more.
Pentaho Data Integration 4 Cookbook shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The initial chapters explain the details about working with databases, files, and XML structures. Then you will see different ways for searching data, executing and reusing jobs and transformations, and manipulating streams. Further, you will learn all the available options for integrating Kettle with other Pentaho tools.
Pentaho Data Integration 4 Cookbook has plenty of recipes with easy step-by-step instructions to accomplish specific tasks. There are examples and code that are ready for adaptation to individual needs.
Learn to solve data manipulation problems using the Pentaho Data Integration tool Kettle.
What this book covers
Chapter 1, Working with Databases helps you to deal with databases in Kettle. The recipes cover creating and sharing connections, loading tables under different scenarios, and creating dynamic SQL statements among others topics.
Chapter 2, Reading and Writing Files shows you not only the basics for reading and writing files, but also all the how-tos for dealing with files. The chapter includes parsing unstructured files, reading master/detail files, generating multi-sheet Excel files, and more.
Chapter 3, Manipulating XML Structures teaches you how to read, write, and validate XML data. It covers both simple and complex XML structures.
Chapter 4, File Management helps you to pick and configure the different options for copying, moving, and transferring lists of files or directories.
Chapter 5, Looking for Data explains the different methods for searching information in databases, text files, web services, and more.
Chapter 6, Understanding Data Flows focuses on the different ways for combining, splitting, or manipulating streams or flows of data in simple and complex situations.
Chapter 7, Executing and Reusing Jobs and Transformations explains in a simple fashion topics that are critical for building complex PDI projects. For example, building reusable jobs and transformations, iterating the execution of a transformation over a list of data and transferring data between transformations.
Chapter 8, Integrating Kettle and the Pentaho Suite. PDI aka Kettle is part of the Pentaho Business Intelligent Suite. As such, it can be used interacting with other components of the suite, for example as the datasource for reporting, or as part of a bigger process. This chapter shows you how to run Kettle jobs and transformations in that context.
Chapter 9, Getting the Most Out of Kettle covers a wide variety of topics, such as customizing a log file, sending e-mails with attachments, or creating a custom functionality.
Appendix, Data Structures describes some structures used in several recipes throughout the book.
What you need for this book
PDI is a multiplatform tool, meaning that you will be able to install the tool no matter what your operating system is. The only prerequisite to work with PDI is to have JVM 1.5 or a higher version installed. It is also useful to have Excel or Calc, a nice text editor, and access to a database engine of your preference.
Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.
Who this book is for
If you are a software developer or anyone involved or interested in developing ETL solutions, or in general, doing any kind of data manipulation, this book is for you. It does not cover PDI basics, SQL basics, or database concepts. You are expected to have a basic understanding of the PDI tool, SQL language, and databases.
Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text are shown as follows: Copy the .jar file containing the driver to the libext/JDBC directory inside the Kettle installation directory
.
A block of code is set as follows:
NUMBER, LASTNAME, FIRSTNAME, EXT, OFFICE, REPORTS, TITLE
1188, Firrelli, Julianne,x2174,2,1143, Sales Manager
1619, King, Tom,x103,6,1088,Sales Rep
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: Add a Delete file entry from the File management category
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.
If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or e-mail
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you