TalendOpenStudio DI GettingStarted 6.4.1 en
TalendOpenStudio DI GettingStarted 6.4.1 en
TalendOpenStudio DI GettingStarted 6.4.1 en
6.4.1
Contents
Copyleft.......................................................................................................................3
Introduction to Talend Open Studio for Data Integration.......................................... 4
Prerequisites to using Talend Open Studio for Data Integration................................4
Downloading and installing Talend Open Studio for Data Integration...................... 6
Configuring and setting up your Talend product....................................................... 7
Performing data integration tasks...............................................................................8
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
Copyleft
Adapted for 6.4.1. Supersedes previous releases.
Publication date: June 29th, 2017
This documentation is provided under the terms of the Creative Commons Public License (CCPL).
For more information about what you can and cannot do with this documentation in accordance with
the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/.
Notices
Talend is a trademark of Talend, Inc.
All brands, product names, company names, trademarks and service marks are the properties of their
respective owners.
License Agreement
The software described in this documentation is licensed under the Apache License, Version 2.0 (the
"License"); you may not use this software except in compliance with the License. You may obtain
a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by
applicable law or agreed to in writing, software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing permissions and limitations under the
License.
This product includes software developed at AOP Alliance (Java/J2EE AOP standards), ASM,
Amazon, AntlR, Apache ActiveMQ, Apache Ant, Apache Axiom, Apache Axis, Apache Axis 2,
Apache Batik, Apache CXF, Apache Chemistry, Apache Common Http Client, Apache Common
Http Core, Apache Commons, Apache Commons Bcel, Apache Commons JxPath, Apache Commons
Lang, Apache Derby Database Engine and Embedded JDBC Driver, Apache Geronimo, Apache
Hadoop, Apache Hive, Apache HttpClient, Apache HttpComponents Client, Apache JAMES, Apache
Log4j, Apache Lucene Core, Apache Neethi, Apache POI, Apache ServiceMix, Apache Tomcat,
Apache Velocity, Apache WSS4J, Apache WebServices Common Utilities, Apache Xml-RPC, Apache
Zookeeper, Box Java SDK (V2), CSV Tools, DataStax Java Driver for Apache Cassandra, Ehcache,
Ezmorph, Ganymed SSH-2 for Java, Google APIs Client Library for Java, Google Gson, Groovy,
Guava: Google Core Libraries for Java, H2 Embedded Database and JDBC Driver, Hector: A high
level Java client for Apache Cassandra, Hibernate Validator, HighScale Lib, HsqlDB, Ini4j, JClouds,
JLine, JSON, JSR 305: Annotations for Software Defect Detection in Java, JUnit, Jackson Java JSON-
processor, Java API for RESTful Services, Java Agent for Memory Measurements, Jaxb, Jaxen,
Jettison, Jetty, Joda-Time, Json Simple, LightCouch, MetaStuff, Mondrian, OpenSAML, Paraccel
JDBC Driver, PostgreSQL JDBC Driver, Resty: A simple HTTP REST client for Java, Rocoto,
SL4J: Simple Logging Facade for Java, SQLite JDBC Driver, Simple API for CSS, SshJ, StAX API,
StAXON - JSON via StAX, The Castor Project, The Legion of the Bouncy Castle, W3C, Woden,
Woodstox: High-performance XML processor, Xalan-J, Xerces2, XmlBeans, XmlSchema Core,
Xmlsec - Apache Santuario, Zip4J, atinject, dropbox-sdk-java: Java library for the Dropbox Core API,
google-guice. Licensed under their respective license.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 3
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
Memory requirements
To make the most out of your Talendproduct, please consider the following memory and disk space
usage:
Software requirements
To make the most out of your Talend product, please consider the following system and software
requirements:
Required software
Yosemite/10.10 64-bit
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 4
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
Mavericks/10.9 64-bit
Optional software
Installing Java
To use your Talend product, you need Oracle Java Runtime Environment installed on your computer.
1. From the Java SE Downloads page, under Java Platform, Standard Edition, click the JRE
Download.
2. From the Java SE Runtime Environment 8 Downloads page, click the radio button to Accept
License Agreement.
3. Select the appropriate download for your Operating System.
4. Follow the Oracle installation steps to install Java.
When Java is installed on your computer, you need to set up the JAVA_HOME environment variable.
For more information, see:
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 5
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
Example:
export JAVA_HOME=/usr/lib/jvm/jre1.8.0_65
export PATH=$JAVA_HOME/bin:$PATH
3. Add these lines at the end of the user profiles in the ~/.profile file or, as a superuser, at the end
of the global profiles in the /etc/profile file.
4. Log on again.
For Windows, Talend recommends you to install 7-Zip and use it to extract files. For more information,
see Installing 7-Zip (Windows) on page 6.
To install the studio, follow the steps below:
1. Navigate to your local folder, locate the TOS zip file and move it to another location with a path as
short as possible and without any space character.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 6
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
Example: C:/Talend/
2. Unzip it by right-clicking on the compressed file and selecting 7-Zip > Extract Here.
If you do not want to use 7-Zip, you can use Windows default unzipping tool.
1. Unzip it by right-click the compressed file and select Extract All.
2. Click Browse and navigate to the C: drive.
3. Select Make new folder and name the folder Talend. Click OK.
4. Click Extract to begin the installation.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 7
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
After your Studio successfully launches, you can also click the Videos link on the top of the Studio
main window to watch a couple of short videos that help you get started with your Talend Studio.
For some operating systems, you may need to install an MP4 decoder/player to play the videos.
Now you have successfully logged in to the Talend Studio. Next you need to install additional
packages required for the Talend Studio to work properly.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 8
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
the online version of this page at https://help.talend.com, and then save the source files in your local
directory C:\getting_started\input_data\.
In this step of the wizard, Name is the only mandatory field. The information you provide in the
Description field will appear as hover text when you move your mouse pointer over the Job in the
Repository tree view.
5. Click Finish to create your Job.
An empty Job is opened in the Studio.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 9
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
1. Drop a tFileInputDelimited and a tLogRow component from the Palette onto the design
workspace.
You can find the tFileInputDelimited component in the Input group of the File family and the
tLogRow component in the Logs & Errors family in the Palette.
2. Click the tFileInputDelimited component so that an o icon appears, drag and drop the o icon onto
the tLogRow component.
The two components are now connected via a Row > Main connection.
Now you have added the required components to the Job. In the next steps you will need to prepare the
required metadata and configure the Job.
• You have the source file movies.csv ready in the directory C:\getting_started
\input_data\.
1. In the Repository tree view, expand the Metadata node, right-click File delimited, and select
Create file delimited from the contextual menu to open the New Delimited File wizard.
2. In the New Delimited File wizard, enter a name for the file metadata, movies in this example, and
other useful information to better describe your file metadata, and then click Next to go to the next
step and define the general properties of the file.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 10
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
In this step of the wizard, Name is the only mandatory field. The information you provide in
the Description field will appear as a tooltip when you move your mouse pointer over the file
connection.
3. In the File field specify the path of the source file, or click Browse to browse to the file.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 11
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
The file is loaded, and the File Viewer area displays an abstract of the file, allowing you to check
the file consistency, the presence of header and more generally the file structure.
4. From the Format list, select your operating system, and click Next to parse the file.
5. On the Preview tab, select the Set heading row as column names check box to retrieve the file
column names from the first row, and then click Refresh Preview.
The file preview is refreshed, and the Header check box in the Rows To Skip area is automatically
selected, with the number of header rows to be skipped incremented by 1.
6. If the file contains more than one heading row, which need to be skipped in file parsing, specify the
number in this field and click Refresh Preview again.
7. Click Next to retrieve the file schema.
The Description of the Schema table displays the generated file schema.
8. Name the schema movies_schema and check the file schema and edit it according to your actual
needs.
In this example, increase the length of the title and url columns.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 12
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
You now have the movies file metadata ready for use. Next, you need to apply the created metadata to
the component that reads the source file.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 13
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
design workspace. When asked whether to propagate the changes to the output component, click
Yes.
In the Basic settings tab of the Component view, you'll find that all the parameters of the
component have been automatically filled.
3. Double-click the tLogRow component to open its Basic settings tab view.
4. In the Mode area, select the Vertical (each row is a key/value list) option for better readability of
long fields on the Run console.
5. Press F6 or click the Run button on the Run view to execute your Job.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 14
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
The Run console displays the movies information read from the source file.
1. In the Repository tree view, expand the Metadata node, right-click File delimited, and select
Create file delimited from the contextual menu to open the [New Delimited File] wizard.
2. Enter a name for the file connection, directors in this example, and other useful information to
better describe your file metadata, and then click Next to go to the next step and define the general
properties of the file.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 15
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
3. In the File field specify the path of the source file, or click Browse to browse to the file.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 16
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
The file is loaded, and the File Viewer area displays an abstract of the file, allowing you to check
the file consistency, the presence of header and more generally the file structure.
4. Select Windows from the Format list, and click Next to parse the file.
5. From the Field Separator list of the File Settings area, select Comma.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 17
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 18
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
You now have the directors file metadata ready for use when you set up the component to read the
reference file.
• You have created and successfully executed the Job named movies as described in Reading movies
information from a CSV file on page 8.
1. In the Repository tree view, right-click the Job named moviesmovies and select Duplicate from the
contextual menu.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 19
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
2. In the Duplicate dialog box, enter a name for the new Job, filter_movies in this example, and
click OK to validate the Job creation and close the dialog box.
The Job named filter_movies is created, which is a duplicate of the Job named movies.
The procedure below shows how to add a mapping component by typing the component name directly
on the existing connection.
1. In the new Job named filter_movies, select the Row connection linking the tFileInputDelimited
and tLogRow components, and type name of tMap or part of it.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 20
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
When you start typing the component name, a list of components that match your input appears.
You can select a component to view its description besides the component list.
2. Double-click tMap on the list to added it onto the connection.
The newly added tMap component is now connected with the input component, and a dialog box
opens asking you to give a name to the new output connection.
3. Enter a name for the new output connection, Valid_movies in this example, and click OK.
When asked whether you want to propagate the input schema to the target output component, click
Yes.
The tMap component is now added to the Job and connected with the two existing components via
Row > Main connections.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 21
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
• You have centralized the metadata for directors.txt in the Repository as described in
Preparing directors file metadata on page 15.
1. In the Repository tree view, expand Metadata > File delimited, drag and drop the file connection
directors or its schema directors_schema onto the design workspace.
The Components dialog box opens, showing a list of components you can add to the Job from this
metadata item.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 22
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
3. Right-click the newly added tFileInputDelimited component, select Row > Main from the
contextual menu, and click the tMap component.
The tFileInputDelimited is connected to the tMap via a lookup connection now.
4. In the Advanced settings tab of the new tFileInputDelimited component, and select the Trim all
columns check box.
Some records of the reference input file directors.txt contains leading white spaces. This
option allows you to remove such white spaces from the lookup input flow when the Job is
executed.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 23
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
You have now all the components in the Job needed for filtering the movies information. Next you'll
need to configure mappings in the tMap component to filter the main input flow against the lookup
flow and output the desired information.
2. Select the directorID column in the row1 table, and drop it onto the directorID column in the
row2 table to create a join between the two input data sets based on the director IDs.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 24
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
3. Click the tMap settings button, then click Value field for Join Model, and then click the [...]
button that appears to open the Options dialog box. In the dialog box, select Inner Join and click
OK to define the join as an inner join.
With this setting, only the movie records with the director IDs matching with those in the reference
file will be passed to the output.
4. In the Schema editor at the bottom of the map editor, select directorID column of the output
schema, Valid_movies in this example, and click the [X] button to remove it.
5. Click the [+] button beneath the output table to add a new column, name it directedBy, set its
length to 20, and move it up so that it's between the title and releaseYear columns.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 25
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
6. Select the directorName column in the row2 table, and drop it to the Expression field
corresponding to the directedBy column in the output table.
A new mapping is created between lookup table and the output table.
7. Click OK to validate the mappings and close the map editor, and click Yes when asked whether to
propagate the changes.
The mapping configurations are saved and the output schema is synchronized to the output
component tLogRow.
8. Press F6 or click the Run button on the Run view to execute your Job.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 26
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
Only movie records with valid director information are displayed on the Run console.
• You have created and successfully executed the Job filter_movies as described in Filtering the
movies information on page 15.
1. Create a new Job by duplicating the Job created in the previous scenario, and name the new Job
write_movies_to_db, and then double-click the Job to open it in the design workspace.
2. Right-click the tLogRow component and select Delete from the contextual menu to delete it.
3. Click where the tLogRow was on the design workspace and type the name of tMySqlOutput
or part of it, and then select and double-click tMySqlOutput on the list to add it onto the design
workspace.
When you start typing the component name, a list of components that match your input appears.
You can select a component to view its description besides the component list.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 27
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
4. Right-click the tMap component, select Row > Valid_movies from the context menu, and click the
tMySqlOutput to link it with the tMap.
The connection name Valid_movies corresponds to the name of the existing output table in tMap.
5. Click the tMap component, and drag and drop the o icon onto the design workspace.
A text field and a list of suggested components appear. You can select a component to view its
description besides the component list.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 28
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
6. In the text field, type the name of tMySqlOutput, select the component on the list, and press Enter
to add another tMySqlOutput component onto the design workspace.
A dialog box appears, asking you to enter a name for the output connection.
7. In the dialog box, enter Invalid_movies and click OK to connect tMap to the second
tMySqlOutput component.
Now you have added and connected the database output components you need to write the processed
movies information to a MySQL database. Next, you'll need to configure new mappings in the tMap
and database settings in the tMySqlOutput components.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 29
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
3. Click the tMap settings button on the Invalid_movies table, click the Value field for Catch
lookup inner noin reject, and then click the [...] button that appears to open the Options dialog
box. In the dialog box, select true and click OK.
With this setting, any records without director IDs or with director IDs that do not match with those
in the reference file will be passed to this output.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 30
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
4. Click OK to validate the mappings and close the map editor, and click Yes when asked whether to
propagate the changes.
The mapping configurations are saved and the output schema is synchronized to the output
component.
Now you have configured mappings for the rejected output. Next, you'll need to configure the output
components to write the output flows to database tables.
2. Provide the connection details needed to access your database, including the host name or IP
address, port number, database name, user name and password, in the relevant fields.
When entering your password, you need first to click the [...] button next to the Password field to
open a dialog box, enter your password between double quotation marks in the text field, and then
click OK.
3. In the Table field, enter the name of the target database table.
In this example, the table for valid movies information is valid_movies.
4. Select the Action on table and Action on data options according to your needs.
In this example, we want to remove the table first if it already exists and then create an empty one,
and use the default option for the action on data.
5. In the Basic settings of the second tMySqlOutput component, use the same settings as in the first
tMySqlOutput except the name for the target database table.
In this example, the table for invalid movies information is invalid_movies.
6. Press F6 or click the Run button on the Run view to execute your Job.
The movies records with valid director information are saved to the database table named
valid_movies, and those without valid director information are saved to the database table named
invalid_movies.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 31
Talend Open Studio for Data Integration 6.4.1 © Talend 2017
What's next?
You have seen how Talend Studio helps you manage your data using Talend Jobs. You have learned
how to access your data via Talend Studio, filter and transform your data, and store the filtered and
transformed data in a database. Along the way, you have learned how to centralize frequently used
connections in the Repository and easily reuse these connections in your Jobs.
To learn more about Talend Studio, see:
• Talend Studio User Guide
• Talend components documentation
To ensure that your data is clean, you can try Talend Open Studio for Data Quality and Talend Data
Preparation Free Desktop.
To learn more about Talend products and solutions, visit www.talend.com.
Talend Open Studio for Data Integration Getting Started Guide (2017-06-29) | 32