Tuesday, September 28, 2010

ETL: Importing data with Talend

ETL is used for Operational Data Stores, Data Warehouse and Data Mart. ETL tools can also be handy for simple importing into existing application databases as well.

Importing data into your application is as important as reporting out of your application. Importing can be achieved while distributing the workload through several people.

On one end you want a BA to decide when to run an import process, which components to assembly, which data sources to use, the mappings.

There is an existing model, datasources representing feeds that need to get to your model and a mapping to make that transformation happen. ETL processes are commonly used for this task. You can do something as simple as manual SQL scripting, something more elaborated using a Rules Engine or even more polished like using an ETL tool.

Talend has a very good tool for ETL (which JasperETL uses as well). The fact that latest JasperETL 3.2.3 does not work that well on MAC OSX even after tweaking made me decide to stick to Talend. These tools can be used for:
  1. Construct a Business model with a graphical interface that allows BA to drop the general blocks for example an Excel source file to be used in conjunction with a web service output to fill out records in an existing database.
  2. Design a Job to implement the business model blocks.
  3. Schedule Jobs.

This tutorial is about designing a Job with Talend/Jasper ETL. I am not interested here in covering points 1 and 3 as they are really not needed for our task: Importing data from Excel into an application database.

For the impatient

Create a local MySQL DB and name it myapp. Use the model from http://thinkinginsoftware.blogspot.com/2010/09/jasper-real-time-report-services.html

Download Talend ( I used TOS-All-r48998-V4.1.0RC3.zip ) and uncompress in c:\TOS-All-r48998-V4.1.0RC3

  1. Create folder c:\projects\talend to be used as workspace
  2. Open Talend (use the non-wpf exe file for example TalendOpenStudio-win32-x86.exe ) and create a project named DbImport. Use as workspace folder c:\projects\talend
  3. After Talend is done updating the project close it.
  4. Checkout http://nestorurquiza.googlecode.com/svn/trunk/talend/DbImport/ in a temporary directory
  5. Copy all files from the temporary directory to the workspace DbImport folder
  6. I have included a sample Excel employees.xls with the project. Create folder c:\projects\DbImport and drop the excel there
  7. Start Talend and open project DbImport
  8. Run PopulateAll job and confirm offices, departments and employees have been added to “myapp” database
  9. Note: Even if no changes are made after closing the IDE you will get differences in some projects files. It is good idea to always update from svn before starting to work on a project as other developers might commit their local project files to the repository.

Let us review in detail what I have done in this simple project.

Showcase

  1. Go ahead and create a mySQL database named “myapp”. We are going to use the simple model we created on http://thinkinginsoftware.blogspot.com/2010/09/jasper-real-time-report-services.html
  2. We want to import employees from an existing Excel spreadsheet into our new database.
  3. Our Excel import file contains a de-normalized data we need to import into our normalized tables. The columns are:
first_name
last_name
office_name
department_name

Using Talend / Jasper ETL

  1. Install Talend (I am using version 4.1.0) or Jasper ETL (I have tested version 3.2.3 which at least in MAC OSX needs a little tweak). Install it near the File System root. I will describe here everything for Windows OS but you should be able to to the same in other OS. The important is to keep paths reusable through your team. So install Talend then in “c:\talend”
  2. Start the program (If using Windows XP use the executable, for example TalendOpenStudio-win32-x86.exe. The Eclipse -wpf- version flicks).Set up a connection. Click on the button near “Repository” (In Talend is “Email” button), provide your email (this will end up stamped in many files from now on so use a real personal or work email). For the workspace folder use a common folder that other users can later use as well in their own machines for example “C:\projects\talend”. You will need to hit “restart” if using Talend and changing the workspace.
  3. Select “Create a new Local project” from “Project” section and click on “Go”. Use as project name “DbImport” and as language generation “Java”. Pick the new project from the last dropdown. Click on “Open”.
  4. After you close the welcome windows you should see the “Window|Perspective|Design Workspace”. Right click on Job Designs on the left and create the first Job called “PopulateLookup” with Purpose “Import from Excel to lookup MySQL tables”. This is a job that will populate department and office tables. We need office_id and department_id for the employee table, that is why we must be sure the department and office exist in the DB before.
  5. On the repository View (left panel by default) right click on Metadata/Db Connection and create a MySQL connection to the database containing the tables employee, department and office tables. Right click on the connection and “retrieve schema” for the three tables.
  6. Create an Excel file named c:\projects\DbImport\employees.xls containing the data in the appendix.
  7. Right click on Metadata/File Excel and point to a local Excel file (Use Path button to point to the file)
  8. Select the sheet, click next, select “Set heading row as column names”, click next and select as name “employee”
  9. Drag and drop one by one (into the job area) the department and office metadata (when prompted select tMySQLOutput)
  10. Look for the Palette (Components Library). If it is not showing up use “Window|Show View|General|Palette.”
  11. Drag and drop a tMap component from the palette “Processing” section.
  12. Right click on the employee inner square (you must select the inner square otherwise the option will not be available) and select “Row|Main” a line will be started and will end wherever you click as final destination component. In this case click on the tMap component.
  13. Right click on tMap, select “Row|New Output” and drop the line into department and name it outputDepartment. Do the same for Office. When prompted to get the schema from the target respond yes as that helps to see the available destination fields.
  14. Double click on the tMap and drag and drop the fields from the input to the output
  15. Run the Job from the Run tab. If there are problems the specific component will be red and double clicking on it will show up a description of the problem. You might notice that is the situation as we have specified office_name instead of just name as the destination field in the tMap, so correct that and rerun.
  16. Save your job and create a second job named “PopulateEmployee” purpose “Populate table employee”
  17. Drop department and office boxes into the working area. Be sure to select type tMySQLInput
  18. Drop the employee Excel
  19. Drop a tMap
  20. Drop the employee MySQL as tMySQLOutput
  21. Create input and output connections as explained before. Use naming conventions for example inputEmployee, inputOffice, inputDepartment and outputEmployee
  22. Open the tMap and in the input panel drag and drop the inputOffice.name to inputEmployee.office_name and inputDepartment.name to inputEmployee.department_name. Here you are defining the necessary joins from input sources.
  23. Drag and drop inputOffice.id, inputDepartment.id, inputEmployee.first_name and inputEmployee.last_name into the output panel left colum right next to the destination field.
  24. Run the project to get the data imported. Check the data from the mySQL tables
  25. Of course both jobs are related. We want to run PopulateLookup and then later PopulateEmployee. That is why we need to create a third job now. Name it “PopulateAll”
  26. Drop two components type “tRunJob” from the palette. From the Component tab select for the first “PopulateLookup” and for the second “PopulateEmployee”
  27. Right click on the first and select “Row|Main”. Drop the line into the second sub job.
  28. Cleanup the records from the database so you can see all recreated.


  29. delete from office;
    delete from employee;
    delete from office;
    
  30. Run “populateAll” job and your data will be in the destination.

Sharing the project

Talend and so JasperETL are designed in a way that they have version control through a server. To avoid using an extra sever you could use export/import (but that would be limiting):
1. To export: Right click on “Business Models” and export “all” to the root folder (in our case C:\projects\) that is shared let us say on a subversion repository. This will create/update “c:\projects\DbImport”. Now you can share that on SVN.

2. To import: Checkout from SVN. Go to Talend and import.

As a better option you can (at least in Talend version 4.1.0) share the whole project (which does not include any binaries) Below is the list of all files for the project in this tutorial:
|-- TDQ_Data Profiling
|   |-- Analyses
|   `-- Reports
|-- TDQ_Libraries
|   |-- Indicators
|   |-- JRXML Template
|   |-- Patterns
|   `-- Rules
|-- businessProcess
|-- businessProcessSVG
|-- code
|   |-- jobscripts
|   |-- routines
|   |   `-- system
|   |       |-- DataOperation_0.1.item
|   |       |-- DataOperation_0.1.properties
|   |       |-- Mathematical_0.1.item
|   |       |-- Mathematical_0.1.properties
|   |       |-- Numeric_0.1.item
|   |       |-- Numeric_0.1.properties
|   |       |-- Relational_0.1.item
|   |       |-- Relational_0.1.properties
|   |       |-- StringHandling_0.1.item
|   |       |-- StringHandling_0.1.properties
|   |       |-- TalendDataGenerator_0.1.item
|   |       |-- TalendDataGenerator_0.1.properties
|   |       |-- TalendDate_0.1.item
|   |       |-- TalendDate_0.1.properties
|   |       |-- TalendString_0.1.item
|   |       `-- TalendString_0.1.properties
|   `-- snippets
|-- components
|-- context
|-- documentations
|-- images
|   |-- job_outlines
|   `-- joblet_outlines
|-- joblets
|-- libs
|-- metadata
|   |-- BRMSconnections
|   |-- FTPconnections
|   |-- LDAPSchema
|   |-- MDMconnections
|   |-- SalesforceSchema
|   |-- WSDLSchema
|   |-- connections
|   |   |-- myapp_0.1.item
|   |   `-- myapp_0.1.properties
|   |-- fileDelimited
|   |-- fileEBCDIC
|   |-- fileExcel
|   |   |-- employee_0.1.item
|   |   `-- employee_0.1.properties
|   |-- fileHL7
|   |-- fileLdif
|   |-- filePositional
|   |-- fileRegex
|   |-- fileXml
|   |-- genericSchema
|   |-- header_footer
|   |-- rules
|   `-- sapconnections
|-- process
|   |-- PopulateEmployee_0.1.item
|   |-- PopulateEmployee_0.1.properties
|   |-- PopulateLookup_0.1.item
|   |-- PopulateLookup_0.1.properties
|   |-- populateAll_0.1.item
|   `-- populateAll_0.1.properties
|-- sqlPatterns
|   |-- Generic
|   |   |-- UserDefined
|   |   `-- system
|   |       |-- Aggregate_0.1.item
|   |       |-- Aggregate_0.1.properties
|   |       |-- Commit_0.1.item
|   |       |-- Commit_0.1.properties
|   |       |-- DropSourceTable_0.1.item
|   |       |-- DropSourceTable_0.1.properties
|   |       |-- DropTargetTable_0.1.item
|   |       |-- DropTargetTable_0.1.properties
|   |       |-- FilterColumns_0.1.item
|   |       |-- FilterColumns_0.1.properties
|   |       |-- FilterRow_0.1.item
|   |       |-- FilterRow_0.1.properties
|   |       |-- MergeInsert_0.1.item
|   |       |-- MergeInsert_0.1.properties
|   |       |-- MergeUpdate_0.1.item
|   |       |-- MergeUpdate_0.1.properties
|   |       |-- Rollback_0.1.item
|   |       `-- Rollback_0.1.properties
|   |-- Hive
|   |   |-- UserDefined
|   |   `-- system
|   |       |-- HiveAggregate_0.1.item
|   |       |-- HiveAggregate_0.1.properties
|   |       |-- HiveCreateSourceTable_0.1.item
|   |       |-- HiveCreateSourceTable_0.1.properties
|   |       |-- HiveCreateTargetTable_0.1.item
|   |       |-- HiveCreateTargetTable_0.1.properties
|   |       |-- HiveDropSourceTable_0.1.item
|   |       |-- HiveDropSourceTable_0.1.properties
|   |       |-- HiveDropTargetTable_0.1.item
|   |       |-- HiveDropTargetTable_0.1.properties
|   |       |-- HiveFilterColumns_0.1.item
|   |       |-- HiveFilterColumns_0.1.properties
|   |       |-- HiveFilterRow_0.1.item
|   |       `-- HiveFilterRow_0.1.properties
|   |-- MySQL
|   |   |-- UserDefined
|   |   `-- system
|   |       |-- MySQLAggregate_0.1.item
|   |       |-- MySQLAggregate_0.1.properties
|   |       |-- MySQLCreateSourceTable_0.1.item
|   |       |-- MySQLCreateSourceTable_0.1.properties
|   |       |-- MySQLCreateTargetTable_0.1.item
|   |       |-- MySQLCreateTargetTable_0.1.properties
|   |       |-- MySQLDropSourceTable_0.1.item
|   |       |-- MySQLDropSourceTable_0.1.properties
|   |       |-- MySQLDropTargetTable_0.1.item
|   |       |-- MySQLDropTargetTable_0.1.properties
|   |       |-- MySQLFilterColumns_0.1.item
|   |       |-- MySQLFilterColumns_0.1.properties
|   |       |-- MySQLFilterRow_0.1.item
|   |       `-- MySQLFilterRow_0.1.properties
|   |-- Netezza
|   |   |-- UserDefined
|   |   `-- system
|   |       |-- NetezzaAggregate_0.1.item
|   |       |-- NetezzaAggregate_0.1.properties
|   |       |-- NetezzaCreateSourceTable_0.1.item
|   |       |-- NetezzaCreateSourceTable_0.1.properties
|   |       |-- NetezzaCreateTargetTable_0.1.item
|   |       |-- NetezzaCreateTargetTable_0.1.properties
|   |       |-- NetezzaDropSourceTable_0.1.item
|   |       |-- NetezzaDropSourceTable_0.1.properties
|   |       |-- NetezzaDropTargetTable_0.1.item
|   |       |-- NetezzaDropTargetTable_0.1.properties
|   |       |-- NetezzaFilterColumns_0.1.item
|   |       |-- NetezzaFilterColumns_0.1.properties
|   |       |-- NetezzaFilterRow_0.1.item
|   |       `-- NetezzaFilterRow_0.1.properties
|   |-- Oracle
|   |   |-- UserDefined
|   |   `-- system
|   |       |-- OracleAggregate_0.1.item
|   |       |-- OracleAggregate_0.1.properties
|   |       |-- OracleCreateSourceTable_0.1.item
|   |       |-- OracleCreateSourceTable_0.1.properties
|   |       |-- OracleCreateTargetTable_0.1.item
|   |       |-- OracleCreateTargetTable_0.1.properties
|   |       |-- OracleDropSourceTable_0.1.item
|   |       |-- OracleDropSourceTable_0.1.properties
|   |       |-- OracleDropTargetTable_0.1.item
|   |       |-- OracleDropTargetTable_0.1.properties
|   |       |-- OracleFilterColumns_0.1.item
|   |       |-- OracleFilterColumns_0.1.properties
|   |       |-- OracleFilterRow_0.1.item
|   |       |-- OracleFilterRow_0.1.properties
|   |       |-- OracleMerge_0.1.item
|   |       `-- OracleMerge_0.1.properties
|   |-- ParAccel
|   |   |-- UserDefined
|   |   `-- system
|   |       |-- ParAccelAggregate_0.1.item
|   |       |-- ParAccelAggregate_0.1.properties
|   |       |-- ParAccelCommit_0.1.item
|   |       |-- ParAccelCommit_0.1.properties
|   |       |-- ParAccelDropSourceTable_0.1.item
|   |       |-- ParAccelDropSourceTable_0.1.properties
|   |       |-- ParAccelDropTargetTable_0.1.item
|   |       |-- ParAccelDropTargetTable_0.1.properties
|   |       |-- ParAccelFilterColumns_0.1.item
|   |       |-- ParAccelFilterColumns_0.1.properties
|   |       |-- ParAccelFilterRow_0.1.item
|   |       |-- ParAccelFilterRow_0.1.properties
|   |       |-- ParAccelRollback_0.1.item
|   |       `-- ParAccelRollback_0.1.properties
|   `-- Teradata
|       |-- UserDefined
|       `-- system
|           |-- TeradataAggregate_0.1.item
|           |-- TeradataAggregate_0.1.properties
|           |-- TeradataColumnList_0.1.item
|           |-- TeradataColumnList_0.1.properties
|           |-- TeradataCreateSourceTable_0.1.item
|           |-- TeradataCreateSourceTable_0.1.properties
|           |-- TeradataCreateTargetTable_0.1.item
|           |-- TeradataCreateTargetTable_0.1.properties
|           |-- TeradataDropSourceTable_0.1.item
|           |-- TeradataDropSourceTable_0.1.properties
|           |-- TeradataDropTargetTable_0.1.item
|           |-- TeradataDropTargetTable_0.1.properties
|           |-- TeradataFilterColumns_0.1.item
|           |-- TeradataFilterColumns_0.1.properties
|           |-- TeradataFilterRow_0.1.item
|           |-- TeradataFilterRow_0.1.properties
|           |-- TeradataTableList_0.1.item
|           `-- TeradataTableList_0.1.properties
|-- talend.project
`-- temp

Unfortunately SVN support is not included in the IDE. Following some steps though you can still share the project.

Commit the project to SVN

  1. Create “DbImport” project as explained before.
  2. Delete temp directory
  3. Import in your SVN
  4. Checkout the project from SVN
  5. Add svn:ignore for the temp directory (svn propset svn:ignore "temp" .)
  6. Commit the project.

Check out the project from SVN

  1. Create a new local “DbImport” project. Close the IDE.
  2. Outside the workspace folder checkout “DbImport” from SVN.
  3. Replace the content of the workspace “DbImport” directory with the checked from SVN files.
  4. Open the IDE and modify the project as you wish.
  5. Close the IDE and use svn update and/or commit commands as you need.

Documentation

  1. http://sourceforge.net/projects/jasperetl/files/
  2. http://talend.dreamhosters.com/tos/user-guide-download/V402/DocumentationSet_UG&RG_40b_EN.zip
  3. Help from the GUI

Appendix


first_namelast_nameoffice_namedepartment_name
JohnSmithLondonLegal
MathewParkerUSAMarketing
AndreaPoliniRomeSales

No comments:

Followers