Mar 212019 0 Responses

Talend Introduction & Tutorial to Merge files, having same schema

Extract, Transform, Load (ETL) is the process of extracting data from various data sources, organizing it together, and storing it into a single database for later use like decision making and business insights. Before people used to perform ETL through manual coding in SQL or .NET, but today lots of ETL tools are available that simplify the process. ETL is generally used for data migration, data replication, operational processes, data transformation and data synchronization.

ETL Process

There are many ETL tools available in market both commercial as well as open source like Informatica Power Center, IBM Infosphere Information Server, Oracle Data Integrator, Microsoft SQL Server Integrated Services(SSIS), Ab Initio, Sybase ETL and many more.

ETL has big role in web scraping process. Data scraped from Public websites or other sources are not always in well format or some time it’s messy, ETL tools like Talend and other tools helps to transform the data in required format, validate them, merge them and load it to database like MySQL, NoSQL, sqLite, Oracle and many others or storage target like Amazon S3, FTP, Azure, Dropbox and others.

Talend is one of the best free open source ETL tools available in this era of big data. Talend easily integrates various types of data sources, including CSV, spreadsheets, databases and almost all cloud-based or on-premise Data warehouse solutions.  Talend makes the task of the data warehouse developer easy and fun experience. Talend has extended functionalities for data cleansing, data profiling, Enterprise Application Integration (EAI), Big Data, Data Quality and Master data management.

Some of the key features of Talend:

  • Free open source Community version, hence enhanced flexibility
  • It is widely used for data integration.
  • It has More than 900 inbuilt components for connecting various data sources.
  • It provides easy to use Drag and drop graphical user interface.
  • It can be easily deployed to single cloud or multi cloud or hybrid cloud environment.
  • Community Support: Community.Talend.com is Talend’s technical community site. Sections available for users include a support forum, a wiki, bugtracker, components, tutorials and the translation tool.

Here is a real use case example of merging multiple excel files using Talend which is very easy and straight forward.  If this has to do manually then take lot of human hours or to do code in other language to  achieve same.

Primary Components Used in this Job are below:

  • tFileList is used to merge the data of multiple files having same schema and same file mask in a single file or in a Database using Talend open studio for Data Integration
  • tFlowToIterate will process the files one by one.
  • tlterateToFlow will store the data till all the files are processed and data from all those files has been fetched.

Talend Merge Files

 

There was 2000 Excel files having 10 columns in each excel files and that required to merge into one big excel file for data analysis. This job assumes that all the files have same schema structure. Once job finishes it generated one big file with all data.

If there is multiple files having different schema and you want to merge them into one then you need to use tMap component to normalize data into similar format and then apply merge process on it.

I will be writing some more articles and publishing more video tutorials for talend in near future.

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">