.

Lab3 on Talend Open Studio and Apache Hadoop

INFO703 – Big Data and Analytics - Lab

In this lab you will work with Talend Open Studio and Apache Hadoop to learn the map/reduce model and run some examples.

In the virtual machine you already imported in Oracle VM VirtualBox, Talend Open Studio has been installed for you which needs these command to start running:

cd talend

cd tos_bd-20161216_1026-v6.3.1

./TOS_BD-linux-gtk-x86_64

Select local_project2 and go ahead. In the left pane expand Job Designs and choose Wordcount. Then run the example and look at the output:

Big Data Sample Assignments 3 img1

This is the WordCount example which uses Map/Reduce paradigm and you already ran it on Hadoop. As you see in the design window, there are several components connected to each other to accomplish the job. It uses the following components:

- tFileInputDelimited: This reads the input file, defines the data structure and sends it to the next component. If you double click on it, you will see its properties, i.e. what is the source file and what is the delimiter, etc (see the figure in the below):

Big Data Sample Assignments 3 img2

tNormalize: It normalizes the data. Double click on it and click on Schema to see the normalization task associated to this component.

Big Data Sample Assignments 3 img3

tAggregateRow: It performs aggregation. In this example, the aggregation function is count:

Big Data Sample Assignments 3 img4

tMap: It transforms and routes data from single or multiple sources to single or multiple destinations. Double click on it to see its transformation job:

Big Data Sample Assignments 3 img5

tLogRow: It is used to to monitor data processed.

Big Data Sample Assignments 3 img6

For further information about the Talend Open Studio components see the following document:

docs.huihoo.com/talend/TalendOpenStudio_Components_RG_32a_EN.pdf

Another example you can see in the left pane is test 0.1 which is the WordCount example but read the file from HDFS rather than from a local system. To set the input from HDFS you need to use tHDFSInput and to set output to HDFS you need to use tHDFSOutput component.

Big Data Sample Assignments 3 img7

Double click on tHDFSInput and see its setting:

Big Data Sample Assignments 3 img8
.