Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
In order to run an application a job client will submits the job which can be a JAR file or an executable
to a single master in Hadoop called ResourceManager. This master will then distribute tasks,
configure nodes, monitor tasks and schedule tasks. Moreover, all the files for correspondence in the
framework need to be moved to Hadoop File System (HDFS); the user has to feed input files into the
HDFS directory and the output files will also be saved in HDFS directories.
This tutorial will walk-through of these main steps by running an application that will count the number
of words in file(s). The application will run it in a Single Node setup.
Note:
The application for the purpose of this tutorial is run on a Linux Ubuntu 12.04 Virtual Machine.
Username: hadoop
Password: hadoop
2 Setup
2.1 Prerequisites:
1. Linux System/ Virtual Machine
2. Java Must be installed in the system.
3. ssh, sshd and rsync must be installed. Link
1
If the variables are empty then the commands will return a blank line similar to one above.
In order to pass the correct path names for JAVA_HOME please find the appropriate version of java
compiler. For example on typing the following command one gets the following result:
As the version of Java Compiler is 1.7.0_95. Thus, corresponding version to the environment variable
JAVA_HOME can be updates as below.
> export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
After updating the above variable one can later change the HADOOP_CLASSPATH variable which is as
follows:
> export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
Later one can check if the variables indeed contain the values:
Note:
/usr/lib/jvm/java-1.7.0-openjdk-amd64 is an actual path pointing to the Java files residing in the system.
2
2.4 Checking bin/hadoop
Now for the next step navigate to the folder that contains the source of Hadoop framework, simply
type the following:
> cd ~/Desktop/hadoop-2.7.2
2.5 Configurations
Before continuing, some simple configurations need to be performed. Edit the files core-site.xml
and hdfs-site.xml, they can be found at ~/Desktop/hadoop-2.7.2/etc/hadoop/
Add the details as mentioned below to the respective files, in order to do that type the following
command, this command will open gedit which is a word editor
> gedit etc/hadoop/core-site.xml
Add the following details, refer to the screenshot below for further clarifications:
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
3
Save the file ( Ctr + s ) and then close it. Repeat the procedure for the hdfs-site.xml file as well.
The configuration details are mentioned below for the same.
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
4
2.6 Check ssh to localhost
In order to start the daemons one needs to check the ssh to localhost:
> ssh localhost
If ssh to localhost is not successful after typing y or Yes, then type these commands:
If prompted, enter the password. The screenshot below shows the prompts to enter password.
5
Check the web interface for NameNode.
Note:
The daemons can be stopped by typing the following command, it is recommended to keep it running
when the MapReduce application is in use.
> sbin/stop-dfs.sh
3 Execution Steps:
3.1 Compiling WordCount.java
In order to continue forward one needs to create a local repository for the application. A repository
where the .java files and input files can be stored. One can create a local directory outside directory
containing hadoop source. Type the following:
> mkdir ../tutorial01
Later the following snippet of code can be pasted to a file called WordCount.java, this file should
reside in the newly created directory. To do that one needs to open a word editor (ex. Gedit) opening
a new file called WordCount.java, later copy the snippet provided below and then save and close. Type
the following command:
> gedit ../tutorial01/WordCount.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
6
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Save the file ( Ctr + s ) and then close it. The following screenshot shows the same.
7
The next step is to compile the code and create JAR file. But before that please copy the JAVA file to
the current directory by typing the following: [Error]
> cp ../tutorial01/WordCount.java .
> bin/hadoop com.sun.tools.javac.Main WordCount.java
> jar cf wc.jar WordCount*.class
This operation will create several files. To check, perform a listing, sorted according to files created
lately. Type the following command:
The above commands will display the details similar to the ones in the screenshot below:
8
3.2 Creating Directories in HDFS
Now after the above steps, one needs to create directories for the current application in the Hadoop
File Sytems. As the directories are not present, one must create the directories one by one as follows:
[Error]
One can always check the contents within the HDFS directories. For example if one has to determine
the directories within /user/hadoop directory, simple type the following command:
> bin/hdfs dfs –ls /user/hadoop
Now in local file system one will create input files as follows, these files are filled up with texts viz.
“Hello World Bye World” and “Hello Hadoop GoodBye Hadoop” for file01 and
file02 respectively.
Move these files to the HDFS directory input using the following commands:
9
> bin/hadoop fs –copyFromLocal ../tutorial01/file0*
/user/hadoop/wordcount/input
One could also verify if the files that were copied correctly. Simply type the command below, it will
display the content of the files.
> bin/hdfs dfs –cat /user/hadoop/wordcount/input/file01
> bin/hdfs dfs -cat /user/hadoop/wordcount/input/file02
The above screenshot shows that file01 and file02 have been moved successfully to input directory
residing in HDFS.
3.4 Execute the JAR
Run the following command to execute the jar: [Error]
This will create a directory user/hadoop/wordcount/output and two files within it viz.
_SUCCESS and part-r-00000
The output of this application i.e. counting of words will be stored in part-r-00000 file.
One can view the contents of the file just like we did above.
> bin/hdfs –cat /user/hadoop/wordcount/output/part*
10
4 Errors
1. If ssh localhost is not working and it shows, ssh not installed. Then please enter the
following:
> sudo apt-get install ssh
Enter password if prompted.
2. If there is an error while executing the command. Please check the variables JAVA_HOME and
HADOOP_CLASSPATH, similar to here. Later reset the values and proceed.
3. If the is ClassNotFoundException, then please find the details in the link here: LINK. Or
one could compile and save the .jar file within the same source directory instead
of ../tutorial01/.
4. If an error like the following appears on trying to make directories in HDFS viz.
5 References
The references for running this application were found in the hadoop website. Moreover, there were
difficulties faced while setting up and running the applications. The contents in the following websites
were quite useful:
11