0% found this document useful (0 votes)
59 views132 pages

Install Apache Hadoop Using Cloudera

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
59 views132 pages

Install Apache Hadoop Using Cloudera

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 132

Install Apache Hadoop – Single Node REHL

7
by Mandeep K Sandhu.In Big data.1 Comment on Install Apache Hadoop – Single Node
REHL 7

Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and provides high throughput access to application data .

The main goal of this tutorial is to get a simple Hadoop installation up and running so that you
can play around with the software and learn more about it.

Environment: This blog has been tested in the following software version.

 REHL ( Red hat Linux 7.4) on Virtual box 5.2


 Hadoop 2.7.3 version
 update /etc/hosts file with Hostname and IP address.

[root@cdhs ~]# cat /etc/hosts


10.0.2.5 cdhs

Dedicated Hadoop system user:

After VM set up, please add a non sudo user dedicated to Hadoop which will be used to
configure Hadoop. Following command will add the user hduser and the group hadoop to VM
machine.

[root@cdhs ~]# groupadd hadoo


[root@cdhs ~]# useradd -g hado
# Password set for hduser
[root@cdhs ~]# passwd hduser

1 [root@cdhs ~]# groupadd hadoop


2 [root@cdhs ~]# useradd -g hadoop hduser
3 # Password set for hduser
4 [root@cdhs ~]# passwd hduser
5 Changing password for user hduser.
6 New password:
7 Retype new password:
8 passwd: all authentication tokens updated successfully.

Apache Software:
download the apache hadoop software from official site.

[root@cdhs yum.repos.d]# wget


Connecting to archive.apache.o
HTTP request sent, awaiting res
Length: 214092195 (204M) [app

[root@cdhs yum.repos.d]# wget <a href="https://archive.apache.org/dist/hadoop/core/hadoop-


2.7.3/hadoop-2.7.3.tar.gz">https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
1 2.7.3.tar.gz</a>--2018-05-03 14:57:40-- <a
2 href="https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz">https://
3 archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz</a>Resolving
4 archive.apache.org (archive.apache.org)... 163.172.17.199
5 Connecting to archive.apache.org (archive.apache.org)|163.172.17.199|:443... connected.
6 HTTP request sent, awaiting response... 200 OK
7 Length: 214092195 (204M) [application/x-gzip]
8 Saving to: ‘hadoop-2.7.3.tar.gz’
9 100%
1 [==================================================================
0 ==========================>]214,092,195 6.33MB/s in 46s
1 [root@cdhs ~]# mv hadoop-2.7.3.tar.gz /home/hduser
1 [root@cdhs hduser]# chown hduser:hadoop /home/hduser/*
1 [root@cdhs ~]# mv hadoop-2.7.3.tar.gz /home/hduser
2 [root@cdhs hduser]# su - hduser
hduser@cdhs ~]$ tar zxvf hadoop-2.7.3.tar.gz
[hduser@cdhs ~]$ ln -s hadoop-2.7.3 hadoop

Install Java:

Hadoop is written in Java, hence before installing Apache Hadoop we will need to install Java
first. To install Java in your system first we will need to download the file from oracle website
from here.

wget "http:/ / download.oracle.co


[root@cdhs ~]# mv jdk-7u75-linu
[root@cdhs hduser]# su - hduse
Last login: Thu May 3 14:44:38

1 wget "http://download.oracle.com/otn-pub/java/jdk/jdk-7u75-linux-x64.tar"
2 [root@cdhs ~]# mv jdk-7u75-linux-x64.tar /home/hduser
3 [root@cdhs hduser]# su - hduser
4 Last login: Thu May 3 14:44:38 NZST 2018 on pts/0
5 [hduser@cdhs ~]$ tar -xvf jdk-7u75-linux-x64.tar.gz
6 [hduser@cdhs ~]$ ln -s jdk1.7.0_75 jdk

Once the Java is installed in your system, you can check the version of Java using the following
command.
[hduser@cdhs ~]$ java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environm
Java HotSpot(TM) 64-Bit Server

1 [hduser@cdhs ~]$ java -version


2 java version "1.7.0_75"
3 Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
4 Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Now edit .bash_profile file using your favorite editor and add Hadoop/java home.

[hduser@cdhs ~]$ cat .bash_pro


export HADOOP_HOME=/ home
export JAVA_HOME=/ home/ hdu
# PATH=$HADOOP_HOME/ bin

[hduser@cdhs ~]$ cat .bash_profile


1 export HADOOP_HOME=/home/hduser/hadoop
2 export JAVA_HOME=/home/hduser/jdk
3 # PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$HOME/bin:
4 $PATH
5 PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HADOOP_HOME/bin:
6 $HADOOP_HOME/sbin:$JAVA_HOME/bin
7 # Initialize the variables using the following command.
Source .bash_profile

After unpack the downloaded Hadoop distribution, edit the file


$HADOOP_HOME/etc/hadoop/hadoop-env.sh to define some parameters as follows. It is
important that we setup Java path here, otherwise Hadoop will not be able to use Java.

# set to the root of your Java ins


export JAVA_HOME=/ home/ hd

1 # set to the root of your Java installation


2 export JAVA_HOME=/home/hduser/jdk

Once done, you can now check if the environment variables are now set or not. Run the
following command.

echo $JAVA_HOME
echo $HADOOP_HOME
1 echo $JAVA_HOME
2 echo $HADOOP_HOME

Configuring Hadoop:

Hadoop has many configuration files, which are located at $HADOOP_HOME/etc/hadoop


directory. You can view the list of configuration files using the following command. As we are
installing Hadoop on a single node in pseudo distributed mode. We will need to edit some
configuration files in order for Hadoop to work.

The first file is “core-site.xml” file which contains configuration of the port number used by
HDFS.

[hduser@cdhs hadoop]$ vi core


<configuration>
<property>
<description>NameNode Setup

1 [hduser@cdhs hadoop]$ vi core-site.xml


2 <configuration>
3 <property>
4 <description>NameNode Setup </description>
5 <name>fs.defaultFS</name>
6 <value>hdfs://10.0.2.5:8020</value>
7 </property>
8 </configuration>

Configuration for NameNode and DataNode.

[hduser@cdhs hadoop]$ vi hdfs


<configuration>
<property>
<description>Path on the local fi

1 [hduser@cdhs hadoop]$ vi hdfs-site.xml


2 <configuration>
3 <property>
4 <description>Path on the local filesystem where the NameNode stores the namespace and
5 transactions logs persistently.</description>
6 <name>dfs.namenode.name.dir</name>
7 <value>/data/dfs/nn</value>
8 </property>
9
10 <property>
11 <description>list of paths on the local filesystem of a DataNode where it should store its
12 blocks. </description>
13 <name>dfs.datanode.data.dir</name>
<value>/data/dfs/dn</value>
14 </property>
15
16 <property>
17 <name>dfs.namenode.checkpoint.dir</name>
18 <value>/data/dfs/snn</value>
19 </property>
20
21 <property>
22 <description> Secondary NameNode </description>
23 <name>dfs.namenode.secondary.http-address</name>
24 <value>10.0.2.5:50090</value>
25 </property>
26
27 <property>
28 <name>dfs.replication</name>
29 <value>1</value>
30 </property>
31
32 <property>
33 <name>dfs.blocksize</name>
34 <value>134217728</value>
35 </property>
</configuration>

Configuration for ResourceManager and NodeManager.

[hduser@cdhs hadoop]$ vi yarn


<configuration>
<property>
<description>Shuffle service t

1 [hduser@cdhs hadoop]$ vi yarn-site.xml


2 <configuration>
3 <property>
4 <description>Shuffle service that needs to be set for Map Reduce applications.
5 </description>
6 <name>yarn.nodemanager.aux-services</name>
7 <value>mapreduce_shuffle</value>
8 </property>
9
10 <property>
11 <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
12 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
13 </property>
14
15 <property>
16 <description>aggregation logs</description>
17 <name>yarn.log-aggregation-enable</name>
18 <value>true</value>
19 </property>
20
21 <property>
22 <description>List of directories to store localized files in.</description>
23 <name>yarn.nodemanager.local-dirs</name>
24 <value>file:///data/yarn/nm-local-dir</value>
25 </property>
26
27 <property>
28 <description>Where to store container logs.</description>
29 <name>yarn.nodemanager.log-dirs</name>
30 <value>file:///var/log/hadoop-yarn/containers</value>
31 </property>
32
33 <property>
34 <description>Where to aggregate logs to.</description>
35 <name>yarn.nodemanager.remote-app-log-dir</name>
36 <value>hdfs://10.0.2.5:8020/var/log/hadoop-yarn/apps</value>
37 </property>
38
39 <property>
40 <description>Classpath for typical applications.</description>
41 <name>yarn.application.classpath</name>
42 <value>
43 $HADOOP_CONF_DIR,
44 $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
45 $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
46 $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
47 $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
48 </value>
49 </property>
50
51 <property>
52 <description>ResourceManager address for clients to submit jobs. </description>
53 <name>yarn.resourcemanager.address</name>
54 <value>10.0.2.5:8032</value>
55 </property>
56
57 <property>
58 <description>ResourceManager Host:port for applicationMaster to talk to Scheduler to
59 obtain resources.</description>
60 <name>yarn.resourcemanager.scheduler.address</name>
<value>10.0.2.5:8030</value>
61 </property>
62
63 <property>
64 <description>ResourceManager Host:port for NodeManagers </description>
65 <name>yarn.resourcemanager.resource-tracker.address</name>
66 <value>10.0.2.5:8025</value>
67 </property>
</configuration>

Now copy mapred-site.xml.template as file mapred-site.xml using the following


command. And edit the “mapred-site.xml” file for Mapreduce Applications and JobHistory
Server.

[hduser@cdhs hadoop]$ cp map


[hduser@cdhs hadoop]$ vi map
<configuration>
<property>

1 [hduser@cdhs hadoop]$ cp mapred-site.xml.template mapred-site.xml


2 [hduser@cdhs hadoop]$ vi mapred-site.xml
3 <configuration>
4 <property>
5 <description>Execution framework set to Hadoop YARN. </description>
6 <name>mapreduce.framework.name</name>
7 <value>yarn</value>
8 </property>
9
10 <property>
11 <description>Mapreduce job history server. </description>
12 <name>mapreduce.jobhistory.address</name>
13 <value>10.0.2.5:10020</value>
14 </property>
15
16 <property>
17 <description>Mapreduce job history server WEB UI. </description>
18 <name>mapreduce.jobhistory.webapp.address</name>
19 <value>10.0.2.5:19888</value>
20 </property>
21
22 <property>
23 <name>yarn.ap.mapreduce.am.staging-dir</name>
24 <value>/user</value>
25 </property>
26
27 </configuration>
[hduser@cdhs hadoop]$ vi slave
10.0.2.5

1 [hduser@cdhs hadoop]$ vi slaves


2 10.0.2.5

Now we will need to create two directories to store namenode and datanode using the following
commands.

[root@cdhs sf_Share_VM]# mkd


[root@cdhs sf_Share_VM]# mkd
[root@cdhs sf_Share_VM]# mkd
[root@cdhs sf_Share_VM]# cho

1 [root@cdhs sf_Share_VM]# mkdir -p /data/dfs/nn


2 [root@cdhs sf_Share_VM]# mkdir -p /data/dfs/dn
3 [root@cdhs sf_Share_VM]# mkdir -p /data/dfs/snn
4 [root@cdhs sf_Share_VM]# chown -R hduser:hadoop /data/dfs
5 [root@cdhs sf_Share_VM]# mkdir -p /data/yarn/nm-local-dir
6 [root@cdhs sf_Share_VM]# chown -R hduser:hadoop /data/yarn

Now you will need to configure the SSH keys for your new user so that the user can securely log
into hadoop without any password.

[hduser@cdhs ~]$ ssh-keygen


[hduser@cdhs ~]$ ssh-keygen
Generating public/ private rsa ke
Enter file in which to save the ke

1 [hduser@cdhs ~]$ ssh-keygen


2 [hduser@cdhs ~]$ ssh-keygen
3 Generating public/private rsa key pair.
4 Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
5 Enter passphrase (empty for no passphrase):
6 Enter same passphrase again:
7 Your identification has been saved in /home/hduser/.ssh/id_rsa.
8 Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
9 The key fingerprint is:
10 SHA256:wNDLCv5DMd+kUDcJQ//+m9sb5ZgA9iyrIlMxejmNmUo hduser@cdhs
11 The key's randomart image is:
12 +---[RSA 2048]----+
13 | .o+. . |
14 | ooo+ |
15 | ooo..o |
16 | . + =..o + |
| . . B %S o + .|
17 | . E @ o. o . = |
18 | + + . o + .|
19 | *. ..o.|
20 | + .. =oo. |
21 +----[SHA256]-----+
22 [hduser@cdhs ~]$ ssh-copy-id -i .ssh/id_rsa.pub 10.0.2.5
23 /bin/ssh-copy-id: INFO: Source of key(s) to be installed: ".ssh/id_rsa.pub"
24 /bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are
25 already installed
26 /bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to
27 install the new keys
28 hduser@10.0.2.5's password:
29
Number of key(s) added: 1

We have now configured Hadoop to work on a single node cluster. Now we can initialize HDFS
file system by formatting the namenode directory using the following command.

Now we can start Hadoop cluster, navigate to $HADOOP_HOME/sbin directory using the
following command.

[hduser@cdhs ~]$ cd hadoop/ b


[hduser@cdhs bin]$ hdfs namen
--- End lines only
18/ 05/ 03 16:08:35 INFO metrics

1 [hduser@cdhs ~]$ cd hadoop/bin/


2 [hduser@cdhs bin]$ hdfs namenode -format
3 --- End lines only
4 18/05/03 16:08:35 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users =
5 10
6 18/05/03 16:08:35 INFO metrics.TopMetrics: NNTop conf:
7 dfs.namenode.top.windows.minutes = 1,5,25
8 18/05/03 16:08:35 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
9 18/05/03 16:08:35 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap
10 and retry cache entry expiry time is 600000 millis
11 18/05/03 16:08:35 INFO util.GSet: Computing capacity for map NameNodeRetryCache
12 18/05/03 16:08:35 INFO util.GSet: VM type = 64-bit
13 18/05/03 16:08:35 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB =
14 297.0 KB
15 18/05/03 16:08:35 INFO util.GSet: capacity = 2^15 = 32768 entries
16 18/05/03 16:08:35 INFO namenode.FSImage: Allocated new BlockPoolId: BP-644753539-
17 10.0.2.5-1525320515324
18 18/05/03 16:08:35 INFO common.Storage: Storage directory /data/dfs/nn has been
19 successfully formatted.
18/05/03 16:08:35 INFO namenode.FSImageFormatProtobuf: Saving image file
/data/dfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
18/05/03 16:08:35 INFO namenode.FSImageFormatProtobuf: Image file
/data/dfs/nn/current/fsimage.ckpt_0000000000000000000 of size 353 bytes saved in 0
seconds.
20 18/05/03 16:08:35 INFO namenode.NNStorageRetentionManager: Going to retain 1 images
21 with txid >= 0
18/05/03 16:08:35 INFO util.ExitUtil: Exiting with status 0
18/05/03 16:08:35 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at cdhs/10.0.2.5
************************************************************/

You can start Hadoop services by executing the following commands.

cd / home/ hduser/ hadoop/ sbin


[hduser@cdhs sbin]$ start-dfs.s
Starting namenodes on [cdhs]
cdhs: starting namenode, loggin

cd /home/hduser/hadoop/sbin
[hduser@cdhs sbin]$ start-dfs.sh
1
Starting namenodes on [cdhs]
2
cdhs: starting namenode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-
3
namenode-cdhs.out
4
10.0.2.5: starting datanode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-
5
datanode-cdhs.out
6
Starting secondary namenodes [cdhs]
7
cdhs: starting secondarynamenode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-
hduser-secondarynamenode-cdhs.out

Now start YARN using the following command.

[hduser@cdhs sbin]$ start-yarn.s


starting yarn daemons
starting resourcemanager, loggin
10.0.2.5: starting nodemanager,

[hduser@cdhs sbin]$ start-yarn.sh


1 starting yarn daemons
2 starting resourcemanager, logging to /home/hduser/hadoop-2.7.3/logs/yarn-hduser-
3 resourcemanager-cdhs.out
4 10.0.2.5: starting nodemanager, logging to /home/hduser/hadoop-2.7.3/logs/yarn-hduser-
nodemanager-cdhs.out

You can check the status of the services using the following command.
[hduser@cdhs ~]$ jdk/ bin/ jps
3542 ResourceManager
3098 NameNode
3648 NodeManager

1 [hduser@cdhs ~]$ jdk/bin/jps


2 3542 ResourceManager
3 3098 NameNode
4 3648 NodeManager
5 3227 DataNode
6 3940 Jps
7 3392 SecondaryNameNode

This shows that Hadoop is successfully running on the server.

You can now browse the Apache Hadoop services through your browser. By default Apache
Hadoop namenode service is started on port 50070. Go to following address using your favorite
browser.

http://10.0.2.5:50070

To view Hadoop clusters and all applications, browse the following address into your browser.

http://10.0.2.5:8088

You will see the information about NodeManager.

http://10.0.2.5:8042

Secondary NameNode information avaiable via following link.

http://10.0.2.5:50090/

If problem in opening any of the URL above, please disable the IPTABLES service.

[root@cdhs ~]# service iptables


Redirecting to / bin/ systemctl sto
[root@cdhs ~]# service iptables
The service command supports

[root@cdhs ~]# service iptables stop


1
Redirecting to /bin/systemctl stop iptables.service
2
[root@cdhs ~]# service iptables disable
3
The service command supports only basic LSB actions (start, stop, restart, try-restart, reload,
4
force-reload, status). For other actions, please try to use systemctl.

Conclusion:
In this tutorial we have learnt how to install Apache Hadoop on a single node with Pseudo
distribution mode.

Set up Hadoop Cluster – Multi-Node


by Mandeep K Sandhu.In Big data.Leave a Comment on Set up Hadoop Cluster – Multi-
Node

From my previous blog, we learnt how to set up a Hadoop Single Node Installation. Now, I will
show how to set up a Hadoop Multi Node Cluster. A Multi Node Cluster in Hadoop contains
two or more DataNodes in a distributed Hadoop environment. This is practically used in
organizations to store and analyse their Petabytes and Exabytes of data.

Here in this blog, we are taking three machine to set up multi-node cluster – MN and DN1/DN2.

 Master node (MN) will run the NameNode and ResourcesManager Daemons.
 Data Nodes (DN1 and DN2) will be our data nodes that stores the actual data and provide
processing power to run the jobs. Both hosts will run the DataNode and NodeManager
daemons.

Software Required:

 REHL 7 – Set up MN and DN1/DN2 with REHL 7 operating system – Minimal Install.
 Hadoop-2.7.3
 JAVA 7
 SSH

Configure the System

First of all, we have to edit hosts file in /etc/ folder in MasterNode (MN) , specify the IP address
of each system followed by their host names.

# vi / etc/ hosts
enter the following lines in the / e
192.168.1.77 MN
192.168.1.78 DN1

1 # vi /etc/hosts
2 enter the following lines in the /etc/hosts file.
3 192.168.1.77 MN
4 192.168.1.78 DN1
5 192.168.1.79 DN2

Disable the firewall restrictions.

[root@MN yum.repos.d]# yum in


[root@MN yum.repos.d]# service
Redirecting to / bin/ systemctl sto
[root@MN yum.repos.d]# chkco

1 [root@MN yum.repos.d]# yum install iptables-services


2 [root@MN yum.repos.d]# service iptables stop
3 Redirecting to /bin/systemctl stop iptables.service
4 [root@MN yum.repos.d]# chkconfig iptables off
5 Note: Forwarding request to 'systemctl disable iptables.service'.

Now set up OS group and user for Hadoop software.

[root@MN yum.repos.d]# groupa


[root@MN yum.repos.d]# userad
[root@MN yum.repos.d]# passw

1 [root@MN yum.repos.d]# groupadd hadoop


2 [root@MN yum.repos.d]# useradd -g hadoop hduser
3 [root@MN yum.repos.d]# passwd hduser

Add the directories to keep the hdfs files.

[root@MN ~]# mkdir -p / data/ df


[root@MN ~]# mkdir -p / data/ df
[root@MN ~]# chown -R hduser
[root@MN ~]# mkdir -p / data/ ya

1 [root@MN ~]# mkdir -p /data/dfs/nn


2 [root@MN ~]# mkdir -p /data/dfs/dn
3 [root@MN ~]# chown -R hduser:hadoop /data/dfs
4 [root@MN ~]# mkdir -p /data/yarn/nm-local-dir
5 [root@MN ~]# chown -R hduser:hadoop /data/yarn

Download and Unpack Hadoop/Java Binaries.

Download and extract the Java Tar File on Master node. And Similarly download hadoop 2.7.3
Package on Master Node (MN) and extract the Hadoop tar File.
[root@MN ~]# wget "https:/ / arc
[root@MN ~]# wget "http:/ / dow
[root@MN ~]# mv hadoop-2.7.3
[root@MN ~]# mv jdk-7u75-linux

[root@MN ~]# wget "https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-


1 2.7.3.tar.gz"
2 [root@MN ~]# wget "http://download.oracle.com/otn-pub/java/jdk/jdk-7u75-linux-
3 x64.tar.gz"
4 [root@MN ~]# mv hadoop-2.7.3.tar.gz&nbsp; /home/hduser
5 [root@MN ~]# mv jdk-7u75-linux-x64.tar.gz /home/hduser
6 [root@MN ~]# chown hduser:hadoop /home/hduser/*
7 [root@MN ~]#su - hduser
8 [hduser@MN ~]$ tar zxvf hadoop-2.7.3.tar.gz
9 [hduser@MN ~]$ tar -xvf jdk-7u75-linux-x64.tar.gz
10 [hduser@MN ~]$ ln -s hadoop-2.7.3 hadoop
[hduser@MN ~]$ ln -s jdk1.7.0_75 jdk

Set Environment Variables:

Add Hadoop/java binaries to your PATH. Edit /home/hadoop/.bash_profile and add the
following lines in master node. Then, save the bash file and close it.

vi home/ hadoop/ .bash_profile


export HADOOP_HOME=/ home
export JAVA_HOME=/ home/ hdu
PATH=$PATH:$HOME/ .local/ b

vi home/hadoop/.bash_profile
1
export HADOOP_HOME=/home/hduser/hadoop
2
export JAVA_HOME=/home/hduser/jdk
3
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HADOOP_HOME/bin:
4
$HADOOP_HOME/sbin:$JAVA_HOME/bin

For applying all these changes to the current Terminal, execute the source command.

[hduser@MN ~]$ source .bash_

1 [hduser@MN ~]$ source .bash_profile

To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.
hduser@MN ~]$ javac -version
javac 1.7.0_75

[hduser@MN ~]$ java -version

1 hduser@MN ~]$ javac -version


2 javac 1.7.0_75
3
4 [hduser@MN ~]$ java -version
5 java version "1.7.0_75"
6 Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
7 Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
8
9 [hduser@MN ~]$ hadoop version
10 Hadoop 2.7.3

Now clone the Master Node MN to Data Node DN1/DN2.

Distribute Authentication Key-pairs for the Hadoop User:

Login to MN as the hadoop user, and generate an ssh-key. Copy the generated ssh Key to Master
Node’s authorized keys.

[hduser@MN .ssh]$ ssh-keygen


Generating public/ private rsa ke
Enter file in which to save the ke
Enter passphrase (empty for no p

1 [hduser@MN .ssh]$ ssh-keygen


2 Generating public/private rsa key pair.
3 Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
4 Enter passphrase (empty for no passphrase):
5 Enter same passphrase again:
6 Your identification has been saved in /home/hduser/.ssh/id_rsa.
7 Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
8 The key fingerprint is:
9 SHA256:KN1GTdk/tbG74BYLvKpg9J64L3eYAwZJaT9Bjv8GTwU hduser@MN
10 The key's randomart image is:
11 +---[RSA 2048]----+
12 | o. E .o |
13 | +o. .o. . ..|
14 | o.o.. ... . .+|
15 | o.+ +. oo |
16 | o++.S . ..|
17 | .+*. o o . |
18 | .o.=o + + .|
19 | ..==.. . + . |
20 | o==+.. . |
21 +----[SHA256]-----+
22 [hduser@MN ~]$ ssh-copy-id -i .ssh/id_rsa.pub 192.168.1.77

Copy the master node’s ssh key to DN1 and DN2 authorized keys.

[hduser@MN ~]$ for i in `cat hos


authorized_keys
hduser@192.168.1.78's passwo
authorized_keys

[hduser@MN ~]$ for i in `cat hosts`; do scp /home/hduser/.ssh/authorized_keys


1
$i:/home/hduser/.ssh; done
2
authorized_keys 100% 391 715.9KB/s 00:00
3
hduser@192.168.1.78's password:
4
authorized_keys 100% 391 545.2KB/s 00:00
5
hduser@192.168.1.79's password:
6
authorized_keys 100% 391 482.8KB/s 00:00

Now test the passwordless connectivity login through SSH.

[hduser@MN ~]$ for i in `cat hos


MN
DN1
DN2

1 [hduser@MN ~]$ for i in `cat hosts`; do ssh $i "hostname -f"; done


2 MN
3 DN1
4 DN2

Configure Hadoop:

Now edit the configuration files in hadoop/etc/hadoop directory in master node. Set the
NameNode location.

[hduser@MN hadoop]$ vi core-s


<configuration>
<property>
<name>fs.defaultFS</ name>

1 [hduser@MN hadoop]$ vi core-site.xml


2 <configuration>
3 <property>
4 <name>fs.defaultFS</name>
5 <value>hdfs://MN:8020</value>
6 </property>
7 </configuration>

Set path for HDFS:

Edit hdfs-site.conf on master node for NameNode and DataNode file location.

hduser@MN hadoop]$ vi hdfs-si


<configuration>
<property>
<description>Path on the local f

1 hduser@MN hadoop]$ vi hdfs-site.xml


2 <configuration>
3 <property>
4 <description>Path on the local filesystem where the NameNode stores the namespace and
5 transactions logs persistently. </description>
6 <name>dfs.namenode.name.dir</name>
7 <value>/data/dfs/nn</value>
8 </property>
9
10 <property>
11 <description>list of paths on the local filesystem of a DataNode where it should store its
12 blocks. </description>
13 <name>dfs.datanode.data.dir</name>
14 <value>/data/dfs/dn</value>
15 </property>
16
17 <property>
18 <name>dfs.namenode.checkpoint.dir</name>
19 <value>/data/dfs/snn</value>
20 </property>
21
22 <property>
23 <description>Secondary Name Node. </description>
24 <name>dfs.namenode.secondary.http-address</name>
25 <value>MN:50090</value>
26 </property>
27
28
29 <property>
30 <name>dfs.namenode.checkpoint.dir</name>
31 <value>/data/dfs/snn</value>
32 </property>
33
34 <property>
<name>dfs.replication</name>
35 <value>2</value>
36 </property>
</configuration>

The last property, dfs.replication, indicates how many times data is replicated in the cluster. You
can set 2 to have all the data duplicated on the two nodes. Don’t enter a value higher than the
actual number of data nodes.

Set YARN as Job Schedule:

Copy mapred-site from the template in configuration folder and the edit mapred-site.xml on
Master node. Set yarn as the default framework for MapReduce operations.

[hduser@MN hadoop]$ cp mapr


[hduser@MN hadoop]$ vi mapre
</ configuration>
<property>

1 [hduser@MN hadoop]$ cp mapred-site.xml.template mapred-site.xml


2 [hduser@MN hadoop]$ vi mapred-site.xml
3 </configuration>
4 <property>
5 <description>Execution framework set to Hadoop YARN. </description>
6 <name>mapreduce.framework.name</name>
7 <value>yarn</value>
8 </property>
9 </configuration>

Configure YARN:

Edit yarn-site.xml on master node.

[hduser@MN hadoop]$ vi yarn-s


<configuration>
<property>
<description>Shuffle service t

1 [hduser@MN hadoop]$ vi yarn-site.xml


2 <configuration>
3 <property>
4 <description>Shuffle service that needs to be set for Map Reduce applications.
5 </description>
6 <name>yarn.nodemanager.aux-services</name>
7 <value>mapreduce_shuffle</value>
8 </property>
9
10 <property>
11 <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
12 <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
13 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
14 </property>
15
16 <property>
17 <description>aggregation logs</description>
18 <name>yarn.log-aggregation-enable</name>
19 <value>true</value>
20 </property>
21
22 <property>
23 <description>List of directories to store localized files in.</description>
24 <name>yarn.nodemanager.local-dirs</name>
25 <value>file:///data/yarn/nm-local-dir</value>
26 </property>
27
28 <property>
29 <description>Where to store container logs.</description>
30 <name>yarn.nodemanager.log-dirs</name>
31 <value>file:///var/log/hadoop-yarn/containers</value>
32 </property>
33
34 <property>
35 <description>Where to aggregate logs to.</description>
36 <name>yarn.nodemanager.remote-app-log-dir</name>
37 <value>hdfs://MN:8020/var/log/hadoop-yarn/apps</value>
38 </property>
39
40 <property>
41 <description>Classpath for typical applications.</description>
42 <name>yarn.application.classpath</name>
43 <value>
44 $HADOOP_CONF_DIR,
45 $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
46 $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
47 $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
48 $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
49 </value>
50 </property>
51
52 <property>
53 <description>ResourceManager address for clients to submit jobs. </description>
54 <name>yarn.resourcemanager.address</name>
<value>MN:8032</value>
55 </property>
56
57 <property>
58 <description>ResourceManager Host:port for applicationMaster to talk to Scheduler to
59 obtain resources. </description>
60 <name>yarn.resourcemanager.scheduler.address</name>
61 <value>MN:8030</value>
62 </property>
63
64 <property>
65 <description>ResourceManager Host:port for NodeManagers </description>
66 <name>yarn.resourcemanager.resource-tracker.address</name>
67 <value>MN:8025</value>
68 </property>
69
</configuration>

Configure Slaves

The file slaves is used by startup scripts to start required daemons on all nodes. Edit
~/hadoop/etc/hadoop/slaves to be.

[hduser@MN hadoop]$ vi slaves


DN1
DN2

1 [hduser@MN hadoop]$ vi slaves


2 DN1
3 DN2

Format HDFS:

Format the namenode (Only on master machine).

[root@MN ~]# cd / home/ hduse


[root@MN ~]# hdfs namenode -
18/ 05/ 17 21:53:54 INFO commo
18/ 05/ 17 21:53:54 INFO namen

1 [root@MN ~]# cd /home/hduser/hadoop/bin


2 [root@MN ~]# hdfs namenode -format
3 18/05/17 21:53:54 INFO common.Storage: Storage directory /data/dfs/nn has been
4 successfully formatted.
5 18/05/17 21:53:54 INFO namenode.FSImageFormatProtobuf: Saving image file
/data/dfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
18/05/17 21:53:55 INFO namenode.FSImageFormatProtobuf: Image file
/data/dfs/nn/current/fsimage.ckpt_0000000000000000000 of size 353 bytes saved in 0
6
seconds.
7
18/05/17 21:53:55 INFO namenode.NNStorageRetentionManager: Going to retain 1 images
8
with txid >= 0
9
18/05/17 21:53:55 INFO util.ExitUtil: Exiting with status 0
10
18/05/17 21:53:55 INFO namenode.NameNode: SHUTDOWN_MSG:
11
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at MN/192.168.1.77
************************************************************/

Run and Monitor HDFS:

Now start Hadoop services by executing the following commands. It’ll start NameNode and
SecondaryNameNode on node-master, and DataNode on node1 and node2, according to the
configuration in the slaves config file.

[hduser@MN sbin]$ / home/ hdu


[hduser@MN sbin]$ start-dfs.sh
Starting namenodes on [MN]
MN: starting namenode, logging

[hduser@MN sbin]$ /home/hduser/hadoop/sbin


[hduser@MN sbin]$ start-dfs.sh
1 Starting namenodes on [MN]
2 MN: starting namenode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-
3 namenode-MN.out
4 DN1: starting datanode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-datanode-
5 DN1.out
6 DN2: starting datanode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-datanode-
7 DN2.out
8 Starting secondary namenodes [MN]
MN: starting secondarynamenode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-
secondarynamenode-MN.out

In addition to the previous HDFS daemon, you should see a ResourceManager on node-
master, and a NodeManager on node1 and node2.

[hduser@MN sbin]$ start-yarn.sh


starting yarn daemons
starting resourcemanager, loggin
DN2: starting nodemanager, logg

1 [hduser@MN sbin]$ start-yarn.sh


2 starting yarn daemons
starting resourcemanager, logging to /home/hduser/hadoop-2.7.3/logs/yarn-hduser-
resourcemanager-MN.out
3
DN2: starting nodemanager, logging to /home/hduser/hadoop-2.7.3/logs/yarn-hduser-
4
nodemanager-DN2.out
5
DN1: starting nodemanager, logging to /home/hduser/hadoop-2.7.3/logs/yarn-hduser-
nodemanager-DN1.out

Check all the daemons running on both master and slave machines.

[hduser@MN ~]$ for i in `cat hos


MN
17185 ResourceManager
17035 SecondaryNameNode

[hduser@MN ~]$ for i in `cat hosts`; do ssh -t $i "hostname -f; /home/hduser/jdk/bin/jps;echo


1
-e '\n' "; done
2
MN
3
17185 ResourceManager
4
17035 SecondaryNameNode
5
16851 NameNode
6
17593 Jps
7
Connection to 192.168.1.77 closed.
8
9
DN1
10
16353 DataNode
11
16655 Jps
12
16446 NodeManager
13
Connection to 192.168.1.78 closed.
14
15
DN2
16
17201 NodeManager
17
17410 Jps
18
17107 DataNode
19
Connection to 192.168.1.79 closed.

Web Interface:

At last, open the browser and go to master:50070/dfshealth.html on your master machine, this
will give you the NameNode interface.
To view Hadoop clusters and all applications, browse the following address into your browser.

http://192.168.1.77:8088/cluster

You will see the information about NodeManager.

http://192.168.1.78:8042/node
http://192.168.1.79:8042/node

Secondary NameNode information avaiable via following link.

http://192.168.1.77:50090/status.html

Stop the Services:

Services can be stopped in following order.

[hduser@MN sbin]$ / home/ hdu


[hduser@MN sbin]$ stop-yarn.sh
stopping yarn daemons
stopping resourcemanager

1 [hduser@MN sbin]$ /home/hduser/hadoop/sbin


2 [hduser@MN sbin]$ stop-yarn.sh
3 stopping yarn daemons
4 stopping resourcemanager
5 DN2: stopping nodemanager
6 DN1: stopping nodemanager
7 no proxyserver to stop
8 [hduser@MN sbin]$ stop-dfs.sh
9 Stopping namenodes on [MN]
10 MN: stopping namenode
11 DN2: stopping datanode
12 DN1: stopping datanode
13 Stopping secondary namenodes [MN]
14 MN: stopping secondarynamenode

I hope you would have successfully installed a Hadoop Multi Node Cluster. If you are facing any
problem, you can comment below, we will be replying shortly.

Mandy!!!

Hadoop Cluster via Cloudera Manager


by Mandeep K Sandhu.In Big data.Leave a Comment on Hadoop Cluster via Cloudera
Manager

I have written couple of blogs to set up Hadoop as Single/Cluster Muti-node environment and
deploying, configuring and running a Hadoop cluster manually is rather time and cost-
consuming. Here’s a helping hand to create a fully distributed Hadoop cluster with Cloudera
Manager. In this blog, we’ll see how fast and easy to install Hadoop cluster with cloudera
Manager.

Software used:

 CDH5
 Cloudera Manager – 5.7
 OS – REHL 7
 VirtualBox – 5.2

Prepare Servers:

For Minimal cluster, we need 3 servers for non-production cluster.

 CM – CloudManager + other Hadoop Services ( Minimum 8GB )


 DN1/DN2 – Data Nodes

Please do the following steps on one machine CloudManager (CM)

Disable Selinux:
vi / etc/ selinux/ config
SELINUX=disabled

1 vi /etc/selinux/config
2 SELINUX=disabled

Setup NTP:

[root@CM ~]# yum install ntp -y


[root@CM ~]# chkconfig ntpd o
Note: Forwarding request to 'sys
Created symlink from / etc/ system

[root@CM ~]# yum install ntp -y


1
[root@CM ~]# chkconfig ntpd on
2
Note: Forwarding request to 'systemctl enable ntpd.service'.
3
Created symlink from /etc/systemd/system/multi-user.target.wants/ntpd.service to
4
/usr/lib/systemd/system/ntpd.service.
5
[root@CM ~]# service ntpd start
6
Redirecting to /bin/systemctl start ntpd.service
7
[root@CM ~]# hwclock --systohc

Disable firewall:

[root@CM ~]# yum install iptable


[root@CM ~]# service iptables s
Redirecting to / bin/ systemctl sto
[root@CM ~]# chkconfig iptable

1 [root@CM ~]# yum install iptables-services


2 [root@CM ~]# service iptables stop
3 Redirecting to /bin/systemctl stop iptables.service
4 [root@CM ~]# chkconfig iptables off
5 Note: Forwarding request to 'systemctl disable iptables.service'.

Distribute Authentication Key-pairs:

[root@CM] .ssh]$ ssh-keygen


[root@CM] ssh-copy-id -i .ssh/ id_

1 [root@CM] .ssh]$ ssh-keygen


2 [root@CM] ssh-copy-id -i .ssh/id_rsa.pub 192.168.1.80
Define host names:

Edit hosts file in /etc/ folder in clusterManager Node (CM) , specify the IP address of each
system followed by their host names. Each machine need a static IP address and all VM’s
machines should be ping able from each other.

[root@CM ~]# cat / etc/ hosts


192.168.1.80 CM
192.168.1.81 DN1
192.168.1.82 DN2

1 [root@CM ~]# cat /etc/hosts


2 192.168.1.80 CM
3 192.168.1.81 DN1
4 192.168.1.82 DN2

Now clone the machines to DN1/DN2. Update the IP address and hostname. Test the SSH
without password and display the hostname.

[root@CM ~]# for i in `cat hosts`


CM
DN1
DN2

1 [root@CM ~]# for i in `cat hosts`; do ssh $i "hostname -f"; done


2 CM
3 DN1
4 DN2

Install Cloudera Manager and Agents:

Installation could be divided into the following steps:

 Install MySql database


 Java Set up
 install and run Cloudera Manager server/ Agents

Install MySql Database:

# If MariaDB installed then remo


[root@CM ~]# yum remove maria
[root@CM ~]# rpm -ivh "https:/ /
[root@CM ~]# yum install mysql

1 # If MariaDB installed then remove it first.


2 [root@CM ~]# yum remove mariadb mariadb-server
3 [root@CM ~]# rpm -ivh "https://dev.mysql.com/get/mysql57-community-release-el7-
4 11.noarch.rpm"
5 [root@CM ~]# yum install mysql mysql-server
6 [root@CM ~]# systemctl start mysqld
7 [root@CM log]# grep password /var/log/mysqld.log
8 [root@CM log]# /usr/bin/mysql_secure_installation
9 Securing the MySQL server deployment.
10 Enter password for user root:
11 The existing password for the user account root has expired. Please set a new password.
12 New password:
13 Re-enter new password:
14 The 'validate_password' plugin is installed on the server.
15 The subsequent steps will run with the existing configuration
16 of the plugin.
17 Using existing password for root.
18 Estimated strength of the password: 100
19 Change the password for root ? ((Press y|Y for Yes, any other key for No) : y
20 New password:
21 Re-enter new password:
22 Estimated strength of the password: 100
23 Do you wish to continue with the password provided?(Press y|Y for Yes, any other key for
24 No) : y
25 By default, a MySQL installation has an anonymous user,
26 allowing anyone to log into MySQL without having to have
27 a user account created for them. This is intended only for
28 testing, and to make the installation go a bit smoother.
29 You should remove them before moving into a production
30 environment.
31 Remove anonymous users? (Press y|Y for Yes, any other key for No) : y
32 Success.
33 Normally, root should only be allowed to connect from
34 'localhost'. This ensures that someone cannot guess at
35 the root password from the network.
36 Disallow root login remotely? (Press y|Y for Yes, any other key for No) : y
37 Success.
38 By default, MySQL comes with a database named 'test' that
39 anyone can access. This is also intended only for testing,
40 and should be removed before moving into a production
41 environment.
42 Remove test database and access to it? (Press y|Y for Yes, any other key for No) : y
43 - Dropping test database...
44 Success.
45 - Removing privileges on test database...
46 Success.
47 Reloading the privilege tables will ensure that all changes
48 made so far will take effect immediately.
49 Reload privilege tables now? (Press y|Y for Yes, any other key for No) : y
Success.
All done!

Java Set up:

Hadoop is written in Java so we need to set up Java. Install the Oracle Java Development Kit
(JDK) as below on all nodes.

[root@CM ~] mkdir / usr/ java/


[root@CM ~] mv / root/ software/
[root@CM ~] tar -xvf jdk-7u75-lin
[root@CM ~] ln -s jdk1.7.0_75 d

1 [root@CM ~] mkdir /usr/java/


2 [root@CM ~] mv /root/software/jdk-7u75-linux-x64.tar.gz /usr/java/
3 [root@CM ~] tar -xvf jdk-7u75-linux-x64.tar.gz
4 [root@CM ~] ln -s jdk1.7.0_75 default
5 [root@CM ~] vim /etc/profile.d/java.sh
6 export JAVA_HOME=/usr/java/default
7 export PATH=$PATH:$JAVA_HOME:$JAVA_HOME/bin
8 # Made file executable.
9 [root@CM ~] chmod +x /etc/profile.d/java.sh

Logout and login and you can see java version.

[root@CM ~] java -version


java version "1.7.0_75"
Java(TM) SE Runtime Environm
Java HotSpot(TM) 64-Bit Server

1 [root@CM ~] java -version


2 java version "1.7.0_75"
3 Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
4 Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Copy JDK to other Nodes DN1/DN2 as well.

Get MySQL jdbc connector:

[root@CM ~]# wget "https:/ / dev


[root@CM ~]# tar -xvf mysql-con
[root@CM ~]# mkdir / usr/ share/
[root@CM]# cd / root/ mysql-con

1 [root@CM ~]# wget "https://dev.mysql.com/get/downloads/connector-j/mysql-connector-java-


2 5.1.43.tar.gz"
3 [root@CM ~]# tar -xvf mysql-connector-java-5.1.43.tar.gz
[root@CM ~]# mkdir /usr/share/java
4 [root@CM]# cd /root/mysql-connector-java-5.1.43
5 [root@CM mysql-connector-java-5.1.43]# cp mysql-connector-java-5.1.43-bin.jar
/usr/share/java/mysql-connector-java.jar

Install Cloudera Manager and Agents:

Now deploy the cloudera manager server and agent.

Setup Cloudera repository:

[root@CM ~]# Wget "http:/ / arc


[root@CM ~]# mv cloudera-man

[root@CM ~]# Wget "http://archive.cloudera.com/cm5/redhat/7/x86_64/cm/cloudera-


1
manager.repo"
2
[root@CM ~]# mv cloudera-manager.repo /etc/yum.repos.d/

Now install packages for Cloudera Manager on CM node.

[root@CM ~]# yum install cloude

[root@CM ~]# yum install cloudera-manager-server cloudera-manager-agent cloudera-


1
manager-daemons -y

On other two nodes , please install demons/agent only.

[root@DN1 ~]# yum install cloud


[root@DN2 ~]# yum install cloud

1 [root@DN1 ~]# yum install cloudera-manager-agent cloudera-manager-daemons -y


2 [root@DN2 ~]# yum install cloudera-manager-agent cloudera-manager-daemons -

Prepare Cloudera Manager Database:

[root@CM ~]# / usr/ share/ cmf/ s


[root@CM ~]# /usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost -uroot -p
1
--scm-host localhost scm scm

Start the cloudera manager.

[root@CM ~]# systemctl start clo

1 [root@CM ~]# systemctl start cloudera-scm-server

Before starting agent, update the host entry in agent config files.

# updaate file CM/ DN1/ DN2


[root@CM ~]# vi / etc/ cloudera-
# Hostname of the CM
server.server_host=CM

1 # updaate file CM/DN1/DN2


2 [root@CM ~]# vi /etc/cloudera-scm-agent//config.ini
3 # Hostname of the CM
4 server.server_host=CM
5
6 [root@CM ~]# systemctl start cloudera-scm-agent
7 [root@DN2 ~]# systemctl start cloudera-scm-agent
8 [root@DN1 ~]# systemctl start cloudera-scm-agent

Hadoop cluster Set up via Cloudera Manager:

Go to http://192.168.1.80:7180/cmf/ and login page will appear. login with admin/admin as


login/password.

Then read and accept the license agreement and choose “Cloudera Enterprise Data Hub Edition
Trial” on the next page. After that you’ll be offered to set up a new cluster.
As you have already installed the Agents, you can see the hosts lists. Select all hosts.

Press continue and select the CDH version and select parcel method.
Press “Continue” and wait for distribution and activation.

Wait for Cluster Inspector to finish the inspection and you’ll see all installed components.
Install Hadoop cluster:

Then you can choose the cluster roles distribution across the cluster. Accept the default options.
You can see the summary view via “Host view detail”.
Next Part is database set up. Please provide the database access detail.

Accept default and continue.


Wait for the Cloudera Manager to set up the cluster roles.
When cluster is installed you can see it in Cloudera Manager and start monitor the cluster state,
add and remove new services in this cluster, change configurations, identify problems in the
cluster and so on. The yellow signs shown near the services are warnings that can be ignored
now but should be analyzed and fixed if you are going to bring the cluster in production.

Summary
Cloudera Manager makes creation and maintenance of Hadoop clusters significantly easier than
if they have been managed manually. Due to this instruction it is possible to create a Hadoop
cluster in less than one hour when manual configuration and deployment could take a few hours
or even days.

Install ClouderaManager from Local


Repository
by Mandeep K Sandhu.In Big data.Leave a Comment on Install ClouderaManager from
Local Repository

This section explains how to set up a local yum repository to install CDH on the machines in
your cluster. There are a number of reasons you might want to do this, for example:
 Server in your cluster don’t have access to internet. You can still use YUM to do an
installation on those machines by creating a local YUM repository.
 To make sure that each node will have the same version of software installed.
 Local repository is more efficient.

We need internet connection to download the repo/packages.

Set up Local Repo:

Create local web publishing directory. And Install web server such as Apache/http on the
machine that host the RPM and start the http server.

[root@cm ~]# mkdir -p / var/ www


[root@cm ~]# yum install httpd
[root@cm ~]# service httpd start
Redirecting to / bin/ systemctl sta

1 [root@cm ~]# mkdir -p /var/www/html/yum/cm


2 [root@cm ~]# yum install httpd
3 [root@cm ~]# service httpd start
4 Redirecting to /bin/systemctl start httpd.service
5 [root@cm ~]# service httpd status

Install createrepo and yum-utils rpms.

[root@cm ~]# yum install yum-ut

1 [root@cm ~]# yum install yum-utils createrepo

Download the CM tar file.

[root@cm ~]# wget http:/ / archiv


# Download the file on your PC a

[root@cm ~]# wget http://archive.cloudera.com/cm5/repo-as-tarball/5.12.0/cm5.12.0-


1 centos7.tar.gz
2 # Download the file on your PC and move it to machine that host
RPM.http://archive.cloudera.com/cm5/repo-as-tarball/5.12.0/

Move the tar file to web directory and unpack it.


[root@cm conf]# mv cm5.12.0-c
[root@cm html]# tar -xvf cm5.12

1 [root@cm conf]# mv cm5.12.0-centos7.tar.gz /var/www/html/yum


2 [root@cm html]# tar -xvf cm5.12.0-centos7.tar.gz

Now create the yum repository.

[root@cm cm]# createrepo / var/


Spawning worker 0 with 7 pkgs
Workers Finished
Saving Primary metadata

1 [root@cm cm]# createrepo /var/www/html/yum/cm/5.12.0


2 Spawning worker 0 with 7 pkgs
3 Workers Finished
4 Saving Primary metadata
5 Saving file lists metadata
6 Saving other metadata
7 Generating sqlite DBs
8 Sqlite DBs complete

Ensure that you can access the URL.

http://192.168.1.83/yum/cm/

Now create the cm.repo file and list the available repo.

[root@cm html]# cd / etc/ yum.re


[root@cm yum.repos.d]# vi cm.re
[cm]
name=cm

1 [root@cm html]# cd /etc/yum.repos.d/


2 [root@cm yum.repos.d]# vi cm.repo
3 [cm]
4 name=cm
5 baseurl=http://192.168.1.83/yum/cm/5
6 enabled=1
7 gpgcheck=0
8 [root@cm yum.repos.d]# yum repolist

Set up Parcels:

Create the parcel directory.


[root@cm ~]# mkdir -p / var/ www
[root@cm ~]# cd / var/ www/ htm

1 [root@cm ~]# mkdir -p /var/www/html/parcels


2 [root@cm ~]# cd /var/www/html/parcel

Now download the required parcels.

[root@cm parcels]# mkdir / var/ w


[root@cm parcels]# cd / var/ ww
[root@cm parcels]# wget https:/
[root@cm parcels]# wget https:/

[root@cm parcels]# mkdir /var/www/html/parcels/


1 [root@cm parcels]# cd /var/www/html/parcels/
2 [root@cm parcels]# wget https://archive.cloudera.com/cdh5/parcels/5.14.0/CDH-5.14.0-
3 1.cdh5.14.0.p0.24-el5.parcel
4 [root@cm parcels]# wget https://archive.cloudera.com/cdh5/parcels/5.14.0/manifest.json
5 [root@cm parcels]# wget https://archive.cloudera.com/cdh5/parcels/5.14.0/CDH-5.14.0-
1.cdh5.14.0.p0.24-el5.parcel.sha1

Check URL for parcels.You can use this URL in ClouderaManager.

192.168.1.83/parcels/

Install CM using local repository:

Download installer for CM and change the permissions. And run the CM installer with option ” –
skip_repo_package=1 to install cloudera manager from the local repo.

[root@cm ~]# wget http:/ / archiv


[root@cm ~]# chmod u+x cloude
[root@cm ~]# sudo ./ cloudera-m

[root@cm ~]# wget http://archive.cloudera.com/cm5/installer/5.12.0/cloudera-manager-


1
installer.bin
2
[root@cm ~]# chmod u+x cloudera-manager-installer.bin
3
[root@cm ~]# sudo ./cloudera-manager-installer.bin --skip_repo_package=1

You’ll get the following window. Not recommended for production.


Accept the Licence.

Accept Oracle Binaries Licence.


JDK/ClouderaManager installation is in progress.

Installation complete.

Now open URL.

http://cm:7180/cmf/login or http://192.168.1.83:7180/cmf/login

Login to the page with admin/admin. Accept the licence and accept cloudera Enterprise Data
hub.
Next part is cluster set up and will explain in future blogs.

Thanks

Mandy
ClouderaManager – Installation on Google
Cloud
by Mandeep K Sandhu.In Big data.Leave a Comment on ClouderaManager – Installation on
Google Cloud

In this post, I am going to tell you about how to set-up a Hadoop cluster on Google Cloud
Platform.

Register on Google Cloud Platform

First of all, you have to register on Google cloud. It’s easy. Just sign-in with your Gmail id and
fill your credit card details. Once registered you will get (300 USD) 1-year free subscription on
Google Cloud.

How to create Virtual Machines

 Create a new project. Give a name to your project or leave as it is provided by Google.
 Now click on the icon on the top left corner of your homepage. A list of products and
services will appear which the Google cloud provides. Click on Compute Engine and
then click on VM instances. The VM Instances page will open, select Create Instance.
Create Four Machines as below.

 Name – CM, DN1, DN2 and RM.


 Zone – Select nearest Zone.
 Machine Type – 3vCPU X 12GB(CM) and 1vCPU X 3.5GB (DN1 X DN2) and 2vCPU
X 7GB(RM)
 Boot disk – Click on change. I am familiar with Red Hat Enterprise Linux 7 so I chose
that. Leave Boot disk type as it is and increase your disk size – 150 for all.
 Identity and API access – Leave as it is.
 Firewall – Allow HTTP traffic.
 Click on create.

Once created, it will appear like this as shown below.


Configure the Machines:

First of all, click on SSH (SSH is a network protocol that allows you to access a remote
computer in a secure way). A terminal will open. Now do the following steps:

Login with root user. Use the command: sudo su

Disable firewall-

Disable firewall and stop the currently running firewall

chkconfig iptables off


service iptables Stop

1 chkconfig iptables off


2 service iptables Stop

Disable the SELinux-

Write “disabled” in place of “enforcing”.

vi / etc/ selinux/ config


SELINUX disabled

1 vi /etc/selinux/config
2 SELINUX disabled

Setup SSHD –

Update the “sshd_config” file as below.

[root@cm ~]# vi / etc/ ssh/ sshd_


# Enable root Login - changes n
PermitRootLogin yes
# Similarly change the following

1 [root@cm ~]# vi /etc/ssh/sshd_config


2 # Enable root Login - changes no to yes.
3 PermitRootLogin yes
4 # Similarly change the following two parameters.
5 PasswordAuthentication yes
6 ChallengeResponseAuthentication yes
Password change –

Set the password for all machines.

[root@cm ~]# passwd


Changing password for user root
New password:
Retype new password:

1 [root@cm ~]# passwd


2 Changing password for user root.
3 New password:
4 Retype new password:
5 passwd: all authentication tokens updated successfully.

Once done restart your node by pressing init 6.

ClouderaManager Install:

Download the cloudera-manager-installer.bin file on CM node and change the permissions and
run the installer.

[root@cm ~]# wget <span style=


[root@cm ~]# chmod u+x cloude
[root@cm ~]# sudo ./ cloudera-m

[root@cm ~]# wget <span style="text-decoration:


underline;">http://archive.cloudera.com/cm5/installer/5.12.0/cloudera-manager-installe<a
1
title="r.bin" href="http://archive.cloudera.com/cm5/installer/5.12.0/cloudera-manager-
2
installer.bin" target="_blank" rel="noopener">r.bin</a></span>
3
[root@cm ~]# chmod u+x cloudera-manager-installer.bin
[root@cm ~]# sudo ./cloudera-manager-installer.bin

The installation part is simple and straightforward. Accept the Licence for both cloudera and
Oracle JDK.
Click Next and install JDK and CDH.
Installation complete and Open URL which is external IP address of the Machines where you run
the installed which is CM in my case.

http://35.197.189.176/:7180/cmf/login

Login to the page with admin/admin. Accept the licence and accept cloudera Enterprise Data
hub.
Next screen will show all the services which are available. Press continue.

Next page you need to specify the Host IP Address/Hostname for all your instances and then
click search.
After all the hosts have been searched, it will display the following page.

The next page which will emerge is where you select the repository.

 Choose Method -> select Use Parcels


 Select the version of CDH -> select CDH 5.14
Next anc click the box to install JDK.

Next page will be of enabling Single User Mode. Just click Continue. No need to enable that.
Provide SSH login credentials. Enter your password which you have set during configuring the
server and hit Continue.

It will install required packages and will take time.


Last step is validation and now the cluster installation part is complete.

Setup Cluster:

Now you have come on the Cluster Setup page. You can select which services you want to
install. At the bottom, you will find Custom Service through which you can choose whichever
services you want to assign to your cluster. I selected HDFS and YARN and click continue.
In role assignment page, assign roles to different nodes and view by host the final distribution. I
assigned as below:

 CM – NameNode, SecondaryNameNode and ClouderaManager related services.


 RM – ResourceManager, JobHistroyServer
 DN1/DN2 – DataNode1/2 and NodeManager1/2

Set up repository for report Manager. Use embedded DB but in production use custom database
like MySql/MSSql. Click on Test Connection and then Continue.
Keep the default the settings for block size and data directories and press continue.
Now it will start all the services on your cluster and will also take some time. This concludes the
cluster installation part.

You will see the Cloudera Manager home page.


The yellow signs shown near the services are warnings that can be ignored now but should be
analyzed and fixed if you are going to bring the cluster in production.

If you like this article, please share it further.

Mandy

Configure High Availability – HDFS/YARN


via Cloudera
by Mandeep K Sandhu.In Big data.Leave a Comment on Configure High Availability –
HDFS/YARN via Cloudera

In earlier releases, the NameNode was a single point of failure (SPOF) in a HDFS cluster. Each
cluster had a single NameNode, and if that machine or process became unavailable, the cluster as
a whole would be unavailable until the NameNode was either restarted or brought up on a
separate machine. The Secondary NameNode did not provide failover capability. The HA
architecture solved this problem of NameNode availability by allowing us to have two
NameNodes in an active/passive configuration. The NameNode is the centerpiece of an HDFS
file system

To enable Namenode HA in cloudera, you must ensure that the two nodes are of same
configuration in terms of memory, disk, etc for optimal performance. Here are the steps.

ZooKeeper:

First of install “ZooKeeper to set up HA for NameNode.

Select cluster -> Action -> Add Service and pop will appear.
Add zookeeper from listed services.

Add ZooKeeper to 3 servers.


Add Zookeeper service to Cluster and click next.

Next step will initialize and start the ZooKeeper services. Click Next and ZooKeeper service
successfully added to the cluster.
HDFS HA set up:

Now select the HDFS Service from cluster and see the status page.

Select Action -> Enable HA option.


Give Nameservice Name and click next.
Now select the location of Second NameNode service. In my case, I selected RM as a secondary
NN location and 3 X JournalNodes (Odd Numbers are Required for OJM. See here for more
information.

Now change the directory for Journal Nodes edit.


click next and you can see the progress steps.
NameNode formatting step will fail, that is fine. Wait till all the steps get finished.

Now its completed.


You can see the Active/Passive Node set up as below.

See the ZooKeeper status and one will be the leader and others as followers.

HA Test:

HA test is very simple and quick. Check the status NameNodes services and select active Name
Node -> action for selected -> stop.
Now check the status and after few seconds, the standby node will come up as active node.

Now start the stopped NN service and will come up as standby NN service after a while.

Setup HA for YARN:

Go to the YARN service.


Select Action “enable high availability”.
Select Second node for YARN service. I selected CM node.
Click next and it’ll enable the HA.

Now go to the ResourceManager you’ll see both Active/Passive RM nodes.


Once you have configured the YARN resource manager for HA, if the active resource manager
is down or is no longer on the list, one of the standby resource managers becomes active and
resumes resource manager responsibilities. This way, jobs continue to run and complete
successfully.

High Availability Set up – HDFS/YARN using


Quorum
by Mandeep K Sandhu.In Big data.1 Comment on High Availability Set up – HDFS/YARN
using Quorum

In this blog, I am going to talk about how to configure and manage a High availability HDFS
(CDH 5.12.0) cluster. In earlier releases, the NameNode was a single point of failure (SPOF) in
a HDFS cluster. Each cluster had a single NameNode, and if that machine or process became
unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted
or brought up on a separate machine. The Secondary NameNode did not provide failover
capability.

The HA architecture solved this problem of NameNode availability by allowing us to have two
NameNodes in an active/passive configuration. So, we have two running NameNodes at the
same time in a High Availability cluster:

 Active NameNode
 Standby/Passive NameNode.

We can implement the Active and Standby NameNode configuration in following two ways:

 Using Quorum Journal Nodes


 Shared Storage using NFS

Using the Quorum Journal Manager (QJM) is the preferred method for achieving high
availability for HDFS. Read here to know more about QJM and NFS methods. In this blog, I’ll
implement the HA configuration for quorum based storage and here are the IP address and
corresponding machines Names/roles.
 NameNode machines – NN1/NN2 of equivalent hardware and spec
 JournalNode machines – The JournalNode daemon is relatively lightweight, so these
daemons can reasonably be collocated on machines with other Hadoop daemons, for
example NameNodes, the JobTracker, or the YARN ResourceManager. There must be at
least three JournalNode daemons, since edit log modifications must be written to a
majority of JournalNodes.So 3 JN’s runs on NN1/NN2 and MGT Server.
 Note that when running with N JournalNodes, the system can tolerate at most (N – 1) / 2
failures and continue to function normally.
 The ZookeerFailoverController (ZKFC) is a Zookeeper client that also monitors and
manages the NameNode status. Each of the NameNode runs a ZKFC also. ZKFC is
responsible for monitoring the health of the NameNodes periodically.
 Resource Manager Running on same NameNode NN1/NN2.
 Two Data Nodes – DN1 and DN2

Pre-requirements:

First of all, we have to edit hosts file in /etc/ folder in NameNode(NN1) , specify the IP address
of each system followed by their host names. Each machine need a static IP address and all
VM’s machines should be ping able from each other.

vi / etc/ hosts
192.168.1.150 NN1
192.168.1.151 NN2
192.168.1.152 DN1
1 vi /etc/hosts
2 192.168.1.150 NN1
3 192.168.1.151 NN2
4 192.168.1.152 DN1
5 192.168.1.153 DN2
6 192.168.1.154 MGT

All VM machines are set up with REHL 7 operating system. And disable the firewall restrictions
on all VM’s.

[root@NN1 yum.repos.d]# yum i


[root@NN1 yum.repos.d]# servic
Redirecting to / bin/ systemctl sto
[root@NN1 yum.repos.d]# chkc

1 [root@NN1 yum.repos.d]# yum install iptables-services


2 [root@NN1 yum.repos.d]# service iptables stop
3 Redirecting to /bin/systemctl stop iptables.service
4 [root@NN1 yum.repos.d]# chkconfig iptables off
5 Note: Forwarding request to 'systemctl disable iptables.service'.

Setup Java:

Hadoop is written in Java so we need to set up Java first. Install the Oracle Java Development
Kit (JDK) as below on all nodes.

[root@NN1 ~] mkdir / usr/ java/


[root@NN1 ~] mv / root/ software
[root@NN1 ~] tar -xvf jdk-7u75-l
[root@NN1 ~] ln -s jdk1.7.0_75

1 [root@NN1 ~] mkdir /usr/java/


2 [root@NN1 ~] mv /root/software/jdk-7u75-linux-x64.tar.gz /usr/java/
3 [root@NN1 ~] tar -xvf jdk-7u75-linux-x64.tar.gz
4 [root@NN1 ~] ln -s jdk1.7.0_75 default
5 [root@NN1 ~] vim /etc/profile.d/java.sh
6 export JAVA_HOME=/usr/java/default
7 export PATH=$PATH:$JAVA_HOME:$JAVA_HOME/bin
8 # Made file executable.
9 [root@NN1 ~] chmod +x /etc/profile.d/java.sh

Logout and login and you can see java version.


java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environm
Java HotSpot(TM) 64-Bit Server

1 java -version
2 java version "1.7.0_75"
3 Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
4 Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Download/install CDH 5.13.X

If you want to create your own YUM repository, download the appropriate repo file, create the
repo, distribute the repo file as described under:

http://archive.cloudera.com/cdh5/repo-as-tarball/5.13.0/

You can use Hadoop 2.x distribution as well.

[root@NN1 ~] vi local-repos.repo
[cloudera-cdh5]
name=cloudera
baseurl=file:/ / / root/ cdhrepo/ cdh

1 [root@NN1 ~] vi local-repos.repo
2 [cloudera-cdh5]
3 name=cloudera
4 baseurl=file:///root/cdhrepo/cdh/5
5 gpgcheck=0
6 enabled=1
7 [root@NN1 ~] yum repolist

Now install below components on NameNode1/2 (NN1/NN2)/and disable the run level start of
components.

[root@NN1 conf] yum install had


[root@NN1 conf] yum install had
[root@NN1 conf] yum install had
[root@NN1 conf] yum install had

1 [root@NN1 conf] yum install hadoop-hdfs-namenode -y


2 [root@NN1 conf] yum install hadoop-yarn-resourcemanager -y
3 [root@NN1 conf] yum install hadoop-mapreduce-historyserver -y
4 [root@NN1 conf] yum install hadoop-hdfs-journalnode hadoop-hdfs-zkfc zookeeper-server
5 zookeeper -y
6 [root@NN2 conf] yum install hadoop-hdfs-journalnode hadoop-hdfs-zkfc zookeeper-server
7 zookeeper -y
8 # Set run level start off
9 [root@NN1 ~]cd /etc/init.d/
10 [root@NN1 init.d]ls -altr
11 [root@NN1 init.d] chkconfig hadoop-hdfs-namenode off
12 [root@NN1 init.d] chkconfig zookeeper-server off
13 [root@NN1 init.d] chkconfig hadoop-hdfs-zkfc off
14 [root@NN1 init.d] chkconfig hadoop-hdfs-journalnode off
[root@NN1 init.d] chkconfig hadoop-yarn-resourcemanager off

Similarly install the DataNode and NodeManager rpms on DN1/DN2.

[root@DN1 conf] yum install had


[root@DN1 conf] yum install had

1 [root@DN1 conf] yum install hadoop-yarn-nodemanager -y


2 [root@DN1 conf] yum install hadoop-hdfs-datanode -y

Managment server (MGT) will be installed with journalNode, ZooKeeper and client Service.

[root@DN1 conf] yum install had


[root@MGT conf]yum install had

1 [root@DN1 conf]&nbsp;yum install hadoop-client -y


2 [root@MGT conf]yum install hadoop-hdfs-journalnode zookeeper-server zookeeper -y

Distribute Authentication Key-pairs:

Login to NN1 as the root user, and generate an ssh-key. Copy the generated ssh Key to all
Node’s authorized keys.

[root@NN1] .ssh]$ ssh-keygen


[root@NN1] ssh-copy-id -i .ssh/ id

1 [root@NN1] .ssh]$ ssh-keygen


2 [root@NN1] ssh-copy-id -i .ssh/id_rsa.pub 192.168.1.150
Now test the passwordless connections. Create a file “hosts” with IP address of all nodes.
[root@NN1 ~]# vi hosts
192.168.1.150
192.168.1.151
192.168.1.152

[root@NN1 ~]# vi hosts


1
192.168.1.150
2
192.168.1.151
3
192.168.1.152
4
192.168.1.153
5
192.168.1.154
6
7
# Display HOSTNAME with password from NN1[root@NN1 ~]# for i in `cat hosts`; do ssh
8
$i "hostname -f"; done
9
NN1
10
NN2
11
DN1
12
DN2
13
MGT

Configuration Details:

To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml
configuration file. Choose a logical name for this nameservice, for example “ha-cluster”. Open
the core-site.xml file from the Active Name node and add the below properties.

cd /etc/hadoop/conf

[root@NN1 conf]# vi core-site.xm


<configuration>
<property>
<name>fs.defaultFS</ name>

1 [root@NN1 conf]# vi core-site.xml


2 <configuration>
3 <property>
4 <name>fs.defaultFS</name>
5 <value>hdfs://ha-cluster</value>
6 </property>
7 </configuration>

Open the HDFS-site.xml file, add this Datanode directory path in dfs.datanode.data.dir property
and other properties.
[root@NN1 conf]# vi hdfs-site.xm
<configuration>
<property>
<name>dfs.namenode.name.dir<

1 [root@NN1 conf]# vi hdfs-site.xml


2 <configuration>
3 <property>
4 <name>dfs.namenode.name.dir</name>
5 <value>/data/dfs/nn</value>
6 </property>
7
8 <property>
9 <name>dfs.datanode.data.dir</name>
10 <value>/data/dfs/dn</value>
11 </property>
12
13
14 <property>
15 <name>dfs.replication</name>
16 <value>2</value>
17 </property>
18
19 <property>
20 <name>dfs.blocksize</name>
21 <value>134217728</value>
22 </property>
23
24 <property>
25 <name>dfs.nameservices</name>
26 <value>ha-cluster</value>
27 </property>
28
29 <property>
30 <name>dfs.ha.namenodes.ha-cluster</name>
31 <value>nn1,nn2</value>
32 </property>
33
34 <property>
35 <name>dfs.namenode.rpc-address.ha-cluster.nn1</name>
36 <value>192.168.1.150:8020</value>
37 </property>
38
39 <property>
40 <name>dfs.namenode.rpc-address.ha-cluster.nn2</name>
41 <value>192.168.1.151:8020</value>
42 </property>
43 <property>
44 <name>dfs.namenode.http-address.ha-cluster.nn1</name>
45 <value>192.168.1.150:50070</value>
46 </property>
47
48 <property>
49 <name>dfs.namenode.http-address.ha-cluster.nn2</name>
50 <value>192.168.1.151:50070</value>
51 </property>
52
53 <property>
54 <name>dfs.namenode.shared.edits.dir</name>
55 <value>qjournal://192.168.1.150:8485;192.168.1.151:8485;192.168.1.154:8485/ha-
56 cluster</value>
57 </property>
58
59 <property>
60 <name>dfs.journalnode.edits.dir</name>
61 <value>/data/dfs/jn</value>
62 </property>
63
64 <property>
65 <name>dfs.client.failover.proxy.provider.ha-cluster</name>
66 <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</
67 value>
68 </property>
69
70 <property>
71 <name>dfs.ha.fencing.methods</name>
72 <value>shell(/bin/true)</value>
73 </property>
74
75 <property>
76 <name>dfs.ha.automatic-failover.enabled</name>
77 <value>true</value>
78 </property>
79
80 <property>
81 <name>ha.zookeeper.quorum</name>
82 <value>192.168.1.150:2181,192.168.1.151:2181,192.168.1.154:2181</value>
83 </property>
84
</configuration>

Edit yarn-site.xml on NN1.


[root@NN1 conf]# vi yarn-site.xm

<configuration>
<property>

1 [root@NN1 conf]# vi yarn-site.xml


2
3 <configuration>
4 <property>
5 <name>yarn.nodemanager.aux-services</name>
6 <value>mapreduce_shuffle</value>
7 </property>
8
9 <property>
10 <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
11 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
12 </property>
13
14 <property>
15 <name>yarn.log-aggregation-enable</name>
16 <value>true</value>
17 </property>
18
19 <property>
20 <description>List of directories to store localized files in.</description>
21 <name>yarn.nodemanager.local-dirs</name>
22 <value>file:///data/yarn/nm-local-dir</value>
23 </property>
24
25 <property>
26 <description>Where to store container logs.</description>
27 <name>yarn.nodemanager.log-dirs</name>
28 <value>file:///var/log/hadoop-yarn/containers</value>
29 </property>
30
31 <property>
32 <description>Where to aggregate logs to.</description>
33 <name>yarn.nodemanager.remote-app-log-dir</name>
34 <value>hdfs://ha-cluster:8020/var/log/hadoop-yarn/apps</value>
35 </property>
36
37 <property>
38 <description>Classpath for typical applications.</description>
39 <name>yarn.application.classpath</name>
40 <value>
41 $HADOOP_CONF_DIR,
42 $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
43 $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
44 $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
45 $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
46 </value>
47 </property>
48
49 <property>
50 <name>yarn.resourcemanager.cluster-id</name>
51 <value>ha-cluster</value>
52 </property>
53
54 <property>
55 <name>yarn.resourcemanager.ha.enabled</name>
56 <value>true</value>
57 </property>
58
59 <property>
60 <name>yarn.client.failover.proxy-provider</name>
61 <value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value>
62 </property>
63
64 <property>
65 <name>yarn.resourcemanager.store.class</name>
66 <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</
67 value>
68 </property>
69
70 <property>
71 <name>yarn.resourcemanager.recovery.enabled</name>
72 <value>true</value>
73 </property>
74
75 <property>
76 <name>yarn.resourcemanager.ha.rm-ids</name>
77 <value>rm1,rm2</value>
78 </property>
79
80 <property>
81 <name>yarn.resourcemanager.hostname.rm1</name>
82 <value>192.168.1.150</value>
83 </property>
84
85 <property>
86 <name>yarn.resourcemanager.hostname.rm2</name>
87 <value>192.168.1.151</value>
88 </property>
89
<property>
90
<name>yarn.resourcemanager.webapp.address.rm1</name>
91
<value>192.168.1.150:8088</value>
92
</property>
93
94
<property>
95
<name>yarn.resourcemanager.webapp.address.rm2</name>
96
<value>192.168.1.151:8088</value>
97
</property>
98
99
<property>
100
<name>yarn.resourcemanager.zk-address</name>
101
<value>192.168.1.150:2181,192.168.1.151:2181,192.168.1.154:2181</value>
102
</property>
103
104
</configuration>

edit and set yarn as the default framework for MapReduce operations and other HA properties.

[root@NN1 conf]# vi mapred-site


<configuration>
<property>
<name>mapreduce.framework.n

1 [root@NN1 conf]# vi mapred-site.xml


2 <configuration>
3 <property>
4 <name>mapreduce.framework.name</name>
5 <value>yarn</value>
6 </property>
7
8 <property>
9 <name>mapreduce.jobhistory.address</name>
10 <value>192.168.1.151:10020</value>
11 </property>
12
13 <property>
14 <name>mapreduce.jobhistory.webapp.address</name>
15 <value>192.168.1.151:19888</value>
16 </property>
17
18 <property>
19 <name>yarn.ap.mapreduce.am.staging-dir</name>
20 <value>/user</value>
21 </property>
22
23 </configuration>

update slaves with DataNodes VM’s DN1/DN2.

[root@NN1 conf]# vi slaves


192.168.1.152
192.168.1.153

1 [root@NN1 conf]# vi slaves


2 192.168.1.152
3 192.168.1.153

Once all files updated on NN1 and now copy configs files from NN1 to all other nodes.

[root@NN1 ~]# for i in `cat hosts

1 [root@NN1 ~]# for i in `cat hosts`; do scp /etc/hadoop/conf/* $i:/etc/hadoop/conf/; done

Deploying Zookeeper:

In a conf directory you have zoo_sample.cfg file, create the zoo.cfg using zoo_sample.cfg file.
Open the zoo.cfg file. Add the custom directory path to the dataDir property if you want and add
the below details regarding remaining node, in the zoo.cfg file. I kept the same directory. See
screen shot below and copy the Zookeeper conf file to NN2 and MGT server.

[root@NN1 ~]# vi / etc/ zookeep


# add at the end
server.1=192.168.1.150:2888:38
server.2=192.168.1.151:2888:38

1 [root@NN1 ~]# vi /etc/zookeeper/conf/zoo.cfg


2 # add at the end
3 server.1=192.168.1.150:2888:3888
4 server.2=192.168.1.151:2888:3888
5 server.3=192.168.1.154:2888:3888
6 [root@NN1 conf]# scp zoo.cfg 192.168.1.151:/etc/zookeeper/conf/
7 [root@NN1 conf]# scp zoo.cfg 192.168.1.154:/etc/zookeeper/conf/
Now initialize the zookeeper and start the Zookeeper service.

[root@NN1 conf]# service zook


[root@NN2 ~]# service zookeep
[root@MGT ~]# service zookee
# If directory already exist the us

1 [root@NN1 conf]# service zookeeper-server init --myid=1


2 [root@NN2 ~]# service zookeeper-server init --myid=2
3 [root@MGT ~]# service zookeeper-server init --myid=3
4 # If directory already exist the use force option
5 # service zookeeper-server init --myid=1 --force
6
7 # Start the ZK Service.
8 [root@NN1 conf]# service zookeeper-server start
9 JMX enabled by default
10 Using config: /etc/zookeeper/conf/zoo.cfg
11 Starting zookeeper ... STARTED
12 [root@NN2 ~]# service zookeeper-server start
13 [root@MGT ~]# service zookeeper-server start

Now start journal node now.

[root@NN1 bin]# service hadoo


[root@NN2 bin]# service hadoo
[root@MGT bin]# service hadoo

1 [root@NN1 bin]# service hadoop-hdfs-journalnode start


2 [root@NN2 bin]# service hadoop-hdfs-journalnode start
3 [root@MGT bin]# service hadoop-hdfs-journalnode start
Format HDFS:

Format the NameNode on NN1 only if new cluster. If converting the existing cluster from NON-
HA to HA then re-initilize it as i did.

[root@NN1 ~]# sudo -u hdfs hdf


---
18/ 05/ 21 19:16:03 INFO namen
18/ 05/ 21 19:16:03 INFO namen

[root@NN1 ~]# sudo -u hdfs hdfs namenode -initializeSharedEdits


---
18/05/21 19:16:03 INFO namenode.EditLogInputStream: Fast-forwarding stream
1 '/data/dfs/nn/current/edits_0000000000000000209-0000000000000000209' to transaction ID
2 169
3 18/05/21 19:16:03 INFO namenode.FSEditLog: Starting log segment at 209
4 18/05/21 19:16:03 INFO namenode.FSEditLog: Ending log segment 209, 209
5 18/05/21 19:16:03 INFO namenode.FSEditLog: logSyncAll toSyncToTxId=209
6 lastSyncedTxid=209 mostRecentTxid=209
7 18/05/21 19:16:03 INFO namenode.FSEditLog: Done logSyncAll lastWrittenTxId=209
8 lastSyncedTxid=209 mostRecentTxid=209
9 18/05/21 19:16:03 INFO namenode.FSEditLog: Number of transactions: 1 Total time for
10 transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 1
11 SyncTimes(ms): 7
12 18/05/21 19:16:03 INFO util.ExitUtil: Exiting with status 0
13 18/05/21 19:16:03 INFO namenode.NameNode: SHUTDOWN_MSG:
14 /************************************************************
15 SHUTDOWN_MSG: Shutting down NameNode at NN1/192.168.1.150
16 ************************************************************/

### Format In new cluster


sudo -u hdfs hdfs namenode -format

Start And Monitor HDFS Services:

Now start the NameNode Service on NN1.

[root@NN1 ~]# service hadoop-

1 [root@NN1 ~]# service hadoop-hdfs-namenode start

Now NN2 has different meta data.. So how you’ll sync it with NN1, you can either copy the
entire meta data directory or use this command. Start the NameNode servie on NN2.
[root@NN2 ~]# sudo -u hdfs hdf
[root@NN2 ~]# service hadoop-
starting namenode, logging to / v
Started Hadoop namenode: [ OK

1 [root@NN2 ~]# sudo -u hdfs hdfs namenode -bootstrapStandby


2 [root@NN2 ~]# service hadoop-hdfs-namenode start
3 starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-NN2.out
4 Started Hadoop namenode: [ OK

Start the DataNode Services on DN1/DN2.

[root@DN1 ~]# service hadoop-


starting datanode, logging to / va
Started Hadoop datanode (hado
[root@DN2 ~]# service hadoop-

1 [root@DN1 ~]# service hadoop-hdfs-datanode start


2 starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-DN1.out
3 Started Hadoop datanode (hadoop-hdfs-datanode): [ OK ]
4 [root@DN2 ~]# service hadoop-hdfs-datanode start
5 starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-DN2.out
6 Started Hadoop datanode (hadoop-hdfs-datanode): [ OK ]

The next step is to initialize required state in ZooKeeper. You can do so by running the following
command from one of the NameNode hosts. This will create a znode in ZooKeeper inside of
which the automatic failover system stores its data.

[root@NN1 hadoop-hdfs]# sudo


--- Check this line from logs
18/ 05/ 21 19:45:06 INFO ha.Act

[root@NN1 hadoop-hdfs]# sudo -u hdfs hdfs zkfc -formatZK


1
--- Check this line from logs
2
18/05/21 19:45:06 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/ha-cluster
3
in ZK.

Check Zookeeper connected or not.

[root@NN1 conf]# zookeeper-cl


WatchedEvent state:SyncConne
JLine support is enabled
[zk: localhost:2181(CONNECTE

1 [root@NN1 conf]# zookeeper-client -server=192.168.1.150:2181


2 WatchedEvent state:SyncConnected type:None path:null
3 JLine support is enabled
4 [zk: localhost:2181(CONNECTED) 0]
5 [zk: localhost:2181(CONNECTED) 1] ls /hadoop-ha/ha-cluster

Now start ZKFC service. Note: wherever you start the ZKFC first, that’ll become the active
node.

[root@NN1 hadoop-hdfs]# servi


starting zkfc, logging to / var/ log
Started Hadoop zkfc: [ OK ]
[root@NN2 ~]# service hadoop-

1 [root@NN1 hadoop-hdfs]# service hadoop-hdfs-zkfc start


2 starting zkfc, logging to /var/log/hadoop-hdfs/hadoop-hdfs-zkfc-NN1.out
3 Started Hadoop zkfc: [ OK ]
4 [root@NN2 ~]# service hadoop-hdfs-zkfc start
5 starting zkfc, logging to /var/log/hadoop-hdfs/hadoop-hdfs-zkfc-NN2.out
6 Started Hadoop zkfc: [ OK ]

Web Interface:

Open the browser and explore further.

http://192.168.1.150:50070/dfshealth.html#tab-overview
http://192.168.1.151:50070/dfshealth.html#tab-overview

Check Services with command line.


[root@NN1 ~]# for i in `cat hosts
NN1
3053 DFSZKFailoverController
2039 NameNode

[root@NN1 ~]# for i in `cat hosts`; do ssh -t $i "hostname -f;sudo -u hdfs


1 /usr/java/default/bin/jps;sudo -u yarn /usr/java/default/bin/jps;sudo -u mapred
2 /usr/java/default/bin/jps;sudo -u zookeeper /usr/java/default/bin/jps;echo -e '\n' "; done
3 NN1
4 3053 DFSZKFailoverController
5 2039 NameNode
6 1889 JournalNode
7 2755 QuorumPeerMain
8
9 NN2
10 2869 DFSZKFailoverController
11 1914 JournalNode
12 2147 NameNode
13 2690 QuorumPeerMain
14
15 DN1
16 1747 Jps
17 1605 DataNode
18
19 DN2
20 1637 DataNode
21
22 MGT
23 1822 JournalNode
2196 QuorumPeerMain

Test HA set up:

See process Id and kill the active NameNode on NN1.

[root@NN1 ~]# sudo -u hdfs hdf


active
[root@NN1 ~]# sudo -u hdfs hdf
standby

1 [root@NN1 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn1


2 active
3 [root@NN1 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn2
4 standby
5
6 [root@NN1 ~]# kill -9 2039
7
8 [root@NN1 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn2
9 active
10
11 [root@NN1 ~]# service hadoop-hdfs-namenode start
12 starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-NN1.out
13 Started Hadoop namenode: [ OK ]
14
15 [root@NN1 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn1
16 standby

Repeat the same test on NN2 and also check the web URL’s.

YARN HA set up:

Now start the ResourceManager and NodeManager on NN1/NN2.

[root@NN1 ~]# service hadoop-


[root@NN2 ~]# service hadoop-

# Start the Node Manager

1 [root@NN1 ~]# service hadoop-yarn-resourcemanager start


2 [root@NN2 ~]# service hadoop-yarn-resourcemanager start
3
4 # Start the Node Manager
5 [root@DN1 ~]# service hadoop-yarn-nodemanager start
6 [root@DN2 ~]# service hadoop-yarn-nodemanager start

Complete the Failover test here.


[root@NN1 ~]# sudo -u yarn yar
active
[root@NN1 ~]# sudo -u yarn yar
standby

[root@NN1 ~]# sudo -u yarn yarn rmadmin -getServiceState rm1


1
active
2
[root@NN1 ~]# sudo -u yarn yarn rmadmin -getServiceState rm2
3
standby
4
5
[root@NN1 ~]# kill -9 3787
6
7
[root@NN1 ~]# sudo -u yarn yarn rmadmin -getServiceState rm2
8
active
9
[root@NN1 ~]# service hadoop-yarn-resourcemanager start
10
starting resourcemanager, logging to /var/log/hadoop-yarn/yarn-yarn-resourcemanager-
11
NN1.out
12
Started Hadoop resourcemanager: [ OK ]
13
[root@NN1 ~]# sudo -u yarn yarn rmadmin -getServiceState rm1
14
standb

http://192.168.1.150:8088/cluster/cluster

http://192.168.1.151:8088/cluster/cluster

Once you have configured the YARN resource manager for HA, if the active resource manager
is down or is no longer on the list, one of the standby resource managers becomes active and
resumes resource manager responsibilities. This way, jobs continue to run and complete
successfully.

Please share and like if you this blog is help for you.

Thanks
Mandy

Commissioning/Decommissioning –
Datanode in Hadoop
by Mandeep K Sandhu.In Big data.Leave a Comment on Commissioning/Decommissioning
– Datanode in Hadoop

Commissioning of nodes means adding new data node in cluster and decommissioning stands for
removing node from cluster. You can’t directly add/remove dataNode in large and a real-time
cluster as it can cause a lot of disturbance. So if you want to scale your cluster , you need
commissioning and steps are below.

Commission:

Pre-requirements:

 Clone existing Node.


 Change IP address and hostname – 192.168.1.155 and DN3
 Update Hosts files on all nodes – add this entry in /etc/hosts file “192.168.1.155 DN3”
 Make it password less

Configuration changes:

We need to update the include file on both the Resource Manager and the Namenode . If it’s not
present, then create an include file on both the Nodes.

Go to your NameNode and add include file in hdfs-site.xml file.

[root@NN conf] cd / etc/ hadoo


[root@NN conf] Vi hdfs-site.xml
-- Add this property
<property>

1 [root@NN conf] cd /etc/hadoop/conf


2 [root@NN conf] Vi hdfs-site.xml
3 -- Add this property
4 <property>
5 <name>dfs.hosts</name>
6 <value>/etc/hadoop/conf/includes</value>
7 </property>

Also update the slaves file on NameNode and add new DataNode IP address.

[root@NN conf] Vi / etc/ hadoop


192.168.1.152
192.168.1.153
192.168.1.155

1 [root@NN conf] Vi /etc/hadoop/conf/include


2 192.168.1.152
3 192.168.1.153
4 192.168.1.155

Edit the “yarn-site.xml” file where ResourceManager is running.

[root@RM conf]# vi yarn-site.xm


# Add this property path of includ
<property>
<name>yarn.resourcemanage

1 [root@RM conf]# vi yarn-site.xml


2 # Add this property path of include file.
3 <property>
4 <name>yarn.resourcemanager.nodes.include-path</name>
5 <value>/etc/hadoop/conf/includes</value>
6 </property>

Now update the include RM.

[root@RM conf]# cat / etc/ hado


192.168.1.152
192.168.1.153
192.168.1.155

1 [root@RM conf]# cat /etc/hadoop/conf/includes


2 192.168.1.152
3 192.168.1.153
4 192.168.1.155

New DataNode Set up:

Copy all configuration files from NameNode and then refresh the Nodes.
[root@NN conf]# scp -r * DN3:/
[root@NN conf]# sudo -u hdfs h
Refresh nodes successful

1 [root@NN conf]# scp -r * DN3:/etc/hadoop/conf/


2 [root@NN conf]# sudo -u hdfs hdfs dfsadmin -refreshNodes
3 Refresh nodes successful
4
5 [root@RM ~]# sudo -u yarn yarn rmadmin -refreshNodes
6 18/05/21 23:03:34 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8033

Now start the services on DataNode.

[root@DN3 ~]# service hadoop-


[root@DN3 ~]# service hadoop-

1 [root@DN3 ~]# service hadoop-hdfs-datanode start


2 [root@DN3 ~]# service hadoop-yarn-nodemanager start

Commissioning of new DataNode is complete. Check Hadoop admin report using the command

sudo -u hdfs hadoop dfsadmin -

1 sudo -u hdfs hadoop dfsadmin -report

Decommission a Node:

For decommission DataNode, exclude property need at NameNode side. Do decomm activity in
non-peak hour. Any process running on this decommissioned node can fail.

Note: It’s very important to note that the include/exclude files should be mutually exclusive.
Means can’t have same values in both exclude and include file.

[root@NN conf] vi hdfs-site.xml


<property>
<name>dfs.hosts.exclude</ nam
<value>/ etc/ hadoop/ conf/ exclu

1 [root@NN conf] vi hdfs-site.xml


2 <property>
3 <name>dfs.hosts.exclude</name>
4 <value>/etc/hadoop/conf/excludes</value>
5 </property>
6
7 # create exclude file
8 [root@NN conf]# cat /etc/hadoop/conf/excludes
9 192.168.1.155

Update

[root@RM conf]# vi yarn-site.xm


<property>
<name>yarn.resourcemanage
<value>/ etc/ hadoop/ conf/ ex

1 [root@RM conf]# vi yarn-site.xml


2 <property>
3 <name>yarn.resourcemanager.nodes.exclude-path</name>
4 <value>/etc/hadoop/conf/excludes</value>
5 </property>
6 -- Remove file
7 [root@RM conf]# cat /etc/hadoop/conf/excludes
8 192.168.1.155

Refresh the nodes.

[root@NN ~]# sudo -u hdfs hdfs


Refresh nodes successful
[root@RM ~]# sudo -u yarn yarn
18/ 05/ 21 00:00:58 INFO client.R

1 [root@NN ~]# sudo -u hdfs hdfs dfsadmin -refreshNodes


2 Refresh nodes successful
3 [root@RM ~]# sudo -u yarn yarn rmadmin -refreshNodes
4 18/05/21 00:00:58 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8033

Hadoop Balancer:

Hadoop Balancer is a built-in property which makes sure that no datanode will be over utilized.
When you run the balancer utility, it checks whether some datanode are under-utilized or over-
utilized and will balance the replication factor. But make sure the Balancer should run in only
off-peak hours in a real cluster, because if you run this during peak hours, it will cause a heavy
load to networking, as it will transfer a large amount of data.
hadoop balancer

1 hadoop balancer

Hope this post was helpful in understanding about the Commissioning and Decommissioning of
the datanodes in Hadoop.

Thanks

Mandy

Analyze Big data with EMR


by Mandeep K Sandhu.In AWS, Big data.Leave a Comment on Analyze Big data with
EMR

Amazon Enterprise MapReduce is a fully managed cluster platform that process and analyze
larger amount of data. When you run a large amount of data you eventually run into processing
problems. By using hadoop cluster EMR can help in reducing large processing problems and
split big data sets into smaller jobs and distribute them across many compute nodes. EMR can do
this with big data framework and open source projects. Big data framework includes :

 Apache Hadoop, Spark, Hbase


 Presto
 Zeppelin, Ganglia, Pig, hive etc..

Amazon EMR mainly used for log processing and analysis, ETL Processing, Clickstream
analysis and Machine learning.

EMR Architecture:

Amazon EMR architecture contains following three types of nodes:

 Master Nodes:
o EMR have Single Master Node and don’t have another master node to fail over.
o Master node manages resources of the cluster
o Co-ordinates distribution and parallel execution of MapReduce executable.
o Tracking and directing HDFS.
o Monitor health of core and task nodes.
o Resource Manager also running on master nodes which is responsible for
scheduling the resources.
 Core nodes:
o Core nodes are slaves nodes and run the tasks as directed by master node.
o Core contains data as part of HDFS or EMRFS. So data daemons runs on core
nodes and store the data.
o Core nodes also run NodeManager which takes action from Resource Manager
like how to manage the resources.
o ApplicationMaster is task which negotiates the resources with Resource Manager
and working with NodeManager to execute and monitor application containers.
 Task Nodes:
o Task nodes also controlled by master and are optional.
o These nodes are required to provide extra capacity to the cluster in terms of CPU
and memory
o Can be added/removed any time from running cluster.

Storage option:

Various storage option are available for EMR cluster.

 Instance store: Local storage attached to EC2 instance but data lose after terminating the
EMR cluster. Can be used where high i/o or high IOPS at low-cost.
 EBS volume: EBS volume for Data storage but data lost after EMR cluster termination.
 EMR FS: An implementation of HDFS which allows cluster to store/ingest data directly
from S3. Data copy from S3 to HDFS can be done via S3DistCp

Launch a Cluster:

You can launch EMR cluster with few clicks. Sign up for AWS cloud and go to AWS console.

On service Menu click EMR and further click create cluster. There are two options to create
cluster.

 Quick:
 Advanced:

I’ll go through Quick One first. Fill up the following information.

 Cluster Name: testcluster


 Logging : S3 buket to keep the hadoop logs
 Release : Choose the latest version
 Applications : Only 4 sets are available and need to choose from those sets.
 Instance type: m4.large ( all nodes same type)
 Number of instances: 2 ( 1 Master and 1 core)
 EC2 key pair: processed without Key pair
 EMR role and profile : leave it default(EMR will create the default roles for you)
 Click create cluster and will take 5 mins to create the cluster.
Advanced option:

When you click on create cluster, choose go to advance option.

 From here you can choose the EMR release and custom application that you want to
install. In quick option, you won’t able to choose the specific application but have to
select one set out of 4 sets. I choose cluster with application hive and spark.
 Secondly you can enter hadoop configuration like change the YARN job logs in JSON
format.
 You can load JSON file as well from S3.

 In advance option you can specify the VPC and subnet settings and EBS volume sizes
etc.. This option is not available in quick option.
 Advance option allow you to choose the different configuration of master, core and task
nodes.In quick option all nodes are of same type.
 Auto scaling can also be configured in Advance option.

 Terminate protection is enabled by default. This will protect the cluster from being
accidentally terminated by Amazon.
 But for Transient cluster ( like temp cluster who do particular task and automatically
terminate after the task), disable the terminate protection.
 Boot strap scripts can be executed in advance EMR option.

 Authentication and encryption: can be changed.


 security groups: additional security rules can be applied.
Cluster will be ready with in 4-5 mins time. Download the key-pair as well to connect to the
cluster.

EMR log file:

Like any Hadoop environment, Amazon EMR would generate a number of log files. While some
of these log files are common for any Hadoop system, others are specific to EMR. Here is a brief
introduction to these different types of log files.

 Bootstrap Action Logs: These logs are specific to Amazon EMR. It’s possible to run
bootstrap actions when the cluster is created. An example of bootstrapping can be
installing a Hadoop component not included in EMR, or using a certain parameters in a
configuration file. The bootstrap action logs contain the output of these actions.
 Instance State Logs: These log files contain infrastructure resource related information,
like CPU, memory, or garbage collection.
 Hadoop/YARN Component Logs: These logs are associated with the Hadoop daemons
like those related to HDFS, YARN, Oozie, Pig, or Hive. I instructed to create YARN
related log files in S3.
 Step Logs: This type log is specific to Amazon EMR. As we said, an EMR cluster can
run one or more steps of a submitted job. These steps can be defined when the cluster is
created, or submitted afterwards. Each step of a job generates four types of log files.
Collectively, these logs can help troubleshoot any user-submitted job.
Loading Data into Hive:
Once cluster is ready connect to the cluster by using hadoop user with private key. Now run the
hive program and create the table. The dataset is already uploaded to
“s3://testcluster-emr/input/restaurant.data” bucket.

Once hive table is created query the Hive table to list the rating for the restaurant. It took almost
33 seconds to execute the query.
# Create table statement in Hive
CREATE EXTERNAL TABLE `re
Age string,
gender string,

1 # Create table statement in Hive


2 CREATE EXTERNAL TABLE `restaurants_data`(
3 Age string,
4 gender string,
5 budget string,
6 price string,
7 cuisine_type string,
8 rating string)
9 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
10 STORED AS INPUTFORMAT
11 'org.apache.hadoop.mapred.TextInputFormat'
12 OUTPUTFORMAT
13 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
14 LOCATION
15 's3://testcluster-emr/input/';
16
17 # Run query to get the result.
18 select count(*), rating from restaurants_data group by rating;

Now exit hive and get ready with Spark SQL.

Before running the Spark SQL turn off the verbose logging. Spark SQL is compatible with hive
metastore which means tables created with Hive don’t need to be re-created for Spark SQL.
Now cache the restaurant table created by hive in Spark SQL. Caching will take time depends
upon how big the table is??? Once done, run the same select Sql query that we ran with hive.
You’ll get the query result in 3 seconds.

Spark SQL includes a cost-based optimizer, columnar storage and code generation to make
queries fast.

Spark SQL setting (Turn off verb


sudo sed -i -e 's/ rootCategory=IN
> Spark-sql
select count(*), rating from restau

1 Spark SQL setting (Turn off verbose logging):


2 sudo sed -i -e 's/rootCategory=INFO/rootCategory=WARN/' /etc/spark/conf/log4j.properties
3 > Spark-sql
4 select count(*), rating from restaurants_data group by rating

Submit your Hive script as a step:

Use the Add Step option to submit your Hive script to the cluster using the console. The Hive
script have been uploaded to Amazon S3 for you. My hive script contain the select statement
only.

Go to the cluster and scroll to the Steps section and expand it, then choose Add step.

 Step type: Hive program


 Name: Count hive table or anything
 Script : S3 location ( My script contain the select statement only)
 Input S3 location : Dataset ( you can upload a script to create the table and dataset can be
as input file)
 output: The query writes results
 Arguments: Include the following argument to allow column names that are the same as
reserved words if any.
 For Action on failure, accept the default option Continue.
 Click Add.
You can see the hive step logs and jobs once completed.

This example was a simple demonstration. In real life, there may be dozens of steps with
complex logic, each generating very large log files. Manually looking through thousands of lines
of log may not be practical for troubleshooting purposes. So S3 is good candidate for placing
step logs for troubleshooting.

EMR CLI:

EMR CLI commands are listed here. You can run these to EMR nodes.

[hadoop@ip-172-31-64-121 emr]
[hadoop@ip-172-31-64-121 emr]
1 [hadoop@ip-172-31-64-121 emr]$ aws emr list-clusters
2 [hadoop@ip-172-31-64-121 emr]$ aws emr list-instances --cluster-id j-3RUZVHS6CKWYC

EMR console:

EMR cluster management console provide the GUI console to monitor, resize, terminate and a
lot of other features. The same can be achieved EMR CLI.

You can monitor the cluster status, node stats, CPU, I/O and application jobs completed etc…
Really handy.

Application history can be monitored separately as well which will show the type of the job and
successful/failure status etc..
Select the cluster and see the overall status from one screen. You can resize the cluster from
hardware option.

You can increase the core nodes from here. There is only one master node and you can’t change
that. NO HA at the moment but meta data can be stored in metastore and later can be used to
restore any failed cluster. Just update the count in core node and click green tick.
Resizing will take time and once done cluster core nodes status will be in running state.

Additional task nodes can be added from same tab.


Terminate:

To terminate the cluster, you need to turn off the termination protection.

This option is available in the summary page.


In this post, we had a quick introduction to Amazon EMR cluster, different launch options,
running the hive/spark SQL queries and different types of log files.

Thanks

Mandy

HDFS Command line – Manage files and


directories.
by Mandeep K Sandhu.In Big data.Leave a Comment on HDFS Command line – Manage
files and directories.

In my previous blog, we have configured hadoop single and cluster set up. Now try to create the
files and directories on Hadoop distributed file system(HDFS). You can see the full list here.
When I started the hdfs commands I got confused with three different command syntax. All three
commands appears to be same but have some differences as explained below.

 hadoop fs {args}

FS relates to a generic file system which can point to any file systems like local, HDFS etc. So
this can be used when you are dealing with different file systems such as Local FS, (S)FTP, S3,
and others.

 hadoop dfs {args}

dfs is very specific to HDFS. would work for operation relates to HDFS. This has been
deprecated and we should use hdfs dfs instead.

 hdfs dfs {args}

same as 2nd i.e would work for all the operations related to HDFS and is the recommended
command instead of hadoop dfs

Create OS user and test the various commands.

[root@cm ~]# useradd mandy


[root@cm ~]# usermod -G hdfs m

1 [root@cm ~]# useradd mandy

2 [root@cm ~]# usermod -G hdfs mandy

hadoop fs:

[root@cm ~]# su - hdfs


hdfs@cm ~]$ hadoop fs -mkdir /
[hdfs@cm ~]$ hadoop fs -chown

1 [root@cm ~]# su - hdfs

2 hdfs@cm ~]$ hadoop fs -mkdir /user/mandy

3 [hdfs@cm ~]$ hadoop fs -chown mandy:hdfs /user/mandy

5 su - mandy[mandy@cm ~]$ cd /home/mandy


6 [mandy@cm ~]$ vi text_hadoop.txt

7 [mandy@cm ~]$ hadoop fs -put text_hadoop.txt

8 [[mandy@cm ~]$ hadoop fs -ls /user/mandy

9 Found 1 items

10 -rw-r--r-- 3 mandy hdfs 12 2018-05-24 01:30 /user/mandy/text_hadoop.txt

hdfs dfs:

[hdfs@cm ~]$ hdfs dfs -mkdir / u


[hdfs@cm ~]$ hdfs dfs -chown m

[hdfs@cm ~]$ su - mandy

1 [hdfs@cm ~]$ hdfs dfs -mkdir /user/d1

2 [hdfs@cm ~]$ hdfs dfs -chown mandy:hdfs /user/d1

4 [hdfs@cm ~]$ su - mandy

5 [mandy@cm ~]$ cd /home/mandy

6 [mandy@cm ~]$ cp text_hadoop.txt text_hadoop.txt_old

7 [mandy@cm ~]$ hdfs dfs -put text_hadoop.txt_old

8 [mandy@cm ~]$ hdfs dfs -ls /user/mandy

9 Found 2 items

10 -rw-r--r-- 3 mandy hdfs 12 2018-05-24 01:30 /user/mandy/text_hadoop.txt

11 -rw-r--r-- 3 mandy hdfs 12 2018-05-24 01:41 /user/mandy/text_hadoop.txt_old

hadoop dfs:

[mandy@cm ~]$ hadoop dfs -ls /


DEPRECATED: Use of this scrip
Found 2 items
-rw-r--r-- 3 mandy hdfs 12 2018-0

1 [mandy@cm ~]$ hadoop dfs -ls /user/mandy

2 DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs
command for it.
3 Found 2 items

4 -rw-r--r-- 3 mandy hdfs 12 2018-05-24 01:30 /user/mandy/text_hadoop.txt

5 -rw-r--r-- 3 mandy hdfs 12 2018-05-24 01:41 /user/mandy/text_hadoop.txt_old

Copy file HDFS to Local file system:


# syntax hdfs dfs -copyToLocal
[admin@cm jars]$ hdfs dfs -copy

1 # syntax hdfs dfs -copyToLocal

2 [admin@cm jars]$ hdfs dfs -copyToLocal /user/admin/out/part-r-00000 /home/admin/

Thanks
Mandy

Configure HA – HiveMetastore and Load


Balancing for HiveServer2
by Mandeep K Sandhu.In Big data.Leave a Comment on Configure HA – HiveMetastore
and Load Balancing for HiveServer2

Apache hive is a Data Warehouse software project built on top of apache Hadoop for providing
data summary, query and analysis. Hive gives an SQL like interface to query data stored in
various databases and file systems that integrate with Hadoop.

Configuring High Availability for Hive requires the following components to be fail proof:

 Hive MetaStore – RDBMS (MySQL)


 ZooKeeper
 Hive MetaStore Server
 HiveServer2

Set up MySQL db:

First of all set up hive metastore as MySql database. Here are the steps:
[root@CM ~]# rpm -ivh "https:/ /
[root@CM ~]# yum install mysql
[root@CM ~]# systemctl start my
[root@CM log]# grep password /

[root@CM ~]# rpm -ivh "https://dev.mysql.com/get/mysql57-community-release-el7-


1
11.noarch.rpm"
2
[root@CM ~]# yum install mysql mysql-server
3
[root@CM ~]# systemctl start mysqld
4
[root@CM log]# grep password /var/log/mysqld.log
5
[root@CM log]# /usr/bin/mysql_secure_installation
6
-- Provide new password
7
-- remove anonymous user - Y
8
-- disallow root login remotely - N
9
-- reload Privileges - Y

Now login MySQL database and create the hive database /user. And grant the privileges.

[root@cm ~]# mysql -u root


mysql> create database hive;
mysql> create user 'hive' identifie
mysql> grant all on hive.* to hive

1 [root@cm ~]# mysql -u root


2 mysql> create database hive;
3 mysql> create user 'hive' identified by 'hadoop_12';
4 mysql> grant all on hive.* to hive;
5 mysql> flush privileges;
6 root@cm ~]# mysql -u hive -p

Install Hive:

Add the service to cluster through Cloudera Manager.

Assign nodes as below (CM – master Node).

Test Mysql database connection.


Next page and keep the configuration default.

Service addition in progress and hive service added successfully.

Test hive:

[root@cm ~]# hive


hive> show databases;
OK
default
1 [root@cm ~]# hive&nbsp;&nbsp;
2 hive> show databases;
3 OK
4 default
5 Time taken: 0.819 seconds, Fetched: 1 row(s)

Enabling High Availability for Hive Metastore Server:

 Select Hive Services -> configuration


 Select Scope -> Hive Metastore Server and category -> Advanced.
 Locate the Hive Metastore Delegation Token Store property. or search for it by typing its
name in the search box.
 select org.apache.hadoop.hive.thrift.DBTokenStore
 click save changes

 Click on instance tab and add role instance.


 Click the text field under hive metastore server.
 Click on Select Hosts for Hive Metastore Server.
 Choose another Host (RM) to configure Hive Metastore Server on.

 Click Finish. You should now see new hosts added as the Hive Metastore Server.
 Re-start the stale configurations

Notice that you now have multiple instances of Hive Metastore Server.

Test HA set up for Hive Meta Store:

SSH to any DataNode. Connect to Hiveserver2 using Beeline.

# beeline -u “jdbc:hive2://cm:10000”

Issue show database.

Now from CM, select first Hive Metastore Server and stop the connection.
Now stop second hive Metastore server. This command should fail which is normal.

Confiure Load balancing for HiverServer2:

To enable high availability for multiple HiveServer2 hosts, configure a load balancer to manage
them. To increase stability and security, configure the load balancer on a proxy server.

Add couple of You should now see new hosts added as HiveServer2.

 Go to the Hive service.


 Click the Configuration tab -> Scope > HiveServer2 and Category -> Advanced
 Locate the HiveServer2 advanced Snippet property or search for it by typing its name in
the Search box.

The clients connecting to HiveServer2 now go through Zookeeper.


beeline -u
“jdbc:hive2://dn1:2181,dn2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hi
veserver2”

The connection gets routed to the HiveServer2 instances in a round robin fashion.

Mandy!!!

Create/Restore a snapshot of an HDFS


directory
by Mandeep K Sandhu.In Big data.Leave a Comment on Create/Restore a snapshot of an
HDFS directory

In this tutorial, we focus on HDFS snapshots. Common use cases of HDFS snapshots include
backups and protection against user errors.

Create a snapshot of HDFS directory:

HDFS directories must be enabled for snapshots in order for snapshots to be created. Steps are:

 From the Clusters tab -> select HDFS service.


 Go to the File Browser tab. Select the file directory.
 Verify the Snapshottable Path and click Enable Snapshots.
With Command line:

su - hdfs
Hdfs dfsadmin -allowsnapshot / u
# Another Method
hadoop dfsadmin -allowSnapsho

1 su - hdfs
2 Hdfs dfsadmin -allowsnapshot /user/mandy/snapshot_test
3 # Another Method
4 hadoop dfsadmin -allowSnapshot /user/mandy/snapshot_test

Once directory has been enabled for snapshots, take a snapshot.

 To take a snapshot, click Take Snapshot, specify the name of the snapshot, and click
Take Snapshot.
The snapshot is added to the snapshot list.

With Command line:

hdfs dfs -createsnapshot / user/ m


# List the snapshots
[mandy@cm ~]$ hdfs lsSnapsho
drwxr-xr-x 0 mandy hdfs 0 2018-0

hdfs dfs -createsnapshot /user/mandy/snapshot_test


1
# List the snapshots
2
[mandy@cm ~]$ hdfs lsSnapshottableDir
3
drwxr-xr-x 0 mandy hdfs 0 2018-05-24 02:40 1 65536 /user/mandy/snapshot_test
4
[mandy@cm ~]$ hdfs dfs -ls /user/mandy/snapshot_test/.snapshot
5
Found 1 itemsdrwxr-xr-x - mandy hdfs 0 2018-05-24 02:40
6
/user/mandy/snapshot_test/.snapshot/s20180524-024025.045

Remove the file from snapshot:

Now, let’s “accidentally” remove a file inside the snapshotable directory:

[mandy@cm ~]$ hdfs dfs -rm -r /


18/ 05/ 24 03:05:51 INFO fs.Tras

[mandy@cm ~]$ hdfs dfs -rm -r /user/mandy/snapshot_test/snap1.txt


1 18/05/24 03:05:51 INFO fs.TrashPolicyDefault: Moved:
2 'hdfs://cluster-ha/user/mandy/snapshot_test/snap1.txt' to trash at:
hdfs://cluster-ha/user/mandy/.Trash/Current/user/mandy/snapshot_test/snap1.txt
Recover the file from snapshot:

To restore a snapshot, click drop down button near folder name again and select restore from
snapshot.

Select the snap from and select the restore method.

Restore in progress.
With Command line:

Recovering from the snapshot is as simple as copying the file.

[mandy@cm ~]$ hdfs dfs -cp / us

[mandy@cm ~]$ hdfs dfs -cp /user/mandy/snapshot_test/.snapshot/s20180524-030219.005


1
/user/mandy/snapshot_test

You can read the content of the file or list the file.

Disable Snapshot:

Try to remove a snapshotable directory by typing a following command as the hdfs user. As
expected, the directory can’t be deleted because is snapshottable and it already contains a
snapshot. Remove the snapshot first and re-try again.
Delete the snapshot.

From command line.


[hdfs@cm root]$ hdfs dfsadmin -
disallowSnapshot: The directory
# Delete Snaphsot first
[mandy@cm ~]$ hdfs dfs -delete

[hdfs@cm root]$ hdfs dfsadmin -disallowSnapshot /user/mandy/snapshot_test


1
disallowSnapshot: The directory /user/mandy/snapshot_test has snapshot(s). Please redo the
2
operation after removing all the snapshots.
3
# Delete Snaphsot first
4
[mandy@cm ~]$ hdfs dfs -deleteSnapshot /user/mandy/snapshot_test/ s20180524-024025.045
5
# Now You can delete the
6
[hdfs@cm root]$ hdfs dfsadmin -disallowSnapshot /user/mandy/snapshot_test

Thanks

Mandy

Decommission/Recommission – DataNode in
Cloudera
by Mandeep K Sandhu.In Big data.Leave a Comment on Decommission/Recommission –
DataNode in Cloudera

Commissioning nodes stand for adding new nodes in current cluster which operates your Hadoop
framework. In contrast, decommissioning nodes stands for removing nodes from your cluster.
This is very useful feature to handle node failure during the operation of Hadoop cluster without
stopping entire Hadoop nodes in your cluster.

Decommission:

You can’t decommission a DataNode or host with DataNode if number of the data nodes equals
to the replication factor. if you attempt to decommission a datanode in such situation the data
node decommission process will not complete. you have to abort the decommission process and
change the replication factor.
In my case, I have two data node and decommission one will leave only on data node. Before
decomm process , change the replication factor to 1.

Same can be done via command line.

hdfs dfs -setrep -R -w 1/

1 hdfs dfs -setrep -R -w 1/

Now restart the stale services.


Now you can decomm the Datanode.

 Go to hosts
 Select the host or hosts that you want to decommission
 Click on Action -> Select “Hosts Decommission/suppress Alert”

 Host decommission in progress and will take some time.


 Once the host is decommissioned, the “Commission state” of the host will change to
“decommissioned”.

Recommission:

Recommission is applicable only for hosts decommissioned using Cloudera Manager.

 Go to CM –> Hosts – >select decommissioned hosts – >Actions for selected – Hosts


Recommission – > confirm.

Remove host from cluster:

The host which is decommissioned, now can be removed from cluster. Remove the roles from
host and leave the management role.
Remove host from Cloudera Manager:

 Go to hosts -> select host to delete


 Stop agent on the host first.

service cloudera-scm-agent stop

 Now click on remove Host from cloudera.

Add New Host to Cluster:


New host can be added through, New Host wizard. Select cluster -> add Host wizard and follow
the steps.

 Search the host with IP address or hostname and select it.

 Select type of CDH software installation. Selected matched withe existing set up.

 Above step will distribute/activate the CDH software.


 Create new template and select the roles and apply it to this host.
 Wait for deployment step to complete and see the roles started on new host.
 Addition is now complete

Addition role instances can be added to the new host.

Rebalance the cluster:

In HDFS, the blocks of the files are distributed among the datanodes as per the replication factor.
Whenever you add a new datanode, the node will start receiving, storing the blocks of the new
files. Though this sounds alright, the cluster is not balanced when you look at administrative
point view. HDFS provides a balancer utility that analyzes block placement and balances data
across the DataNodes. You can do it via Cloudera after addition of new DataNode.
 Go to the HDFS service.
 Ensure the service has a Balancer role.
 Select Actions > Rebalance.
 Click Rebalance to confirm.

You can also do the same from command line.

hdfs balancer
# Set the different threshold
hdfs balancer -threshold 5

1 hdfs balancer
2 # Set the different threshold
3 hdfs balancer -threshold 5

Mandy

You might also like