Install Apache Hadoop Using Cloudera
Install Apache Hadoop Using Cloudera
7
by Mandeep K Sandhu.In Big data.1 Comment on Install Apache Hadoop – Single Node
REHL 7
Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and provides high throughput access to application data .
The main goal of this tutorial is to get a simple Hadoop installation up and running so that you
can play around with the software and learn more about it.
Environment: This blog has been tested in the following software version.
After VM set up, please add a non sudo user dedicated to Hadoop which will be used to
configure Hadoop. Following command will add the user hduser and the group hadoop to VM
machine.
Apache Software:
download the apache hadoop software from official site.
Install Java:
Hadoop is written in Java, hence before installing Apache Hadoop we will need to install Java
first. To install Java in your system first we will need to download the file from oracle website
from here.
1 wget "http://download.oracle.com/otn-pub/java/jdk/jdk-7u75-linux-x64.tar"
2 [root@cdhs ~]# mv jdk-7u75-linux-x64.tar /home/hduser
3 [root@cdhs hduser]# su - hduser
4 Last login: Thu May 3 14:44:38 NZST 2018 on pts/0
5 [hduser@cdhs ~]$ tar -xvf jdk-7u75-linux-x64.tar.gz
6 [hduser@cdhs ~]$ ln -s jdk1.7.0_75 jdk
Once the Java is installed in your system, you can check the version of Java using the following
command.
[hduser@cdhs ~]$ java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environm
Java HotSpot(TM) 64-Bit Server
Now edit .bash_profile file using your favorite editor and add Hadoop/java home.
Once done, you can now check if the environment variables are now set or not. Run the
following command.
echo $JAVA_HOME
echo $HADOOP_HOME
1 echo $JAVA_HOME
2 echo $HADOOP_HOME
Configuring Hadoop:
The first file is “core-site.xml” file which contains configuration of the port number used by
HDFS.
Now we will need to create two directories to store namenode and datanode using the following
commands.
Now you will need to configure the SSH keys for your new user so that the user can securely log
into hadoop without any password.
We have now configured Hadoop to work on a single node cluster. Now we can initialize HDFS
file system by formatting the namenode directory using the following command.
Now we can start Hadoop cluster, navigate to $HADOOP_HOME/sbin directory using the
following command.
cd /home/hduser/hadoop/sbin
[hduser@cdhs sbin]$ start-dfs.sh
1
Starting namenodes on [cdhs]
2
cdhs: starting namenode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-
3
namenode-cdhs.out
4
10.0.2.5: starting datanode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-hduser-
5
datanode-cdhs.out
6
Starting secondary namenodes [cdhs]
7
cdhs: starting secondarynamenode, logging to /home/hduser/hadoop-2.7.3/logs/hadoop-
hduser-secondarynamenode-cdhs.out
You can check the status of the services using the following command.
[hduser@cdhs ~]$ jdk/ bin/ jps
3542 ResourceManager
3098 NameNode
3648 NodeManager
You can now browse the Apache Hadoop services through your browser. By default Apache
Hadoop namenode service is started on port 50070. Go to following address using your favorite
browser.
http://10.0.2.5:50070
To view Hadoop clusters and all applications, browse the following address into your browser.
http://10.0.2.5:8088
http://10.0.2.5:8042
http://10.0.2.5:50090/
If problem in opening any of the URL above, please disable the IPTABLES service.
Conclusion:
In this tutorial we have learnt how to install Apache Hadoop on a single node with Pseudo
distribution mode.
From my previous blog, we learnt how to set up a Hadoop Single Node Installation. Now, I will
show how to set up a Hadoop Multi Node Cluster. A Multi Node Cluster in Hadoop contains
two or more DataNodes in a distributed Hadoop environment. This is practically used in
organizations to store and analyse their Petabytes and Exabytes of data.
Here in this blog, we are taking three machine to set up multi-node cluster – MN and DN1/DN2.
Master node (MN) will run the NameNode and ResourcesManager Daemons.
Data Nodes (DN1 and DN2) will be our data nodes that stores the actual data and provide
processing power to run the jobs. Both hosts will run the DataNode and NodeManager
daemons.
Software Required:
REHL 7 – Set up MN and DN1/DN2 with REHL 7 operating system – Minimal Install.
Hadoop-2.7.3
JAVA 7
SSH
First of all, we have to edit hosts file in /etc/ folder in MasterNode (MN) , specify the IP address
of each system followed by their host names.
# vi / etc/ hosts
enter the following lines in the / e
192.168.1.77 MN
192.168.1.78 DN1
1 # vi /etc/hosts
2 enter the following lines in the /etc/hosts file.
3 192.168.1.77 MN
4 192.168.1.78 DN1
5 192.168.1.79 DN2
Download and extract the Java Tar File on Master node. And Similarly download hadoop 2.7.3
Package on Master Node (MN) and extract the Hadoop tar File.
[root@MN ~]# wget "https:/ / arc
[root@MN ~]# wget "http:/ / dow
[root@MN ~]# mv hadoop-2.7.3
[root@MN ~]# mv jdk-7u75-linux
Add Hadoop/java binaries to your PATH. Edit /home/hadoop/.bash_profile and add the
following lines in master node. Then, save the bash file and close it.
vi home/hadoop/.bash_profile
1
export HADOOP_HOME=/home/hduser/hadoop
2
export JAVA_HOME=/home/hduser/jdk
3
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HADOOP_HOME/bin:
4
$HADOOP_HOME/sbin:$JAVA_HOME/bin
For applying all these changes to the current Terminal, execute the source command.
To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.
hduser@MN ~]$ javac -version
javac 1.7.0_75
Login to MN as the hadoop user, and generate an ssh-key. Copy the generated ssh Key to Master
Node’s authorized keys.
Copy the master node’s ssh key to DN1 and DN2 authorized keys.
Configure Hadoop:
Now edit the configuration files in hadoop/etc/hadoop directory in master node. Set the
NameNode location.
Edit hdfs-site.conf on master node for NameNode and DataNode file location.
The last property, dfs.replication, indicates how many times data is replicated in the cluster. You
can set 2 to have all the data duplicated on the two nodes. Don’t enter a value higher than the
actual number of data nodes.
Copy mapred-site from the template in configuration folder and the edit mapred-site.xml on
Master node. Set yarn as the default framework for MapReduce operations.
Configure YARN:
Configure Slaves
The file slaves is used by startup scripts to start required daemons on all nodes. Edit
~/hadoop/etc/hadoop/slaves to be.
Format HDFS:
Now start Hadoop services by executing the following commands. It’ll start NameNode and
SecondaryNameNode on node-master, and DataNode on node1 and node2, according to the
configuration in the slaves config file.
In addition to the previous HDFS daemon, you should see a ResourceManager on node-
master, and a NodeManager on node1 and node2.
Check all the daemons running on both master and slave machines.
Web Interface:
At last, open the browser and go to master:50070/dfshealth.html on your master machine, this
will give you the NameNode interface.
To view Hadoop clusters and all applications, browse the following address into your browser.
http://192.168.1.77:8088/cluster
http://192.168.1.78:8042/node
http://192.168.1.79:8042/node
http://192.168.1.77:50090/status.html
I hope you would have successfully installed a Hadoop Multi Node Cluster. If you are facing any
problem, you can comment below, we will be replying shortly.
Mandy!!!
I have written couple of blogs to set up Hadoop as Single/Cluster Muti-node environment and
deploying, configuring and running a Hadoop cluster manually is rather time and cost-
consuming. Here’s a helping hand to create a fully distributed Hadoop cluster with Cloudera
Manager. In this blog, we’ll see how fast and easy to install Hadoop cluster with cloudera
Manager.
Software used:
CDH5
Cloudera Manager – 5.7
OS – REHL 7
VirtualBox – 5.2
Prepare Servers:
Disable Selinux:
vi / etc/ selinux/ config
SELINUX=disabled
1 vi /etc/selinux/config
2 SELINUX=disabled
Setup NTP:
Disable firewall:
Edit hosts file in /etc/ folder in clusterManager Node (CM) , specify the IP address of each
system followed by their host names. Each machine need a static IP address and all VM’s
machines should be ping able from each other.
Now clone the machines to DN1/DN2. Update the IP address and hostname. Test the SSH
without password and display the hostname.
Hadoop is written in Java so we need to set up Java. Install the Oracle Java Development Kit
(JDK) as below on all nodes.
Before starting agent, update the host entry in agent config files.
Then read and accept the license agreement and choose “Cloudera Enterprise Data Hub Edition
Trial” on the next page. After that you’ll be offered to set up a new cluster.
As you have already installed the Agents, you can see the hosts lists. Select all hosts.
Press continue and select the CDH version and select parcel method.
Press “Continue” and wait for distribution and activation.
Wait for Cluster Inspector to finish the inspection and you’ll see all installed components.
Install Hadoop cluster:
Then you can choose the cluster roles distribution across the cluster. Accept the default options.
You can see the summary view via “Host view detail”.
Next Part is database set up. Please provide the database access detail.
Summary
Cloudera Manager makes creation and maintenance of Hadoop clusters significantly easier than
if they have been managed manually. Due to this instruction it is possible to create a Hadoop
cluster in less than one hour when manual configuration and deployment could take a few hours
or even days.
This section explains how to set up a local yum repository to install CDH on the machines in
your cluster. There are a number of reasons you might want to do this, for example:
Server in your cluster don’t have access to internet. You can still use YUM to do an
installation on those machines by creating a local YUM repository.
To make sure that each node will have the same version of software installed.
Local repository is more efficient.
Create local web publishing directory. And Install web server such as Apache/http on the
machine that host the RPM and start the http server.
http://192.168.1.83/yum/cm/
Now create the cm.repo file and list the available repo.
Set up Parcels:
192.168.1.83/parcels/
Download installer for CM and change the permissions. And run the CM installer with option ” –
skip_repo_package=1 to install cloudera manager from the local repo.
Installation complete.
http://cm:7180/cmf/login or http://192.168.1.83:7180/cmf/login
Login to the page with admin/admin. Accept the licence and accept cloudera Enterprise Data
hub.
Next part is cluster set up and will explain in future blogs.
Thanks
Mandy
ClouderaManager – Installation on Google
Cloud
by Mandeep K Sandhu.In Big data.Leave a Comment on ClouderaManager – Installation on
Google Cloud
In this post, I am going to tell you about how to set-up a Hadoop cluster on Google Cloud
Platform.
First of all, you have to register on Google cloud. It’s easy. Just sign-in with your Gmail id and
fill your credit card details. Once registered you will get (300 USD) 1-year free subscription on
Google Cloud.
Create a new project. Give a name to your project or leave as it is provided by Google.
Now click on the icon on the top left corner of your homepage. A list of products and
services will appear which the Google cloud provides. Click on Compute Engine and
then click on VM instances. The VM Instances page will open, select Create Instance.
Create Four Machines as below.
First of all, click on SSH (SSH is a network protocol that allows you to access a remote
computer in a secure way). A terminal will open. Now do the following steps:
Disable firewall-
1 vi /etc/selinux/config
2 SELINUX disabled
Setup SSHD –
ClouderaManager Install:
Download the cloudera-manager-installer.bin file on CM node and change the permissions and
run the installer.
The installation part is simple and straightforward. Accept the Licence for both cloudera and
Oracle JDK.
Click Next and install JDK and CDH.
Installation complete and Open URL which is external IP address of the Machines where you run
the installed which is CM in my case.
http://35.197.189.176/:7180/cmf/login
Login to the page with admin/admin. Accept the licence and accept cloudera Enterprise Data
hub.
Next screen will show all the services which are available. Press continue.
Next page you need to specify the Host IP Address/Hostname for all your instances and then
click search.
After all the hosts have been searched, it will display the following page.
The next page which will emerge is where you select the repository.
Next page will be of enabling Single User Mode. Just click Continue. No need to enable that.
Provide SSH login credentials. Enter your password which you have set during configuring the
server and hit Continue.
Setup Cluster:
Now you have come on the Cluster Setup page. You can select which services you want to
install. At the bottom, you will find Custom Service through which you can choose whichever
services you want to assign to your cluster. I selected HDFS and YARN and click continue.
In role assignment page, assign roles to different nodes and view by host the final distribution. I
assigned as below:
Set up repository for report Manager. Use embedded DB but in production use custom database
like MySql/MSSql. Click on Test Connection and then Continue.
Keep the default the settings for block size and data directories and press continue.
Now it will start all the services on your cluster and will also take some time. This concludes the
cluster installation part.
Mandy
In earlier releases, the NameNode was a single point of failure (SPOF) in a HDFS cluster. Each
cluster had a single NameNode, and if that machine or process became unavailable, the cluster as
a whole would be unavailable until the NameNode was either restarted or brought up on a
separate machine. The Secondary NameNode did not provide failover capability. The HA
architecture solved this problem of NameNode availability by allowing us to have two
NameNodes in an active/passive configuration. The NameNode is the centerpiece of an HDFS
file system
To enable Namenode HA in cloudera, you must ensure that the two nodes are of same
configuration in terms of memory, disk, etc for optimal performance. Here are the steps.
ZooKeeper:
Select cluster -> Action -> Add Service and pop will appear.
Add zookeeper from listed services.
Next step will initialize and start the ZooKeeper services. Click Next and ZooKeeper service
successfully added to the cluster.
HDFS HA set up:
Now select the HDFS Service from cluster and see the status page.
See the ZooKeeper status and one will be the leader and others as followers.
HA Test:
HA test is very simple and quick. Check the status NameNodes services and select active Name
Node -> action for selected -> stop.
Now check the status and after few seconds, the standby node will come up as active node.
Now start the stopped NN service and will come up as standby NN service after a while.
In this blog, I am going to talk about how to configure and manage a High availability HDFS
(CDH 5.12.0) cluster. In earlier releases, the NameNode was a single point of failure (SPOF) in
a HDFS cluster. Each cluster had a single NameNode, and if that machine or process became
unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted
or brought up on a separate machine. The Secondary NameNode did not provide failover
capability.
The HA architecture solved this problem of NameNode availability by allowing us to have two
NameNodes in an active/passive configuration. So, we have two running NameNodes at the
same time in a High Availability cluster:
Active NameNode
Standby/Passive NameNode.
We can implement the Active and Standby NameNode configuration in following two ways:
Using the Quorum Journal Manager (QJM) is the preferred method for achieving high
availability for HDFS. Read here to know more about QJM and NFS methods. In this blog, I’ll
implement the HA configuration for quorum based storage and here are the IP address and
corresponding machines Names/roles.
NameNode machines – NN1/NN2 of equivalent hardware and spec
JournalNode machines – The JournalNode daemon is relatively lightweight, so these
daemons can reasonably be collocated on machines with other Hadoop daemons, for
example NameNodes, the JobTracker, or the YARN ResourceManager. There must be at
least three JournalNode daemons, since edit log modifications must be written to a
majority of JournalNodes.So 3 JN’s runs on NN1/NN2 and MGT Server.
Note that when running with N JournalNodes, the system can tolerate at most (N – 1) / 2
failures and continue to function normally.
The ZookeerFailoverController (ZKFC) is a Zookeeper client that also monitors and
manages the NameNode status. Each of the NameNode runs a ZKFC also. ZKFC is
responsible for monitoring the health of the NameNodes periodically.
Resource Manager Running on same NameNode NN1/NN2.
Two Data Nodes – DN1 and DN2
Pre-requirements:
First of all, we have to edit hosts file in /etc/ folder in NameNode(NN1) , specify the IP address
of each system followed by their host names. Each machine need a static IP address and all
VM’s machines should be ping able from each other.
vi / etc/ hosts
192.168.1.150 NN1
192.168.1.151 NN2
192.168.1.152 DN1
1 vi /etc/hosts
2 192.168.1.150 NN1
3 192.168.1.151 NN2
4 192.168.1.152 DN1
5 192.168.1.153 DN2
6 192.168.1.154 MGT
All VM machines are set up with REHL 7 operating system. And disable the firewall restrictions
on all VM’s.
Setup Java:
Hadoop is written in Java so we need to set up Java first. Install the Oracle Java Development
Kit (JDK) as below on all nodes.
1 java -version
2 java version "1.7.0_75"
3 Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
4 Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
If you want to create your own YUM repository, download the appropriate repo file, create the
repo, distribute the repo file as described under:
http://archive.cloudera.com/cdh5/repo-as-tarball/5.13.0/
[root@NN1 ~] vi local-repos.repo
[cloudera-cdh5]
name=cloudera
baseurl=file:/ / / root/ cdhrepo/ cdh
1 [root@NN1 ~] vi local-repos.repo
2 [cloudera-cdh5]
3 name=cloudera
4 baseurl=file:///root/cdhrepo/cdh/5
5 gpgcheck=0
6 enabled=1
7 [root@NN1 ~] yum repolist
Now install below components on NameNode1/2 (NN1/NN2)/and disable the run level start of
components.
Managment server (MGT) will be installed with journalNode, ZooKeeper and client Service.
Login to NN1 as the root user, and generate an ssh-key. Copy the generated ssh Key to all
Node’s authorized keys.
Configuration Details:
To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml
configuration file. Choose a logical name for this nameservice, for example “ha-cluster”. Open
the core-site.xml file from the Active Name node and add the below properties.
cd /etc/hadoop/conf
Open the HDFS-site.xml file, add this Datanode directory path in dfs.datanode.data.dir property
and other properties.
[root@NN1 conf]# vi hdfs-site.xm
<configuration>
<property>
<name>dfs.namenode.name.dir<
<configuration>
<property>
edit and set yarn as the default framework for MapReduce operations and other HA properties.
Once all files updated on NN1 and now copy configs files from NN1 to all other nodes.
Deploying Zookeeper:
In a conf directory you have zoo_sample.cfg file, create the zoo.cfg using zoo_sample.cfg file.
Open the zoo.cfg file. Add the custom directory path to the dataDir property if you want and add
the below details regarding remaining node, in the zoo.cfg file. I kept the same directory. See
screen shot below and copy the Zookeeper conf file to NN2 and MGT server.
Format the NameNode on NN1 only if new cluster. If converting the existing cluster from NON-
HA to HA then re-initilize it as i did.
Now NN2 has different meta data.. So how you’ll sync it with NN1, you can either copy the
entire meta data directory or use this command. Start the NameNode servie on NN2.
[root@NN2 ~]# sudo -u hdfs hdf
[root@NN2 ~]# service hadoop-
starting namenode, logging to / v
Started Hadoop namenode: [ OK
The next step is to initialize required state in ZooKeeper. You can do so by running the following
command from one of the NameNode hosts. This will create a znode in ZooKeeper inside of
which the automatic failover system stores its data.
Now start ZKFC service. Note: wherever you start the ZKFC first, that’ll become the active
node.
Web Interface:
http://192.168.1.150:50070/dfshealth.html#tab-overview
http://192.168.1.151:50070/dfshealth.html#tab-overview
Repeat the same test on NN2 and also check the web URL’s.
http://192.168.1.150:8088/cluster/cluster
http://192.168.1.151:8088/cluster/cluster
Once you have configured the YARN resource manager for HA, if the active resource manager
is down or is no longer on the list, one of the standby resource managers becomes active and
resumes resource manager responsibilities. This way, jobs continue to run and complete
successfully.
Please share and like if you this blog is help for you.
Thanks
Mandy
Commissioning/Decommissioning –
Datanode in Hadoop
by Mandeep K Sandhu.In Big data.Leave a Comment on Commissioning/Decommissioning
– Datanode in Hadoop
Commissioning of nodes means adding new data node in cluster and decommissioning stands for
removing node from cluster. You can’t directly add/remove dataNode in large and a real-time
cluster as it can cause a lot of disturbance. So if you want to scale your cluster , you need
commissioning and steps are below.
Commission:
Pre-requirements:
Configuration changes:
We need to update the include file on both the Resource Manager and the Namenode . If it’s not
present, then create an include file on both the Nodes.
Also update the slaves file on NameNode and add new DataNode IP address.
Copy all configuration files from NameNode and then refresh the Nodes.
[root@NN conf]# scp -r * DN3:/
[root@NN conf]# sudo -u hdfs h
Refresh nodes successful
Commissioning of new DataNode is complete. Check Hadoop admin report using the command
Decommission a Node:
For decommission DataNode, exclude property need at NameNode side. Do decomm activity in
non-peak hour. Any process running on this decommissioned node can fail.
Note: It’s very important to note that the include/exclude files should be mutually exclusive.
Means can’t have same values in both exclude and include file.
Update
Hadoop Balancer:
Hadoop Balancer is a built-in property which makes sure that no datanode will be over utilized.
When you run the balancer utility, it checks whether some datanode are under-utilized or over-
utilized and will balance the replication factor. But make sure the Balancer should run in only
off-peak hours in a real cluster, because if you run this during peak hours, it will cause a heavy
load to networking, as it will transfer a large amount of data.
hadoop balancer
1 hadoop balancer
Hope this post was helpful in understanding about the Commissioning and Decommissioning of
the datanodes in Hadoop.
Thanks
Mandy
Amazon Enterprise MapReduce is a fully managed cluster platform that process and analyze
larger amount of data. When you run a large amount of data you eventually run into processing
problems. By using hadoop cluster EMR can help in reducing large processing problems and
split big data sets into smaller jobs and distribute them across many compute nodes. EMR can do
this with big data framework and open source projects. Big data framework includes :
Amazon EMR mainly used for log processing and analysis, ETL Processing, Clickstream
analysis and Machine learning.
EMR Architecture:
Master Nodes:
o EMR have Single Master Node and don’t have another master node to fail over.
o Master node manages resources of the cluster
o Co-ordinates distribution and parallel execution of MapReduce executable.
o Tracking and directing HDFS.
o Monitor health of core and task nodes.
o Resource Manager also running on master nodes which is responsible for
scheduling the resources.
Core nodes:
o Core nodes are slaves nodes and run the tasks as directed by master node.
o Core contains data as part of HDFS or EMRFS. So data daemons runs on core
nodes and store the data.
o Core nodes also run NodeManager which takes action from Resource Manager
like how to manage the resources.
o ApplicationMaster is task which negotiates the resources with Resource Manager
and working with NodeManager to execute and monitor application containers.
Task Nodes:
o Task nodes also controlled by master and are optional.
o These nodes are required to provide extra capacity to the cluster in terms of CPU
and memory
o Can be added/removed any time from running cluster.
Storage option:
Instance store: Local storage attached to EC2 instance but data lose after terminating the
EMR cluster. Can be used where high i/o or high IOPS at low-cost.
EBS volume: EBS volume for Data storage but data lost after EMR cluster termination.
EMR FS: An implementation of HDFS which allows cluster to store/ingest data directly
from S3. Data copy from S3 to HDFS can be done via S3DistCp
Launch a Cluster:
You can launch EMR cluster with few clicks. Sign up for AWS cloud and go to AWS console.
On service Menu click EMR and further click create cluster. There are two options to create
cluster.
Quick:
Advanced:
From here you can choose the EMR release and custom application that you want to
install. In quick option, you won’t able to choose the specific application but have to
select one set out of 4 sets. I choose cluster with application hive and spark.
Secondly you can enter hadoop configuration like change the YARN job logs in JSON
format.
You can load JSON file as well from S3.
In advance option you can specify the VPC and subnet settings and EBS volume sizes
etc.. This option is not available in quick option.
Advance option allow you to choose the different configuration of master, core and task
nodes.In quick option all nodes are of same type.
Auto scaling can also be configured in Advance option.
Terminate protection is enabled by default. This will protect the cluster from being
accidentally terminated by Amazon.
But for Transient cluster ( like temp cluster who do particular task and automatically
terminate after the task), disable the terminate protection.
Boot strap scripts can be executed in advance EMR option.
Like any Hadoop environment, Amazon EMR would generate a number of log files. While some
of these log files are common for any Hadoop system, others are specific to EMR. Here is a brief
introduction to these different types of log files.
Bootstrap Action Logs: These logs are specific to Amazon EMR. It’s possible to run
bootstrap actions when the cluster is created. An example of bootstrapping can be
installing a Hadoop component not included in EMR, or using a certain parameters in a
configuration file. The bootstrap action logs contain the output of these actions.
Instance State Logs: These log files contain infrastructure resource related information,
like CPU, memory, or garbage collection.
Hadoop/YARN Component Logs: These logs are associated with the Hadoop daemons
like those related to HDFS, YARN, Oozie, Pig, or Hive. I instructed to create YARN
related log files in S3.
Step Logs: This type log is specific to Amazon EMR. As we said, an EMR cluster can
run one or more steps of a submitted job. These steps can be defined when the cluster is
created, or submitted afterwards. Each step of a job generates four types of log files.
Collectively, these logs can help troubleshoot any user-submitted job.
Loading Data into Hive:
Once cluster is ready connect to the cluster by using hadoop user with private key. Now run the
hive program and create the table. The dataset is already uploaded to
“s3://testcluster-emr/input/restaurant.data” bucket.
Once hive table is created query the Hive table to list the rating for the restaurant. It took almost
33 seconds to execute the query.
# Create table statement in Hive
CREATE EXTERNAL TABLE `re
Age string,
gender string,
Before running the Spark SQL turn off the verbose logging. Spark SQL is compatible with hive
metastore which means tables created with Hive don’t need to be re-created for Spark SQL.
Now cache the restaurant table created by hive in Spark SQL. Caching will take time depends
upon how big the table is??? Once done, run the same select Sql query that we ran with hive.
You’ll get the query result in 3 seconds.
Spark SQL includes a cost-based optimizer, columnar storage and code generation to make
queries fast.
Use the Add Step option to submit your Hive script to the cluster using the console. The Hive
script have been uploaded to Amazon S3 for you. My hive script contain the select statement
only.
Go to the cluster and scroll to the Steps section and expand it, then choose Add step.
This example was a simple demonstration. In real life, there may be dozens of steps with
complex logic, each generating very large log files. Manually looking through thousands of lines
of log may not be practical for troubleshooting purposes. So S3 is good candidate for placing
step logs for troubleshooting.
EMR CLI:
EMR CLI commands are listed here. You can run these to EMR nodes.
[hadoop@ip-172-31-64-121 emr]
[hadoop@ip-172-31-64-121 emr]
1 [hadoop@ip-172-31-64-121 emr]$ aws emr list-clusters
2 [hadoop@ip-172-31-64-121 emr]$ aws emr list-instances --cluster-id j-3RUZVHS6CKWYC
EMR console:
EMR cluster management console provide the GUI console to monitor, resize, terminate and a
lot of other features. The same can be achieved EMR CLI.
You can monitor the cluster status, node stats, CPU, I/O and application jobs completed etc…
Really handy.
Application history can be monitored separately as well which will show the type of the job and
successful/failure status etc..
Select the cluster and see the overall status from one screen. You can resize the cluster from
hardware option.
You can increase the core nodes from here. There is only one master node and you can’t change
that. NO HA at the moment but meta data can be stored in metastore and later can be used to
restore any failed cluster. Just update the count in core node and click green tick.
Resizing will take time and once done cluster core nodes status will be in running state.
To terminate the cluster, you need to turn off the termination protection.
Thanks
Mandy
In my previous blog, we have configured hadoop single and cluster set up. Now try to create the
files and directories on Hadoop distributed file system(HDFS). You can see the full list here.
When I started the hdfs commands I got confused with three different command syntax. All three
commands appears to be same but have some differences as explained below.
hadoop fs {args}
FS relates to a generic file system which can point to any file systems like local, HDFS etc. So
this can be used when you are dealing with different file systems such as Local FS, (S)FTP, S3,
and others.
dfs is very specific to HDFS. would work for operation relates to HDFS. This has been
deprecated and we should use hdfs dfs instead.
same as 2nd i.e would work for all the operations related to HDFS and is the recommended
command instead of hadoop dfs
hadoop fs:
9 Found 1 items
hdfs dfs:
9 Found 2 items
hadoop dfs:
2 DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs
command for it.
3 Found 2 items
Thanks
Mandy
Apache hive is a Data Warehouse software project built on top of apache Hadoop for providing
data summary, query and analysis. Hive gives an SQL like interface to query data stored in
various databases and file systems that integrate with Hadoop.
Configuring High Availability for Hive requires the following components to be fail proof:
First of all set up hive metastore as MySql database. Here are the steps:
[root@CM ~]# rpm -ivh "https:/ /
[root@CM ~]# yum install mysql
[root@CM ~]# systemctl start my
[root@CM log]# grep password /
Now login MySQL database and create the hive database /user. And grant the privileges.
Install Hive:
Test hive:
Click Finish. You should now see new hosts added as the Hive Metastore Server.
Re-start the stale configurations
Notice that you now have multiple instances of Hive Metastore Server.
# beeline -u “jdbc:hive2://cm:10000”
Now from CM, select first Hive Metastore Server and stop the connection.
Now stop second hive Metastore server. This command should fail which is normal.
To enable high availability for multiple HiveServer2 hosts, configure a load balancer to manage
them. To increase stability and security, configure the load balancer on a proxy server.
Add couple of You should now see new hosts added as HiveServer2.
The connection gets routed to the HiveServer2 instances in a round robin fashion.
Mandy!!!
In this tutorial, we focus on HDFS snapshots. Common use cases of HDFS snapshots include
backups and protection against user errors.
HDFS directories must be enabled for snapshots in order for snapshots to be created. Steps are:
su - hdfs
Hdfs dfsadmin -allowsnapshot / u
# Another Method
hadoop dfsadmin -allowSnapsho
1 su - hdfs
2 Hdfs dfsadmin -allowsnapshot /user/mandy/snapshot_test
3 # Another Method
4 hadoop dfsadmin -allowSnapshot /user/mandy/snapshot_test
To take a snapshot, click Take Snapshot, specify the name of the snapshot, and click
Take Snapshot.
The snapshot is added to the snapshot list.
To restore a snapshot, click drop down button near folder name again and select restore from
snapshot.
Restore in progress.
With Command line:
You can read the content of the file or list the file.
Disable Snapshot:
Try to remove a snapshotable directory by typing a following command as the hdfs user. As
expected, the directory can’t be deleted because is snapshottable and it already contains a
snapshot. Remove the snapshot first and re-try again.
Delete the snapshot.
Thanks
Mandy
Decommission/Recommission – DataNode in
Cloudera
by Mandeep K Sandhu.In Big data.Leave a Comment on Decommission/Recommission –
DataNode in Cloudera
Commissioning nodes stand for adding new nodes in current cluster which operates your Hadoop
framework. In contrast, decommissioning nodes stands for removing nodes from your cluster.
This is very useful feature to handle node failure during the operation of Hadoop cluster without
stopping entire Hadoop nodes in your cluster.
Decommission:
You can’t decommission a DataNode or host with DataNode if number of the data nodes equals
to the replication factor. if you attempt to decommission a datanode in such situation the data
node decommission process will not complete. you have to abort the decommission process and
change the replication factor.
In my case, I have two data node and decommission one will leave only on data node. Before
decomm process , change the replication factor to 1.
Go to hosts
Select the host or hosts that you want to decommission
Click on Action -> Select “Hosts Decommission/suppress Alert”
Recommission:
The host which is decommissioned, now can be removed from cluster. Remove the roles from
host and leave the management role.
Remove host from Cloudera Manager:
Select type of CDH software installation. Selected matched withe existing set up.
In HDFS, the blocks of the files are distributed among the datanodes as per the replication factor.
Whenever you add a new datanode, the node will start receiving, storing the blocks of the new
files. Though this sounds alright, the cluster is not balanced when you look at administrative
point view. HDFS provides a balancer utility that analyzes block placement and balances data
across the DataNodes. You can do it via Cloudera after addition of new DataNode.
Go to the HDFS service.
Ensure the service has a Balancer role.
Select Actions > Rebalance.
Click Rebalance to confirm.
hdfs balancer
# Set the different threshold
hdfs balancer -threshold 5
1 hdfs balancer
2 # Set the different threshold
3 hdfs balancer -threshold 5
Mandy