21ai402 Data Analytics Unit-2

9.
LECTURE NOTES
1. INTRODUCING HADOOP
Today, Big Data seems to be the buzz word! Enterprises, the world over, are
beginning to realize that there is a huge volume of untapped information before
them in the form of structured, semi-structured and unstructured data. This varied
variety of data is spread across the networks.
Let us look at few statistics to get an idea of the amount of data which gets
generated every day, every minute and every second.
1. Every day:
(a) NYSE (New York Stock Exchange) generates 1.5 billion shares and trade data.
(b) Facebook stores 2.7 billion comments and Likes.
(c) Google processes about 24 petabytes of data.
2. Every minute:
(a) Facebook users share nearly 2.5 million pieces of content.
(b) Twitter users tweet nearly 300,000 times.
(c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps.
(f) Email users send over 200 million messages.
(g) Amazon generates over $80,000 in online sales.
(h) Google receives over 4 million search queries.
3. Every second:
(a) Banking applications process more than 10,000 credit card transactions.
1.1 Data : The Treasure Trove
1. Provides business advantages such as generating product recommendations,
inventing new products, analyzing the market, and many, many more,..
2. Provides few early key indicators that can turn the fortune of business.
3. Provides room for precise analysis. If we have more data for analysis, then we
have greater precision of analysis.
To process, analyze, and make sense of these different kinds of data, we need a
system that scales and addresses the challenges shown in the below Figure.
❖ Hadoop is an open source software programming framework for storing a

large amount of data and performing the computation. Its framework is
based on Java programming with some native code in C and shell scripts.
2. RDBMS versus HADOOP
Table 2.1 describes the difference between RDBMS and Hadoop.
Table 2.1 RDBMS versus Hadoop
PARAMETER RDBMS HADOOP
System Relational Database Node Based Flat

Management System Structure
Data Suitable for structured Suitable for structured,

data unstructured data.
Supports variety of data
formats in real time
such as XML, JSON, text
based flat file formats,
etc.
Processing OLTP Analytical, Big Data

Processing
Choice When the data needs Big Data processing,

consistent relationship which does not require
any consistent
relationships between
data.
Processor Needs expensive In a Hadoop Cluster, a
hardware or high-end node requires only a
processors to store huge processor, a network
volumes of data. card, and few hard
drives.
Cost Cost around $10,000 to Cost around $4,000 per

$14,000 per terabytes of terabytes of storage.
storage
3. HADOOP OVERVIEW
Open-source software framework to store and process massive amounts of
data in a distributed fashion on large clusters of commodity hardware. Basically,
Hadoop accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
3.1 Key Aspects of Hadoop

The following are the key aspects of Hadoop:
➢ Open source software: lt is free to download, use and contribute to.
➢ Framework: Means everything that you will need to develop and
execute and application is provided-programs, tools,etc.
➢ Distributed: Divides and stores data across multiple computers.
Computation/Processing is done in parallel across multiple connected
nodes.
➢ Massive storage: Stores colossal amounts of data across nodes of low
cost commodity hardware.
➢ Faster processing: Large amounts of data is processed in parallel,
yielding quick response.
3.2 Hadoop Components
The following figure depicts the Hadoop components.
Hadoop Core Components

1. HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem:
Hadoop Ecosystem are support projects to enhance the functionality of Hadoop
Core Components.
The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
3.3 Hadoop Conceptual Layer
It is conceptually divided into Data Storage Layer which stores huge volumes of
data and Data Processing Layer which processes data in parallel to extract
richer and meaningful insights from data (Figure 3.3)
3.4 High-Level Architecture of Hadoop
Hadoop is a distributed Master-Slave Architecture. Master node is known as

NameNode and slave nodes are known as DataNodes. Figure 3.4 depicts the
Master-Slave Architecture of Hadoop Framework.
Figure 3.3 Hadoop conceptual layer

Figure 3.4 Hadoop high-level architecture
Let us look at the key components of the Master Node.
1. Master HDFS : Its main responsibility is partitioning the data storage
across the slave nodes. It also keeps track of locations of data on
DataNodes.
2. Master MapReduce : It decides and schedules computation task on slave
nodes.
4. HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

Some key Points of Hadoop Distributed File System are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and moves
computation where data is stored).
5. We can replicate a file for a configured number of times, which is tolerant in
terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. We can realize the power of HDFS when we perform read or write on large files
(gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4, which is described in
Figure 5.13.
Figure 5.14 describes important key points of HDFS. Figure 5.15 describes
Hadoop Distributed File system architecture. Client Application interacts with
NameNode for metadata related activities and communicates with DataNodes to
read and write files. DataNodes converse with each other for pipeline reads and
writes.
Let us assume that the file “Sample.txt” is of size 192 MB. As per the default data
block size (64 MB), it will be split into three blocks and replicated across the
nodes on the cluster based on the default replication factor.
4.1 HDFS Daemons
4.1.1 NameNode
➢ HDFS breaks a large file into smaller pieces called blocks. NameNode uses
a rack ID to identify DataNodes in the rack.
➢ A rack is a collection of DataNodes within the cluster. NameNode keeps
tracks of blocks of a file as it is placed on various DataNodes.
➢ NameNode manages file-related operations such as read, write, create and
delete. Its main job is managing the File System Namespace.
➢ A file system namespace is collection of files in the cluster. NameNode
stores HDFS namespace.
➢ File system namespace includes mapping of blocks to file, file properties
and is stored in a file called Fslmage.
➢ NameNode uses an EditLog (transaction log) to record every
transaction that happens to the file system metadata. Figure 5.16.
➢ When NameNode starts up, it reads FsImage and EditLog from disk
and applies all transactions from the EditLog to in-memory
representation of the Fslmage.
➢ Then it flushes out new version of Fslmage on disk and truncates the
old EditLog because the changes are updated in the Fslmage. There is a
single NameNode per cluster.
4.1.2 DataNode
➢ There are multiple DataNodes per cluster. During Pipeline read and
write DataNodes communicate with each other. A DataNode also
continuously sends "heartbeat” message to NameNode to ensure the
connectivity between the NameNode and DataNode.
➢ In case there is no heartbeat from a DataNode, the NameNode replicates
that DataNode within the cluster and keeps on running as if nothing had
happened.
➢ The concept behind sending the heartbeat report by the DataNodes to the
NameNode:
PICTURE THIS …
You work for a renowned IT organization. Everyday when you come to office, you
are required to swipe in to record your attendance. This record of attendance is
then shared with your manager to keep him posted on who all from his team
have reported for work. Your manager is able to allocate tasks to the team
members who are present in office. The tasks for the day cannot be allocated to
team member who have not turned in. Likewise heartbeat report is a way by
which DataNodes inform the NameNode that they are up and functional and can
be assigned tasks. Figure 5.17 depicts the above scenario.
4.1.3 Secondary NameNode
➢ The Secondary NameNode takes a snapshot of HDFS metadata at
intervals specified in the Hadoop configuration. Since the memory
requirements of Secondary NameNode are the same as NameNode, it
is better to run NameNode and Secondary NameNode on different
machines.
➢ In case of failure of the NameNode, the Secondary NameNode can be
configured manually to bring up the cluster.
➢ However, the Secondary NameNode does not record any real-time changes
that happen to the HDFS metadata.
4.2 Anatomy of File Read

Figure 5.18 describes the anatomy of File Read
The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling open() on
the DistributedFileSystem
2. DistributedFileSystem communicates with the NameNode to get the
location of data blocks. NameNode returns with the addresses of the
DataNodes that the data blocks are stored on. Subsequent to this, the
DistributedFileSystem returns an FSDatalnputStream to client to read from
the file.
3. Client then calls read() on the stream DFSlnputStream, which has
addresses of the DataNodes for the first few blocks of the file, connects to
the closest DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When end of the block is reached, DFSlnputStream closes the connection
with the DataNode. It repeats the steps to find the best DataNode for the
next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on the
FSDatalnputStream to close the connection.
4.3 Anatomy of File Write

Figure 5.19 describes the anatomy of File Write. The steps involved in anatomy of
File Write are as follows:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the NameNode happens through the DistributedFileSystem to
create a new file. The NameNode performs various checks to create a new file
(checks whether such a file exists or not).
Initially, the NameNode creates a file without associating any data blocks to the
file. The DistributedFileSystem returns an FSDataOutputStream to the client to
perform write.
3. As the client writes data, data is split into packets by DFSOutputStream, which
is then written to an internal queue, called data queue. DataStreamer consumes
the data queue. The DataStreamer requests the NameNode to allocate new blocks
by selecting a list of suitable DataNodes to store replicas. This list of DataNodes
makes a pipeline. Here, we will go with the default replication factor of three, so
there will be three nodes in the pipeline for the first block.
4. DataStreamer streams the packets to the first DataNode in the pipeline. It

stores packet and forwards it to the second DataNode in the pipeline.
In the same way, the second DataNode stores the packet and forwards it to the
third DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages an "Ack
queue" of packets that are waiting for the acknowledgement by DataNodes. A
packet is removed from the "Ack queue" only if it is acknowledged by all the
DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close() on the stream.
7. This flushes all the remaining packets to the DataNode pipeline and waits for
relevant acknowledgments before communicating with the NameNode to inform
the client that the creation of the file is complete.
4.4 Replica Placement Strategy

4.4.1 Hadoop Default Replica Placement Strategy
As per the Hadoop Replica Placement Strategy, first replica is placed on the same
node as the client. Then it places second replica on a node that is present on
different rack. It places the third replica on the same rack as second, but on a
different node in the rack. Once replica locations have been set, a pipeline is built.
This strategy provides good reliability. Figure 5.20 describes the typical replica
pipeline.
4.5 Working with HDFS Commands
➢ Objective: To get the list of directories and files at the root of HDFS.
Act:
hadoop fs -ls/
➢ Objective: To get the list of complete directories and files of HDFS.

Act:
hadoop fs -ls -R/
➢ Objective: To create a directory (say, sample) in HDFS.

Act:
hadoop fs -mkdir / sample
➢ Objective: To copy a file from local file system to HDFS.

Act:
hadoop fs -put / root/sample/test.txt /sample/test.txt
➢ Objective: To copy a file from HDFS to local file system.

Act:
hadoop fs -get/sample/test.txt /root/sample/testsample.txt
➢ Objective: To copy a file from local file system to HDFS via

copyFromLocal command.
Act:
hadoop fs -copyFromLocal /root/sample/test.txt /sample/testsample.txt
➢ Objective: To copy a file from Hadoop file system to local file system via
copyToLocal command.
Act:
hadoop fs -copyToLocal /sample/test.txt /root/sample/testsample1.txt
➢ Objective: To display the contents of an HDFS file on console.

Act:
hadoop fs -cat /sample/test.txt
➢ Objective: To copy a file from one directory to another on HDFS.

Act:
hadoop fs -cp /sample/test.txt/sample1
➢ Objective: To remove a directory from HDFS.

Act:
hadoop fs-rm-r/sample1
4.6 Special Features of HDFS
1. Data Replication:
There is absolutely no need for a client application to track all blocks. It directs
the client to the nearest replica to ensure high performance.
2. Data Pipeline:
A client application writes a block to the first DataNode in the pipeline. Then this
DataNode takes over and forwards the data to the next node in the pipeline. This
process continues for all the data blocks, and subsequently all the replicas are
written to the disk.
5. Processing Data with Hadoop
• MapReduce Programming is a software framework. MapReduce

Programming helps us to process massive amounts of data in parallel.
• In MapReduce Programming, the input dataset is split into independent

chunks. Map tasks process these independent chunks completely in a
parallel manner.
• The output produced by the map tasks serves as intermediate data and is
stored on the local disk of that server. The output of the mappers are
automatically shuffled and sorted by the framework.
• MapReduce Framework sorts the output based on keys. This sorted

Output becomes the input to the reduce tasks. Reduce task provides
reduced output by combining the output of the various mappers.
• Job inputs and outputs are stored in a file system. MapReduce framework
also takes care of the other tasks such as scheduling, monitoring, re-
executing failed tasks, etc.
• Hadoop Distributed File System and MapReduce Framework run on the

same set of nodes. This configuration allows effective scheduling of tasks
on the nodes where data is present (Data Locality). This in turn results in
very high throughput.
• There are two daemons associated with MapReduce Programming. A single

master JobTracker per cluster and one slave TaskTracker per cluster-
node. The JobTracker is responsible for scheduling tasks to the
TaskTrackers, monitoring the task, and re-executing the task just in case
the TaskTracker fails. The TaskTracker executes the task. Figure 5.21
• The MapReduce functions and input/output locations are implemented via
the MapReduce applications. These applications use suitable interfaces to
construct the job.
• The application and the job parameters together are known as

job configuration. Hadoop job client submits job (jar/executable,
etc.) to the JobTracker. Then it is the responsibility of JobTracker to
schedule tasks to the slaves. In addition to scheduling, it also monitors
the task and provides status information to the job-client.
5.1 MapReduce Daemons

1. JobTracker:
• It provides connectivity between Hadoop and our application. When we
submit code to cluster, JobTracker creates the execution plan by deciding
which task to assign to which node. lt also monitors all the running tasks.
• When a task fails, it automatically re-schedules the task to a different node

after a predefined number of retries. JobTracker is a master daemon
responsible for executing overall MapReduce job. There is a single
JobTracker per Hadoop cluster.
2. TaskTracker:
• This daemon is responsible for executing individual tasks that is
assigned by the JobTracker. There is a single TaskTracker per slave
and spawns multiple Java Virtual Machines (JVMs) to handle multiple
map or reduce tasks in parallel.
• TaskTracker continuously sends heartbeat message to JobTracker. When

the JobTracker fails to receive a heartbeat from a TaskTracker, the
JobTracker assumes that the TaskTracker has failed and resubmits the task
to another available node in the clusters.
• Once the client submits a job to the JobTracker, it partitions and assigns
diverse MapReduce tasks for each TaskTracker in the cluster. Figure 5.22
depicts JobTracker and TaskTracker interaction.
5.2 How Does MapReduce Work?
❖ MapReduce divides a data analysis task into two parts - map and reduce.
Figure 5.23 depicts how the MapReduce Programming works.
• In this example, there are two mappers and one reducer. Each
mapper works on the partial dataset that is stored on that node and the
reducer combines the output from the mappers to produce the reduced
result set.
• Figure 5.24 describes the working model of MapReduce Programming. The

following steps describe how MapReduce performs its task.
1. First, the input dataset is split into multiple pieces of data (several small
subsets).
2. Next, the framework creates a master and several workers processes
and executes the worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data that
were assigned to each map task. The map worker uses the map function to
extract only those data that are present on their server and generates
key/value pair for the extracted data.
4. Map worker uses partitioner function to divide the data into regions.
Partitioner decides which reducer should get the output of the specified
mapper.
5. When the map workers complete their work, the master instructs the reduce
workers to begin their work. The reduce workers in turn contact the map workers
to get the key/value data for their partition. The data thus received is shuffled
and sorted as per keys.
6. Then it calls reduce function for every unique key. This function writes the
output to the file
7. When all the reduce workers complete their work, the master transfers the
control to the user program.
5.3 MapReduce Example

The famous example for MapReduce Programming is Word Count. For example,
consider we need to count the occurrences of similar words across 50 files. We
can achieve this using MapReduce Programming. Refer Figure 5.25.
Word Count MapReduce Programming using Java
The MapReduce Programming requires three things.
1. Driver Class: This class specifies Job Configuration details.
2. Mapper Class: This class overrides the Map Function based on the problem
statement.
3. Reducer Class: This class overrides the Reduce Function based on the
problem statement.
❖ Wordcounter.java: Driver Program

package com.app;
import java.io.IOException;
❖ WordCounterMap.java: Map Class
❖ WordCountReduce.java: Reduce Class

Table 5.2 describes differences between SQL and MapReduce.
Table 5.2 SQL versus MapReduce
6. MANAGING RESOURCES AND APPLICATIONS WITH HADOOP YARN

(YET ANOTHER RESOURCE NEGOTIATOR)
• Apache Hadoop YARN is a sub-project of Hadoop 2.x.

• Hadoop 2.x is YARN-based architecture. It is a general processing platform.
• YARN is not constrained to MapReduce only. We can run
multiple applications in Hadoop 2.x in which all applications share a
common resource management.
• Now Hadoop can be used for various types of processing such as Batch,
Interactive, Online, Streaming, Graph, and others.
6.1 Limitations of Hadoop 1.0 Architecture

In Hadoop 1.0, HDFS and MapReduce are Core Components, while other
components are built around the core.
1. Single NameNode is responsible for managing entire namespace for Hadoop

Cluster.
2. It has a restricted processing model which is suitable for batch-oriented

MapReduce jobs.
3. Hadoop MapReduce is not suitable for interactive analysis.
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and other
memory intensive algorithms.
5. MapReduce is responsible for cluster resource management and data

processing.
• In this Architecture, map slots might be "full", while the reduce slots
are empty and vice versa.
• This causes resource utilization issues. This needs to be improved for
proper resource utilization.
6.2 HDFS Limitation
• NameNode saves all its file metadata in main memory. Although
the main memory today is not as small and as expensive as it used
to be two decades ago, still there is a limit on the number of objects
that one can have in the memory on a single NameNode.
• The NameNode can quickly become overwhelmed with load on the
system increasing.
• In Hadoop 2.x, this is resolved with the help of HDFS Federation.
6.3 Hadoop 2: HDFS

• HDFS 2 consists of two major components:
(a) namespace
(b) blocks storage service.
• Namespace service takes care of file-related operations, such as
creating files, modifying files, and directories. The block storage service
handles data node cluster management, replication.
HDFS 2 Features
1. Horizontal scalability
2. High availability
• HDFS Federation uses multiple independent NameNodes for horizontal

scalability.
• NameNodes are independent of each other. It means, NameNodes does
not need any coordination with each other.
• The DataNodes are common storage for blocks and shared by all
NameNodes.
• All DataNodes in the cluster registers with each NameNode in the
cluster.
• High availability of NameNode is obtained with the help of
Passive Standby NameNode. In Hadoop 2.x, Active-passive NameNode
handles failover automatically.
• All namespace edits are recorded to a shared NFS storage and there is a
single writer at any point of time. Passive NameNode reads edits from
shared storage and keeps updated metadata information.
• In case of Active NameNode failure, Passive NameNode becomes an Active
NameNode automatically. Then it starts writing to the shared storage.
• Figure 5.26 describes the Active-Passive NameNode interaction.
• Figure 5.27 depicts Hadoop 1.0 and Hadoop 2.0 architecture.

6.4 Hadoop 2 YARN: Taking Hadoop beyond Batch
• YARN helps us to store all data in one place.
• We can interact in multiple ways to get predictable performance and
quality of services.
• This was originally architected by Yahoo. Refer Figure 5.28.
6.4.1 Fundamental Idea

The fundamental idea behind this architecture is splitting the JobTracker
responsibility of resource management and job scheduling/Monitoring into
separate daemons. Daemons that are part of YARN Architecture are described
below.
1. A Global ResourceManager:
Its main responsibility is to distribute resources among various applications
in the system. It has two main components:
(a) Scheduler: The pluggable scheduler of ResourceManager decides
allocation of resources to various running applications. The scheduler is
just that, a pure scheduler, meaning it does NOT monitor or track the
status of the application.
(b) ApplicationManager: ApplicationManager does the following:
• Accepting job submissions.
• Negotiating resources (container) for executing the application
specific ApplicationMaster.
• Restarting the ApplicationMaster in case of failure.
2. NodeManager:
• This is a per-machine slave daemon.
• NodeManager responsibility is launching the application containers for
application execution.
• NodeManager monitors the resource usage such as memory, CPU, disk,
network, etc. It then reports the usage of resources to the global
ResourceManager.
3. Per-application ApplicationMaster:
• This is an application-specific entity.
• Its responsibility is to negotiate required resources for execution from
the ResourceManager.
• It works along with the NodeManager for executing and
monitoring component tasks.
6.4.2 Basic Concepts

Application:
1. Application is a job submitted to the framework.
2. Example- MapReduce Job.
Container:
1. Basic unit of allocation.
2. Fine-grained resource allocation across multiple resource types (Memory, CPU,
disk, network, etc.)
(a) container_0 = 2GB, 1 CPU
(b) container_1 = 1GB, 6 CPU
3. Replaces the fixed map/reduce slots.
YARN Architecture:
Figure 5.29 depicts YARN architecture.
The steps involved in YARN architecture are as follows:

1. A client program submits the application which includes the necessary
specifications to launch the application-specific ApplicationMaster itself.
2. The ResourceManager launches the ApplicationMaster by assigning some

container.
3. The ApplicationMaster, on boot-up, registers with the ResourceManager. This
helps the client program to query the ResourceManager directly for the details.
4. During the normal course, ApplicationMaster negotiates appropriate resource

containers via the resource-request protocol.
5. On successful container allocations, the ApplicationMaster launches the

container by providing the container launch specification to the NodeManager.
6. The NodeManager executes the application code and provides necessary

information such as progress status, etc. to its ApplicationMaster via an
application-specific protocol.
7. During the application execution, the client that submitted the job directly
communicates with the ApplicationMaster to get status, progress updates, etc. via
an application-specific protocol.
8. Once the application has been processed completely, ApplicationMaster

deregisters with the ResourceManager and shuts down, allowing its own container
to be repurposed.
7. INTERACTING WITH HADOOP ECOSYSTEM
7.1 Pig
• Pig is a data flow system for Hadoop. It uses Pig Latin to specify data flow.
• Pig is an alternative to MapReduce Programming. It abstracts some details
and allows us to focus on data processing.
• It consists of two components.
1. Pig Latin: The data processing language.
2. Compiler: To translate Pig Latin to MapReduce Programming.
• Figure 5.30 depicts the Pig in the Hadoop ecosystem.
7.2 Hive
• Hive is a Data Warehousing Layer on top of Hadoop.
• Analysis and queries can be done using an SQL-like language.
• Hive can be used to do ad-hoc queries, summarization, and data analysis.
Figure 5.31 depicts Hive in the Hadoop ecosystem.
7.3 Sqoop
• Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases.
• With the help of Sqoop, we can import data from RDBMS to HDFS and
vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
7.4 HBase
• HBase is a NoSQL database for Hadoop.
• HBase is column-oriented NoSQL database.
• HBase is used to store billions of rows and millions of columns.
• HBase provides random read/write operation. It also supports record
level updates which is not possible using HDFS. HBase sits on top of
HDFS. Figure 5.33 depicts the HBase in Hadoop ecosystem.
10. ASSIGNMENT : UNIT – II
1. Since the data is replicated thrice in HDFS, does it mean that any
calculation done on one node will also be replicated on the other two ?
2. Why do we use HDFS for applications having large datasets and not
when we have small files?
3. Suppose Hadoop spawned 100 tasks for a job and one of the task
failed. What will hadoop do ?
4. How does Hadoop differs from volunteer computing ?
5. How would you deploy different components of Hadoop in production?
11. PART A : Q & A : UNIT – II
1. What is Hadoop ( CO2 , K1 )

● Hadoop is an open source software programming framework for storing a
large amount of data and performing the computation.
● Its framework is based on Java programming with some native code in C
and shell scripts.
2. List the key aspects of hadoop. ( CO2 , K1 )

The following are the key aspects of Hadoop:
➢ Open source software
➢ Framework
➢ Distributed
➢ Massive storage
➢ Faster processing
3. What are the core components of hadoop? ( CO2 , K1 )

Hadoop Core Components
1. HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
4. What is Hadoop Conceptual Layer? ( CO2 , K1 )

It is conceptually divided into Data Storage Layer which stores huge volumes of
data and Data Processing Layer which processes data in parallel to extract richer
and meaningful insights from data.
5. What is pig in hadoop ecosystem ? ( CO2 , K1 )

• Pig is a data flow system for Hadoop. It uses Pig Latin to specify data flow.
• Pig is an alternative to MapReduce Programming. It abstracts some details
and allows us to focus on data processing.
• It consists of two components.
1. Pig Latin: The data processing language.
2. Compiler: To translate Pig Latin to MapReduce Programming.
6. How sqoop is different from hbase? ( CO2 , K1 )

• Sqoop is a tool which helps to transfer data between Hadoop and Relational
Databases.
• With the help of Sqoop, we can import data from RDBMS to HDFS and
vice-versa.
• HBase is a NoSQL database for Hadoop.
• HBase is column-oriented NoSQL database. It is used to store billions of
rows and millions of columns.
7. What is the use of hive in hadoop ecosystem? ( CO2 , K1 )

• Hive is a Data Warehousing Layer on top of Hadoop.
• Analysis and queries can be done using an SQL-like language.
• Hive can be used to do ad-hoc queries, summarization, and data analysis.
8. List the Daemons that are part of YARN Architecture. ( CO2 , K1 )

(a) Global ResourceManager
● Scheduler
● ApplicationManager
(b) NodeManager
(c) Per-application ApplicationMaster
9. What are the major components of HDFS 2 ? ( CO2 , K1 )

• HDFS 2 consists of two major components:
(a) namespace
(b) blocks storage service.
• Namespace service takes care of file-related operations, such as creating
files, modifying files, and directories.
• The block storage service handles data node cluster management,
replication.
10. What are the features of HDFS 2 ? ( CO2 , K1 )

• HDFS 2 Features
(a) Horizontal scalability.
(b) High availability
11. What is Active and passive NameNode ? ( CO2 , K1 )

• High availability of NameNode is obtained with the help of Passive Standby
NameNode.
• In Hadoop 2.x, Active-passive NameNode handles failover automatically.

• Passive NameNode reads edits from shared storage and keeps updated
metadata information.
• In case of Active NameNode failure, Passive NameNode becomes an Active
NameNode automatically. Then it starts writing to the shared storage.
12. What are the Limitations of Hadoop 1.0 Architecture? ( CO2 , K1 )

In Hadoop 1.0, HDFS and MapReduce are Core Components, while other
components are built around the core.
1. Single NameNode is responsible for managing entire namespace for Hadoop

Cluster.
2. It has a restricted processing model which is suitable for batch-oriented

MapReduce jobs.
3. Hadoop MapReduce is not suitable for interactive analysis.
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and other
memory intensive algorithms.
13. What is HDFS Limitation ? What do you understand by HDFS Federation.

( CO2, K1)
• NameNode saves all its file metadata in main memory. Although the main
memory today is not as small and as expensive as it used to be two decades
ago, still there is a limit on the number of objects that one can have in the
memory on a single NameNode.
• The NameNode can quickly become overwhelmed with load on the system
increasing.
• In Hadoop 2.x, this is resolved with the help of HDFS Federation.
14. What is the difference between SQL and MapReduce. ( CO2, K1)
15. What are the three important classes of MapReduce? ( CO2, K1)
● MapReduce Programming requires three things.
1. Driver Class: This class specifies Job Configuration details.
2. Mapper Class: This class overrides the Map Function based on the problem
statement.
3. Reducer Class: This class overrides the Reduce Function based on the problem
statement.
16. What are the two MapReduce Daemons? ( CO2, K1)

1. JobTracker
2. TaskTracker
17. Which daemon is responsible for executing overall MapReduce job? ( CO2, K1)
• JobTracker provides connectivity between Hadoop and our application.
When we submit code to cluster, JobTracker creates the execution plan by
deciding which task to assign to which node. lt also monitors all the running
tasks.
• JobTracker is a master daemon responsible for executing overall MapReduce
job.
18. What is MapReduce ? ( CO2, K1)

● MapReduce divides a data analysis task into two parts - map and reduce.
● For example, if there are two mappers and one reducer. Each mapper
works on the partial dataset that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.
19. What are the Special Features of HDFS ? ( CO2, K1)

1. Data Replication:
There is absolutely no need for a client application to track all blocks. It directs the
client to the nearest replica to ensure high performance.
2. Data Pipeline:
A client application writes a block to the first DataNode in the pipeline. Then this
DataNode takes over and forwards the data to the next node in the pipeline. This
process continues for all the data blocks, and subsequently all the replicas are
written to the disk.
20. What is TaskTracker? (CO2, K1)

• This daemon is responsible for executing individual tasks that is assigned
by the JobTracker. There is a single TaskTracker per slave and spawns
multiple Java Virtual Machines (JVMs) to handle multiple map or reduce
tasks in parallel.
• TaskTracker continuously sends heartbeat message to JobTracker. When

the JobTracker fails to receive a heartbeat from a TaskTracker, the
JobTracker assumes that the TaskTracker has failed and resubmits the
task to another available node in the clusters.
12. PART B QUESTIONS : UNIT – II
1. Describe the difference between RDBMS and Hadoop. ( CO2, K1)

2. How does Hadoop work? Describe the overview of hadoop. ( CO2, K1)
3. Write short notes on Interacting with Hadoop Ecosystem. ( CO2, K1)
4. Explain YARN Architecture in detail. ( CO2, K1)
5. Explain the significances of Hadoop distributed file systems ( HDFS ) and its
application (CO2,K1)
6. Explain Managing Resources and Applications with Hadoop YARN. ( CO2, K1)
7. Write the Word Count MapReduce Programming using Java. ( CO2, K1)
8. How Does MapReduce Work? ( CO2, K1)
9. Write short notes on Working with HDFS Commands. ( CO2, K1)

21ai402 Data Analytics Unit-2

Uploaded by

Copyright:

Available Formats

21ai402 Data Analytics Unit-2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

21ai402 Data Analytics Unit-2

Uploaded by

Copyright:

Available Formats

9.

❖ Hadoop is an open source software programming framework for storing a

Table 2.1 describes the difference between RDBMS and Hadoop.

Table 2.1 RDBMS versus Hadoop

PARAMETER RDBMS HADOOP

System Relational Database Node Based Flat

Data Suitable for structured Suitable for structured,

Processing OLTP Analytical, Big Data

Choice When the data needs Big Data processing,

Cost Cost around $10,000 to Cost around $4,000 per

3.1 Key Aspects of Hadoop

Hadoop Core Components

3.3 Hadoop Conceptual Layer

3.4 High-Level Architecture of Hadoop

Hadoop is a distributed Master-Slave Architecture. Master node is known as

Figure 3.3 Hadoop conceptual layer

4. HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

4.2 Anatomy of File Read

4.3 Anatomy of File Write

4. DataStreamer streams the packets to the first DataNode in the pipeline. It

4.4 Replica Placement Strategy

➢ Objective: To get the list of complete directories and files of HDFS.

➢ Objective: To create a directory (say, sample) in HDFS.

➢ Objective: To copy a file from local file system to HDFS.

➢ Objective: To copy a file from HDFS to local file system.

➢ Objective: To copy a file from local file system to HDFS via

➢ Objective: To display the contents of an HDFS file on console.

➢ Objective: To copy a file from one directory to another on HDFS.

➢ Objective: To remove a directory from HDFS.

4.6 Special Features of HDFS

• MapReduce Programming is a software framework. MapReduce

• In MapReduce Programming, the input dataset is split into independent

• MapReduce Framework sorts the output based on keys. This sorted

• Hadoop Distributed File System and MapReduce Framework run on the

• There are two daemons associated with MapReduce Programming. A single

• The application and the job parameters together are known as

5.1 MapReduce Daemons

• When a task fails, it automatically re-schedules the task to a different node

• TaskTracker continuously sends heartbeat message to JobTracker. When

• Figure 5.24 describes the working model of MapReduce Programming. The

5.3 MapReduce Example

❖ Wordcounter.java: Driver Program

❖ WordCountReduce.java: Reduce Class

Table 5.2 SQL versus MapReduce

6. MANAGING RESOURCES AND APPLICATIONS WITH HADOOP YARN

• Apache Hadoop YARN is a sub-project of Hadoop 2.x.

6.1 Limitations of Hadoop 1.0 Architecture

1. Single NameNode is responsible for managing entire namespace for Hadoop

2. It has a restricted processing model which is suitable for batch-oriented

3. Hadoop MapReduce is not suitable for interactive analysis.

5. MapReduce is responsible for cluster resource management and data

6.3 Hadoop 2: HDFS

• HDFS Federation uses multiple independent NameNodes for horizontal

• Figure 5.27 depicts Hadoop 1.0 and Hadoop 2.0 architecture.

6.4.1 Fundamental Idea

6.4.2 Basic Concepts

The steps involved in YARN architecture are as follows:

2. The ResourceManager launches the ApplicationMaster by assigning some