21ai402 Data Analytics Unit-2
21ai402 Data Analytics Unit-2
21ai402 Data Analytics Unit-2
LECTURE NOTES
1. INTRODUCING HADOOP
Today, Big Data seems to be the buzz word! Enterprises, the world over, are
beginning to realize that there is a huge volume of untapped information before
them in the form of structured, semi-structured and unstructured data. This varied
variety of data is spread across the networks.
Let us look at few statistics to get an idea of the amount of data which gets
generated every day, every minute and every second.
1. Every day:
(a) NYSE (New York Stock Exchange) generates 1.5 billion shares and trade data.
(b) Facebook stores 2.7 billion comments and Likes.
(c) Google processes about 24 petabytes of data.
2. Every minute:
(a) Facebook users share nearly 2.5 million pieces of content.
(b) Twitter users tweet nearly 300,000 times.
(c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps.
(f) Email users send over 200 million messages.
(g) Amazon generates over $80,000 in online sales.
(h) Google receives over 4 million search queries.
3. Every second:
(a) Banking applications process more than 10,000 credit card transactions.
1.1 Data : The Treasure Trove
1. Provides business advantages such as generating product recommendations,
inventing new products, analyzing the market, and many, many more,..
2. Provides few early key indicators that can turn the fortune of business.
3. Provides room for precise analysis. If we have more data for analysis, then we
have greater precision of analysis.
To process, analyze, and make sense of these different kinds of data, we need a
system that scales and addresses the challenges shown in the below Figure.
3. HADOOP OVERVIEW
Open-source software framework to store and process massive amounts of
data in a distributed fashion on large clusters of commodity hardware. Basically,
Hadoop accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem:
Hadoop Ecosystem are support projects to enhance the functionality of Hadoop
Core Components.
The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
It is conceptually divided into Data Storage Layer which stores huge volumes of
data and Data Processing Layer which processes data in parallel to extract
richer and meaningful insights from data (Figure 3.3)
Figure 5.14 describes important key points of HDFS. Figure 5.15 describes
Hadoop Distributed File system architecture. Client Application interacts with
NameNode for metadata related activities and communicates with DataNodes to
read and write files. DataNodes converse with each other for pipeline reads and
writes.
Let us assume that the file “Sample.txt” is of size 192 MB. As per the default data
block size (64 MB), it will be split into three blocks and replicated across the
nodes on the cluster based on the default replication factor.
4.1 HDFS Daemons
4.1.1 NameNode
➢ HDFS breaks a large file into smaller pieces called blocks. NameNode uses
a rack ID to identify DataNodes in the rack.
➢ A rack is a collection of DataNodes within the cluster. NameNode keeps
tracks of blocks of a file as it is placed on various DataNodes.
➢ NameNode manages file-related operations such as read, write, create and
delete. Its main job is managing the File System Namespace.
➢ A file system namespace is collection of files in the cluster. NameNode
stores HDFS namespace.
➢ File system namespace includes mapping of blocks to file, file properties
and is stored in a file called Fslmage.
➢ NameNode uses an EditLog (transaction log) to record every
transaction that happens to the file system metadata. Figure 5.16.
➢ When NameNode starts up, it reads FsImage and EditLog from disk
and applies all transactions from the EditLog to in-memory
representation of the Fslmage.
➢ Then it flushes out new version of Fslmage on disk and truncates the
old EditLog because the changes are updated in the Fslmage. There is a
single NameNode per cluster.
4.1.2 DataNode
➢ There are multiple DataNodes per cluster. During Pipeline read and
write DataNodes communicate with each other. A DataNode also
continuously sends "heartbeat” message to NameNode to ensure the
connectivity between the NameNode and DataNode.
➢ In case there is no heartbeat from a DataNode, the NameNode replicates
that DataNode within the cluster and keeps on running as if nothing had
happened.
➢ The concept behind sending the heartbeat report by the DataNodes to the
NameNode:
PICTURE THIS …
You work for a renowned IT organization. Everyday when you come to office, you
are required to swipe in to record your attendance. This record of attendance is
then shared with your manager to keep him posted on who all from his team
have reported for work. Your manager is able to allocate tasks to the team
members who are present in office. The tasks for the day cannot be allocated to
team member who have not turned in. Likewise heartbeat report is a way by
which DataNodes inform the NameNode that they are up and functional and can
be assigned tasks. Figure 5.17 depicts the above scenario.
4.1.3 Secondary NameNode
➢ The Secondary NameNode takes a snapshot of HDFS metadata at
intervals specified in the Hadoop configuration. Since the memory
requirements of Secondary NameNode are the same as NameNode, it
is better to run NameNode and Secondary NameNode on different
machines.
➢ In case of failure of the NameNode, the Secondary NameNode can be
configured manually to bring up the cluster.
➢ However, the Secondary NameNode does not record any real-time changes
that happen to the HDFS metadata.
1. Data Replication:
There is absolutely no need for a client application to track all blocks. It directs
the client to the nearest replica to ensure high performance.
2. Data Pipeline:
A client application writes a block to the first DataNode in the pipeline. Then this
DataNode takes over and forwards the data to the next node in the pipeline. This
process continues for all the data blocks, and subsequently all the replicas are
written to the disk.
5. Processing Data with Hadoop
• The output produced by the map tasks serves as intermediate data and is
stored on the local disk of that server. The output of the mappers are
automatically shuffled and sorted by the framework.
• Job inputs and outputs are stored in a file system. MapReduce framework
also takes care of the other tasks such as scheduling, monitoring, re-
executing failed tasks, etc.
• Once the client submits a job to the JobTracker, it partitions and assigns
diverse MapReduce tasks for each TaskTracker in the cluster. Figure 5.22
depicts JobTracker and TaskTracker interaction.
5.2 How Does MapReduce Work?
❖ MapReduce divides a data analysis task into two parts - map and reduce.
Figure 5.23 depicts how the MapReduce Programming works.
• In this example, there are two mappers and one reducer. Each
mapper works on the partial dataset that is stored on that node and the
reducer combines the output from the mappers to produce the reduced
result set.
1. First, the input dataset is split into multiple pieces of data (several small
subsets).
2. Next, the framework creates a master and several workers processes
and executes the worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data that
were assigned to each map task. The map worker uses the map function to
extract only those data that are present on their server and generates
key/value pair for the extracted data.
4. Map worker uses partitioner function to divide the data into regions.
Partitioner decides which reducer should get the output of the specified
mapper.
5. When the map workers complete their work, the master instructs the reduce
workers to begin their work. The reduce workers in turn contact the map workers
to get the key/value data for their partition. The data thus received is shuffled
and sorted as per keys.
6. Then it calls reduce function for every unique key. This function writes the
output to the file
7. When all the reduce workers complete their work, the master transfers the
control to the user program.
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and other
memory intensive algorithms.
• In this Architecture, map slots might be "full", while the reduce slots
are empty and vice versa.
• This causes resource utilization issues. This needs to be improved for
proper resource utilization.
6.2 HDFS Limitation
• NameNode saves all its file metadata in main memory. Although
the main memory today is not as small and as expensive as it used
to be two decades ago, still there is a limit on the number of objects
that one can have in the memory on a single NameNode.
• The NameNode can quickly become overwhelmed with load on the
system increasing.
• In Hadoop 2.x, this is resolved with the help of HDFS Federation.
HDFS 2 Features
1. Horizontal scalability
2. High availability
2. NodeManager:
• This is a per-machine slave daemon.
• NodeManager responsibility is launching the application containers for
application execution.
• NodeManager monitors the resource usage such as memory, CPU, disk,
network, etc. It then reports the usage of resources to the global
ResourceManager.
3. Per-application ApplicationMaster:
• This is an application-specific entity.
• Its responsibility is to negotiate required resources for execution from
the ResourceManager.
• It works along with the NodeManager for executing and
monitoring component tasks.
Container:
1. Basic unit of allocation.
2. Fine-grained resource allocation across multiple resource types (Memory, CPU,
disk, network, etc.)
(a) container_0 = 2GB, 1 CPU
(b) container_1 = 1GB, 6 CPU
3. Replaces the fixed map/reduce slots.
YARN Architecture:
Figure 5.29 depicts YARN architecture.
7. During the application execution, the client that submitted the job directly
communicates with the ApplicationMaster to get status, progress updates, etc. via
an application-specific protocol.
7.1 Pig
• Pig is a data flow system for Hadoop. It uses Pig Latin to specify data flow.
• Pig is an alternative to MapReduce Programming. It abstracts some details
and allows us to focus on data processing.
• It consists of two components.
1. Pig Latin: The data processing language.
2. Compiler: To translate Pig Latin to MapReduce Programming.
• Figure 5.30 depicts the Pig in the Hadoop ecosystem.
7.2 Hive
• Hive is a Data Warehousing Layer on top of Hadoop.
• Analysis and queries can be done using an SQL-like language.
• Hive can be used to do ad-hoc queries, summarization, and data analysis.
Figure 5.31 depicts Hive in the Hadoop ecosystem.
7.3 Sqoop
• Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases.
• With the help of Sqoop, we can import data from RDBMS to HDFS and
vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
7.4 HBase
• HBase is a NoSQL database for Hadoop.
• HBase is column-oriented NoSQL database.
• HBase is used to store billions of rows and millions of columns.
• HBase provides random read/write operation. It also supports record
level updates which is not possible using HDFS. HBase sits on top of
HDFS. Figure 5.33 depicts the HBase in Hadoop ecosystem.
10. ASSIGNMENT : UNIT – II
1. Since the data is replicated thrice in HDFS, does it mean that any
calculation done on one node will also be replicated on the other two ?
2. Why do we use HDFS for applications having large datasets and not
when we have small files?
3. Suppose Hadoop spawned 100 tasks for a job and one of the task
failed. What will hadoop do ?
4. How does Hadoop differs from volunteer computing ?
5. How would you deploy different components of Hadoop in production?
11. PART A : Q & A : UNIT – II
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
11. PART A : Q & A : UNIT – II
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and other
memory intensive algorithms.
11. PART A : Q & A : UNIT – II
14. What is the difference between SQL and MapReduce. ( CO2, K1)
15. What are the three important classes of MapReduce? ( CO2, K1)
● MapReduce Programming requires three things.
1. Driver Class: This class specifies Job Configuration details.
2. Mapper Class: This class overrides the Map Function based on the problem
statement.
3. Reducer Class: This class overrides the Reduce Function based on the problem
statement.
11. PART A : Q & A : UNIT – II
17. Which daemon is responsible for executing overall MapReduce job? ( CO2, K1)
• JobTracker provides connectivity between Hadoop and our application.
When we submit code to cluster, JobTracker creates the execution plan by
deciding which task to assign to which node. lt also monitors all the running
tasks.
• JobTracker is a master daemon responsible for executing overall MapReduce
job.
2. Data Pipeline:
A client application writes a block to the first DataNode in the pipeline. Then this
DataNode takes over and forwards the data to the next node in the pipeline. This
process continues for all the data blocks, and subsequently all the replicas are
written to the disk.
11. PART A : Q & A : UNIT – II