Bda - Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

UNIT 2 : HADOOP

OUTLINE:
• History of Hadoop

• Hadoop Distributed File System

• Developing a Map Reduce Application

• Hadoop Environment

• Security in Hadoop

• Administering in Hadoop

• Monitoring & Maintenance


History Of Hadoop
• Hadoop is an open-source software framework for storing and
processing large datasets ranging in size from gigabytes to
petabytes.
• Hadoop was created by Doug Cutting and Mike
Cafarella in 2005

• Hadoop was developed at the Apache Software Foundation.

• In 2008, Hadoop defeated supercomputers and became the fastest


system on the planet for storing terabytes of data.

• There are basically two components in Hadoop:

1. Hadoop Distributed File System(HDFS): (For Storage)


• It allows you to store data of various formats across a cluster.
2. Yarn / Map Reduce ( For Computation)
• For resource management in Hadoop. It allows parallel processing
over the data, i.e stored across HDFS.
Basics Of Hadoop
• Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware.
• It provides massive storage for any kind of data, enormous processing power and the ability to
handle virtually limitless concurrent tasks or jobs.
• A data residing in a local file system of a personal computer system, in Hadoop, data resides in a
distributed file system which is called a Hadoop Distributed File System- HDFS
• The processing model is based on the ‘Data Locality’ concept wherein computational logic is sent
to cluster nodes(servers)containing data.
• This computational logic is nothing, but a compiled version of a program written in a high-level
language such as Java.
• Such a program, processes data stores in Hadoop HDFS.
Hadoop Offers: Redundant, Fault-tolerant data storage, Parallel computation framework, Job
coordination
Motivation Questions
Problem Solution

Problem 1:Data is too big to store on one machine HDFS: Store the data on multiple machines!
???

Problem 2: Very high end machines are too HDFS: Run on commodity hardware!
expensive !!!!

Problem 3: Commodity hardware will fail! HDFS: Software is intelligent enough to handle
hardware failure!

Problem 4: What happens to the data if the HDFS: Replicate the data!
machine stores the data fails?

Problem 5: How can distributed machines HDFS: Master-Slave Architecture!


organize the data in a coordinated way?
Key Distinctions Of
Hadoop
1. Accessible: Hadoop runs on large clusters of
commodity machines or on cloud services such as
Amazon’s Elastic Compute Cloud(EC2).
2. Robust: Because it is intended to run on commodity
hardware, Hadoop is architected with the assumption
of frequent hardware malfunctions, it can gracefully
handle most such failures.
3. Scalable: Hadoop scales linearly to handle larger data
by adding more nodes to the cluster.
4. Simple: Hadoop allows users to quickly write efficient
parallel code.
Hadoop File System
Why HDFS?
About HDFS
• HDFS was developed using file system design.
• It is run on commodity hardware.
• HDFS holds a very large amount of data and provides easier
access.
• Features of HDFS:
1. It is suitable for distributed storage and processing.
2. Hadoop provides a command interface to interact with
HDFS.
3. Streaming access to file system data.
4. HDFS provides file permissions and authentication.
HDFS Master-Slave
Architecture
Hadoop Core-Components
HDFS Architecture
Name Node
• The NameNode is a master node that manages and
maintains the Data Node ( Slave Node).
• Performs file system namespace operations like open, close,
rename files etc.
• It maintains the filesystem tree and the metadata for all the
files and directories in the tree.
• Keeps the metadata in the main memory.
• Metadata contains a list of files, blocks of each file, list of
data nodes and blocks on each data node, file attributes such
as replication factor, creation time etc.
• NameNode is a single point of failure in the Hadoop cluster,
as the NameNode is down, HDFS/Hadoop cluster is
inaccessible and considered down.
 Two main components of Name Node:
 FSImage : point-in-time snapshot of HDFS's
namespace.
HDFS’s namespace contains mapping of files
and blocks
 EditLog: Any operation getting performed on
file . File Creation / Deletion etc.
Data Node
• It is the slave daemon process which runs on each slave
machine.
• The actual data is stored on DataNodes.
• It is responsible for serving read and write requests from
clients.
• It is also responsible for creating blocks, deleting blocks
and replicating the same decisions by the Namenode.
• It sends heartbeats to the NameNode periodically to report
the overall health of HDFS, by default this frequency is et
to 3 seconds.
Secondary Data Node
• The Secondary NameNode works for concurrently with
the primary NameNode as a helper daemon/process.
• It is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it
into the hard disk or the file System.
• It is responsible for combining the EditLogs with Fsimage
from the name node.
• It downloads the EditLogs from the NameNode at regular
intervals and applies them to Fsimage.
• The new FsImage is copied back to the NameNode ,which
is used whenever the NameNode is started the next time.
Blocks
• A Block is the minimum amount of data that it can read or
write. HDFS blocks are 128 MB by default and this is
configurable.
• Files n HDFS are broken into block-sized chunks, which
are stored as independent units.
• Unlike a file system, if the file in HDFS is smaller than
the block size, then it does not occupy the full block.s
size, i.e. 5 MB of a file stored in HDFS of block size 128
MB takes 5MB of space only.
• The HDFS block size is large just to minimize the cost of
seek.
Advantages Of HDFS
Architecture
1. It is a highly scalable data storage system. This makes it ideal for
data-intensive applications like Hadoop and streaming analytics.
2. It is very easy to implement, yet very robust. There is a lot of
flexibility you get with Hadoop. It is a fast and reliable file
system.
3. This makes Hadoop a great fit for a wide range of data
applications. The most common one is analytics. You can use
Hadoop to process large amounts of data quickly, and then
analyze it to find trends or make recommendations.
4. You can increase the size of the cluster by adding more nodes
or increase the size of the cluster by adding more servers.
5. Specialization reduces the overhead of data movement across
the cluster and provides high availability of data.
HDFS READ AND WRITE
HDFS Read : File is brought from cluster to the local file system
of Data node.

Command (to be executed on terminal of any data node):


hadoop fs -copytolocal [path of the file in hdfs]

A java program gets executed with this command to perform the


series of operation.

Though at user’s end it remains transparent.


Steps :
1) Client node requests Name node to send all the block locations
of the given file(Including replicas).
2) The list in form of block number, replica number, data node
number is sent to client.
3) Based on the location of the data nodes, the client will choose
the closest replica of the required block.
4) The process above gets repeated until all the blocks of the
given file are read.
HDFS Write : File is brought from local file system of Data node
to the cluster.

Command (to be executed on terminal of any data node):


hadoop fs –copyfromlocal [path of the file in hdfs]

A java program gets executed with this command to


perform the series of operation.

Though at user’s end it remains transparent.


•Client node requests Name node
to allocate the blocks for the file,
data nodes of each block and to
where replicas of each nodes to
be stored.
•The Name node checks the
permissions of user, if a file with
the same name exist already.
•If found ok, it will allocate the
blocks.
•Name node will choose the Data
nodes to put blocks which are not
busy and have enough space. To
put replicas, it uses replica
placement strategies.
•After selecting the nodes, a
data pipeline is created and
the data in the form of the
packet will be written.
•First node stores the data
and passes to the second
node(replica), and second
will pass to third.
•After successful
completion, all data nodes
send acknowledgement of
the block written.
•The process above
continues until all the blocks
gets written successfully
MAP REDUCE
MapReduce
• A MapReduce is a data processing tool which is used to
process the data parallelly in a distributed form.
• It was developed in 2004, on the basis of a paper titled
"MapReduce: Simplified Data Processing on Large Clusters,"
published by Google.
• The MapReduce is a paradigm which has two phases, the
mapper phase, and the reducer phase.
• In the Mapper, the input is given in the form of a key-value
pair.
• The output of the Mapper is fed to the reducer as input. The
reducer runs only after the Mapper is over.
• The reducer too takes input in key-value format, and the
output of the reducer is the final output.
Steps in MapReduce
Steps:
1. The map takes data in the form of pairs and returns a list of
<key, value> pairs. The keys will not be unique in this case.

2. Using the output of Map, sort and shuffle are applied by the
Hadoop architecture. This sort and shuffle acts that is a list of
<key, value> pairs and sends out unique keys and a list of
values associated with this unique key <key, list(values)>.

3. An output of sort and shuffle sent to the reducer phase. The


reducer performs a defined function on a list of values for
unique keys, and the Final output <key, value> will be
stored/displayed.
Data Flow in MapReduce
Phases Of MapReduce
1. Input Reader
• The input reader reads the upcoming data and splits it into
the data blocks of the appropriate size (64 MB to 128
MB). Each data block is associated with a Map function.
• Once input reads the data, it generates the corresponding
key-value pairs. The input files reside in HDFS.

2. Map function
• The map function processes the upcoming key-value pairs
and generated the corresponding output key-value pairs.
• The map input and output type may be different from each
other.
3. Partition function
• The partition function assigns the output of each Map
function to the appropriate reducer.
• The available key and value provide this function.
• It returns the index of reducers.

4. Shuffling and Sorting


• The data are shuffled between/within nodes so that it moves
out from the map and ggetsready to process for reduced
function.
• Sometimes, the shuffling of data can take much computation
time.
• The sorting operation is performed on input data for Reduce
function.
• Here, the data is compared using the comparison function
and arranged in a sorted form.
5. Reduce function
• The Reduce function is assigned to each unique key. These
keys are already arranged in sorted order.
• The values associated with the keys can iterate the Reduce
and generates the corresponding output.

4. Output writer
• Once the data flow from all the above phases, the Output
writer executes.
• The role of the Output writer is to write the Reduce output
to the stable storage.
MapReduce Architecture
Job Tracker
• Job Tracker is the one to which client applications submit
mapreduce programs(jobs).

• Job Tracker schedule clients’ jobs and allocates the task to


the slave task trackers that are running on individual
worker machines(data nodes).

• Job tracker manages overall execution of Map-Reduce job.

• Job tracker manages the resources of the cluster like:


1. Manage the data nodes i.e. task tracker.
2. To keep track of the consumed and available resource.
3. To keep track of already running tasks, to provide
fault-tolerance for tasks etc.
Task Tracker
• Each Task Tracker is responsible to execute and manage
the individual tasks assigned by Job Tracker.

• Task Tracker also handles the data motion between the


map and reduce phases.

• If the Job Tracker fails to receive a heartbeat from a Task


Tracker within a specified amount of time, it will
assume the Task Tracker has crashed and will resubmit
the corresponding tasks to other nodes in the cluster.
Example:
Phases Of MapReduce
Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides
various services to solve big data problems.
• It includes Apache projects and various commercial tools
and solutions.
• There are four major elements of Hadoop i.e. HDFS,
MapReduce, YARN, and Hadoop Common. Most of
the tools or solutions are used to supplement or support
these major elements.
• All these tools work collectively to provide services such
as absorption, analysis, storage and maintenance of data
etc.
Hadoop Ecosystem
1. HDFS
• HDFS is the primary or major component of Hadoop
ecosystem and is responsible for storing large data sets of
structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data
about data) requiring comparatively fewer resources than the
data nodes that stores the actual data. These data nodes are
commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost-effective.
• HDFS maintains all the coordination between the clusters and
hardware, thus working at the heart of the system.
2. YARN
• Yet Another Resource Negotiator, as the name implies, YARN
is the one who helps to manage the resources across the
clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
• Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
• Resource manager has the privilege of allocating resources for
the applications in a system whereas Node managers work on
the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledge the resource manager.
• Application manager works as an interface between the
resource manager and node manager and performs negotiations
as per the requirement of the two.
3. MapReduce
• By making the use of distributed and parallel algorithms,
MapReduce makes it possible to carry over the processing’s
logic and helps to write applications which transform big
data sets into manageable one.
• MapReduce makes the use of two functions i.e. Map() and
Reduce() whose task is:
• Map() performs sorting and filtering of data and thereby
organises them in the form of groups. Map generates a
key-value pair-based result which is later on processed
by the Reduce() method.
• Reduce(), as the name suggests does the summarization
by aggregating the mapped data. In simple, Reduce()
takes the output generated by Map() as input and
combines those tuples into a smaller set of tuples.
4. PIG
• Pig was basically developed by Yahoo which works on a pig
Latin language, which is Query based language similar to
SQL.
• It is a platform for structuring the data flow,
processing and analyzing huge data sets.
• Pig does the work of executing commands and in the
background, all the activities of MapReduce are taken care
of. After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework
which runs on Pig Runtime. Just the way Java runs on
the JVM.
• Pig helps to achieve ease of programming and optimization
and hence is a major segment of the Hadoop Ecosystem.
5. HIVE
• With the help of SQL methodology and interface, HIVE
performs the reading and writing of large data sets. However,
its query language is called HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and
batch processing. Also, all the SQL datatypes are supported
by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too
comes with two components: JDBC Drivers and HIVE
Command-Line.
• JDBC, along with ODBC drivers work on establishing the
data storage permissions and connection whereas the HIVE
Command line helps in the processing of queries.
.
5. Mahout
• Mahout, allows Machine Learnability to a system or
application.
• Machine Learning, as the name suggests helps the system to
develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
• It provides various libraries or functionalities such as
collaborative filtering, clustering, and classification which
are nothing but concepts of Machine learning.
• It allows invoking algorithms as per our needs with the help
of its own libraries.
6. Apache Spark
• It’s a platform that handles all the process consumptive
tasks like batch processing, interactive or iterative real-time
processing, graph conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster
than the prior in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best
suited for structured data or batch processing, hence both are
used in most of companies interchangeably.
7. Hbase
• It’s a NoSQL database which supports all kinds of data and is
thus capable of handling anything of a Hadoop Database. It
provides capabilities of Google’s BigTable, thus being able to
work on Big Data sets effectively.
• At when where we need to search or retrieve the occurrences
of something small in a huge database, the request must be
processed within a short quick span of time. At such times,
Hbase in comes handy as it gives us a tolerant way of storing
limited data
• Apart from all of these, there are some other components too
that carry out a huge task in order to make Hadoop capable of
processing large datasets.
Security In Hadoop
1. KERBEROS
• Kerberos is an authentication protocol that is now used as a
standard to implement authentication in the Hadoop cluster.
• Hadoop, by default, does not do any authentication, which
can have severe effects on corporate data centres. To
overcome this limitation, Kerberos which provides a secure
way to authenticate users was introduced in the Hadoop
Ecosystem.
• The client makes the three steps while using Hadoop with
Kerberos.
Authentication: In Kerberos, the client first authenticates itself
to the authentication server. The authentication server provides
the timestamped Ticket-Granting Ticket (TGT) to the client.
Authorization: The client then uses TGT to request a service
ticket from the Ticket-Granting Server.
Service Request: On receiving the service ticket, the client
directly interacts with the Hadoop cluster daemons such as
NameNode and ResourceManager.
2. Transparent Encryption in HDFS
• For data protection, Hadoop HDFS implements transparent
encryption.
• Once it is configured, the data that is to be read from and
written to the special HDFS directories is encrypted and
decrypted transparently without requiring any changes to the
user application code.
• This encryption is end-to-end encryption, which means that
only the client will encrypt or decrypt the data.
• Hadoop HDFS will never store or have access to
unencrypted data or unencrypted data encryption keys,
satisfying at-rest encryption, and in-transit encryption.
• At-rest encryption refers to the encryption of data when data
is on persistent media such as a disk.
• HDFS encryption enables the existing Hadoop applications
to run transparently on the encrypted data.
• This HDFS-level encryption also prevents filesystem or OS-
level attacks.
3. HDFS file and directory permission.
• For authorizing the user, the Hadoop HDFS checks the files and
directory permission after the user authentication.
• The HDFS permission model is very similar to the POSIX model.
Every file and directory in HDFS is having an owner and a group.
• The files or directories have different permissions for the owner,
group members, and all other users.
• The HDFS do a permission check for the file or directory accessed
by the client as follows:
1. If the user name of the client access process matches the owner of
the file or directory, then HDFS perform the test for the owner
permissions;
2. If the group of file/directory matches any of the members of the
group list of the client access process, then HDFS perform the test
for the group permissions;
3. Otherwise, the HDFS tests the other permissions of files/directories.
Administering Hadoop
• Hadoop administration which includes both HDFS and
MapReduce administration.

• HDFS administration includes monitoring the HDFS file


structure, locations, and updated files.

• MapReduce administration includes monitoring the list of


applications, the configuration of nodes, application status
etc.
Monitoring
HDFS Monitoring
• HDFS (Hadoop Distributed File System) contains the user
directories, input files, and output files.
• Use the MapReduce commands, put and get, for storing
and retrieving.

MapReduce Monitoring
• A MapReduce application is a collection of jobs(Map job,
Combiner, Partitioner, and Reduce job)
• It is mandatory to monitor and maintain the following –
Configuration of datanode where the application is
suitable.
• The number of datanodes and resources used per
application.
Maintenance
• Hadoop Admin Roles and Responsibilities include setting
up Hadoop clusters.

• Other duties involve backup, recovery and maintenance.

• Hadoop administration requires good knowledge of


hardware systems and ean xcellent understanding of
Hadoop architecture.

• With the increased algorithm of Hadoop in traditional


enterprise IT solutions and the increased number of
Hadoop implementations in the production environment,
the need for Hadoop operations and Administration
experts to take care of the large Hadoop clusters.
Thank You 

You might also like