Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 62

Hadoop overview

• Hadoop is an open source framework, from the Apache foundation.


• It is capable of processing large amounts of heterogeneous data sets in a
distributed fashion across clusters of commodity computers and hardware using
a simplified programming model.
• Hadoop provides a reliable shared storage and analysis system.
• Hadoop was created by Doug Cutting and Mike Cafarella.
• Hadoop has originated from an open source web search engine called "Apache
Nutch", which is part of another Apache project called "Apache Lucene", which is
a widely used open source text search library.
• Hadoop core is hosted by Apache Software Foundation.
• Origins in Google now used by Yahoo, Facebook.
Hadoop Use Cases
• Analytics
• Search
• Data Retention
• Log file processing
• Analysis of Text, Image, Audio, & Video content
• Recommendation systems like in E-Commerce Websites
Hadoop Distributions

• Apache Hadoop is a method for distributed computing and storage.


• The big data created increased rapidly and there was a need to handle
and use this data optimally.
• The introduction of Apache Hadoop helped handle this problem to a
great extent.
• Although Hadoop is free, various distributions offer an easier to use
bundle.
• There are a number of Hadoop Distributions.
Top 5 Hadoop Distribution
HDFS, Map Reduce - Two core
components in Hadoop
HDFS ( Hadoop Distributed File System)
- HDFS is the java based file system for scalable and reliable storage of large datasets.
- Data in HDFS is stored in the form of blocks
- It operates on the Master Slave Architecture.

Hadoop MapReduce-
- This is a java based programming paradigm of Hadoop framework.
- provides scalability across various Hadoop clusters.
- MapReduce distributes the workload into various tasks that can run in parallel.
Main components of a Hadoop
Application
• Hadoop applications have wide range of technologies that provide
great advantage in solving complex business problems.
Core components of a Hadoop application -
• Hadoop Common
• HDFS
• Hadoop MapReduce
• YARN
The components used in Hadoop
Ecosystem
• Data Access Components are - Pig and Hive
• Data Storage Component is - HBase
• Data Integration Components are - Apache Flume, Sqoop
• Data Management and Monitoring Components are - Ambari, Oozie
and Zookeeper.
• Data Serialization Components are - Thrift and Avro
• Data Intelligence Components are - Apache Mahout and Drill.
Hadoop streaming
• Hadoop distribution has a generic application programming
interface(API).
• This API is used for writing Map and Reduce jobs in any desired
programming language like Python, Perl, Ruby, etc.
• This is referred to as Hadoop Streaming.
• Users can create and run jobs with any kind of shell scripts or
executable as the Mapper or Reducers.
What is a commodity hardware?
• Hadoop is an open source framework capable of processing large amounts of
heterogeneous data sets in a distributed fashion across clusters of commodity
computers and hardware using a simplified programming model.
• Commodity Hardware refers to inexpensive systems that do not have high
availability or high quality.
• Commodity Hardware consists of RAM because there are specific services that
need to be executed on RAM.
• Hadoop can be run on any commodity hardware and does not require any super
computers or high end hardware configuration to execute jobs.
• The best configuration for executing Hadoop jobs is dual core machines or dual
processors with 4GB or 8GB RAM
What is cluster? How a Server
Cluster Works ?
Cluster –
• In a computer system, a cluster is a group of servers and other resources.
• A Cluster acts like a single system and enable high availability, load balancing and parallel
processing.
Server cluster -
• A server cluster is a collection of servers, called nodes.
• The Nodes communicate with each other to make a set of services highly available to clients.
Hadoop cluster -
• A Hadoop cluster is a special type of computational cluster designed specifically for storing
and analyzing huge amounts of unstructured data in a distributed computing environment.
What means - Running a Hadoop job

• A Hadoop job run as a set of daemons on different servers in the network.


• These daemons have specific roles.
• Some exist only on one server, some exist across multiple servers.
 Namenode
 Datanode
 Secondary Namenode
 Tasktracker
 Jobtracker

Topology of Hadoop Cluster
Name Node

Hadoop employs a master/slave architecture for both distributed


storage and distributed computation.
The NameNode is the master of HDFS
 NameNode is the bookkeeper of HDFS
 Single namenode store all metadata.
 Metadata is maintained entirely in RAM for fast lookup
Data node

Each slave machine in your cluster will host a Data Node daemon to
perform the grunt work of the distributed filesystem
When you want to read or write a HDFS file, the file is broken into blocks.
NameNode will tell your client which Data Node each block resides in.
Then your client communicates directly with the Data Node daemons after
it is advised by the name node to process the local files corresponding to
the blocks .
Data Nodes are constantly reporting to the Name Node.
Secondary Name node

The Secondary Name Node (SNN) is an assistant daemon for


monitoring the state of the cluster HDFS.

The SNN differs from the Name Node in that this process doesn’t
receive or record any real-time changes to HDFS.

Instead, it communicates with the Name Node to take snapshots of


the HDFS metadata at intervals defined by the cluster configuration.
Job tracker

The JobTracker daemon is the liaison between your application and Hadoop.
JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
JobTracker determines the execution plan by determining which files to process, assigns
nodes to different tasks, and
monitors all tasks as they’re running.
Should a task fail, the JobTracker will automatically relaunch
the task, possibly on a different node, up to a predefined limit of retries.
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
There is only one JobTracker daemon per Hadoop cluster.
Task tracker

Runs the task assigned by Job tracker.


Send progress report (Heartbeat) to the Job Tracker.
Q&A
What happens to a NameNode that has no data?
• There does not exist any NameNode without data.
• If it is a NameNode then it should have some sort of data in it.

What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
• The Hadoop job fails when the NameNode is down.

What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
• The Hadoop job fails when the Job Tracker is down.

Whenever a client submits a hadoop job, who receives it?


• NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block
information.
• JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
Hadoop Video
HDFS: Architecture
HDFS Key features
Self healing

Fault tolerant

Abstraction for parallel processing

Manage complex structured or unstructured data


HDFS weaknesses
• Block information stored in memory
• File system metadata size is limited to the amount of available RAM on the
NameNode (the small files problem)
• Single Point of Failure
• File system is offline if NameNode goes down
Where HDFS is not a fit
• Low latency data access
• Low latency describes a computer network that is optimized to process a
very high volume of data messages with minimal delay (latency).
• These networks are designed to support operations that require near real-
time access to rapidly changing data.
• HDFS always deals with large size of data set, which requires high latency to
access data.
• Lots of small files
• Name node holds the filesystem metadata.
Modes of Hadoop installation

• Standalone (or local) mode


• everything runs in a single JVM
• Pseudo-distributed mode’
• daemons run on the local machine
• Fully distributed mode
What is MapReduce
A parallel programming framework
MapReduce provides
Automatic parallelization
Fault tolerance
Monitoring and status updates
Map Reduce Overview
Tasks are distributed to multiple nodes
Each node processes the data stored in that node
Consists of two phase:
Map – Reads input data and produces intermediate keys and values as
output. Map output (intermediate key, values) is stored in local system.
Reduce – Values with the same key are sent to the same reducer for further
processing
Map Reduce Process
Anatomy of a MapReduce Job Run

• At the highest level, there are four independent entities:


• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run.
• The tasktrackers, which run the tasks that the job has been split into.
• The distributed files system (normally HDFS), which is used for sharing
job files between the other entities.
MapReduce (MRv1): Hadoop
Architecture
Map Reduce: Job Tracker

Determines the execution plan by determining which files to process,


assigns nodes to different tasks, and monitors all tasks as they are
running.
When a task fails, the JobTracker will automatically relaunch
(ATTEMPT) the task, possibly on a different node, up to a predefined
limit of retries.
There is only one JobTracker daemon per Hadoop cluster.
MapReduce: Task Tracker

TaskTracker is responsible for instantiating and monitoring individual


map reduce tasks.
Maintains a heartbeat with Job Tracker
There is one TaskTracker daemon per Hadoop data node.
Basic parameters of Mapper and
Reducer
The four basic parameters of a mapper are -
<LongWritable, text, text and IntWritable>

• LongWritable, text - represent input parameters.


• Text, IntWritable - represent intermediate output parameters.

The four basic parameters of a reducer are –


<text, IntWritable, text, IntWritable>

• text, IntWritable - represent intermediate output parameters


• text, IntWritable - represent final output parameters.
MapReduce: Programming Model

Borrows from Functional Programming

Users implement two functions: Map and Reduce


MapReduce Data Flow in Hadoop
Map

Records are fed to the map function as key value pairs


Produces one or more intermediate values with an output key from
the input
Shuffle and Sort

Output of every Map task is copied to the reducer.


Every reduce task get a set of keys to execute the task on.
Map task performs an in memory sort of the keys and divides them
into partitions corresponding to the reducers that they will be sent to.
Reduce task performs a merge on the output it receives from
different map tasks.
Reduce

After map(), intermediate values are combined into a list.

Reduce() combines intermediate values into one or more final values


for that same output key
Input format used in Mapper
What is InputFormat

 The InputFormat defines how to read data from a file into the Mapper instances.

 InputFormat divides the input data sources into fragments that make up the inputs to
individual map tasks.

 Example of InputFormat used in Hadoop – TextInputFormat, SequenceFileInputFormat


etc.

 TextInputFormat works with text files and describe the ways in which text files can be
processed.

 SequenceFileInputFormat is used for reading particular binary file formats.


InputFormat Class

 The InputFormat class is responsible for defining two main things.


• Input splits
• Record reader

 Files are processed locally by Mapper as Input splits which are composed
of records. Input split defines the number of the individual Map tasks.
Total number of Input Splits = Total number of Map tasks.
 The record reader is responsible for actual reading of records from the
input file and submitting them (as key/value pairs) to the mapper
DATA BLOCK AND INPUT SPLITS IN HADOOP’S MAPREDUCE

• HDFS breaks down very large files into large blocks (for example,
measuring 128MB), and stores three copies of these blocks on different
nodes in the cluster.
• HDFS has no awareness of the content of these files.
• The key to efficient MapReduce processing is that, wherever possible,
data is processed locally — on the slave node where it’s stored.
• In Hadoop, files are composed of individual records, which are
ultimately processed one-by-one by mapper tasks.
• For example, the sample data set contains information about completed
flights within the United States between 1987 and 2008.
• Suppose, you have one large file for each year.
• within every file, each individual line represents a single flight. In other
words, one line represents one record.
• Default block size for the Hadoop cluster is 64MB or 128 MB.( so that the
data files are broken into chunks of exactly 64MB/ 128 MB.)
• If each map task processes all records in a specific data block, what
happens to those records that span block boundaries?
• File blocks are exactly 64MB (or whatever you set the block size to be),
and because HDFS has no conception of what’s inside the file blocks, it
can’t gauge when a record might spill over into another block.
• To solve this problem, Hadoop uses a logical representation of the data stored in file blocks,
known as input splits.
• When a MapReduce job client calculates the input splits, it figures out where the first whole
record in a block begins and where the last record in the block ends.
• In cases where the last record in a block is incomplete, the input split includes location
information for the next block and the byte offset of the data needed to complete the record.
• The number of input splits that are calculated for a specific application determines the
number of mapper tasks.
• Each of these mapper tasks is assigned, where possible, to a slave node where the input split
is stored.
• The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
splits are processed locally.
Data blocks and input splits in HDFS
Output Format used in Reducer
What is OutputFormat

 The OutputFormat determines where and how the results of your job are
persisted.
 Hadoop comes with a collection of classes and interfaces for different types
of format.

 TextOutputFormat is a line separated, tab delimited text files of key-value


pairs.

 Hadoop provides the SequenceFileOutputFormat which can write the binary


representation of objects instead of converting it to text, and compress the
result.
RecordWriter

 OutputFormat specify how to serialize data by providing a implementation of RecordWriter.

 RecordWriter classes handle the job of taking an individual key-value pair and writing it to
the location prepared by the OutputFormat.

 There are two functions to a RecordWriter implements : write and close.

 The write function takes key-values from MapReduce job and writes the bytes to disk.

 The default RecordWriter is LineRecordWriter .

 Close function closes the Hadoop data stream to the output file.
Word Count Example (Pseudo code)
What the word count problem
signifies ?
• The basic goal of this program is to count the unique words in a text
file.
Word Count Example
WordCountMapper.java
• Map class starts with import statements.
• It imports Hadoop specific data types for key and values.
• In Hadoop, key and value data types can be only Hadoop specific types.
• LongWritable is similar to Long data type in Java which is used to take
care of a long number.
• Text is similar to String data type which is sequence of characters.
• IntWritable is similar to Integer in Java.
• Every Map class extends the abstract class Mapper and override the
map() function.
• LongWritable, Text - Data types for Input key and input value, which Hadoop
supplies to map().
• Text, IntWritable – Data types for output key and output value.
• Two fields have been declared – one, word (which is required in the
processing logic.)
• Map() has the parameter as – input key, input value, context.
• Role of context of Context data type is to catch the output of key-value pair
• In the processing logic of map(), we tokenize the string into words and write
it into context using the method context.write(word, one).
• where word is used as key and one is used as value.
WordCountReducer.java
• Every Reduce class needs to extend the Reducer class (an abstract class.)
• <Text, IntWritable, IntWritable, Text> Hadoop specific type parameters.
• Text, IntWritable – data types used for the input key and value
• IntWritable, Text - data types used for the output key and value
• we need to override the reduce(Text key, Iterable<IntWritable> values, Context
context)
• The input to reduce() is the key and list of values.
• So values is specified as an iterable field. Here context collects the output key and
values pair.
Logic used in reduce()
• The logic used in reduce function uses a for loop which iterates over
the values.
• Then we add the values into the sum field.
• After all the values of a particular key are processed, we output the
key and value pair through context.write(key, new IntWritable<sum>)
• The data types used for input key and value of the reducer() should
match the data types of output key and value of map()
The driver class
• Job object controls the execution of the job which is used to set the
job parameters so that Hadoop can take it from that point and
executes the job as specified by the programmer.
• We need to set the file input path and file output path in the driver
class. These paths will be passed as command line arguments.

• The job.waitForCompletion() method is used to trigger the submission


of the job to Hadoop.
Compile and Run the Hadoop job
Map and reduce job is running
Hadoop job completed
What is in the part file
Output from a MapReduce job
The output from running the job provides some useful information.
For example,
1) We can see that the job was given an ID of job_local_0001, and it ran one map task and one
reduce task with the IDs.
2) Knowing the job and task IDs can be very useful when debugging MapReduce jobs.
For example - attempt_local_0001_m_000000_0 and attempt_local_0001_r_000000_0

2) “Counters,” shows the statistics that Hadoop generates for each job it runs. These are very useful
for checking whether the amount of data processed is what you expected.
• For example, we can follow the number of records that went through the system:
• five map inputs produced five map outputs, then five reduce inputs in two groups produced two
reduce outputs.

You might also like