Chap 6 - MapReduce Programming

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

MapReduce Programming

Thinking in Parallel
After understanding HDFS, next question is –
“How can I analyze or query my data?”

Google developed MapReduce – a Data processing model

Writing applications for distributed systems is completely


different from writing the same code for centralized
systems
Thinking in Parallel
Problems –
1. Sum given 1000 Numbers

2. Count the number of flights for each carrier in our


flight data set – flights-data-small.doc
MapReduce Architecture
MapReduce Example
Closer look at MapReduce Flow
The phases are:
• Mapper
• Shuffle & Sort
• Reducer
Mapper Phase
1. Mapper task is the first phase of processing
that processes each input record and
generates an intermediate key-value pair.
2. Mapper stores intermediate-output on the
local disk.
3. In a mapper task, the output is the full
collection of all these <key, value> pairs.
4. Sorting is done on the basis of keys
Shuffle & Sort Phase
• Shuffling is the process by which it transfers mappers
intermediate output to the reducer.
• Data from the mapper are grouped by the key, split
among reducers and sorted by the key.
• Every reducer obtains all values associated with the
same key.
Reduce Phase
• Reducer takes a set of intermediate key-value pair produced
by the mapper as input and runs a Reducer function on each
• Reducer functions may be - aggregate, filter, and combine
this data in a number of ways
• Reducers run in parallel since they are independent of one
another. The user decides number of reducers. By default
number of reducers is 1.
• This phase is optional – Ex. To convert given text into
uppercase
• Note that output of mapper tasks are not written to HDFS
Problems for Groups
• Count the number of flights for each carrier in our
flight data set – flights-data-small.doc
• Sales data – finding total sales by country
• Movies tag – find the movie which has Maximum tags
• Movie ratings – find movie which has maximum
average rating
MapReduce Cycle
Importance of MapReduce
• Hadoop & MapReduce framework handles all sorts of
complexity, thus relieving the burden on programmer

• Fairly difficult to code MapReduce

• There are very few MapReduce Programmers


Blocks
• File is split into blocks for storing
• File consists of records to be processed
• Logical records do not fit neatly into HDFS blocks
• Do you see a problem in processing a block ?
Input Splits
Example
Data :
Ramesha 78
Kousalya 97

Block size -8 bytes

Block1 - Ramesha 7
Block2- 8Kousalya
Block3- 97

Split1- Ramesha 78
Split2 - Kousalya 87
Issue – Records are split across blocks
• If each map task processes all records in a data block, what
happens to those records that span block boundaries?

• HDFS has no conception of what’s inside the file blocks, it


can’t gauge when a record might spill over into another
block.

• To solve this problem, Hadoop uses a logical representation


of the data stored in file blocks, known as input splits.
Input Splits
Who finds Input splits – Client application

How ?
 It actually checks all record inside the block are
complete?

 If not, it captures location information of the


next/previous Block and byte offset of the data needed
to complete the record.
Block vs Input Split
Block Input Split
Size The default size of the HDFS block is By default, split size is
128 MB which we can configure as approximately equal to block
per our requirement. size.

All blocks of the file are of the same Based on the size of data in
size except the last block, which can MapReduce program.
be of same size or smaller.

Data It is the physical representation of It is the logical representation of


Representation data. data present in the block.

It contains a minimum amount of data It is used during data processing


that can be read or write. in MapReduce program
MapReduce Processing Flow
Speculative Execution
MapReduce breaks jobs
into tasks and these tasks
run parallel rather than
sequential, thus reduces
overall execution time.

This model of execution is


sensitive to slow tasks as
they slow down the overall
execution of a job.
Data locality
• Moving computation unit to data rather data
to the computation unit.

• Processing of the data happens in the very


node by Node Manager where data blocks are
present.
Components of a MapReduce Application
• Driver (mandatory)
• Mapper class (mandatory)
• Reducer class (optional)
• Combiner class (optional)
• Partitioner class (optional)
• Record reader and Record writer classes
(optional)
Record Reader
RecordWriter
RecordWriter is the class which handles the job
of taking an individual key-value pair i.e output
from reducer and writing it to HDFS at the
location prepared by the OutputFormat.
Hadoop Combiner- “Mini-Reducer” that summarizes
the Mapper output record with the same Key before
passing to the Reducer
Partitioner
Finds which partition a
given (key, value) pair will
go.

Hash function is applied


on key (or a subset of the
key) find the partition.
Steps in MapReduce Application Flow
1. Determine the exact data sets to process from the data blocks. This
involves calculating where records are located

2. Run specified algorithm against each record in the data set until all
the records are processed. The individual instance of the application
running against a block of data in a data set is known as a mapper task.

3. Locally perform an interim reduction of the output of each mapper.


This phase is optional because, in some common cases, it isn’t
desirable.

4. Based on partitioning requirements, group the applicable partitions


of data from each mapper’s result sets.

5. Boil down the result sets from the mappers into a single result set —
the Reduce part of MapReduce.
Word Count Example
Sales Data Example
Shuffle Phase
• To speed up the overall MapReduce process, data is
immediately moved to the reducer tasks’ nodes, to
avoid a flood of network activity when the final mapper
task finishes its work.

• Reduce task’s processing cannot begin until all mapper


tasks have finished

• Speculative execution-To avoid scenarios where the


performance of a MapReduce job is hampered by one
straggling mapper task that’s running on a poorly
performing slave node
Hadoop 2.x YARN Benefits

• Highly Scalability
• Highly Availability
• Supports Multiple Programming Models
• Supports Multi-Tenancy
• Supports Multiple Namespaces
• Improved Cluster Utilization
• Supports Horizontal Scalability

You might also like