Chap 6 - MapReduce Programming

MapReduce Programming
Thinking in Parallel
After understanding HDFS, next question is –
“How can I analyze or query my data?”
Google developed MapReduce – a Data processing model
Writing applications for distributed systems is completely

different from writing the same code for centralized
systems
Thinking in Parallel
Problems –
1. Sum given 1000 Numbers
2. Count the number of flights for each carrier in our

flight data set – flights-data-small.doc
MapReduce Architecture
MapReduce Example
Closer look at MapReduce Flow
The phases are:
• Mapper
• Shuffle & Sort
• Reducer
Mapper Phase
1. Mapper task is the first phase of processing
that processes each input record and
generates an intermediate key-value pair.
2. Mapper stores intermediate-output on the
local disk.
3. In a mapper task, the output is the full
collection of all these <key, value> pairs.
4. Sorting is done on the basis of keys
Shuffle & Sort Phase
• Shuffling is the process by which it transfers mappers
intermediate output to the reducer.
• Data from the mapper are grouped by the key, split
among reducers and sorted by the key.
• Every reducer obtains all values associated with the
same key.
Reduce Phase
• Reducer takes a set of intermediate key-value pair produced
by the mapper as input and runs a Reducer function on each
• Reducer functions may be - aggregate, filter, and combine
this data in a number of ways
• Reducers run in parallel since they are independent of one
another. The user decides number of reducers. By default
number of reducers is 1.
• This phase is optional – Ex. To convert given text into
uppercase
• Note that output of mapper tasks are not written to HDFS
Problems for Groups
• Count the number of flights for each carrier in our
flight data set – flights-data-small.doc
• Sales data – finding total sales by country
• Movies tag – find the movie which has Maximum tags
• Movie ratings – find movie which has maximum
average rating
MapReduce Cycle
Importance of MapReduce
• Hadoop & MapReduce framework handles all sorts of
complexity, thus relieving the burden on programmer
• Fairly difficult to code MapReduce
• There are very few MapReduce Programmers

Blocks
• File is split into blocks for storing
• File consists of records to be processed
• Logical records do not fit neatly into HDFS blocks
• Do you see a problem in processing a block ?
Input Splits
Example
Data :
Ramesha 78
Kousalya 97
Block size -8 bytes
Block1 - Ramesha 7
Block2- 8Kousalya
Block3- 97
Split1- Ramesha 78
Split2 - Kousalya 87
Issue – Records are split across blocks
• If each map task processes all records in a data block, what
happens to those records that span block boundaries?
• HDFS has no conception of what’s inside the file blocks, it

can’t gauge when a record might spill over into another
block.
• To solve this problem, Hadoop uses a logical representation

of the data stored in file blocks, known as input splits.
Input Splits
Who finds Input splits – Client application
How ?
 It actually checks all record inside the block are
complete?
 If not, it captures location information of the

next/previous Block and byte offset of the data needed
to complete the record.
Block vs Input Split
Block Input Split
Size The default size of the HDFS block is By default, split size is
128 MB which we can configure as approximately equal to block
per our requirement. size.
All blocks of the file are of the same Based on the size of data in
size except the last block, which can MapReduce program.
be of same size or smaller.
Data It is the physical representation of It is the logical representation of

Representation data. data present in the block.
It contains a minimum amount of data It is used during data processing

that can be read or write. in MapReduce program
MapReduce Processing Flow
Speculative Execution
MapReduce breaks jobs
into tasks and these tasks
run parallel rather than
sequential, thus reduces
overall execution time.
This model of execution is

sensitive to slow tasks as
they slow down the overall
execution of a job.
Data locality
• Moving computation unit to data rather data
to the computation unit.
• Processing of the data happens in the very

node by Node Manager where data blocks are
present.
Components of a MapReduce Application
• Driver (mandatory)
• Mapper class (mandatory)
• Reducer class (optional)
• Combiner class (optional)
• Partitioner class (optional)
• Record reader and Record writer classes
(optional)
Record Reader
RecordWriter
RecordWriter is the class which handles the job
of taking an individual key-value pair i.e output
from reducer and writing it to HDFS at the
location prepared by the OutputFormat.
Hadoop Combiner- “Mini-Reducer” that summarizes
the Mapper output record with the same Key before
passing to the Reducer
Partitioner
Finds which partition a
given (key, value) pair will
go.
Hash function is applied

on key (or a subset of the
key) find the partition.
Steps in MapReduce Application Flow
1. Determine the exact data sets to process from the data blocks. This
involves calculating where records are located
2. Run specified algorithm against each record in the data set until all
the records are processed. The individual instance of the application
running against a block of data in a data set is known as a mapper task.
3. Locally perform an interim reduction of the output of each mapper.

This phase is optional because, in some common cases, it isn’t
desirable.
4. Based on partitioning requirements, group the applicable partitions

of data from each mapper’s result sets.
5. Boil down the result sets from the mappers into a single result set —
the Reduce part of MapReduce.
Word Count Example
Sales Data Example
Shuffle Phase
• To speed up the overall MapReduce process, data is
immediately moved to the reducer tasks’ nodes, to
avoid a flood of network activity when the final mapper
task finishes its work.
• Reduce task’s processing cannot begin until all mapper

tasks have finished
• Speculative execution-To avoid scenarios where the

performance of a MapReduce job is hampered by one
straggling mapper task that’s running on a poorly
performing slave node
Hadoop 2.x YARN Benefits
• Highly Scalability
• Highly Availability
• Supports Multiple Programming Models
• Supports Multi-Tenancy
• Supports Multiple Namespaces
• Improved Cluster Utilization
• Supports Horizontal Scalability

Chap 6 - MapReduce Programming

Uploaded by

Copyright:

Available Formats

Chap 6 - MapReduce Programming

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 6 - MapReduce Programming

Uploaded by

Copyright:

Available Formats

MapReduce Programming

Google developed MapReduce – a Data processing model

Writing applications for distributed systems is completely

2. Count the number of flights for each carrier in our

• Fairly difficult to code MapReduce

• There are very few MapReduce Programmers

Block size -8 bytes

• HDFS has no conception of what’s inside the file blocks, it

• To solve this problem, Hadoop uses a logical representation

 If not, it captures location information of the

Data It is the physical representation of It is the logical representation of

It contains a minimum amount of data It is used during data processing

This model of execution is

• Processing of the data happens in the very

Hash function is applied

3. Locally perform an interim reduction of the output of each mapper.

4. Based on partitioning requirements, group the applicable partitions

• Reduce task’s processing cannot begin until all mapper

• Speculative execution-To avoid scenarios where the

You might also like