Chap 6 - MapReduce Programming
Chap 6 - MapReduce Programming
Chap 6 - MapReduce Programming
Thinking in Parallel
After understanding HDFS, next question is –
“How can I analyze or query my data?”
Block1 - Ramesha 7
Block2- 8Kousalya
Block3- 97
Split1- Ramesha 78
Split2 - Kousalya 87
Issue – Records are split across blocks
• If each map task processes all records in a data block, what
happens to those records that span block boundaries?
How ?
It actually checks all record inside the block are
complete?
All blocks of the file are of the same Based on the size of data in
size except the last block, which can MapReduce program.
be of same size or smaller.
2. Run specified algorithm against each record in the data set until all
the records are processed. The individual instance of the application
running against a block of data in a data set is known as a mapper task.
5. Boil down the result sets from the mappers into a single result set —
the Reduce part of MapReduce.
Word Count Example
Sales Data Example
Shuffle Phase
• To speed up the overall MapReduce process, data is
immediately moved to the reducer tasks’ nodes, to
avoid a flood of network activity when the final mapper
task finishes its work.
• Highly Scalability
• Highly Availability
• Supports Multiple Programming Models
• Supports Multi-Tenancy
• Supports Multiple Namespaces
• Improved Cluster Utilization
• Supports Horizontal Scalability