Data Analytics
Data Analytics
Data Analytics
NOTES ON HADOOP
MAPREDUCE
HIMANSHU GUPTA
COLLEGE OF ENGINEERING ROORKEE
5/9/2022
COLLEGE OF ENGINEERING ROORKEE
DATA ANALYTICS (BCST-603)
NOTES UNIT-3 & 4
1. Distributes Complex data: The Hadoop DistributedCache can distribute complex data like
jars and archives. The archive files such as zip, tgz, tar.gz files are un-archived at the worker
nodes.
3. Consistent: Using the hashing algorithm, the cache engine always determines on which node
a particular key-value pair resides. Thus, the cache cluster is always in a single state, which
makes it consistent.
4. No Single Point of Failure: It runs as an independent unit across many nodes in the cluster.
So, failure in any one DataNode doesn’t result in the failure of the entire cache.
Disadvantages of DistributedCache
Object Serialisation:
Very slow – To inspect the type of information at runtime, it uses the Reflection
technique, which is very slow as compared to the Pre-compiled code.
Very Bulky – Serialisation is very bulky as it stores multiple data like class name,
assembly, cluster details, and reference to other instances in member variables.
____________________________________________________________________________
_____________________________________________________________________________
Mapper task processes each input record (from RecordReader) and generates a key-value pair
and this key-value pairs generated by mapper is completely different from the input pair. The
output of Mapper is also Known as intermediate output and is written to the localdisk. To
compress mapper output we should set conf.set(“mapreduce.map.output.compress”, true). Apart
from setting this property to enable compression for mapper output, we also need to consider
some other factors like, which codec to use and what should be the compression type.
Following are the properties for configuring the same:-
mapred.map.output.compression.codec
mapred.output.compression.type
Out of these two factors, the choice of right codec is of more importance. As each codec has
some pros and cons, you need to figure out, which suits your requirement. Generally, you would
want faster read/write, a good compression factor and CPU friendly decompression (we have less
number of reducers). Considering these factors, snappy codec feels like the best fit, as it has the
faster read/write and a compression factor of 3.
______________________________________________________________________________
Mapper in Hadoop takes each record generated by the RecordReader as input. Then processes
each record and generates key-value pairs. This key-value pair is completely different from the
input pair. The mapper output is known as intermediate output which is stored on the local disk.
Mapper does not store its output on HDFS, as it is temporary data and storing on HDFS will
create multiple copies.
Before storing mapper output on the local disk, partitioning of output takes place on the basis of
the key and then sorting is done. This partitioning specifies that all the value for each key is
grouped together. Mapper in hadoop only understands key-value pairs of data. So data should be
converted into key-value pair before passing to the mapper. Data is converted into key-value
pairs by InputSplit and RecordReader.
1) InputSplit- InputFormat generates InputSplit which is the logical representation of data.
MapReduce framework generates one map task for each Inputsplit.
2) RecordReader- It communicates with InputSplit and converts data into key-value pairs.
It depends on the total size of the input, i.e. the total number of blocks of the input files.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000
______________________________________________________________________________
MapReduce is the heart of Hadoop. It is the programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a Hadoop cluster. It is a processing layer of
Hadoop.
MapReduce is a programming model designed for processing large volumes of data in parallel
by dividing the work into the set of chunks. We need to write the business logic, and then rest
work will be taken care by the framework. The problem is divided into a large number of smaller
problems each of which is processed independently to produce individual outputs. These
individual outputs are combined to produce final output.
There are two processes one is Mapper and another is the Reducer.
1) Mapper is used to process the input data, Input data is in the form of file or directory which
resides in HDFS. Client needs to write the map reduce program and need to submit the input
data. The input file is passed to mapper line by line. It will process the data and produce the
output which is called intermediate output. The output of map is stored on the local disk from
where it is shuffled to reduce nodes. The number of maps is usually driven by the total size of the
inputs that is the total number of blocks of the input files.
2) Reducer takes an intermediate key/value pairs produced by Map. Reducer has 3 primary
phases: shuffle, sort and reduce.
shuffle – Input to the reducer is sorted output of mappers. In this phase, framework fetches
all output of mappers.
Sort – The framework groups reducer input by keys.
Reducer is the second phase of processing when the client needs to write the business logic. The
output of reducer is the final output that is written to HDFS.
Need of MapReduce:
MapReduce is a data processing paradigm in itself. This was one of its kind data processing
and has been transformative. While using MapReduce we are moving computation to data which
is less costly as compared to when data is moved to the computation.
Before the development of Hadoop MapReduce, the huge volume of data processing was very
difficult as hundreds and thousands of processors (CPU) were needed to handle the huge amount
of data. Moreover, parallelization and distribution were also not possible with huge data sets.
Map reduce makes these things possible and easy on top of that, it also provides I/O scheduling,
status, and monitoring of job.
MapReduce is a Fault Tolerant programming model present as the heart of Hadoop ecosystem.
Because of all the above features, MapReduce has become the favourite of the industry. This is
also the reason that it is present in lots of Big Data Frameworks.
To prove its importance I would like to add an example given in the book Hadoop – The
Definitive Guide as Mailtrust, Rackspace’s mail division, used Apache Hadoop for processing
email logs. One ad hoc query they wrote was to find the geographic distribution of their users. In
their words:
“This dataset was very essential, Rackspace arranges the hadoop MapReduce program to run
Notes: MapReduce COER Instructor: Himanshu Gupta Page 4
periodically every month, now they will be using these insights to help us decide which data
centers to place new mail servers (and other resources) in as they grow.”
By bringing 1000s of gigabytes of data and provide the tools to analyze it, the Rackspace team
were able to gain the insights of the data that they otherwise would never have had, and they
were able to use what they had learned to improve the service for their customers.
_____________________________________________________________________________
Rack awareness in HDFS:-
HDFS stores files across multiple nodes (DataNodes) in a cluster. To get the maximum
performance from Hadoop and to improve the network traffic during file read/write, NameNode
chooses the DataNodes on the same rack or nearby racks for data read/write. Rack awareness is
the concept of choosing the closer DataNode based on rack information.
The Rack is the collection of around 40-50 DataNodes connected using the same network
switch. If the network goes down, the whole rack will be unavailable. A large Hadoop cluster is
deployed in multiple racks.
In a large Hadoop cluster, there are multiple racks. Each rack consists of DataNodes.
Communication between the DataNodes on the same rack is more efficient as compared to the
communication between DataNodes residing on different racks.
To reduce the network traffic during file read/write, NameNode chooses the closest DataNode
for serving the client read/write request. NameNode maintains rack ids of each DataNode to
achieve this rack information. This concept of choosing the closest DataNode based on the rack
information is known as Rack Awareness.
The reasons for the Rack Awareness in Hadoop are:
To reduce the network traffic while file read/write, which improves the cluster
performance.
To achieve fault tolerance, even when the rack goes down (discussed later in this
article).
Achieve high availability of data so that data is available even in unfavorable conditions.
To reduce the latency, that is, to make the file read/write operations done with lower
delay.
NameNode uses a rack awareness algorithm while placing the replicas in HDFS.
NameNode on multiple rack cluster maintains block replication by using inbuilt Rack
awareness policies which says:
Not more than one replica be placed on one node.
Not more than two replicas are placed on the same rack.
Also, the number of racks used for block replication should always be smaller than the
number of replicas.
For the common case where the replication factor is three, the block replication policy put the
first replica on the local rack, a second replica on the different DataNode on the same rack, and a
third replica on the different rack.
Shuffling is the process of transferring the data. It transfers the data from the mappers to the
reducers. The output data from the map is sent as input to the reducer. This process is necessary
for the reducers or they would not have any input.
In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck
command provides information regarding the over and under-replicated block.
Under-replicated blocks:
These are the blocks that do not meet their target replication for the files they belong to. HDFS
will automatically create new replicas of under-replicated blocks until they meet the target
replication.
Consider a cluster with three nodes and replication set to three. At any point, if one of the
NameNodes crashes, the blocks would be under-replicated. It means that there was a replication
factor set, but there are not enough replicas as per the replication factor. If the NameNode does
not get information about the replicas, it will wait for a limited amount of time and then start the
re-replication of missing blocks from the available nodes.
Over-replicated blocks:
These are the blocks that exceed their target replication for the files they belong to. Usually,
over-replication is not a problem, and HDFS will automatically delete excess replicas.
Consider a case of three nodes running with the replication of three, and one of the nodes goes
down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and
then the failed node is back with its set of blocks. This is an over-replication situation, and the
NameNode will delete a set of blocks from one of the nodes.
______________________________________________________________________________
The Identity mapper is one of the pre-defined mapper class that can be used with any key/value
pairs of data. It is a generic class and also the default mapper class provided by the Hadoop.
When no mapper class is specified in the MR Driver class, the Identity Mapper class is invoked
automatically when a Map-Reduce job is assigned.
The ChainMapper is also one of the pre-defined mapper class that allows using multiple
mapper class within a single Map task. All the mappers are run in a chain fashioned, that is the
output of the first mapper becomes the input of the second mapper and so on. The output of the
last mapper class is written to the intermediate files.
______________________________________________________________________________
Spilling in MapReduce:
Mapper task processes each input record (from RecordReader) and generates a key-value pair.
The Mapper does not store its output on HDFS. Thus, this is temporary data and writing on
HDFS will create unnecessary multiple copies. The Mapper writes its output into the circular
memory buffer (RAM). Since, the size of the buffer is 100 MB by default, which we can change
by using
mapreduce.task.io.sort.mb property
Now, Spilling is a process of copying the data from the memory buffer to disc. It takes place
when the content of the buffer reaches a certain threshold size. By default, a background thread
starts spilling the contents after 80% of the buffer size has filled. Therefore, for a 100 MB size
buffer, the spilling will start after the content of the buffer reach a size of 80MB.
Data Integration:
Data integration is the practice of consolidating data from disparate sources into a single dataset
with the ultimate goal of providing users with consistent access and delivery of data across the
spectrum of subjects and structure types, and to meet the information needs of all applications
and business processes. The data integration process is one of the main components in the overall
data management process, employed with increasing frequency as big data integration and the
need to share existing data continues to grow.
Data integration architects develop data integration software programs and data integration
platforms that facilitate an automated data integration process for connecting and routing data
from source systems to target systems. This can be achieved through a variety of data integration
techniques, including:
Extract, Transform and Load: copies of datasets from disparate sources are gathered
together, harmonized, and loaded into a data warehouse or database
Extract, Load and Transform: data is loaded as is into a big data system and transformed at a
later time for particular analytics uses
Change Data Capture: identifies data changes in databases in real-time and applies them to a data
warehouse or other repositories
Data Virtualization: data from different systems are virtually combined to create a unified view
rather than loading data into a new repository
Streaming Data Integration: a real time data integration method in which different streams of
data are continuously integrated and fed into analytics systems and data stores
Data integration techniques are available across a broad range of organizational levels, from fully
automated to manual methods. Typical tools and techniques for data integration include:
Manual Integration or Common User Interface: There is no unified view of the data. Users
operate with all relevant information accessing all the source systems.
Application Based Integration: requires each application to implement all the integration
efforts; manageable with a small number of applications
Middleware Data Integration: transfers integration logic from an application to a new
middleware layer
Uniform Data Access: leaves data in the source systems and defines a set of views to provide a
unified view to users across the enterprise
Common Data Storage or Physical Data Integration: creates a new system in which a copy of
the data from the source system is stored and managed independently of the original system
Developers may use Structured Query Language (SQL) to code a data integration system by
hand. There are also data integration toolkits available from various IT vendors that streamline,
automate, and document the development process.
Importance of Data Integration
Enterprises that wish to remain competitive and relevant are embracing big data and all its
benefits and challenges. Data integration supports queries in these enormous datasets, benefiting
everything from business intelligence and customer data analytics to data enrichment and real
time information delivery.
One of the foremost use cases for data integration services and solutions is the management of
business and customer data. Enterprise data integration feeds integrated data into data
warehouses or virtual data integration architecture to support enterprise reporting, business
intelligence (BI data integration), and advanced analytics.
Customer data integration provides business managers and data analysts with a complete picture
of key performance indicators (KPIs), financial risks, customers, manufacturing and supply chain
operations, regulatory compliance efforts, and other aspects of business processes.
Big data analytics platforms require scalability and high performance, emphasizing the need for a
common data integration platform that supports profiling and data quality, and drives insights by
providing the user with the most complete and up-to-date view of their enterprise.
Advantages of MapReduce
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure below:
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework. In the traditional system, we used to bring data to the processing
unit and process it. But, as the data grew and became very huge, bringing this huge amount of
data to the processing unit posed the following issues:
Moving huge data to processing is costly and deteriorates the network performance.
Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
where each node processes the part of the data residing on it. This allows us to have the
following advantages:
Advantages of Hadoop
1. Standalone Mode – It is the default mode of configuration of Hadoop. It doesn’t use hdfs
instead, it uses a local file system for both input and output. It is useful for debugging and
testing.
2. Pseudo-Distributed Mode – It is also called a single node cluster where both NameNode and
DataNode resides in the same machine. All the daemons run on the same machine in this mode.
It produces a fully functioning cluster on a single machine.
3. Fully Distributed Mode – Hadoop runs on multiple nodes wherein there are separate nodes
for master and slave daemons. The data is distributed among a cluster of machines providing a
production environment.
2. C:\>start-yarn.cmd
This command is used to start the Yarn daemons i.e Resource Manager and
NodeManager.
3. C:\>start-all.cmd
This command is used to start all the hadoop daemons.
4. C:\>stop-all.cmd
This command is used to stop all the hadoop daemons.
5. C:\>hadoop fs -ls \
A Generic class simply means that the items or functions in that class can be generalized with the
parameter(example T) to specify that we can add any type as a parameter in place of T like
Integer, Character, String, Double or any other user-defined type
Example 1:-
package generic1;
class Test<T> {
T obj;
Test(T obj)
{ this.obj = obj;
public T getObject()
return this.obj;
System.out.println(ob.getObject());
Test<String> ob1
Example: 2
package generic2;
P obj1;
U obj2;
this.obj1 = obj1;
this.obj2 = obj2;
System.out.println(obj1);
System.out.println(obj2);
obj.print();
}}
Programs on Map-Reduce
1. Java program for word-count
package wordcount2;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Wordcount2 {
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Input File
package mymaxmin;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
String line = Value.toString();
if (!(line.length() == 0)) {
String date = line.substring(6, 14);
Notes: MapReduce COER Instructor: Himanshu Gupta Page 19
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());
float temp_Min = Float.parseFloat(line.substring(47, 53).trim());
if (temp_Max > 30.0) {
context.write(new Text("The Day is Hot Day :" + date),
new Text(String.valueOf(temp_Max)));
}
if (temp_Min < 15) {
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}
}
public static class MaxTemperatureReducer extends
Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterator<Text> Values, Context context)
throws IOException, InterruptedException {
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
}
}
Input file
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Tempex {
}
}
Input
Output:-