Data Analytics

2022
NOTES ON HADOOP
MAPREDUCE
HIMANSHU GUPTA
COLLEGE OF ENGINEERING ROORKEE
5/9/2022
COLLEGE OF ENGINEERING ROORKEE
DATA ANALYTICS (BCST-603)
NOTES UNIT-3 & 4
Writable data types in Hadoop

Writable data types are meant for writing the data to the local disk and it is a serialization format.
Just like in Java there are data types to store variables (int, float, long, double,etc.), Hadoop has
its own equivalent data types called Writable data types. These Writable data types are passed as
parameters (input and output key-value pairs) for the mapper and reducer.
The Writable data types discussed below implements WritableComparableinterface. Comparable
interface is used for comparing when the reducer sorts the keys, and Writable can write the result
to the local disk. It does not use the java Serializable because java Serializable is too big or too
heavy for hadoop, Writable can serializable the hadoop Object in a very light way.
WritableComparable is a combination of Writable and Comparableinterfaces.
Below is the list of few data types in Java along with the equivalent Hadoop variant:
Integer –> IntWritable: It is the Hadoop variant of Integer. It is used to pass integer numbers as
key or value.
Float –> FloatWritable: Hadoop variant of Float used to pass floating point numbers as key or
value.
Long –> LongWritable: Hadoop variant of Long data type to store long values.
Short –> ShortWritable: Hadoop variant of Short data type to store short values.
Double –> DoubleWritable: Hadoop variant of Double to store double values.
String –> Text: Hadoop variant of String to pass string characters as key or value.
Byte –> ByteWritable: Hadoop variant of byte to store sequence of bytes.
null –> NullWritable: Hadoop variant of null to pass null as a key or value. Usually NullWritable
is used as data type for output key of the reducer, when the output key is not important in the
final result.
_____________________________________________________________________________
Distributed Cache in Hadoop:-
Hadoop DistributedCache is a mechanism provided by the Hadoop MapReduce Framework that
offers a service for copying read-only files or archives or jar files to the worker nodes, before the
Notes: MapReduce COER Instructor: Himanshu Gupta Page 1

execution of any tasks for the job on that node. Files get normally copied once per job to save the
network bandwidth.
DistributedCache distributes read-only data/text files or archives, jars, etc.
Advantages of Hadoop DistributedCache
1. Distributes Complex data: The Hadoop DistributedCache can distribute complex data like
jars and archives. The archive files such as zip, tgz, tar.gz files are un-archived at the worker
nodes.
2. Track Modification Timestamp: The Hadoop DistributedCache tracks the modification

timestamp of each cache file so that while the job is executing, no application or external factor
should modify the cache file.
3. Consistent: Using the hashing algorithm, the cache engine always determines on which node
a particular key-value pair resides. Thus, the cache cluster is always in a single state, which
makes it consistent.
4. No Single Point of Failure: It runs as an independent unit across many nodes in the cluster.
So, failure in any one DataNode doesn’t result in the failure of the entire cache.
Disadvantages of DistributedCache
Object Serialisation:
In Hadoop DistributedCache, objects need to be serialized. The major problem of Distributed

Cache lies in the serialization mechanism.
 Very slow – To inspect the type of information at runtime, it uses the Reflection
technique, which is very slow as compared to the Pre-compiled code.
 Very Bulky – Serialisation is very bulky as it stores multiple data like class name,
assembly, cluster details, and reference to other instances in member variables.
____________________________________________________________________________
Speculative execution in Apache Hadoop MapReduce:

In Hadoop, Speculative Execution is a process that takes place during the slower execution of
a task at a node. In this process, the master node starts executing another instance of that same
task on the other node. And the task which is finished first is accepted and the execution of other
is stopped by killing that
_____________________________________________________________________________

Compressing mapper output in Hadoop
Mapper task processes each input record (from RecordReader) and generates a key-value pair
and this key-value pairs generated by mapper is completely different from the input pair. The
output of Mapper is also Known as intermediate output and is written to the localdisk. To
compress mapper output we should set conf.set(“mapreduce.map.output.compress”, true). Apart
from setting this property to enable compression for mapper output, we also need to consider
some other factors like, which codec to use and what should be the compression type.
Following are the properties for configuring the same:-
mapred.map.output.compression.codec
mapred.output.compression.type
Out of these two factors, the choice of right codec is of more importance. As each codec has
some pros and cons, you need to figure out, which suits your requirement. Generally, you would
want faster read/write, a good compression factor and CPU friendly decompression (we have less
number of reducers). Considering these factors, snappy codec feels like the best fit, as it has the
faster read/write and a compression factor of 3.
______________________________________________________________________________
Mapper in Hadoop MapReduce:-
Mapper in Hadoop takes each record generated by the RecordReader as input. Then processes
each record and generates key-value pairs. This key-value pair is completely different from the
input pair. The mapper output is known as intermediate output which is stored on the local disk.
Mapper does not store its output on HDFS, as it is temporary data and storing on HDFS will
create multiple copies.
Before storing mapper output on the local disk, partitioning of output takes place on the basis of
the key and then sorting is done. This partitioning specifies that all the value for each key is
grouped together. Mapper in hadoop only understands key-value pairs of data. So data should be
converted into key-value pair before passing to the mapper. Data is converted into key-value
pairs by InputSplit and RecordReader.
1) InputSplit- InputFormat generates InputSplit which is the logical representation of data.
MapReduce framework generates one map task for each Inputsplit.
2) RecordReader- It communicates with InputSplit and converts data into key-value pairs.
How many mappers?
It depends on the total size of the input, i.e. the total number of blocks of the input files.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000
______________________________________________________________________________

Hadoop MapReduce
MapReduce is the heart of Hadoop. It is the programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a Hadoop cluster. It is a processing layer of
Hadoop.
MapReduce is a programming model designed for processing large volumes of data in parallel
by dividing the work into the set of chunks. We need to write the business logic, and then rest
work will be taken care by the framework. The problem is divided into a large number of smaller
problems each of which is processed independently to produce individual outputs. These
individual outputs are combined to produce final output.
There are two processes one is Mapper and another is the Reducer.
1) Mapper is used to process the input data, Input data is in the form of file or directory which
resides in HDFS. Client needs to write the map reduce program and need to submit the input
data. The input file is passed to mapper line by line. It will process the data and produce the
output which is called intermediate output. The output of map is stored on the local disk from
where it is shuffled to reduce nodes. The number of maps is usually driven by the total size of the
inputs that is the total number of blocks of the input files.
2) Reducer takes an intermediate key/value pairs produced by Map. Reducer has 3 primary
phases: shuffle, sort and reduce.
 shuffle – Input to the reducer is sorted output of mappers. In this phase, framework fetches
all output of mappers.
 Sort – The framework groups reducer input by keys.
Reducer is the second phase of processing when the client needs to write the business logic. The
output of reducer is the final output that is written to HDFS.
Need of MapReduce:
MapReduce is a data processing paradigm in itself. This was one of its kind data processing
and has been transformative. While using MapReduce we are moving computation to data which
is less costly as compared to when data is moved to the computation.
Before the development of Hadoop MapReduce, the huge volume of data processing was very
difficult as hundreds and thousands of processors (CPU) were needed to handle the huge amount
of data. Moreover, parallelization and distribution were also not possible with huge data sets.
Map reduce makes these things possible and easy on top of that, it also provides I/O scheduling,
status, and monitoring of job.
MapReduce is a Fault Tolerant programming model present as the heart of Hadoop ecosystem.
Because of all the above features, MapReduce has become the favourite of the industry. This is
also the reason that it is present in lots of Big Data Frameworks.
To prove its importance I would like to add an example given in the book Hadoop – The
Definitive Guide as Mailtrust, Rackspace’s mail division, used Apache Hadoop for processing
email logs. One ad hoc query they wrote was to find the geographic distribution of their users. In
their words:
“This dataset was very essential, Rackspace arranges the hadoop MapReduce program to run
periodically every month, now they will be using these insights to help us decide which data
centers to place new mail servers (and other resources) in as they grow.”
By bringing 1000s of gigabytes of data and provide the tools to analyze it, the Rackspace team
were able to gain the insights of the data that they otherwise would never have had, and they
were able to use what they had learned to improve the service for their customers.
_____________________________________________________________________________
Rack awareness in HDFS:-
HDFS stores files across multiple nodes (DataNodes) in a cluster. To get the maximum
performance from Hadoop and to improve the network traffic during file read/write, NameNode
chooses the DataNodes on the same rack or nearby racks for data read/write. Rack awareness is
the concept of choosing the closer DataNode based on rack information.
The Rack is the collection of around 40-50 DataNodes connected using the same network
switch. If the network goes down, the whole rack will be unavailable. A large Hadoop cluster is
deployed in multiple racks.
In a large Hadoop cluster, there are multiple racks. Each rack consists of DataNodes.
Communication between the DataNodes on the same rack is more efficient as compared to the
communication between DataNodes residing on different racks.
To reduce the network traffic during file read/write, NameNode chooses the closest DataNode
for serving the client read/write request. NameNode maintains rack ids of each DataNode to
achieve this rack information. This concept of choosing the closest DataNode based on the rack
information is known as Rack Awareness.
The reasons for the Rack Awareness in Hadoop are:
 To reduce the network traffic while file read/write, which improves the cluster
performance.
 To achieve fault tolerance, even when the rack goes down (discussed later in this
article).
 Achieve high availability of data so that data is available even in unfavorable conditions.
 To reduce the latency, that is, to make the file read/write operations done with lower
delay.
NameNode uses a rack awareness algorithm while placing the replicas in HDFS.
NameNode on multiple rack cluster maintains block replication by using inbuilt Rack
awareness policies which says:
 Not more than one replica be placed on one node.
 Not more than two replicas are placed on the same rack.
 Also, the number of racks used for block replication should always be smaller than the
number of replicas.
For the common case where the replication factor is three, the block replication policy put the
first replica on the local rack, a second replica on the different DataNode on the same rack, and a
third replica on the different rack.

Also, while re-replicating a block, if the existing replica is one, place the second replica on a
different rack. If the existing replicas are two and are on the same rack, then place the third
replica on a different rack.
_____________________________________________________________________________
What would happen if you store too many small files in a cluster on HDFS?
Small files are a common challenge in the Apache Hadoop world and when not handled with
care, they can lead to a number of complications. The Apache Hadoop Distributed File System
(HDFS) was developed to store and process large data sets over the range of terabytes and
petabytes. However, HDFS stores small files inefficiently, leading to inefficient Namenode
memory utilization and RPC calls, block scanning throughput degradation, and reduced
application layer performance.
Storing several small files on HDFS generates a lot of metadata files. To store these metadata in
the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the
cumulative size of all the metadata will be too large.
When a NameNode restarts, it must load the filesystem metadata from local disk into memory.
This means that if the namenode metadata is large, restarts will be slower. The NameNode must
also track changes in the block locations on the cluster. Too many small files can also cause the
NameNode to run out of metadata space in memory before the DataNodes run out of data space
on disk. The datanodes also report block changes to the NameNode over the network; more
blocks means more changes to report over the network.
More files mean more read requests that need to be served by the NameNode, which may end up
clogging NameNode’s capacity to do so. This will increase the RPC queue and processing
latency, which will then lead to degraded performance and responsiveness.
______________________________________________________________________________
INPUT SPLIT IN MAPREDUCE
InputSplit in the MapReduce is used to represent the data logically that is used by the mapper
process. So the number of InputSplits are equal to the number of map tasks. Every InputSplit has
a storage location and the length of the InputSplits is measured in bytes. The important thing to
note is that the InputSplit just references the data and it doesn’t actually contain the data. The
Input format in the Hadoop is responsible for creating the InputSplits. The split size based on the
size of the data in the MapReduce program can be user-defined.
Shuffling and Sorting in MapReduce
Shuffling is the process of transferring the data. It transfers the data from the mappers to the
reducers. The output data from the map is sent as input to the reducer. This process is necessary
for the reducers or they would not have any input.

Sorting operation sorts the keys generated by the mapper. It is done to easily distinguish when a
new reduce task should start. When a new key in the sorted input data is different from the
previous, then a new reduce task starts.
Replication consistency in a Hadoop cluster:
In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck
command provides information regarding the over and under-replicated block.
Under-replicated blocks:
These are the blocks that do not meet their target replication for the files they belong to. HDFS
will automatically create new replicas of under-replicated blocks until they meet the target
replication.
Consider a cluster with three nodes and replication set to three. At any point, if one of the
NameNodes crashes, the blocks would be under-replicated. It means that there was a replication
factor set, but there are not enough replicas as per the replication factor. If the NameNode does
not get information about the replicas, it will wait for a limited amount of time and then start the
re-replication of missing blocks from the available nodes.
Over-replicated blocks:
These are the blocks that exceed their target replication for the files they belong to. Usually,
over-replication is not a problem, and HDFS will automatically delete excess replicas.
Consider a case of three nodes running with the replication of three, and one of the nodes goes
down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and
then the failed node is back with its set of blocks. This is an over-replication situation, and the
NameNode will delete a set of blocks from one of the nodes.
______________________________________________________________________________
RecordReader, Combiner and Partitioner :

• RecordReader
This communicates with the InputSplit and converts the data into key-value
pairs suitable for the mapper to read.
• Combiner

This is an optional phase; it is like a mini reducer. The combiner receives data from
the map tasks, works on it, and then passes its output to the reducer phase.
• Partitioner
The partitioner decides how many reduced tasks would be used to summarize the
data. It also confirms how outputs from combiners are sent to the reducer, and
controls the partitioning of keys of the intermediate map outputs.
_____________________________________________________________________________
Identity mapper and chain mapper:
The Identity mapper is one of the pre-defined mapper class that can be used with any key/value
pairs of data. It is a generic class and also the default mapper class provided by the Hadoop.
When no mapper class is specified in the MR Driver class, the Identity Mapper class is invoked
automatically when a Map-Reduce job is assigned.
The ChainMapper is also one of the pre-defined mapper class that allows using multiple
mapper class within a single Map task. All the mappers are run in a chain fashioned, that is the
output of the first mapper becomes the input of the second mapper and so on. The output of the
last mapper class is written to the intermediate files.
______________________________________________________________________________
Spilling in MapReduce:
Mapper task processes each input record (from RecordReader) and generates a key-value pair.
The Mapper does not store its output on HDFS. Thus, this is temporary data and writing on
HDFS will create unnecessary multiple copies. The Mapper writes its output into the circular
memory buffer (RAM). Since, the size of the buffer is 100 MB by default, which we can change
by using
mapreduce.task.io.sort.mb property
Now, Spilling is a process of copying the data from the memory buffer to disc. It takes place
when the content of the buffer reaches a certain threshold size. By default, a background thread
starts spilling the contents after 80% of the buffer size has filled. Therefore, for a 100 MB size
buffer, the spilling will start after the content of the buffer reach a size of 80MB.

Spilling is required because a checkpoint is required from which you can restart the reducers
jobs. Checkpoint use those spilled records in case of a reduce task failure.
_____________________________________________________________________________
Data Integration:
Data integration is the practice of consolidating data from disparate sources into a single dataset
with the ultimate goal of providing users with consistent access and delivery of data across the
spectrum of subjects and structure types, and to meet the information needs of all applications
and business processes. The data integration process is one of the main components in the overall
data management process, employed with increasing frequency as big data integration and the
need to share existing data continues to grow.
Data integration architects develop data integration software programs and data integration
platforms that facilitate an automated data integration process for connecting and routing data
from source systems to target systems. This can be achieved through a variety of data integration
techniques, including:
Extract, Transform and Load: copies of datasets from disparate sources are gathered
together, harmonized, and loaded into a data warehouse or database
Extract, Load and Transform: data is loaded as is into a big data system and transformed at a
later time for particular analytics uses
Change Data Capture: identifies data changes in databases in real-time and applies them to a data
warehouse or other repositories
Data Virtualization: data from different systems are virtually combined to create a unified view
rather than loading data into a new repository
Streaming Data Integration: a real time data integration method in which different streams of
data are continuously integrated and fed into analytics systems and data stores
Application Integration vs. Data Integration
Data integration technologies were introduced as a response to the adoption of relational

databases and the growing need to efficiently move information between them, typically
involving data at rest. In contrast, application integration manages the integration of live,
operational data in real time between two or more applications.
The ultimate goal of application integration is to enable independently designed applications to
operate together, which requires data consistency among separate copies of data, management of
the integrated flow of multiple tasks executed by disparate applications, and, similar to data

integration requirements, a single user interface or service from which to access data and
functionality from independently designed applications.
A common tool for achieving application integration is cloud data integration, which refers to a
system of tools and technologies that connects various applications for the real time exchange of
data and processes and provides access by multiple devices over a network or via the internet
Data Integration Tools and Techniques
Data integration techniques are available across a broad range of organizational levels, from fully
automated to manual methods. Typical tools and techniques for data integration include:
Manual Integration or Common User Interface: There is no unified view of the data. Users
operate with all relevant information accessing all the source systems.
Application Based Integration: requires each application to implement all the integration
efforts; manageable with a small number of applications
Middleware Data Integration: transfers integration logic from an application to a new
middleware layer
Uniform Data Access: leaves data in the source systems and defines a set of views to provide a
unified view to users across the enterprise
Common Data Storage or Physical Data Integration: creates a new system in which a copy of
the data from the source system is stored and managed independently of the original system
Developers may use Structured Query Language (SQL) to code a data integration system by
hand. There are also data integration toolkits available from various IT vendors that streamline,
automate, and document the development process.
Importance of Data Integration
Enterprises that wish to remain competitive and relevant are embracing big data and all its
benefits and challenges. Data integration supports queries in these enormous datasets, benefiting
everything from business intelligence and customer data analytics to data enrichment and real
time information delivery.
One of the foremost use cases for data integration services and solutions is the management of
business and customer data. Enterprise data integration feeds integrated data into data
warehouses or virtual data integration architecture to support enterprise reporting, business
intelligence (BI data integration), and advanced analytics.
Customer data integration provides business managers and data analysts with a complete picture
of key performance indicators (KPIs), financial risks, customers, manufacturing and supply chain
operations, regulatory compliance efforts, and other aspects of business processes.

Data integration also plays an important role in the healthcare industry. Integrated data from
different patient records and clinics helps doctors in diagnosing medical conditions and diseases
by organizing data from different systems into a unified view of useful information from which
useful insights can be made. Effective data acquisition and integration also improves claims
processing accuracy for medical insurers and ensures a consistent and accurate record of patient
names and contact information. This exchange of information between different systems is often
referred to as interoperability.
Big Data Integration

Big data integration refers to the advanced data integration processes developed to manage the
enormous volume, variety, and velocity of big data, and combines this data from sources such as
web data, social media, machine-generated data, and data from the Internet of Things (IoT), into
a single framework.
Big data analytics platforms require scalability and high performance, emphasizing the need for a
common data integration platform that supports profiling and data quality, and drives insights by
providing the user with the most complete and up-to-date view of their enterprise.
Advantages of MapReduce
The two biggest advantages of MapReduce are:
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure below:

Fig.: Traditional Way Vs. MapReduce Way – MapReduce Tutorial
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework. In the traditional system, we used to bring data to the processing
unit and process it. But, as the data grew and became very huge, bringing this huge amount of
data to the processing unit posed the following issues:
 Moving huge data to processing is costly and deteriorates the network performance.
 Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
 The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
where each node processes the part of the data residing on it. This allows us to have the
following advantages:
 It is very cost-effective to move processing unit to the data.

 The processing time is reduced as all the nodes are working with their part of the data in
parallel.
 Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.
Advantages of Hadoop
1. Economical – Hadoop is an open source Apache product, so it is free software. It has

hardware cost associated with it. It is cost effective as it uses commodity hardware that are cheap
machines to store its datasets and not any specialized machine.

2. Scalable – Hadoop distributes large data sets across multiple machines of a cluster. New
machines can be easily added to the nodes of a cluster and can scale to thousands of nodes
storing thousands of terabytes of data.
3. Fault Tolerance – Hadoop, by default, stores 3 replicas of data across the nodes of a cluster.
So if any node goes down, data can be retrieved from other nodes.
4. Fast – Since Hadoop processes distributed data parallelly, it can process large data sets much
faster than the traditional systems. It is highly suitable for batch processing of data.
5. Flexibility – Hadoop can store structured, semi-structured as well as unstructured data. It can
accept data in the form of textfile, images, CSV files, XML files, emails, etc
6. Data Locality – Traditionally, to process the data, the data was fetched from the location it is
stored, to the location where the application is submitted; however, in Hadoop, the processing
application goes to the location of data to perform computation. This reduces the delay in
processing of data.
7. Compatibility – Most of the emerging big data tools can be easily integrated with Hadoop
like Spark. They use Hadoop as a storage platform and work as its processing system.
Hadoop Deployment Methods
1. Standalone Mode – It is the default mode of configuration of Hadoop. It doesn’t use hdfs
instead, it uses a local file system for both input and output. It is useful for debugging and
testing.
2. Pseudo-Distributed Mode – It is also called a single node cluster where both NameNode and
DataNode resides in the same machine. All the daemons run on the same machine in this mode.
It produces a fully functioning cluster on a single machine.
3. Fully Distributed Mode – Hadoop runs on multiple nodes wherein there are separate nodes
for master and slave daemons. The data is distributed among a cluster of machines providing a
production environment.
Common Hadoop Commands

1. C:\>start-dfs.cmd
This command is used to start the HDFS daemons i.e NameNode and DataNode
2. C:\>start-yarn.cmd
This command is used to start the Yarn daemons i.e Resource Manager and
NodeManager.
3. C:\>start-all.cmd
This command is used to start all the hadoop daemons.
4. C:\>stop-all.cmd
This command is used to stop all the hadoop daemons.
5. C:\>hadoop fs -ls \

This command is used to list the content of directory
6. C:\>hadoop fs -cat \word2.txt

This command is used to print the contents of a file
7. C:\>hadoop fs -put c:\word2.txt \pig_data

This command is used to move a file from local file system to HDFS
8. C:\>hadoop fs -rm \input\word2.txt

This command is used to remove a file in HDFS
9. C:\>hadoop fs -copyFromLocal c:\word2.txt \input

This command works same as PUT command
10. C:\>hadoop fs -mkdir \input_dir

This command is used to create a directory in the HDFS
11. C:\>hdfs getconf –namenodes

This command is used to give the configuration of namenode
12. C:\>hadoop fs -get \abc\a.txt F:\Dell

This command is used to copy a file from HDFS to local file system
13. C:\>hadoop version

This command gives version of hadoop installed in the system.
14. hadoop fs -copyToLocal <hdfs source> <localdst>

This works same as get command
15. C:\>hadoop fs -mv /input /input_dir

This command is used to move a file from one place to another
16. C:\>hadoop fs -cp \word2.txt \input_directory

This command is used to copy a file from one place to another

Generic class in Java
A Generic class simply means that the items or functions in that class can be generalized with the
parameter(example T) to specify that we can add any type as a parameter in place of T like
Integer, Character, String, Double or any other user-defined type
Example 1:-
package generic1;
class Test<T> {
T obj;
Test(T obj)
{ this.obj = obj;
public T getObject()
return this.obj;
public class Generic1 {
public static void main(String[] args)
Test<Integer> ob = new Test<Integer>(50);
System.out.println(ob.getObject());
Test<String> ob1
= new Test<String>("God is great");

System.out.println(ob1.getObject());
Example: 2
package generic2;
class Test<P, U>
P obj1;
U obj2;
Test(P obj1, U obj2)
this.obj1 = obj1;
this.obj2 = obj2;
public void print()
System.out.println(obj1);
System.out.println(obj2);
public class Generic2 {

public static void main (String[] args)
Test <String, Integer> obj =
new Test<String, Integer>("Shivansh", 1);
obj.print();
}}
Programs on Map-Reduce
1. Java program for word-count
package wordcount2;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Wordcount2 {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
Job job = new Job(conf, "Wordcount2");
job.setJarByClass(Wordcount2.class);
job.setJobName("WordCounter");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Input File

Output File
Example 2:- Temperature program
package mymaxmin;
import java.util.Iterator;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends

Mapper<LongWritable, Text, Text, Text> {
public static final int MISSING = 9999;
@Override
public void map(LongWritable arg0, Text Value, Context context)
String line = Value.toString();
if (!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());
float temp_Min = Float.parseFloat(line.substring(47, 53).trim());
if (temp_Max > 30.0) {
context.write(new Text("The Day is Hot Day :" + date),
new Text(String.valueOf(temp_Max)));
}
if (temp_Min < 15) {
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}
}
public static class MaxTemperatureReducer extends
Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterator<Text> Values, Context context)
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
// reads the default configuration of the

// cluster from the configuration XML files
// Initializing the job with the

// default configuration of the cluster
Job job = new Job(conf, "weather example");
// Assigning the driver class name

job.setJarByClass(MyMaxMin.class);
// Key type coming out of mapper

job.setMapOutputKeyClass(Text.class);
// value type coming out of mapper

job.setMapOutputValueClass(Text.class);
// Defining the mapper class name

job.setMapperClass(MaxTemperatureMapper.class);

// Defining the reducer class name
job.setReducerClass(MaxTemperatureReducer.class);
// Defining input Format class which is

// responsible to parse the dataset
// into a key value pair
job.setInputFormatClass(TextInputFormat.class);
// Defining output Format class which is

// responsible to parse the dataset
// into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);
// setting the second argument

// as a path in a path variable
Path OutputPath = new Path(args[1]);
// Configuring the input path

// from the filesystem into the job
// Configuring the output path from

// the filesystem into the job
// deleting the context path automatically

// from hdfs so that we don't have
// to delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);
// exiting the job only if the

// flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Input file

Output
Program3 :- Program to test Number of times song is played by Unique Listeners

package tempex;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class Tempex {
public static class UniqueListenersMapper extends

Mapper<Object,Text,IntWritable,IntWritable>{
IntWritable trackId = new IntWritable();
IntWritable userId = new IntWritable();
public void map(Object key, Text value,Context context)

String[] parts = value.toString().split("[|]");

trackId.set(Integer.parseInt(parts[1]));
userId.set(Integer.parseInt(parts[0]));
context.write(trackId, userId);
}
}
public static class UniqueListenersReducer extends

Reducer< IntWritable , IntWritable, IntWritable, IntWritable> {
public void reduce(
IntWritable trackId,
Iterable< IntWritable > userIds,Context context) throws IOException,
InterruptedException
{
Set< Integer > userIdSet = new HashSet< Integer >();
for (IntWritable userId : userIds) {
userIdSet.add(userId.get());
}
IntWritable size = new IntWritable(userIdSet.size());
context.write(trackId, size);
}
}

if (args.length != 2) {
System.err.println("Usage: uniquelisteners < in > < out >");
System.exit(2);
}
Job job = new Job(conf, "Unique listeners per track");
job.setJarByClass(Tempex.class);
job.setMapperClass(UniqueListenersMapper.class);
job.setReducerClass(UniqueListenersReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Input
Output:-


Data Analytics

Uploaded by

Copyright:

Available Formats

Data Analytics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics

Uploaded by

Copyright:

Available Formats

2022

Writable data types in Hadoop

Notes: MapReduce COER Instructor: Himanshu Gupta Page 1

Advantages of Hadoop DistributedCache

2. Track Modification Timestamp: The Hadoop DistributedCache tracks the modification

In Hadoop DistributedCache, objects need to be serialized. The major problem of Distributed

Speculative execution in Apache Hadoop MapReduce:

Notes: MapReduce COER Instructor: Himanshu Gupta Page 2

Mapper in Hadoop MapReduce:-

How many mappers?

Notes: MapReduce COER Instructor: Himanshu Gupta Page 3

Notes: MapReduce COER Instructor: Himanshu Gupta Page 5

Shuffling and Sorting in MapReduce

Notes: MapReduce COER Instructor: Himanshu Gupta Page 6

Replication consistency in a Hadoop cluster:

RecordReader, Combiner and Partitioner :

Notes: MapReduce COER Instructor: Himanshu Gupta Page 7

Identity mapper and chain mapper:

Notes: MapReduce COER Instructor: Himanshu Gupta Page 8

Application Integration vs. Data Integration

Data integration technologies were introduced as a response to the adoption of relational

Notes: MapReduce COER Instructor: Himanshu Gupta Page 9

Data Integration Tools and Techniques

Notes: MapReduce COER Instructor: Himanshu Gupta Page 10

Big Data Integration

The two biggest advantages of MapReduce are:

Notes: MapReduce COER Instructor: Himanshu Gupta Page 11

 It is very cost-effective to move processing unit to the data.

1. Economical – Hadoop is an open source Apache product, so it is free software. It has

Notes: MapReduce COER Instructor: Himanshu Gupta Page 12

Hadoop Deployment Methods

Common Hadoop Commands

Notes: MapReduce COER Instructor: Himanshu Gupta Page 13

6. C:\>hadoop fs -cat \word2.txt

7. C:\>hadoop fs -put c:\word2.txt \pig_data

8. C:\>hadoop fs -rm \input\word2.txt

9. C:\>hadoop fs -copyFromLocal c:\word2.txt \input

10. C:\>hadoop fs -mkdir \input_dir

11. C:\>hdfs getconf –namenodes

12. C:\>hadoop fs -get \abc\a.txt F:\Dell

13. C:\>hadoop version

14. hadoop fs -copyToLocal <hdfs source> <localdst>

15. C:\>hadoop fs -mv /input /input_dir

16. C:\>hadoop fs -cp \word2.txt \input_directory

Notes: MapReduce COER Instructor: Himanshu Gupta Page 14

public class Generic1 {

public static void main(String[] args)

Test<Integer> ob = new Test<Integer>(50);

= new Test<String>("God is great");

Notes: MapReduce COER Instructor: Himanshu Gupta Page 15

class Test<P, U>

Test(P obj1, U obj2)

public void print()

public class Generic2 {

Notes: MapReduce COER Instructor: Himanshu Gupta Page 16

Test <String, Integer> obj =

new Test<String, Integer>("Shivansh", 1);

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

Notes: MapReduce COER Instructor: Himanshu Gupta Page 17