Map Reduce
Map Reduce
Map Reduce
As the name MapReduce suggests, reducer phase takes place after the mapper phase has
been\
completed.
So, the first is the map job, where a block of data is read and processed to produce key-
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.
A Word Count Example of MapReduce
Let us understand, how a MapReduce works by taking an example where I have a text file called
Dear, Bear, River, Car, Car, River, Deer, Car and BearNow, suppose, we have to perform a
word count on the sample.txt using MapReduce. So, we will be finding unique words and the
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each
of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that
Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
— Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shuffling happen
so that all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as — Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
Advantages of MapReduce
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part of
the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us
Instead of moving data to the processing unit, we are moving the processing unit to the data in the
MapReduce Framework. In the traditional system, we used to bring data to the processing unit
and process it. But, as the data grew and became very huge, bringing this huge amount of data to
Moving huge data to processing is costly and deteriorates the network performance.
Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
where each node processes the part of the data residing on it. This allows us to have the following
advantages:
The processing time is reduced as all the nodes are working with their part of the data in
parallel.
Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.
Apache Spark is an open source parallel processing framework for running large-scale data
analytics applications across clustered computers. It can handle both batch and real-time
analytics and data processing workloads.
Spark became a top-level project of the Apache software foundation in February 2014, and
version 1.0 of Apache Spark was released in May 2014. Spark version 2.0 was released in July
2016.
The technology was initially designed in 2009 by researchers at the University of California,
Berkeley as a way to speed up processing jobs in Hadoop systems.
Spark Core, the heart of the project that provides distributed task transmission, scheduling and
I/O functionality provides programmers with a potentially faster and more flexible alternative
to MapReduce, the software framework to which early versions of Hadoop were tied. Spark's
developers say it can run jobs 100 times faster than MapReduce when processed in-memory, and
10 times faster on disk.
Apache Spark can process data from a variety of data repositories, including the Hadoop
Distributed File System (HDFS), NoSQL databases and relational data stores, such as Apache
Hive. Spark supports in-memory processing to boost the performance of big data
analytics applications, but it can also perform conventional disk-based processing when data sets
are too large to fit into the available system memory.
The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The
RDD is designed in such a way so as to hide much of the computational complexity from users.
It aggregates data and partitions it across a server cluster, where it can then be computed and
either moved to a different data store or run through an analytic model. The user doesn't have to
define where specific files are sent or what computational resources are used to store or retrieve
files.
In addition, Spark can handle more than the batch processing applications that MapReduce is
limited to running.
Spark libraries
The Spark Core engine functions partly as an application programming interface (API) layer and
underpins a set of related tools for managing and analyzing data. Aside from the Spark Core
processing engine, the Apache Spark API environment comes packaged with some libraries of
code for use in data analytics applications. These libraries include the following:
Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query
data stored in disparate applications using the common SQL language.
Spark Streaming -- This library enables users to build applications that analyze and present
data in real time.
MLlib -- A library of machine learning code that enables users to apply advanced statistical
operations to data in their Spark cluster and to build applications around these analyses.
Spark languages
Spark was written in Scala, which is considered the primary language for interacting with the
Spark Core engine. Out of the box, Spark also comes with API connectors for using Java and
Python. Java is not considered an optimal language for data engineering or data science, so many
users rely on Python, which is simpler and more geared toward data analysis.
There is also an R programming package that users can download and run in Spark. This enables
users to run the popular desktop data science language on larger distributed data sets in Spark
and to use it to build applications that leverage machine learning algorithms.
The wide range of Spark libraries and its ability to compute data from many different types of
data stores means Spark can be applied to many different problems in many industries. Digital
advertising companies use it to maintain databases of web activity and design campaigns tailored
to specific consumers. Financial companies use it to ingest financial data and run models to
guide investing activity. Consumer goods companies use it to aggregate customer data and
forecast trends to guide inventory decisions and spot new market opportunities.
Large enterprises that work with big data applications use Spark because of its speed and its
ability to tie together multiple types of databases and to run different kinds of analytics
applications. As of this writing, Spark is the largest open source community in big data, with
over 1,000 contributors from over 250 organizations.
What is GraphLab?
GraphLab is a new parallel framework for machine learning written in C++. It is an open source
project and has been designed considering the scale, variety and complexity of real world data. It
incorporates various high level algorithms such as Stochastic Gradient Descent (SGD), Gradient
Descent & Locking to deliver high performance experience. It helps data scientists and
But, what makes it amazing? It’s the presence of neat libraries for data transformation,
includes implementation for deep learning, factor machines, topic modeling, clustering, nearest
scalable machine learning. Let’s look at the data structure of Graph Lab:
SFrame: It is an efficient disk-based tabular data structure which is not limited by RAM. It helps
to scale analysis and data processing to handle large data set (Tera byte), even on your laptop. It
has similar syntax like pandas or R data frames. Each column is an SArray, which is a series of
elements stored on disk. This makes SFrames disk based. I have discussed the methods to
items. Each item is represented by a vertex in the graph. Relationship between items
uses SGraph object. It is a scalable graph data structure which store vertices and edges in
SFrames. To know more about this, please refer this link. Below is graph representation of James
Bond characters.
Integration with various data sources: GraphLab supports various data sources like S3,
Data exploration and visualization with GraphLab Canvas. GraphLab Canvas is a browser-
based interactive GUI which allows you to explore tabular data, summary statistics and bi-variate
plots. Using this feature, you spend less time coding for data exploration. This will help you
to focus more on understanding the relationship and distribution of variables. I have discussed
Feature Engineering: GraphLab has an inbuilt option to create new useful features to enhance
allows you to perform various modeling exercise (regression, classification, clustering) in fewer
lines of code. You can work on problems like recommendation system, churn prediction,
Production automation: Data pipelines allow you to assemble reusable code task into jobs.
Then, automatically run them on common execution environments (e.g. Amazon Web Services,
Hadoop).
GraphLab Create SDK: Advance users can extend the capabilities of GraphLab Create using
GraphLab Creat SDK. You can define new machine learning models/ programs and integrate
them with the rest of the package. See the GitHub repository here.
License: It has limitation to use. You can go for 30 days free trial period or one year license for
structure here).
Native integration with advanced processing libraries (SQL, machine learning, graph processing)
This unification of disparate data processing capabilities is the key reason behind Spark Streaming’s
rapid adoption. It makes it very easy for developers to use a single framework to satisfy all their
processing needs.