Map Reduce

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Map Reduce

MapReduce is a programming framework that allows us to perform distributed and parallel

processing on large data sets in a distributed environment.

 MapReduce consists of two distinct tasks — Map and Reduce.

 As the name MapReduce suggests, reducer phase takes place after the mapper phase has

been\

 completed.

 So, the first is the map job, where a block of data is read and processed to produce key-

value pairs as intermediate outputs.

 The output of a Mapper or map job (key-value pairs) is input to the Reducer.

 The reducer receives the key-value pair from multiple map jobs.

 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)

into a smaller set of tuples or key-value pairs which is the final output.
A Word Count Example of MapReduce

Let us understand, how a MapReduce works by taking an example where I have a text file called

example.txt whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and BearNow, suppose, we have to perform a

word count on the sample.txt using MapReduce. So, we will be finding unique words and the

number of occurrences of those unique words.


 First, we divide the input into three splits as shown in the figure. This will distribute the

work among all the map nodes.

 Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each

of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that

every word, in itself, will occur once.

 Now, a list of key-value pair will be created where the key is nothing but the individual

words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs

— Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.

 After the mapper phase, a partition process takes place where sorting and shuffling happen

so that all the tuples with the same key are sent to the corresponding reducer.

 So, after the sorting and shuffling phase, each reducer will have a unique key and a list of

values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.

 Now, each Reducer counts the values which are present in that list of values. As shown in

the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the

number of ones in the very list and gives the final output as — Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the output file.

Advantages of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part of

the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us

to process the data using different machines.

Traditional Way Vs. MapReduce Way - MapReduce Tutorial


2. Data Locality:

Instead of moving data to the processing unit, we are moving the processing unit to the data in the

MapReduce Framework. In the traditional system, we used to bring data to the processing unit

and process it. But, as the data grew and became very huge, bringing this huge amount of data to

the processing unit posed the following issues:

 Moving huge data to processing is costly and deteriorates the network performance.

 Processing takes time as the data is processed by a single unit which becomes the

bottleneck.

 Master node can get over-burdened and may fail.

Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the

data. So, as you can see in the above image that the data is distributed among multiple nodes

where each node processes the part of the data residing on it. This allows us to have the following

advantages:

 It is very cost effective to move the processing unit to the data.

 The processing time is reduced as all the nodes are working with their part of the data in

parallel.
 Every node gets a part of the data to process and therefore, there is no chance of a node

getting overburdened.

What is Apache Spark?

Apache Spark is an open source parallel processing framework for running large-scale data
analytics applications across clustered computers. It can handle both batch and real-time
analytics and data processing workloads.

Spark became a top-level project of the Apache software foundation in February 2014, and
version 1.0 of Apache Spark was released in May 2014. Spark version 2.0 was released in July
2016.

The technology was initially designed in 2009 by researchers at the University of California,
Berkeley as a way to speed up processing jobs in Hadoop systems.

Spark Core, the heart of the project that provides distributed task transmission, scheduling and
I/O functionality provides programmers with a potentially faster and more flexible alternative
to MapReduce, the software framework to which early versions of Hadoop were tied. Spark's
developers say it can run jobs 100 times faster than MapReduce when processed in-memory, and
10 times faster on disk.

How Apache Spark works

Apache Spark can process data from a variety of data repositories, including the Hadoop
Distributed File System (HDFS), NoSQL databases and relational data stores, such as Apache
Hive. Spark supports in-memory processing to boost the performance of big data
analytics applications, but it can also perform conventional disk-based processing when data sets
are too large to fit into the available system memory.

The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The
RDD is designed in such a way so as to hide much of the computational complexity from users.
It aggregates data and partitions it across a server cluster, where it can then be computed and
either moved to a different data store or run through an analytic model. The user doesn't have to
define where specific files are sent or what computational resources are used to store or retrieve
files.

In addition, Spark can handle more than the batch processing applications that MapReduce is
limited to running.

Spark libraries

The Spark Core engine functions partly as an application programming interface (API) layer and
underpins a set of related tools for managing and analyzing data. Aside from the Spark Core
processing engine, the Apache Spark API environment comes packaged with some libraries of
code for use in data analytics applications. These libraries include the following:

 Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query
data stored in disparate applications using the common SQL language.

 Spark Streaming -- This library enables users to build applications that analyze and present
data in real time.

 MLlib -- A library of machine learning code that enables users to apply advanced statistical
operations to data in their Spark cluster and to build applications around these analyses.

 GraphX -- A built-in library of algorithms for graph-parallel computation.

Spark languages

Spark was written in Scala, which is considered the primary language for interacting with the
Spark Core engine. Out of the box, Spark also comes with API connectors for using Java and
Python. Java is not considered an optimal language for data engineering or data science, so many
users rely on Python, which is simpler and more geared toward data analysis.
There is also an R programming package that users can download and run in Spark. This enables
users to run the popular desktop data science language on larger distributed data sets in Spark
and to use it to build applications that leverage machine learning algorithms.

Apache Spark use cases

The wide range of Spark libraries and its ability to compute data from many different types of
data stores means Spark can be applied to many different problems in many industries. Digital
advertising companies use it to maintain databases of web activity and design campaigns tailored
to specific consumers. Financial companies use it to ingest financial data and run models to
guide investing activity. Consumer goods companies use it to aggregate customer data and
forecast trends to guide inventory decisions and spot new market opportunities.

Large enterprises that work with big data applications use Spark because of its speed and its
ability to tie together multiple types of databases and to run different kinds of analytics
applications. As of this writing, Spark is the largest open source community in big data, with
over 1,000 contributors from over 250 organizations.

What is GraphLab?

GraphLab is a new parallel framework for machine learning written in C++. It is an open source

project and has been designed considering the scale, variety and complexity of real world data. It

incorporates various high level algorithms such as Stochastic Gradient Descent (SGD), Gradient

Descent & Locking to deliver high performance experience. It helps data scientists and

developers easily create and install applications at large scale.

But, what makes it amazing? It’s the presence of neat libraries for data transformation,

manipulation and model visualization. In addition, it comprises of scalable machine learning


toolkits which has everything (almost) required to improve machine learning models. The toolkit

includes implementation for deep learning, factor machines, topic modeling, clustering, nearest

neighbors and more.

Here is the complete architecture of GraphLab Create.

What are the Benefits of using GraphLab ?

There are multiple benefits of using GraphLab as described below:


 Handles Large Data: Data structure of GraphLab can handle large data sets which result into

scalable machine learning. Let’s look at the data structure of Graph Lab:

 SFrame: It is an efficient disk-based tabular data structure which is not limited by RAM. It helps

to scale analysis and data processing to handle large data set (Tera byte), even on your laptop. It

has similar syntax like pandas or R data frames. Each column is an SArray, which is a series of

elements stored on disk. This makes SFrames disk based. I have discussed the methods to

work with “SFrames” in following sections.

 SGraph: Graph helps us to understand networks by analyzing relationships between pair of

items. Each item is represented by a vertex in the graph. Relationship between items

is represented by edges. In GraphLab, to perform a graph-oriented data analysis, it

uses SGraph object. It is a scalable graph data structure which store vertices and edges in

SFrames. To know more about this, please refer this link. Below is graph representation of James
Bond characters.

 Integration with various data sources: GraphLab supports various data sources like S3,

ODBC, JSON, CSV, HDFS and many more.

 Data exploration and visualization with GraphLab Canvas. GraphLab Canvas is a browser-

based interactive GUI which allows you to explore tabular data, summary statistics and bi-variate

plots. Using this feature, you spend less time coding for data exploration. This will help you

to focus more on understanding the relationship and distribution of variables. I have discussed

this part following sections.

 Feature Engineering: GraphLab has an inbuilt option to create new useful features to enhance

model performance. It comprises of various options like transformation, binning, imputation,

One hot encoding, tf-idf etc.


 Modeling: GraphLab has various toolkits to deliver easy and fast solution for ML problems. It

allows you to perform various modeling exercise (regression, classification, clustering) in fewer

lines of code. You can work on problems like recommendation system, churn prediction,

sentiment analysis, image analysis and many more.

 Production automation: Data pipelines allow you to assemble reusable code task into jobs.

Then, automatically run them on common execution environments (e.g. Amazon Web Services,

Hadoop).

 GraphLab Create SDK: Advance users can extend the capabilities of GraphLab Create using

GraphLab Creat SDK. You can define new machine learning models/ programs and integrate

them with the rest of the package. See the GitHub repository here.

 License: It has limitation to use. You can go for 30 days free trial period or one year license for

academic edition. To extend your subscription you;ll be charged (see subscription

structure here).

Apache Spark Streaming


Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports
both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows
data engineers and data scientists to process real-time data from various sources including (but not
limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems,
databases, and live dashboards. Its key abstraction is a Discretized Stream or, in short, a DStream, which
represents a stream of data divided into small batches. DStreams are built on RDDs, Spark’s core data
abstraction. This allows Spark Streaming to seamlessly integrate with any other Spark components like
MLlib and Spark SQL. Spark Streaming is different from other systems that either have a processing
engine designed only for streaming, or have similar batch and streaming APIs but compile internally to
different engines. Spark’s single execution engine and unified programming model for batch and
streaming lead to some unique benefits over other traditional streaming systems.

Four Major Aspects of Spark Streaming

Fast recovery from failures and stragglers

Better load balancing and resource usage

Combining of streaming data with static datasets and interactive queries

Native integration with advanced processing libraries (SQL, machine learning, graph processing)

This unification of disparate data processing capabilities is the key reason behind Spark Streaming’s
rapid adoption. It makes it very easy for developers to use a single framework to satisfy all their
processing needs.

You might also like