F-IoT Unit-4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

ANURAG COLLEGE OF ENGINEERING

(Approved by AICTE, New Delhi & Affiliated to JNTU-HYD)


Aushapur (V), Ghatkesar (M), Medchal (Dist.), Telangana-501 301.

FUNDAMENTALS OF INTERNET OF THINGS


B.Tech - III YEAR II SEM - CSE

Presented By:

G. Kiran Kumari
Assistant Professor
Department of ECE
Anurag College of Engineering
Syllabus

UNIT – IV
Introduction to Software Defined Network
(SDN), SDN for IoT, Data Handling and
Analytics.

Anurag College of Engineering 2


Conventional Network Architecture

3
Anurag College of Engineering
Conventional Network Architecture
• Fig. shows the conventional network architecture built
with specialized hardware like switches, routers, etc.
• Network devices in conventional network architectures
are getting complex with the increasing number of
distributed protocols being implemented and the use of
proprietary hardware and interfaces.
• In the conventional network architecture the control plane
and data plane are coupled.
• Control plane is the part of the network that carries the
signaling and routing message traffic.
• Data plane is the part of the network that carries the
payload data traffic.
4
Anurag College of Engineering
Limitations of Conventional Network Architecture
A) Complex Network Devices:
1. Conventional networks are getting increasingly complex
with more and more protocols being implemented to improve
link speeds and reliability.
2. Interoperability is limited due to the lack of standard and
open interfaces.
3. Network devices use proprietary hardware and software and
have slow product life-cycles limiting innovation.
4. The conventional networks were well suited for static traffic
patterns and had a large number of protocols designed for
specific applications.
5. For IoT applications which are deployed in cloud computing
environments, the traffic patterns are more dynamic.
6. Due to the complexity of conventional network devices,
making changes in the networks to meet the dynamic traffic
patterns has become increasingly difficult.
5
Anurag College of Engineering
Limitations of Conventional Network Architecture
B) Management Overhead:
1. Conventional networks involve significant management
overhead.
2. Network managers find it increasingly difficult to
manage multiple network devices and interfaces from
multiple vendors.
3. Upgradation of network requires configuration changes
in multiple devices like switches, routers, firewalls, etc.

6
Anurag College of Engineering
Limitations of Conventional Network Architecture
C) Limited Scalability:
1. The virtualization technologies used in cloud computing
environments has increased the number of virtual hosts
requiring network access.
2. IoT applications hosted on cloud are distributed across
multiple virtual machines that require exchange of traffic.
3. The analytics components of IoT applications run distributed
algorithms on a large number of virtual machines that require
huge amounts of data exchange between virtual machines.
4. Such computing environments require highly scalable and
easy to manage network architectures with minimal manual
configurations, which is becoming increasingly difficult with
conventional networks.

7
Anurag College of Engineering
Software-Defined Networking

• Software-Defined Networking (SDN) is a networking


architecture that separates the control plane from the data
plane and centralizes the network controller.
• SDN attempts to create network architectures that are
simpler, inexpensive, scalable, agile and easy to manage.
• Fig 1. shows the SDN architecture and Fig 2. shows the
SDN layers in which the control and data planes are
decoupled and the network controller is centralized.

8
Anurag College of Engineering
SDN Architecture

Fig 1. SDN Architecture


9
Anurag College of Engineering
SDN Layers

Fig 2. SDN Layers


10
Anurag College of Engineering
• Software-based SDN controllers maintain a unified view
of the network and make configuration, management and
provisioning simpler.
• The underlying infrastructure in SDN uses simple packet
forwarding hardware as opposed to specialized hardware in
conventional networks.
• The underlying network infrastructure is abstracted form the
applications.
• Network devices become simple with SDN as they do not
require implementation of a large number of protocols.
• Network devices receive instructions from the SDN controller on
how to forward the packets.
• These devices can be simpler and cost less as they can be built
from standard hardware and software components.
11
Anurag College of Engineering
Key Elements of SDN

• Key elements of SDN are:


1. Centralized Network Controller
2. Programmable Open APIs
3. Standard Communication Interface
(Open Flow)

12
Anurag College of Engineering
• Centralized Network Controller:

✓ With decoupled control and data planes and centralized


network controller, the network administrators can rapidly
configure the network
✓ SDN applications can be deployed through
programmable open APIs.
✓ This speeds up innovation as the network administrator no
longer need to wait for the device vendors to embed new
features in their proprietary hardware.

13
Anurag College of Engineering
• Programmable Open APIs:

✓ SDN architecture supports programmable open APIs for


interface between the SDN application and control layers.
✓ With these open APIs various network services can be
implemented such as routing, quality of service(QoS),
access control, etc.

14
Anurag College of Engineering
• Standard Communication Interface (Open Flow):
✓ SDN architecture uses a standard communication
interface between the control and infrastructure layers
(Southbound interface).
✓ Open Flow, which is defined by the Open Networking
Foundation (ONF) is the broadly accepted SDN protocol
for the South bound Interface.
✓ With Open Flow the forwarding plane of the network
devices can be directly accessed and manipulated.
✓ Open Flow uses the concept of flows to identify the
network traffic based on pre-defined match rules.
✓ Flows can be programmed statically or dynamically by
the SDN control software.
15
Anurag College of Engineering
• Components of OpenFlow Switch:

16
Anurag College of Engineering
• Components of OpenFlow Switch: (continued)

✓ Fig. shows the components of an OpenFlow switch


comprising of one or more flow tables and a group table,
which perform packet lookups and forwarding, and an
OpenFlow channel to an external controller.
✓ OpenFlow protocol is implemented on both sides of the
interface between the controller and network devices.
✓ The controller manages the switch via the OpenFlow
switch protocol.
✓ The controller can add, update and delete flow entries in
flow tables.

17
Anurag College of Engineering
• OpenFlow Table:

18
Anurag College of Engineering
• OpenFlow Table: (continued)

✓ Fig. shows an example of an OpenFlow flow table.


✓ Each flow table contains a set of flow entries.
✓ Each flow entry consists of match fields, counters and set
of instructions to apply to matching packets.
✓ Matching starts at the first flow table and may continue to
additional flow tables of the pipeline.

19
Anurag College of Engineering
Data Handling and Analytics

Anurag College of Engineering 20


Why Data Handling and Analytics is required in IoT?
• IoT systems consists of devices such as sensors,
actuators and different communication devices such as
Wi-Fi, mobile devices, 3G and 4G, etc.
• All these devices are huge producers of data.
• So, IoT is heavily data intensive. Lot of data is produced
in the IoT implementation.
• The data that is produced has to be
✓ properly handled
✓ to be analyzed
to make sense out of the data.
so that things can be made much more efficient.
21
Anurag College of Engineering
Data Handling
• Data Handling
✓ ensures that the data is stored properly, archived or
disposed off in a safe and secure manner during and
after the conclusion of a project.
✓ Includes the development of policies and procedures to
manage data that is handled electronically as well as
through non-electronic means.
• In IoT most of the data has certain features which are
analogous to the features of Big data.
• Big data in IoT is due to huge amount of data is generated
by the different sensors and the different other IoT devices.
• Huge amount of data is generated and the data is big in size
and continuously big streams of data flow through or
generated through the network.
22
Anurag College of Engineering
Data Handling continued
• Eg.
Camera streams in lot of data. The data that is
collected is huge (ie. Collected for over days, months
and years).
• This data has to be
✓ stored
✓ analyzed
✓ dispose of the data ( if not required)
when we are planning or designing an IoT system.

23
Anurag College of Engineering
Big Data Analytics
• Big data is defined as collections of data sets whose
volume, velocity (in terms of its temporal variation) or
variety, is so large that it is difficult to store, manage,
process and analyze the data using traditional
databases and data processing tools.
• Big data analytics involves several steps starting from
data cleansing, data munging (or wrangling), data
processing and visualization.

24
Anurag College of Engineering
• Examples of big data generated by IoT systems are:
✓ Sensor data generated by IoT systems such as
weather monitoring stations.
✓ Machine sensor data collected from sensors
embedded in industrial and energy systems for
monitoring their health and detecting failures.
✓ Health and fitness data generated by IoT devices
such as wearable fitness bands.
✓ Data generated by IoT systems for location and
tracking of vehicles.
✓ Data generated by retail inventory monitoring
systems.
25
Anurag College of Engineering
• Characteristics of Big data:
✓ Volume
✓ Velocity
✓ Variety

26
Anurag College of Engineering
• Characteristics of Big data: continued
Volume:
• Though there is no fixed threshold for the volume of data
to be considered as big data, typically the term big data is
used for massive scale data that is difficult to store,
manage and process using traditional databases and data
processing architectures.
• The volumes of data generated by modern IT, industrial
and health care systems, for example, is growing
exponentially driven by the lowering costs of data storage
and processing architectures and the need to extract
valuable insights from the data to improve business
processes, efficiency and service to customers.
27
Anurag College of Engineering
• Characteristics of Big data: continued
Velocity:
• Velocity is another important characteristic of big data
and the primary reason for exponential growth of data.
• Velocity of data refers to how fast the data is generated
and how frequently it varies.
• Modern IT, industrial and other systems are generating
data at increasingly higher speeds.
Variety:
• Variety refers to the forms of the data.
• Big data comes in different forms such as structured or
unstructured data, including text data, image, audio,
video and sensor data.
28
Anurag College of Engineering
Data Analytics for IoT
• The volume, velocity and variety of data generated
by data-intensive IoT systems is so large that it is
difficult to store, manage, process and analyze the data
using traditional data bases and data processing tools.
• Analysis of data can be done with aggregation methods
such as computing mean, maximum, minimum,
counts, etc. or using machine learning methods such as
clustering and classification.
• Clustering is used to group similar data items together
such that, data items which are more similar to each
other( with respect to some similarity criteria) than other
data items are put in one cluster.
• Classification is used for categorizing objects into
29
predefined categories. Anurag College of Engineering
Frameworks for Data Analysis
• Various frameworks for data analysis are:
✓ ApacheHadoop
✓ Apache Oozie
✓ Apache Spark
✓ Apache Storm

30
Anurag College of Engineering
Apache Hadoop
• Apache Hadoop is an open source framework for
distributed batch processing of big data.
• MapReduce is parallel programming model suitable for
analysis of big data.
• MapReduce algorithms allow large scale computations to
be parallelized across a large cluster of servers.

31
Anurag College of Engineering
MapReduce Programming Model:
• MapReduce is a widely used parallel data processing model for
processing and analysis of massive scale data.
• MapReduce model has two phases.
✓Map
✓Reduce
• MapReduce programs are written in a functional programming style to
create Map and Reduce functions.
• The input data to the map and reduce phases is in the form of key-value
pairs.
• Run-time systems for MapReduce are typically large clusters built of
commodity hardware.
• The MapReduce run-time systems take care of tasks such as partitioning
the data, scheduling of jobs and communication between nodes in the
cluster.
• This makes it easier for programmers to analyze massive scale data
without worrying about tasksAnurag
such as data partitioning and scheduling.32
College of Engineering
Anurag College of Engineering 33
• The above fig. shows the flow of data for a
MapReduce job.
• MapReduce programs take a set of input key-value pairs and
produce a set of output key-value pairs.
• In the Map phase, data is read from a distributed file system ,
partitioned among a set of computing nodes in the cluster, and
sent to the nodes as a set of key-value pairs.
• The Map tasks process the input records independently of
each other and produce intermediate results as key-value
pairs.
• The intermediate results are stored on the local disk of the
node running the Map task.
• When all the Map tasks are completed, the Reduce Phase
begins in which the intermediate data with the same key is
aggregated. Anurag College of Engineering
34
• An optimal Combine task can be used to perform data
aggregation on the intermediate data of the same key for
the output of the mapper before transferring the output to the
Reduce task.
• MapReduce programs take advantage of locality of data and the
data processing takes place on the nodes where the data resides.
• In traditional approaches for data analysis, data is moved to the
compute nodes which results in delay in data transmission
between the nodes in a cluster.
• MapReduce programming model moves the computation to
where the data resides thus decreasing the transmission of data
and improving efficiency.
• MapReduce programming model is well suited for parallel
processing of massive scale data in which the data analysis tasks
can be accomplished by independent map and reduce
operations. Anurag College of Engineering
35
Hadoop MapReduce Job Execution:

36
Anurag College of Engineering
• The above fig. shows the components of a Hadoop
cluster.
• A Hadoop cluster comprises of a Master node, backup node
and a number of slave nodes.
• The master node runs the NameNode and JobTracker
processes and the slave nodes run the DataNode and
TaskTracker components of Hadoop.
• The backup node runs the Secondary NameNode process.

37
Anurag College of Engineering
• The functions of the key processes of Hadoop are:
NameNode:
• NameNode keeps the directory tree of all files in the file
system, and tracks where across the cluster the file data is
kept.
• It does not store the data of these files itself.
• Client applications talk to the NameNode whenever they wish
to locate a file or when they want to add/copy/move/delete a
file.
• The NameNode responds to the successful requests by
returning a list of relevant DataNode servers where the data
lives.
• NameNode serves as both directory namespace manager and
‘inode table’ for the Hadoop DFS (Distributed File System).
• There is a single NameNode running in any DFS deployment. 38
Anurag College of Engineering
Secondary NameNode:
• HDFS is not currently a high availability system.
• The NameNode is a Single Point of Failure for the HDFS
Cluster.
• When the NameNode goes down, the file system goes offline.
• An optional Secondary NameNode which is hosted on a
separate machine creates checkpoints of the namespace.

JobTracker:
• The JobTracker is the service within Hadoop that distributes
MapReduce tasks to specific nodes in the cluster, ideally the
nodes that have the data, or at least are in the same rack.

39
Anurag College of Engineering
TaskTracker:
• TaskTracker is a node in a Hadoop cluster that accepts Map,
Reduce and Shuffle tasks from the JobTracker.
• Each TaskTracker has a defined number of slots which
indicate the number of tasks that it can accept.
• When the JobTracker tries to find a TaskTracker to schedule a
map or reduce task it first looks for an empty slot on the same
node that hosts the DataNode containing the data.
• If an empty slot is not found on the same node, the
JobTracker looks for an empty slot on a node in the same
rack.

40
Anurag College of Engineering
DataNode:
• A DataNode stores data in an HDFS file system.
• A functional HDFS file system has more than one DataNode,
with data replicated across them.
• DataNodes connect to the NameNode on startup.
• DataNodes respond to requests from the NameNode for file
system operations.
• Client applications can talk directly to a DataNode, once the
NameNode has provided the location of the data.
• Similarly, MapReduce operations assigned to TaskTracker
instances near a DataNode, talk directly to the DataNode to
access the files.
• TaskTracker instances can be deployed on the same servers
that host DataNode instances, so that MapReduce operations
are performed close to the data. 41
Anurag College of Engineering
MapReduce Job Execution Workflow:

42
Anurag College of Engineering
• The above fig. shows the job execution workflow for
Hadoop MapReduce framework.
• The job execution starts when the client applications submit
jobs to the Job tracker.
• The JobTracker returns a JobID to the client application.
• The JobTracker talks to the NameNode to determine the
location of the data.
• The JobTracker locates TaskTracker nodes with available
slots at/or near the data.
• The TaskTrackers send out heartbeat messages to the
JobTracker, usually every few minutes, to reassure the
JobTracker that they are still alive.
• These messages also inform the JobTracker of the number of
available slots , so the JobTracker can stay up to date with
where in the cluster, newAnurag
work can be delegated.
College of Engineering
43
• The JobTracker submits work to the TaskTracker nodes
when they poll for tasks.
• To choose a task for a TaskTracker, the JobTracker uses FIFO
(first-in- first-out) algorithm which is the default scheduling
algorithm in Hadoop.
• In FIFO scheduling a work queue is maintained and JobTracker
pulls the oldest job first for scheduling. There is no notion of the
job priority or size of the job in FIFO scheduling.
• The TaskTracker nodes are monitored using the heartbeat
signals that are sent by the TaskTrackers to JobTracker.
• The TaskTracker spawns a separate JVM process for each task
so that any task failure does not bring down the TaskTracker.
• The TaskTracker monitors these spawned processes while
capturing the output and exit codes.
44
Anurag College of Engineering
• When the process finishes, successfully or not, the
TaskTracker notifies the JobTracker.
• When a task fails the TaskTracker notifies the JobTracker
and the JobTracker decides whether to resubmit the job to
some other TaskTracker or mark that specific record as
something to avoid.
• The JobTracker can blacklist a TaskTracker as unreliable
if there are repeated task failures.
• When the job is completed, the JobTracker updates its
status.
• Client applications can poll the JobTracker for status of
the jobs.

45
Anurag College of Engineering
Hadoop Cluster Setup:

• Hadoop is open source framework written in Java and


has been designed to work with commodity hardware.
• The Hadoop filesystem HDFS is highly fault-tolerant.
• The preferred operating system to host Hadoop is Linux
• It can also be set up on windows- like operating systems
with a Cygwin environment.

46
Anurag College of Engineering
Hadoop Cluster Setup: continued

The steps involved in setting up a Hadoop cluster:

• Install Java
• Install Hadoop
• Networking (configure the network)
• Configure Hadoop
• Starting and stopping the Hadoop cluster

47
Anurag College of Engineering
Install Java:
• Hadoop requires Java 6 or later version
• Fig. shows the List of commands for installing Java 7.

48
Anurag College of Engineering
Install Hadoop:
• To set up a Hadoop cluster, the Hadoop setup tarball is
downloaded and unpacked on all the nodes.

49
Anurag College of Engineering
Networking:
• After unpacking the Hadoop setup package on all the nodes of the
cluster, the next step is to configure the network such that all the
nodes can connect to each other over the network.
• To make the addressing of nodes simple, assign simple host names
to nodes (such master, slave1, slave2).
• The /etc/hosts file is edited on all nodes and IP addresses and host
names of all the nodes are added.
• Hadoop control scripts use SSH for cluster-wide operations such as
starting and stopping NameNode, DataNode, JobTracker,
TaskTracker and other daemons on the nodes in the cluster.
• For the control scripts to work, all the nodes in the cluster must be
able to connect to each other via a password-less SSH login.
• To enable this, public/private RSA key pair is generated on each
node.
50
Anurag College of Engineering
• The private key is stored in the file /.ssh/id_rsa and public key is
stored in the file /.ssh/id_rsa.pub.
• The public SSH key of each node is copied to the
/.ssh/authorized_keys file of every other node.
• This can be done manually editing the /.ssh/authorized_keys file on
each node or using the ssh-copy-id command.
• The final step to setup the networking is to save host key
fingerprints of each node to the known_hosts file of every other
node.
• This is done by connecting from each node to every other node by
SSH.

51
Anurag College of Engineering
Configure Hadoop:
• With the Hadoop setup package unpacked on all nodes and
networking of nodes setup, the next step is to configure the
Hadoop cluster. Hadoop is configured using the configuration files
listed in table.

52
Anurag College of Engineering
Sample configuration settings for the Hadoop
configuration files core-site.xml, mapred-site.xml,
hdfs-site.xml, master/slaves files :

53
Anurag College of Engineering
54
Anurag College of Engineering
55
Anurag College of Engineering
56
Anurag College of Engineering
57
Anurag College of Engineering
Starting and stopping Hadoop cluster:
• After installing and configuring Hadoop the next step is to start the
Hadoop cluster.
• The list of commands for starting and stopping the Hadoop cluster
are shown below:

58
Anurag College of Engineering
• Fig. shows the Hadoop NameNode status page which provides
information about NameNode uptime, the number of live,
dead and decommissioned nodes, host and port information, safe
mode status, heap information, audit logs, garbage collection
metrics, total load, file operations and CPU usage.

59
Anurag College of Engineering
• Fig. shows the MapReduce administration page which provides
host and port information, start time, tracker counts, heap
information, scheduling information, current running jobs, retried
jobs, job history log, service daemon logs, thread stacks and a
cluster utilization summary.

60
Anurag College of Engineering
• Fig. shows the status page of the live data nodes of the Hadoop
cluster. The status page shows two live data nodes- slave1 and
slave2.

61
Anurag College of Engineering
• Fig. shows the status page of the active TaskTrackers of the
Hadoop cluster. The status page shows two active TaskTrackers
that run on the slave 1 and slave 2 nodes of the cluster.

62
Anurag College of Engineering
Hadoop MapReduce workflow for batch analysis
of IoT data:

63
Anurag College of Engineering
• Fig. shows a Hadoop MapReduce workflow for batch
analysis of IoT data.
• Batch analysis is done to aggregate data (computing
mean, maximum, minimum, etc.) on various time scales.
• The data collector retrieves the sensor data collected in
the cloud database and creates a raw data file in a form
suitable for processing by Hadoop.

64
Anurag College of Engineering
Map program

65
Anurag College of Engineering
• The above Fig. shows the map program for batch
analysis of sensor data.
• For the forest fire detection example, the raw data file
consists of the raw sensor readings along with the time
stamps.
• The map program reads the data from standard input
(stdin) and splits the data into timestamp and individual
sensor readings.
• The map program emits key-value pairs where key is a
portion of the timestamp and the value is a comma
separated string of sensor readings.

66
Anurag College of Engineering
Reduce program

67
Anurag College of Engineering
• The above Fig. shows the reduce program for batch
analysis of sensor data.
• The key-value pairs emitted by the map program are
shuffled to the reducer and grouped by the key.
• The reducer reads the key-value pairs grouped by the
same key from standard input and computes the means of
temperature, humidity, light and CO readings

68
Anurag College of Engineering
Running MapReduce program on Hadoop cluster

69
Anurag College of Engineering
Hadoop YARN
• Hadoop YARN is the next generation architecture of
Hadoop (version 2.x).
• In the YARN architecture, the original processing engine
of Hadoop (MapReduce) has been separated from the
resource management ( which is now a part of YARN) as
shown in fig.
• This makes YARN effectively an operating system for
Hadoop that supports different processing engines on a
Hadoop cluster such as MapReduce for batch processing,
Apache Tez for interactive queries, Apache Storm for
stream processing, etc.

70
Anurag College of Engineering
71
Anurag College of Engineering
72
Anurag College of Engineering
• The above Fig. shows the MapReduce job execution
workflow for next generation Hadoop MapReduce
framework (MR2).
• The next generation MapReduce architecture divides the
two major functions of the JobTracker- resource
management and job life-cycle management into separate
components- Resource Manager and Application Master.

73
Anurag College of Engineering
Key components of YARN:
The key components of YARN are:
1. Resource Manager (RM)
2. Application Master (AM)
3. Node Manager (NM)
4. Containers

74
Anurag College of Engineering
Key components of YARN: continued
1. Resource Manager (RM):
• RM manages the global assignment of compute resource
to applications.
• RM consists of two main services.
1. Scheduler: Scheduler is a pluggable service that
manages and enforces the resource scheduling policy in
the cluster.
2. Applications Manager(AsM): AsM manages the
running Application Masters in the cluster.
AsM is responsible for starting application masters and
for monitoring and restarting them on different nodes in
case of failures.
75
Anurag College of Engineering
Key components of YARN: continued
2. Application Master(AM):
• A pre-application AM manages the application’s life cycle.
AM is responsible for negotiating resources from the RM and
working with the Node Managers(NMs) to execute and
monitor the tasks..
3. Node Manager(NM): A per-machine NM manages the user
processes on that machine.
4.Containers: Container is a bundle of resources allocated by
RM(memory, CPU, network, etc.)
A container is a conceptual entity that grants an application
the privilege to use a certain amount of resources on a given
machine to run a component task. Each node has a NM that
spawns multiple containers based on the resource allocations
made by the RM. Anurag College of Engineering
76
77
Anurag College of Engineering
• The above Fig. shows a YARN cluster with a Resource
Manager Node and three Node Manager nodes.
• There are as many Application Masters running as there are
applications (jobs).
• Each application’s AM manages the application tasks such as
starting, monitoring and restarting tasks in case of failures.
• Each application has multiple tasks.
• Each task runs in a separate container.(Containers in YARN
architecture are similar to task slots in Hadoop MapReduce 1.x (MR1)).
• Each container in YARN can be used for both map and reduce
tasks.
• The resource allocation model of YARN is more flexible with
introduction of resource containers which improve cluster
utilization.(The resource allocation model in MR1 consists of predefined
number of map slots and reduce slots. This static allocation of slots result in low
cluster utilization.
78
Anurag College of Engineering
• The understand the YARN job execution workflow
we analyze the interactions between the main
components of YARN.

79
Anurag College of Engineering
• The above Fig. shows the interactions between a Client and
Resource Manager.
• Job execution begins with the submission of a new application
request by the client to the RM.
• The RM then responds with a unique application ID and
information about cluster resource capabilities that the client
needs in requesting resources for running the application’s AM.
• Using the information received from the RM, the client
constructs and submits an Application Submission Context
which contains information such as scheduler queue, priority and
user information.
• The Application Submission Context also contains a Container
Launch Context which contains the application’s jar, job files,
security tokens and any resource requirements.
• The client can query the RM for application reports. The client
can also “force kill” an application by sending a request to the80
RM. Anurag College of Engineering
81
Anurag College of Engineering
• The above Fig. shows the interactions between a
Resource Manager and Application Master.
• Upon receiving an application submission context from a
client, the RM finds an available container meeting the
resource requirements for running the AM for the
application.
• On finding a suitable container, the RM contacts the NM
for the container to start the AM process on its node.
• When the AM is launched it registers itself with the RM.
• The registration process consists of handshaking that
conveys information such as the RPC port that the AM
will be listening on, the tracking URL for monitoring the
application’s status and progress, etc.

82
Anurag College of Engineering
• The registration response from the RM contains
information for the AM that is used in calculating and
requesting any resource requests for the application’s
individual tasks.
• The AM relays heartbeat and progress information to the RM.
• The AM sends resource allocation requests to the RM that
contains a list of requested containers and may also contain a
list of released containers by the AM.
• Upon receiving the allocation request, the scheduler
component of the RM computes a list of containers that satisfy
the request and sends back an allocation response.
• Upon receiving the resource list, the AM contacts the
associated NMs for starting the containers.
• When the job finishes, the AM sends a Finish Application
message to the RM. Anurag College of Engineering
83
84
Anurag College of Engineering
• The above Fig. shows the interactions between an
Application Master and Node Manager.
• Based on the resource list received from the RM, the AM
requests the hosting NM for each container to start the
container.
• The AM can request and receive a container status report
from the Node Manager.

85
Anurag College of Engineering
Setting up Hadoop YARN cluster:

The steps involved in setting up Hadoop YARN cluster are:

• Setting up hosts
• Install Java
• Download and Install Hadoop YARN
• Configuring the network
• Edit the Hadoop configuration files
• Starting and stopping the Hadoop YARN cluster

86
Anurag College of Engineering
Apache Oozie
• In the previous class we have learned about the Hadoop
framework and how the MapReduce jobs can be used for
analyzing IoT data.
• Many IoT applications require more than one
MapReduce job to be chained to perform data analysis.
• This can be accomplished using Apache Oozie system.

87
Anurag College of Engineering
Apache Oozie:
• Oozie is a workflow scheduler system that allows managing
Hadoop jobs.
• With Oozie, we can create workflows which are a collection
of actions (jobs) arranged as Direct Acyclic Graphs (DAG).
• Control dependencies exists between the actions in a
workflow.
• Thus an action is executed only when the preceding action is
completed.
• An Oozie workflow specifies a sequence of actions that need
to be executed using an XML- based Process Definition
Language called hPDL.
• Oozie supports various types of actions such as Hadoop
MapReduce, Hadoop file system, Pig, Java, Email, Shell,
Hive, Sqoop, SSH and custom actions.
Anurag College of Engineering
88
Setting up Oozie:
• Oozie requires a Hadoop installation and can be setup on
either a single node or a cluster of two or more nodes.
Steps in setting up Oozie:
• Create a new user and group
• After setting up Hadoop install packages required for setting
up Oozie.
• Download and build Oozie.
• Create OozieDB
• Start Oozie server
• Setup Oozie client

89
Anurag College of Engineering
Apache Spark
• Apache spark is open source cluster computing
framework for data analysis.
• Spark supports in-memory cluster computing and
promises to be faster than Hadoop.
• Spark supports various high-level tools for data analysis
such as Spark Streaming for streaming jobs, Spark SQL
for analysis of structured data, Mllib machine learning
library for Spark, GraphX for graph processing and
Shark.
• Spark allows real-time, batch and interactive queries and
provides APIs for Scala, Java and Python languages.
90
Anurag College of Engineering
Components of a Spark cluster:

91
Anurag College of Engineering
• The above fig, shows the components of a spark
cluster.
• Each Spark application consists of a driver program and
is coordinated by a SparkContext object.
• Spark supports various cluster managers including
Spark’s standalone cluster manager, Apache Mesos and
Hadoop YARN.
• The cluster manager allocates resources for applications
on the worker nodes.
• The executors which are allocated on the worker nodes
run the application code as multiple tasks.
• Applications are isolated from each other and run within
their own executor processes on the worker nodes
92
Anurag College of Engineering
• Spark provides data abstraction called resilient
distributed dataset (RDD) which is a collection of
elements partitioned across the nodes of the cluster.
• The RDD elements can be operated on in parallel in the
cluster.
• RDDs support two types of operations
✓ Transformations
✓ Actions
• Transformations are used to create a new dataset from an
existing one.
• Actions returns a value to the driver program after
running a computation on the dataset.
• Spark API allows chaining together transformations and
actions. Anurag College of Engineering
93
Apache Storm
• Apache Storm is a framework for distributed and fault-
tolerant real-time computation.
• Storm can be used for real-time processing of streams of
data.

94
Anurag College of Engineering
Components of a Storm cluster:

95
Anurag College of Engineering
• The above fig, shows the components of a Storm
cluster.
• A Storm cluster comprises of Nimbus, Supervisor and
Zookeeper.
• Nimbus is similar to Hadoop’s Job Tracker and is
responsible for distributing code around the cluster,
launching works across the cluster and monitoring
computation.
• A Storm cluster has one or more Supervisor nodes on
which the worker processes run.
• Supervisor nodes communicate with Nimbus through
Zookeeper.
• Nimbus sends signals to Supervisor to start or stop
workers. Anurag College of Engineering
96
• Zookeeper is a high performance distributed
coordination service for maintaining configuration
information, naming, providing distributed
synchronization and group services.
• Zookeeper is required for coordination of the Storm
cluster.
• A computation job on the storm cluster is called a
“topology” which is a graph of computation.
• A Storm topology comprises of a number of worker
processes that are distributed on the cluster.
• Each worker process runs a subset of the topology.
• A topology is composed of Spouts and Bolts.
• Spout is a source of streams(sequence of tuples). Eg. A
sensor data stream. Anurag College of Engineering
97
• The streams emitted by the Spouts are processed by the
Bolts.
• Bolts subscribe to Spouts, consume the streams, process
them and emit new streams.
• A topology can consists of multiple Spouts and Bolts.

98
Anurag College of Engineering
• Fig. shows a Storm topology with one Spout and three
Bolts.
• Bolts 1 and 2 subscribe to the Spout and consume the
streams emitted by the Spout.
• The outputs of Bolts 1 and 2 are consumed by Bolt-3.

99
Anurag College of Engineering
Any Questions
?

Anurag College of Engineering 100


Anurag College of Engineering 101

You might also like