0% found this document useful (0 votes)
4 views26 pages

unit 3

Apache Kafka is an open-source distributed event streaming platform originally developed by LinkedIn, designed for high-throughput data handling and real-time processing. It operates on a publish-subscribe messaging system, allowing producers to send messages to topics and consumers to read them, while ensuring fault tolerance and low latency. Kafka's architecture includes producers, brokers, consumers, and Zookeeper for coordination, making it suitable for various applications like message brokering, website activity tracking, and log aggregation.

Uploaded by

shailurishav4
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
4 views26 pages

unit 3

Apache Kafka is an open-source distributed event streaming platform originally developed by LinkedIn, designed for high-throughput data handling and real-time processing. It operates on a publish-subscribe messaging system, allowing producers to send messages to topics and consumers to read them, while ensuring fault tolerance and low latency. Kafka's architecture includes producers, brokers, consumers, and Zookeeper for coordination, making it suitable for various applications like message brokering, website activity tracking, and log aggregation.

Uploaded by

shailurishav4
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 26

Unit 3

What is Apache Kafka

Developed as a publish-subscribe messaging system to handle mass amounts of data at LinkedIn, today,
Apache Kafka® is an open-source distributed event streaming platform used by over 80% of the Fortune 100.

Apache Kafka is a software platform which is based on a distributed streaming process. It is a publish-
subscribe messaging system which let exchanging of data between applications, servers, and processors as
well. Apache Kafka was originally developed by LinkedIn, and later it was donated to the Apache Software
Foundation. Currently, it is maintained by Confluent under Apache Software Foundation. Apache Kafka has
resolved the lethargic trouble of data communication between a sender and a receiver.

What is a messaging system

A messaging system is a simple exchange of messages between two or more persons, devices, etc. A publish-
subscribe messaging system allows a sender to send/write the message and a receiver to read that message. In
Apache Kafka, a sender is known as a producer who publishes messages, and a receiver is known as
a consumer who consumes that message by subscribing it.

What is Streaming process

A streaming process is the processing of data in parallelly connected systems. This process allows different
applications to limit the parallel execution of the data, where one record executes without waiting for the
output of the previous record. Therefore, a distributed streaming platform enables the user to simplify the task
of the streaming process and parallel execution. Therefore, a streaming platform in Kafka has the following
key capabilities:

o As soon as the streams of records occur, it processes it.


o It works similar to an enterprise messaging system where it publishes and subscribes streams of records.
o It stores the streams of records in a fault-tolerant durable way.

To learn and understand Apache Kafka, the aspirants should know the following four core APIs :
Producer API: This API allows/permits an application to publish streams of records to one or more topics.
Consumer API: This API allows an application to subscribe one or more topics and process the stream of
records produced to them.
Streams API: This API allows an application to effectively transform the input streams to the output streams.
It permits an application to act as a stream processor which consumes an input stream from one or more topics,
and produce an output stream to one or more output topics.
Connector API: This API executes the reusable producer and consumer APIs with the existing data systems
or applications.
Why Apache Kafka
Apache Kafka is a software platform that has the following reasons which best describes the need of Apache
Kafka.
1. Apache Kafka is capable of handling millions of data or messages per second.
2. Apache Kafka works as a mediator between the source system and the target system. Thus, the source
system (producer) data is sent to the Apache Kafka, where it decouples the data, and the target system
(consumer) consumes the data from Kafka.
3. Apache Kafka is having extremely high performance, i.e., it has really low latency value less than
10ms which proves it as a well-versed software.
4. Apache Kafka has a resilient architecture which has resolved unusual complications in data sharing.
5. Organizations such as NETFLIX, UBER, Walmart, etc. and over thousands of such firms make use of
Apache Kafka.
6. Apache Kafka is able to maintain the fault-tolerance. Fault-tolerance means that sometimes a
consumer successfully consumes the message that was delivered by the producer. But, the consumer fails to
process the message back due to backend database failure, or due to presence of a bug in the consumer code. In
such a situation, the consumer is unable to consume the message again. Consequently, Apache Kafka has
resolved the problem by reprocessing the data.
Kafka Streams Architecture
The architecture of Apache Kafka is designed to handle large volumes of data streaming in real-time. It
consists of several key components:

1. Producers: These are the applications or systems that generate and send data to Kafka. Producers send data to
Kafka topics, which are logical channels for storing and distributing data.
2. Brokers: These are the servers that form the Kafka cluster. They are responsible for receiving data from
producers and storing it in topics. A Kafka cluster can have one or more brokers, each of which holds a replica
of the data stored in the topics.
3. Zookeeper: This is a separate service that is used to coordinate the Kafka brokers. It is responsible for
maintaining configuration information, detecting broker failures, and electing new leaders for partitions.
4. Consumers: These are the applications or systems that subscribe to and consume data from Kafka topics.
Consumers read data from topics, process it, and send it to other systems or applications.
5. Topics and partitions: Topics are the logical channels for storing and distributing data in Kafka. Each topic can
have one or more partitions, which are used to distribute data across multiple brokers for scalability and fault
tolerance.
6. Replication: Each partition in a topic is replicated across multiple brokers to ensure that data is not lost in case
of a broker failure.

In summary, Kafka architecture is designed for large-scale data streaming, it's composed by Producers that
send data to topics, Brokers that store data in topics, Zookeeper that coordinates the brokers, Consumers that
read data from topics, topics and partitions that distribute data, and replication that ensures data fault tolerance.

Basically, by building on the Kafka producer and consumer libraries and leveraging the native capabilities of
Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity, Kafka
Streams simplifies application development.
Apache Kafka Use Cases
Apache Kafka has the following use cases which best describes the events to use it:
1) Message Broker
Apache Kafka is one of the trending technology that is capable to handle a large amount of similar type of
messages or data. This capability enables Kafka to give high throughput value. Also, Kafka is a publish-
subscribe messaging system that makes users to read and write data more conveniently.
2) Metrics
Apache Kafka is used to monitor operational data by producing centralized feeds of that data. Operational data
means monitoring things from technology to security logs to supplier information, and so on.
3) Website Activity Tracking
t is one of the widely used use cases of Kafka. It is because a website activity usually creates a huge amount of
data, generating various messages for each particular page view and user's activity. Kafka also ensures that
data is successfully sent and received by both parties.
4) Event Sourcing
Apache Kafka supports the collection of huge amounts of log data. Thus it becomes a crucial component for
any Event Management System, which includes Security Information Event Management(SIEM). Handling
large amounts of logs data make it an excellent backend for building an application.
5) Commit Logs
Apache Kafka is used for data replication between the nodes and to restore data on failed nodes. Kafka can
also act as a pseudo commit-log. For example, suppose if a user is tracking device data for IoT sensors. He
finds an issue with the database that it is not storing all data, then the user can replay the data for replacing the
missing or unstored information in the database.
6) Log Aggregation
Several organizations make use of Apache Kafka to collect logs from various services and make them
available to their multiple customers in a standard format.
7) Kafka Stream Processing
We have various popular frameworks that read data from a topic, process it, and write that processed data over
a new topic. This new topic containing the processed data becomes available to users and applications such as
Spark Streaming, Storm, etc.
Apache Kafka Applications
There are following applications of Apache Kafka:

LinkedIn
In 2010, LinkedIn developed Apache Kafka. As Kafka is a publish-subscriber messaging system, thus various
LinkedIn products such as LinkedIn Today and LinkedIn Newsfeed use it for message consumption.
Uber
Uber use Kafka as a message bus to connect different parts of the ecosystem. Kafka helps both the passengers
and drivers to meet to their correct matches. It collects information from the rider's app as well as from the
driver's app, then makes that information available to a variety of downstream consumers.
Twitter
Because Kafka has fulfilled the requirements of data replication and durability, twitter has become one of the
best applications/users of Apache Kafka. Adopting Kafka led twitter to a vast resource saving, upto 75%, i.e., a
good cost reduction.
Netflix
Netflix uses Kafka under Keystone Pipeline. A Keystone is a unified collection, event publishing, and routing
infrastructure used for stream and batch processing. The Keystone Pipeline uses two sets of Kafka cluster, i.e.,
Fronting Kafka and Consumer Kafka. Fronting Kafka gets the message from the producers. Consumer Kafka
contains topics subsets which are routed by Samza (an Apache framework) for the real-time consumers. Thus,
Kafka has maintained the cost by providing a lossless delivery of the data pipeline.

Oracle
Apache Kafka has supported Oracle Database as a Kafka Consumer. It has also supported Oracle for
publishing events to Kafka. Apache Kafka provides reliable and scalable data streaming. The oracle user can
easily retrieve data from a Kafka topic. Oracle developers are now more capable of implementing staged data
pipelines through OSB(Oracle Service Bus).

Mozilla
Mozilla Firefox is an open-source and free web browser to all. It supports Windows, Linux, macOS, and many
other operating systems. Mozilla uses Kafka for backing up the data, i.e., used as a backing data store. Soon,
Kafka is going to replace Mozilla's current production system for collecting performance and usage data from
the end users for Telemetry, Test Pilot like projects.

Advantages of Apache Kafka


Following advantages of Apache Kafka makes it worthy:
1. Low Latency: Apache Kafka offers low latency value, i.e., upto 10 milliseconds. It is
because it decouples the message which lets the consumer to consume that message
anytime.
2. High Throughput: Due to low latency, Kafka is able to handle more number of
messages of high volume and high velocity. Kafka can support thousands of messages in
a second. Many companies such as Uber use Kafka to load a high volume of data.
3. Fault tolerance: Kafka has an essential feature to provide resistant to node/machine
failure within the cluster.
4. Durability: Kafka offers the replication feature, which makes data or messages to persist
more on the cluster over a disk. This makes it durable.
5. Reduces the need for multiple integrations: All the data that a producer writes go
through Kafka. Therefore, we just need to create one integration with Kafka, which
automatically integrates us with each producing and consuming system.
6. Easily accessible: As all our data gets stored in Kafka, it becomes easily accessible to
anyone.
7. Distributed System: Apache Kafka contains a distributed architecture which makes it
scalable. Partitioning and replication are the two capabilities under the distributed
system.
8. Real-Time handling: Apache Kafka is able to handle real-time data pipeline. Building a
real-time data pipeline includes processors, analytics, storage, etc.
9. Batch approach: Kafka uses batch-like use cases. It can also work like an ETL tool
because of its data persistence capability.
10. Scalability: The quality of Kafka to handle large amount of messages simultaneously
make it a scalable software product.

Disadvantages Of Apache Kafka


With the above advantages, there are following limitations/disadvantages of Apache Kafka:

1. Do not have complete set of monitoring tools: Apache Kafka does not contain a
complete set of monitoring as well as managing tools. Thus, new startups or enterprises
fear to work with Kafka.
2. Message tweaking issues: The Kafka broker uses system calls to deliver messages to
the consumer. In case, the message needs some tweaking, the performance of Kafka gets
significantly reduced. So, it works well if the message does not need to change.
3. Do not support wildcard topic selection: Apache Kafka does not support wildcard
topic selection. Instead, it matches only the exact topic name. It is because selecting
wildcard topics make it incapable to address certain use cases.
4. Reduces Performance: Brokers and consumers reduce the performance of Kafka by
compressing and decompressing the data flow. This not only affects its performance but
also affects its throughput.
5. Clumsy Behaviour: Apache Kafka most often behaves a bit clumsy when the number of
queues increases in the Kafka Cluster.
6. Lack some message paradigms: Certain message paradigms such as point-to-point
queues, request/reply, etc. are missing in Kafka for some use cases.
Map reduce

What is MapReduce in Hadoop?


MapReduce is a software framework and programming model used for processing huge amounts of
data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting and
mapping of data while Reduce tasks shuffle and reduce the data. MapReduce is a programming model for
writing applications that can process Big Data in parallel on multiple nodes. MapReduce provides analytical
capabilities for analyzing huge volumes of complex data.

Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C+
+. The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for performing
large-scale data analysis using multiple machines in the cluster.

The input to each phase is key-value pairs. In addition, every programmer needs to specify two functions: map
function and reduce function.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The following
illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not
suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers.
Moreover, the centralized system creates too much of a bottleneck while processing multiple files
simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into
small parts and assigns them to many computers. Later, the results are collected at one place and integrated to
form the result dataset.

MapReduce Architecture in Big Data explained with Example


The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing.

Ex:Consider you have following input data for your MapReduce in Big data Program

Welcome to Hadoop Class


Hadoop is good
Hadoop is bad
MapReduce Architecture
The final output of the MapReduce task is

Bad 1
Class 1
Good 1
Hadoop 3
Is 2
To 1
Welcome 1
The data goes through the following phases of MapReduce in Big Data

Input Splits:

An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits Input split is a
chunk of the input that is consumed by a single map

Mapping: This is the very first phase in the execution of map-reduce program. In this phase data in each split
is passed to a mapping function to produce output values. In our example, a job of mapping phase is to count a
number of occurrences of each word from input splits (more details about input-split is given below) and
prepare a list in the form of <word, frequency>

Shuffling:This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their respective
frequency.

Reducing:In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.

In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each
word.

MapReduce Architecture explained in detail

 One map task is created for each split which then executes map function for each record in the split.
 It is always beneficial to have multiple splits because the time taken to process a split is small as
compared to the time taken for processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the splits in parallel.
 However, it is also not desirable to have splits too small in size. When splits are too small, the
overload of managing the splits and map task creation begins to dominate the total job execution time.
 For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB, by
default).
 Execution of map tasks results into writing output to a local disk on the respective node and not to
HDFS.
 Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS
store operation.
 Map output is intermediate output which is processed by reduce tasks to produce the final output.
 Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication
becomes overkill.
 In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns the
map task on another node and re-creates the map output.
 Reduce task doesn’t work on the concept of data locality. An output of every map task is fed to the
reduce task. Map output is transferred to the machine where reduce task is running.
 On this machine, the output is merged and then passed to the user-defined reduce function.
 Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local node
and other replicas are stored on off-rack nodes). So, writing the reduce output

How MapReduce Organizes Work?


Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)


2. Reduce tasks (Shuffling, Reducing)
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of
entities called a

1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and
there are multiple tasktrackers which reside on Datanode.

H
ow Hadoop MapReduce Works

 A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on different
data nodes.
 Execution of individual task is then to look after by task tracker, which resides on every data node
executing part of the job.
 Task tracker’s responsibility is to send the progress report to the job tracker.
 In addition, task tracker periodically sends ‘heartbeat’ signal to the Jobtracker so as to notify him of
the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the job
tracker can reschedule it on a different task tracker.
Advantages of using it over Traditional methods:

There are numerous advantages of MapReduce but the following are the ones which are most prominent.

 Scalability: The Hadoop MapReduce algorithm is extremely scalable mostly because of the way it
distributes and stores data, other than that the servers embedded with MapReduce have a tendency to
work in parallel thus are much quicker and can serve a large scale of clients altogether.
 Flexibility: The data stored and processed by the MapReduce can be easily reused with other
applications embedded with the main system.
 Security and Authentication: Because the data is actually divided into chunks of data and it isn’t
stored on one server, it’s much easier for it to keep it safe from malicious activity.

 Cost-effective solution: MapReduce technically saves a lot of start-up costs where a limited amount
of memory sources are required to do the processing.

 Simple Coding Model: With MapReduce, the programmer does not have to implement parallelism,
distributed data passing, or any of the complexities that they would otherwise be faced with. This
greatly simplifies the coding task and reduces the amount of time required to create analytical routines.

 Supports Unstructured Data: Unstructured data is data that does not follow a specified format for
big data. If 20 percent of the data available to enterprises is structured data, the other 80 percent is
unstructured. Unstructured data is really most of the data that you will encounter. Until recently,
however, the technology didn’t really support doing much with it except storing it or analyzing it
manually.

 Fault Tolerance: Because of its highly distributed nature, MapReduce is very fault tolerant. Typically
the distributed file systems that MapReduce support, along with the master process, enable
MapReduce jobs to survive hardware failures

If a system is running entirely on a single machine, there is no communication with other machines to
go wrong, and the only points of failure are local; the machine components may fail, or the software
on the machine may exhibit a fault. In either case, there is often very little that can be done to
automatically recover from these failures. A hardware failure will typically cause the system to halt, in
which case there is no opportunity for any automatic recovery action. A software fault may be more
recoverable, but with any remedial action being taken, a restart of the software will most likely see a
repeat of the fault in the near future.
Conversely, consider a computer system that is distributed across multiple machines connected
together by a local area network, such as Ethernet. Each of the machines process data and produces a
subset of results, working in conjunction with each other to its end goal. Passing data and messages
between the machines is critical to the successful and efficient functioning of the system.

Because MapReduce processes simple key value pairs, it can support any type of data structure that fits into
this model.

challangesfor Hadoop™ MapReduce in the Enterprise


Lack of performance and scalability – Current implementations of the Hadoop MapReduce programming
model do not provide a fast, scalable distributed resource management solution fundamentally limiting the
speed with which problems can be addressed. Organizations require a distributed MapReduce solution that can
deliver competitive advantage by solving a wider range of data-intensive analytic problems faster. They also
require the ability to harness resources from clusters in remote data centers. Ideally, the MapReduce
implementation should help organizations run complex data simulations with sub-millisecond latency, high
data throughput, and thousands of map/reduce tasks completed per second depending on complexity.
Applications should be able to scale to tens of thousands of cores and hundreds of concurrent clients and/or
applications.
Lack of flexible resource management – Current implementations of the Hadoop MapReduce programming
model are not able to react quickly to real-time changes in application or user demands. Based on the volume
of tasks, the priority of the job, and time-varying resource allocation policies, MapReduce jobs should be able
to quickly grow or shrink the number of concurrently executing tasks to maximize throughput, performance
and cluster utilization while respecting resource ownership and sharing policies.
Lack of application deployment support – Current implementations of the Hadoop MapReduce
programming model do not make it easy to manage multiple application integrations on production-scale
distributed system with automated application service deployment capability. An enterprise-class solution
should have automated capabilities that include application deployment, workload policy management, tuning,
and general monitoring and administration. The environment should promote good coding practices and
version control to simplify implementation, minimize ongoing application maintenance, improve time to
market and improve code quality.
Lack of quality of service assurance – Current implementations of the Hadoop MapReduce programming
model are not optimized to take full advantage of modern multi-core servers. Ideally, the implementation
should allow for both multi-threaded and single-threaded tasks, and be able to schedule them intelligently with
a view to maximizing cache effectiveness and data locality into consideration. Application performance and
scalability can be further improved by optimizing the placement of tasks on multi-core systems based on the
specific nature of the MapReduce workload.
Lack of multiple data source support – Current implementations of the Hadoop MapReduce programming
model only support a single distributed file system; the most common being HDFS. A complete
implementation of the MapReduce programming model should be flexible enough to provide data access
across multiple distributed file systems. In this way, existing data does not need to be moved or translated
before it can be processed. MapReduce services need visibility to data regardless of where it resides.

Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question arises that
what is big data ? Big data is a term given to the data sets which can’t be processed in an efficient
manner with the help of traditional methodology such as RDBMS. Hadoop has made its place in the
industries and companies that need to work on large data sets which are sensitive and needs efficient
handling. Hadoop is a framework that enables processing of large data sets which reside in the form of
clusters. Being a framework, Hadoop is made up of several modules that are supported by a large
ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of
the tools or solutions are used to supplement or support these major elements. All these tools work
collectively to provide services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring comparatively
fewer resources than the data nodes that stores the actual data. These data nodes are commodity
hardware in the distributed environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus working at the heart
of the system.
YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the
Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per
machine and later on acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over
the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of group.
Map generates a key-value pair based result which is later on processed by the Reduce()
method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
Hive
The Hadoop ecosystem component, Apache Hive, is an open source data
warehouse system for querying and analyzing large datasets stored in Hadoop
files. Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute
on Hadoop.
Main parts of Hive are:
 Metastore – It stores the metadata.
 Driver – Manage the lifecycle of a HiveQL statement.
 Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
 Hive server – Provide a thrift interface and JDBC/ODBC server.

Pig
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of MapReduce
are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way
Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.

Apache Pig is a high-level language platform for analyzing and querying huge
dataset that are stored in HDFS. Pig as a component of Hadoop Ecosystem
uses PigLatin language. It is very similar to SQL. It loads the data, applies the
required filters and dumps the data in the required format. For Programs
execution, pig requires Java runtime environment.
Features of Apache Pig:
 Extensibility – For carrying out special purpose processing, users can create
their own function.
 Optimization opportunities – Pig allows the system to optimize automatic
execution. This allows the user to pay attention to semantics instead of
efficiency.
 Handles all kinds of data – Pig analyzes both structured as well as
unstructured.
HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database
that was designed to store structured data in tables that could have billions of row
and millions of columns. HBase is scalable, distributed, and NoSQL database that
is built on top of HDFS. HBase, provide real-time access to read or write data in
HDFS.
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent of big data,
companies realized the benefit of processing big data and started opting for solutions like Hadoop.Hadoop uses distributed
file system for storing big data, and MapReduce to process it. Hadoop excels in storing and processing of huge data of
various formats such as arbitrary, semi-, or even unstructured.

Limitations of Hadoop

Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to
search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this
point, a new solution is needed to access any point of data in a single unit of time (random access).

Hadoop Random Access Databases Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some
of the databases that store huge amounts of data and access the data in a random manner.

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is
horizontally scalable.HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).It is a
part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS
randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.

HBase and HDFS

HDFS HBase

HDFS is a distributed file system suitable for storing HBase is a database built on top of the HDFS.
large files.

HDFS does not support fast individual record HBase provides fast lookups for larger tables.
lookups.

It provides high latency batch processing; no concept It provides low latency access to single rows from billions of
of batch processing. records (Random access).

It provides only sequential access of data. HBase internally uses Hash tables and provides random access,
and it stores the data in indexed HDFS files for faster lookups.

Storage Mechanism in HBase


HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column
families, which are the key value pairs. A table have multiple column families and each column family can have any
number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a
timestamp. In short, in an HBase:

 Table is a collection of rows.


 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.
Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family


col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data.
Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process It is suitable for Online Analytical


(OLTP). Processing (OLAP).

Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.

The following image shows column families in a column-oriented database:

HBase and RDBMS

HBase RDBMS

HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema, which describes the whole
of fixed columns schema; defines only column structure of tables.
families.

It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to scale.
scalable.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as structured It is good for structured data.


data.
Features of HBase

 HBase is linearly scalable.


 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File
System, likewise Apache HBase works on top of Hadoop and HDFS.

Applications of HBase

 It is used whenever there is a need to write heavy applications.


 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer.

i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all RegionServer.

 Maintain and monitor the Hadoop cluster.


 Performs administration (interface for creating, updating and deleting tables.)
 Controls the failover.
 HMaster handles DDL operation.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests from clients. Region server process runs on
every node in Hadoop cluster. Region server runs on HDFS DateNode.

Apache Mahout
Mahout is open source framework for creating scalable machine
learning algorithm and data mining library. Once data is stored in Hadoop HDFS,
mahout provides the data science tools to automatically find meaningful patterns in
those big data sets.
Algorithms of Mahout are:
 Clustering – Here it takes the item in particular class and organizes them into
naturally occurring groups, such that item belonging to the same group are
similar to each other.
 Collaborative filtering – It mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
 Classifications – It learns from existing categorization and then assigns
unclassified items to the best category.
 Frequent pattern mining – It analyzes items in a group (e.g. items in a
shopping cart or terms in query session) and then identifies which items
typically appear together.
.
Apache Spark:
 It’s a platform that handles all the process consumptive tasks like batch processing, interactive or
iterative real-time processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.

Apache Zookeeper is a centralized service and a Hadoop Ecosystem


component for maintaining configuration information, naming, providing
distributed synchronization, and providing group services. Zookeeper manages and
coordinates a large cluster of machines.
Features of Zookeeper:
 Fast – Zookeeper is fast with workloads where reads to data are more common
than writes. The ideal read/write ratio is 10:1.
 Ordered – Zookeeper maintains a record of all transactions.

Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie
combines multiple jobs sequentially into one logical unit of work. Oozie framework
is fully integrated with apache Hadoop stack, YARN as an architecture center and
supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.

In Oozie, users can create Directed Acyclic Graph of workflow, which can run in
parallel and sequentially in Hadoop. Oozie is scalable and can manage timely
execution of thousands of workflow in a Hadoop cluster. Oozie is very much
flexible as well. One can easily start, stop, suspend and rerun jobs. It is even
possible to skip a specific failed node or rerun it in Oozie.
There are two basic types of Oozie jobs:

 Oozie workflow – It is to store and run workflows composed of Hadoop jobs


e.g., MapReduce, pig, Hive.
 Oozie Coordinator – It runs workflow jobs based on predefined schedules and
availability of data.

PySpark Introduction:
What is PySpark? PySpark is a Spark library written in Python to run Python
applications using Apache Spark capabilities, using PySpark we can run applications parallelly on the
distributed cluster (multiple nodes).

In other words, PySpark is a Python API for Apache Spark. Apache Spark is an analytical processing
engine for large scale powerful distributed data processing and machine learning applications.

Spark basically written in Scala and later on due to its industry adaptation it’s API PySpark released
for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to
dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along
with Python, and Apache Spark.
Additionally, For the development, you can use Anaconda distribution (widely used in the Machine
Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run
PySpark applications.
In real-time, PySpark has used a lot in the machine learning & Data scientists community; thanks to
vast python machine learning libraries. Spark runs operations on billions and trillions of data on
distributed clusters 100 times faster than the traditional python applications.

Who uses PySpark? PySpark is very well used in Data Science and Machine Learning
community as there are many widely used data science libraries written in Python including NumPy,
TensorFlow. Also used due to its efficient processing of large datasets. PySpark has been used by
many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

Features: Following are the main features of PySpark.


PySpark Features
 In-memory computation
 Distributed processing using parallelize
 Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
 Fault-tolerant
 Immutable
 Lazy evaluation
 Cache & persistence
 Inbuild-optimization when using DataFrames
 Supports ANSI SQL
Advantages of PySpark
 PySpark is a general-purpose, in-memory, distributed processing engine that allows you to process
data efficiently in a distributed fashion.
 Applications running on PySpark are 100x faster than traditional systems.
 You will get great benefits using PySpark for data ingestion pipelines.
 Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems.
 PySpark also is used to process real-time data using Streaming and Kafka.
 Using PySpark streaming you can also stream files from the file system and also stream from the
socket.
 PySpark natively has machine learning and graph libraries.

PySpark is a tool created by Apache Spark Community for using Python with Spark. It allows working with
RDD (Resilient Distributed Dataset) in Python. It also offers PySpark Shell to link Python APIs with Spark
core to initiate Spark Context. Spark is the name engine to realize cluster computing, while PySpark is
Python’s library to use Spark.
PySpark is the Python library for Spark programming. PySpark allows developers to write Spark
applications using Python, and it provides a high-level API for distributed data processing that is
built on top of the Spark core engine.

PySpark provides a convenient interface for working with large datasets and distributed
computing, and it can be used for a wide range of tasks including data cleaning and
transformation, feature extraction, machine learning, and graph processing.

The main data structure in PySpark is the Resilient Distributed Dataset (RDD), which is a fault-
tolerant collection of elements that can be processed in parallel. PySpark also provides a
DataFrame API, which is built on top of RDD and provides a more convenient and higher-level
interface for data manipulation and analysis.

PySpark also provides a powerful interactive shell, which allows developers to easily run and
debug Spark applications. Additionally, PySpark provides support for popular Python libraries
such as NumPy and Pandas, which makes it easy to integrate existing Python code with Spark.

PySpark can be used with a standalone Spark cluster or with other cluster managers like Hadoop
YARN, Apache Mesos, or Kubernetes.
Business benefits of using PySpark include:
1. Speed: PySpark's in-memory data processing capabilities allow it to perform
computations much faster than traditional disk-based systems like Hadoop MapReduce.
This can lead to significant time and cost savings for businesses that need to process
large amounts of data quickly.
2. Flexibility: PySpark's support for a wide range of data sources and its ability to perform
both batch and real-time processing make it a versatile tool for a variety of big data use
cases.
3. Ease of use: PySpark's high-level API and support for popular Python libraries like
NumPy and Pandas make it easy for developers to build and maintain complex big data
applications.
4. Scalability: PySpark is designed to handle large amounts of data and can be easily
scaled out across a cluster of machines.

However, there are also some challenges that businesses may face when using PySpark, including:

1. Complexity: PySpark is a powerful tool, but it can be complex to set up and configure. It
also requires a certain level of expertise to use effectively.
2. Lack of support: PySpark is an open-source project and while it has a large community,
it may not have the same level of support as some commercial options.
3. Limited machine learning libraries: PySpark comes with built-in machine learning
libraries, but these may not be as extensive or advanced as those found in other
popular Python libraries like scikit-learn.
4. Resource requirements: PySpark's in-memory processing capabilities can require a large
amount of memory and CPU resources, which can be a challenge for businesses with
limited resources.
PySpark Architecture:
Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called “Workers”.
When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all
operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster
Manager.

Cluster Manager Types

As of writing this Spark with Python (PySpark) tutorial, Spark supports below cluster managers:

 Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
 Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark applications.
 Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
 Kubernetes – an open-source system for automating deployment, scaling, and management of containerized
applications.
.
PySpark Modules & Packages

Modules & packages

 PySpark RDD (pyspark.RDD)


 PySpark DataFrame and SQL (pyspark.sql)
 PySpark Streaming (pyspark.streaming)
 PySpark MLib (pyspark.ml, pyspark.mllib)
 PySpark GraphFrames (GraphFrames)
 PySpark Resource (pyspark.resource) It’s new in PySpark 3.0

PySpark – What is SparkSession?

Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame.
Prior to 2.0, SparkContext used to be an entry point. Here, focus on explaining what is SparkSession by
defining and describing how to create SparkSession and using default SparkSession spark variable
from pyspark-shell.
What is SparkSession

SparkSession was introduced in version 2.0, It is an entry point to underlying PySpark functionality in
order to programmatically create PySpark RDD, DataFrame. It’s object spark is default available in
pyspark-shell and it can be created programmatically using SparkSession.
1. SparkSession

With Spark 2.0 a new class SparkSession ( pyspark.sql import SparkSession) has been introduced.
SparkSession is a combined class for all different contexts we used to have prior to 2.0 release
(SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext,
HiveContext, and other contexts defined prior to 2.0.
As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession
instance would be the first statement you would write to program with RDD, DataFrame, and Dataset.
SparkSession will be created using SparkSession.builder builder patterns.
Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with
SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You
should also know that SparkSession internally creates SparkConfig and SparkContext with the
configuration provided with SparkSession.

SparkSession also includes all the APIs available in different contexts –

 SparkContext,
 SQLContext,
 StreamingContext,
 HiveContext.
How many SparkSessions can you create in a PySpark application?
You can create as many SparkSession as you want in a PySpark application using
either SparkSession.builder() or SparkSession.newSession() . Many Spark session
objects are required when you wanted to keep PySpark tables (relational entities) logically
separated.
2. SparkSession in PySpark shell
Be default PySpark shell provides “ spark ” object; which is an instance of SparkSession
class. We can directly use this object where required in spark-shell. Start your “ pyspark ”
shell from $SPARK_HOME\bin folder and enter the pyspark command.
Once you are in the PySpark shell enter the below command to get the PySpark version.

>>>spark.version
3.1.1
Similar to the PySpark shell, in most of the tools, the environment itself creates a default
SparkSession object for us to use so you don’t have to worry about creating a
SparkSession object.

3. Create SparkSession
In order to create SparkSession programmatically (in .py file) in PySpark, you need to use
the builder pattern method builder() as explained below. getOrCreate() method
returns an already existing SparkSession; if not exists, it creates a new SparkSession.

import pyspark
from pyspark.sql import SparkSession
spark=
SparkSession.builder.master(“local[1]”).appName(“ex1”).getOrCreate(
)

master() – If you are running it on the cluster you need to use your master name as an
argument to master(). usually, it would be either yarn or mesos depends on your cluster
setup.
 Use local[x] when running in Standalone mode. x should be an integer value and
should be greater than 0; this represents how many partitions it should create when
using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores
you have.
appName() – Used to set your application name.
getOrCreate() – This returns a SparkSession object if already exists, and creates a new
one if not exist.
Note: SparkSession object spark is by default available in the PySpark shell.
4. Create Another SparkSession
You can also create a new SparkSession using newSession() method. This uses the same
app name, master as the existing session. Underlying SparkContext will be the same for
both sessions as you can have only one context per PySpark application.
spark2=SparkSession.newSession
print(spark2)
This always creates a new SparkSession object.

5. Get Existing SparkSession You can get the existing


SparkSession in PySpark using the builder.getOrCreate() , for example.

spark3=SparkSession.builder.getOrCreate
print(spark3)

6. Using Spark Config


If you wanted to set some configs to SparkSession, use the config() method.
import pyspark
from pyspark.sql import SparkSession
spark= SparkSession.builder.master(“local[1]”).appName(“ex1”)\
.config(“spark.some.config.option”,”config-value”).getOrCreate()

7. Create PySpark DataFrame


SparkSession also provides several methods to create a Spark DataFrame and DataSet.
The below example uses the createDataFrame() method which takes a list of data.
#create data frame
df=spark.create DataFrame([(“scala”,25),(“spark”,35)])
df.show()

# Output
Output
#+-----+-----+
#| _1| _2|
#+-----+-----+
#|Scala|25000|
#|Spark|35000|
#| PHP|21000|
#+-----+-----+
PySpark SparkContext Explained
pyspark.SparkContext is an entry point to the PySpark functionality that is used
to communicate with the cluster and to create an RDD, accumulator, and
broadcast variables. In this article, you will learn how to create PySpark
SparkContext with examples. Note that you can create only one SparkContext
per JVM, in order to create another first you need to stop the existing one
using stop() method.

The Spark driver program creates and uses SparkContext to connect to the cluster
manager to submit PySpark jobs, and know what resource manager (YARN, Mesos, or
Standalone) to communicate to. It is the heart of the PySpark application.

1. SparkContext in PySpark shell


Be default PySpark shell creates and provides sc object, which is an instance of
SparkContext class. We can directly use this object where required without the need of
creating.

>>sc.appName

Similar to the PySpark shell, in most of the tools, notebooks, and Azure Databricks, the
environment itself creates a default SparkContext object for us to use so you don’t have
to worry about creating a PySpark context.

2. Create SparkContext in PySpark


Since PySpark 2.0, Creating a SparkSession creates a SparkContext internally and exposes
the sparkContext variable to use.
At any given time only one SparkContext instance should be active per JVM. In case you
want to create another you should stop existing SparkContext using stop() before
creating a new one.

#create sparksession from builder


from pyspark.sql import SparkSession
spark=
SparkSession.builder.master(“local[1]”).appName(“ex1”).getOrCreate(
)
print(spark.sparkContext)
print(“spark appName:”+spark.sparkContext.appName)

you can create any number of SparkSession objects however, for all those objects
underlying there will be only one SparkContext.
3. Stop PySpark SparkContext
You can stop the SparkContext by calling the stop() method. As explained above you can
have only one SparkContext per JVM. If you wanted to create another, you need to
shutdown it first by using stop() method and create a new SparkContext.

#spark context stop ()method


spark.sparkContext.stop()

When PySpark executes this statement, it logs the message INFO SparkContext:
Successfully stopped SparkContext to console or to a log file.When you create
multiple SparkContext you will get the below error.
ValueError: Cannot run multiple SparkContexts at once;
5. Create PySpark RDD
Once you have a SparkContext object, you can create a PySpark RDD in several ways,
below I have used the range() function.

#create RDD
rdd=spark.sparkContext.range(1,5)
Print(rdd.collect())
#output
#[1,2,3,4]

6. SparkContext Commonly Used Variables


applicationId – Returns a unique ID of a PySpark application.
version – Version of PySpark cluster where your job is running.
uiWebUrl – Provides the Spark Web UI url that started by SparkContext.
7. SparkContext Commonly Used Methods
accumulator (value[, accum_param]) – It creates an pyspark accumulator variable with
initial specified value. Only a driver can access accumulator variables.
broadcast(value) – read-only PySpark broadcast variable . This will be broadcast to the
entire cluster. You can broadcast a variable to a PySpark cluster only once.
emptyRDD() – Creates an empty RDD
getOrCreate() – Creates or returns a SparkContext
hadoopFile() – Returns an RDD of a Hadoop file
newAPIHadoopFile() – Creates an RDD for a Hadoop file with a new API InputFormat.
sequenceFile() – Get an RDD for a Hadoop SequenceFile with given key and value
types.
setLogLevel() – Change log level to debug, info, warn, fatal, and error
textFile() – Reads a text file from HDFS, local or any Hadoop supported file systems
and returns an RDD
union() – Union two RDDs
wholeTextFiles() – Reads a text file in the folder from HDFS, local or any Hadoop
supported file systems and returns an RDD of Tuple2. The first element of the tuple
consists file name and the second element consists context of the text file

What is PySpark SparkConf?


We need to set a few configurations and parameters, to run a Spark application on
the local/cluster, this is what SparkConf helps with. Basically, to run a Spark
application, it offers configurations.
 Code
For PySpark, here is the code block which has the details of a SparkConf class:
class pyspark.SparkConf (
loadDefaults = True,
_jvm = None,
_jconf = None
)
Basically, with SparkConf() we will create a SparkConf object first. So, that will load
the values from spark. Even Java system properties. Hence, by using the SparkConf
object, now we can set different parameters and their parameters will take priority
over the system properties.
However, there are better methods, which support chaining, in a SparkConf class.
Let’s say, we can write conf.setAppName(“PySpark App”).setMaster(“local”).
Though, it cannot be modified by any user once we pass a SparkConf object to
Apache Spark.
Attributes of PySpark SparkConf
Thus here are the most commonly used attributes of SparkConf:

i. set(key, value) ----It helps to set a configuration property.

ii. setMaster(value)---In order to set the master URL, we use it.

iii. setAppName(value)----We use it to set an application name.

iv. get(key, defaultValue=None)---It helps to get a configuration value of a key.

v. setSparkHome(value)----In order to set Spark installation path on worker nodes, we use it.

In the following code, we can use to create SparkConf and SparkContext objects as part of our
applications. Also, using sbt console on base directory of our application we can validate:
from pyspark import SparkConf,SparkContext

conf = SparkConf().setAppName("Spark Demo").setMaster("local")

sc = SparkContext(conf=conf)

PySpark RDD What is RDD (Resilient Distributed Dataset)?


RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark which is fault-
tolerant, immutable distributed collections of objects. Immutable meaning once you create an RDD
you cannot change it. Each record in RDD is divided into logical partitions, which can be computed
on different nodes of the cluster.

In other words, RDDs are a collection of objects similar to list in Python, with the difference being
RDD is computed on several processes scattered across multiple physical servers also called nodes
in a cluster while a Python collection lives and process in just one process.

Additionally, RDDs provide data abstraction of partitioning and distribution of the data designed to
run computations in parallel on several nodes, while doing transformations on RDD we don’t have to
worry about the parallelism as PySpark by default provides.

the basic operations available on RDDs, such as map() , filter() , and persist() and many
more. In addition, Pair RDD functions that operate on RDDs of key-value pairs such
as groupByKey() and join() etc.
Note: RDD’s can have a name and unique identifier (id)
PySpark RDD Benefits: PySpark is widely adapted in Machine learning and Data
science community due to it’s advantages compared with traditional python programming.
In-Memory Processing
PySpark loads the data from disk and process in memory and keeps the data in memory, this is the
main difference between PySpark and Mapreduce (I/O intensive). In between the transformations, we
can also cache/persists the RDD in memory to reuse the previous computations.
Immutability
PySpark RDD’s are immutable in nature meaning, once RDDs are created you cannot modify. When
we apply transformations on RDD, PySpark creates a new RDD and maintains the RDD Lineage.
Fault Tolerance
PySpark operates on fault-tolerant data stores on HDFS, S3 e.t.c hence any RDD operation fails, it
automatically reloads the data from other partitions. Also, When PySpark applications running on a
cluster, PySpark task failures are automatically recovered for a certain number of times (as per the
configuration) and finish the application seamlessly.
Lazy Evolution
PySpark does not evaluate the RDD transformations as they appear/encountered by Driver instead it
keeps the all transformations as it encounters(DAG) and evaluates the all transformation when it
sees the first RDD action.
Partitioning
When you create RDD from a data, It by default partitions the elements in a RDD. By default it
partitions to the number of cores available.
PySpark RDD Limitations
PySpark RDDs are not much suitable for applications that make updates to the state store such as
storage systems for a web application. For these applications, it is more efficient to use systems that
perform traditional update logging and data checkpointing, such as databases. The goal of RDD is to
provide an efficient programming model for batch analytics and leave these asynchronous
applications.
Creating RDD
RDD’s are created primarily in two different ways,
 parallelizing an existing collection and
 referencing a dataset in an external storage system ( HDFS , S3 and many more).
Before we look into examples, first let’s initialize SparkSession using the builder pattern method
defined in SparkSession class. While initializing, we need to provide the master and application
name as shown below. In realtime application, you will pass master from spark-submit instead of
hardcoding on Spark application.

from pyspark.sql import SparkSession


spark=
SparkSession.builder.master(“local[1]”).appName(“ex1”).getOrCreate(
)

master() – If you are running it on the cluster you need to use your master name as an argument
to master(). usually, it would be either yarn (Yet Another Resource
Negotiator) or mesos depends on your cluster setup.
 Use local[x] when running in Standalone mode. x should be an integer value and
should be greater than 0; this represents how many partitions it should create when
using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores
you have.
appName() – Used to set your application name.
getOrCreate() – This returns a SparkSession object if already exists, and creates a new one if
not exist.
Note: Creating SparkSession object, internally creates one SparkContext per JVM.
Create RDD using sparkContext.parallelize()
By using parallelize() function of SparkContext (sparkContext.parallelize() ) you can create
an RDD. This function loads the existing collection from your driver program into parallelizing RDD.
This is a basic method to create RDD and is used when you already have data in memory that is
either loaded from a file or from a database. and it required all data to be present on the driver
program prior to creating RDD.

RDD from list


#create Rdd from parallelize
data=[1,2,3,4,5,6,7,8,9,10,11,12]
rdd=spark.sparkContext.parallelize(data)

Create RDD using sparkContext.textFile(): Using textFile() method we can


read a text (.txt) file into RDD.

#create Rdd from external data source


rdd=spark.sparkContext.textFile(“/path/tyextFile.txt”)
Besides using text files, we can also create RDD from CSV file , JSON, and more formats.
RDD Parallelize:
When we use parallelize() or textFile() or wholeTextFiles() methods
of SparkContxt to initiate RDD, it automatically splits the data into partitions based on resource
availability. when you run it on a laptop it would create partitions as the same number of cores
available on your system.
getNumPartitions() – This a RDD function which returns a number of partitions our dataset split
into.

Print(“initial partition count”+str(rdd.getNumpartitions()))


#outputs:initial partition count:2

Set parallelize manually – We can also set a number of partitions manually, all, we need is, to
pass a number of partitions as the second parameter to these functions for example
sparkContext.parallelize([1,2,3,4,56,7,8,9,12,3], 10) .
PySpark RDD Operations
RDD transformations – Transformations are lazy operations, instead of updating an RDD, these
operations return another RDD.
RDD actions – operations that trigger computation and return RDD values.

RDD Transformations with example


Transformations on PySpark RDD returns another RDD and transformations are lazy meaning they
don’t execute until you call an action on RDD. Some transformations on RDD’s
are flatMap() , map() , reduceByKey() , filter() , sortByKey() and return new RDD
instead of updating the current.
In this PySpark RDD Transformation section of the tutorial, I will explain transformations using the
word count example. The below image demonstrates different RDD transformations we going to use.

First, create an RDD by reading a text file.


rdd=spark.sparkContext.textFile(“/path/tyextFile.txt”)
flatMap – flatMap() transformation flattens the RDD after applying the function and returns a
new RDD. On the below example, first, it splits each record by space in an RDD and finally flattens it.
Resulting RDD consists of a single word on each record.

rdd2=rdd.flatMap(lambda x: x.split(“ “))

map – map() transformation is used the apply any complex operations like adding a column,
updating a column e.t.c, the output of map transformations would always have the same number of
records as input.
In our word count example, we are adding a new column with value 1 for each word, the result of the
RDD is PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of
type Int as value.
rdd3=rdd2.map(lambda x: (x,1))

reduceByKey – reduceByKey() merges the values for each key with the function specified. In
our example, it reduces the word string by applying the sum function on value. The result of our RDD
contains unique words and their count.
rdd4=rdd3.reduceByKey(lambda a,b: a+b)

sortByKey – sortByKey() transformation is used to sort RDD elements on key. In our example,
first, we convert RDD[(String,Int]) to RDD[(Int, String]) using map transformation and apply
sortByKey which ideally does sort on an integer value. And finally, foreach with println statements
returns all words in RDD and their count as key-value pair
rdd5=rdd4.map(lambda x: (x[1],x[0])).sortByKey()

print(rdd5.collect()

filter – filter () transformation is used to filter the records in an RDD. In our example we are
filtering all words starts with “a”.
rdd4 = rdd3.filter(lambda x : ’an’ in x[1])
print(rdd4.collect())
RDD Actions with example
RDD Action operations return the values from an RDD to a driver program. In other words,
any RDD function that returns non-RDD is considered as an action.
In this section of the PySpark RDD tutorial, we will continue to use our word count example and
performs some actions on it.
count() – Returns the number of records in an RDD

print(“count: “+str(rdd6.count()))

first() – Returns the first record.


firstrec=rdd6.first()
print(“First Record: “+str(firstrec[0]) + “,”+ firstrec[1])

max() – Returns max record.


datmax=rdd6.max()
print(“Max Record: “+str(datmax[0]) + “,”+ datmax)

take() – Returns the record specified as an argument.


collect() – Returns all data from RDD as an array. Be careful when you use this action when you
are working with huge RDD with millions and billions of data as you may run out of memory on the
driver.
saveAsTextFile() – Using saveAsTestFile action, we can write the RDD to a text file.
rdd6.saveAsTextFile(“/tmp/wrdcount”) rdd6.saveAsTextFile
PySpark - MLlib
Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine learning API in Python
as well. It supports different kind of algorithms, which are mentioned below −
 mllib.classification − The spark.mllib package supports various methods for binary classification,
multiclass classification and regression analysis. Some of the most popular algorithms in classification
are Random Forest, Naive Bayes, Decision Tree, etc.
 mllib.clustering − Clustering is an unsupervised learning problem, whereby you aim to group subsets
of entities with one another based on some notion of similarity.
 mllib.fpm − Frequent pattern matching is mining frequent items, itemsets, subsequences or other
substructures that are usually among the first steps to analyze a large-scale dataset. This has been an
active research topic in data mining for years.
 mllib.linalg − MLlib utilities for linear algebra.
 mllib.recommendation − Collaborative filtering is commonly used for recommender systems. These
techniques aim to fill in the missing entries of a user item association matrix.
 spark.mllib − It ¬currently supports model-based collaborative filtering, in which users and products are
described by a small set of latent factors that can be used to predict missing entries. spark.mllib uses the
Alternating Least Squares (ALS) algorithm to learn these latent factors.
 mllib.regression − Linear regression belongs to the family of regression algorithms. The goal of
regression is to find relationships and dependencies between variables. The interface for working with
linear regression models and model summaries is similar to the logistic regression case.
PySpark - Serializers
Serialization is used for performance tuning on Apache Spark. All data that is sent over the network or written
to the disk or persisted in the memory should be serialized. Serialization plays an important role in costly
operations.PySpark supports custom serializers for performance tuning. The following two serializers are
supported by PySpark −

MarshalSerializer:Serializes objects using Python’s Marshal Serializer. This serializer is faster than
PickleSerializer, but supports fewer datatypes.
class pyspark.MarshalSerializer

PickleSerializer
Serializes objects using Python’s Pickle Serializer. This serializer supports nearly any Python object, but may
not be as fast as more specialized serializers.
class pyspark.PickleSerializer
Let us see an example on PySpark serialization. Here, we serialize the data using MarshalSerializer.
--------------------------------------serializing.py-------------------------------------
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext("local", "serialization app", serializer = MarshalSerializer())
print(sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10))
sc.stop()
--------------------------------------serializing.py-------------------------------------
Command − The command is as follows −
$SPARK_HOME/bin/spark-submit serializing.py
Output − The output of the above command is −
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

You might also like