unit 3
unit 3
Developed as a publish-subscribe messaging system to handle mass amounts of data at LinkedIn, today,
Apache Kafka® is an open-source distributed event streaming platform used by over 80% of the Fortune 100.
Apache Kafka is a software platform which is based on a distributed streaming process. It is a publish-
subscribe messaging system which let exchanging of data between applications, servers, and processors as
well. Apache Kafka was originally developed by LinkedIn, and later it was donated to the Apache Software
Foundation. Currently, it is maintained by Confluent under Apache Software Foundation. Apache Kafka has
resolved the lethargic trouble of data communication between a sender and a receiver.
A messaging system is a simple exchange of messages between two or more persons, devices, etc. A publish-
subscribe messaging system allows a sender to send/write the message and a receiver to read that message. In
Apache Kafka, a sender is known as a producer who publishes messages, and a receiver is known as
a consumer who consumes that message by subscribing it.
A streaming process is the processing of data in parallelly connected systems. This process allows different
applications to limit the parallel execution of the data, where one record executes without waiting for the
output of the previous record. Therefore, a distributed streaming platform enables the user to simplify the task
of the streaming process and parallel execution. Therefore, a streaming platform in Kafka has the following
key capabilities:
To learn and understand Apache Kafka, the aspirants should know the following four core APIs :
Producer API: This API allows/permits an application to publish streams of records to one or more topics.
Consumer API: This API allows an application to subscribe one or more topics and process the stream of
records produced to them.
Streams API: This API allows an application to effectively transform the input streams to the output streams.
It permits an application to act as a stream processor which consumes an input stream from one or more topics,
and produce an output stream to one or more output topics.
Connector API: This API executes the reusable producer and consumer APIs with the existing data systems
or applications.
Why Apache Kafka
Apache Kafka is a software platform that has the following reasons which best describes the need of Apache
Kafka.
1. Apache Kafka is capable of handling millions of data or messages per second.
2. Apache Kafka works as a mediator between the source system and the target system. Thus, the source
system (producer) data is sent to the Apache Kafka, where it decouples the data, and the target system
(consumer) consumes the data from Kafka.
3. Apache Kafka is having extremely high performance, i.e., it has really low latency value less than
10ms which proves it as a well-versed software.
4. Apache Kafka has a resilient architecture which has resolved unusual complications in data sharing.
5. Organizations such as NETFLIX, UBER, Walmart, etc. and over thousands of such firms make use of
Apache Kafka.
6. Apache Kafka is able to maintain the fault-tolerance. Fault-tolerance means that sometimes a
consumer successfully consumes the message that was delivered by the producer. But, the consumer fails to
process the message back due to backend database failure, or due to presence of a bug in the consumer code. In
such a situation, the consumer is unable to consume the message again. Consequently, Apache Kafka has
resolved the problem by reprocessing the data.
Kafka Streams Architecture
The architecture of Apache Kafka is designed to handle large volumes of data streaming in real-time. It
consists of several key components:
1. Producers: These are the applications or systems that generate and send data to Kafka. Producers send data to
Kafka topics, which are logical channels for storing and distributing data.
2. Brokers: These are the servers that form the Kafka cluster. They are responsible for receiving data from
producers and storing it in topics. A Kafka cluster can have one or more brokers, each of which holds a replica
of the data stored in the topics.
3. Zookeeper: This is a separate service that is used to coordinate the Kafka brokers. It is responsible for
maintaining configuration information, detecting broker failures, and electing new leaders for partitions.
4. Consumers: These are the applications or systems that subscribe to and consume data from Kafka topics.
Consumers read data from topics, process it, and send it to other systems or applications.
5. Topics and partitions: Topics are the logical channels for storing and distributing data in Kafka. Each topic can
have one or more partitions, which are used to distribute data across multiple brokers for scalability and fault
tolerance.
6. Replication: Each partition in a topic is replicated across multiple brokers to ensure that data is not lost in case
of a broker failure.
In summary, Kafka architecture is designed for large-scale data streaming, it's composed by Producers that
send data to topics, Brokers that store data in topics, Zookeeper that coordinates the brokers, Consumers that
read data from topics, topics and partitions that distribute data, and replication that ensures data fault tolerance.
Basically, by building on the Kafka producer and consumer libraries and leveraging the native capabilities of
Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity, Kafka
Streams simplifies application development.
Apache Kafka Use Cases
Apache Kafka has the following use cases which best describes the events to use it:
1) Message Broker
Apache Kafka is one of the trending technology that is capable to handle a large amount of similar type of
messages or data. This capability enables Kafka to give high throughput value. Also, Kafka is a publish-
subscribe messaging system that makes users to read and write data more conveniently.
2) Metrics
Apache Kafka is used to monitor operational data by producing centralized feeds of that data. Operational data
means monitoring things from technology to security logs to supplier information, and so on.
3) Website Activity Tracking
t is one of the widely used use cases of Kafka. It is because a website activity usually creates a huge amount of
data, generating various messages for each particular page view and user's activity. Kafka also ensures that
data is successfully sent and received by both parties.
4) Event Sourcing
Apache Kafka supports the collection of huge amounts of log data. Thus it becomes a crucial component for
any Event Management System, which includes Security Information Event Management(SIEM). Handling
large amounts of logs data make it an excellent backend for building an application.
5) Commit Logs
Apache Kafka is used for data replication between the nodes and to restore data on failed nodes. Kafka can
also act as a pseudo commit-log. For example, suppose if a user is tracking device data for IoT sensors. He
finds an issue with the database that it is not storing all data, then the user can replay the data for replacing the
missing or unstored information in the database.
6) Log Aggregation
Several organizations make use of Apache Kafka to collect logs from various services and make them
available to their multiple customers in a standard format.
7) Kafka Stream Processing
We have various popular frameworks that read data from a topic, process it, and write that processed data over
a new topic. This new topic containing the processed data becomes available to users and applications such as
Spark Streaming, Storm, etc.
Apache Kafka Applications
There are following applications of Apache Kafka:
LinkedIn
In 2010, LinkedIn developed Apache Kafka. As Kafka is a publish-subscriber messaging system, thus various
LinkedIn products such as LinkedIn Today and LinkedIn Newsfeed use it for message consumption.
Uber
Uber use Kafka as a message bus to connect different parts of the ecosystem. Kafka helps both the passengers
and drivers to meet to their correct matches. It collects information from the rider's app as well as from the
driver's app, then makes that information available to a variety of downstream consumers.
Twitter
Because Kafka has fulfilled the requirements of data replication and durability, twitter has become one of the
best applications/users of Apache Kafka. Adopting Kafka led twitter to a vast resource saving, upto 75%, i.e., a
good cost reduction.
Netflix
Netflix uses Kafka under Keystone Pipeline. A Keystone is a unified collection, event publishing, and routing
infrastructure used for stream and batch processing. The Keystone Pipeline uses two sets of Kafka cluster, i.e.,
Fronting Kafka and Consumer Kafka. Fronting Kafka gets the message from the producers. Consumer Kafka
contains topics subsets which are routed by Samza (an Apache framework) for the real-time consumers. Thus,
Kafka has maintained the cost by providing a lossless delivery of the data pipeline.
Oracle
Apache Kafka has supported Oracle Database as a Kafka Consumer. It has also supported Oracle for
publishing events to Kafka. Apache Kafka provides reliable and scalable data streaming. The oracle user can
easily retrieve data from a Kafka topic. Oracle developers are now more capable of implementing staged data
pipelines through OSB(Oracle Service Bus).
Mozilla
Mozilla Firefox is an open-source and free web browser to all. It supports Windows, Linux, macOS, and many
other operating systems. Mozilla uses Kafka for backing up the data, i.e., used as a backing data store. Soon,
Kafka is going to replace Mozilla's current production system for collecting performance and usage data from
the end users for Telemetry, Test Pilot like projects.
1. Do not have complete set of monitoring tools: Apache Kafka does not contain a
complete set of monitoring as well as managing tools. Thus, new startups or enterprises
fear to work with Kafka.
2. Message tweaking issues: The Kafka broker uses system calls to deliver messages to
the consumer. In case, the message needs some tweaking, the performance of Kafka gets
significantly reduced. So, it works well if the message does not need to change.
3. Do not support wildcard topic selection: Apache Kafka does not support wildcard
topic selection. Instead, it matches only the exact topic name. It is because selecting
wildcard topics make it incapable to address certain use cases.
4. Reduces Performance: Brokers and consumers reduce the performance of Kafka by
compressing and decompressing the data flow. This not only affects its performance but
also affects its throughput.
5. Clumsy Behaviour: Apache Kafka most often behaves a bit clumsy when the number of
queues increases in the Kafka Cluster.
6. Lack some message paradigms: Certain message paradigms such as point-to-point
queues, request/reply, etc. are missing in Kafka for some use cases.
Map reduce
Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C+
+. The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for performing
large-scale data analysis using multiple machines in the cluster.
The input to each phase is key-value pairs. In addition, every programmer needs to specify two functions: map
function and reduce function.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The following
illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not
suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers.
Moreover, the centralized system creates too much of a bottleneck while processing multiple files
simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into
small parts and assigns them to many computers. Later, the results are collected at one place and integrated to
form the result dataset.
Ex:Consider you have following input data for your MapReduce in Big data Program
Bad 1
Class 1
Good 1
Hadoop 3
Is 2
To 1
Welcome 1
The data goes through the following phases of MapReduce in Big Data
Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits Input split is a
chunk of the input that is consumed by a single map
Mapping: This is the very first phase in the execution of map-reduce program. In this phase data in each split
is passed to a mapping function to produce output values. In our example, a job of mapping phase is to count a
number of occurrences of each word from input splits (more details about input-split is given below) and
prepare a list in the form of <word, frequency>
Shuffling:This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their respective
frequency.
Reducing:In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each
word.
One map task is created for each split which then executes map function for each record in the split.
It is always beneficial to have multiple splits because the time taken to process a split is small as
compared to the time taken for processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the splits in parallel.
However, it is also not desirable to have splits too small in size. When splits are too small, the
overload of managing the splits and map task creation begins to dominate the total job execution time.
For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB, by
default).
Execution of map tasks results into writing output to a local disk on the respective node and not to
HDFS.
Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS
store operation.
Map output is intermediate output which is processed by reduce tasks to produce the final output.
Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication
becomes overkill.
In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns the
map task on another node and re-creates the map output.
Reduce task doesn’t work on the concept of data locality. An output of every map task is fed to the
reduce task. Map output is transferred to the machine where reduce task is running.
On this machine, the output is merged and then passed to the user-defined reduce function.
Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local node
and other replicas are stored on off-rack nodes). So, writing the reduce output
1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and
there are multiple tasktrackers which reside on Datanode.
H
ow Hadoop MapReduce Works
A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on different
data nodes.
Execution of individual task is then to look after by task tracker, which resides on every data node
executing part of the job.
Task tracker’s responsibility is to send the progress report to the job tracker.
In addition, task tracker periodically sends ‘heartbeat’ signal to the Jobtracker so as to notify him of
the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the job
tracker can reschedule it on a different task tracker.
Advantages of using it over Traditional methods:
There are numerous advantages of MapReduce but the following are the ones which are most prominent.
Scalability: The Hadoop MapReduce algorithm is extremely scalable mostly because of the way it
distributes and stores data, other than that the servers embedded with MapReduce have a tendency to
work in parallel thus are much quicker and can serve a large scale of clients altogether.
Flexibility: The data stored and processed by the MapReduce can be easily reused with other
applications embedded with the main system.
Security and Authentication: Because the data is actually divided into chunks of data and it isn’t
stored on one server, it’s much easier for it to keep it safe from malicious activity.
Cost-effective solution: MapReduce technically saves a lot of start-up costs where a limited amount
of memory sources are required to do the processing.
Simple Coding Model: With MapReduce, the programmer does not have to implement parallelism,
distributed data passing, or any of the complexities that they would otherwise be faced with. This
greatly simplifies the coding task and reduces the amount of time required to create analytical routines.
Supports Unstructured Data: Unstructured data is data that does not follow a specified format for
big data. If 20 percent of the data available to enterprises is structured data, the other 80 percent is
unstructured. Unstructured data is really most of the data that you will encounter. Until recently,
however, the technology didn’t really support doing much with it except storing it or analyzing it
manually.
Fault Tolerance: Because of its highly distributed nature, MapReduce is very fault tolerant. Typically
the distributed file systems that MapReduce support, along with the master process, enable
MapReduce jobs to survive hardware failures
If a system is running entirely on a single machine, there is no communication with other machines to
go wrong, and the only points of failure are local; the machine components may fail, or the software
on the machine may exhibit a fault. In either case, there is often very little that can be done to
automatically recover from these failures. A hardware failure will typically cause the system to halt, in
which case there is no opportunity for any automatic recovery action. A software fault may be more
recoverable, but with any remedial action being taken, a restart of the software will most likely see a
repeat of the fault in the near future.
Conversely, consider a computer system that is distributed across multiple machines connected
together by a local area network, such as Ethernet. Each of the machines process data and produces a
subset of results, working in conjunction with each other to its end goal. Passing data and messages
between the machines is critical to the successful and efficient functioning of the system.
Because MapReduce processes simple key value pairs, it can support any type of data structure that fits into
this model.
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question arises that
what is big data ? Big data is a term given to the data sets which can’t be processed in an efficient
manner with the help of traditional methodology such as RDBMS. Hadoop has made its place in the
industries and companies that need to work on large data sets which are sensitive and needs efficient
handling. Hadoop is a framework that enables processing of large data sets which reside in the form of
clusters. Being a framework, Hadoop is made up of several modules that are supported by a large
ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of
the tools or solutions are used to supplement or support these major elements. All these tools work
collectively to provide services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively
fewer resources than the data nodes that stores the actual data. These data nodes are commodity
hardware in the distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart
of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the
Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per
machine and later on acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over
the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of group.
Map generates a key-value pair based result which is later on processed by the Reduce()
method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
Hive
The Hadoop ecosystem component, Apache Hive, is an open source data
warehouse system for querying and analyzing large datasets stored in Hadoop
files. Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute
on Hadoop.
Main parts of Hive are:
Metastore – It stores the metadata.
Driver – Manage the lifecycle of a HiveQL statement.
Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
Hive server – Provide a thrift interface and JDBC/ODBC server.
Pig
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce
are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way
Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
Apache Pig is a high-level language platform for analyzing and querying huge
dataset that are stored in HDFS. Pig as a component of Hadoop Ecosystem
uses PigLatin language. It is very similar to SQL. It loads the data, applies the
required filters and dumps the data in the required format. For Programs
execution, pig requires Java runtime environment.
Features of Apache Pig:
Extensibility – For carrying out special purpose processing, users can create
their own function.
Optimization opportunities – Pig allows the system to optimize automatic
execution. This allows the user to pay attention to semantics instead of
efficiency.
Handles all kinds of data – Pig analyzes both structured as well as
unstructured.
HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database
that was designed to store structured data in tables that could have billions of row
and millions of columns. HBase is scalable, distributed, and NoSQL database that
is built on top of HDFS. HBase, provide real-time access to read or write data in
HDFS.
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent of big data,
companies realized the benefit of processing big data and started opting for solutions like Hadoop.Hadoop uses distributed
file system for storing big data, and MapReduce to process it. Hadoop excels in storing and processing of huge data of
various formats such as arbitrary, semi-, or even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to
search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this
point, a new solution is needed to access any point of data in a single unit of time (random access).
Hadoop Random Access Databases Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some
of the databases that store huge amounts of data and access the data in a random manner.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is
horizontally scalable.HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).It is a
part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS
randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
HDFS HBase
HDFS is a distributed file system suitable for storing HBase is a database built on top of the HDFS.
large files.
HDFS does not support fast individual record HBase provides fast lookups for larger tables.
lookups.
It provides high latency batch processing; no concept It provides low latency access to single rows from billions of
of batch processing. records (Random access).
It provides only sequential access of data. HBase internally uses Hash tables and provides random access,
and it stores the data in indexed HDFS files for faster lookups.
Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data.
Shortly, they will have column families.
Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.
HBase RDBMS
HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema, which describes the whole
of fixed columns schema; defines only column structure of tables.
families.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to scale.
scalable.
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File
System, likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all RegionServer.
Apache Mahout
Mahout is open source framework for creating scalable machine
learning algorithm and data mining library. Once data is stored in Hadoop HDFS,
mahout provides the data science tools to automatically find meaningful patterns in
those big data sets.
Algorithms of Mahout are:
Clustering – Here it takes the item in particular class and organizes them into
naturally occurring groups, such that item belonging to the same group are
similar to each other.
Collaborative filtering – It mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
Classifications – It learns from existing categorization and then assigns
unclassified items to the best category.
Frequent pattern mining – It analyzes items in a group (e.g. items in a
shopping cart or terms in query session) and then identifies which items
typically appear together.
.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or
iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.
Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie
combines multiple jobs sequentially into one logical unit of work. Oozie framework
is fully integrated with apache Hadoop stack, YARN as an architecture center and
supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.
In Oozie, users can create Directed Acyclic Graph of workflow, which can run in
parallel and sequentially in Hadoop. Oozie is scalable and can manage timely
execution of thousands of workflow in a Hadoop cluster. Oozie is very much
flexible as well. One can easily start, stop, suspend and rerun jobs. It is even
possible to skip a specific failed node or rerun it in Oozie.
There are two basic types of Oozie jobs:
PySpark Introduction:
What is PySpark? PySpark is a Spark library written in Python to run Python
applications using Apache Spark capabilities, using PySpark we can run applications parallelly on the
distributed cluster (multiple nodes).
In other words, PySpark is a Python API for Apache Spark. Apache Spark is an analytical processing
engine for large scale powerful distributed data processing and machine learning applications.
Spark basically written in Scala and later on due to its industry adaptation it’s API PySpark released
for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to
dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along
with Python, and Apache Spark.
Additionally, For the development, you can use Anaconda distribution (widely used in the Machine
Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run
PySpark applications.
In real-time, PySpark has used a lot in the machine learning & Data scientists community; thanks to
vast python machine learning libraries. Spark runs operations on billions and trillions of data on
distributed clusters 100 times faster than the traditional python applications.
Who uses PySpark? PySpark is very well used in Data Science and Machine Learning
community as there are many widely used data science libraries written in Python including NumPy,
TensorFlow. Also used due to its efficient processing of large datasets. PySpark has been used by
many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.
PySpark is a tool created by Apache Spark Community for using Python with Spark. It allows working with
RDD (Resilient Distributed Dataset) in Python. It also offers PySpark Shell to link Python APIs with Spark
core to initiate Spark Context. Spark is the name engine to realize cluster computing, while PySpark is
Python’s library to use Spark.
PySpark is the Python library for Spark programming. PySpark allows developers to write Spark
applications using Python, and it provides a high-level API for distributed data processing that is
built on top of the Spark core engine.
PySpark provides a convenient interface for working with large datasets and distributed
computing, and it can be used for a wide range of tasks including data cleaning and
transformation, feature extraction, machine learning, and graph processing.
The main data structure in PySpark is the Resilient Distributed Dataset (RDD), which is a fault-
tolerant collection of elements that can be processed in parallel. PySpark also provides a
DataFrame API, which is built on top of RDD and provides a more convenient and higher-level
interface for data manipulation and analysis.
PySpark also provides a powerful interactive shell, which allows developers to easily run and
debug Spark applications. Additionally, PySpark provides support for popular Python libraries
such as NumPy and Pandas, which makes it easy to integrate existing Python code with Spark.
PySpark can be used with a standalone Spark cluster or with other cluster managers like Hadoop
YARN, Apache Mesos, or Kubernetes.
Business benefits of using PySpark include:
1. Speed: PySpark's in-memory data processing capabilities allow it to perform
computations much faster than traditional disk-based systems like Hadoop MapReduce.
This can lead to significant time and cost savings for businesses that need to process
large amounts of data quickly.
2. Flexibility: PySpark's support for a wide range of data sources and its ability to perform
both batch and real-time processing make it a versatile tool for a variety of big data use
cases.
3. Ease of use: PySpark's high-level API and support for popular Python libraries like
NumPy and Pandas make it easy for developers to build and maintain complex big data
applications.
4. Scalability: PySpark is designed to handle large amounts of data and can be easily
scaled out across a cluster of machines.
However, there are also some challenges that businesses may face when using PySpark, including:
1. Complexity: PySpark is a powerful tool, but it can be complex to set up and configure. It
also requires a certain level of expertise to use effectively.
2. Lack of support: PySpark is an open-source project and while it has a large community,
it may not have the same level of support as some commercial options.
3. Limited machine learning libraries: PySpark comes with built-in machine learning
libraries, but these may not be as extensive or advanced as those found in other
popular Python libraries like scikit-learn.
4. Resource requirements: PySpark's in-memory processing capabilities can require a large
amount of memory and CPU resources, which can be a challenge for businesses with
limited resources.
PySpark Architecture:
Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called “Workers”.
When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all
operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster
Manager.
As of writing this Spark with Python (PySpark) tutorial, Spark supports below cluster managers:
Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark applications.
Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
Kubernetes – an open-source system for automating deployment, scaling, and management of containerized
applications.
.
PySpark Modules & Packages
Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame.
Prior to 2.0, SparkContext used to be an entry point. Here, focus on explaining what is SparkSession by
defining and describing how to create SparkSession and using default SparkSession spark variable
from pyspark-shell.
What is SparkSession
SparkSession was introduced in version 2.0, It is an entry point to underlying PySpark functionality in
order to programmatically create PySpark RDD, DataFrame. It’s object spark is default available in
pyspark-shell and it can be created programmatically using SparkSession.
1. SparkSession
With Spark 2.0 a new class SparkSession ( pyspark.sql import SparkSession) has been introduced.
SparkSession is a combined class for all different contexts we used to have prior to 2.0 release
(SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext,
HiveContext, and other contexts defined prior to 2.0.
As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession
instance would be the first statement you would write to program with RDD, DataFrame, and Dataset.
SparkSession will be created using SparkSession.builder builder patterns.
Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with
SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You
should also know that SparkSession internally creates SparkConfig and SparkContext with the
configuration provided with SparkSession.
SparkContext,
SQLContext,
StreamingContext,
HiveContext.
How many SparkSessions can you create in a PySpark application?
You can create as many SparkSession as you want in a PySpark application using
either SparkSession.builder() or SparkSession.newSession() . Many Spark session
objects are required when you wanted to keep PySpark tables (relational entities) logically
separated.
2. SparkSession in PySpark shell
Be default PySpark shell provides “ spark ” object; which is an instance of SparkSession
class. We can directly use this object where required in spark-shell. Start your “ pyspark ”
shell from $SPARK_HOME\bin folder and enter the pyspark command.
Once you are in the PySpark shell enter the below command to get the PySpark version.
>>>spark.version
3.1.1
Similar to the PySpark shell, in most of the tools, the environment itself creates a default
SparkSession object for us to use so you don’t have to worry about creating a
SparkSession object.
3. Create SparkSession
In order to create SparkSession programmatically (in .py file) in PySpark, you need to use
the builder pattern method builder() as explained below. getOrCreate() method
returns an already existing SparkSession; if not exists, it creates a new SparkSession.
import pyspark
from pyspark.sql import SparkSession
spark=
SparkSession.builder.master(“local[1]”).appName(“ex1”).getOrCreate(
)
master() – If you are running it on the cluster you need to use your master name as an
argument to master(). usually, it would be either yarn or mesos depends on your cluster
setup.
Use local[x] when running in Standalone mode. x should be an integer value and
should be greater than 0; this represents how many partitions it should create when
using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores
you have.
appName() – Used to set your application name.
getOrCreate() – This returns a SparkSession object if already exists, and creates a new
one if not exist.
Note: SparkSession object spark is by default available in the PySpark shell.
4. Create Another SparkSession
You can also create a new SparkSession using newSession() method. This uses the same
app name, master as the existing session. Underlying SparkContext will be the same for
both sessions as you can have only one context per PySpark application.
spark2=SparkSession.newSession
print(spark2)
This always creates a new SparkSession object.
spark3=SparkSession.builder.getOrCreate
print(spark3)
# Output
Output
#+-----+-----+
#| _1| _2|
#+-----+-----+
#|Scala|25000|
#|Spark|35000|
#| PHP|21000|
#+-----+-----+
PySpark SparkContext Explained
pyspark.SparkContext is an entry point to the PySpark functionality that is used
to communicate with the cluster and to create an RDD, accumulator, and
broadcast variables. In this article, you will learn how to create PySpark
SparkContext with examples. Note that you can create only one SparkContext
per JVM, in order to create another first you need to stop the existing one
using stop() method.
The Spark driver program creates and uses SparkContext to connect to the cluster
manager to submit PySpark jobs, and know what resource manager (YARN, Mesos, or
Standalone) to communicate to. It is the heart of the PySpark application.
>>sc.appName
Similar to the PySpark shell, in most of the tools, notebooks, and Azure Databricks, the
environment itself creates a default SparkContext object for us to use so you don’t have
to worry about creating a PySpark context.
you can create any number of SparkSession objects however, for all those objects
underlying there will be only one SparkContext.
3. Stop PySpark SparkContext
You can stop the SparkContext by calling the stop() method. As explained above you can
have only one SparkContext per JVM. If you wanted to create another, you need to
shutdown it first by using stop() method and create a new SparkContext.
When PySpark executes this statement, it logs the message INFO SparkContext:
Successfully stopped SparkContext to console or to a log file.When you create
multiple SparkContext you will get the below error.
ValueError: Cannot run multiple SparkContexts at once;
5. Create PySpark RDD
Once you have a SparkContext object, you can create a PySpark RDD in several ways,
below I have used the range() function.
#create RDD
rdd=spark.sparkContext.range(1,5)
Print(rdd.collect())
#output
#[1,2,3,4]
v. setSparkHome(value)----In order to set Spark installation path on worker nodes, we use it.
In the following code, we can use to create SparkConf and SparkContext objects as part of our
applications. Also, using sbt console on base directory of our application we can validate:
from pyspark import SparkConf,SparkContext
sc = SparkContext(conf=conf)
In other words, RDDs are a collection of objects similar to list in Python, with the difference being
RDD is computed on several processes scattered across multiple physical servers also called nodes
in a cluster while a Python collection lives and process in just one process.
Additionally, RDDs provide data abstraction of partitioning and distribution of the data designed to
run computations in parallel on several nodes, while doing transformations on RDD we don’t have to
worry about the parallelism as PySpark by default provides.
the basic operations available on RDDs, such as map() , filter() , and persist() and many
more. In addition, Pair RDD functions that operate on RDDs of key-value pairs such
as groupByKey() and join() etc.
Note: RDD’s can have a name and unique identifier (id)
PySpark RDD Benefits: PySpark is widely adapted in Machine learning and Data
science community due to it’s advantages compared with traditional python programming.
In-Memory Processing
PySpark loads the data from disk and process in memory and keeps the data in memory, this is the
main difference between PySpark and Mapreduce (I/O intensive). In between the transformations, we
can also cache/persists the RDD in memory to reuse the previous computations.
Immutability
PySpark RDD’s are immutable in nature meaning, once RDDs are created you cannot modify. When
we apply transformations on RDD, PySpark creates a new RDD and maintains the RDD Lineage.
Fault Tolerance
PySpark operates on fault-tolerant data stores on HDFS, S3 e.t.c hence any RDD operation fails, it
automatically reloads the data from other partitions. Also, When PySpark applications running on a
cluster, PySpark task failures are automatically recovered for a certain number of times (as per the
configuration) and finish the application seamlessly.
Lazy Evolution
PySpark does not evaluate the RDD transformations as they appear/encountered by Driver instead it
keeps the all transformations as it encounters(DAG) and evaluates the all transformation when it
sees the first RDD action.
Partitioning
When you create RDD from a data, It by default partitions the elements in a RDD. By default it
partitions to the number of cores available.
PySpark RDD Limitations
PySpark RDDs are not much suitable for applications that make updates to the state store such as
storage systems for a web application. For these applications, it is more efficient to use systems that
perform traditional update logging and data checkpointing, such as databases. The goal of RDD is to
provide an efficient programming model for batch analytics and leave these asynchronous
applications.
Creating RDD
RDD’s are created primarily in two different ways,
parallelizing an existing collection and
referencing a dataset in an external storage system ( HDFS , S3 and many more).
Before we look into examples, first let’s initialize SparkSession using the builder pattern method
defined in SparkSession class. While initializing, we need to provide the master and application
name as shown below. In realtime application, you will pass master from spark-submit instead of
hardcoding on Spark application.
master() – If you are running it on the cluster you need to use your master name as an argument
to master(). usually, it would be either yarn (Yet Another Resource
Negotiator) or mesos depends on your cluster setup.
Use local[x] when running in Standalone mode. x should be an integer value and
should be greater than 0; this represents how many partitions it should create when
using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores
you have.
appName() – Used to set your application name.
getOrCreate() – This returns a SparkSession object if already exists, and creates a new one if
not exist.
Note: Creating SparkSession object, internally creates one SparkContext per JVM.
Create RDD using sparkContext.parallelize()
By using parallelize() function of SparkContext (sparkContext.parallelize() ) you can create
an RDD. This function loads the existing collection from your driver program into parallelizing RDD.
This is a basic method to create RDD and is used when you already have data in memory that is
either loaded from a file or from a database. and it required all data to be present on the driver
program prior to creating RDD.
Set parallelize manually – We can also set a number of partitions manually, all, we need is, to
pass a number of partitions as the second parameter to these functions for example
sparkContext.parallelize([1,2,3,4,56,7,8,9,12,3], 10) .
PySpark RDD Operations
RDD transformations – Transformations are lazy operations, instead of updating an RDD, these
operations return another RDD.
RDD actions – operations that trigger computation and return RDD values.
map – map() transformation is used the apply any complex operations like adding a column,
updating a column e.t.c, the output of map transformations would always have the same number of
records as input.
In our word count example, we are adding a new column with value 1 for each word, the result of the
RDD is PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of
type Int as value.
rdd3=rdd2.map(lambda x: (x,1))
reduceByKey – reduceByKey() merges the values for each key with the function specified. In
our example, it reduces the word string by applying the sum function on value. The result of our RDD
contains unique words and their count.
rdd4=rdd3.reduceByKey(lambda a,b: a+b)
sortByKey – sortByKey() transformation is used to sort RDD elements on key. In our example,
first, we convert RDD[(String,Int]) to RDD[(Int, String]) using map transformation and apply
sortByKey which ideally does sort on an integer value. And finally, foreach with println statements
returns all words in RDD and their count as key-value pair
rdd5=rdd4.map(lambda x: (x[1],x[0])).sortByKey()
print(rdd5.collect()
filter – filter () transformation is used to filter the records in an RDD. In our example we are
filtering all words starts with “a”.
rdd4 = rdd3.filter(lambda x : ’an’ in x[1])
print(rdd4.collect())
RDD Actions with example
RDD Action operations return the values from an RDD to a driver program. In other words,
any RDD function that returns non-RDD is considered as an action.
In this section of the PySpark RDD tutorial, we will continue to use our word count example and
performs some actions on it.
count() – Returns the number of records in an RDD
print(“count: “+str(rdd6.count()))
MarshalSerializer:Serializes objects using Python’s Marshal Serializer. This serializer is faster than
PickleSerializer, but supports fewer datatypes.
class pyspark.MarshalSerializer
PickleSerializer
Serializes objects using Python’s Pickle Serializer. This serializer supports nearly any Python object, but may
not be as fast as more specialized serializers.
class pyspark.PickleSerializer
Let us see an example on PySpark serialization. Here, we serialize the data using MarshalSerializer.
--------------------------------------serializing.py-------------------------------------
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext("local", "serialization app", serializer = MarshalSerializer())
print(sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10))
sc.stop()
--------------------------------------serializing.py-------------------------------------
Command − The command is as follows −
$SPARK_HOME/bin/spark-submit serializing.py
Output − The output of the above command is −
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]