pyspark questions

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 2

pyspark qns

1)pyspark archeiture==driver program,cluster manager,work node,executor task

A)Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object.
The purpose of SparkContext is to coordinate the spark applications, running as
independent sets of processes on a cluster
B)Cluster Manager
The role of the cluster manager is to allocate resources across applications. The
Spark is capable enough of running on a large number of clusters.
C)Worker Node
The worker node is a slave node
Its role is to run the application code in the cluster.
D)Executor
An executor is a process launched for an application on a worker node.
It runs tasks and keeps data in memory or disk storage across them.
It read and write data to the external sources.
Every application contains its executor.
e)Task
A unit of work that will be sent to one executor.

RDD;;The Spark follows the master-slave architecture. Its cluster consists of a


single master and multiple slaves.

The Spark architecture depends upon two abstractions:

Resilient Distributed DATASETS


Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDD)
The Resilient Distributed Datasets are the group of data items that can be stored
in-memory on worker nodes. Here,

Resilient: Restore the data on failure.


Distributed: Data is distributed among different nodes.
Dataset: Group of data.

cacheee
In Spark, caching is a mechanism for storing data in memory to speed up access to
that data.
When you cache a dataset, Spark keeps the data in memory so that it can be
quickly retrieved the next time it is needed.
Caching is especially useful when you need to perform multiple operations on the
same dataset, as it eliminates the need to read the data from a disk
each time.
The persist() method allows you to specify the level of storage for the cached
data, such as memory-only or disk-only storage.
difference between cache and persist
What is the difference between cache and persist in Spark?
Caching and persistence are both optimization techniques in Spark, but they differ
in their approach.
Caching stores the data in memory, while persistence allows for more control over
the storage level.
diffrence between cache and broadast
caching is used to store and reuse RDDs/DataFrames across multiple stages of a
Spark application, and it can be used for larger datasets.
Broadcasting, on the other hand, is suitable for efficiently sharing small, read-
only data across worker nodes, reducing data transfer and improving performance

There are 5 distinct types of Join Strategies:


1)Broadcast Hash Join (BHJ)
When a “Join” Operation is “Performed” between “Two DataFrames”, if the “Size” of
“Any One” or “Both” the “DataFrames” lie
“Within” the “Range” of the “Broadcast Threshold Limit”,
2)Shuffle Hash Join (SHj
The “Shuffle Hash Join” goes through the following “Three Phases” -
1. Shuffle
2. Hash Table Creation
3. Hash Join
3)Sort Merge Join (SMJ)
The “Sort Merge Join” is the “Default Join Selection Strategy” when a “Join”
Operation is “Performed” between “Two DataFrames”.
The “Sort Merge Join” goes through the following “Three Phases” -
1. Shuffle
2. Sort
3. Merge

What is repartition in Spark?


In Apache Spark, the repartition operation is a powerful transformation used to
redistribute data within RDDs or DataFrames,
allowing for greater control over data distribution and improved parallelism.

You might also like