pyspark questions
pyspark questions
pyspark questions
A)Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object.
The purpose of SparkContext is to coordinate the spark applications, running as
independent sets of processes on a cluster
B)Cluster Manager
The role of the cluster manager is to allocate resources across applications. The
Spark is capable enough of running on a large number of clusters.
C)Worker Node
The worker node is a slave node
Its role is to run the application code in the cluster.
D)Executor
An executor is a process launched for an application on a worker node.
It runs tasks and keeps data in memory or disk storage across them.
It read and write data to the external sources.
Every application contains its executor.
e)Task
A unit of work that will be sent to one executor.
cacheee
In Spark, caching is a mechanism for storing data in memory to speed up access to
that data.
When you cache a dataset, Spark keeps the data in memory so that it can be
quickly retrieved the next time it is needed.
Caching is especially useful when you need to perform multiple operations on the
same dataset, as it eliminates the need to read the data from a disk
each time.
The persist() method allows you to specify the level of storage for the cached
data, such as memory-only or disk-only storage.
difference between cache and persist
What is the difference between cache and persist in Spark?
Caching and persistence are both optimization techniques in Spark, but they differ
in their approach.
Caching stores the data in memory, while persistence allows for more control over
the storage level.
diffrence between cache and broadast
caching is used to store and reuse RDDs/DataFrames across multiple stages of a
Spark application, and it can be used for larger datasets.
Broadcasting, on the other hand, is suitable for efficiently sharing small, read-
only data across worker nodes, reducing data transfer and improving performance