Spark A To Z
Spark A To Z
Deepa
Vasanthkumar
V 2.
A
Operations that trigger the execution of the
RDD transformations and return a result to
Action
the driver program or write it to
storage.
DEEPA
vasanthkumar
A
A feature in Spark that dynamically adjusts
Adaptive Query query plans at runtime based on the actual
Execution (AQE) data being processed, leading to optimized
performance.
DEEPA
vasanthkumar
B
Variables cached on each machine,
instead of shipping a copy of it with
Broadcast variables tasks, to improve performance when
tasks across stages need the same
data.
DEEPA
C
The query optimization framework in
Catalyst Spark SQL that applies a series of
Optimizer optimization rules to improve the
execution plan of queries.
A strongly-typed,
immutable collection of objects that
Dataset can be manipulated using functional
transformations (map, flatMap, filter,
etc.).
E
D A
E P
D
An open-source storage layer that
brings ACID transactions to Apache
Spark and big data workloads, enabling
Delta Lake reliable data lakes with data versioning
and time travel.
A representation of a sequence of
computations to be performed on
DAG (Directed
data. In Spark, a DAG is used to track
Acyclic Graph)
the lineage of operations and optimize
execution.
E
D A
E P
D
Dependencies in Spark define the
relationship between RDDs and are
used to determine how data should be
Dependency recomputed in case of a failure. There
are two types of dependencies: narrow
(e.g., map, filter) and wide (e.g.,
reduceByKey, join).
E
D A
E P
D
Dynamic allocation is a feature in
Spark that allows the number of
Dynamic
executors to be dynamically adjusted
Allocation
based on the workload. It helps in
optimizing resource usage and cost.
P
supports event-time processing.
E
D A
E
E
The execution plan is a sequence of
steps generated by the Spark Catalyst
optimizer for executing a query. It
Execution Plan
includes physical and logical plans that
describe how Spark will execute the
query.
FlatMap
DEEPA
vasanthkumar
F
The foreach action is used to apply a
function to each element of the
Foreach dataset. It is often used for side effects
such as updating an accumulator or
interacting with external storage.
FlatMap
DEEPA
vasanthkumar
G
A component of Spark for graph
GraphX
processing and analysis.
E
D A
E P
H
An open-source framework for
distributed storage and processing of
Hadoop large
datasets. Spark can run on Hadoop
clusters and use Hadoop's HDFS.
DEEPA
vasanthkumar
H
A distributed file system designed to
HDFS (Hadoop
store large datasets across multiple
Distributed File
nodes. Spark can read from and write
System)
to HDFS.
DEEPA
vasanthkumar
H
Hash partitioning is a method of
dividing data into partitions based on
Hash Partitioning the hash value of keys. It is used in
Spark to distribute data evenly across
partitions for parallel processing.
DEEPA
vasanthkumar
I
Storing data in memory to improve
In-memory
performance instead of reading from
Computing
disk storage.
DEEPA
I
Interactive query in Spark allows users
to perform ad-hoc queries on large
Interactive Query datasets and get results quickly. Spark
SQL and DataFrames are often used
for interactive querying.
DEEPA
J
A sequence of tasks submitted to the
Job cluster for execution, generated by a
Spark action.
DEEPA
vasanthkumar
J
JVM is the virtual machine that runs
Java bytecode and other JVM-based
Java Virtual
languages. Spark runs on the JVM and
Machine (JVM)
uses it for executing its tasks and
applications.
DEEPA
vasanthkumar
K
A
serialization library used by Spark for
Kryo
fast and efficient serialization of
objects.
E
D A
E P
K
In Spark, a key typically refers to the
attribute or field used to partition or
organize data in key-value pair RDDs or
Key
DataFrames. Operations like
groupByKey and join are based on
keys.
E
D A
E P
L
A technique used in Spark where
transformations on RDDs are not
Lazy Evaluation immediately
executed but are recorded in a
lineage graph for optimization.
DEEPA execution.
L
Livy is a REST service for Apache Spark
that allows remote applications to
submit Spark jobs, monitor their
Livy
status, and retrieve results
programmatically. It simplifies
integration with Spark clusters.
DEEPA
M
A
programming model for processing
large datasets. Spark can execute
MapReduce
MapReduce
tasks much faster due to its in-
memory computing capabilities.
DEEPA
vasanthkumar
M
A narrow transformation that applies a
function to each element of an RDD or
Map
DataFrame, returning a new RDD or
DataFrame with the results.
DEEPA
vasanthkumar
M
Apache Mesos is a cluster manager
that can dynamically share resources
across multiple Spark applications.
Mesos
Spark can run on Mesos, leveraging its
resource isolation and sharing
capabilities.
DEEPA
vasanthkumar
M
The master node in Spark is the node
that hosts the SparkContext,
Master Node coordinates the execution of tasks
across worker nodes, and manages the
overall execution of Spark jobs.
Micro-batching is a streaming
processing technique used in Spark
Streaming where data is processed in
Micro-batching
small, finite-sized batches. It balances
low-latency processing with efficient
resource utilization.
DEEPA
vasanthkumar
N
A single machine in a Spark cluster
Node
that can be a driver or a worker.
vasanthkumar
O
Overhead in Spark refers to the
additional resources (such as CPU and
memory) consumed by Spark
Overhead
framework operations, beyond what is
directly used by user tasks.
E
D A
E P
vasanthkumar
P
A division of data in an RDD or
Partition DataFrame that can be processed in
parallel.
DEEPA
P
A sequence of data processing steps,
often used in machine learning
Pipeline
workflows. Spark's MLlib provides APIs
to create and manage pipelines.
DEEPA
P
A predicate in Spark is a function used
to filter data based on a specified
Predicate
condition, such as in filter operations
on RDDs or DataFrames.
DEEPA
Q
A request for data retrieval and
Query processing in Spark, often written in
SQL or using DataFrame/Dataset APIs.
DEEPA
vasanthkumar
R
The fundamental data structure in
RDD (Resilient
Spark, representing an immutable,
Distributed
distributed
Dataset)
collection of objects.
DEEPA
vasanthkumar
R
A wide transformation that merges the
values of each key using an associative
reduce function. It is more efficient
ReduceByKey
than groupByKey because it performs
partial aggregation locally before
shuffling the data.
DEEPA
running in a cluster environment.
vasanthkumar
S
The
main entry point for Spark
SparkContext functionality, responsible for
connecting to the
cluster and creating RDDs.
E
D A
E P
vasanthkumar
S
The structure that defines the
organization of data in a DataFrame or
Schema
Dataset, including column names and
data types.
EE
D AP
partitions is highly uneven, causing
some partitions to contain significantly
more data than others
vasanthkumar
T
A unit of work that runs on a single
Task
executor and is a part of a job.
E
D A
E P
vasanthkumar
T
The task scheduler in Spark
coordinates the assignment of tasks to
Task Scheduler executor nodes in a cluster, optimizing
task placement based on data locality
and resource availability.
Unit Tests
DEEPA
vasanthkumar
U
union is a transformation in Spark that
combines two RDDs or DataFrames
Union
into a single RDD or DataFrame by
appending the rows or columns.
Unit Tests
DEEPA
vasanthkumar
V
A temporary table in Spark SQL
View
created from a DataFrame.
E
D A
E P
W
A node
Worker Node in a Spark cluster that executes tasks
and returns results to the driver.
DEEPA
W
WAL in Spark Streaming is a
mechanism for fault tolerance, storing
Write-ahead Log data about processed records in a log
(WAL) before they are committed to a data
source, ensuring recovery in case of
failures.
DEEPA
X
A markup language that Spark can
read and write using libraries and
XML
custom
parsers.
DEEPA
vasanthkumar
Y
YARN (Yet A cluster management technology
Another used by Spark for resource allocation
Resource and job
Negotiator) scheduling.
E
D A
E P
vasanthkumar
Z
A centralized service for maintaining
configuration information, naming,
providing distributed
Zookeeper synchronization, and providing group
services, often used in Spark
streaming for managing offsets in
Kafka.
https://www.linkedin.com/in/deepa-vasanthkumar/
https://medium.com/@deepa.account