0% found this document useful (0 votes)
43 views63 pages

Spark A To Z

Uploaded by

Sozha Vendhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
43 views63 pages

Spark A To Z

Uploaded by

Sozha Vendhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 63

A Z to

Deepa
Vasanthkumar

V 2.
A
Operations that trigger the execution of the
RDD transformations and return a result to
Action
the driver program or write it to
storage.

A set of functions and protocols for


API building software and applications, allowing
interaction with Apache Spark.

An open-source distributed computing system


Apache Spark
for big data processing and analytics.

DEEPA
vasanthkumar
A
A feature in Spark that dynamically adjusts
Adaptive Query query plans at runtime based on the actual
Execution (AQE) data being processed, leading to optimized
performance.

A process of combining multiple values into a


single value. Spark provides various
Aggregation aggregation functions such as sum(), avg(),
count(), and custom aggregations using
aggregate().

DEEPA
vasanthkumar
B
Variables cached on each machine,
instead of shipping a copy of it with
Broadcast variables tasks, to improve performance when
tasks across stages need the same
data.

Processing of large blocks of data at


once, as opposed to real-time or
streaming data processing. Spark is
Batch Processing often used for batch processing in ETL
workflows.

A technique for partitioning data into a


Bucketing fixed number of buckets to optimize
join operations and aggregation tasks.
B
In Spark, a block is a unit of data
storage. Blocks are pieces of data that
Block are split and distributed across the
cluster nodes. Each block can be
processed independently in parallel.

The Block Manager is responsible for


managing and storing blocks in Spark.
Block Manager It handles the storage of RDDs, shuffle
data, and other intermediate data in
memory or on disk.

Binning is a data preprocessing


technique used to convert continuous
variables into categorical variables by
Binning
dividing the range of values into a
series of intervals, or bins. This can
help with data analysis and modeling.
B
BSP is a computational model that
divides computations into supersteps,
BSP (Bulk with each superstep consisting of local
Synchronous computation, communication, and
Parallel) barrier synchronization. Spark’s
execution model can be seen as
loosely following the BSP model.

Back pressure is a mechanism used in


Spark Streaming to handle situations
where the data ingestion rate is higher
Back Pressure
than the data processing rate. It helps
to control the flow of data to avoid
overwhelming the system.
B
Barrier execution mode in Spark allows
for better coordination of stages that
need to run concurrently. It is
Barrier Execution
particularly useful for integrating with
Mode
deep learning frameworks that require
precise synchronization across all
tasks
C
The process of storing data in memory
Caching to speed up retrieval during
processing.

Manages resources in a cluster, such


Cluster Manager as YARN, Mesos, or Spark's standalone
cluster manager.

The basic unit of parallelism in Spark. It


Core
is a thread running a task.

DEEPA
C
The query optimization framework in
Catalyst Spark SQL that applies a series of
Optimizer optimization rules to improve the
execution plan of queries.

A process of saving the intermediate


state of an RDD or DataFrame to
Checkpointing reliable storage, which is useful for
long-running computations and fault
tolerance.

A data storage format that stores data


in columns rather than rows, which is
Columnar beneficial for analytical queries that
Storage typically access a subset of columns.
Spark supports columnar formats like

DEEPA Parquet and ORC.


C
The coalesce() transformation is used
to reduce
Column the number of partitions in
an RDD. It is more efficient than
Coalesce
repartition() when reducing the number
of partitions because it avoids a full
shuffle.

In Spark SQL and DataFrame API, a


column represents a single attribute or
Column field in a dataset. Columns are used in
expressions for transformations and
aggregations.

Compression refers to the process of


reducing the size of data to save
storage space and improve I/O
Compression
performance. Spark supports various

DEEPA compression codecs, including Snappy,


Gzip, and LZ4.
D
A distributed collection of data
organized
Dataframe
into named columns, similar to a table
in a relational database.

A strongly-typed,
immutable collection of objects that
Dataset can be manipulated using functional
transformations (map, flatMap, filter,
etc.).

The main program that


creates the SparkContext, connects
Driver Program
to the cluster, and coordinates the
execution of the tasks.

E
D A
E P
D
An open-source storage layer that
brings ACID transactions to Apache
Spark and big data workloads, enabling
Delta Lake reliable data lakes with data versioning
and time travel.

A representation of a sequence of
computations to be performed on
DAG (Directed
data. In Spark, a DAG is used to track
Acyclic Graph)
the lineage of operations and optimize
execution.

E
D A
E P
D
Dependencies in Spark define the
relationship between RDDs and are
used to determine how data should be
Dependency recomputed in case of a failure. There
are two types of dependencies: narrow
(e.g., map, filter) and wide (e.g.,
reduceByKey, join).

The driver node is the machine where


the Spark driver program runs. It
coordinates the execution of tasks on
Driver Node
the worker nodes and maintains
information about the Spark
application's state.

E
D A
E P
D
Dynamic allocation is a feature in
Spark that allows the number of
Dynamic
executors to be dynamically adjusted
Allocation
based on the workload. It helps in
optimizing resource usage and cost.

In Spark Streaming, a direct stream is a


type of DStream (Discretized Stream)
that directly pulls data from a source
Direct Stream
such as Apache Kafka, ensuring
exactly-once semantics and better
fault tolerance.

Dynamic partition pruning is an


Dynamic optimization technique in Spark that
partition Pruning prevents scanning of unnecessary
partitions when reading data.
E
A distributed agent responsible for
executing a task on a worker node and
Executor
returning the results to the driver
program.

A process in data warehousing and


data integration that involves
ETL (Extract,
extracting data from source systems,
Transform, Load)
transforming it to fit business needs,
and loading it into a target data store.

The time at which events actually


occurred, as opposed to processing
Event Time time when events are processed by the
system. Spark Structured Streaming

P
supports event-time processing.
E
D A
E
E
The execution plan is a sequence of
steps generated by the Spark Catalyst
optimizer for executing a query. It
Execution Plan
includes physical and logical plans that
describe how Spark will execute the
query.

Event logs are logs that Spark


generates to record events such as job
Event Log start, job end, task start, and task end.
These logs can be used for monitoring
and debugging Spark applications.

Ephemeral storage refers to temporary


storage that is used during the
Ephemeral
execution of a Spark application. It is
Storage
not persistent and is typically used for
E
D A
E P intermediate data, such as shuffle files.
E
An edge node is a gateway node that
sits between the user's local
environment and the cluster. It is used
Edge Node to run client tools and host
applications, such as Spark driver
programs, that need to interact with
the cluster.

Execution memory in Spark is the


memory used for performing
Execution computations and storing intermediate
Memory results. It is distinct from storage
memory, which is used for caching
data.

The execution context in Spark defines


the environment in which Spark jobs
Execution
run. It includes information about the
Context
cluster, resources, and configuration
settings.
F
The
Fault Tolerance ability of Spark to recover from node
failures and recompute lost data.

A programming paradigm in Spark


Functional where functions are treated as first-
Programming class citizens and operations are
performed using transformations.

A transformation that applies a


function to each element of an RDD or
FlatMap DataFrame and returns a new RDD or
DataFrame by flattening the results.

FlatMap
DEEPA
vasanthkumar
F
The foreach action is used to apply a
function to each element of the
Foreach dataset. It is often used for side effects
such as updating an accumulator or
interacting with external storage.

Similar to foreach, foreachPartition


applies a function to each partition of
ForeachPartition
the dataset, allowing for more efficient
processing of partition-level data.

The fold action is an aggregate


function in Spark that combines
elements of the dataset using an
Fold associative function and a neutral
"zero value" to start the aggregation. It
is similar to reduce but allows for a
starting value.
FlatMap
F
A collection of libraries and tools that
provides a structured approach to
Framework building and managing applications.
Spark is a big data processing
framework.

A transformation that returns a new


Filter RDD or DataFrame containing only the
elements that satisfy a given condition.

FlatMap
DEEPA
vasanthkumar
G
A component of Spark for graph
GraphX
processing and analysis.

An optimization algorithm used in


machine learning to minimize a
Gradient Descent
function by iteratively moving towards
the steepest descent.

A transformation that groups the


values of a key-value pair RDD by key.
GroupByKey
It returns a new RDD of (key,
Iterable<values>) pairs.
G
A global view in Spark refers to viewing
the entire dataset across all partitions,
as opposed to looking at data within
Global View
individual partitions.

Garbage collection in Spark refers to


the automatic memory management
Garbage
process that reclaims memory
Collection
occupied by objects that are no longer
in use.

E
D A
E P
H
An open-source framework for
distributed storage and processing of
Hadoop large
datasets. Spark can run on Hadoop
clusters and use Hadoop's HDFS.

A data warehousing solution built


on top of Hadoop that allows
querying and managing large datasets
Hive
using SQL.
Spark can use Hive for reading and
writing data.

DEEPA
vasanthkumar
H
A distributed file system designed to
HDFS (Hadoop
store large datasets across multiple
Distributed File
nodes. Spark can read from and write
System)
to HDFS.

A class in Spark that allows querying


data using the HiveQL language. It is
HiveContext
part of the Spark SQL module and
provides compatibility with Hive.

DEEPA
vasanthkumar
H
Hash partitioning is a method of
dividing data into partitions based on
Hash Partitioning the hash value of keys. It is used in
Spark to distribute data evenly across
partitions for parallel processing.

A database that stores metadata about


Hive tables, such as schema and
Hive Metastore location. Spark SQL can integrate with
Hive Metastore to access this
metadata.

DEEPA
vasanthkumar
I
Storing data in memory to improve
In-memory
performance instead of reading from
Computing
disk storage.

An object in Spark that allows


Iterator traversing through a collection of
elements one at a time.

A class in Hadoop (and used by Spark)


that defines how input data is split and
InputFormat read into the system. Examples include
TextInputFormat and
SequenceFileInputFormat.

DEEPA
I
Interactive query in Spark allows users
to perform ad-hoc queries on large
Interactive Query datasets and get results quickly. Spark
SQL and DataFrames are often used
for interactive querying.

In Spark, RDDs (Resilient Distributed


Datasets) are immutable, meaning
Immutable once created, they cannot be changed.
This immutability provides consistency
and simplifies parallel processing.

DEEPA
J
A sequence of tasks submitted to the
Job cluster for execution, generated by a
Spark action.

A transformation that combines two


RDDs or DataFrames based on a
Join common key. Types of joins include
inner join, outer join, left join, and right
join.

The component of Spark that handles


the scheduling of jobs, dividing them
Job Scheduler
into stages and tasks, and distributing
them across the cluster for execution.

DEEPA
vasanthkumar
J
JVM is the virtual machine that runs
Java bytecode and other JVM-based
Java Virtual
languages. Spark runs on the JVM and
Machine (JVM)
uses it for executing its tasks and
applications.

JobServer in Spark is a service that


manages and runs Spark jobs. It
JobServer provides an interface for submitting
jobs, monitoring their execution, and
managing job resources.

DEEPA
vasanthkumar
K
A
serialization library used by Spark for
Kryo
fast and efficient serialization of
objects.

A distributed streaming platform


Kafka that Spark can integrate with for
processing real-time data streams.

A real-time data streaming service


provided by AWS. Spark can read data
Kinesis
from and write data to Kinesis streams
for real-time data processing.

E
D A
E P
K
In Spark, a key typically refers to the
attribute or field used to partition or
organize data in key-value pair RDDs or
Key
DataFrames. Operations like
groupByKey and join are based on
keys.

Kurtosis is a statistical measure of the


"tailedness" of the probability
Kurtosis distribution of a dataset. Spark MLlib
provides functions for calculating
kurtosis as part of statistical analysis.

E
D A
E P
L
A technique used in Spark where
transformations on RDDs are not
Lazy Evaluation immediately
executed but are recorded in a
lineage graph for optimization.

A record of the transformations


Lineage applied to an RDD, used for fault
tolerance and recomputation.

An abstract, high-level representation


of a query that describes what
operations need to be performed.
Logical Plan Spark's Catalyst Optimizer generates
and optimizes logical plans before
converting them into physical plans for

DEEPA execution.
L
Livy is a REST service for Apache Spark
that allows remote applications to
submit Spark jobs, monitor their
Livy
status, and retrieve results
programmatically. It simplifies
integration with Spark clusters.

Local mode in Spark refers to running


Spark applications on a single machine
Local Mode using a single JVM process, without
utilizing a distributed cluster. It is
useful for testing and development.

Levenshtein distance is a metric used


in Spark SQL and String manipulation
Levenshtein
for measuring the difference between
Distance
two sequences. It is useful for fuzzy
matching and similarity checks.

DEEPA
M
A
programming model for processing
large datasets. Spark can execute
MapReduce
MapReduce
tasks much faster due to its in-
memory computing capabilities.

A machine learning library in


Spark providing various algorithms
MLlib
and utilities for scalable machine
learning.

DEEPA
vasanthkumar
M
A narrow transformation that applies a
function to each element of an RDD or
Map
DataFrame, returning a new RDD or
DataFrame with the results.

A transformation that applies a


function to each partition of an RDD or
MapPartitions DataFrame, rather than to each
element. This can be more efficient for
certain operations.

DEEPA
vasanthkumar
M
Apache Mesos is a cluster manager
that can dynamically share resources
across multiple Spark applications.
Mesos
Spark can run on Mesos, leveraging its
resource isolation and sharing
capabilities.

Spark manages memory usage through


different storage levels (e.g.,
Memory MEMORY_ONLY, MEMORY_AND_DISK)
Management and memory management policies to
optimize performance and resource
utilization.

DEEPA
vasanthkumar
M
The master node in Spark is the node
that hosts the SparkContext,
Master Node coordinates the execution of tasks
across worker nodes, and manages the
overall execution of Spark jobs.

Micro-batching is a streaming
processing technique used in Spark
Streaming where data is processed in
Micro-batching
small, finite-sized batches. It balances
low-latency processing with efficient
resource utilization.

DEEPA
vasanthkumar
N
A single machine in a Spark cluster
Node
that can be a driver or a worker.

An interactive environment, such as


Databricks Notebooks or Jupyter
Notebook Notebooks, where users can write and
execute Spark code, visualize data, and
document their analysis.

The process of structuring data to


reduce redundancy and improve data
Normalization integrity. In Spark, this often involves
transforming and cleaning data before
analysis.
N
Narrow dependency (or narrow
transformation) in Spark refers to
Narrow transformations like map, filter, or
Dependency flatMap that do not require data to be
shuffled across partitions, making
them more efficient.

NaN (Not a Number) is a special


floating-point value used to represent
undefined or unrepresentable
NaN
numerical results. Spark handles NaN
values in computations and
transformations.

In Spark SQL and Hive, a namespace is


a logical container for tables, views,
and other database objects.
Namespace
Namespaces help organize and
manage data assets within a data
catalog.
N
NodeManager is a component of
Apache YARN (Yet Another Resource
Negotiator) that runs on worker nodes
in a Hadoop cluster. It manages
NodeManager resources (CPU, memory) and executes
tasks allocated by the
ResourceManager, including Spark
tasks.

NumPy is a numerical computing


library for Python. Spark integrates
with NumPy to leverage its capabilities
Numpy
for efficient array operations and
numerical computations within Spark
applications.
O
Functions in Spark that can be applied
to
Operations
RDDs, such as transformations and
actions.

Techniques used to improve the


performance of Spark jobs, including
Optimization
query optimization and execution plan
tuning.

A class in Hadoop (and used by Spark)


that defines how the output data is
OutputFormat written. Examples include
TextOutputFormat and
E
D A
E P SequenceFileOutputFormat.

vasanthkumar
O
Overhead in Spark refers to the
additional resources (such as CPU and
memory) consumed by Spark
Overhead
framework operations, beyond what is
directly used by user tasks.

Out-of-core processing in Spark refers


to the ability to process data that is
Out-of-Core
larger than the available memory by
Processing
spilling data to disk, enabling efficient
handling of big data.

E
D A
E P
vasanthkumar
P
A division of data in an RDD or
Partition DataFrame that can be processed in
parallel.

Storing an RDD in memory across


Persist
operations for faster access.

The Python API for Spark, allowing


PySpark
the use of Spark with Python.

DEEPA
P
A sequence of data processing steps,
often used in machine learning
Pipeline
workflows. Spark's MLlib provides APIs
to create and manage pipelines.

A detailed, low-level representation of


how a query will be executed in Spark.
Physical Plan
It is generated by the Catalyst
Optimizer from the logical plan.

DEEPA
P
A predicate in Spark is a function used
to filter data based on a specified
Predicate
condition, such as in filter operations
on RDDs or DataFrames.

PairRDDFunctions are specialized


functions in Spark's Scala API for
PairRDDFunction
operations on RDDs containing key-
s
value pairs, providing methods like
reduceByKey, groupByKey, and join.

Parquet is a columnar storage format


supported by Spark for efficient data
Parquet storage and query processing. It offers
benefits such as compression and
schema evolution support.

DEEPA
Q
A request for data retrieval and
Query processing in Spark, often written in
SQL or using DataFrame/Dataset APIs.

The series of steps Spark takes to


Query Execution
execute a query, which includes both
Plan
the logical plan and the physical plan.

DEEPA
vasanthkumar
R
The fundamental data structure in
RDD (Resilient
Spark, representing an immutable,
Distributed
distributed
Dataset)
collection of objects.

Resource Manages the allocation of


Manager resources in a cluster for Spark jobs.

A partitioning strategy where data is


Range divided into ranges based on a key.
Partitioning This can optimize performance for
range queries and joins.

DEEPA
vasanthkumar
R
A wide transformation that merges the
values of each key using an associative
reduce function. It is more efficient
ReduceByKey
than groupByKey because it performs
partial aggregation locally before
shuffling the data.

A transformation that reshuffles the


data in an RDD or DataFrame into a
Repartition specified number of partitions. This
can be used to increase or decrease
the level of parallelism.

A record in a DataFrame, representing


a single entry with fields
ROW
corresponding to the columns of the
DataFrame.
DEEPA
vasanthkumar
R
Replication in Spark refers to the
process of duplicating data or RDD
Replication partitions across nodes in a cluster to
achieve fault tolerance and data
locality.

Resilience in Spark refers to its ability


to recover from failures or faults
Resilience automatically by using lineage
information and recomputing lost data
partitions.

Resource Manager, such as YARN (Yet


Another Resource Negotiator),
Resource
manages and allocates resources (CPU,
Manager
memory) for Spark applications

DEEPA
running in a cluster environment.

vasanthkumar
S
The
main entry point for Spark
SparkContext functionality, responsible for
connecting to the
cluster and creating RDDs.

A Spark module for working with


Spark SQL
structured data using SQL queries.

Processing real-time data streams


Streaming in Spark, using the Spark Streaming
API.

E
D A
E P
vasanthkumar
S
The structure that defines the
organization of data in a DataFrame or
Schema
Dataset, including column names and
data types.

A process of redistributing data across


partitions that involves moving data
Shuffle between executors. It typically occurs
during wide transformations like
groupByKey, reduceByKey, and join.

The entry point for programming Spark


applications with the DataFrame and
SparkSession
Dataset API. It replaces the older
SQLContext and HiveContext.
S
A set of tasks that can be executed in
parallel during the execution of a Spark
Stage
job. A job is divided into stages based
on shuffle boundaries.

An API in Spark for stream processing


Structured that allows you to process data in real-
Streaming time using high-level declarative
queries similar to batch processing.

Salting in Apache Spark is a technique


used to address data skew issues when
performing certain operations,
particularly joins and aggregations, on
Salting large datasets. Data skew occurs when
the distribution of data across

EE
D AP
partitions is highly uneven, causing
some partitions to contain significantly
more data than others
vasanthkumar
T
A unit of work that runs on a single
Task
executor and is a part of a job.

Operations that create a new RDD


Transformation from an existing one, such as map,
filter, and reduceByKey.

A mechanism in Structured Streaming


that specifies when the system should
Triggers process the next set of data. Examples
include continuous processing and
micro-batch intervals

E
D A
E P
vasanthkumar
T
The task scheduler in Spark
coordinates the assignment of tasks to
Task Scheduler executor nodes in a cluster, optimizing
task placement based on data locality
and resource availability.

In Spark, a tuple is an ordered


collection of elements (similar to a list
or array) that can contain
Tuple
heterogeneous data types, commonly
used to represent key-value pairs in
RDDs or DataFrames.

Tungsten is the internal code name for


a Spark project focused on optimizing
Tungsten memory management and execution
speed by leveraging binary processing
and off-heap memory.
U
UDF (User-Defined Custom functions defined by users to
Function) extend the capabilities of Spark SQL.

Releasing the memory used by a


Unpersist
cached RDD.

Tests that validate the functionality of


individual components of Spark
Unit Tests applications. Libraries like spark-
testing-base can help write unit tests
for Spark applications.

Unit Tests
DEEPA
vasanthkumar
U
union is a transformation in Spark that
combines two RDDs or DataFrames
Union
into a single RDD or DataFrame by
appending the rows or columns.

Upstream dependency in Spark refers


to the dependencies of a task or
Upstream
operation on previous stages or
Dependency
transformations in the execution DAG
(Directed Acyclic Graph).

Unit Tests
DEEPA
vasanthkumar
V
A temporary table in Spark SQL
View
created from a DataFrame.

A technique used in Spark MLlib to


represent data in a format that can be
Vectorization
efficiently processed by machine
learning algorithms.

E
D A
E P
W
A node
Worker Node in a Spark cluster that executes tasks
and returns results to the driver.

Transformations that require data


Wide
shuffling across nodes, such as
Transformation
groupByKey and reduceByKey.

A function that performs a calculation


across a set of table rows related to
the current row. Spark SQL supports
Window Function
window functions for operations like
ranking, aggregations, and analytic
functions.

DEEPA
W
WAL in Spark Streaming is a
mechanism for fault tolerance, storing
Write-ahead Log data about processed records in a log
(WAL) before they are committed to a data
source, ensuring recovery in case of
failures.

In Spark's type system, widening


conversion refers to automatic type
Widening
promotion or casting of data types to a
Conversion
wider or more general type to maintain
compatibility and avoid data loss.

DEEPA
X
A markup language that Spark can
read and write using libraries and
XML
custom
parsers.

XPath is a query language used for


selecting nodes from XML documents,
XPath which can be integrated with Spark
applications for data extraction and
processing.

DEEPA
vasanthkumar
Y
YARN (Yet A cluster management technology
Another used by Spark for resource allocation
Resource and job
Negotiator) scheduling.

YAML ain't markup language (a


recursive acronym), which emphasizes
YAML:(Yet that YAML is for data, not documents
Another Markup is a human-readable data serialization
Language) format, which can be used for
configuring and managing settings in
Spark applications or related tools.

E
D A
E P
vasanthkumar
Z
A centralized service for maintaining
configuration information, naming,
providing distributed
Zookeeper synchronization, and providing group
services, often used in Spark
streaming for managing offsets in
Kafka.

A multidimensional clustering method


used in Delta Lake to optimize data
skipping and improve query
Z-Order
performance by clustering data in a
way that enhances locality for multiple
columns.

An open-source web-based notebook


that enables interactive data analytics.
Apache Zeppelin supports multiple
Zeppelin
languages and can be used with Spark
for interactive data exploration and
Deepa Vasanthkumar

https://www.linkedin.com/in/deepa-vasanthkumar/

https://medium.com/@deepa.account

You might also like