Spark Material
Spark Material
Apache Spark's internal working can be broken down into several components:
1. Driver: The driver is the program that creates a SparkContext and coordinates the execution of
tasks on the cluster.
2. SparkContext: The SparkContext is the entry point to Spark's functionality. It sets up the
internal services and configuration for the application.
3. DAGScheduler: The DAGScheduler is responsible for breaking down the user's program into
stages and tasks.
7. BlockManager: The BlockManager is responsible for managing data storage and retrieval.
2. DAG creation: The DAGScheduler creates a Directed Acyclic Graph (DAG) of stages and tasks.
3. Stage creation: The DAGScheduler breaks down the DAG into stages.
4. Task creation: The DAGScheduler breaks down each stage into tasks.
5. Task scheduling: The TaskScheduler schedules tasks on the Executor.
6. Task execution: The Executor executes tasks and stores output in memory or on disk.
RDDs (Resilient Distributed Datasets) and DataFrames are Spark's core data structures.
1. RDDs: RDDs are immutable collections of data split into partitions and processed in parallel.
Spark uses a concept called lineage to handle failures. Lineage is the sequence of
transformations applied to the data. If a task fails, Spark can recompute the data using the
lineage.
Q5. What is Apache Spark, and how does it differ from Hadoop MapReduce?
Apache Spark is an open-source, distributed computing system that offers an interface for
programming entire clusters with implicit data parallelism and fault tolerance. Unlike Hadoop
MapReduce, Spark processes data in-memory, which significantly boosts performance,
especially for iterative algorithms.
Q8: What are some of the optimizations Spark performs during execution?
Spark uses several optimizations, such as pipelining transformations, broadcasting variables to
avoid data shuffling, and optimizing the DAG execution plan with techniques like Tungsten and
Catalyst optimizers. These enhance both speed and resource efficiency.
Q12: Why might you still use RDDs despite these higher-level abstractions?
RDDs offer more control over low-level operations and are necessary when working with
unstructured or semi-structured data that doesn't fit well into a DataFrame or Dataset. They're
also useful for custom transformations or when you're dealing with legacy codebases.
RDD Transformations are Spark operations when executed on RDD, it results in a single or
multiple new RDD’s.
RDD are immutable in nature, transformations always create new RDD without updating an
existing one hence, this creates an RDD lineage.
# Sample DataFrame
data = [(1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,)]
df = spark.createDataFrame(data, ["values"])
# Calculate range
range_val = df.select(F.max("values") - F.min("values")).collect()[0][0]
print(f"Range: {range_val}")
# Kurtosis
kurtosis_val = df.select(F.kurtosis("values")).collect()[0][0]
print(f"Kurtosis: {kurtosis_val}")
# Skewness
skewness_val = df.select(F.skewness("values")).collect()[0][0]
print(f"Skewness: {skewness_val}")
# Calculate mean and standard deviation
mean_val = df.select(F.mean("values")).collect()[0][0]
stddev_val = df.select(F.stddev("values")).collect()[0][0]
# Coefficient of variation
cv = stddev_val / mean_val
print(f"Coefficient of Variation (CV): {cv}")
# Calculate IQR
q1 = df.approxQuantile("values", [0.25], 0.0)[0]
q3 = df.approxQuantile("values", [0.75], 0.0)[0]
iqr = q3 - q1
print(f"Interquartile Range (IQR): {iqr}")
# Variance
variance_val = df.select(F.variance("values")).collect()[0][0]
print(f"Variance: {variance_val}")