0% found this document useful (0 votes)

20 views

Spark Material

Uploaded by

JP2B4 197974 S Satya Reddy PMPC

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Spark Material

Uploaded by

JP2B4 197974 S Satya Reddy PMPC

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Components of Spark

Q1. Can you explain the internal working of Apache Spark?

Apache Spark's internal working can be broken down into several components:

1. Driver: The driver is the program that creates a SparkContext and coordinates the execution of
tasks on the cluster.

2. SparkContext: The SparkContext is the entry point to Spark's functionality. It sets up the
internal services and configuration for the application.

3. DAGScheduler: The DAGScheduler is responsible for breaking down the user's program into
stages and tasks.

4. TaskScheduler: The TaskScheduler is responsible for scheduling tasks on the cluster.

5. Executor: The Executor is responsible for executing tasks on the cluster.

6. ShuffleManager: The ShuffleManager is responsible for managing data shuffle between

stages.

7. BlockManager: The BlockManager is responsible for managing data storage and retrieval.

Q2. Can you explain the workflow of a Spark job?

Here's the workflow:

1. Job submission: The user submits a job to the SparkContext.

2. DAG creation: The DAGScheduler creates a Directed Acyclic Graph (DAG) of stages and tasks.

3. Stage creation: The DAGScheduler breaks down the DAG into stages.

4. Task creation: The DAGScheduler breaks down each stage into tasks.
5. Task scheduling: The TaskScheduler schedules tasks on the Executor.

6. Task execution: The Executor executes tasks and stores output in memory or on disk.

7. Shuffle: The ShuffleManager manages data shuffle between stages.

8. Result retrieval: The BlockManager retrieves results from memory or disk.

Q3. Can you explain the concept of RDDs and DataFrames?

RDDs (Resilient Distributed Datasets) and DataFrames are Spark's core data structures.

1. RDDs: RDDs are immutable collections of data split into partitions and processed in parallel.

2. DataFrames: DataFrames are a higher-level abstraction built on top of RDDs, providing a

structured data model.

Q4. How does Spark handle failures?

Spark uses a concept called lineage to handle failures. Lineage is the sequence of
transformations applied to the data. If a task fails, Spark can recompute the data using the
lineage.

Q5. What is Apache Spark, and how does it differ from Hadoop MapReduce?
Apache Spark is an open-source, distributed computing system that offers an interface for
programming entire clusters with implicit data parallelism and fault tolerance. Unlike Hadoop
MapReduce, Spark processes data in-memory, which significantly boosts performance,
especially for iterative algorithms.

Q6. Can you explain how Spark handles fault tolerance?

Spark achieves fault tolerance using lineage information. If a partition of data is lost, Spark can
rebuild it by recomputing the lost partition using the original transformations applied to the data.
This is efficient and reduces the need for redundant data storage.

Q7. How does Spark's DAG (Directed Acyclic Graph) work?

In Spark, the DAG represents a sequence of computations to be performed on data. When an
action is called, Spark creates a DAG of stages that it executes sequentially. Each stage consists
of tasks based on the data's partitioning, optimizing the overall execution process.

Q8: What are some of the optimizations Spark performs during execution?
Spark uses several optimizations, such as pipelining transformations, broadcasting variables to
avoid data shuffling, and optimizing the DAG execution plan with techniques like Tungsten and
Catalyst optimizers. These enhance both speed and resource efficiency.

Q9. What is an RDD in Apache Spark?

RDD, or Resilient Distributed Dataset, is Spark's core abstraction representing an immutable,
distributed collection of objects. RDDs are fault-tolerant and allow for parallel processing across
a cluster, making them the backbone of Spark's data processing.
Q10. How does a DataFrame differ from an RDD?
A DataFrame is a higher-level abstraction built on top of RDDs. It represents data in a tabular
format, similar to a table in a relational database, with named columns and rows. DataFrames
optimize operations through Spark's Catalyst optimizer, making them more efficient than RDDs
for most use cases.

Q11. When would you use a Dataset over a DataFrame?

Datasets combine the best of both RDDs and DataFrames. They provide the benefits of a typed
API with the efficiency of DataFrames. Datasets are particularly useful when you need compile-
time type safety along with the optimization benefits of Catalyst.

Q12: Why might you still use RDDs despite these higher-level abstractions?
RDDs offer more control over low-level operations and are necessary when working with
unstructured or semi-structured data that doesn't fit well into a DataFrame or Dataset. They're
also useful for custom transformations or when you're dealing with legacy codebases.

RDD Transformations are Spark operations when executed on RDD, it results in a single or
multiple new RDD’s.

RDD are immutable in nature, transformations always create new RDD without updating an
existing one hence, this creates an RDD lineage.

RDD Transformation is of 2 types : Narrow Transformation and

Wide Transformation

Different ways to read data into PySpark:

1.Reading from CSV Files:

df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

2. Reading from JSON Files:

df = spark.read.json("path/to/file.json")

3. Reading from Parquet Files:

df = spark.read.parquet("path/to/file.parquet")

4. Reading from Text Files:

df = spark.read.text("path/to/file.txt")

5. Reading from Database:

You can read from databases using JDBC:
```python
df = spark.read.format("jdbc").options(
url="jdbc:postgresql://host:port/dbname",
driver="org.postgresql.Driver",
dbtable="table_name",
user="username",
password="password"
).load()
6.Reading from Hive Tables:
If you have a Hive context set up:
df = spark.sql("SELECT * FROM hive_table_name")

7. Reading from ORC Files:

df = spark.read.orc("path/to/file.orc")

8. Reading from Avro Files** (requires the Avro package):

df = spark.read.format("avro").load("path/to/file.avro")

9. Reading from Kafka:

To read streaming data from Kafka:
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"server:port").option("subscribe", "topic").load()

10. Reading from Delta Lake:

If using Delta Lake:
df = spark.read.format("delta").load("path/to/delta_table")

Here are some basic transformations you'll use on a daily basis:

1. Adding a New Column: df = df.withColumn("NewColumn", col("ExistingColumn") + 1)

2. Dropping a Column: df = df.drop("ColumnName")
3. Renaming a Column: df = df.withColumnRenamed("OldColumnName", "NewColumnName")
4. Filter Null Values: df = df.filter(df["ColumnName"].isNotNull())
5. Where Clause: df = df.where(col("ColumnName") > 10)
6. Case Statement: df = df.withColumn("NewColumn", when(col("Age") >= 18,
"Adult").otherwise("Minor"))
7. Joining DataFrames: df = df1.join(df2, df1["id"] == df2["id"], "inner")
8. Default Value for a Column: df = df.withColumn("NewColumn", lit("default_value"))
9. Replacing Nulls: df = df.fillna({"ColumnName": "default_value"})
10. UDF (User Defined Function): df = df.withColumn("UpperColumn",
to_upper_udf(col("ColumnName")))
11. Handling Arrays: df = df.withColumn("ArrayColumn", array("Column1", "Column2"))
12. Explode Arrays: df = df.withColumn("ExplodedColumn", explode(col("ArrayColumn")))
13. Window Functions: df = df.withColumn("Rank", rank().over(windowSpec))
14. Selecting Specific Columns: df = df.select("Column1", "Column2")
15. Distinct Rows: df = df.select("ColumnName").distinct()
16. GroupBy and Aggregate: df = df.groupBy("Column1").agg({"Column2": "sum"})
17. Sorting Data: df = df.orderBy(col("ColumnName").desc())
18. Drop Duplicates: df = df.dropDuplicates(["Column1", "Column2"])
19. Casting Data Types: df = df.withColumn("NewColumn", col("OldColumn").cast("Integer"))
20. Pivot/Unpivot: df = df.groupBy("Category").pivot("Type").sum("Amount")

Statistical based questions on Pyspark :

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Initialize Spark session
spark = SparkSession.builder.appName("Statistics").getOrCreate()

# Sample DataFrame
data = [(1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,)]
df = spark.createDataFrame(data, ["values"])

# Function to calculate Gini coefficient

def gini_coefficient(values):
sorted_values = sorted(values)
n = len(sorted_values)
cumulative_values = [sum(sorted_values[:i+1]) for i in range(n)]
sum_cumulative = sum(cumulative_values)
sum_values = sum(sorted_values)
if sum_values == 0:
return 0
gini = 1 - 2 * (sum_cumulative / (n * sum_values))
return gini

# Collect data and apply Gini function

values = df.select("values").rdd.flatMap(lambda x: x).collect()
gini = gini_coefficient(values)
print(f"Gini Coefficient: {gini}")

# Calculate range
range_val = df.select(F.max("values") - F.min("values")).collect()[0][0]
print(f"Range: {range_val}")

# Function to calculate Harmonic Mean

def harmonic_mean(values):
n = len(values)
return n / sum([1.0 / v for v in values])

# Collect data and apply Harmonic Mean function

values = df.select("values").rdd.flatMap(lambda x: x).collect()
harmonic_mean_val = harmonic_mean(values)
print(f"Harmonic Mean: {harmonic_mean_val}")

# Kurtosis
kurtosis_val = df.select(F.kurtosis("values")).collect()[0][0]
print(f"Kurtosis: {kurtosis_val}")

# Median (50th percentile)

median_val = df.approxQuantile("values", [0.5], 0.0)[0]
print(f"Median: {median_val}")

# Skewness
skewness_val = df.select(F.skewness("values")).collect()[0][0]
print(f"Skewness: {skewness_val}")
# Calculate mean and standard deviation
mean_val = df.select(F.mean("values")).collect()[0][0]
stddev_val = df.select(F.stddev("values")).collect()[0][0]

# Add a new column with Z-scores

df_zscore = df.withColumn("z_score", (F.col("values") - mean_val) / stddev_val)
df_zscore.show()

# Coefficient of variation
cv = stddev_val / mean_val
print(f"Coefficient of Variation (CV): {cv}")

# Calculate IQR
q1 = df.approxQuantile("values", [0.25], 0.0)[0]
q3 = df.approxQuantile("values", [0.75], 0.0)[0]
iqr = q3 - q1
print(f"Interquartile Range (IQR): {iqr}")

# Variance
variance_val = df.select(F.variance("values")).collect()[0][0]
print(f"Variance: {variance_val}")

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Sadigh Et - Al - 1997
No ratings yet
Sadigh Et - Al - 1997
10 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
TFWoljND9k
No ratings yet
TFWoljND9k
25 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark Interview Questions and Answers
100% (2)
Spark Interview Questions and Answers
31 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Interview - Questions
No ratings yet
Interview - Questions
8 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
RDD
No ratings yet
RDD
4 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Apache_Spark_Lecture_Notes
No ratings yet
Apache_Spark_Lecture_Notes
4 pages
Pyspark-1
No ratings yet
Pyspark-1
7 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
PySpark_Interview_Questions
No ratings yet
PySpark_Interview_Questions
2 pages
Pyspark Interview Code
100% (2)
Pyspark Interview Code
197 pages
50_PySpark_interview_questions__1732556477
No ratings yet
50_PySpark_interview_questions__1732556477
7 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Spark Interview Questions 04
No ratings yet
Spark Interview Questions 04
4 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
Pyspark
No ratings yet
Pyspark
48 pages
pyspark
No ratings yet
pyspark
6 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
SparkStepbyStepInterviewGuide_draft
No ratings yet
SparkStepbyStepInterviewGuide_draft
3 pages
Spark Interview More Questions With Answers
No ratings yet
Spark Interview More Questions With Answers
3 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
15 Asked Questions in KPMG
No ratings yet
15 Asked Questions in KPMG
22 pages
Apache Spark
No ratings yet
Apache Spark
15 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Spark Scenario Based Interview Questions !! For Interview
No ratings yet
Spark Scenario Based Interview Questions !! For Interview
4 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
1731556887911
No ratings yet
1731556887911
275 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
Deep Learning
No ratings yet
Deep Learning
17 pages
13th Question
No ratings yet
13th Question
1 page
A1194041789 63616 14 2022 Cse230daaca2k21ch
No ratings yet
A1194041789 63616 14 2022 Cse230daaca2k21ch
13 pages
E 1 Pages Examples Guide
No ratings yet
E 1 Pages Examples Guide
4 pages
m07500670 00000000 0en PDF
No ratings yet
m07500670 00000000 0en PDF
272 pages
Yuken Relief Valve
No ratings yet
Yuken Relief Valve
38 pages
Unit 2 Register Transfer and Microoperations
No ratings yet
Unit 2 Register Transfer and Microoperations
39 pages
What S New in Pscad v4 5 1
No ratings yet
What S New in Pscad v4 5 1
13 pages
PassLeader 300-375 Exam Dumps (1-10)
No ratings yet
PassLeader 300-375 Exam Dumps (1-10)
7 pages
Land, Nick - Chasm (2015, Time Spiral Press) - Libgen - Li
No ratings yet
Land, Nick - Chasm (2015, Time Spiral Press) - Libgen - Li
93 pages
Questions & Answers On Control System and Components
No ratings yet
Questions & Answers On Control System and Components
13 pages
VHDL Programs
83% (12)
VHDL Programs
49 pages
Year 10 Baseline Test Maths Foundation Non-Calculator (Interactive)
No ratings yet
Year 10 Baseline Test Maths Foundation Non-Calculator (Interactive)
8 pages
Immediate download Reservoir Engineering Handbook 2nd Edition Tarek Ahmed Phd Pe ebooks 2024
100% (1)
Immediate download Reservoir Engineering Handbook 2nd Edition Tarek Ahmed Phd Pe ebooks 2024
45 pages
Put Call Ratio
No ratings yet
Put Call Ratio
12 pages
Brochure Cleaning Emeia Product Catalog
No ratings yet
Brochure Cleaning Emeia Product Catalog
40 pages
h2 Generators - WM Series - Uk Asynt
No ratings yet
h2 Generators - WM Series - Uk Asynt
2 pages
MHT Cet Chemistry Triumph STD 11th and 12th MCQ Hints1561553400
50% (2)
MHT Cet Chemistry Triumph STD 11th and 12th MCQ Hints1561553400
326 pages
Express Cross Platform Library
No ratings yet
Express Cross Platform Library
938 pages
Unit 4 Governing Equations of Heat Conduction: Structure
No ratings yet
Unit 4 Governing Equations of Heat Conduction: Structure
45 pages
111 112 103 104 110 102 U V W EGR Valve Motor
No ratings yet
111 112 103 104 110 102 U V W EGR Valve Motor
1 page
Hydraulic Calculation For Bridge No.:-34 Mahesana-Taranga Gauge Conversion Project
No ratings yet
Hydraulic Calculation For Bridge No.:-34 Mahesana-Taranga Gauge Conversion Project
3 pages
Scout 100 Ex
No ratings yet
Scout 100 Ex
282 pages
Butterfly Valve Q 011
No ratings yet
Butterfly Valve Q 011
4 pages
Unit1 Notes ADS
No ratings yet
Unit1 Notes ADS
15 pages
EMV of ALTERNATIVES - Final Requirement
No ratings yet
EMV of ALTERNATIVES - Final Requirement
28 pages
Trailer Frame Types Handout
No ratings yet
Trailer Frame Types Handout
6 pages
1970 Developments in Triaxial Testing Technique
No ratings yet
1970 Developments in Triaxial Testing Technique
6 pages
563 Siltherm 250 Ps rv1
No ratings yet
563 Siltherm 250 Ps rv1
2 pages
Examples and Exercises of BCNF
No ratings yet
Examples and Exercises of BCNF
3 pages
Breaker Guide PDF
No ratings yet
Breaker Guide PDF
100 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages