pyspark questions

Uploaded by

rameshborukati

Copyright:

Available Formats

Download as TXT, PDF, TXT or read online from Scribd

pyspark questions

Uploaded by

rameshborukati

0% found this document useful (0 votes)

3 views2 pages

Copyright

Available Formats

TXT, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as TXT, PDF, TXT or read online from Scribd

Download as txt, pdf, or txt

0% found this document useful (0 votes)

3 views2 pages

pyspark questions

Uploaded by

rameshborukati

Copyright:

Available Formats

Download as TXT, PDF, TXT or read online from Scribd

Download as txt, pdf, or txt

Jump to Page

You are on page 1of 2

Search inside document

pyspark qns

1)pyspark archeiture==driver program,cluster manager,work node,executor task

A)Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object.
The purpose of SparkContext is to coordinate the spark applications, running as
independent sets of processes on a cluster
B)Cluster Manager
The role of the cluster manager is to allocate resources across applications. The
Spark is capable enough of running on a large number of clusters.
C)Worker Node
The worker node is a slave node
Its role is to run the application code in the cluster.
D)Executor
An executor is a process launched for an application on a worker node.
It runs tasks and keeps data in memory or disk storage across them.
It read and write data to the external sources.
Every application contains its executor.
e)Task
A unit of work that will be sent to one executor.

RDD;;The Spark follows the master-slave architecture. Its cluster consists of a

single master and multiple slaves.

The Spark architecture depends upon two abstractions:

Resilient Distributed DATASETS

Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDD)
The Resilient Distributed Datasets are the group of data items that can be stored
in-memory on worker nodes. Here,

Resilient: Restore the data on failure.

Distributed: Data is distributed among different nodes.
Dataset: Group of data.

cacheee
In Spark, caching is a mechanism for storing data in memory to speed up access to
that data.
When you cache a dataset, Spark keeps the data in memory so that it can be
quickly retrieved the next time it is needed.
Caching is especially useful when you need to perform multiple operations on the
same dataset, as it eliminates the need to read the data from a disk
each time.
The persist() method allows you to specify the level of storage for the cached
data, such as memory-only or disk-only storage.
difference between cache and persist
What is the difference between cache and persist in Spark?
Caching and persistence are both optimization techniques in Spark, but they differ
in their approach.
Caching stores the data in memory, while persistence allows for more control over
the storage level.
diffrence between cache and broadast
caching is used to store and reuse RDDs/DataFrames across multiple stages of a
Spark application, and it can be used for larger datasets.
Broadcasting, on the other hand, is suitable for efficiently sharing small, read-
only data across worker nodes, reducing data transfer and improving performance

There are 5 distinct types of Join Strategies:

1)Broadcast Hash Join (BHJ)
When a “Join” Operation is “Performed” between “Two DataFrames”, if the “Size” of
“Any One” or “Both” the “DataFrames” lie
“Within” the “Range” of the “Broadcast Threshold Limit”,
2)Shuffle Hash Join (SHj
The “Shuffle Hash Join” goes through the following “Three Phases” -
1. Shuffle
2. Hash Table Creation
3. Hash Join
3)Sort Merge Join (SMJ)
The “Sort Merge Join” is the “Default Join Selection Strategy” when a “Join”
Operation is “Performed” between “Two DataFrames”.
The “Sort Merge Join” goes through the following “Three Phases” -
1. Shuffle
2. Sort
3. Merge

What is repartition in Spark?

In Apache Spark, the repartition operation is a powerful transformation used to
redistribute data within RDDs or DataFrames,
allowing for greater control over data distribution and improved parallelism.

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Pyspark
Document48 pages
Pyspark
Ajay Chavan
No ratings yet
Spark Intreview FAQ
Document21 pages
Spark Intreview FAQ
haranadhc
100% (1)
BDA-Unit-III
Document19 pages
BDA-Unit-III
syambabuj
No ratings yet
Bigdata Interview Q&A
Document71 pages
Bigdata Interview Q&A
gvk171983
No ratings yet
What Is Spark?: History of Apache Spark
Document65 pages
What Is Spark?: History of Apache Spark
Apurva
No ratings yet
Spark Questions Imp
Document33 pages
Spark Questions Imp
mahima
No ratings yet
Spark Architecture
Document7 pages
Spark Architecture
KRamakrishna
No ratings yet
Apache Spark Architecture
Document7 pages
Apache Spark Architecture
klogeswaran.it
No ratings yet
Features of Apache Spark
Document7 pages
Features of Apache Spark
Sailesh Chauhan
No ratings yet
Data Bricks Interview
Document18 pages
Data Bricks Interview
panditdandgule777_80
No ratings yet
Spark Interview Questions and Answers
Document31 pages
Spark Interview Questions and Answers
srinivas75k
100% (2)
Pyspark Interview Code
Document197 pages
Pyspark Interview Code
mailme me
100% (2)
Mod4 Bda
Document14 pages
Mod4 Bda
Diksha Padiyar
No ratings yet
Tuning - Spark 3.5.1 Documentation
Document10 pages
Tuning - Spark 3.5.1 Documentation
walteravelin
No ratings yet
Top Spark Interview Q&A
Document21 pages
Top Spark Interview Q&A
P vishnu
No ratings yet
Spark Interview Questions
Document61 pages
Spark Interview Questions
Gilberto Manhaes
No ratings yet
Spark SQL
Document25 pages
Spark SQL
Rishi
No ratings yet
Data Engineer Question
Document33 pages
Data Engineer Question
Richard Smith
No ratings yet
Top Answers To Spark Interview Questions
Document4 pages
Top Answers To Spark Interview Questions
Ejaz Alam
No ratings yet
Spark Interview 4
Document10 pages
Spark Interview 4
consania
No ratings yet
Unit Iii
Document19 pages
Unit Iii
karimunisa
No ratings yet
UNIT 4 Part 2
Document11 pages
UNIT 4 Part 2
works8606
No ratings yet
Unit 4 Spark Cassendra
Document41 pages
Unit 4 Spark Cassendra
downloadjain123
No ratings yet
Bda 5
Document21 pages
Bda 5
abdulahad.ubeid
No ratings yet
Interview Question Spark Day1
Document3 pages
Interview Question Spark Day1
venkatarakesh2203
No ratings yet
Spark Interview More Questions With Answers
Document3 pages
Spark Interview More Questions With Answers
kslvonlineexams
No ratings yet
Parallel Processing
Document38 pages
Parallel Processing
Luis Ballester Tamarit
No ratings yet
Spark Interview Questions
Document8 pages
Spark Interview Questions
Jnsk Srinu
100% (1)
Spark2x: Big Data Huawei Course
Document25 pages
Spark2x: Big Data Huawei Course
Thiago Siqueira
No ratings yet
Big Data Processing With Apache Spark
Document17 pages
Big Data Processing With Apache Spark
abhijitch
No ratings yet
Apache Spark Quick Guide
Document21 pages
Apache Spark Quick Guide
Oumaima Alfa
100% (1)
Spark 2
Document1 page
Spark 2
JEESON JOY
No ratings yet
Basics of Big Data
Document7 pages
Basics of Big Data
abdoalsenaweabdo
No ratings yet
New Microsoft Word Document
Document10 pages
New Microsoft Word Document
Nguyễn Minh Đạt
No ratings yet
Unit 5 Note
Document18 pages
Unit 5 Note
Sashikanth chowdary
No ratings yet
Interview - Questions
Document8 pages
Interview - Questions
SELVAKUMAR MP
No ratings yet
Apache Spark Interview Questions
Document12 pages
Apache Spark Interview Questions
varun3dec1
No ratings yet
Big Data Assignment
Document6 pages
Big Data Assignment
suibian.270619
No ratings yet
Databricks RealQuestions
Document9 pages
Databricks RealQuestions
panditdandgule777_80
No ratings yet
Caching: Application Server Cache
Document4 pages
Caching: Application Server Cache
taff
No ratings yet
Spark Scenario Based Interview Questions !! For Interview
Document4 pages
Spark Scenario Based Interview Questions !! For Interview
SECE20A39MRUNAL VAIDYA
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
Document16 pages
Big Data Processing With Apache Spark - Infoqdotcom
abhijitch
No ratings yet
DataStage Theory Part
Document28 pages
DataStage Theory Part
srividya.1020
No ratings yet
PySpark Comprehensive Notes⚡
Document59 pages
PySpark Comprehensive Notes⚡
Richard Smith
No ratings yet
Apache Spark IQ
Document15 pages
Apache Spark IQ
SivaKrishnaBikki
No ratings yet
Sem 7 - COMP - BDA
Document16 pages
Sem 7 - COMP - BDA
Raja Rajgonda
No ratings yet
DataStage Theory Part
Document18 pages
DataStage Theory Part
Jesse Kota
No ratings yet
Data Mining For High Performance Data Cloud Using Association Rule Mining
Document6 pages
Data Mining For High Performance Data Cloud Using Association Rule Mining
editor_ijarcsse
No ratings yet
Databricks Question
Document7 pages
Databricks Question
Chaitali Dange
No ratings yet
BDA U4 copy
Document49 pages
BDA U4 copy
snehapallap18
No ratings yet
Spark Material
Document6 pages
Spark Material
JP2B4 197974 S Satya Reddy PMPC
No ratings yet
Per Partition
Document3 pages
Per Partition
bhargavi
No ratings yet
learn
Document16 pages
learn
shantanukk0108
No ratings yet
Shark
Document24 pages
Shark
kapilkashyap3105
No ratings yet
Pyspark Modules&packages RDD
Document9 pages
Pyspark Modules&packages RDD
klogeswaran.it
No ratings yet
Spark A To Z
Document63 pages
Spark A To Z
Sozha Vendhan
No ratings yet
Lec no 10
Document17 pages
Lec no 10
hsyas918
No ratings yet
Data Engineering Guide for Beginners: Part 2
From Everand
Data Engineering Guide for Beginners: Part 2
Allan Murray
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
FINAL REPORT (PROFIT)
Document3 pages
FINAL REPORT (PROFIT)
rameshborukati
No ratings yet
spark
Document160 pages
spark
rameshborukati
No ratings yet
gg
Document2 pages
gg
rameshborukati
No ratings yet
BigData-Integrations Book
Document15 pages
BigData-Integrations Book
rameshborukati
No ratings yet
interv
Document2 pages
interv
rameshborukati
No ratings yet
Phone: +91 9959546250 Professional Summary
Document6 pages
Phone: +91 9959546250 Professional Summary
rameshborukati
No ratings yet
Meaning of Storage and Preservation of Information Materials
Document10 pages
Meaning of Storage and Preservation of Information Materials
kelvinmuranguri
No ratings yet
Hp-Ux Path vs1
Document11 pages
Hp-Ux Path vs1
Alex Alvarado
No ratings yet
File System Implementation
Document38 pages
File System Implementation
Manohar Datt
No ratings yet
Methods of Data Storage
Document2 pages
Methods of Data Storage
Gugathasan Dharsan
100% (1)
2023+CISSP+Domain+2+Study+Guide+by+ThorTeaches Com+v4 0
Document9 pages
2023+CISSP+Domain+2+Study+Guide+by+ThorTeaches Com+v4 0
somkumar1
No ratings yet
Introduction To Computer: Md. Mumtahin Habib Ullah Mazumder
Document21 pages
Introduction To Computer: Md. Mumtahin Habib Ullah Mazumder
Shokanta Roy
No ratings yet
IT Concept CH-02
Document170 pages
IT Concept CH-02
tigeress991
0% (1)
AWS - Certified Cloud Practitioner (CLF-C01) Notes 28
Document1 page
AWS - Certified Cloud Practitioner (CLF-C01) Notes 28
dread2
No ratings yet
AWS Cost Optimization
Document11 pages
AWS Cost Optimization
Debarghya Nanda
No ratings yet
Exact Closed Form Algorithm For The Four Peg Tower of Hanoi Puzzle
Document6 pages
Exact Closed Form Algorithm For The Four Peg Tower of Hanoi Puzzle
katezq
No ratings yet
PIC16F84A: 2.2 Data Memory Organization
Document1 page
PIC16F84A: 2.2 Data Memory Organization
Antonio Nappa
No ratings yet
AOMEI Backupper Standard v6.8.0 Free - Mediaket
Document2 pages
AOMEI Backupper Standard v6.8.0 Free - Mediaket
Dneto
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
Document12 pages
Apache Druid: Sudhindra Tirupati Nagaraj
Ankit Agarwal
No ratings yet
CH 1
Document62 pages
CH 1
waqarsyedwaqar76
No ratings yet
CSS Q3 Mod 3 & 4
Document6 pages
CSS Q3 Mod 3 & 4
Jhoana Tamondong
No ratings yet
Ultrastar SATA Series: Highlights
Document2 pages
Ultrastar SATA Series: Highlights
lela
No ratings yet
SkyHawk_Dahua_Datasheet(ST10000VE000, ST8000VX009, ST8000VX004, ST6000VX008, ST6000VX001, ST4000VX015, ST4000VX0050, ST3000VX014, ST3000VX009, ST2000VX016, ST2000VX012, ST1000VX012, ST1000VX001)_202308
Document1 page
SkyHawk_Dahua_Datasheet(ST10000VE000, ST8000VX009, ST8000VX004, ST6000VX008, ST6000VX001, ST4000VX015, ST4000VX0050, ST3000VX014, ST3000VX009, ST2000VX016, ST2000VX012, ST1000VX012, ST1000VX001)_202308
dahuafrance
No ratings yet
CSAA Part 1
Document44 pages
CSAA Part 1
Ogeli Stark
No ratings yet
AWS Vs GCP Vs MS Azure Vs OCI 1678035909
Document8 pages
AWS Vs GCP Vs MS Azure Vs OCI 1678035909
Nageswar Makala
No ratings yet
03 Back of The Envelope Estimation
Document6 pages
03 Back of The Envelope Estimation
srawat
No ratings yet
Input Output Organization
Document30 pages
Input Output Organization
Giri Saranu
100% (22)
Homework 2 (22 11 2022)
Document2 pages
Homework 2 (22 11 2022)
Prabha K
No ratings yet
17206029_G4
Document12 pages
17206029_G4
imtihanrahman
No ratings yet
Form 4 - T3
Document7 pages
Form 4 - T3
Atease Production
No ratings yet
Datasheet - SU800 M.2 2280 - EN - 202003
Document2 pages
Datasheet - SU800 M.2 2280 - EN - 202003
Roberto Castro
No ratings yet
Lesson 3: Computer System Servicing
Document93 pages
Lesson 3: Computer System Servicing
Rodrigo Calapan
No ratings yet
AWS Vs Azure Vs GCP
Document5 pages
AWS Vs Azure Vs GCP
Zaid_Sultan
No ratings yet
Memory Organization
Document57 pages
Memory Organization
vpriyacse
No ratings yet
The Memory System: Fundamental Concepts
Document115 pages
The Memory System: Fundamental Concepts
Rajdeep Chatterjee
No ratings yet
CHAPTER 6 File System Management
Document13 pages
CHAPTER 6 File System Management
lim.sandarah011
No ratings yet