0% found this document useful (0 votes)

167 views3 pages

Spark RDD Dataframes SQL

Apache Spark can be used with Scala via the spark-shell or with Python via PySpark. It allows processing of structured data using DataFrames and SQL. Examples show how to create RDDs from text files, transform them using Map, FlatMap, Reduce and ReduceByKey. RDDs can be converted to DataFrames for SQL queries. DataFrames allow grouping, filtering, and aggregating structured data.

Uploaded by

leongladxton

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Download as docx, pdf, or txt

0% found this document useful (0 votes)

167 views3 pages

Spark RDD Dataframes SQL

Uploaded by

leongladxton

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Download as docx, pdf, or txt

You are on page 1/ 3

Apache Spark:

language used scala:

to start spark

spark-shell for scala

pyspark for python

include modules :

import org.apache.spark.sql.SQLContext
import sqlContext.implicits._

import com.databricks.spark.xml._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types.{StructType, StructField, StringType,
DoubleType};

type(df.describe())
df.select("_id","author","description").show()

hdfs://quickstart.cloudera:8020/user/cloudera/

MAP and FlapMaP example:

mountain@mountain:~/sbook$ cat words.txt

line1 word1
line2 word2 word1
line3 word3 word4
line4 word1

scala> val lines = sc.textFile("words.txt");

...
scala> lines.map(_.split(" ")).take(3)
res4: Array[Array[String]] = Array(Array(line1, word1), Array(line2, word2, word1), Array(line3,
word3, word4))

A flatMap() flattens multiple list into one single List

scala> lines.flatMap(_.split(" ")).take(3)

res5: Array[String] = Array(line1, word1, line2)

REDUCE :
val rdd1 = sc.parallelize(List(20,32,45,62,8,5))
val sum = rdd1.reduce(_+_)

ReduceByKey:
val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data_RDD = sc.parallelize(words)
val mapped_RDD = data_RDD.map(w => (w,1))
mapped_RDD.take(10)

val reduced_RDD = mapped_RDD.reduceByKey(_+_)

reduced_RDD.take(10)

FilTER:

val data_RDD =
sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/temperature_2014.csv")
data_RDD.take(100)

val FL_mapped_RDD = data_RDD.flatMap(lines => lines.split(","))

FL_mapped_RDD.take(20)

RDD to DATAFRAME:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import org.apache.spark.sql._
/ this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
/ load the data into a new RDD
val ebayText = sc.textFile("/home/jovyan/work/datasets/spark-ebook/ebay.csv")

// Return the first element in this RDD

ebayText.first()

//define the schema using a case class

//class name starts with capital letter
case class Auction(auctionid: String, bid: Float, bidtime: Float,bidder: String,
bidderrate: Integer, openbid: Float, price: Float,item: String, daystolive: Integer)

// create an RDD of Auction objects

val ebay = ebayText.map(_.split(",")).map(p
=>Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt,p(5).toFloat,
p(6).toFloat,p(7),p(8).toInt))

/ Return the first element in this RDD

ebay.first()
// change ebay RDD of Auction objects to a DataFrame
val auction = ebay.toDF()

/ How many bids per item?

auction.groupBy("auctionid", "item").count.show
auction.select("auctionid").distinct.count()

// Get the auctions with closing price > 100

val highprice = auction.filter("price > 100")
highprice.show()

/ register the DataFrame as a temp table

auction.registerTempTable("RDD_table")

import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// How many bids per auction?
val results = sqlContext.sql("SELECT auctionid, item, count(bid) FROM
RDD_table GROUP BY auctionid, item")

referrences:
https://mapr.com/ebooks/spark/05-processing-tabular-data-with-spark-sql.html
https://www.supergloo.com/fieldnotes/spark-sql-csv-examples-python/
http://sparktutorials.net/Opening+CSV+Files+in+Apache+Spark+-
+The+Spark+Data+Sources+API+and+Spark-CSV

Project 4 SQL Queries
No ratings yet
Project 4 SQL Queries
28 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
2 Vinaytech Power Bi Basic Practicals All Feeds
No ratings yet
2 Vinaytech Power Bi Basic Practicals All Feeds
40 pages
Tutorial 5 - SQL Practice 1
No ratings yet
Tutorial 5 - SQL Practice 1
2 pages
Uniqlo Basic Interview Questions and Answers: 1. Tell Me About Your Self
No ratings yet
Uniqlo Basic Interview Questions and Answers: 1. Tell Me About Your Self
10 pages
Everest Masala: (S. Narendra Kumar & Co)
No ratings yet
Everest Masala: (S. Narendra Kumar & Co)
7 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Interview
No ratings yet
Interview
86 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Pair RDD Operations: Flat Map
No ratings yet
Pair RDD Operations: Flat Map
4 pages
Difference Between Procedure and Function
No ratings yet
Difference Between Procedure and Function
9 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
SQL WW3 Schools
100% (1)
SQL WW3 Schools
34 pages
PLSQL Introduction Final
No ratings yet
PLSQL Introduction Final
81 pages
How To Remove Duplicate Rows Without Using DISTINCT or ROWID or GROUP by Methods in Oracle
0% (1)
How To Remove Duplicate Rows Without Using DISTINCT or ROWID or GROUP by Methods in Oracle
36 pages
MSSQL Server 2008 Developer
No ratings yet
MSSQL Server 2008 Developer
240 pages
Roshani Kumari ETL Engineer
No ratings yet
Roshani Kumari ETL Engineer
7 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Datawarehousing Concepts
No ratings yet
Datawarehousing Concepts
11 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
SQL Basics & PL-SQL: Training Highlights
No ratings yet
SQL Basics & PL-SQL: Training Highlights
5 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Sravani Soma-ETL Resume
No ratings yet
Sravani Soma-ETL Resume
3 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
JADE
100% (1)
JADE
50 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Interface Python With SQL Database: Apurv Gupta
No ratings yet
Interface Python With SQL Database: Apurv Gupta
20 pages
Triggers
No ratings yet
Triggers
4 pages
10.python Lists
No ratings yet
10.python Lists
53 pages
Ds Lab Programs
No ratings yet
Ds Lab Programs
30 pages
PLSQL Questions 1
No ratings yet
PLSQL Questions 1
2 pages
EDA with Pandas
No ratings yet
EDA with Pandas
8 pages
CTE in SQL Server - CodeProject
No ratings yet
CTE in SQL Server - CodeProject
8 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Naresh Kumar.K: Oracle SQL PL/SQL Developer Mobile
No ratings yet
Naresh Kumar.K: Oracle SQL PL/SQL Developer Mobile
2 pages
HTML 74
No ratings yet
HTML 74
9 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
DBT Interview Prep
100% (1)
DBT Interview Prep
16 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Siva
No ratings yet
Siva
4 pages
Hibernate 2
No ratings yet
Hibernate 2
115 pages
Snowflake Mini Project
No ratings yet
Snowflake Mini Project
7 pages
SQL Ti
No ratings yet
SQL Ti
292 pages
Sanfoundry Sourcecode
No ratings yet
Sanfoundry Sourcecode
5 pages
Hanumantha Rao Resume-1 (4391)
No ratings yet
Hanumantha Rao Resume-1 (4391)
4 pages
NEW-Oracle PLSQL Labbook
No ratings yet
NEW-Oracle PLSQL Labbook
40 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
PLSQL
No ratings yet
PLSQL
63 pages
Print SQL
No ratings yet
Print SQL
52 pages
Search Features: Arrow Functions
No ratings yet
Search Features: Arrow Functions
9 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Meeting Manage
No ratings yet
Meeting Manage
3 pages
A Proposal For The A.I
No ratings yet
A Proposal For The A.I
3 pages
Pressure Var With Alt
No ratings yet
Pressure Var With Alt
12 pages
On The Influence of Liquid Elasticity On Mixing in A Vessel Agitated by A Combined Ribbon Screw Impeller
No ratings yet
On The Influence of Liquid Elasticity On Mixing in A Vessel Agitated by A Combined Ribbon Screw Impeller
3 pages
Basics of Solfege and Sight-Singing
0% (1)
Basics of Solfege and Sight-Singing
2 pages
Science Lesson Plan 5
No ratings yet
Science Lesson Plan 5
4 pages
GA 5-18 VSDS Pack Dimension Drawing Metric Antwerp 9820 7011 15 Prel
No ratings yet
GA 5-18 VSDS Pack Dimension Drawing Metric Antwerp 9820 7011 15 Prel
1 page
Sp7a Sx4a
No ratings yet
Sp7a Sx4a
69 pages
Remote Sensing: Detection and Identification of Remnant PFM-1 Butterfly Mines' With A UAV-Based Thermal-Imaging Protocol
No ratings yet
Remote Sensing: Detection and Identification of Remnant PFM-1 Butterfly Mines' With A UAV-Based Thermal-Imaging Protocol
14 pages
XK150 Manual
No ratings yet
XK150 Manual
84 pages
Architecting Aws1 PDF
No ratings yet
Architecting Aws1 PDF
2 pages
PROCEDURE FOR INDIA POST DELIVERY STAFF TO DELIVER THE TOOL KIT BOXES BY USING PM VISHWAKARMA MOBILE APP_04_09_2024
No ratings yet
PROCEDURE FOR INDIA POST DELIVERY STAFF TO DELIVER THE TOOL KIT BOXES BY USING PM VISHWAKARMA MOBILE APP_04_09_2024
46 pages
Mwo Presentation
No ratings yet
Mwo Presentation
15 pages
4607K - Rev08 - Programming Manual V3.3.3
No ratings yet
4607K - Rev08 - Programming Manual V3.3.3
345 pages
Language 1 Advance
No ratings yet
Language 1 Advance
5 pages
CIT Statement of Purpose Guide
No ratings yet
CIT Statement of Purpose Guide
2 pages
Kompilasi Lesson Plan Music p3 Term 2
No ratings yet
Kompilasi Lesson Plan Music p3 Term 2
9 pages
Methodology of Teaching English
No ratings yet
Methodology of Teaching English
24 pages
Territorial Profiles Montreal Lasalle 2019 2020
No ratings yet
Territorial Profiles Montreal Lasalle 2019 2020
3 pages
Offshore Monopod Concepts PDF
No ratings yet
Offshore Monopod Concepts PDF
11 pages
Manual Completo HP Photosmart
No ratings yet
Manual Completo HP Photosmart
96 pages
Hotel Management System Project Report
50% (2)
Hotel Management System Project Report
31 pages
Marianas Trench Marine National Monument: U.S. Fish & Wildlife Service
No ratings yet
Marianas Trench Marine National Monument: U.S. Fish & Wildlife Service
2 pages
Telangana State Public Service Commission: TSPSC GROUP-I Application User Manual
No ratings yet
Telangana State Public Service Commission: TSPSC GROUP-I Application User Manual
16 pages
Explorer Ella's Magic Forest
No ratings yet
Explorer Ella's Magic Forest
12 pages
Knock Knock Jokes
No ratings yet
Knock Knock Jokes
123 pages
Localized Surface Plasmons Surface Plasmon Polaritons and Their Coupling in 2D Metallic Array For SERS
No ratings yet
Localized Surface Plasmons Surface Plasmon Polaritons and Their Coupling in 2D Metallic Array For SERS
7 pages

Spark RDD Dataframes SQL

Uploaded by

Spark RDD Dataframes SQL

Uploaded by

Apache Spark:

language used scala:

spark-shell for scala

MAP and FlapMaP example:

mountain@mountain:~/sbook$ cat words.txt

scala> val lines = sc.textFile("words.txt");

A flatMap() flattens multiple list into one single List

scala> lines.flatMap(_.split(" ")).take(3)

val reduced_RDD = mapped_RDD.reduceByKey(_+_)

val FL_mapped_RDD = data_RDD.flatMap(lines => lines.split(","))

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Return the first element in this RDD

//define the schema using a case class

// create an RDD of Auction objects

/ Return the first element in this RDD

/ How many bids per item?

// Get the auctions with closing price > 100

/ register the DataFrame as a temp table

You might also like