Apache Spark - DataFrames and Spark SQL
Apache Spark - DataFrames and Spark SQL
Spark SQL
July, 2015
Spark SQL Graduated
from
Alpha
in 1.3
2
Spark SQL
• Part of the core distribution since 1.0 (April 2015)
• Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
Improved
multi-version
support in 1.4
3
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety
of formats.
df = sqlContext.read \
.format("json") \
.option("samplingRatio", "0.1") \
.load("/Users/spark/data/stuff.json")
df.write \
.format("parquet") \
.mode("append") \
.partitionBy("year") \
.saveAsTable("faster-‐stuff")
4
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety
of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")
df.write.
format("parquet").
mode("append").
partitionBy("year").
saveAsTable("faster-‐stuff")
5
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety
of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")
df.write.
format("parquet"). read and write
mode("append").
partitionBy("year").
functions create
saveAsTable("faster-‐stuff") new builders for
doing I/O
6
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety
of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").} Builder methods
load("/Users/spark/data/stuff.json") specify:
df.write. • format
format("parquet"). • partitioning
mode("append").
partitionBy("year").
} • handling of
saveAsTable("faster-‐stuff") existing data
7
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety
of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")
load(…), save(…),
or saveAsTable(…)
df.write.
format("parquet").
finish the I/O
mode("append"). specification
partitionBy("year").
saveAsTable("faster-‐stuff")
8
ETL using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.jira")
.option("url", "https://issues.apache.org/jira/rest/api/latest/search")
.option("user", "...")
.option("password", "...")
.option("query", """
|project = SPARK AND
|component = SQL AND
|(status = Open OR status = "In Progress" OR status =
"Reopened").stripMargin
.load()
.repartition(1)
.write
.format("parquet")
.saveAsTable("sparkSqlJira")
9
Write Less Code: High-Level Operations
10
Write Less Code: Compute an Average
--------------------------------------------------------------------------------
--
• Python
Using DataFrames • R
sqlContext.table("people")
.groupBy("name")
.agg("name", a vg("age"))
.collect()
12
What are DataFrames?
DataFrames are a recent addition to Spark (early 2015).
See databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-
large-scale-data-science.html
13
What are DataFrames?
DataFrames have the following features:
14
What are DataFrames?
• For new users familiar with data frames in other
programming languages, this API should make
them feel at home.
• For existing Spark users, the API will make Spark
easier to program.
• For both sets of users, DataFrames will improve
performance through intelligent optimizations and
code-generation.
15
Construct a DataFrame
16
Use DataFrames
17
DataFrames and Spark SQL
young.registerTempTable("young")
sqlContext.sql("SELECT count(*) FROM young")
18
More details, coming up
19
DataFrames and Spark SQL
DataFrames are fundamentally tied to Spark SQL.
• The DataFrames API provides a programmatic
interface—really, a domain-specific language
(DSL)—for interacting with your data.
• Spark SQL provides a SQL-like interface.
• What you can do in Spark SQL, you can do in
DataFrames
• … and vice versa.
20
What, exactly, is Spark SQL?
Spark SQL allows you to manipulate distributed
data with SQL queries. Currently, two SQL dialects
are supported.
• If you're using a Spark SQLContext, the only
supported dialect is "sql", a rich subset of SQL 92.
• If you're using a HiveContext, the default dialect
is "hiveql", corresponding to Hive's SQL dialect.
"sql" is also available, but "hiveql" is a richer
dialect.
21
Spark SQL
• You issue SQL queries through a SQLContext or
HiveContext, using the sql() method.
• The sql() method returns a DataFrame.
• You can mix DataFrame methods and SQL queries
in the same code.
• To use SQL, you must either:
• query a persisted Hive table, or
• make a table alias for a DataFrame, using
registerTempTable()
22
DataFrames
Like Spark SQL, the DataFrames API assumes that
the data has a table-like structure.
• filter • count
• select • collect
• drop • show
• intersect • head
• join • take
24
Transformations, Actions, Laziness
Actions cause the execution of the query.
25
All Actions on a DataFrame
26
27
28
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
29
30
31
DataFrames &Resilient Distributed
Datasets (RDDs)
• DataFrames are built on top of the Spark RDD* API.
• This means you can use normal RDD operations on
DataFrames.
• However, stick with the DataFrame API, wherever
possible.
• Using RDD operations will often give you back an RDD,
not a DataFrame.
• The DataFrame API is likely to be more efficient, because
it can optimize the underlying operations with Catalyst.
* We will be discussing RDDs later in the course.
32
DataFrames can be significantly faster than RDDs.
And they perform the same, regardless of language.
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
0 2 4 6 8 10
34
Plan Optimization & Execution
Cost Model
Unresolved Optimized Physical Selected
Logical Plan RDDs
Logical Plan Logical Plan Plans Physical Plan
DataFrame
Catalog
35
Plan Optimization & Execution
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)
filter join
36
Plan Optimization: "Intelligent" Data Sources
join
join scan
filter
(users)
scan filter scan
(users) (events)
39
https://databricks.com/blog/2015/04/13/deep-‐dive-‐into-‐spark-‐sqls-‐catalyst-‐optimizer.html
3 Fundamental transformations on
DataFrames
- mapPartitions
- New ShuffledRDD
- ZipPartitions
40
DataFrame limitations
• Catalyst does not automatically repartition DataFrames
optimally
42
Creating a DataFrame in Scala
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
43
Creating a DataFrame in Python
Unfortunately, getOrCreate() does not exist in
pyspark.
df = sqlContext.read.parquet("/path/to/data.parquet")
df2 = sqlContext.read.json("/path/to/data.json")
44
Creating a DataFrame in R
# The following two lines are not necessary in the sparkR shell
sc <-‐ sparkR.init(master, appName)
sqlContext <-‐ sparkRSQL.init(sc)
df <-‐ parquetFile("/path/to/data.parquet")
df2 <-‐ jsonFile("/path/to/data.json")
45
SQLContext and Hive
Our previous examples created a default Spark
SQLContext object.
47
HiveContext
If your copy of Spark has Hive support, you can create a
HiveContext easily enough:
import org.spark.sql.hive.HiveContext
sqlContext = HiveContext(sc)
48
DataFrames Have Schemas
In the previous example, we created DataFrames from
Parquet and JSON data.
• A Parquet table has a schema (column names and
types) that Spark can use. Parquet also allows Spark
to be efficient about how it pares down data.
• Spark can infer a Schema from a JSON file.
49
Data Sources supported by
DataFrames
built-in external
JDBC
{ JSON }
and more …
50
Schema Inference
What if your data file doesn’t have a schema? (e.g.,
You’re reading a CSV file or a plain text file.)
52
Schema Inference :: Scala
import sqlContext.implicits._
53
Schema Inference :: Python
• We can do the same thing in Python.
• Use a namedtuple, dict, or Row, instead of a
Python class, though.*
• Row is part of the DataFrames API
* See
spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.create
DataFrame
54
Schema Inference :: Python
from pyspark.sql import Row
rdd = sc.textFile("people.csv")
Person = Row('first_name', 'last_name', 'gender', 'age')
def line_to_person(line):
cells = line.split(",")
cells[3] = int(cells[3])
return Person(*cells)
peopleRDD = rdd.map(line_to_person)
df = peopleRDD.toDF()
# DataFrame[first_name: string, last_name: string, gender:
string, age: bigint]
55
Schema Inference :: Python
from collections import namedtuple
Person = namedtuple('Person',
['first_name', 'last_name', 'gender', 'age']
)
rdd = sc.textFile("people.csv")
def line_to_person(line):
cells = line.split(",")
return Person(cells[0], cells[1], cells[2],
int(cells[3]))
peopleRDD = rdd.map(line_to_person)
df = peopleRDD.toDF()
# DataFrame[first_name: string, last_name: string, gender: string,
age: bigint]
56
Schema Inference
57
Schema Inference :: Scala
Here’s the Scala version:
rdd = sc.textFile("people.csv")
def line_to_person(line):
cells = line.split(",")
return tuple(cells[0:3] + [int(cells[3])])
peopleRDD = rdd.map(line_to_person)
df = peopleRDD.toDF(("first_name", "last_name",
"gender", "age"))
In Python, you don’t. You can use any iterable data structure
(e.g., a list).
In Scala, you do. Tuples have fixed lengths and fixed types for
each element at compile time. For instance:
Tuple4[String,String,String,Int]
60
Hands On
In the labs area of the shard, under the
sql-‐and-‐dataframes folder, you'll find another folder
called hands-‐on.
61
Additional Input Formats
The DataFrames API can be extended to understand
additional input formats (or, input sources).
63
A brief look at spark-csv
// Scala
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
load("people.csv")
# Python
df = sqlContext.read.format("com.databricks.spark.csv").\
load("people.csv", header="true")
64
A brief look at spark-csv
65
A brief look at spark-csv
You can also declare the schema programmatically, which
allows you to specify the column types. Here’s Scala:
import org.apache.spark.sql.types._
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
schema(schema).
load("people.csv")
// df: org.apache.spark.sql.DataFrame = [firstName: string, gender: string,
age: int]
66
A brief look at spark-csv
Here’s the same thing in Python:
df = sqlContext.read.format("com.databricks.spark.csv").\
schema(schema).\
load("people.csv")
67
What can I do with a DataFrame?
68
Columns
When we say “column” here, what do we mean?
69
Columns
Input Source Data Frame Data Let's see how
Format Variable Name
DataFrame
JSON dataFrame1 [ {"first": "Amy",
"last": "Bello",
"age": 29 }, columns map
onto some
{"first": "Ravi",
"last": "Agarwal",
"age": 33 },
]
… common data
CSV dataFrame2 first,last,age
sources.
Fred,Hoover,91
Joaquin,Hernandez,24
…
70
Columns dataFrame1
column: "first"
Input Source Data Frame Data
Format Variable Name
JSON dataFrame1 [ {"first": "Amy",
"last": "Bello",
"age": 29 },
{"first": "Ravi",
"last": "Agarwal", dataFrame2
"age": 33 },
… column: "first"
]
dataFrame3
SQL Table dataFrame3 column: "first"
first last age
Joe Smith 42
Jill Jones 33
71
Columns
When we say “column” here, what do we mean?
Several things:
†In Python, it’s possible to access a DataFrame’s columns either by attribute (df.age) or
by indexing (df['age']). While the former is convenient for interactive data
exploration, you should use the index form. It's future proof and won’t break with column
names that are also attributes on the DataFrame class.
‡ The $ syntax can be ambiguous, if there are multiple DataFrames in the lineage.
73
printSchema()
You can have Spark tell you what it thinks the data
schema is, by calling the printSchema() method.
(This is mostly useful in the shell.)
scala> df.printSchema()
root
|-‐-‐ firstName: string (nullable = true)
|-‐-‐ lastName: string (nullable = true)
|-‐-‐ gender: string (nullable = true)
|-‐-‐ age: integer (nullable = false)
74
printSchema()
> printSchema(df)
root
|-‐-‐ firstName: string (nullable = true)
|-‐-‐ lastName: string (nullable = true)
|-‐-‐ gender: string (nullable = true)
|-‐-‐ age: integer (nullable = false)
75
show()
You can look at the first n elements in a DataFrame with
the show() method. If not specified, n defaults to 20.
77
show()
> showDF(df)
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
|firstName|lastName|gender|age|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
| Erin| Shannon| F| 42|
| Claire| McBride| F| 23|
| Norman|Lockwood| M| 81|
| Miguel| Ruiz| M| 64|
| Rosalita| Ramirez| F| 14|
| Ally| Garcia| F| 39|
| Abigail|Cottrell| F| 75|
| José| Rivera| M| 59|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
78
cache()
79
select()
select() is like a SQL SELECT, allowing you to
limit the results to specific columns.
80
select()
The DSL also allows you create on-the-fly derived
columns.
scala> df.select($"firstName",
$"age",
$"age" > 49,
$"age" + 10).show(5)
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|firstName|age|(age > 49)|(age + 10)|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
| Erin| 42| false| 52|
| Claire| 23| false| 33|
| Norman| 81| true| 91|
| Miguel| 64| true| 74|
| Rosalita| 14| false| 24|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
81
select()
The Python DSL is slightly different.
82
select()
The R syntax is completely different:
83
select()
And, of course, you can also use SQL. (This is the Python API,
but you issue SQL the same way in Scala and Java.)
In[1]: df.registerTempTable("names")
In[2]: sqlContext.sql("SELECT first_name, age, age > 49 FROM names").\
show(5)
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
|first_name|age| _c2|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
| Erin| 42|false|
| Claire| 23|false|
| Norman| 81| true|
| Miguel| 64| true|
| Rosalita| 14|false|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
85
filter()
The filter() method allows you to filter rows out
of your results.
86
filter()
Here’s the Python version.
87
filter()
Here’s the R version.
88
filter()
Here’s the SQL version.
89
Hands On
90
orderBy()
The orderBy() method allows you to sort the results.
91
orderBy()
It’s easy to reverse the sort order.
92
orderBy()
And, in Python:
93
orderBy()
In R:
> showDF(orderBy(
+ select(filter(df, df$age > 49), df$first_name, df$age),
+ desc(df$age), df$first_name)
+ )
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
|first_name|age|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
| Norman| 81|
| Abigail| 75|
| Miguel| 64|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
94
orderBy()
In SQL, it's pretty normal looking:
95
groupBy()
Often used with count(), groupBy() groups data
items by a specific column value.
In [5]: df.groupBy("age").count().show()
+-‐-‐-‐+-‐-‐-‐-‐-‐+
|age|count|
+-‐-‐-‐+-‐-‐-‐-‐-‐+
| 39| 1|
| 42| 2|
| 64| 1|
| 75| 1|
| 81| 1|
| 14| 1|
| 23| 2|
+-‐-‐-‐+-‐-‐-‐-‐-‐+
96
groupBy()
R, again, is slightly different.
97
groupBy()
And SQL, of course, isn't surprising:
98
as() or alias()
as() or alias() allows you to rename a column.
It’s especially useful with generated columns.
In [7]: df.select(df['first_name'],\
df['age'],\
(df['age'] < 30).alias('young')).show(5)
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
|first_name|age|young|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
| Erin| 42|false|
| Claire| 23| true|
| Norman| 81|false|
| Miguel| 64|false|
| Rosalita| 14| true|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
100
alias()
Here's R. Only alias() is supported here.
101
as()
And, of course, SQL:
scala> sqlContext.sql("SELECT firstName, age, age < 30 AS young " +
| "FROM names")
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
|first_name|age|young|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
| Erin| 42|false|
| Claire| 23| true|
| Norman| 81|false|
| Miguel| 64|false|
| Rosalita| 14| true|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
102
Hands On
103
Other Useful Transformations
Method Description
limit(n) Limit the results to n rows. limit() is
not an action, like show() or the RDD
take() method. It returns another
DataFrame.
distinct() Returns a new DataFrame containing only
the unique rows from the current
DataFrame
drop(column) Returns a new DataFrame with a column
dropped. column is a name or a Column
object.
intersect(dataframe) Intersect one DataFrame with another.
join(dataframe) Join one DataFrame with another, like a
SQL join. We’ll discuss this one more in a
minute.
[
{
"firstName": "Erin",
"lastName": "Shannon",
"medium": "oil on canvas"
},
{
"firstName": "Norman",
"lastName": "Lockwood",
"medium": "metal (sculpture)"
},
…
]
105
Joins
We can load that into a second DataFrame and join
it with our first one.
In [1]: df2 = sqlContext.read.json("artists.json")
# Schema inferred as DataFrame[firstName: string, lastName: string, medium:
string]
In [2]: df.join(
df2,
df.first_name == df2.firstName and df.last_name == df2.lastName
).show()
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|first_name|last_name|gender|age|firstName|lastName| medium|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
| Norman| Lockwood| M| 81| Norman|Lockwood|metal (sculpture)|
| Erin| Shannon| F| 42| Erin| Shannon| oil on canvas|
| Rosalita| Ramirez| F| 14| Rosalita| Ramirez| charcoal|
| Miguel| Ruiz| M| 64| Miguel| Ruiz| oil on canvas|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
106
Joins
Let’s make that a little more readable by only
selecting some of the columns.
107
explode() [
{"id": 9 09091,
108
explode()
When you load it into a DataFrame, here's what you see:
109
explode()
The schema is more interesting.
scala> df.printSchema
root
|-‐-‐ id: integer (nullable = true)
|-‐-‐ father: struct (nullable = true)
| |-‐-‐ firstName: string (nullable = true)
| |-‐-‐ middleName: string (nullable = true)
| |-‐-‐ lastName: string (nullable = true)
| |-‐-‐ gender: string (nullable = true)
| |-‐-‐ birthYear: integer (nullable = true)
|-‐-‐ mother: struct (nullable = true)
| |-‐-‐ firstName: string (nullable = true)
| |-‐-‐ middleName: string (nullable = true)
| |-‐-‐ lastName: string (nullable = true)
| |-‐-‐ gender: string (nullable = true)
| |-‐-‐ birthYear: integer (nullable = true)
|-‐-‐ children: array (nullable = true)
| |-‐-‐ element: struct (containsNull = true)
| | |-‐-‐ firstName: string (nullable = true)
| | |-‐-‐ middleName: string (nullable = true)
| | |-‐-‐ lastName: string (nullable = true)
| | |-‐-‐ gender: string (nullable = true)
| | |-‐-‐ birthYear: integer (nullable = true)
110
explode()
In that layout, the data can be difficult to manage. But, we can explode
the columns to make them easier to manage. For instance, we can turn a
single children value, an array, into multiple values, one per row:
Note what happened: A single children column value was exploded into multiple values,
one per row. The rest of the values in the original row were duplicated in the new rows.
111
explode()
The resulting DataFrame has one child per row, and it's
easier to work with:
scala> df2.select($"father.firstName".as("fatherFirstName"),
$"mother.firstName".as("motherFirstName"),
$"child.firstName".as("childFirstName"),
$"child.middleName".as("childMiddleName")).show()
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|fatherFirstName|motherFirstName|childFirstName|childMiddleName|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
| Nicolas| Jenette| Terri| Olene|
| Nicolas| Jenette| Bobbie| Lupe|
| Nicolas| Jenette| Liana| Ophelia|
| Nicolas| Jenette| Pablo| Son|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
112
User Defined Functions
Suppose our JSON data file capitalizes the names differently
than our first data file. The obvious solution is to force all
names to lower case before joining.
113
User Defined Functions
However, this deficiency is easily remedied with a user
defined function.
In [8]: from pyspark.sql.functions import udf
In [9]: lower = udf(lambda s: s.lower())
In [10]: df.select(lower(df['firstName'])).show(5)
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|PythonUDF#<lambda>(first_name)|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
| erin|
| claire|
| norman|
| miguel|
| rosalita|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
114
User Defined Functions
Interestingly enough, lower() does exist in the Scala
API. So, let’s invent something that doesn’t:
scala> df.select(double($("total")))
console>:23: error: not found: value double
df.select(double($("total"))).show()
^
115
User Defined Functions
Again, it’s an easy fix.
scala> df.select(double($("total"))).show(5)
+---------------+
|scalaUDF(total)|
+---------------+
| 7065.0|
| 2604.0|
| 2003.0|
| 1939.0|
| 1746.0|
+---------------+
116
User Defined Functions
UDFs are not currently supported in R.
117
Lab
118
Writing DataFrames
• You can write DataFrames out, as well. When doing ETL,
this is a very common requirement.
• In most cases, if you can read a data format, you can
write that data format, as well.
• If you're writing to a text file format (e.g., JSON), you'll
typically get multiple output files.
119
Writing DataFrames
scala> df.write.format("json").save("/path/to/directory")
scala> df.write.format("parquet").save("/path/to/directory")
In [20]: df.write.format("json").save("/path/to/directory")
In [21]: df.write.format("parquet").save("/path/to/directory")
120
Writing DataFrames: Save modes
Save operations can optionally take a SaveMode that
specifies how to handle existing data if present.
Scala/Java Python Meaning
SaveMode.ErrorIfExists "error" If output data or table already exists,
(default) an exception is expected to be
thrown.
SaveMode.Append "append" If output data or table already exists,
append contents of the DataFrame
to existing data.
SaveMode.Overwrite "overwrite" If output data or table already exists,
replace existing data with contents
of DataFrame.
SaveMode.Ignore "ignore" If output data or table already exists,
do not write DataFrame at all.
121
Writing DataFrames: Save modes
122
Writing DataFrames: Hive
• When working with a HiveContext, you can save
a DataFrame as a persistent table, with the
saveAsTable() method.
• Unlike registerTempTable(),
saveAsTable() materializes the DataFrame (i.e.,
runs the DAG) and creates a pointer to the data in
the Hive metastore.
• Persistent tables will exist even after your Spark
program has restarted.
123
Writing Data Frames: Hive
124
Other Hive Table Operations
• To create a DataFrame from a persistent Hive table, call
the table() method on a SQLContext, passing the
table name.
125
Explain
You can dump the query plan to standard output, so
you can get an idea of how Spark will execute your
query.
126
Explain
Pass true to get a more detailed query plan.
scala> df.join(df2, lower(df("firstName")) === lower(df2("firstName"))).explain(true)
== Parsed Logical Plan ==
Join Inner, Some((Lower(firstName#1) = Lower(firstName#13)))
Relation[birthDate#0,firstName#1,gender#2,lastName#3,middleName#4,salary#5L,ssn#6]
org.apache.spark.sql.json.JSONRelation@7cbb370e
Relation[firstName#13,lastName#14,medium#15] org.apache.spark.sql.json.JSONRelation@e5203d2c
127
Spark SQL: Just a little more info
128
Example
To issue SQL against an existing DataFrame, create a temporary table,
which essentially gives the DataFrame a name that's usable within a query.
131
SQL and RDDs
132
DataFrame Advanced Tips
• It is possible to coalesce or repartition DataFrames
133
Machine Learning Integration
134
Machine Learning Integration
135
ML: Transformer
A Transformer is an algorithm which can transform
one DataFrame into another DataFrame.
136
ML: Transformer
A feature transformer might:
• take a dataset,
• read a column (e.g., text),
• convert it into a new column (e.g., feature vectors),
• append the new column to the dataset, and
• output the updated dataset.
137
ML: Transformer
A learning model might:
• take a dataset,
• read the column containing feature vectors,
• predict the label for each feature vector,
• append the labels as a new column, and
• output the updated dataset.
138
ML: Estimator
139
ML: Estimator
An Estimator abstracts the concept of any algorithm
which fits or trains on data.
141
ML: Pipeline
In machine learning, it is common to run a sequence of
algorithms to process and learn from data. A simple text
document processing workflow might include several
stages:
df = context.load("/path/to/data")
model = pipeline.fit(df)
143
ML: Scala Example
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.classification.LogisticRegression
val df = sqlContext.load("/path/to/data")
val model = pipeline.fit(df) 144
Lab
In Databricks, you'll find a DataFrames SQL lab
notebook.