GPU Computing With Apache Spark and Python: April 5, 2016

GPU Computing with Apache Spark 
and Python
Stan Seibert
Siu Kwan Lam
April 5, 2016
© 2015 Continuum Analytics- Confidential & Proprietary

My Background
• Trained in particle physics 
• Using Python for data analysis for 10 years 
• Using GPUs for data analysis for 7 years 
• Currently lead the High Performance Python

team at Continuum
2
About Continuum Analytics
• We give superpowers to people who change the world! 
• We want to help everyone analyze their data with Python

(and other tools), so we offer:
• Enterprise Products
• Consulting
• Training
• Open Source
3
• I’m going to use Anaconda throughout this presentation. 
• Anaconda is a free Mac/Win/Linux Python distribution:

• Based on conda, an open source package manager
• Installs both Python and non-Python dependencies
• Easiest way to get the software I will talk about today 
• https://www.continuum.io/downloads
4
Overview
1. Why Python?
2. Numba: A Python JIT compiler for the CPU and GPU
3. PySpark: Distributed computing for Python
4. Example: Image Registration
5. Tips and Tricks
6. Conclusion
WHY PYTHON?
6
Why is Python so popular?
• Straightforward, productive language for system administrators,
programmers, scientists, analysts and hobbyists
• Great community:
• Lots of tutorial and reference materials
• Easy to interface with other languages
• Vast ecosystem of useful libraries
7
8
… But, isn’t Python slow?
• Pure, interpreted Python is slow.
• Python excels at interfacing with other languages used in HPC:
C: ctypes, CFFI, Cython
C++: Cython, Boost.Python
FORTRAN: f2py
• Secret: Most scientific Python packages put the speed critical sections of their
algorithms in a compiled language.
9
Is there another way?
• Switching languages for speed in your projects can be a little
clunky:
• Sometimes tedious boilerplate for translating data types across

the language barrier
• Generating compiled functions for the wide range of data types

can be difficult
• How can we use cutting edge hardware, like GPUs?
10
NUMBA: A PYTHON JIT COMPILER
11
Compiling Python
• Numba is an open-source, type-specializing compiler for Python functions
• Can translate Python syntax into machine code if all type information can be
deduced when the function is called.
• Implemented as a module. Does not replace the Python interpreter!
• Code generation done with:
• LLVM (for CPU)
• NVVM (for CUDA GPUs).
12
Supported Platforms
OS HW SW
• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3
• OS X (10.9 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.10
• Linux (~RHEL 5 and

• HSA-capable AMD GPUs
later)
• Experimental support for
ARMv7 (Raspberry Pi 2)
13
How Does Numba Work? @jit
def do_math(a, b):
…
>>> do_math(x, y)
Python Function Functions

(bytecode) Arguments
Type Rewrite IR
Inference
Bytecode Numba IR
Analysis
Lowering
Cache
Machine
Execute! LLVM/NVVM JIT LLVM IR
Code
14
Numba on the CPU
15
Numba on the CPU Numba decorator 
(nopython=True not required)
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
16
CUDA Kernels in Python
17
CUDA Kernels in Python
Decorator will infer type signature when you call it
Helper function to compute 

blockIdx.x * blockDim.x + 
threadIdx.x
NumPy arrays have expected
attributes and indexing
Helper function to compute 

blockDim.x * gridDim.x
18
Calling the Kernel from Python
Works just like CUDA C, except Numba handles allocating

and copying data to/from the host if needed
19
Handling Device Memory Directly
Memory allocation matters in small tasks.

20
Higher Level Tools: GPU Ufuncs
21
Higher Level Tools: GPU Ufuncs
Decorator for creating ufunc
List of supported type signatures
Code generation target
22
GPU Ufuncs Performance
4x speedup incl. host<->device round trip

on GeForce GT 650M
23
Accelerate Library Bindings: cuFFT
MKL accelerated FFT
>2x speedup incl. host<->device round trip

on GeForce GT 650M
24
PYSPARK: DISTRIBUTED COMPUTING
FOR PYTHON
25
What is Apache Spark?
• An API and an execution engine for distributed computing on a cluster
• Based on the concept of Resilient Distributed Datasets (RDDs)
• Dataset: Collection of independent elements (files, objects, etc) in memory

from previous calculations, or originating from some data store
• Distributed: Elements in RDDs are grouped into partitions and may be

stored on different nodes
• Resilient: RDDs remember how they were created, so if a node goes down,
Spark can recompute the lost elements on another node
26
Computation DAGs
Fig from: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

27
How does Spark Scale?
• All cluster scaling is about minimizing I/O. 
Spark does this in several ways:
• Keep intermediate results in memory with rdd.cache()
• Move computation to the data whenever possible (functions are

small and data is big!)
• Provide computation primitives that expose parallelism and

minimize communication between workers: 
map, filter, sample, reduce, …
28
Python and Spark
• Spark is implemented in Java & Scala on the JVM
• Full API support for Scala, Java, and Python 

(+ limited support for R)
• How does Python work, since it doesn’t run on the JVM? 

(not counting IronPython)
29
Spark Py4J Python
Worker Interpreter
Java Python
Spark Py4J Spark

Context Context
Spark Py4J Python
Python Java
Worker Interpreter
Java Python
Client Cluster
30
Getting Started
• conda can create a local environment for Spark for you:
conda create -n spark -c anaconda-cluster python=3.5 spark numba \ 

cudatoolkit ipython-notebook
source activate spark

#Uncomment below if on Mac OS X
#export JRE_HOME=$(/usr/libexec/java_home)
#export JAVA_HOME=$(/usr/libexec/java_home)
IPYTHON_OPTS="notebook" pyspark # starts jupyter notebook
31
Using Numba (CPU) with Spark
32
Using Numba (CPU) with Spark
Load arrays onto the Spark
Workers
Apply my function
to every element in
the RDD and return
first element
33
Using CUDA Python with Spark
Define CUDA kernel
Compilation happens here
Wrap CUDA kernel launching

logic
Creates Spark RDD (8 partitions)
Apply gpu_work on each partition

34
recompile if CUDA arch is different
LLVM IR LLVM IR PTX
CUDA
Python Compiles Serialize Deserialize Finalize CUDA
Kernel Binary
PTX PTX
use serialized PTX if CUDA
arch matches client
Client Cluster
35
LLVM IR LLVM IR PTX
CUDA
Python Compiles Serialize Deserialize Finalize CUDA
Kernel Binary
PTX PTX
This happens on every worker process
Client Cluster
36
EXAMPLE: IMAGE REGISTRATION
37
Basic Algorithm
image set
Group similar images

(unsupervised kNN clustering)
attempt image registration on

every pair in each group
(phase correlation based; FFT heavy)
unused images new images
Progress?
38
Basic Algorithm
• Reduce number of pairwise
image set
image registration attempt
• Run on CPU
Group similar images
(unsupervised kNN clustering)
attempt image registration on

every pair in each group • FFT 2D heavy
(phase correlation based; FFT heavy)
• Expensive
• Mostly running on GPU
unused images new images
Progress?
39
Cross Power Spectrum
Core of phase correlation based image registration algorithm
40
Cross Power Spectrum
Core of phase correlation based image registration algorithm
cuFFT
explicit memory transfer
41
Scaling on Spark
Image set
• Machines have multiple GPUs
• Each worker computes a partition at a time
Random partitioning
rdd.repartition(num_parts)
partition partition partition

Apply ImgReg on each Partition i i i
…
…
rdd.mapPartitions(imgreg_func)
u n u n u n
Progress?
42
CUDA Multi-Process Service (MPS)
• Sharing one GPU between multiple workers can be beneficial
• nvidia-cuda-mps-control
• Better GPU utilization from multiple processes
• For our example app: 5-10% improvement with MPS
• Effect can be bigger for app with higher GPU utilization
43
Scaling with Multiple GPUs and MPS
• 1 CPU worker: 1 Tesla K20
• 2 CPU worker: 2 Tesla K20
• 4 CPU worker: 2 Tesla K20 

(2 workers per GPU using MPS)
• Spark + Dask = External process

performing GPU calculation
44
Spark vs CUDA
• Spark logic is similar to CUDA host logic • Partitions are like CUDA blocks
• mapPartitions is like kernel launch • Work cooperatively within a partition
• Spark network transfer overhead vs 
PCI-express transfer overhead
Image set
Random partitioning partition partition partition

rdd.repartition(num_parts
i i i
Apply ImgReg on each Partition

rdd.mapPartitions(imgreg_func)
u n u n u n …
…
Progress?
45
TIPS AND TRICKS
46
Parallelism is about communication
• Be aware of the communication overhead when deciding how
to chunk your work in Spark.
• Bigger chunks: Risk of underutilization of resources
• Smaller chunks: Risk of computation swamped by overhead
➡ Start with big chunks, then move to smaller ones if you fail to
reach full utilization.
47
Start Small
• Easier to debug your logic on your local system!
• Run a Spark cluster on your workstation where it is easier

to debug and profile.
• Move to the cluster only after you’ve seen things work

locally. (Spark makes this fairly painless.)
48
Be interactive with Big Data!
• Run Jupyter notebook on Spark + MultiGPU cluster
• Live experimentation helps you develop and understand

your progress
• Spark keeps track of intermediate data and recomputes

them if they are lost
49
Amortize Setup Costs
• Make sure that one-time costs are actually done once:
• GPU state (like FFT plans) can be slow to initialize. Do

this at the top of your mapPartition call and reuse for
each element in the RDD partition.
• Coarser chunking of tasks allows GPU memory

allocations to be reused between steps inside a single
task.
50
Be(a)ware of Global State
• Beware that PySpark may (quite often!) spawn new processes
• New processes may have the same initial random seed
• the python random module must be seeded properly in each

newly spawned process
• Use memoize
• example: use global dictionary to track shared states and

remember performed initializations
51
Spark and Multi-GPU
• Spark is not aware of GPU resources
• Efficient usage of GPU resources require some workarounds 

e.g.
• option 1: random GPU assignment 

numba.cuda.select_device(random.randrange(num_of_gpus))
• option 2: use mapPartitionWithIndex 

selected_gpu = partition_index % num_of_gpus
• option 3: manage GPUs externally; like CaffeOnSpark
52
GPU assignment by partition index
Delegate GPU work to external GPU-aware process 

>2x speedup
© 2015 Continuum Analytics- Confidential & Proprietary 53

CONCLUSION
54
PySpark and Numba for GPU clusters
• Numba let’s you create compiled CPU and CUDA functions right inside
your Python applications.
• Numba can be used with Spark to easily distribute and run your code
on Spark workers with GPUs
• There is room for improvement in how Spark interacts with the GPU,
but things do work.
• Beware of accidentally multiplying fixed initialization and compilation

costs.
55

GPU Computing With Apache Spark and Python: April 5, 2016

Uploaded by

Copyright:

Available Formats

GPU Computing With Apache Spark and Python: April 5, 2016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GPU Computing With Apache Spark and Python: April 5, 2016

Uploaded by

Copyright:

Available Formats

GPU Computing with Apache Spark

© 2015 Continuum Analytics- Confidential & Proprietary

• Using Python for data analysis for 10 years

• Using GPUs for data analysis for 7 years

• Currently lead the High Performance Python

• We give superpowers to people who change the world!

• We want to help everyone analyze their data with Python

• Anaconda is a free Mac/Win/Linux Python distribution:

• Python excels at interfacing with other languages used in HPC:

C: ctypes, CFFI, Cython

C++: Cython, Boost.Python

• Sometimes tedious boilerplate for translating data types across

• Generating compiled functions for the wide range of data types

• How can we use cutting edge hardware, like GPUs?

• Implemented as a module. Does not replace the Python interpreter!

• Code generation done with:

• LLVM (for CPU)

• NVVM (for CUDA GPUs).

• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3

• Linux (~RHEL 5 and

Python Function Functions

Returning a slice of the array

Helper function to compute

Helper function to compute

Works just like CUDA C, except Numba handles allocating

Memory allocation matters in small tasks.

List of supported type signatures

Code generation target

4x speedup incl. host<->device round trip

MKL accelerated FFT

>2x speedup incl. host<->device round trip

• Based on the concept of Resilient Distributed Datasets (RDDs)

• Dataset: Collection of independent elements (files, objects, etc) in memory

• Distributed: Elements in RDDs are grouped into partitions and may be

Fig from: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

• Keep intermediate results in memory with rdd.cache()

• Move computation to the data whenever possible (functions are

• Provide computation primitives that expose parallelism and

• Spark is implemented in Java & Scala on the JVM

• Full API support for Scala, Java, and Python

• How does Python work, since it doesn’t run on the JVM?

Spark Py4J Spark

conda create -n spark -c anaconda-cluster python=3.5 spark numba \

source activate spark

IPYTHON_OPTS="notebook" pyspark # starts jupyter notebook

Wrap CUDA kernel launching

Creates Spark RDD (8 partitions)

Apply gpu_work on each partition

LLVM IR LLVM IR PTX

This happens on every worker process

Group similar images

attempt image registration on

unused images new images

attempt image registration on

explicit memory transfer

partition partition partition

• Better GPU utilization from multiple processes

• For our example app: 5-10% improvement with MPS

• Effect can be bigger for app with higher GPU utilization

GPU Computing with Apache Spark 

• Using Python for data analysis for 10 years 

• Using GPUs for data analysis for 7 years 

• We give superpowers to people who change the world! 

Helper function to compute 

Helper function to compute 

• Full API support for Scala, Java, and Python 

• How does Python work, since it doesn’t run on the JVM? 

conda create -n spark -c anaconda-cluster python=3.5 spark numba \ 

• 4 CPU worker: 2 Tesla K20 

• Efficient usage of GPU resources require some workarounds 

• option 1: random GPU assignment 

• option 2: use mapPartitionWithIndex 

Delegate GPU work to external GPU-aware process