GPU Computing With Apache Spark and Python: April 5, 2016
GPU Computing With Apache Spark and Python: April 5, 2016
GPU Computing With Apache Spark and Python: April 5, 2016
and Python
Stan Seibert
Siu Kwan Lam
April 5, 2016
2
About Continuum Analytics
3
• I’m going to use Anaconda throughout this presentation.
• https://www.continuum.io/downloads
4
Overview
1. Why Python?
2. Numba: A Python JIT compiler for the CPU and GPU
3. PySpark: Distributed computing for Python
4. Example: Image Registration
5. Tips and Tricks
6. Conclusion
WHY PYTHON?
6
Why is Python so popular?
• Straightforward, productive language for system administrators,
programmers, scientists, analysts and hobbyists
• Great community:
• Lots of tutorial and reference materials
• Easy to interface with other languages
• Vast ecosystem of useful libraries
7
8
… But, isn’t Python slow?
• Pure, interpreted Python is slow.
FORTRAN: f2py
• Secret: Most scientific Python packages put the speed critical sections of their
algorithms in a compiled language.
9
Is there another way?
• Switching languages for speed in your projects can be a little
clunky:
10
NUMBA: A PYTHON JIT COMPILER
11
Compiling Python
• Numba is an open-source, type-specializing compiler for Python functions
• Can translate Python syntax into machine code if all type information can be
deduced when the function is called.
12
Supported Platforms
OS HW SW
• OS X (10.9 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.10
13
How Does Numba Work? @jit
def do_math(a, b):
…
>>> do_math(x, y)
Bytecode Numba IR
Analysis
Lowering
Cache
Machine
Execute! LLVM/NVVM JIT LLVM IR
Code
14
Numba on the CPU
15
Numba on the CPU Numba decorator
(nopython=True not required)
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
2.7x speedup!
16
CUDA Kernels in Python
17
CUDA Kernels in Python
Decorator will infer type signature when you call it
18
Calling the Kernel from Python
21
Higher Level Tools: GPU Ufuncs
Decorator for creating ufunc
22
GPU Ufuncs Performance
23
Accelerate Library Bindings: cuFFT
24
PYSPARK: DISTRIBUTED COMPUTING
FOR PYTHON
25
What is Apache Spark?
• An API and an execution engine for distributed computing on a cluster
• Resilient: RDDs remember how they were created, so if a node goes down,
Spark can recompute the lost elements on another node
26
Computation DAGs
28
Python and Spark
29
Spark Py4J Python
Worker Interpreter
Java Python
31
Using Numba (CPU) with Spark
32
Using Numba (CPU) with Spark
Load arrays onto the Spark
Workers
Apply my function
to every element in
the RDD and return
first element
33
Using CUDA Python with Spark
Define CUDA kernel
Compilation happens here
CUDA
Python Compiles Serialize Deserialize Finalize CUDA
Kernel Binary
PTX PTX
use serialized PTX if CUDA
arch matches client
Client Cluster
35
LLVM IR LLVM IR PTX
CUDA
Python Compiles Serialize Deserialize Finalize CUDA
Kernel Binary
PTX PTX
Client Cluster
36
EXAMPLE: IMAGE REGISTRATION
37
Basic Algorithm
image set
Progress?
38
Basic Algorithm
• Reduce number of pairwise
image set
image registration attempt
• Run on CPU
Group similar images
(unsupervised kNN clustering)
Progress?
39
Cross Power Spectrum
Core of phase correlation based image registration algorithm
40
Cross Power Spectrum
Core of phase correlation based image registration algorithm
cuFFT
41
Scaling on Spark
Image set
• Machines have multiple GPUs
• Each worker computes a partition at a time
Random partitioning
rdd.repartition(num_parts)
…
…
rdd.mapPartitions(imgreg_func)
u n u n u n
Progress?
42
CUDA Multi-Process Service (MPS)
• Sharing one GPU between multiple workers can be beneficial
• nvidia-cuda-mps-control
43
Scaling with Multiple GPUs and MPS
• 1 CPU worker: 1 Tesla K20
44
Spark vs CUDA
• Spark logic is similar to CUDA host logic • Partitions are like CUDA blocks
• mapPartitions is like kernel launch • Work cooperatively within a partition
• Spark network transfer overhead vs
PCI-express transfer overhead
Image set
Progress?
45
TIPS AND TRICKS
46
Parallelism is about communication
• Be aware of the communication overhead when deciding how
to chunk your work in Spark.
➡ Start with big chunks, then move to smaller ones if you fail to
reach full utilization.
47
Start Small
• Easier to debug your logic on your local system!
48
Be interactive with Big Data!
• Run Jupyter notebook on Spark + MultiGPU cluster
49
Amortize Setup Costs
• Make sure that one-time costs are actually done once:
50
Be(a)ware of Global State
• Beware that PySpark may (quite often!) spawn new processes
• Use memoize
51
Spark and Multi-GPU
• Spark is not aware of GPU resources
52
GPU assignment by partition index
54
PySpark and Numba for GPU clusters
• Numba let’s you create compiled CPU and CUDA functions right inside
your Python applications.
• Numba can be used with Spark to easily distribute and run your code
on Spark workers with GPUs
• There is room for improvement in how Spark interacts with the GPU,
but things do work.
55