Pymic: A Python Offload Module For The Intel Xeon Phi™ Coprocessor

pyMIC: A Python* Offload Module for the
Intel® Xeon Phi™ Coprocessor
Dr.-Ing. Michael Klemm

Software and Services Group
Intel Corporation
* Some names and brands may be claimed as the property of others.

Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE
OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
Copyright © 2015 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel
Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations
that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets
and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and
Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
2
Python in HPC
• Python has gained a lot of interest throughout the HPC

community (and others):
• IPython
• Numpy / SciPy
• Pandas
• Intel® Xeon Phi™ Coprocessor is an interesting target to

speed-up processing of Python codes
3
The pyMIC Offload Infrastructure
• Design principles (pyMIC’s 4 “K”s)

• Keep usage simple
• Keep the API slim
• Keep the code fast
• Keep control in a programmer’s hand
• pyMIC facts
• 3800 lines of C/C++ code;
• 1100 lines of Python code for the main API;
• LIBXSTREAM and Intel® LEO for interfacing with MPSS
4
High-Level Overview
• LIBXSTREAM and Intel® LEO: High-level Interface

low-level device interaction [Python]
• Transfer of shared libraries
Low-level Data
• Data transfers, kernel invocation Management
OffloadArray
• C/C++ extension module [Python]
(kernels) [C]
• Low-level device management _pyMICimpl
• Interaction with LEO [C/C++]
• Low-level API with memcpy-like LIBXSTREAM* & Intel LEO runtime

interface, smart device pointers [C/C++]
• High-level API with offload arrays
* https://github.com/hfp/libxstream
• Library with internal device kernels
5
Example dgemm: The Host Side…
import pymic as mic
import numpy as np import numpy as np
device = mic.devices[0]
stream = device.get_default_stream()
library = device.load_library("libdgemm.so")
m, n, k = 4096, 4096, 4096 m,n,k = 4096,4096,4096

alpha = 1.0 alpha = 1.0
beta = 0.0 beta = 0.0
np.random.seed(10) np.random.seed(10)
a = np.random.random(m * k).reshape((m, k)) a = np.random.random(m*k).reshape((m, k))
b = np.random.random(k * n).reshape((k, n)) b = np.random.random(k*n).reshape((k, n))
c = np.empty((m, n)) c = np.empty((m, n))
am = np.matrix(a) stream.invoke(library.dgemm_kernel,
bm = np.matrix(b) a, b, c,
cm = np.matrix(c) m, n, k, alpha, beta)
cm = alpha * am * bm + beta * cm stream.sync()
6
Example dgemm: The Host Side…
import pymic as mic
import numpy as np
• Get a device handle
(numbered from 0 to n-1) stream = device.get_default_stream()
• Load native code as a shared- library = device.load_library("libdgemm.so")
m,n,k = 4096,4096,4096
object library alpha = 1.0
beta = 0.0
np.random.seed(10)
• Invoke kernel function and pass a = np.random.random(m*k).reshape((m, k))
b = np.random.random(k*n).reshape((k, n))
actual arguments c = np.empty((m, n))
• Copy-in/copy-out semantics for stream.invoke(library.dgemm_kernel,

a, b, c,
arrays m, n, k, alpha, beta)
stream.sync()
• Copy-in semantics for scalars
• Synchronize host and

coprocessor
7
Example dgemm: The Target Side…
• Arguments are passed as C/C++ #include <pymic_kernel.h>
types #include <mkl.h>
• All argument passing is done
PYMIC_KERNEL
with pointers to actual data void dgemm_kernel(const double *A, const double *B,
double *C,
const int64_t *m, const int64_t *n,
const int64_t *k,
const double *alpha, const double *beta) {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
• Invoke (native) dgemm kernel *m, *n, *k, *alpha, A, *k, B, *n,
*beta, C, *n);
}
8
High-level Data Structures
OffloadDevice OffloadArray
• Interaction with devices • numpy.ndarray container
• Loading of shared libraries • Transfer management
OffloadStream
• Simple kernels and
• Invocation of kernel
functions operators (fill, +, *)
• Buffer management
9
Optimize Offloads with High-level Containers
import pymic as mic
import numpy as np
• Get a device handle
(numbered from 0 to n-1) stream = device.get_default_stream()
• Load native code as a shared- library = device.load_library("libdgemm.so")
m,n,k = 4096,4096,4096
object library alpha = 1.0
beta = 0.0
np.random.seed(10)
• Use bind to create an offload a = np.random.random(m*k).reshape((m, k))
b = np.random.random(k*n).reshape((k, n))
buffer for host data c = np.zeros((m, n))
offl_a = device.bind(a)
offl_b = device.bind(b)
• Invoke kernel function and pass offl_c = device.bind(c)
actual arguments stream.invoke(library.dgemm_kernel,

offl_a, offl_b, offl_c,
m, n, k, alpha, beta)
• Update host data from the offl_c.update_host()
device buffer stream.sync()
10
The High-level Offload Protocol
import pyMIC as mic
device.load_library() #include <pymic_kernel.h>
import numpy as np 0101010100100010101
0101010001010010010
#include <mkl.h>
device = mic.devices[0] 1010100101010101010
libr = device.load_library("libdgemm.so") PYMIC_KERNEL
0010010011101110001
target process
stream = device.get_default_stream() void dgemm_kernel(const double *A,
0101010010101011011
const double *B,
host process
0101010100100010101
m,n,k = 4096,4096,4096 0010101010101010101
double *C,
alpha = 1.0
0101010001010010010 const int64_t *m,
beta = 0.0 const int64_t *n,
1010100101010101010
np.random.seed(10) const int64_t *k,
0010010011101110001
a = np.random.random(m*k).reshape((m, k)) count double *alpha,
0101010010101011011
b = np.random.random(k*n).reshape((k, n)) const double beta) {
c = np.zeros((m, n))
0010101010101010101 a.update_device() cblas_dgemm(CblasRowMajor,
CblasNoTrans,
offl_a = stream.bind(a) CblasNoTrans,
offl_b = stream.bind(b) *m, *n, *k,
offl_c = stream.bind(c) *alpha, A, *k,
B, *n,
stream.invoke(libr.dgemm_kernel, *beta, C, *n);
offl_a, offl_b,
offl_c, stream.invoke(…) }
m, n, k,
alpha, beta)
offl_c.update_host()
stream.sync()
a.update_host()
11
Buffer Management: Buffer Creation
class OffloadStream: class OffloadStream:
def bind(self, array, update_device=True): def allocate_device_memory(self, nbytes, alignment=64, sticky=False):
if not isinstance(array, numpy.ndarray):
device = self._device_id
raise ValueError("only numpy.ndarray can be associated "
"with OffloadArray") if nbytes <= 0:
raise ValueError('Cannot allocate negative amount of '
# detect the order of storage for 'array' 'memory: {0}'.format(nbytes))
if array.flags.c_contiguous: device_ptr = _pymic_impl_stream_allocate(device, self._stream_id,
order = "C"
nbytes, alignment)
elif array.flags.f_contiguous:
order = "F" return SmartPtr(self, device, device_ptr, sticky)
else:
raise ValueError("could not detect storage order")
# construct and return a new OffloadArray

bound = pymic.OffloadArray(array.shape, array.dtype, order, False,
device=self._device, stream=self) unsigned char *buffer_allocate(int device,
bound.array = array libxstream_stream *stream,
size_t size,
# allocate the buffer on the device (and update data)
bound._device_ptr = self.allocate_device_memory(bound._nbytes) size_t alignment) {
if update_device: void *memory = NULL;
bound.update_device()
libxstream_mem_allocate(device, &memory, size, alignment);
return bound return reinterpret_cast<unsigned char *>(memory);
}
12
Buffer Management: Data Transfer
class OffloadArray: void buffer_copy_to_target(int device,
def update_device(self): libxstream_stream *stream,
host_ptr = self.array.ctypes.get_data() unsigned char *src,
s = self.stream unsigned char *dst,
s.transfer_host2device(host_ptr, size_t size,
self._device_ptr, size_t offset_host,
self._nbytes) size_t offset_device) {
return None unsigned char *src_offs = src + offset_host;
unsigned char *dst_offs = dst + offset_device;
def update_host(self):
host_ptr = self.array.ctypes.get_data() libxstream_memcpy_h2d(src_offs, dst_offs,
s = self.stream size, stream);
s.transfer_device2host(self._device_ptr, }
host_ptr,
self._nbytes)
return self
13
The Low-level Offload Protocol
import pyMIC as mic #include <pymic_kernel.h>
import numpy as np
#include <mkl.h>
libr = device.load_library("libdgemm.so") PYMIC_KERNEL
target process
stream = device.get_default_stream() void dgemm_kernel(const double *A,
const double *B,
host process
m,n,k = 4096,4096,4096 double *C,

alpha = 1.0 const int64_t *m,
beta = 0.0 stream.allocate_device_memory(…) const int64_t *n,
np.random.seed(10) const int64_t *k,
a = np.random.random(m*k).reshape((m, k)) count double *alpha,
b = np.random.random(k*n).reshape((k, n)) const double beta) {
c = np.zeros((m, n))
stream.transfer_host2device(…) cblas_dgemm(CblasRowMajor,
CblasNoTrans,
offl_a = stream.bind(a) CblasNoTrans,
offl_b = stream.bind(b) *m, *n, *k,
offl_c = stream.bind(c) *alpha, A, *k,
B, *n,
stream.invoke(libr.dgemm_kernel, *beta, C, *n);
offl_a, offl_b, }
offl_c,
m, n, k,
alpha, beta)
offl_c.update_host()
stream.sync()
stream.transfer_device2host(…)
stream.deallocate_device_memory(…)
14
Using the Low-level API
import pymic # transfer a into the first buffer and shuffle a bit
import numpy stream.transfer_host2device(ptr_a, dev_ptr_1,
nbytes/2, offset_host=0,
device = pymic.devices[0] offset_device=nbytes/2)
stream = device.get_default_stream() stream.transfer_host2device(ptr_a, dev_ptr_1, nbytes/2,
offset_host=nbytes/2,
a = numpy.arange(0.0, 32.0) offset_device=0)
b = numpy.empty_like(a) # do some more shuffling on the target
for i in xrange(0, 4):
# determine size of the array in bytes and get pointer stream.transfer_device2device(dev_ptr_1, dev_ptr_2,
nbytes = a.dtype.itemsize * a.size nbytes/4,
ptr_a = a.ctypes.data off_dev_src=i*(nbytes/4),
ptr_b = b.ctypes.data off_dev_dst=(3-i)*(nbytes/4))
# transfer data back into 'b' array and shuffle even more
# allocate buffer spaces in the target for i in xrange(0, 4):
dev_ptr_1 = stream.allocate_device_memory(nbytes) stream.transfer_device2host(dev_ptr_2, ptr_b, nbytes/4,
dev_ptr_2 = stream.allocate_device_memory(nbytes) offset_device=i*(nbytes/4),
offset_host=(3-i)*(nbytes/4))
stream.sync()
15
Example: Singular Value Decomposition
• Treat picture as 2D matrix
• Decompose matrix:
M=U×∑×VT
• Ignore some singular values
• Effectively compresses images
16
Example: Singular Value Decomposition
Host code Host code, cont’d

import numpy as np def reconstruct_image_dgemm(U, sigma, V):
import pymic as mic offl_tmp = stream.empty((U.shape[0], U.shape[1]),
dtype=float, update_host=False)
from PIL import Image
offl_res = stream.empty((U.shape[0], V.shape[1]),
dtype=float, update_host=False)
def compute_svd(image): offl_U, offl_sigma = stream.bind(U), stream.bind(sigma)
mtx = np.asarray(image.getdata(band=0), offl_V = stream.bind(V)
float) alpha, beta = 1.0, 0.0
mtx.shape = (image.size[1], image.size[0]) m, k, n = U.shape[0], U.shape[1], sigma.shape[1]
mtx = np.matrix(mtx) stream.invoke_kernel(library.dgemm_kernel,
offl_U, offl_sigma, offl_tmp,
return np.linalg.svd(mtx)
m, n, k, alpha, beta)
m, k, n = offl_tmp.shape[0], offl_tmp.shape[1], V.shape[1]
def reconstruct_image(U, sigma, V):
stream.invoke_kernel(library.dgemm_kernel,
reconstructed = U * sigma * V offl_tmp, offl_V, offl_res,
image = Image.fromarray(reconstructed) m, n, k, alpha, beta)
return image offl_res.update_host()
stream.sync()
image = Image.fromarray(offl_res.array)
return image
17
Performance: Bandwidth of Data Transfers
8000
7000
6000
5000
MiB/sec
4000
3000
bind
2000
copyin
1000
copyout
0
1024
8
16
32
64
128
256
512
2097152
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
4194304
8388608
16777216
33554432
67108864
134217728
268435456
536870912
1073741824
2147483648
data transferred [bytes]
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are
measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. System configuration: Intel S2600GZ server with two Intel Xeon E5-2697v2 12-core
processors at 2.7 GHz (64 GB DDR3 with 1867 MHz), Red Had Enterprise Linux 6.5 (kernel version 2.6.32-358.6.2) and Intel C600 IOH, one Intel Xeon Phi
7120P coprocessor (C0 stepping, GDDR5 with 3.6 GT/sec, driver v3.3-1, flash image/micro OS 2.1.02.0390), and Intel Composer XE 14.0.3.174. For more
complete information visit http://www.intel.com/performance. 21
Performance: dgemm
700
600 MKL
Numpy (MKL)
500
pyMIC (kernel only)
GFLOPS
400
pyMIC (incl. transfers)
300
200
100
0
640
4480
128
384
896
1152
1408
1664
1920
2176
2432
2688
2944
3200
3456
3712
3968
4224
4736
4992
5248
5504
5760
6016
6272
6528
6784
7040
7296
7552
7808
8064
matrix size
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are
measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. System configuration: Intel S2600GZ server with two Intel Xeon E5-2697v2 12-core
processors at 2.7 GHz (64 GB DDR3 with 1867 MHz), Red Had Enterprise Linux 6.5 (kernel version 2.6.32-358.6.2) and Intel C600 IOH, one Intel Xeon Phi
7120P coprocessor (C0 stepping, GDDR5 with 3.6 GT/sec, driver v3.3-1, flash image/micro OS 2.1.02.0390), and Intel Composer XE 14.0.3.174. For more
complete information visit http://www.intel.com/performance. 22
Summary & Future Work
• pyMIC
• A slim, easy-to-use offload interface for Python
• Native kernels on the target devices
• Almost negligible extra overhead for Python integration
• Future versions will likely bring:

• Offloading of full Python code
• Download pyMIC at https://github.com/01org/pyMIC.

• Mailinglist at https://lists.01.org/mailman/listinfo/pymic
24

Pymic: A Python Offload Module For The Intel Xeon Phi™ Coprocessor

Uploaded by

Pymic: A Python Offload Module For The Intel Xeon Phi™ Coprocessor

Uploaded by

pyMIC: A Python* Offload Module for the

Intel® Xeon Phi™ Coprocessor

Dr.-Ing. Michael Klemm

* Some names and brands may be claimed as the property of others.

• Python has gained a lot of interest throughout the HPC

• Intel® Xeon Phi™ Coprocessor is an interesting target to

• Design principles (pyMIC’s 4 “K”s)

• LIBXSTREAM and Intel® LEO: High-level Interface

• Low-level API with memcpy-like LIBXSTREAM* & Intel LEO runtime

m, n, k = 4096, 4096, 4096 m,n,k = 4096,4096,4096

• Load native code as a shared- library = device.load_library("libdgemm.so")

• Copy-in/copy-out semantics for stream.invoke(library.dgemm_kernel,

• Synchronize host and

• Loading of shared libraries • Transfer management

• Load native code as a shared- library = device.load_library("libdgemm.so")

actual arguments stream.invoke(library.dgemm_kernel,

# construct and return a new OffloadArray

m,n,k = 4096,4096,4096 double *C,

• Treat picture as 2D matrix

• Ignore some singular values

• Effectively compresses images

Host code Host code, cont’d

• Future versions will likely bring:

• Download pyMIC at https://github.com/01org/pyMIC.

You might also like