Pymic: A Python Offload Module For The Intel Xeon Phi™ Coprocessor
Pymic: A Python Offload Module For The Intel Xeon Phi™ Coprocessor
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
Copyright © 2015 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel
Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations
that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets
and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and
Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
2
Python in HPC
3
The pyMIC Offload Infrastructure
• pyMIC facts
• 3800 lines of C/C++ code;
• 1100 lines of Python code for the main API;
• LIBXSTREAM and Intel® LEO for interfacing with MPSS
4
High-Level Overview
OffloadArray
• C/C++ extension module [Python]
(kernels) [C]
• Low-level device management _pyMICimpl
• Interaction with LEO [C/C++]
5
Example dgemm: The Host Side…
import pymic as mic
import numpy as np import numpy as np
device = mic.devices[0]
stream = device.get_default_stream()
library = device.load_library("libdgemm.so")
am = np.matrix(a) stream.invoke(library.dgemm_kernel,
bm = np.matrix(b) a, b, c,
cm = np.matrix(c) m, n, k, alpha, beta)
cm = alpha * am * bm + beta * cm stream.sync()
6
Example dgemm: The Host Side…
import pymic as mic
import numpy as np
• Get a device handle
device = mic.devices[0]
(numbered from 0 to n-1) stream = device.get_default_stream()
m,n,k = 4096,4096,4096
object library alpha = 1.0
beta = 0.0
np.random.seed(10)
• Invoke kernel function and pass a = np.random.random(m*k).reshape((m, k))
b = np.random.random(k*n).reshape((k, n))
actual arguments c = np.empty((m, n))
7
Example dgemm: The Target Side…
• Arguments are passed as C/C++ #include <pymic_kernel.h>
types #include <mkl.h>
• All argument passing is done
PYMIC_KERNEL
with pointers to actual data void dgemm_kernel(const double *A, const double *B,
double *C,
const int64_t *m, const int64_t *n,
const int64_t *k,
const double *alpha, const double *beta) {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
• Invoke (native) dgemm kernel *m, *n, *k, *alpha, A, *k, B, *n,
*beta, C, *n);
}
8
High-level Data Structures
OffloadDevice OffloadArray
• Interaction with devices • numpy.ndarray container
OffloadStream
• Simple kernels and
• Invocation of kernel
functions operators (fill, +, *)
• Buffer management
9
Optimize Offloads with High-level Containers
import pymic as mic
import numpy as np
• Get a device handle
device = mic.devices[0]
(numbered from 0 to n-1) stream = device.get_default_stream()
m,n,k = 4096,4096,4096
object library alpha = 1.0
beta = 0.0
np.random.seed(10)
• Use bind to create an offload a = np.random.random(m*k).reshape((m, k))
b = np.random.random(k*n).reshape((k, n))
buffer for host data c = np.zeros((m, n))
offl_a = device.bind(a)
offl_b = device.bind(b)
• Invoke kernel function and pass offl_c = device.bind(c)
10
The High-level Offload Protocol
import pyMIC as mic
device.load_library() #include <pymic_kernel.h>
import numpy as np 0101010100100010101
0101010001010010010
#include <mkl.h>
device = mic.devices[0] 1010100101010101010
libr = device.load_library("libdgemm.so") PYMIC_KERNEL
0010010011101110001
target process
stream = device.get_default_stream() void dgemm_kernel(const double *A,
0101010010101011011
const double *B,
host process
0101010100100010101
m,n,k = 4096,4096,4096 0010101010101010101
double *C,
alpha = 1.0
0101010001010010010 const int64_t *m,
beta = 0.0 const int64_t *n,
1010100101010101010
np.random.seed(10) const int64_t *k,
0010010011101110001
a = np.random.random(m*k).reshape((m, k)) count double *alpha,
0101010010101011011
b = np.random.random(k*n).reshape((k, n)) const double beta) {
c = np.zeros((m, n))
0010101010101010101 a.update_device() cblas_dgemm(CblasRowMajor,
CblasNoTrans,
offl_a = stream.bind(a) CblasNoTrans,
offl_b = stream.bind(b) *m, *n, *k,
offl_c = stream.bind(c) *alpha, A, *k,
B, *n,
stream.invoke(libr.dgemm_kernel, *beta, C, *n);
offl_a, offl_b,
offl_c, stream.invoke(…) }
m, n, k,
alpha, beta)
offl_c.update_host()
stream.sync()
a.update_host()
11
Buffer Management: Buffer Creation
class OffloadStream: class OffloadStream:
def bind(self, array, update_device=True): def allocate_device_memory(self, nbytes, alignment=64, sticky=False):
if not isinstance(array, numpy.ndarray):
device = self._device_id
raise ValueError("only numpy.ndarray can be associated "
"with OffloadArray") if nbytes <= 0:
raise ValueError('Cannot allocate negative amount of '
# detect the order of storage for 'array' 'memory: {0}'.format(nbytes))
if array.flags.c_contiguous: device_ptr = _pymic_impl_stream_allocate(device, self._stream_id,
order = "C"
nbytes, alignment)
elif array.flags.f_contiguous:
order = "F" return SmartPtr(self, device, device_ptr, sticky)
else:
raise ValueError("could not detect storage order")
12
Buffer Management: Data Transfer
class OffloadArray: void buffer_copy_to_target(int device,
def update_device(self): libxstream_stream *stream,
host_ptr = self.array.ctypes.get_data() unsigned char *src,
s = self.stream unsigned char *dst,
s.transfer_host2device(host_ptr, size_t size,
self._device_ptr, size_t offset_host,
self._nbytes) size_t offset_device) {
return None unsigned char *src_offs = src + offset_host;
unsigned char *dst_offs = dst + offset_device;
def update_host(self):
host_ptr = self.array.ctypes.get_data() libxstream_memcpy_h2d(src_offs, dst_offs,
s = self.stream size, stream);
s.transfer_device2host(self._device_ptr, }
host_ptr,
self._nbytes)
return self
13
The Low-level Offload Protocol
import pyMIC as mic #include <pymic_kernel.h>
import numpy as np
#include <mkl.h>
device = mic.devices[0]
libr = device.load_library("libdgemm.so") PYMIC_KERNEL
target process
stream = device.get_default_stream() void dgemm_kernel(const double *A,
const double *B,
host process
offl_c.update_host()
stream.sync()
stream.transfer_device2host(…)
stream.deallocate_device_memory(…)
14
Using the Low-level API
import pymic # transfer a into the first buffer and shuffle a bit
import numpy stream.transfer_host2device(ptr_a, dev_ptr_1,
nbytes/2, offset_host=0,
device = pymic.devices[0] offset_device=nbytes/2)
stream = device.get_default_stream() stream.transfer_host2device(ptr_a, dev_ptr_1, nbytes/2,
offset_host=nbytes/2,
a = numpy.arange(0.0, 32.0) offset_device=0)
b = numpy.empty_like(a) # do some more shuffling on the target
for i in xrange(0, 4):
# determine size of the array in bytes and get pointer stream.transfer_device2device(dev_ptr_1, dev_ptr_2,
nbytes = a.dtype.itemsize * a.size nbytes/4,
ptr_a = a.ctypes.data off_dev_src=i*(nbytes/4),
ptr_b = b.ctypes.data off_dev_dst=(3-i)*(nbytes/4))
# transfer data back into 'b' array and shuffle even more
# allocate buffer spaces in the target for i in xrange(0, 4):
dev_ptr_1 = stream.allocate_device_memory(nbytes) stream.transfer_device2host(dev_ptr_2, ptr_b, nbytes/4,
dev_ptr_2 = stream.allocate_device_memory(nbytes) offset_device=i*(nbytes/4),
offset_host=(3-i)*(nbytes/4))
stream.sync()
15
Example: Singular Value Decomposition
• Decompose matrix:
M=U×∑×VT
16
Example: Singular Value Decomposition
4000
3000
bind
2000
copyin
1000
copyout
0
1024
8
16
32
64
128
256
512
2097152
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
4194304
8388608
16777216
33554432
67108864
134217728
268435456
536870912
1073741824
2147483648
data transferred [bytes]
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are
measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. System configuration: Intel S2600GZ server with two Intel Xeon E5-2697v2 12-core
processors at 2.7 GHz (64 GB DDR3 with 1867 MHz), Red Had Enterprise Linux 6.5 (kernel version 2.6.32-358.6.2) and Intel C600 IOH, one Intel Xeon Phi
7120P coprocessor (C0 stepping, GDDR5 with 3.6 GT/sec, driver v3.3-1, flash image/micro OS 2.1.02.0390), and Intel Composer XE 14.0.3.174. For more
complete information visit http://www.intel.com/performance. 21
Performance: dgemm
700
600 MKL
Numpy (MKL)
500
pyMIC (kernel only)
GFLOPS
400
pyMIC (incl. transfers)
300
200
100
0
640
4480
128
384
896
1152
1408
1664
1920
2176
2432
2688
2944
3200
3456
3712
3968
4224
4736
4992
5248
5504
5760
6016
6272
6528
6784
7040
7296
7552
7808
8064
matrix size
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are
measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. System configuration: Intel S2600GZ server with two Intel Xeon E5-2697v2 12-core
processors at 2.7 GHz (64 GB DDR3 with 1867 MHz), Red Had Enterprise Linux 6.5 (kernel version 2.6.32-358.6.2) and Intel C600 IOH, one Intel Xeon Phi
7120P coprocessor (C0 stepping, GDDR5 with 3.6 GT/sec, driver v3.3-1, flash image/micro OS 2.1.02.0390), and Intel Composer XE 14.0.3.174. For more
complete information visit http://www.intel.com/performance. 22
Summary & Future Work
• pyMIC
• A slim, easy-to-use offload interface for Python
• Native kernels on the target devices
• Almost negligible extra overhead for Python integration
24