cs179 2016 Lec13
cs179 2016 Lec13
cs179 2016 Lec13
1
Moving data is slow
So far we’ve only considered performance when the data is
already on the GPU
2
Moving data is important
Intelligently moving data allows processing data larger than
GPU global memory (~6GB)
3
VRAM
Max VRAM on a GPU (Nvidia Maxwell): 12 GB
Max RAM on a CPU (Xeon E7): 1536 GB
1. VRAM is soldered directly onto the graphics card, allowing the VRAM modules to sit
closely to the GPU, allowing quicker transfer between VRAM and GPU, increasing
graphics performance. Secondly, sockets and circuits most likely would cause GPU
prices to go up.
4
Matrix transpose: another look
Time(%) Time Calls Avg Name
49.35% 29.581ms 1 29.581ms [CUDA memcpy DtoH]
47.48% 28.462ms 1 28.462ms [CUDA memcpy HtoD]
3.17% 1.9000ms 1 1.9000ms naiveTransposeKernel
kernel 0 HD 1 kernel 0
DH 0 HD 2 kernel 1 DH 0
HD 1 HD 3 kernel 2 DH 1
kernel 1 HD 4 kernel 3 DH 2
DH 1 HD 5 kernel 4 DH 3
HD 2 HD 6 kernel 5 DH 4
kernel 2 HD 7 kernel 6 DH 5
8
Turning dreams into reality
What do we need to make the dream happen?
● hardware to run 2 transfers and 1 kernel in parallel
● 2 input buffers
● 2 output buffers easy, up to programmer
9
Latency hiding checklist
Hardware:
● maximum of 4, 16, or 32 concurrent kernels
(depending on hardware) on CC >= 2.0
● 1 device→host copy engine
● 1 host→device copy engine
(2 copy engines only on newer hardware, some hardware
has single copy engine shared for both directions)
10
Asynchrony
An asynchronous function returns as soon is it called.
11
cudaMemcpyAsync
Convenient asynchronous memcpy! Similar arguments to
normal cudaMemcpy.
while (1) {
cudaMemcpyAsync(d_in, h_in, in_size)
kernel<<<grid, block>>>(d_in, d_out)
cudaMemcpyAsync(out, d_out, out_size)
}
13
The null / default stream
When stream is not specified, operation only starts after all
other GPU operations have finished.
CPU code can run concurrently with default stream.
14
Stream example
cudaStream_t s[2];
cudaStreamCreate(&s[0]); cudaStreamCreate(&s[1]);
for (int i = 0; i < 2; i++) {
kernel<<<grid, block, shmem, s[i]>(d_outs[i], d_ins[i]);
cudaMemcpyAsync(h_outs[i], d_outs[i], size, dir, s[i]);
}
for (int i = 0; i < 2; i++) {
cudaStreamSynchronize(s[i]);
cudaStreamDestroy(s[i]);
}
kernels run in parallel!
15
CUDA events
Streams synchronize the GPU (but can synchronize
CPU/GPU with cudaStreamSynchronize)
16
Events example
#define START_TIMER() { \
gpuErrChk(cudaEventCreate(&start)); \
gpuErrChk(cudaEventCreate(&stop)); \
gpuErrChk(cudaEventRecord(start)); \
}
#define STOP_RECORD_TIMER(name) { \
gpuErrChk(cudaEventRecord(stop)); \
gpuErrChk(cudaEventSynchronize(stop)); \
gpuErrChk(cudaEventElapsedTime(&name, start, stop)); \
gpuErrChk(cudaEventDestroy(start)); \
gpuErrChk(cudaEventDestroy(stop)); \
}
17
Events methods
cudaEventRecord - records that an event has occurred.
Recording happens not at time of call but after all
preceding operations on GPU have finished
19
CPU/GPU communication
20
Virtual Memory
Could give a week of lectures on virtual memory…
21
What does virtual memory give us?
Each process can act like it is the only process running.
The same virtual address in different processes can point
to different physical addresses (and values).
Each process can use more than the total system memory.
Store pages of data on disc if there is no room in physical
memory.
Operating system can move pages around physical
memory and disc as needed.
22
Unified Virtual Addressing
On 64-bit OS with GPU of CC >= 2.0, GPU pointers live in
disjoint address space from CPU. Makes it possible to
figure out which memory an address lives on at runtime.
23
Unified Virtual Addressing/Memory
You can think of unified
memory as “smart pinned
memory”. Driver is allowed
to cache memory on host
or any GPU.
cudaMallocManaged/
cudaFree
24
Virtual memory and GPU
To move data from CPU to GPU, the GPU must access
data on host. GPU is given virtual address.
2 options:
(1)for each word, have the CPU look up physical address
and then perform copy. Discontiguous physical
memory. slow!
(2)tell the OS to keep a page at a fixed location (pinning).
Directly access physical memory on host from GPU
(direct memory access a.k.a. DMA). fast! 25
Memcpy
cudaMemcpy(Async):
Pin a host buffer in the driver.
Copy data from user array into pinned buffer.
Copy data from pinned buffer to GPU.
26
Commands communicated by circular buffer.
Host writes, device reads.
pinned host memory
Advantages:
(1)can dereference pointer to pinned host buffers on
device! Lots of PCI-Express (PCI-E) traffic :(
(2)cudaMemcpy is considerably faster when copying
to/from pinned host memory.
28
Pinned host memory use cases
● self-referential data structures that are not easy to copy
(such as a linked list)
● deliver output as soon as possible (rather than waiting
for kernel completion and memcpy)
29
Disadvantages of pinning
Pinned pages limit freedom of OS memory management.
cudaMallocHost will fail (due to no memory available)
long before malloc.
30