Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
TU/e 5kk73
/nju:/ /j/
System Architecture
GPU Architecture
NVIDIA Fermi, 512 Processing Elements (PEs)
ref: http://top500.org
ref: http://www.green500.org
Looks like the previous example, but SSE instructions execute on 4 ALUs.
Let's start with two important differences: 1. GPUs use threads instead of vectors 2. The "Shared Memory" spaces
int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); // define threads __device__ kernelF(A){ // all threads run same kernel i = blockIdx.x; // each thread block has its id j = threadIdx.x; // each thread has its id A[i][j]++; // each thread has a different i and j }
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; Before load instruction smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Threads starts window operation as soon as it
loads it own data element.
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] Till all threads hit barrier. ... + smem[i+1][i+1] ) / 9; }
Vector SIMD can also have shared memory. For Example, the CELL architecture. Q: What are the fundamental differences between the SIMT and vector SIMD programming models?
for(j=0;i<4;j+=4){
movups xmm0, [ &A[i][j] ] movups [ &B[i][j] ], xmm0 }} for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ] addps xmm1, [ &B[i-1][j] ] ... divps xmm1, 9 }} for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }} }
__shared__ smem[16][16];
i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem
+ smem[i+1][i+1] ) / 9;
Programmers convert data level parallelism (DLP) into thread level parallelism (TLP).
Example of Implementation
Note: NVIDIA may use a more complicated implementation.
Example
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Assume warp 0 and warp 1 are scheduled for execution.
Read Src Op
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5
Buffer Src Op
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5
Read Src Op
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Read source operands: r2 for warp 0 r5 for warp 1
Buffer Src Op
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5
Execute
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5
Execute
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5
Write back
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Write back: r0 for warp 0 r3 for warp 1
together to execute the same instruction. A warp of 32 threads can be executed on 16 (8) PEs in 2 (4) cycles by time-multiplexing.
Summary
Reference
NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro 2008, link: http://dx.doi.org/10.1109/MM.2008.31 Understanding throughput-oriented architectures, Communications of the ACM 2010, link: http://dx.doi.org/10.1145/1839676.1839694 GPUs and the Future of Parallel Computing, IEEE Micro 2011, link: http://dx.doi.org/10.1109/MM.2011.89 An extended list of learning materials in the assignment website: http://sites.google.com/site/5kk73gpu2011/materials