Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15

Graphics Processing Unit (GPU) Architecture and Programming
TU/e 5kk73
/nju:/ /j/
Zhenyu Ye Henk Corporaal 2011-11-15
System Architecture
GPU Architecture
NVIDIA Fermi, 512 Processing Elements (PEs)
What Can It Do?

Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!
ref: "How GPUs Work", http://dx.doi.org/10.1109/MC.2007.59
Single-Chip GPU v.s. Fastest Super Computers

ref: http://www.llnl.gov/str/JanFeb05/Seager.html
GPUs Are In Top Supercomputers

The Top500 supersomputer ranking in June 2011.
ref: http://top500.org
GPUs Are Also Green

The Green500 supersomputer ranking in June 2011.
ref: http://www.green500.org
The Gap Between CPU and GPU

Note: This is from the perspective of NVIDIA.
ref: Tesla GPU Computing Brochure
The Gap Between CPU and GPU
Application performance benchmarked by Intel.
ref: "Debunking the 100X GPU vs. CPU myth", http://dx.doi.org/10.1145/1815961.1816021
In This Lecture, We Will Find Out...
What is the archticture in GPUs? How to program GPUs?
Let's Start with Examples
Don't worry, we will start from C and RISC!
Let's Start with C and RISC

int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } }
Assembly code of inner-loop
lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1)
Programmer's view of RISC
Most CPUs Have Vector SIMD Units

Programmer's view of a vector SIMD, e.g. SSE.
Let's Program the Vector SIMD

Unroll inner-loop to vector operation. int A[2][4]; int A[2][4]; for(i=0;i<2;i++){ for(i=0;i<2;i++){ for(j=0;j<4;j++){ movups xmm0, [ &A[i][0] ] // load A[i][j]++; addps xmm0, xmm1 // add 1 } movups [ &A[i][0] ], xmm0 // store } } int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } }
Assembly code of inner-loop
Looks like the previous example, but SSE instructions execute on 4 ALUs.
lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1)
How Do Vector Programs Run?

int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store }
CUDA Programmer's View of GPUs

A GPU contains multiple SIMD Units.
CUDA Programmer's View of GPUs

A GPU contains multiple SIMD Units. All of them can access global memory.
What Are the Differences?

SSE GPU
Let's start with two important differences: 1. GPUs use threads instead of vectors 2. The "Shared Memory" spaces
Thread Hierarchy in CUDA

Grid contains Thread Blocks
Thread Block contains Threads
Let's Start Again from C

int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } convert into CUDA } int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); // define threads __device__ kernelF(A){ // all threads run same kernel i = blockIdx.x; // each thread block has its id j = threadIdx.x; // each thread has its id A[i][j]++; // each thread has a different i and j }
What Is the Thread Hierarchy?

thread 3 of block 1 operates on element A[1][3]
int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); // define threads __device__ kernelF(A){ // all threads run same kernel i = blockIdx.x; // each thread block has its id j = threadIdx.x; // each thread has its id A[i][j]++; // each thread has a different i and j }
How Are Threads Scheduled?
How Are Threads Executed?

int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; }
mv.u32 %r0, %ctaid.x // r0 = i = blockIdx.x mv.u32 %r1, %ntid.x // r1 = "threads-per-block" // r2 = j = threadIdx.x mv.u32 %r2, %tid.x mad.u32 %r3, %r2, %r1, %r0 // r3 = i * "threads-per-block" + j ld.global.s32 %r4, [%r3] // r4 = A[i][j] // r4 = r4 + 1 add.s32 %r4, %r4, 1 st.global.s32 [%r3], %r4 // A[i][j] = r4
Utilizing Memory Hierarchy
Example: Average Filters

Average over a 3x3 window for a 16x16 array
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ i = threadIdx.y; j = threadIdx.x; tmp = (A[i-1][j-1] + A[i-1][j] ... + A[i+1][i+1] ) / 9; A[i][j] = tmp; }
Utilizing the Shared Memory

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }
Utilizing the Shared Memory

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; allocate shared mem j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }
However, the Program Is Incorrect

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }
Let's See What's Wrong

Assume 256 threads are scheduled on 8 PEs.
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; Before load instruction smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Threads starts window operation as soon as it
loads it own data element.

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; Threads starts window operation as soon as it Some elements in the window are not}
Assume 256 threads are scheduled on 8 PEs. yet loaded by other threads. Error! loads it own data element.
How To Solve It?

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }
Use a "SYNC" barrier!

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; }

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; Some threads finish the j = threadIdx.x; load earlier than others. smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] Till all threads hit barrier. ... + smem[i+1][i+1] ) / 9; }

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... All elements in the window are loaded + smem[i+1][i+1] ) / 9; when each thread starts averaging. }
Review What We Have Learned

1. Single Instruction Multiple Thread (SIMT) 2. Shared memory
Vector SIMD can also have shared memory. For Example, the CELL architecture. Q: What are the fundamental differences between the SIMT and vector SIMD programming models?
Take the Same Example Again

Assume vector SIMD and SIMT both have shared memory. What is the difference?
Vector SIMD v.s. SIMT

int A[16][16]; // global memory __shared__ int B[16][16]; // shared mem for(i=0;i<16;i++){ kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){
for(j=0;i<4;j+=4){
movups xmm0, [ &A[i][j] ] movups [ &B[i][j] ], xmm0 }} for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ] addps xmm1, [ &B[i-1][j] ] ... divps xmm1, 9 }} for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }} }
__shared__ smem[16][16];
i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem
__sync(); // threads wait at barrier

A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ...
+ smem[i+1][i+1] ) / 9;
Vector SIMD v.s. SIMT

int A[16][16]; __shared__ int B[16][16]; for(i=0;i<16;i++){ kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ Programmers schedule __shared__ smem[16][16]; for(j=0;i<4;j+=4){ operations on PEs. i = threadIdx.y; movups xmm0, [ &A[i][j] ] # of PEs in HW is j = threadIdx.x; movups [ &B[i][j] ], xmm0 }} transparent to smem[i][j] = A[i][j]; for(i=0;i<16;i++){ programmers. You need to know how __sync(); // threads wait at barrier for(j=0;i<4;j+=4){ many PEs are in HW. A[i][j] = ( smem[i-1][j-1] addps xmm1, [ &B[i-1][j-1] ] + smem[i-1][j] Programmers addps xmm1, [ &B[i-1][j] ] give up exec. ... ... ordering to HW. Each inst. is executed by + smem[i+1][i+1] ) / 9; divps xmm1, 9 }} all PEs in locked step. } for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }} CUDA programmers let the SIMT hardware schedule operations on PEs.
Review What We Have Learned
Programmers convert data level parallelism (DLP) into thread level parallelism (TLP).
HW Groups Threads Into Warps

Example: 32 threads per warp
Example of Implementation
Note: NVIDIA may use a more complicated implementation.
Example
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Assume warp 0 and warp 1 are scheduled for execution.
Read Src Op
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5
Read source operands: r1 for warp 0 r4 for warp 1
Buffer Src Op
Push ops to op collector: r1 for warp 0 r4 for warp 1
Read Src Op
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Read source operands: r2 for warp 0 r5 for warp 1
Buffer Src Op
Push ops to op collector: r2 for warp 0 r5 for warp 1
Execute
Compute the first 16 threads in the warp.
Execute
Compute the last 16 threads in the warp.
Write back
Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Write back: r0 for warp 0 r3 for warp 1
A Brief Recap of SIMT Architecture
Threads in the same warp are scheduled
together to execute the same instruction. A warp of 32 threads can be executed on 16 (8) PEs in 2 (4) cycles by time-multiplexing.
Summary
The CUDA programming model. The SIMT architecture.
Reference
NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro 2008, link: http://dx.doi.org/10.1109/MM.2008.31 Understanding throughput-oriented architectures, Communications of the ACM 2010, link: http://dx.doi.org/10.1145/1839676.1839694 GPUs and the Future of Parallel Computing, IEEE Micro 2011, link: http://dx.doi.org/10.1109/MM.2011.89 An extended list of learning materials in the assignment website: http://sites.google.com/site/5kk73gpu2011/materials

Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15

Uploaded by

Copyright:

Available Formats

Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15

Uploaded by

Copyright:

Available Formats

Graphics Processing Unit (GPU) Architecture and Programming

Zhenyu Ye Henk Corporaal 2011-11-15

What Can It Do?

ref: "How GPUs Work", http://dx.doi.org/10.1109/MC.2007.59

Single-Chip GPU v.s. Fastest Super Computers

GPUs Are In Top Supercomputers

GPUs Are Also Green

The Gap Between CPU and GPU

ref: Tesla GPU Computing Brochure

The Gap Between CPU and GPU

Application performance benchmarked by Intel.

ref: "Debunking the 100X GPU vs. CPU myth", http://dx.doi.org/10.1145/1815961.1816021

In This Lecture, We Will Find Out...

What is the archticture in GPUs? How to program GPUs?

Let's Start with Examples

Don't worry, we will start from C and RISC!

Let's Start with C and RISC

lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1)

Programmer's view of RISC

Most CPUs Have Vector SIMD Units

Let's Program the Vector SIMD

lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1)

How Do Vector Programs Run?

CUDA Programmer's View of GPUs

CUDA Programmer's View of GPUs

What Are the Differences?

Thread Hierarchy in CUDA

Let's Start Again from C

What Is the Thread Hierarchy?

How Are Threads Scheduled?

How Are Threads Executed?

Utilizing Memory Hierarchy

Example: Average Filters

Utilizing the Shared Memory

Utilizing the Shared Memory

However, the Program Is Incorrect

Let's See What's Wrong

Let's See What's Wrong

Let's See What's Wrong

How To Solve It?

Use a "SYNC" barrier!

Use a "SYNC" barrier!

Use a "SYNC" barrier!

Review What We Have Learned

Take the Same Example Again

Vector SIMD v.s. SIMT

__sync(); // threads wait at barrier

Vector SIMD v.s. SIMT

Review What We Have Learned

HW Groups Threads Into Warps

Read source operands: r1 for warp 0 r4 for warp 1

Push ops to op collector: r1 for warp 0 r4 for warp 1

Push ops to op collector: r2 for warp 0 r5 for warp 1

Compute the first 16 threads in the warp.

Compute the last 16 threads in the warp.

A Brief Recap of SIMT Architecture

Threads in the same warp are scheduled

The CUDA programming model. The SIMT architecture.