Openmp Tutorial: Seung-Jai Min

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

OpenMP Tutorial

Seung-Jai Min
([email protected]) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

ECE 563 Programming Parallel Machines

Parallel Programming Standards


Thread Libraries - Win32 API / Posix threads OUR FOCUS Compiler Directives - OpenMP (Shared memory programming) Message Passing Libraries - MPI (Distributed memory programming)

ECE 563 Programming Parallel Machines

Shared Memory Parallel Programming in the Multi-Core Era


Desktop and Laptop
2, 4, 8 cores and ?

A single node in distributed memory clusters


Steele cluster node: 2 8 (16) cores

Shared memory hardware Accelerators


Cell processors: 1 PPE and 8 SPEs Nvidia Quadro GPUs: 128 processing units

ECE 563 Programming Parallel Machines

OpenMP: Some syntax details to get us started


Most of the constructs in OpenMP are compiler directives or pragmas.
For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]]

For Fortran, the directives take one of the forms:


C$OMP construct [clause [clause]] !$OMP construct [clause [clause]] *$OMP construct [clause [clause]]

Include files
#include omp.h
ECE 563 Programming Parallel Machines 4

How is OpenMP typically used?


OpenMP is usually used to parallelize loops:
Find your most time consuming loops. Split them up between threads.
Sequential Program void main() { int i, k, N=1000; double A[N], B[N], C[N]; for (i=0; i<N; i++) { A[i] = B[i] + k*C[i] } } Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<N; i++) { A[i] = B[i] + k*C[i]; } }
5

ECE 563 Programming Parallel Machines

How is OpenMP typically used?


(Cont.)

Single Program Multiple Data (SPMD)


Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<N; i++) { A[i] = B[i] + k*C[i]; } }

ECE 563 Programming Parallel Machines

How is OpenMP typically used?


(Cont.)

Single Program Multiple Data (SPMD)


Thread 0 void main() Thread 2 void main() { Thread 3 void main() { int i, k, N=1000; void main() { int i, k, double A[N], B[N], C[N];N=1000; { int i, k, double A[N], B[N], C[N];N=1000; lb = 0; int i, k, double A[N], B[N], C[N];N=1000; lb = 250; ub = 250; double A[N], B[N], C[N]; lb = 500; ub = 500; for (i=lb;i<ub;i++) { lb = 750; ub = 750; A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; ub = 1000; A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; } A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; } } A[i] = B[i] + k*C[i]; } } } } }
ECE 563 Programming Parallel Machines 7

Thread 1

OpenMP Fork-and-Join model


printf(program begin\n); N = 1000; #pragma omp parallel for for (i=0; i<N; i++) A[i] = B[i] + C[i]; M = 500; #pragma omp parallel for for (j=0; j<M; j++) p[j] = q[j] r[j]; printf(program done\n); Serial Parallel Serial Parallel Serial

ECE 563 Programming Parallel Machines

OpenMP Constructs
1.

Parallel Regions
#pragma omp parallel

2.

Worksharing
#pragma omp for, #pragma omp sections

3.

Data Environment
#pragma omp parallel shared/private ()

4.

Synchronization
#pragma omp barrier

5.

Runtime functions/environment variables


int my_thread_id = omp_get_num_threads(); omp_set_num_threads(8);
ECE 563 Programming Parallel Machines 9

OpenMP: Structured blocks


Most OpenMP constructs apply to structured blocks.
Structured block: one point of entry at the top and one point of exit at the bottom. The only branches allowed are STOP statements in Fortran and exit() in C/C++.

ECE 563 Programming Parallel Machines

10

OpenMP: Structured blocks


A Structured Block
#pragma omp parallel { more: do_big_job(id); if(++count>1) goto more; } printf( All done \n);

Not A Structured Block


if(count==1) goto more; #pragma omp parallel { more: do_big_job(id); if(++count>1) goto done; } done: if(!really_done()) goto more;

ECE 563 Programming Parallel Machines

11

Structured Block Boundaries


In C/C++: a block is a single statement or a group of statements between brackets {}
#pragma omp parallel { id = omp_thread_num(); A[id] = big_compute(id); } #pragma omp for for (I=0;I<N;I++) { res[I] = big_calc(I); A[I] = B[I] + res[I]; }

ECE 563 Programming Parallel Machines

12

Structured Block Boundaries


In Fortran: a block is a single statement or a group of statements between directive/end-directive pairs.
C$OMP PARALLEL 10 W(id) = garbage(id) res(id) = W(id)**2 if(res(id) goto 10 C$OMP END PARALLEL C$OMP PARALLEL DO do I=1,N res(I)=bigComp(I) end do C$OMP END PARALLEL DO

ECE 563 Programming Parallel Machines

13

OpenMP Parallel Regions


double A[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); pooh(ID, A); } printf(all done\n);

double A[1000]; omp_set_num_threads(4)


A single copy of A is shared between all threads.

pooh(0,A)

pooh(1,A)

pooh(2,A)

pooh(3,A)

printf(all done\n);

Implicit barrier: threads wait here for all threads to finish before proceeding
14

ECE 563 Programming Parallel Machines

The OpenMP API Combined parallel work-share


OpenMP shortcut: Put the parallel and the work-share on the same line
int i; double res[MAX]; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } int i; double res[MAX]; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }

the same OpenMP


ECE 563 Programming Parallel Machines 15

Shared Memory Model


private private

thread1

thread2

Shared Memory
thread5 thread4
private private private

thread3

Data can be shared or private Shared data is accessible by all threads Private data can be accessed only by the thread that owns it Data transfer is transparent to the programmer
16

ECE 563 Programming Parallel Machines

Default storage attributes


Shared Memory programming model
Variables are shared by default

Data Environment:

Distributed Memory Programming Model


All variables are private

ECE 563 Programming Parallel Machines

17

Default storage attributes


Global variables are SHARED among threads
Fortran: COMMON blocks, SAVE variables, MODULE variables C: File scope variables, static

Data Environment:

But not everything is shared...


Stack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE.

ECE 563 Programming Parallel Machines

18

Data Environment
int A[100]; /* (Global) SHARED */ int main() { int ii, jj; /* PRIVATE */ int B[100]; /* SHARED */ #pragma omp parallel private(jj) { int kk = 1; /* PRIVATE */ #pragma omp for for (ii=0; ii<N; ii++) for (jj=0; jj<N; jj++) A[ii][jj] = foo(B[ii][jj]); } }
ECE 563 Programming Parallel Machines 19

int foo(int x) { /* PRIVATE */ int count=0; return x*count; }

Work Sharing Construct


Loop Construct
#pragma omp for [clause[[,] clause ] new-line for-loops Where clause is one of the following: private / firstprivate / lastprivate(list) reduction(operator: list) schedule(kind[, chunk_size]) collapse(n) ordered nowait
ECE 563 Programming Parallel Machines 20

Schedule
for (i=0; i<1100; i++) A[i] = ; #pragma omp parallel for schedule (static, 250) or (static)
250 250 250 250 100 or 275 275 275 275

p0

p1

p2

p3

p0

p0

p1

p2

p3

#pragma omp parallel for schedule (dynamic, 200)


200 200 200 200 200 100

p3

p0

p2

p3

p1

p0

#pragma omp parallel for schedule (guided, 100)


137 120 105 100 100 100 100 100 100 100 38

p0

p3

p0

p1 p2 p3 p0 p1 p2 p3 p0

#pragma omp parallel for schedule (auto)


ECE 563 Programming Parallel Machines 21

Critical Construct
sum = 0; #pragma omp parallel private (lsum) { lsum = 0; #pragma omp for for (i=0; i<N; i++) { lsum = lsum + A[i]; } #pragma omp critical { sum += lsum; } Threads wait their turn; } only one thread at a time
executes the critical section
ECE 563 Programming Parallel Machines 22

Reduction Clause
Shared variable sum = 0; #pragma omp parallel for reduction (+:sum) for (i=0; i<N; i++) { sum = sum + A[i]; }

ECE 563 Programming Parallel Machines

23

Performance Evaluation
How do we measure performance? (or how do we remove noise?)
#define N 24000 For (k=0; k<10; k++) { #pragma omp parallel for private(i, j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j+1])/2.0; }
ECE 563 Programming Parallel Machines 24

Performance IssuesSpeedup
What if you see a speedup saturation?
1 2 4 6 # CPUs 8

#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;
ECE 563 Programming Parallel Machines 25

Performance Issues
Speedup

What if you see a speedup saturation?


1 2 4 6 8 # CPUs

#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = b[i][j];

ECE 563 Programming Parallel Machines

26

Loop Scheduling
Any guideline for a chunk size?
#define N <big-number> chunk = ???; #pragma omp parallel for schedule (static, chunk) for (i=1; i<N-1; i++) a[i][j] = ( b[i-2] + b[i-1] + b[i] b[i+1] + b[i+2] )/5.0;
27

ECE 563 Programming Parallel Machines

Performance Issues
Load imbalance: triangular access pattern
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=i; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;

ECE 563 Programming Parallel Machines

28

Summary
OpenMP has advantages
Incremental parallelization Compared to MPI
No data partitioning No communication scheduling

ECE 563 Programming Parallel Machines

29

Resources

http://www.openmp.org http://openmp.org/wp/resources

ECE 563 Programming Parallel Machines

30

You might also like