Openmp Tutorial: Seung-Jai Min

OpenMP Tutorial
Seung-Jai Min
([email protected]) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN
ECE 563 Programming Parallel Machines
Parallel Programming Standards

Thread Libraries - Win32 API / Posix threads OUR FOCUS Compiler Directives - OpenMP (Shared memory programming) Message Passing Libraries - MPI (Distributed memory programming)
Shared Memory Parallel Programming in the Multi-Core Era

Desktop and Laptop
2, 4, 8 cores and ?
A single node in distributed memory clusters

Steele cluster node: 2 8 (16) cores
Shared memory hardware Accelerators

Cell processors: 1 PPE and 8 SPEs Nvidia Quadro GPUs: 128 processing units
OpenMP: Some syntax details to get us started

Most of the constructs in OpenMP are compiler directives or pragmas.
For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]]
For Fortran, the directives take one of the forms:

C$OMP construct [clause [clause]] !$OMP construct [clause [clause]] *$OMP construct [clause [clause]]
Include files
#include omp.h
ECE 563 Programming Parallel Machines 4
How is OpenMP typically used?

OpenMP is usually used to parallelize loops:
Find your most time consuming loops. Split them up between threads.
Sequential Program void main() { int i, k, N=1000; double A[N], B[N], C[N]; for (i=0; i<N; i++) { A[i] = B[i] + k*C[i] } } Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<N; i++) { A[i] = B[i] + k*C[i]; } }
5

(Cont.)
Single Program Multiple Data (SPMD)

Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<N; i++) { A[i] = B[i] + k*C[i]; } }

(Cont.)
Single Program Multiple Data (SPMD)

Thread 0 void main() Thread 2 void main() { Thread 3 void main() { int i, k, N=1000; void main() { int i, k, double A[N], B[N], C[N];N=1000; { int i, k, double A[N], B[N], C[N];N=1000; lb = 0; int i, k, double A[N], B[N], C[N];N=1000; lb = 250; ub = 250; double A[N], B[N], C[N]; lb = 500; ub = 500; for (i=lb;i<ub;i++) { lb = 750; ub = 750; A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; ub = 1000; A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; } A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; } } A[i] = B[i] + k*C[i]; } } } } }
Thread 1
OpenMP Fork-and-Join model

printf(program begin\n); N = 1000; #pragma omp parallel for for (i=0; i<N; i++) A[i] = B[i] + C[i]; M = 500; #pragma omp parallel for for (j=0; j<M; j++) p[j] = q[j] r[j]; printf(program done\n); Serial Parallel Serial Parallel Serial
OpenMP Constructs
1.
Parallel Regions
#pragma omp parallel
2.
Worksharing
#pragma omp for, #pragma omp sections
3.
Data Environment
#pragma omp parallel shared/private ()
4.
Synchronization
#pragma omp barrier
5.

Runtime functions/environment variables

int my_thread_id = omp_get_num_threads(); omp_set_num_threads(8);
OpenMP: Structured blocks

Most OpenMP constructs apply to structured blocks.
Structured block: one point of entry at the top and one point of exit at the bottom. The only branches allowed are STOP statements in Fortran and exit() in C/C++.
10
OpenMP: Structured blocks

A Structured Block
#pragma omp parallel { more: do_big_job(id); if(++count>1) goto more; } printf( All done \n);
Not A Structured Block

if(count==1) goto more; #pragma omp parallel { more: do_big_job(id); if(++count>1) goto done; } done: if(!really_done()) goto more;
11
Structured Block Boundaries

In C/C++: a block is a single statement or a group of statements between brackets {}
#pragma omp parallel { id = omp_thread_num(); A[id] = big_compute(id); } #pragma omp for for (I=0;I<N;I++) { res[I] = big_calc(I); A[I] = B[I] + res[I]; }
12
Structured Block Boundaries

In Fortran: a block is a single statement or a group of statements between directive/end-directive pairs.
C$OMP PARALLEL 10 W(id) = garbage(id) res(id) = W(id)**2 if(res(id) goto 10 C$OMP END PARALLEL C$OMP PARALLEL DO do I=1,N res(I)=bigComp(I) end do C$OMP END PARALLEL DO
13
OpenMP Parallel Regions

double A[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); pooh(ID, A); } printf(all done\n);
double A[1000]; omp_set_num_threads(4)

A single copy of A is shared between all threads.
pooh(0,A)
pooh(1,A)
pooh(2,A)
pooh(3,A)
printf(all done\n);
Implicit barrier: threads wait here for all threads to finish before proceeding
14
The OpenMP API Combined parallel work-share

OpenMP shortcut: Put the parallel and the work-share on the same line
int i; double res[MAX]; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } int i; double res[MAX]; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }
the same OpenMP

Shared Memory Model

private private
thread1
thread2
Shared Memory
thread5 thread4
private private private
thread3
Data can be shared or private Shared data is accessible by all threads Private data can be accessed only by the thread that owns it Data transfer is transparent to the programmer
16
Default storage attributes

Shared Memory programming model
Variables are shared by default
Data Environment:
Distributed Memory Programming Model

All variables are private
17
Default storage attributes

Global variables are SHARED among threads
Fortran: COMMON blocks, SAVE variables, MODULE variables C: File scope variables, static
Data Environment:
But not everything is shared...

Stack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE.
18
Data Environment
int A[100]; /* (Global) SHARED */ int main() { int ii, jj; /* PRIVATE */ int B[100]; /* SHARED */ #pragma omp parallel private(jj) { int kk = 1; /* PRIVATE */ #pragma omp for for (ii=0; ii<N; ii++) for (jj=0; jj<N; jj++) A[ii][jj] = foo(B[ii][jj]); } }
int foo(int x) { /* PRIVATE */ int count=0; return x*count; }
Work Sharing Construct

Loop Construct
#pragma omp for [clause[[,] clause ] new-line for-loops Where clause is one of the following: private / firstprivate / lastprivate(list) reduction(operator: list) schedule(kind[, chunk_size]) collapse(n) ordered nowait
Schedule
for (i=0; i<1100; i++) A[i] = ; #pragma omp parallel for schedule (static, 250) or (static)
250 250 250 250 100 or 275 275 275 275
p0
p1
p2
p3
p0
p0
p1
p2
p3
#pragma omp parallel for schedule (dynamic, 200)

200 200 200 200 200 100
p3
p0
p2
p3
p1
p0
#pragma omp parallel for schedule (guided, 100)

137 120 105 100 100 100 100 100 100 100 38
p0
p3
p0
p1 p2 p3 p0 p1 p2 p3 p0
#pragma omp parallel for schedule (auto)

Critical Construct
sum = 0; #pragma omp parallel private (lsum) { lsum = 0; #pragma omp for for (i=0; i<N; i++) { lsum = lsum + A[i]; } #pragma omp critical { sum += lsum; } Threads wait their turn; } only one thread at a time
executes the critical section
Reduction Clause
Shared variable sum = 0; #pragma omp parallel for reduction (+:sum) for (i=0; i<N; i++) { sum = sum + A[i]; }
23
Performance Evaluation
How do we measure performance? (or how do we remove noise?)
#define N 24000 For (k=0; k<10; k++) { #pragma omp parallel for private(i, j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j+1])/2.0; }
Performance IssuesSpeedup
What if you see a speedup saturation?
1 2 4 6 # CPUs 8
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;
Performance Issues
Speedup
What if you see a speedup saturation?

1 2 4 6 8 # CPUs
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = b[i][j];
26
Loop Scheduling
Any guideline for a chunk size?
#define N <big-number> chunk = ???; #pragma omp parallel for schedule (static, chunk) for (i=1; i<N-1; i++) a[i][j] = ( b[i-2] + b[i-1] + b[i] b[i+1] + b[i+2] )/5.0;
27
Performance Issues
Load imbalance: triangular access pattern
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=i; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;
28
Summary
OpenMP has advantages
Incremental parallelization Compared to MPI
No data partitioning No communication scheduling
29
Resources
http://www.openmp.org http://openmp.org/wp/resources
30

Openmp Tutorial: Seung-Jai Min

Uploaded by

Copyright:

Available Formats

Openmp Tutorial: Seung-Jai Min

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Openmp Tutorial: Seung-Jai Min

Uploaded by

Copyright:

Available Formats

OpenMP Tutorial

ECE 563 Programming Parallel Machines

Parallel Programming Standards

ECE 563 Programming Parallel Machines

Shared Memory Parallel Programming in the Multi-Core Era

A single node in distributed memory clusters

Shared memory hardware Accelerators

ECE 563 Programming Parallel Machines

OpenMP: Some syntax details to get us started

For Fortran, the directives take one of the forms:

How is OpenMP typically used?

ECE 563 Programming Parallel Machines

How is OpenMP typically used?

Single Program Multiple Data (SPMD)

ECE 563 Programming Parallel Machines

How is OpenMP typically used?

Single Program Multiple Data (SPMD)

OpenMP Fork-and-Join model

ECE 563 Programming Parallel Machines

Runtime functions/environment variables

OpenMP: Structured blocks

ECE 563 Programming Parallel Machines

OpenMP: Structured blocks

Not A Structured Block

ECE 563 Programming Parallel Machines

Structured Block Boundaries

ECE 563 Programming Parallel Machines

Structured Block Boundaries

ECE 563 Programming Parallel Machines

OpenMP Parallel Regions

double A[1000]; omp_set_num_threads(4)

ECE 563 Programming Parallel Machines

The OpenMP API Combined parallel work-share

the same OpenMP

Shared Memory Model

ECE 563 Programming Parallel Machines

Default storage attributes

Distributed Memory Programming Model

ECE 563 Programming Parallel Machines

Default storage attributes

But not everything is shared...

ECE 563 Programming Parallel Machines

int foo(int x) { /* PRIVATE */ int count=0; return x*count; }

Work Sharing Construct

#pragma omp parallel for schedule (dynamic, 200)

#pragma omp parallel for schedule (guided, 100)

#pragma omp parallel for schedule (auto)

ECE 563 Programming Parallel Machines

What if you see a speedup saturation?

ECE 563 Programming Parallel Machines

ECE 563 Programming Parallel Machines

ECE 563 Programming Parallel Machines

ECE 563 Programming Parallel Machines

ECE 563 Programming Parallel Machines

You might also like

int foo(int x) { /* PRIVATE / int count=0; return xcount; }