Openmp Tutorial: Seung-Jai Min
Openmp Tutorial: Seung-Jai Min
Openmp Tutorial: Seung-Jai Min
Seung-Jai Min
([email protected]) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN
Include files
#include omp.h
ECE 563 Programming Parallel Machines 4
Thread 1
OpenMP Constructs
1.
Parallel Regions
#pragma omp parallel
2.
Worksharing
#pragma omp for, #pragma omp sections
3.
Data Environment
#pragma omp parallel shared/private ()
4.
Synchronization
#pragma omp barrier
5.
10
11
12
13
pooh(0,A)
pooh(1,A)
pooh(2,A)
pooh(3,A)
printf(all done\n);
Implicit barrier: threads wait here for all threads to finish before proceeding
14
thread1
thread2
Shared Memory
thread5 thread4
private private private
thread3
Data can be shared or private Shared data is accessible by all threads Private data can be accessed only by the thread that owns it Data transfer is transparent to the programmer
16
Data Environment:
17
Data Environment:
18
Data Environment
int A[100]; /* (Global) SHARED */ int main() { int ii, jj; /* PRIVATE */ int B[100]; /* SHARED */ #pragma omp parallel private(jj) { int kk = 1; /* PRIVATE */ #pragma omp for for (ii=0; ii<N; ii++) for (jj=0; jj<N; jj++) A[ii][jj] = foo(B[ii][jj]); } }
ECE 563 Programming Parallel Machines 19
Schedule
for (i=0; i<1100; i++) A[i] = ; #pragma omp parallel for schedule (static, 250) or (static)
250 250 250 250 100 or 275 275 275 275
p0
p1
p2
p3
p0
p0
p1
p2
p3
p3
p0
p2
p3
p1
p0
p0
p3
p0
p1 p2 p3 p0 p1 p2 p3 p0
Critical Construct
sum = 0; #pragma omp parallel private (lsum) { lsum = 0; #pragma omp for for (i=0; i<N; i++) { lsum = lsum + A[i]; } #pragma omp critical { sum += lsum; } Threads wait their turn; } only one thread at a time
executes the critical section
ECE 563 Programming Parallel Machines 22
Reduction Clause
Shared variable sum = 0; #pragma omp parallel for reduction (+:sum) for (i=0; i<N; i++) { sum = sum + A[i]; }
23
Performance Evaluation
How do we measure performance? (or how do we remove noise?)
#define N 24000 For (k=0; k<10; k++) { #pragma omp parallel for private(i, j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j+1])/2.0; }
ECE 563 Programming Parallel Machines 24
Performance IssuesSpeedup
What if you see a speedup saturation?
1 2 4 6 # CPUs 8
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;
ECE 563 Programming Parallel Machines 25
Performance Issues
Speedup
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = b[i][j];
26
Loop Scheduling
Any guideline for a chunk size?
#define N <big-number> chunk = ???; #pragma omp parallel for schedule (static, chunk) for (i=1; i<N-1; i++) a[i][j] = ( b[i-2] + b[i-1] + b[i] b[i+1] + b[i+2] )/5.0;
27
Performance Issues
Load imbalance: triangular access pattern
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=i; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;
28
Summary
OpenMP has advantages
Incremental parallelization Compared to MPI
No data partitioning No communication scheduling
29
Resources
http://www.openmp.org http://openmp.org/wp/resources
30