04 CPD OpenMP Sync Tasks

More on OpenMP
Parallel and Distributed Computing
José Monteiro
Thursday, 2 March 2023

Outline
• Synchronization
• Conditional Parallelism
• Reduction Clause
• Scheduling Options
• Nested Parallelism
Parallel and Distributed Computing 2

Shared-Memory Systems
• Uniform Memory Access (UMA) architecture, 

also known as 
Symmetric Shared-Memory Multiprocessors (SMP)
Core Core Core Core
Cache Cache Cache Cache
Main
I/O
Memory

Explicit Synchronization
A barrier can be explicitly inserted within the parallel code
/* some multi-threaded code */

 
#pragma omp barrier

 
/* remainder of multi-threaded code */
Barrier Region
Time
idle
Task 0
idle
Task 1
Task 2
idle
Task n

#pragma omp parallel

{
/* All threads execute this. */
SomeCode();
#pragma omp barrier
/* All threads execute this, but not before

* all threads have finished executing SomeCode().
*/
SomeMoreCode();
}

• Critical region, similar to mutexes in threads
- A thread waits at the the beginning of a critical region until no

other thread is executing a critical region with the same name
- All unnamed critical directives map to the same unspecified name
#pragma omp critical [<name>]

{ … }
Critical Region Time
Task 0
Task 1 idle
Task 2
idle
Task n
idle

Example of critical Clause
int cnt = 0;
{
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
#pragma omp critical

cnt++;
} /* endif */
a[i] += b[i] * (i+1);
} /* omp end for */
} /* omp end parallel */

Explicit Synchronization with atomic Clause
#pragma omp atomic

<statement>
• Atomic operation for reading/writing
– Guarantees that reading and writing of one memory location is atomic
– Applies only to the statement immediately following it, which needs to

be a simple operation in the form: x <binop>= expr

#pragma omp atomic

<statement>

• The critical directive implements mutual exclusion in terms of the

execution of a region of the code
– All threads arriving at a critical are blocked

#pragma omp atomic

<statement>

• The critical directive implements mutual exclusion in terms of the

execution of a region of the code
– All threads arriving at a critical are blocked

• The atomic directive implements mutual exclusion in terms of the access
to data
– atomic only blocks threads to operating on the same memory position
– requires hardware support

Example of atomic Clause
int accum = 0;
{
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
#pragma omp atomic

accum += b[i] * (i+1);
} /* endif */
} /* end for */
} /* omp end parallel */

Explicit Locks
• The critical directive implicitly uses locks
– OpenMP does allow the explicit manipulation of locks
void omp_init_lock(omp_lock_t *lock);

void omp_destroy_lock(omp_lock_t *lock);
void omp_set_lock(omp_lock_t *lock);
void omp_unset_lock(omp_lock_t *lock);
int omp_test_lock(omp_lock_t *lock);
• Less clean, more error prone than critical
• But more flexible
- In terms of enter and exit points of exclusive regions
- Allows finer granularity of conditions for mutual exclusion

Example of Using Locks
int cnt = 0;
omp_lock_t *lck_a;
omp_init_lock(lck_a);
...
#pragma omp parallel for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
omp_set_lock(&lck_a);
cnt++;
omp_unset_lock(&lck_a);
} /* endif */
a[i] += b[i] * (i+1);
} /* end for */
...
omp_destroy_lock(lck_a);

Single Processor Region
• Slightly different problem: how to have a single thread execute

a region of the parallel section?
#pragma omp single

{ ... }
– Ideally suited for I/O or initialization
– Use master instead of single to guarantee that the master thread

(thread 0) is the one that executes the single processor region
Single Processor Region Time
Task 0
idle
Task 1
idle
Task 2
Task n
idle

Example of Single Processor Region

{
#pragma omp single
printf("Beginning work1.\n");
work1();
#pragma omp single
printf("Finishing work1.\n");
#pragma omp single nowait
printf("Finished work1 and beginning work2.\n");
work2();
}


What happens if you have:
<some code in a parallel region>

#pragma omp single
i = complexFunction();
<rest of code of the parallel region>


#pragma omp single
• i shared: unambiguous result


#pragma omp single

• i private: changes private copy of thread that executes the single
(probably not the desired outcome)


#pragma omp single

• i private: changes private copy of thread that executes the single
(probably not the desired outcome)
use copyprivate to update all copies
#pragma omp single copyprivate(i)


2nd Exercise with single
void func(int id) {

printf("Thread %d in func!\n", id);
#pragma omp for
for(int i = 0; i < 4; i++)
printf("Tid: %d\ti = %d\n", id, i);
}
int main() {
#pragma omp parallel num_threads(2)
{
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
{
printf("Thread %d in single!\n", tid);
func(tid);
}
}
}
2nd Exercise with single
void func(int id) {

printf("Thread %d in func!\n", id);
#pragma omp for
for(int i = 0; i < 4; i++)
printf("Tid: %d\ti = %d\n", id, i);
}
DE
int main() { AD
LO
#pragma omp parallel num_threads(2)
CK
{ !
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
{
printf("Thread %d in single!\n", tid);
func(tid);
}
}
}
Conditional Parallelism  
• Oftentimes, parallelism is only useful if the problem size

is large enough
– For regions with low computational effort, overhead of

parallelization exceeds benefit
#pragma omp parallel if( expression )
#pragma omp parallel sections if( expression )
#pragma omp parallel for if( expression )
• Execute in parallel if expression evaluates to true,

otherwise execute sequentially.

Example of Conditional Parallelism
for(i = 0; i < n; i++)

#pragma omp parallel for private (j,k) if(n-i > 100)
for(j = i + 1; j < n; j++)
for(k = i + 1; k < n; k++)
a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j];

Reduction Clause
How to parallelize the computation of an inner product?
for(i = 0; i < n; i++)

result += a[i] * b[i];

Reduction Clause
for(i = 0; i < n; i++)

#pragma omp parallel for reduction(op:list)
– op is a binary (+, *, -, &, ^, |, &&, ||)

– list is a list of shared variables

Reduction Clause
for(i = 0; i < n; i++)

#pragma omp parallel for reduction(op:list)
– op is a binary (+, *, -, &, ^, |, &&, ||)

– list is a list of shared variables
– Actions
• A private copy of each list variable is created for each thread
• At the end of the reduction, the reduction operator is applied to all

private copies of the variable, and the result is written to the global
shared variable

Reduction Example
main() {
int i, n = 100;
float a[100], b[100], result = 0.0;

{
#pragma omp for
for(i = 0; i < n; i++) { // initialize vectors a and b
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp for reduction(+:result)

for(i = 0; i < n; i++) // compute internal product
result = result + (a[i] * b[i]);
}
printf("Final result = %f\n",result);
}

Load Balancing
• With irregular workloads, care must be taken in distributing

the work over the threads
– Example: Multiplication of two matrices C = A × B, where the A

matrix is upper-triangular (all elements below diagonal are 0)
#pragma omp parallel for private(j,k)

for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
c[i][j] = 0.0;
for(k = i; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
}

The schedule Clause
Different options for work distribution among threads
schedule (static | dynamic | guided [,chunk])

schedule (auto | runtime )
• static [,chunk]
- Iterations are divided into blocks of size chunk, and these blocks
are assigned to the threads in in a round-robin fashion
- In the absence of chunk, each thread executes approximately 

N/P chunks for a loop of length N and P threads
- Example: loop of length N=8 and P=2 threads:
TID 0 1
No chunk 1-4 5-8
chunk = 2 1-2, 5-6 3-4, 7-8

The schedule Clause
• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread 
(defaults to 1, if chunk not specified)
- When a thread finishes, it starts on the next block
- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations

The schedule Clause
• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread 
(defaults to 1, if chunk not specified)
- When a thread finishes, it starts on the next block
- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations
• guided [,chunk]
- Same dynamic behavior as dynamic, but threads are assigned a
block of decreasing size
- The size of each block is proportional to the number of

unassigned iterations divided by the number of threads,
decreasing to chunk (minimum size, or 1 if not defined)

The schedule Clause
Static:
T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 T3
Dynamic:
T2 T1 T0 T3 T0 T3 T2 T3 T1 T3 T0 T2
Guided:
T2 T0 T3 T1 T0 T2 T1 T3 T2 T0 T1 T2

The schedule Clause
• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system

The schedule Clause
• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system
• runtime
- Iteration scheduling scheme is set at runtime through environment
- Variable OMP_SCHEDULE

Scheduling Options
• Static scheduling
– Static has lower overhead than dynamic
– Dynamic adapts better to higher workload imbalance
• Chunks
– Larger chunks reduce overhead and may increase cache

hit rate
– Small chunks allow finer balancing of workload

Collapsing Loops
#pragma omp parallel for private(j,k) collapse(2)

for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
c[i][j] = 0.0;
for(k = i; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
}
• The collapse(n) clause combines the set of n loops
• Combining with a dynamic scheduling may lead to a better load

balancing

Vector Processing in OpenMP
#pragma omp parallel for private(j)

for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];


for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];


for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];
A[i][j] A[i][j+1] A[i][j+2] A[i][j+3]

+ B[i][j] B[i][j+1] B[i][j+2] B[i][j+3]
M[i][j] M[i][j+1] M[i][j+2] M[i][j+3]


for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];
A[i][j] A[i][j+1] A[i][j+2] A[i][j+3]

+ B[i][j] B[i][j+1] B[i][j+2] B[i][j+3]
M[i][j] M[i][j+1] M[i][j+2] M[i][j+3]
• Even when threads are all being used, vector processing can be used to
extract more parallelism at the level of instruction parallelism
• Recent compilers already perform this, but may be too conservative

Nested Parallelism
• Parallel regions can be nested
- Support is implementation dependent
Master Thread
Fork
Fork Fork Fork
Join Join Join

Join

Nested Parallelism
• Parallel regions can be nested
- Support is implementation dependent
Master Thread
Fork
Fork Fork Fork
Join Join Join

Join
• Must be enabled with the OMP_NESTED environment variable or

the omp_set_nested() routine.
- If a parallel directive is encountered within another parallel directive,

new team of threads created
- New team contains only one thread, unless nested parallelism is

enabled
Nested Parallelism
• Set number of threads per level
– Environment variable: OMP_NUM_THREADS (i.e., 4,3,2)
– Runtime routine: omp_set_num_threads() inside a

parallel region
– Clause: add num_threads() clause to a parallel directive

Nested Parallelism
• Set number of threads per level
– Environment variable: OMP_NUM_THREADS (i.e., 4,3,2)
– Runtime routine: omp_set_num_threads() inside a

parallel region
– Clause: add num_threads() clause to a parallel directive
• Set/get the maximum number of OpenMP threads

available to the program
– Environment variable: OMP_THREAD_LIMIT
– Runtime routines: omp_get_thread_limit()

Nested Parallelism
• Set/get the maximum number of nested active parallel regions:
– Environment variable: OMP_MAX_ACTIVE_LEVELS
– Runtime routines: omp_set_max_active_levels(),

omp_get_max_active_levels()

Nested Parallelism
• Set/get the maximum number of nested active parallel regions:
– Environment variable: OMP_MAX_ACTIVE_LEVELS
– Runtime routines: omp_set_max_active_levels(),

omp_get_max_active_levels()
• Library routines to determine:
– Depth of nesting:  
omp_get_level()
– IDs of parent/grandparent/etc threads:

omp_get_ancestor_thread_num(level)
– Team sizes of parent/grandparent/etc teams:

omp_get_team_size(level)

Review
• Synchronization
• Conditional Parallelism
• Reduction Clause
• Scheduling Options
• Nested Parallelism

Next Class
• Task directive
• Performance considerations
• Debugging

04 CPD OpenMP Sync Tasks

Uploaded by

Copyright:

Available Formats

04 CPD OpenMP Sync Tasks

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 CPD OpenMP Sync Tasks

Uploaded by

Copyright:

Available Formats

More on OpenMP

Parallel and Distributed Computing

Thursday, 2 March 2023

Parallel and Distributed Computing 2

• Uniform Memory Access (UMA) architecture,

Core Core Core Core

Cache Cache Cache Cache

Parallel and Distributed Computing 3

A barrier can be explicitly inserted within the parallel code

/* some multi-threaded code */

#pragma omp barrier

/* remainder of multi-threaded code */

Parallel and Distributed Computing 4

#pragma omp parallel

#pragma omp barrier

/* All threads execute this, but not before

Parallel and Distributed Computing 5

• Critical region, similar to mutexes in threads

- A thread waits at the the beginning of a critical region until no

- All unnamed critical directives map to the same unspecified name

#pragma omp critical [<name>]

Parallel and Distributed Computing 6

#pragma omp critical

Parallel and Distributed Computing 7

#pragma omp atomic

• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to

Parallel and Distributed Computing 8

#pragma omp atomic

• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to

• The critical directive implements mutual exclusion in terms of the

– All threads arriving at a critical are blocked

Parallel and Distributed Computing 8

#pragma omp atomic

• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to

• The critical directive implements mutual exclusion in terms of the

– All threads arriving at a critical are blocked

– atomic only blocks threads to operating on the same memory position

– requires hardware support

Parallel and Distributed Computing 8

#pragma omp atomic

Parallel and Distributed Computing 9

• The critical directive implicitly uses locks

– OpenMP does allow the explicit manipulation of locks

void omp_init_lock(omp_lock_t *lock);

• Less clean, more error prone than critical

• But more flexible

- In terms of enter and exit points of exclusive regions

- Allows finer granularity of conditions for mutual exclusion

Parallel and Distributed Computing 10

Parallel and Distributed Computing 11

• Slightly different problem: how to have a single thread execute

#pragma omp single

– Ideally suited for I/O or initialization

• Uniform Memory Access (UMA) architecture, 

- In the absence of chunk, each thread executes approximately