More on OpenMP

Parallel and Distributed Computing

José Monteiro

Thursday, 2 March 2023


• Synchronization

• Conditional Parallelism

• Reduction Clause

• Scheduling Options

• Nested Parallelism

Shared-Memory Systems

• Uniform Memory Access (UMA) architecture,

also known as

Symmetric Shared-Memory Multiprocessors (SMP)

Core Core Core Core

Cache Cache Cache Cache


Explicit Synchronization

A barrier can be explicitly inserted within the parallel code

/* some multi-threaded code */

#pragma omp barrier

/* remainder of multi-threaded code */

Barrier Region
Task 0
Task 1
Task 2

Task n

Explicit Synchronization

#pragma omp parallel

/* All threads execute this. */

#pragma omp barrier

/* All threads execute this, but not before

* all threads have finished executing SomeCode().

Explicit Synchronization

• Critical region, similar to mutexes in threads

- A thread waits at the the beginning of a critical region until no

other thread is executing a critical region with the same name

- All unnamed critical directives map to the same unspecified name

#pragma omp critical [<name>]

{ … }
Critical Region Time
Task 0
Task 1 idle
Task 2

Task n

Example of critical Clause

int cnt = 0;
#pragma omp parallel
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {

#pragma omp critical


} /* endif */
a[i] += b[i] * (i+1);
} /* omp end for */
} /* omp end parallel */

Explicit Synchronization with atomic Clause

#pragma omp atomic


• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to

be a simple operation in the form: x <binop>= expr

Explicit Synchronization with atomic Clause

#pragma omp atomic


• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to

be a simple operation in the form: x <binop>= expr

• The critical directive implements mutual exclusion in terms of the

execution of a region of the code

– All threads arriving at a critical are blocked

Explicit Synchronization with atomic Clause

#pragma omp atomic


• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to

be a simple operation in the form: x <binop>= expr

• The critical directive implements mutual exclusion in terms of the

execution of a region of the code

– All threads arriving at a critical are blocked

• The atomic directive implements mutual exclusion in terms of the access
to data

– atomic only blocks threads to operating on the same memory position

– requires hardware support

Example of atomic Clause

int accum = 0;
#pragma omp parallel
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {

#pragma omp atomic

accum += b[i] * (i+1);

} /* endif */
} /* end for */
} /* omp end parallel */

Explicit Locks

• The critical directive implicitly uses locks

– OpenMP does allow the explicit manipulation of locks

void omp_init_lock(omp_lock_t *lock);

void omp_destroy_lock(omp_lock_t *lock);
void omp_set_lock(omp_lock_t *lock);
void omp_unset_lock(omp_lock_t *lock);
int omp_test_lock(omp_lock_t *lock);

• Less clean, more error prone than critical

• But more flexible

- In terms of enter and exit points of exclusive regions

- Allows finer granularity of conditions for mutual exclusion

Example of Using Locks

int cnt = 0;
omp_lock_t *lck_a;

#pragma omp parallel for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
} /* endif */
a[i] += b[i] * (i+1);
} /* end for */

Single Processor Region

• Slightly different problem: how to have a single thread execute

a region of the parallel section?

#pragma omp single

{ ... }

– Ideally suited for I/O or initialization

– Use master instead of single to guarantee that the master thread

(thread 0) is the one that executes the single processor region
Single Processor Region Time
Task 0
Task 1
Task 2

Task n

Example of Single Processor Region

#pragma omp parallel

#pragma omp single
printf("Beginning work1.\n");
#pragma omp single
printf("Finishing work1.\n");
#pragma omp single nowait
printf("Finished work1 and beginning work2.\n");

Single Processor Region

Single Processor Region

What happens if you have:

<some code in a parallel region>

#pragma omp single
i = complexFunction();
<rest of code of the parallel region>

• i shared: unambiguous result

• i private: changes private copy of thread that executes the single
(probably not the desired outcome)
use copyprivate to update all copies

#pragma omp single copyprivate(i)

i = complexFunction();

2nd Exercise with single

void func(int id) {

printf("Thread %d in func!\n", id);
#pragma omp for
for(int i = 0; i < 4; i++)
printf("Tid: %d\ti = %d\n", id, i);

int main() {
#pragma omp parallel num_threads(2)
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
printf("Thread %d in single!\n", tid);
2nd Exercise with single

void func(int id) {

printf("Thread %d in func!\n", id);
#pragma omp for
for(int i = 0; i < 4; i++)
printf("Tid: %d\ti = %d\n", id, i);

int main() { AD
#pragma omp parallel num_threads(2)
{ !
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
printf("Thread %d in single!\n", tid);
Conditional Parallelism 

• Oftentimes, parallelism is only useful if the problem size

is large enough

– For regions with low computational effort, overhead of

parallelization exceeds benefit

#pragma omp parallel if( expression )

#pragma omp parallel sections if( expression )

#pragma omp parallel for if( expression )

• Execute in parallel if expression evaluates to true,

otherwise execute sequentially.

Example of Conditional Parallelism

for(i = 0; i < n; i++)

#pragma omp parallel for private (j,k) if(n-i > 100)
for(j = i + 1; j < n; j++)
for(k = i + 1; k < n; k++)
a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j];

Reduction Clause

How to parallelize the computation of an inner product?

for(i = 0; i < n; i++)

result += a[i] * b[i];

Reduction Clause

How to parallelize the computation of an inner product?

for(i = 0; i < n; i++)

result += a[i] * b[i];

#pragma omp parallel for reduction(op:list)

– op is a binary (+, *, -, &, ^, |, &&, ||)

– list is a list of shared variables

Reduction Clause

How to parallelize the computation of an inner product?

for(i = 0; i < n; i++)

result += a[i] * b[i];

#pragma omp parallel for reduction(op:list)

– op is a binary (+, *, -, &, ^, |, &&, ||)

– list is a list of shared variables

– Actions

• A private copy of each list variable is created for each thread

• At the end of the reduction, the reduction operator is applied to all

private copies of the variable, and the result is written to the global
shared variable

Reduction Example

main() {
int i, n = 100;
float a[100], b[100], result = 0.0;

#pragma omp parallel

#pragma omp for
for(i = 0; i < n; i++) { // initialize vectors a and b
a[i] = i * 1.0;
b[i] = i * 2.0;

#pragma omp for reduction(+:result)

for(i = 0; i < n; i++) // compute internal product
result = result + (a[i] * b[i]);
printf("Final result = %f\n",result);

Load Balancing

• With irregular workloads, care must be taken in distributing

the work over the threads

– Example: Multiplication of two matrices C = A × B, where the A

matrix is upper-triangular (all elements below diagonal are 0)

#pragma omp parallel for private(j,k)

for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
c[i][j] = 0.0;
for(k = i; k < n; k++)
c[i][j] += a[i][k] * b[k][j];

The schedule Clause

Different options for work distribution among threads

schedule (static | dynamic | guided [,chunk])

schedule (auto | runtime )

• static [,chunk]
- Iterations are divided into blocks of size chunk, and these blocks
are assigned to the threads in in a round-robin fashion

- In the absence of chunk, each thread executes approximately

N/P chunks for a loop of length N and P threads

- Example: loop of length N=8 and P=2 threads:

TID 0 1
No chunk 1-4 5-8
chunk = 2 1-2, 5-6 3-4, 7-8

The schedule Clause

• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread

(defaults to 1, if chunk not specified)

- When a thread finishes, it starts on the next block

- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations

The schedule Clause

• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread

(defaults to 1, if chunk not specified)

- When a thread finishes, it starts on the next block

- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations

• guided [,chunk]
- Same dynamic behavior as dynamic, but threads are assigned a
block of decreasing size

- The size of each block is proportional to the number of

unassigned iterations divided by the number of threads,
decreasing to chunk (minimum size, or 1 if not defined)

The schedule Clause

T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 T3

T2 T1 T0 T3 T0 T3 T2 T3 T1 T3 T0 T2

T2 T0 T3 T1 T0 T2 T1 T3 T2 T0 T1 T2

The schedule Clause

• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system

The schedule Clause

• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system

• runtime
- Iteration scheduling scheme is set at runtime through environment


Scheduling Options

• Static scheduling

– Static has lower overhead than dynamic

– Dynamic adapts better to higher workload imbalance

• Chunks

– Larger chunks reduce overhead and may increase cache

hit rate

– Small chunks allow finer balancing of workload

Collapsing Loops

#pragma omp parallel for private(j,k) collapse(2)

for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
c[i][j] = 0.0;
for(k = i; k < n; k++)
c[i][j] += a[i][k] * b[k][j];

• The collapse(n) clause combines the set of n loops

• Combining with a dynamic scheduling may lead to a better load


Vector Processing in OpenMP

#pragma omp parallel for private(j)

for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

Vector Processing in OpenMP

#pragma omp parallel for private(j)

for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

Vector Processing in OpenMP

#pragma omp parallel for private(j)

for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

A[i][j] A[i][j+1] A[i][j+2] A[i][j+3]

+ B[i][j] B[i][j+1] B[i][j+2] B[i][j+3]

M[i][j] M[i][j+1] M[i][j+2] M[i][j+3]

Vector Processing in OpenMP

#pragma omp parallel for private(j)

for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

A[i][j] A[i][j+1] A[i][j+2] A[i][j+3]

+ B[i][j] B[i][j+1] B[i][j+2] B[i][j+3]

M[i][j] M[i][j+1] M[i][j+2] M[i][j+3]

• Even when threads are all being used, vector processing can be used to
extract more parallelism at the level of instruction parallelism
• Recent compilers already perform this, but may be too conservative

Nested Parallelism

• Parallel regions can be nested

- Support is implementation dependent

Master Thread

Fork Fork Fork

Join Join Join


Nested Parallelism

• Parallel regions can be nested

- Support is implementation dependent

Master Thread

Fork Fork Fork

Join Join Join


• Must be enabled with the OMP_NESTED environment variable or

the omp_set_nested() routine.

- If a parallel directive is encountered within another parallel directive,

new team of threads created

- New team contains only one thread, unless nested parallelism is

Nested Parallelism

• Set number of threads per level

– Environment variable: OMP_NUM_THREADS (i.e., 4,3,2)

– Runtime routine: omp_set_num_threads() inside a

parallel region

– Clause: add num_threads() clause to a parallel directive

Nested Parallelism

• Set number of threads per level

– Environment variable: OMP_NUM_THREADS (i.e., 4,3,2)

– Runtime routine: omp_set_num_threads() inside a

parallel region

– Clause: add num_threads() clause to a parallel directive

• Set/get the maximum number of OpenMP threads

available to the program

– Environment variable: OMP_THREAD_LIMIT

– Runtime routines: omp_get_thread_limit()

Nested Parallelism

• Set/get the maximum number of nested active parallel regions:

– Environment variable: OMP_MAX_ACTIVE_LEVELS

– Runtime routines: omp_set_max_active_levels(),


Nested Parallelism

• Set/get the maximum number of nested active parallel regions:

– Environment variable: OMP_MAX_ACTIVE_LEVELS

– Runtime routines: omp_set_max_active_levels(),


• Library routines to determine:

– Depth of nesting: 


– IDs of parent/grandparent/etc threads:


– Team sizes of parent/grandparent/etc teams:


• Synchronization

• Conditional Parallelism

• Reduction Clause

• Scheduling Options

• Nested Parallelism

Next Class

• Task directive

• Performance considerations

• Debugging

