04 CPD OpenMP Sync Tasks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

More on OpenMP

Parallel and Distributed Computing

José Monteiro

Thursday, 2 March 2023


Outline

• Synchronization

• Conditional Parallelism

• Reduction Clause

• Scheduling Options

• Nested Parallelism

Parallel and Distributed Computing 2


Shared-Memory Systems

• Uniform Memory Access (UMA) architecture,



also known as

Symmetric Shared-Memory Multiprocessors (SMP)

Core Core Core Core

Cache Cache Cache Cache

Main
I/O
Memory

Parallel and Distributed Computing 3


Explicit Synchronization

A barrier can be explicitly inserted within the parallel code

/* some multi-threaded code */


#pragma omp barrier


/* remainder of multi-threaded code */

Barrier Region
Time
idle
Task 0
idle
Task 1
Task 2

idle
Task n

Parallel and Distributed Computing 4


Explicit Synchronization

#pragma omp parallel


{
/* All threads execute this. */
SomeCode();

#pragma omp barrier

/* All threads execute this, but not before


* all threads have finished executing SomeCode().
*/
SomeMoreCode();
}

Parallel and Distributed Computing 5


Explicit Synchronization

• Critical region, similar to mutexes in threads

- A thread waits at the the beginning of a critical region until no


other thread is executing a critical region with the same name

- All unnamed critical directives map to the same unspecified name

#pragma omp critical [<name>]


{ … }
Critical Region Time
Task 0
Task 1 idle
Task 2
idle

Task n
idle

Parallel and Distributed Computing 6


Example of critical Clause

int cnt = 0;
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {

#pragma omp critical


cnt++;

} /* endif */
a[i] += b[i] * (i+1);
} /* omp end for */
} /* omp end parallel */

Parallel and Distributed Computing 7


Explicit Synchronization with atomic Clause

#pragma omp atomic


<statement>

• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to


be a simple operation in the form: x <binop>= expr

Parallel and Distributed Computing 8


Explicit Synchronization with atomic Clause

#pragma omp atomic


<statement>

• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to


be a simple operation in the form: x <binop>= expr

• The critical directive implements mutual exclusion in terms of the


execution of a region of the code

– All threads arriving at a critical are blocked

Parallel and Distributed Computing 8


Explicit Synchronization with atomic Clause

#pragma omp atomic


<statement>

• Atomic operation for reading/writing

– Guarantees that reading and writing of one memory location is atomic

– Applies only to the statement immediately following it, which needs to


be a simple operation in the form: x <binop>= expr

• The critical directive implements mutual exclusion in terms of the


execution of a region of the code

– All threads arriving at a critical are blocked


• The atomic directive implements mutual exclusion in terms of the access
to data

– atomic only blocks threads to operating on the same memory position

– requires hardware support

Parallel and Distributed Computing 8


Example of atomic Clause

int accum = 0;
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {

#pragma omp atomic


accum += b[i] * (i+1);

} /* endif */
} /* end for */
} /* omp end parallel */

Parallel and Distributed Computing 9


Explicit Locks

• The critical directive implicitly uses locks

– OpenMP does allow the explicit manipulation of locks

void omp_init_lock(omp_lock_t *lock);


void omp_destroy_lock(omp_lock_t *lock);
void omp_set_lock(omp_lock_t *lock);
void omp_unset_lock(omp_lock_t *lock);
int omp_test_lock(omp_lock_t *lock);

• Less clean, more error prone than critical

• But more flexible

- In terms of enter and exit points of exclusive regions

- Allows finer granularity of conditions for mutual exclusion

Parallel and Distributed Computing 10


Example of Using Locks

int cnt = 0;
omp_lock_t *lck_a;

omp_init_lock(lck_a);
...
#pragma omp parallel for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
omp_set_lock(&lck_a);
cnt++;
omp_unset_lock(&lck_a);
} /* endif */
a[i] += b[i] * (i+1);
} /* end for */
...
omp_destroy_lock(lck_a);

Parallel and Distributed Computing 11


Single Processor Region

• Slightly different problem: how to have a single thread execute


a region of the parallel section?

#pragma omp single


{ ... }

– Ideally suited for I/O or initialization

– Use master instead of single to guarantee that the master thread


(thread 0) is the one that executes the single processor region
Single Processor Region Time
Task 0
idle
Task 1
idle
Task 2

Task n
idle

Parallel and Distributed Computing 12


Example of Single Processor Region

#pragma omp parallel


{
#pragma omp single
printf("Beginning work1.\n");
work1();
#pragma omp single
printf("Finishing work1.\n");
#pragma omp single nowait
printf("Finished work1 and beginning work2.\n");
work2();
}

Parallel and Distributed Computing 13


Single Processor Region

Parallel and Distributed Computing 14


Single Processor Region

What happens if you have:

<some code in a parallel region>


#pragma omp single
i = complexFunction();
<rest of code of the parallel region>

Parallel and Distributed Computing 14


Single Processor Region

What happens if you have:

<some code in a parallel region>


#pragma omp single
i = complexFunction();
<rest of code of the parallel region>

• i shared: unambiguous result

Parallel and Distributed Computing 14


Single Processor Region

What happens if you have:

<some code in a parallel region>


#pragma omp single
i = complexFunction();
<rest of code of the parallel region>

• i shared: unambiguous result


• i private: changes private copy of thread that executes the single
(probably not the desired outcome)

Parallel and Distributed Computing 14


Single Processor Region

What happens if you have:

<some code in a parallel region>


#pragma omp single
i = complexFunction();
<rest of code of the parallel region>

• i shared: unambiguous result


• i private: changes private copy of thread that executes the single
(probably not the desired outcome)
use copyprivate to update all copies

#pragma omp single copyprivate(i)


i = complexFunction();

Parallel and Distributed Computing 14


2nd Exercise with single

void func(int id) {


printf("Thread %d in func!\n", id);
#pragma omp for
for(int i = 0; i < 4; i++)
printf("Tid: %d\ti = %d\n", id, i);
}

int main() {
#pragma omp parallel num_threads(2)
{
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
{
printf("Thread %d in single!\n", tid);
func(tid);
}
}
}
Parallel and Distributed Computing 15
2nd Exercise with single

void func(int id) {


printf("Thread %d in func!\n", id);
#pragma omp for
for(int i = 0; i < 4; i++)
printf("Tid: %d\ti = %d\n", id, i);
}

DE
int main() { AD
LO
#pragma omp parallel num_threads(2)
CK
{ !
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
{
printf("Thread %d in single!\n", tid);
func(tid);
}
}
}
Parallel and Distributed Computing 15
Conditional Parallelism 


• Oftentimes, parallelism is only useful if the problem size


is large enough

– For regions with low computational effort, overhead of


parallelization exceeds benefit

#pragma omp parallel if( expression )

#pragma omp parallel sections if( expression )

#pragma omp parallel for if( expression )

• Execute in parallel if expression evaluates to true,


otherwise execute sequentially.

Parallel and Distributed Computing 16


Example of Conditional Parallelism

for(i = 0; i < n; i++)


#pragma omp parallel for private (j,k) if(n-i > 100)
for(j = i + 1; j < n; j++)
for(k = i + 1; k < n; k++)
a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j];

Parallel and Distributed Computing 17


Reduction Clause

How to parallelize the computation of an inner product?

for(i = 0; i < n; i++)


result += a[i] * b[i];

Parallel and Distributed Computing 18


Reduction Clause

How to parallelize the computation of an inner product?

for(i = 0; i < n; i++)


result += a[i] * b[i];

#pragma omp parallel for reduction(op:list)

– op is a binary (+, *, -, &, ^, |, &&, ||)


– list is a list of shared variables

Parallel and Distributed Computing 18


Reduction Clause

How to parallelize the computation of an inner product?

for(i = 0; i < n; i++)


result += a[i] * b[i];

#pragma omp parallel for reduction(op:list)

– op is a binary (+, *, -, &, ^, |, &&, ||)


– list is a list of shared variables

– Actions

• A private copy of each list variable is created for each thread

• At the end of the reduction, the reduction operator is applied to all


private copies of the variable, and the result is written to the global
shared variable

Parallel and Distributed Computing 18


Reduction Example

main() {
int i, n = 100;
float a[100], b[100], result = 0.0;

#pragma omp parallel


{
#pragma omp for
for(i = 0; i < n; i++) { // initialize vectors a and b
a[i] = i * 1.0;
b[i] = i * 2.0;
}

#pragma omp for reduction(+:result)


for(i = 0; i < n; i++) // compute internal product
result = result + (a[i] * b[i]);
}
printf("Final result = %f\n",result);
}

Parallel and Distributed Computing 19


Load Balancing

• With irregular workloads, care must be taken in distributing


the work over the threads

– Example: Multiplication of two matrices C = A × B, where the A


matrix is upper-triangular (all elements below diagonal are 0)

#pragma omp parallel for private(j,k)


for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
c[i][j] = 0.0;
for(k = i; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
}

Parallel and Distributed Computing 20


The schedule Clause

Different options for work distribution among threads

schedule (static | dynamic | guided [,chunk])


schedule (auto | runtime )

• static [,chunk]
- Iterations are divided into blocks of size chunk, and these blocks
are assigned to the threads in in a round-robin fashion

- In the absence of chunk, each thread executes approximately



N/P chunks for a loop of length N and P threads

- Example: loop of length N=8 and P=2 threads:

TID 0 1
No chunk 1-4 5-8
chunk = 2 1-2, 5-6 3-4, 7-8

Parallel and Distributed Computing 21


The schedule Clause

• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread

(defaults to 1, if chunk not specified)

- When a thread finishes, it starts on the next block

- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations

Parallel and Distributed Computing 22


The schedule Clause

• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread

(defaults to 1, if chunk not specified)

- When a thread finishes, it starts on the next block

- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations

• guided [,chunk]
- Same dynamic behavior as dynamic, but threads are assigned a
block of decreasing size

- The size of each block is proportional to the number of


unassigned iterations divided by the number of threads,
decreasing to chunk (minimum size, or 1 if not defined)

Parallel and Distributed Computing 22


The schedule Clause

Static:
T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 T3

Dynamic:
T2 T1 T0 T3 T0 T3 T2 T3 T1 T3 T0 T2

Guided:
T2 T0 T3 T1 T0 T2 T1 T3 T2 T0 T1 T2

Parallel and Distributed Computing 23


The schedule Clause

• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system

Parallel and Distributed Computing 24


The schedule Clause

• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system

• runtime
- Iteration scheduling scheme is set at runtime through environment

- Variable OMP_SCHEDULE

Parallel and Distributed Computing 24


Scheduling Options

• Static scheduling

– Static has lower overhead than dynamic

– Dynamic adapts better to higher workload imbalance

• Chunks

– Larger chunks reduce overhead and may increase cache


hit rate

– Small chunks allow finer balancing of workload

Parallel and Distributed Computing 25


Collapsing Loops

#pragma omp parallel for private(j,k) collapse(2)


for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
c[i][j] = 0.0;
for(k = i; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
}

• The collapse(n) clause combines the set of n loops

• Combining with a dynamic scheduling may lead to a better load


balancing

Parallel and Distributed Computing 26


Vector Processing in OpenMP

#pragma omp parallel for private(j)


for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

Parallel and Distributed Computing 27


Vector Processing in OpenMP

#pragma omp parallel for private(j)


for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

Parallel and Distributed Computing 27


Vector Processing in OpenMP

#pragma omp parallel for private(j)


for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

A[i][j] A[i][j+1] A[i][j+2] A[i][j+3]


+ B[i][j] B[i][j+1] B[i][j+2] B[i][j+3]

M[i][j] M[i][j+1] M[i][j+2] M[i][j+3]

Parallel and Distributed Computing 27


Vector Processing in OpenMP

#pragma omp parallel for private(j)


for(i = 0; i < n; i++)
#pragma omp simd
for(j = 0; j < n; j++)
M[i][j] = A[i][j] + B[i][j];

A[i][j] A[i][j+1] A[i][j+2] A[i][j+3]


+ B[i][j] B[i][j+1] B[i][j+2] B[i][j+3]

M[i][j] M[i][j+1] M[i][j+2] M[i][j+3]

• Even when threads are all being used, vector processing can be used to
extract more parallelism at the level of instruction parallelism
• Recent compilers already perform this, but may be too conservative

Parallel and Distributed Computing 27


Nested Parallelism

• Parallel regions can be nested

- Support is implementation dependent

Master Thread
Fork

Fork Fork Fork

Join Join Join


Join

Parallel and Distributed Computing 28


Nested Parallelism

• Parallel regions can be nested

- Support is implementation dependent

Master Thread
Fork

Fork Fork Fork

Join Join Join


Join

• Must be enabled with the OMP_NESTED environment variable or


the omp_set_nested() routine.

- If a parallel directive is encountered within another parallel directive,


new team of threads created

- New team contains only one thread, unless nested parallelism is


enabled
Parallel and Distributed Computing 28
Nested Parallelism

• Set number of threads per level

– Environment variable: OMP_NUM_THREADS (i.e., 4,3,2)

– Runtime routine: omp_set_num_threads() inside a


parallel region

– Clause: add num_threads() clause to a parallel directive

Parallel and Distributed Computing 29


Nested Parallelism

• Set number of threads per level

– Environment variable: OMP_NUM_THREADS (i.e., 4,3,2)

– Runtime routine: omp_set_num_threads() inside a


parallel region

– Clause: add num_threads() clause to a parallel directive

• Set/get the maximum number of OpenMP threads


available to the program

– Environment variable: OMP_THREAD_LIMIT

– Runtime routines: omp_get_thread_limit()

Parallel and Distributed Computing 29


Nested Parallelism

• Set/get the maximum number of nested active parallel regions:

– Environment variable: OMP_MAX_ACTIVE_LEVELS

– Runtime routines: omp_set_max_active_levels(),


omp_get_max_active_levels()

Parallel and Distributed Computing 30


Nested Parallelism

• Set/get the maximum number of nested active parallel regions:

– Environment variable: OMP_MAX_ACTIVE_LEVELS

– Runtime routines: omp_set_max_active_levels(),


omp_get_max_active_levels()

• Library routines to determine:

– Depth of nesting: 

omp_get_level()

– IDs of parent/grandparent/etc threads:


omp_get_ancestor_thread_num(level)

– Team sizes of parent/grandparent/etc teams:


omp_get_team_size(level)

Parallel and Distributed Computing 30


Review

• Synchronization

• Conditional Parallelism

• Reduction Clause

• Scheduling Options

• Nested Parallelism

Parallel and Distributed Computing 31


Next Class

• Task directive

• Performance considerations

• Debugging

Parallel and Distributed Computing 32

You might also like