04 CPD OpenMP Sync Tasks
04 CPD OpenMP Sync Tasks
04 CPD OpenMP Sync Tasks
José Monteiro
• Synchronization
• Conditional Parallelism
• Reduction Clause
• Scheduling Options
• Nested Parallelism
Main
I/O
Memory
Barrier Region
Time
idle
Task 0
idle
Task 1
Task 2
idle
Task n
Task n
idle
int cnt = 0;
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
} /* endif */
a[i] += b[i] * (i+1);
} /* omp end for */
} /* omp end parallel */
int accum = 0;
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
} /* endif */
} /* end for */
} /* omp end parallel */
int cnt = 0;
omp_lock_t *lck_a;
omp_init_lock(lck_a);
...
#pragma omp parallel for
for(i = 0; i < 20; i++) {
if(b[i] == 0) {
omp_set_lock(&lck_a);
cnt++;
omp_unset_lock(&lck_a);
} /* endif */
a[i] += b[i] * (i+1);
} /* end for */
...
omp_destroy_lock(lck_a);
Task n
idle
int main() {
#pragma omp parallel num_threads(2)
{
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
{
printf("Thread %d in single!\n", tid);
func(tid);
}
}
}
Parallel and Distributed Computing 15
2nd Exercise with single
DE
int main() { AD
LO
#pragma omp parallel num_threads(2)
CK
{ !
int tid = omp_get_thread_num();
printf("Thread %d alive!\n", tid);
#pragma omp master
{
printf("Thread %d in single!\n", tid);
func(tid);
}
}
}
Parallel and Distributed Computing 15
Conditional Parallelism
– Actions
main() {
int i, n = 100;
float a[100], b[100], result = 0.0;
• static [,chunk]
- Iterations are divided into blocks of size chunk, and these blocks
are assigned to the threads in in a round-robin fashion
TID 0 1
No chunk 1-4 5-8
chunk = 2 1-2, 5-6 3-4, 7-8
• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread
(defaults to 1, if chunk not specified)
- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations
• dynamic [,chunk]
- A block of size chunk iterations is assigned to each thread
(defaults to 1, if chunk not specified)
- Each block contains chunk iterations, except for the last block to
be distributed, which may have fewer iterations
• guided [,chunk]
- Same dynamic behavior as dynamic, but threads are assigned a
block of decreasing size
Static:
T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 T3
Dynamic:
T2 T1 T0 T3 T0 T3 T2 T3 T1 T3 T0 T2
Guided:
T2 T0 T3 T1 T0 T2 T1 T3 T2 T0 T1 T2
• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system
• auto
- The decision regarding scheduling is delegated to compiler and/or
runtime system
• runtime
- Iteration scheduling scheme is set at runtime through environment
- Variable OMP_SCHEDULE
• Static scheduling
• Chunks
• Even when threads are all being used, vector processing can be used to
extract more parallelism at the level of instruction parallelism
• Recent compilers already perform this, but may be too conservative
Master Thread
Fork
Master Thread
Fork
– Depth of nesting:
omp_get_level()
• Synchronization
• Conditional Parallelism
• Reduction Clause
• Scheduling Options
• Nested Parallelism
• Task directive
• Performance considerations
• Debugging