Day 2 1 Advanced-Openmp
Day 2 1 Advanced-Openmp
Day 2 1 Advanced-Openmp
Lukas Einkemmer
Department of Mathematics
University of Innsbruck
WRONG!
bool wait = false;
#pragma omp parallel for
for(int i=0;i<n;i++) {
// busy wait
while(wait)
;
wait = true;
// do some work
wait = false;
}
WRONG! wait=false
// busy wait
wait=true;
while(wait)
; // do work
while(true)
;
wait = true; wait=false;
while(true) while(true)
// do some work ; ;
wait=true
wait = false;
} while(true)
// do work ;
OpenMP memory model
while(wait)
;
compiles to
.L4:
jmp .L4
Cache 0 0 0 0 1 0
CPU Register 0 0 1 1 0
A flush is implied at
I barrier
I beginning and end of critical
I beginning and end of a parallel region
I end of a worksharing construct (for, do, sections, single,
workshare)
I immediately before and after a task scheduling point
No flush is implied at
I beginning of a worksharing construct (for, do, sections, single,
workshare)
I beginning and end of master
WRONG!
#pragma omp parallel
{
#pragma omp for reduction(+:s) nowait
for(int i=0;i<n;i++)
s += v[i];
int id = omp_get_thread_num();
a[id] = f(s, id);
}
WRONG!
#pragma omp parallel
{
time_t t;
time(&t);
tm* ptm = gmtime(&t);
}
From http://www.cplusplus.com/reference/ctime/gmtime/
A pointer to a tm structure with its members filled with
the values that correspond to the UTC time representation
of timer.
The returned value points to an internal object whose va-
lidity or value may be altered by any subsequent call to
gmtime or localtime.
Library functions
4 5 0 1 3 5 0 1 Caches
a 3 5 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Memory
Time step
∆t n ∆t
uijn+1 = uijn + ui+1,j − 2u n
ij + u n
i−1,j + u n
i,j+1 − 2u n
ij + u n
i,j−1 .
∆x 2 ∆y 2
Heat equation
Goals:
I Parallelization of a more realistic application.
I Understand the performance of parallel programs.
Tasks:
I Interchange nested loops.
I Investigate performance as a function of the problem size.
Expected results:
I No speedup for 80 × 80.
I Significant speedup for 250 × 250.
I Super-linear speedup for 1000 × 1000.
Problem does not fit into the cache of a single core anymore.
I By increasing the number of cores the amount of available
cache increases.
The simd directive is used to tell the compiler that the loop
iterations are independent.
#pragma omp simd
for(int i=0;i<n;i++)
a[i] += b[i];
WRONG!
#pragma omp simd
for(int i=5;i<n;i++)
a[i] = a[i-5]*b[i];
Correct.
#pragma omp simd safelen(4)
for(int i=5;i<n;i++)
a[i] = a[i-5]*b[i];
No vectorization Vectorization
#pragma omp simd #pragma omp simd
for(int i=0;i<n;i++) for(int i=0;i<n;i++)
v_aos[i].density v_soa.density[i]
= f(v_aos[i].density); = f(v_soa.density[i]);
Memory access AoS Memory access SoA
processor : 0 processor : 4
physical id : 0 physical id : 0
core id : 0 core id : 0
Memory Memory
memory bus memory bus
Socket 0 Socket 1
Place partition:
OMP_PLACES = threads or cores or sockets
Threads can freely migrate within a place.
Placement options:
OMP_PROC_BIND = spread or close or master
I close: place threads as close together as possible.
I spread: place threads as far apart as possible.
I master: place threads on the same place partition.
Thread placement
Place all threads on the same NUMA node, one thread per
core.
OMP_NUM_THREADS=4
OMP_PLACES=cores
OMP_PROC_BIND=close
Memory Memory
memory bus memory bus
T0 T1
T2 T3
Socket 0 Socket 1
Memory Memory
memory bus memory bus
T0 T2
T1 T3
Socket 0 Socket 1
OMP_NUM_THREADS=16
OMP_PLACES=threads
OMP_PROC_BIND=close
Memory Memory
memory bus memory bus
T0 T1 T2 T3 T8 T9 T10 T11
Socket 0 Socket 1
Thread placement
OMP_NUM_THREADS=8
OMP_PLACES=threads
OMP_PROC_BIND=spread
Memory Memory
memory bus memory bus
T0 T1 T4 T5
T2 T3 T6 T7
Socket 0 Socket 1
OMP_NUM_THREADS=2,4,2
OMP_PLACES=threads
OMP_PROC_BIND=spread,spread,close
The code
#pragma omp parallel // creates one thread/socket
#pragma omp parallel // creates one thread/core
#pragma omp parallel // creates one thread/hyperthread
//code
creates a total of 16 threads.
The taskloop directive
Remember tasks
struct node {
node *left, *right;
};
void traverse(node* p) {
if(p->left)
#pragma omp task
traverse(p->left); // this is created as a task
if(p->right)
#pragma omp task
traverse(p->right); // this is created as a task
process(p);
}
int main() {
node tree;
#pragma omp parallel // create a team of threads
{
#pragma omp single
traverse(&tree); // executed sequentially
}
}
Taskloop
Taskloop works like a parallel for loop and is used like a task
construct.
#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for(int i=0;i<n;i++)
a[i] = b[i] + i;
{
section section
// MPI communication 1 2
for(int i=0;i<n_b;i++)
a[i] = ...;
} Loop Loop Loop Loop
Goal:
I usage of taskloop construct.