Lammps Overdrive
Lammps Overdrive
Lammps Overdrive
http://sites.google.com/site/akohlmey/
PairLJOMP ThrData
- derived from PairLJ - per-thread accumulators
and ThrOMP - one instance per thread
- replaces ::compute()
FixOMP
with threaded version
- regularly called during MD loop
- gets access to ThrData
- determines when to reduce forces
instance from FixOMP
- manages ThrData instances
- toggles thread-related features
6
Naive OpenMP LJ Kernel
#if defined(_OPENMP)
#pragma omp parallel for default(shared) \
private(i) reduction(+:epot)
#endif Each thread will
for(i=0; i < (sys>natoms)1; ++i) { work on different
double rx1=sys>rx[i]; values of “i”
double ry1=sys>ry[i];
double rz1=sys>rz[i];
[...]
{
#if defined(_OPENMP) The “critical” directive will let only
#pragma omp critical one thread execute this
sys>fx[i] += rx*ffac; block
Race at a time
condition:
#endif
sys>fy[i] += ry*ffac;
Timings (108 atoms):
{
sys>fz[i] += rz*ffac; “i” will be unique for
sys>fx[i] += rx*ffac; each thread, but not “j”
sys>fx[j] = rx*ffac;
serial: 4.0s
sys>fy[i] += ry*ffac; Or some “j” may be an
sys>fy[j] = ry*ffac;
1 thread: 4.2s
sys>fz[i] += rz*ffac;
sys>fz[j] = rz*ffac;
2 threads: 7.1s
sys>fx[j] = rx*ffac;
} “i” of another thread
sys>fy[j] = ry*ffac; => multiple threads
4 threads: 7.7s
sys>fz[j] = rz*ffac; update the same location
8 threads: 8.6s
}
7
Alternatives to “omp critical”
● Use omp atomic to protect each force addition
=> requires hardware support (modern x86)
1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s
=> faster than omp critical for multiple threads
but it is slower than the serial code (4.0s)
rd
●
Don't use Newton's 3 Law
=> no race condition
1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s
=> better scaling, but 2 threads ~= serial speed
=> this is what is done on GPU (many threads)
8
“MPI-like” Approach with OpenMP
#if defined(_OPENMP)
#pragma omp parallel reduction(+:epot)
#endif
{ double *fx, *fy, *fz;
#if defined(_OPENMP) Thread number is like MPI rank
int tid=omp_get_thread_num();
#else
int tid=0;
sys->fx holds storage for one full fx array for
#endif each thread => race condition is avoided.
fx=sys>fx + (tid*sys>natoms); azzero(fx,sys>natoms);
fy=sys>fy + (tid*sys>natoms); azzero(fy,sys>natoms);
fz=sys>fz + (tid*sys>natoms); azzero(fz,sys>natoms);
for(int i=0; i < (sys>natoms 1); i += sys>nthreads) {
int ii = i + tid;
if (ii >= (sys>natoms 1)) break;
rx1=sys>rx[ii];
ry1=sys>ry[ii];
rz1=sys>rz[ii];
9
MPI-like Approach with OpenMP (2)
● We need to write our own reduction:
#if defined (_OPENMP) Need to make certain, all threads
#pragma omp barrier
#endif are done with computing forces
i = 1 + (sys>natoms / sys>nthreads);
fromidx = tid * i;
toidx = fromidx + i;
if (toidx > sys>natoms) toidx = sys>natoms;
for (i=1; i < sys>nthreads; ++i) {
int offs = i*sys>natoms;
for (int j=fromidx; j < toidx; ++j) { Use threads to
sys>fx[j] += sys>fx[offs+j];
sys>fy[j] += sys>fy[offs+j]; parallelize the
sys>fz[j] += sys>fz[offs+j]; reductions
}
}
10
OpenMP Timings Comparison
● omp critical timings
1Thr: 4.2s, 2Thr: 7.1s, 4Thr: 7.7s, 8Thr: 8.6s
● omp atomic timings
1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s
● omp parallel region (MPI-like) timings
1Thr: 4.0s, 2Thr: 2.5s, 4Thr: 2.2s, 8Thr: 2.5s
●
No Newton's 3rd law timings
1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s
=> the omp parallel variant is best for few threads, no
Newton's 3rd variant better for more threads
=> cost for force reduction larger for more threads
11
2x Intel Xeon 2.66Ghz (Harpertown) w/ DDR Infiniband
CHARMM (lj/charmm/coul/long + pppm), 32000 Atoms
20
Time for 1000 MD steps /s
15
10
0
1 2 3 4 6 8 12 16 20 24 32
80%
8 MPI / Node Number of Nodes
70%
40%
30%
2 MPI + 4 OpenMP / Node
20%
10%
Number of Nodes
12
13
Running Big
● Vesicle fusion study:
impact of lipid ratio in
binary mixture
● cg/cmm/coul/long
● Experimental size
=> 4M CG-beads for
1 vesicle and solvent
● 30,000,000 CPU hour
INCITE project
14
Strong Scaling (Cray XT5)
1 Vesicle CG System / 3,862,854 CG-Beads
12 MPI / 1 OpenMP
0.36
6 MPI / 2 OpenMP
4 MPI / 3 OpenMP
2 MPI / 6 OpenMP
Time per MD step (sec)
0.17
0.08
0.04
# Nodes
15
Strong Scaling (2) (Cray XT5)
8 Vesicles CG-System / 30,902,832 CG-Beads
0.61
12 MPI / 1 OpenMP
6 MPI / 2 OpenMP
4 MPI / 3 OpenMP
0.39
2 MPI / 6 OpenMP
Time per MD Step (sec)
0.24
0.15
0.1
# Nodes
16
The Curse of the k-Space (1)
Rhodopsin Benchmark, 860k Atoms, 64 Nodes, Cray XT5
12 MPI/Node
2 MPI + 6 OpenMP/Node
4 MPI/Node
6 MPI/Node
25
Other
1 MPI/Node Neighbor
Comm
20
+ OpenMP Kspace
Bond
Pair
15
Time in seconds
10
0
128 256 384 768 128 256 384 768 768
# PE
17
The Curse of the k-Space (2)
Rhodops in Benchm ark, 860k Atom s , 128 Nodes, Cray XT5
25
Other
12 MPI/Node
Neighbor
2 MPI + 6 OpenMP/Node
4 MPI/Node
6 MPI/Node
Comm
Kspace
1 MPI/Node
Bond
Pair
+ OpenMP
20
15
Time in seconds
10
0
256 512 768 1536 256 512 768 1536 1536
# PE
18
The Curse of the k-Space (3)
Rhodopsin Benchmark, 860k Atoms, 512 Nodes, Cray XT5
12 MPI/Node
2 MPI + 6 OpenMP/Node
6 MPI/Node
4 MPI/Node
25
Other
N eighbor
C om m
Kspace
Bond
20 Pair
15 1 MPI/Node
+ OpenMP
Time in sec onds
10
0
1024 2048 3072 6144 1024 2048 3072 6144 6144
# PE
19
Additional Improvements
● OpenMP threading added to charge density
accumulation and force application in PPPM
● Force reduction only done on last /omp style
● Integration style verlet/split contributed by Voth
group which run k-space on separate partition
(compatible with OpenMP version of PPPM)
● Added threading to selected fixes like charge
equilibration for COMB many-body potential
● Added threading to fix nve/sphere integrator
20
Current GPU Support in LAMMPS
● Multiple developments from different groups
● Converged to two efforts with two philosophies
● GPU package (minimalistic)
● pair styles, neighbor lists and k-space (optional):
● Download coordinates, retrieve forces
● Run asynchronously to bonded (and k-space)
● USER-CUDA package (see next talk)
● Replace all classes that touch atom data
● Data transfer between host and GPU as needed
21
Special Features of “GPU” Package
● Can be compiled for CUDA or OpenCL due to
using “Geryon” preprocessor macros
● Can attach multiple MPI tasks to one GPU for
improved GPU utilization (up to 4x over-
subscription on “Fermi”, up to 15x on “Kepler”)
● Uses a “fix” to manage GPUs and compute
kernel dispatch, “styles” dispatch kernels
asynchronously, “fix” then retrieves the forces
after all other force computations are completed
● Tuned for good scaling with fewer atoms/GPU
22
1x GPU Performance in LAMMPS
140
Bulk Water, LJ+long-range electrostatics
5,376 Water 5000 Step
21,504 Water 1000 Step PPPM on GPU
120
100
80 FirePro
Time in seconds
60
GeForce Tesla
40
20
0
8 Core, 2.8GHz GTX 480 sp GTX 480 mp GTX 480 dp C2050 sp C2050 mp C2050 dp ATI V8800 sp ATI V8800 mp ATI V8800 dp
23
Multiple GPUs per Node
Bulk Water, LJ + long-range electrostatics
100
PPPM on GPU
90
80
5,376 water, 5000 steps
FirePro
21,504 water, 1000 steps
70
60
Time in seconds
50
Tesla
40
30
20
10
0
8 Cores, 2.8GHz 1x C2050 mp 2x C2050 mp 4x C2050 mp 1x V8800 mp 2x V8800 mp 4x V8800 mp
24
Comments on GPU Acceleration
● Mixed precision (force computation in single,
force accumulation in double precision) good
compromise: little overhead, good accuracy on
forces, stress/pressure less so
● GPU acceleration larger for models that require
more computation in force kernel
● Acceleration drops with lower number of atoms
per GPU => limited strong scaling on “big iron”
● Acceleration amount dependent on host & GPU
25
Installation of USER-OMP and GPU
● USER-OMP package:
● make yes-user-omp to install sources
● Add -fopenmp (GNU) or -openmp (Intel) to CC and
LINK definitions in your makefile to enable OpenMP
● Compilation without OpenMP => similar to OPT
● GPU package:
● Compile library in lib/gpu for CUDA or OpenCL
● make yes-gpu to install style sources which are
wrappers for GPU library
● Tweak lib/gpu/Makefile.lammps.??? as needed
26
Using Accelerated Code
● All accelerated styles are optional and need to
be activated in the input or from command line
● Naming convention lj/cut -> lj/cut/omp lj/cut/gpu
● From command line -sf omp or -sf gpu
● Inside script: suffix omp or suffix gpu
and suffix on or suffix off
● Use package omp/gpu command to adjust
settings for acceleration and selection of GPUs
● -sf command line flag implies default settings
27
Conclusions and Outlook: OpenMP
● OpenMP+MPI is almost always a win, especially
with large node counts (=> capability computing)
● USER-OMP also contains serial optimizations
and thus useful without OpenMP compiled in
● Minimal changes to LAMMPS core code
● USER-OMP only a transitional implementation
since efficient only on a small number of threads
● Longer-term solution also needs to consider
vectorization and thus be more GPU-like and
benefits from different data layout (see next talk)
28