Lammps Overdrive

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Accelerating classical MD

for multi-core CPUs and GPUs


Dr. Axel Kohlmeyer
Associate Dean for Scientific Computing
College of Science and Technology
Temple University, Philadelphia

http://sites.google.com/site/akohlmey/

[email protected]

LAMMPS Users and Developers Workshop


and Symposium, March 24th-28th 2014
Standard LAMMPS Parallelization
● MPI based (MPI emulator for serial execution)
● Uses domain decomposition with 1 domain
per MPI task (= processor). Each MPI task
looks after the atoms in its domain
● Atoms move from MPI
task to MPI task as they
move through the system
● Assumes same amount of
work (force computations)
in each domain.
2
Why Bother Adding OpenMP?
1.Why not do it?
a) LAMMPS is already very parallel
b) Even more run-time settings to optimize
c) OpenMP is often less effective than MPI (for MD)
2. Why do it anyway?
a) On multi-core machines (Cray XT5) LAMMPS can
run faster with MPI when some CPU cores are idle
b) Parallelization over particles, not domains
c) PPPM has scaling limitations. At high node counts
it would be better to run it only on a subset of tasks
3
OpenMP Parallelization
● OpenMP is directive based
=> well written code works with or without
● OpenMP can be added incrementally
● OpenMP only works in shared memory
=> multi-core processors are now ubiquitous
● OpenMP hides the calls to a threads library
=> less flexible, more overhead, but less effort
● Caution: need to worry about race conditions,
memory corruption, false sharing, Amdahl's law
4
How to add OpenMP to LAMMPS
● LAMMPS is very modular, just add new classes
derived from non-threaded implementation
● Pairwise interactions (consume most time)
● i,j nested loop over neighbors can be parallelized
● each thread processes different “i” atoms
● Neighbor list build (binning still serial)
● i,j nested loop over atoms and neighboring bins
● Dihedrals and other bonded interactions
● Replace selected function(s) in derived class
5
Threading Class Relations
PairLJ ThrOMP
- serial implementation - thread-safe utility functions
- all non-threaded code - reduction of per-thread force

PairLJOMP ThrData
- derived from PairLJ - per-thread accumulators
and ThrOMP - one instance per thread
- replaces ::compute()
FixOMP
with threaded version
- regularly called during MD loop
- gets access to ThrData
- determines when to reduce forces
instance from FixOMP
- manages ThrData instances
- toggles thread-related features
6
Naive OpenMP LJ Kernel
#if defined(_OPENMP)
#pragma omp parallel for default(shared) \
    private(i) reduction(+:epot)
#endif Each thread will
    for(i=0; i < (sys­>natoms)­1; ++i) { work on different
        double rx1=sys­>rx[i]; values of “i”
        double ry1=sys­>ry[i];
        double rz1=sys­>rz[i];
        [...]
               {
#if defined(_OPENMP) The “critical” directive will let only
#pragma omp critical one thread execute this
                    sys­>fx[i] += rx*ffac; block
Race at a time
condition:
#endif
                    sys­>fy[i] += ry*ffac;
Timings (108 atoms):
                {
                    sys­>fz[i] += rz*ffac; “i” will be unique for
                    sys­>fx[i] += rx*ffac; each thread, but not “j”
                    sys­>fx[j] ­= rx*ffac;
serial: 4.0s
                    sys­>fy[i] += ry*ffac; Or some “j” may be an
                    sys­>fy[j] ­= ry*ffac;
1 thread: 4.2s
                    sys­>fz[i] += rz*ffac;
                    sys­>fz[j] ­= rz*ffac;
2 threads: 7.1s
                    sys­>fx[j] ­= rx*ffac;
                } “i” of another thread
                    sys­>fy[j] ­= ry*ffac; => multiple threads
4 threads: 7.7s
                    sys­>fz[j] ­= rz*ffac; update the same location
8 threads: 8.6s
                }
7
Alternatives to “omp critical”
● Use omp atomic to protect each force addition
=> requires hardware support (modern x86)
1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s
=> faster than omp critical for multiple threads
but it is slower than the serial code (4.0s)
rd

Don't use Newton's 3 Law
=> no race condition
1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s
=> better scaling, but 2 threads ~= serial speed
=> this is what is done on GPU (many threads)
8
“MPI-like” Approach with OpenMP
#if defined(_OPENMP)
#pragma omp parallel reduction(+:epot)
#endif
    {  double *fx, *fy, *fz;
#if defined(_OPENMP) Thread number is like MPI rank
        int tid=omp_get_thread_num();
#else
        int tid=0;
sys->fx holds storage for one full fx array for
#endif each thread => race condition is avoided.
        fx=sys­>fx + (tid*sys­>natoms); azzero(fx,sys­>natoms);
        fy=sys­>fy + (tid*sys­>natoms); azzero(fy,sys­>natoms);
        fz=sys­>fz + (tid*sys­>natoms); azzero(fz,sys­>natoms);
        for(int i=0; i < (sys­>natoms ­1); i += sys­>nthreads) {
            int ii = i + tid;
            if (ii >= (sys­>natoms ­1)) break;
            rx1=sys­>rx[ii];
            ry1=sys­>ry[ii];
            rz1=sys­>rz[ii];

9
MPI-like Approach with OpenMP (2)
● We need to write our own reduction:
#if defined (_OPENMP) Need to make certain, all threads
#pragma omp barrier
#endif are done with computing forces
    i = 1 + (sys­>natoms / sys­>nthreads);
    fromidx = tid * i;
    toidx = fromidx + i;
    if (toidx > sys­>natoms) toidx = sys­>natoms;

    for (i=1; i < sys­>nthreads; ++i) {
        int offs = i*sys­>natoms;
        for (int j=fromidx; j < toidx; ++j) { Use threads to
            sys­>fx[j] += sys­>fx[offs+j];
            sys­>fy[j] += sys­>fy[offs+j]; parallelize the
            sys­>fz[j] += sys­>fz[offs+j]; reductions
        }
    }

10
OpenMP Timings Comparison
● omp critical timings
1Thr: 4.2s, 2Thr: 7.1s, 4Thr: 7.7s, 8Thr: 8.6s
● omp atomic timings
1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s
● omp parallel region (MPI-like) timings
1Thr: 4.0s, 2Thr: 2.5s, 4Thr: 2.2s, 8Thr: 2.5s

No Newton's 3rd law timings
1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s
=> the omp parallel variant is best for few threads, no
Newton's 3rd variant better for more threads
=> cost for force reduction larger for more threads
11
2x Intel Xeon 2.66Ghz (Harpertown) w/ DDR Infiniband
CHARMM (lj/charmm/coul/long + pppm), 32000 Atoms
20
Time for 1000 MD steps /s

15

10

0
1 2 3 4 6 8 12 16 20 24 32

80%
8 MPI / Node Number of Nodes
70%

60% 4 MPI + 2 OpenMP / Node


50%
4 MPI / Node
Parallel Efficiency

40%

30%
2 MPI + 4 OpenMP / Node
20%

10%

1 MPI + 8 OpenMP / Node


0%
1 6 11 16 21 26 31

Number of Nodes
12
13
Running Big
● Vesicle fusion study:
impact of lipid ratio in
binary mixture
● cg/cmm/coul/long
● Experimental size
=> 4M CG-beads for
1 vesicle and solvent
● 30,000,000 CPU hour
INCITE project

14
Strong Scaling (Cray XT5)
1 Vesicle CG System / 3,862,854 CG-Beads

12 MPI / 1 OpenMP
0.36
6 MPI / 2 OpenMP
4 MPI / 3 OpenMP
2 MPI / 6 OpenMP
Time per MD step (sec)

0.17

0.08

0.04

27 63 148 345 805 1878

# Nodes

15
Strong Scaling (2) (Cray XT5)
8 Vesicles CG-System / 30,902,832 CG-Beads
0.61

12 MPI / 1 OpenMP
6 MPI / 2 OpenMP
4 MPI / 3 OpenMP
0.39
2 MPI / 6 OpenMP
Time per MD Step (sec)

0.24

0.15

0.1

256 569 1263 2805 6231

# Nodes

16
The Curse of the k-Space (1)
Rhodopsin Benchmark, 860k Atoms, 64 Nodes, Cray XT5

12 MPI/Node

2 MPI + 6 OpenMP/Node
4 MPI/Node

6 MPI/Node
25
Other
1 MPI/Node Neighbor
Comm
20
+ OpenMP Kspace
Bond
Pair

15
Time in seconds

10

0
128 256 384 768 128 256 384 768 768

# PE

17
The Curse of the k-Space (2)
Rhodops in Benchm ark, 860k Atom s , 128 Nodes, Cray XT5

25
Other

12 MPI/Node
Neighbor

2 MPI + 6 OpenMP/Node
4 MPI/Node

6 MPI/Node
Comm
Kspace

1 MPI/Node
Bond
Pair

+ OpenMP
20

15
Time in seconds

10

0
256 512 768 1536 256 512 768 1536 1536
# PE

18
The Curse of the k-Space (3)
Rhodopsin Benchmark, 860k Atoms, 512 Nodes, Cray XT5

12 MPI/Node

2 MPI + 6 OpenMP/Node
6 MPI/Node
4 MPI/Node
25
Other
N eighbor
C om m
Kspace
Bond
20 Pair

15 1 MPI/Node
+ OpenMP
Time in sec onds

10

0
1024 2048 3072 6144 1024 2048 3072 6144 6144
# PE

19
Additional Improvements
● OpenMP threading added to charge density
accumulation and force application in PPPM
● Force reduction only done on last /omp style
● Integration style verlet/split contributed by Voth
group which run k-space on separate partition
(compatible with OpenMP version of PPPM)
● Added threading to selected fixes like charge
equilibration for COMB many-body potential
● Added threading to fix nve/sphere integrator
20
Current GPU Support in LAMMPS
● Multiple developments from different groups
● Converged to two efforts with two philosophies
● GPU package (minimalistic)
● pair styles, neighbor lists and k-space (optional):
● Download coordinates, retrieve forces
● Run asynchronously to bonded (and k-space)
● USER-CUDA package (see next talk)
● Replace all classes that touch atom data
● Data transfer between host and GPU as needed
21
Special Features of “GPU” Package
● Can be compiled for CUDA or OpenCL due to
using “Geryon” preprocessor macros
● Can attach multiple MPI tasks to one GPU for
improved GPU utilization (up to 4x over-
subscription on “Fermi”, up to 15x on “Kepler”)
● Uses a “fix” to manage GPUs and compute
kernel dispatch, “styles” dispatch kernels
asynchronously, “fix” then retrieves the forces
after all other force computations are completed
● Tuned for good scaling with fewer atoms/GPU
22
1x GPU Performance in LAMMPS
140
Bulk Water, LJ+long-range electrostatics
5,376 Water 5000 Step
21,504 Water 1000 Step PPPM on GPU
120

100

80 FirePro
Time in seconds

60

GeForce Tesla
40

20

0
8 Core, 2.8GHz GTX 480 sp GTX 480 mp GTX 480 dp C2050 sp C2050 mp C2050 dp ATI V8800 sp ATI V8800 mp ATI V8800 dp

23
Multiple GPUs per Node
Bulk Water, LJ + long-range electrostatics
100
PPPM on GPU
90

80
5,376 water, 5000 steps
FirePro
21,504 water, 1000 steps
70

60
Time in seconds

50
Tesla
40

30

20

10

0
8 Cores, 2.8GHz 1x C2050 mp 2x C2050 mp 4x C2050 mp 1x V8800 mp 2x V8800 mp 4x V8800 mp

24
Comments on GPU Acceleration
● Mixed precision (force computation in single,
force accumulation in double precision) good
compromise: little overhead, good accuracy on
forces, stress/pressure less so
● GPU acceleration larger for models that require
more computation in force kernel
● Acceleration drops with lower number of atoms
per GPU => limited strong scaling on “big iron”
● Acceleration amount dependent on host & GPU

25
Installation of USER-OMP and GPU
● USER-OMP package:
● make yes-user-omp to install sources
● Add -fopenmp (GNU) or -openmp (Intel) to CC and
LINK definitions in your makefile to enable OpenMP
● Compilation without OpenMP => similar to OPT
● GPU package:
● Compile library in lib/gpu for CUDA or OpenCL
● make yes-gpu to install style sources which are
wrappers for GPU library
● Tweak lib/gpu/Makefile.lammps.??? as needed
26
Using Accelerated Code
● All accelerated styles are optional and need to
be activated in the input or from command line
● Naming convention lj/cut -> lj/cut/omp lj/cut/gpu
● From command line -sf omp or -sf gpu
● Inside script: suffix omp or suffix gpu
and suffix on or suffix off
● Use package omp/gpu command to adjust
settings for acceleration and selection of GPUs
● -sf command line flag implies default settings
27
Conclusions and Outlook: OpenMP
● OpenMP+MPI is almost always a win, especially
with large node counts (=> capability computing)
● USER-OMP also contains serial optimizations
and thus useful without OpenMP compiled in
● Minimal changes to LAMMPS core code
● USER-OMP only a transitional implementation
since efficient only on a small number of threads
● Longer-term solution also needs to consider
vectorization and thus be more GPU-like and
benefits from different data layout (see next talk)
28

You might also like