2011-08-26 00:17:31 +08:00
|
|
|
"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
|
2011-08-09 23:37:57 +08:00
|
|
|
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc - "Next
|
2011-08-26 00:17:31 +08:00
|
|
|
Section"_Section_howto.html :c
|
2011-05-27 07:45:30 +08:00
|
|
|
|
|
|
|
:link(lws,http://lammps.sandia.gov)
|
|
|
|
:link(ld,Manual.html)
|
|
|
|
:link(lc,Section_commands.html#comm)
|
|
|
|
|
|
|
|
:line
|
|
|
|
|
2011-12-09 08:44:47 +08:00
|
|
|
5. Accelerating LAMMPS performance :h3
|
|
|
|
|
|
|
|
This section describes various methods for improving LAMMPS
|
2012-08-07 06:50:52 +08:00
|
|
|
performance for different classes of problems running on different
|
|
|
|
kinds of machines.
|
|
|
|
|
|
|
|
5.1 "Measuring performance"_#acc_1
|
|
|
|
5.2 "General strategies"_#acc_2
|
|
|
|
5.3 "Packages with optimized styles"_#acc_3
|
|
|
|
5.4 "OPT package"_#acc_4
|
|
|
|
5.5 "USER-OMP package"_#acc_5
|
|
|
|
5.6 "GPU package"_#acc_6
|
|
|
|
5.7 "USER-CUDA package"_#acc_7
|
2014-05-30 06:52:23 +08:00
|
|
|
5.8 "KOKKOS package"_#acc_8
|
2014-08-15 00:30:25 +08:00
|
|
|
5.9 "USER-INTEL package"_#acc_9
|
2014-08-28 04:52:54 +08:00
|
|
|
5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)
|
2011-12-09 08:44:47 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
The "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS
|
|
|
|
web site gives performance results for the various accelerator
|
|
|
|
packages discussed in this section for several of the standard LAMMPS
|
|
|
|
benchmarks, as a function of problem size and number of compute nodes,
|
|
|
|
on different hardware platforms.
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
:line
|
|
|
|
:line
|
|
|
|
|
|
|
|
5.1 Measuring performance :h4,link(acc_1)
|
|
|
|
|
|
|
|
Before trying to make your simulation run faster, you should
|
|
|
|
understand how it currently performs and where the bottlenecks are.
|
|
|
|
|
|
|
|
The best way to do this is run the your system (actual number of
|
|
|
|
atoms) for a modest number of timesteps (say 100, or a few 100 at
|
|
|
|
most) on several different processor counts, including a single
|
|
|
|
processor if possible. Do this for an equilibrium version of your
|
|
|
|
system, so that the 100-step timings are representative of a much
|
|
|
|
longer run. There is typically no need to run for 1000s or timesteps
|
|
|
|
to get accurate timings; you can simply extrapolate from short runs.
|
|
|
|
|
|
|
|
For the set of runs, look at the timing data printed to the screen and
|
|
|
|
log file at the end of each LAMMPS run. "This
|
|
|
|
section"_Section_start.html#start_8 of the manual has an overview.
|
|
|
|
|
|
|
|
Running on one (or a few processors) should give a good estimate of
|
|
|
|
the serial performance and what portions of the timestep are taking
|
|
|
|
the most time. Running the same problem on a few different processor
|
|
|
|
counts should give an estimate of parallel scalability. I.e. if the
|
|
|
|
simulation runs 16x faster on 16 processors, its 100% parallel
|
|
|
|
efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
|
|
|
|
|
|
|
|
The most important data to look at in the timing info is the timing
|
|
|
|
breakdown and relative percentages. For example, trying different
|
|
|
|
options for speeding up the long-range solvers will have little impact
|
|
|
|
if they only consume 10% of the run time. If the pairwise time is
|
|
|
|
dominating, you may want to look at GPU or OMP versions of the pair
|
|
|
|
style, as discussed below. Comparing how the percentages change as
|
|
|
|
you increase the processor count gives you a sense of how different
|
|
|
|
operations within the timestep are scaling. Note that if you are
|
|
|
|
running with a Kspace solver, there is additional output on the
|
|
|
|
breakdown of the Kspace time. For PPPM, this includes the fraction
|
|
|
|
spent on FFTs, which can be communication intensive.
|
|
|
|
|
|
|
|
Another important detail in the timing info are the histograms of
|
|
|
|
atoms counts and neighbor counts. If these vary widely across
|
|
|
|
processors, you have a load-imbalance issue. This often results in
|
|
|
|
inaccurate relative timing data, because processors have to wait when
|
|
|
|
communication occurs for other processors to catch up. Thus the
|
|
|
|
reported times for "Communication" or "Other" may be higher than they
|
|
|
|
really are, due to load-imbalance. If this is an issue, you can
|
|
|
|
uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
|
|
|
|
LAMMPS, to obtain synchronized timings.
|
|
|
|
|
|
|
|
:line
|
|
|
|
|
|
|
|
5.2 General strategies :h4,link(acc_2)
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
NOTE: this section 5.2 is still a work in progress
|
2012-08-07 06:51:58 +08:00
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
Here is a list of general ideas for improving simulation performance.
|
|
|
|
Most of them are only applicable to certain models and certain
|
|
|
|
bottlenecks in the current performance, so let the timing data you
|
2014-01-18 05:51:23 +08:00
|
|
|
generate be your guide. It is hard, if not impossible, to predict how
|
|
|
|
much difference these options will make, since it is a function of
|
|
|
|
problem size, number of processors used, and your machine. There is
|
|
|
|
no substitute for identifying performance bottlenecks, and trying out
|
|
|
|
various options.
|
2012-08-07 06:50:52 +08:00
|
|
|
|
|
|
|
rRESPA
|
|
|
|
2-FFT PPPM
|
2013-06-28 07:48:54 +08:00
|
|
|
Staggered PPPM
|
2012-08-07 06:50:52 +08:00
|
|
|
single vs double PPPM
|
|
|
|
partial charge PPPM
|
|
|
|
verlet/split
|
|
|
|
processor mapping via processors numa command
|
|
|
|
load-balancing: balance and fix balance
|
|
|
|
processor command for layout
|
2012-10-19 23:39:31 +08:00
|
|
|
OMP when lots of cores :ul
|
2012-08-07 06:50:52 +08:00
|
|
|
|
2013-06-30 05:40:58 +08:00
|
|
|
2-FFT PPPM, also called {analytic differentiation} or {ad} PPPM, uses
|
|
|
|
2 FFTs instead of the 4 FFTs used by the default {ik differentiation}
|
|
|
|
PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to
|
|
|
|
achieve the same accuracy as 4-FFT PPPM. For problems where the FFT
|
|
|
|
cost is the performance bottleneck (typically large problems running
|
|
|
|
on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.
|
2013-06-28 07:48:54 +08:00
|
|
|
|
2013-06-30 05:40:58 +08:00
|
|
|
Staggered PPPM performs calculations using two different meshes, one
|
|
|
|
shifted slightly with respect to the other. This can reduce force
|
|
|
|
aliasing errors and increase the accuracy of the method, but also
|
|
|
|
doubles the amount of work required. For high relative accuracy, using
|
|
|
|
staggered PPPM allows one to half the mesh size in each dimension as
|
|
|
|
compared to regular PPPM, which can give around a 4x speedup in the
|
|
|
|
kspace time. However, for low relative accuracy, using staggered PPPM
|
|
|
|
gives little benefit and can be up to 2x slower in the kspace
|
|
|
|
time. For example, the rhodopsin benchmark was run on a single
|
|
|
|
processor, and results for kspace time vs. relative accuracy for the
|
|
|
|
different methods are shown in the figure below. For this system,
|
|
|
|
staggered PPPM (using ik differentiation) becomes useful when using a
|
|
|
|
relative accuracy of slightly greater than 1e-5 and above.
|
2013-06-28 07:48:54 +08:00
|
|
|
|
|
|
|
:c,image(JPG/rhodo_staggered.jpg)
|
|
|
|
|
2013-06-30 05:40:58 +08:00
|
|
|
IMPORTANT NOTE: Using staggered PPPM may not give the same increase in
|
|
|
|
accuracy of energy and pressure as it does in forces, so some caution
|
|
|
|
must be used if energy and/or pressure are quantities of interest,
|
|
|
|
such as when using a barostat.
|
2013-06-28 07:48:54 +08:00
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
:line
|
|
|
|
|
|
|
|
5.3 Packages with optimized styles :h4,link(acc_3)
|
2011-05-27 07:45:30 +08:00
|
|
|
|
|
|
|
Accelerated versions of various "pair_style"_pair_style.html,
|
2011-06-09 04:56:17 +08:00
|
|
|
"fixes"_fix.html, "computes"_compute.html, and other commands have
|
|
|
|
been added to LAMMPS, which will typically run faster than the
|
|
|
|
standard non-accelerated versions, if you have the appropriate
|
|
|
|
hardware on your system.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
All of these commands are in "packages"_Section_packages.html.
|
2014-08-29 04:58:20 +08:00
|
|
|
Currently, there are 6 such accelerator packages in LAMMPS, either as
|
|
|
|
standard or user packages:
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
"USER-CUDA"_#acc_7 : for NVIDIA GPUs
|
|
|
|
"GPU"_acc_6 : for NVIDIA GPUs as well as OpenCL support
|
|
|
|
"USER-INTEL"_acc_9 : for Intel CPUs and Intel Xeon Phi
|
|
|
|
"KOKKOS"_acc_8 : for GPUs, Intel Xeon Phi, and OpenMP threading
|
|
|
|
"USER-OMP"_acc_5 : for OpenMP threading
|
|
|
|
"OPT"_acc_4 : generic CPU optimizations :tb(s=:)
|
2014-08-29 04:58:20 +08:00
|
|
|
|
|
|
|
Any accelerated style has the same name as the corresponding standard
|
|
|
|
style, except that a suffix is appended. Otherwise, the syntax for
|
|
|
|
the command that specifies the style is identical, their functionality
|
|
|
|
is the same, and the numerical results it produces should also be the
|
|
|
|
same, except for precision and round-off effects.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
For example, all of these styles are variants of the basic
|
2014-08-29 04:58:20 +08:00
|
|
|
Lennard-Jones "pair_style lj/cut"_pair_lj.html:
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
"pair_style lj/cut/cuda"_pair_lj.html
|
2011-05-27 07:45:30 +08:00
|
|
|
"pair_style lj/cut/gpu"_pair_lj.html
|
2014-08-15 00:30:25 +08:00
|
|
|
"pair_style lj/cut/intel"_pair_lj.html
|
2014-05-30 06:52:23 +08:00
|
|
|
"pair_style lj/cut/kk"_pair_lj.html
|
|
|
|
"pair_style lj/cut/omp"_pair_lj.html
|
|
|
|
"pair_style lj/cut/opt"_pair_lj.html :ul
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Assuming LAMMPS was built with the appropriate package, these styles
|
|
|
|
can be invoked by specifying them explicitly in your input script. Or
|
|
|
|
the "-suffix command-line switch"_Section_start.html#start_7 can be
|
|
|
|
used to automatically invoke the accelerated versions, without
|
|
|
|
changing the input script. Use of the "suffix"_suffix.html command
|
|
|
|
allows a suffix to be set explicitly and to be turned off and back on
|
|
|
|
at various points within an input script.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
To see what styles are currently available in each of the accelerated
|
|
|
|
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
|
|
|
|
manual. The doc page for each indvidual style (e.g. "pair
|
|
|
|
lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
|
|
|
|
accelerated variants available for that style.
|
|
|
|
|
|
|
|
Here is a brief summary of what the various packages provide. Details
|
|
|
|
are in individual sections below.
|
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
|
|
|
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
2014-08-28 04:52:54 +08:00
|
|
|
The speed-up on a GPU depends on a variety of factors, as discussed
|
2014-08-29 04:58:20 +08:00
|
|
|
below. :ulb,l
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-15 00:30:25 +08:00
|
|
|
Styles with an "intel" suffix are part of the USER-INTEL
|
|
|
|
package. These styles support vectorized single and mixed precision
|
|
|
|
calculations, in addition to full double precision. In extreme cases,
|
|
|
|
this can provide speedups over 3.5x on CPUs. The package also
|
2014-08-28 04:52:54 +08:00
|
|
|
supports acceleration with offload to Intel(R) Xeon Phi(TM)
|
|
|
|
coprocessors. This can result in additional speedup over 2x depending
|
2014-08-29 04:58:20 +08:00
|
|
|
on the hardware configuration. :l
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
2014-08-28 04:52:54 +08:00
|
|
|
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
|
2014-08-29 04:58:20 +08:00
|
|
|
The speed-up depends on a variety of factors, as discussed below. :l
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
Styles with an "omp" suffix are part of the USER-OMP package and allow
|
2011-10-07 01:32:51 +08:00
|
|
|
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
|
|
|
be useful on nodes with high-core counts when using less MPI processes
|
2011-08-18 05:55:22 +08:00
|
|
|
than cores is advantageous, e.g. when running with PPPM so that FFTs
|
2011-10-07 01:32:51 +08:00
|
|
|
are run on fewer MPI processors or when the many MPI tasks would
|
2014-08-29 04:58:20 +08:00
|
|
|
overload the available bandwidth for communication. :l
|
2011-08-18 05:55:22 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
Styles with an "opt" suffix are part of the OPT package and typically
|
2014-08-28 04:52:54 +08:00
|
|
|
speed-up the pairwise calculations of your simulation by 5-25% on a
|
2014-08-29 04:58:20 +08:00
|
|
|
CPU. :l,ule
|
2011-05-27 07:45:30 +08:00
|
|
|
|
|
|
|
The following sections explain:
|
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
what hardware and software the accelerated package requires
|
|
|
|
how to build LAMMPS with the accelerated package
|
|
|
|
how to run an input script with the accelerated package
|
|
|
|
speed-ups to expect
|
2011-08-09 23:37:57 +08:00
|
|
|
guidelines for best performance
|
2014-08-28 04:52:54 +08:00
|
|
|
restrictions :ul
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
The final section compares and contrasts the USER-CUDA, GPU, and
|
|
|
|
KOKKOS packages, since they all enable use of NVIDIA GPUs.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
|
|
|
:line
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
5.4 OPT package :h4,link(acc_4)
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2011-05-28 01:59:03 +08:00
|
|
|
The OPT package was developed by James Fischer (High Performance
|
2011-08-09 23:37:57 +08:00
|
|
|
Technologies), David Richie, and Vincent Natoli (Stone Ridge
|
2011-06-09 04:56:17 +08:00
|
|
|
Technologies). It contains a handful of pair styles whose compute()
|
|
|
|
methods were rewritten in C++ templated form to reduce the overhead
|
|
|
|
due to if tests and other conditional code.
|
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
[Required hardware/software:]
|
|
|
|
|
|
|
|
None.
|
|
|
|
|
|
|
|
[Building LAMMPS with the OPT package:]
|
|
|
|
|
|
|
|
Include the package and build LAMMPS.
|
2011-06-09 04:56:17 +08:00
|
|
|
|
|
|
|
make yes-opt
|
|
|
|
make machine :pre
|
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
No additional compile/link flags are needed in your lo-level
|
|
|
|
src/MAKE/Makefile.machine.
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Running with the OPT package:]
|
2014-08-28 04:52:54 +08:00
|
|
|
|
|
|
|
You can explicitly add an "opt" suffix to the
|
|
|
|
"pair_style"_pair_style.html command in your input script:
|
|
|
|
|
|
|
|
pair_style lj/cut/opt 2.5 :pre
|
|
|
|
|
|
|
|
Or you can run with the -sf "command-line
|
|
|
|
switch"_Section_start.html#start_7, which will automatically append
|
|
|
|
"opt" to styles that support it.
|
2011-06-09 04:56:17 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
lmp_machine -sf opt -in in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf opt -in in.script :pre
|
2011-06-09 04:56:17 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
You should see a reduction in the "Pair time" value printed at the end
|
|
|
|
of a run. On most machines for reasonable problem sizes, it will be a
|
|
|
|
5 to 20% savings.
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Guidelines for best performance:]
|
2014-08-28 04:52:54 +08:00
|
|
|
|
|
|
|
None. Just try out an OPT pair style to see how it performs.
|
|
|
|
|
|
|
|
[Restrictions:]
|
|
|
|
|
|
|
|
None.
|
2011-05-28 01:59:03 +08:00
|
|
|
|
2011-05-27 07:45:30 +08:00
|
|
|
:line
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
5.5 USER-OMP package :h4,link(acc_5)
|
2011-08-18 05:55:22 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
|
|
|
University. It provides multi-threaded versions of most pair styles,
|
2014-08-28 04:52:54 +08:00
|
|
|
nearly all bonded styles (bond, angle, dihedral, improper), several
|
|
|
|
Kspace styles, and a few fix styles. The package currently
|
|
|
|
uses the OpenMP interface for multi-threading.
|
|
|
|
|
|
|
|
[Required hardware/software:]
|
|
|
|
|
|
|
|
Your compiler must support the OpenMP interface. You should have one
|
|
|
|
or more multi-core CPUs so that multiple threads can be launched by an
|
|
|
|
MPI task running on a CPU.
|
2011-10-07 01:32:51 +08:00
|
|
|
|
|
|
|
[Building LAMMPS with the USER-OMP package:]
|
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
Include the package and build LAMMPS.
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
cd lammps/src
|
2011-10-07 01:32:51 +08:00
|
|
|
make yes-user-omp
|
|
|
|
make machine :pre
|
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
|
|
|
|
support in both the CCFLAGS and LINKFLAGS variables. For GNU and
|
|
|
|
Intel compilers, this flag is {-fopenmp}. Without this flag the
|
|
|
|
USER-OMP styles will still be compiled and work, but will not support
|
|
|
|
multi-threading.
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Running with the USER-OMP package:]
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
There are 3 issues (a,b,c) to address:
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(a) Specify how many threads per MPI task to use
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Note that the product of MPI tasks * threads/task should not exceed
|
|
|
|
the physical number of cores, otherwise performance will suffer.
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
By default LAMMPS uses 1 thread per MPI task. If the environment
|
|
|
|
variable OMP_NUM_THREADS is set to a valid value, this value is used.
|
|
|
|
You can set this environment variable when you launch LAMMPS, e.g.
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
|
|
|
|
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
|
|
|
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
or you can set it permanently in your shell's start-up script.
|
|
|
|
All three of these examples use a total of 4 CPU cores.
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Note that different MPI implementations have different ways of passing
|
|
|
|
the OMP_NUM_THREADS environment variable to all MPI processes. The
|
|
|
|
2nd line above is for MPICH; the 3rd line with -x is for OpenMPI.
|
|
|
|
Check your MPI documentation for additional details.
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
You can also set the number of threads per MPI task via the "package
|
|
|
|
omp"_package.html command, which will override any OMP_NUM_THREADS
|
|
|
|
setting.
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(b) Enable the USER-OMP package
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
This can be done in one of two ways. Use a "package omp"_package.html
|
|
|
|
command near the top of your input script.
|
|
|
|
|
|
|
|
Or use the "-sf omp" "command-line switch"_Section_start.html#start_7,
|
|
|
|
which will automatically invoke the command "package omp
|
|
|
|
*"_package.html.
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(c) Use OMP-accelerated styles
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
This can be done by explicitly adding an "omp" suffix to any supported
|
|
|
|
style in your input script:
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
pair_style lj/cut/omp 2.5
|
|
|
|
fix nve/omp :pre
|
2014-08-30 00:27:29 +08:00
|
|
|
<
|
2014-08-29 04:58:20 +08:00
|
|
|
Or you can run with the "-sf omp" "command-line
|
|
|
|
switch"_Section_start.html#start_7, which will automatically append
|
|
|
|
"omp" to styles that support it.
|
2014-08-28 04:52:54 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
lmp_machine -sf omp -in in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf omp -in in.script :pre
|
|
|
|
|
|
|
|
Using the "suffix omp" command in your input script does the same
|
|
|
|
thing.
|
2014-08-28 04:52:54 +08:00
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
Depending on which styles are accelerated, you should look for a
|
|
|
|
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
|
|
|
|
time" values printed at the end of a run.
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
You may see a small performance advantage (5 to 20%) when running a
|
2014-08-30 00:27:29 +08:00
|
|
|
USER-OMP style (in serial or parallel) with a single thread per MPI
|
|
|
|
task, versus running standard LAMMPS with its standard
|
|
|
|
(un-accelerated) styles (in serial or all-MPI parallelization with 1
|
|
|
|
task/core). This is because many of the USER-OMP styles contain
|
|
|
|
similar optimizations to those used in the OPT package, as described
|
|
|
|
above.
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
With multiple threads/task, the optimal choice of MPI tasks/node and
|
|
|
|
OpenMP threads/task can vary a lot and should always be tested via
|
|
|
|
benchmark runs for a specific simulation running on a specific
|
|
|
|
machine, paying attention to guidelines discussed in the next
|
|
|
|
sub-section.
|
|
|
|
|
|
|
|
A description of the multi-threading strategy used in the UESR-OMP
|
|
|
|
package and some performance examples are "presented
|
|
|
|
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Guidelines for best performance:]
|
2014-08-28 04:52:54 +08:00
|
|
|
|
|
|
|
For many problems on current generation CPUs, running the USER-OMP
|
|
|
|
package with a single thread/task is faster than running with multiple
|
|
|
|
threads/task. This is because the MPI parallelization in LAMMPS is
|
|
|
|
often more efficient than multi-threading as implemented in the
|
|
|
|
USER-OMP package. The parallel efficiency (in a threaded sense) also
|
|
|
|
varies for different USER-OMP styles.
|
|
|
|
|
|
|
|
Using multiple threads/task can be more effective under the following
|
2014-05-30 06:52:23 +08:00
|
|
|
circumstances:
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
Individual compute nodes have a significant number of CPU cores but
|
2014-08-28 04:52:54 +08:00
|
|
|
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
|
2014-05-30 06:52:23 +08:00
|
|
|
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
|
|
|
MPI task per CPU core will result in significant performance
|
2014-08-28 04:52:54 +08:00
|
|
|
degradation, so that running with 4 or even only 2 MPI tasks per node
|
|
|
|
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
2014-05-30 06:52:23 +08:00
|
|
|
inter-node communication bandwidth contention in the same way, but
|
2014-08-28 04:52:54 +08:00
|
|
|
offers an additional speedup by utilizing the otherwise idle CPU
|
2014-05-30 06:52:23 +08:00
|
|
|
cores. :ulb,l
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
The interconnect used for MPI communication does not provide
|
|
|
|
sufficient bandwidth for a large number of MPI tasks per node. For
|
|
|
|
example, this applies to running over gigabit ethernet or on Cray XT4
|
|
|
|
or XT5 series supercomputers. As in the aforementioned case, this
|
|
|
|
effect worsens when using an increasing number of nodes. :l
|
|
|
|
|
|
|
|
The system has a spatially inhomogeneous particle density which does
|
|
|
|
not map well to the "domain decomposition scheme"_processors.html or
|
|
|
|
"load-balancing"_balance.html options that LAMMPS provides. This is
|
|
|
|
because multi-threading achives parallelism over the number of
|
|
|
|
particles, not via their distribution in space. :l
|
|
|
|
|
|
|
|
A machine is being used in "capability mode", i.e. near the point
|
|
|
|
where MPI parallelism is maxed out. For example, this can happen when
|
|
|
|
using the "PPPM solver"_kspace_style.html for long-range
|
2014-08-29 04:58:20 +08:00
|
|
|
electrostatics on large numbers of nodes. The scaling of the KSpace
|
|
|
|
calculation (see the "kspace_style"_kspace_style.html command) becomes
|
|
|
|
the performance-limiting factor. Using multi-threading allows less
|
|
|
|
MPI tasks to be invoked and can speed-up the long-range solver, while
|
|
|
|
increasing overall performance by parallelizing the pairwise and
|
|
|
|
bonded calculations via OpenMP. Likewise additional speedup can be
|
|
|
|
sometimes be achived by increasing the length of the Coulombic cutoff
|
|
|
|
and thus reducing the work done by the long-range solver. Using the
|
|
|
|
"run_style verlet/split"_run_style.html command, which is compatible
|
|
|
|
with the USER-OMP package, is an alternative way to reduce the number
|
|
|
|
of MPI tasks assigned to the KSpace calculation. :l,ule
|
2014-08-28 04:52:54 +08:00
|
|
|
|
|
|
|
Other performance tips are as follows:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
The best parallel efficiency from {omp} styles is typically achieved
|
|
|
|
when there is at least one MPI task per physical processor,
|
2014-08-28 04:52:54 +08:00
|
|
|
i.e. socket or die. :ulb,l
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
It is usually most efficient to restrict threading to a single
|
|
|
|
socket, i.e. use one or more MPI task per socket. :l
|
|
|
|
|
|
|
|
Several current MPI implementation by default use a processor affinity
|
|
|
|
setting that restricts each MPI task to a single CPU core. Using
|
|
|
|
multi-threading in this mode will force the threads to share that core
|
|
|
|
and thus is likely to be counterproductive. Instead, binding MPI
|
|
|
|
tasks to a (multi-core) socket, should solve this issue. :l,ule
|
2011-10-07 01:32:51 +08:00
|
|
|
|
2014-08-28 04:52:54 +08:00
|
|
|
[Restrictions:]
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
None.
|
2011-08-18 05:55:22 +08:00
|
|
|
|
|
|
|
:line
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
5.6 GPU package :h4,link(acc_6)
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-04-02 05:30:58 +08:00
|
|
|
The GPU package was developed by Mike Brown at ORNL and his
|
2014-08-29 04:58:20 +08:00
|
|
|
collaborators, particularly Trung Nguyen (ORNL). It provides GPU
|
|
|
|
versions of many pair styles, including the 3-body Stillinger-Weber
|
|
|
|
pair style, and for "kspace_style pppm"_kspace_style.html for
|
|
|
|
long-range Coulombics. It has the following general features:
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
The package is designed to exploit common GPU hardware configurations
|
2014-08-29 04:58:20 +08:00
|
|
|
where one or more GPUs are coupled to many cores of one or more
|
|
|
|
multi-core CPUs, e.g. within a node of a parallel machine. :ulb,l
|
2011-06-09 05:26:06 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
Atom-based data (e.g. coordinates, forces) moves back-and-forth
|
2011-08-18 05:55:22 +08:00
|
|
|
between the CPU(s) and GPU every timestep. :l
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
Neighbor lists can be constructed on the CPU or on the GPU :l
|
2011-08-09 23:37:57 +08:00
|
|
|
|
|
|
|
The charge assignement and force interpolation portions of PPPM can be
|
|
|
|
run on the GPU. The FFT portion, which requires MPI communication
|
|
|
|
between processors, runs on the CPU. :l
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
Asynchronous force computations can be performed simultaneously on the
|
|
|
|
CPU(s) and GPU. :l
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
It allows for GPU computations to be performed in single or double
|
2014-08-29 04:58:20 +08:00
|
|
|
precision, or in mixed-mode precision, where pairwise forces are
|
|
|
|
computed in single precision, but accumulated into double-precision
|
2014-05-30 06:52:23 +08:00
|
|
|
force vectors. :l
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
LAMMPS-specific code is in the GPU package. It makes calls to a
|
2011-08-09 23:37:57 +08:00
|
|
|
generic GPU library in the lib/gpu directory. This library provides
|
2011-08-18 05:55:22 +08:00
|
|
|
NVIDIA support as well as more general OpenCL support, so that the
|
|
|
|
same functionality can eventually be supported on a variety of GPU
|
2011-08-09 23:37:57 +08:00
|
|
|
hardware. :l,ule
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Required hardware/software:]
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
To use this package, you currently need to have an NVIDIA GPU and
|
|
|
|
install the NVIDIA Cuda software on your system:
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
|
2011-08-09 23:37:57 +08:00
|
|
|
Go to http://www.nvidia.com/object/cuda_get.html
|
|
|
|
Install a driver and toolkit appropriate for your system (SDK is not necessary)
|
2014-08-29 04:58:20 +08:00
|
|
|
Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul
|
2011-08-09 23:37:57 +08:00
|
|
|
|
|
|
|
[Building LAMMPS with the GPU package:]
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
This requires two steps (a,b): build the GPU library, then build
|
|
|
|
LAMMPS.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(a) Build the GPU library
|
2014-08-29 04:58:20 +08:00
|
|
|
|
|
|
|
The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in
|
2014-09-05 00:09:39 +08:00
|
|
|
lib/gpu) appropriate for your system. You should pay special
|
|
|
|
attention to 3 settings in this makefile.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-09-05 00:09:39 +08:00
|
|
|
CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
|
|
|
|
CUDA_ARCH = needs to be appropriate to your GPUs
|
|
|
|
CUDA_PREC = precision (double, mixed, single) you desire :ul
|
|
|
|
|
|
|
|
See lib/gpu/Makefile.linux.double for examples of the ARCH settings
|
|
|
|
for different GPU choices, e.g. Fermi vs Kepler. It also lists the
|
|
|
|
possible precision settings:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
CUDA_PREC = -D_SINGLE_SINGLE # Single precision for all calculations
|
|
|
|
CUDA_PREC = -D_DOUBLE_DOUBLE # Double precision for all calculations
|
|
|
|
CUDA_PREC = -D_SINGLE_DOUBLE # Accumulation of forces, etc, in double :pre
|
|
|
|
|
|
|
|
The last setting is the mixed mode referred to above. Note that your
|
|
|
|
GPU must support double precision to use either the 2nd or 3rd of
|
|
|
|
these settings.
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
To build the library, type:
|
|
|
|
|
|
|
|
make -f Makefile.machine :pre
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
If successful, it will produce the files libgpu.a and Makefile.lammps.
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
The latter file has 3 settings that need to be appropriate for the
|
|
|
|
paths and settings for the CUDA system software on your machine.
|
|
|
|
Makefile.lammps is a copy of the file specified by the EXTRAMAKE
|
|
|
|
setting in Makefile.machine. You can change EXTRAMAKE or create your
|
|
|
|
own Makefile.lammps.machine if needed.
|
2011-06-09 05:26:06 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Note that to change the precision of the GPU library, you need to
|
|
|
|
re-build the entire library. Do a "clean" first, e.g. "make -f
|
|
|
|
Makefile.linux clean", followed by the make command above.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(b) Build LAMMPS
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
cd lammps/src
|
2011-08-09 23:37:57 +08:00
|
|
|
make yes-gpu
|
|
|
|
make machine :pre
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Note that if you change the GPU library precision (discussed above),
|
|
|
|
you also need to re-install the GPU package and re-build LAMMPS, so
|
|
|
|
that all affected files are re-compiled and linked to the new GPU
|
|
|
|
library.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Running with the GPU package:]
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
The examples/gpu and bench/GPU directories have scripts that can be
|
|
|
|
run with the GPU package, as well as detailed instructions on how to
|
|
|
|
run them.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
To run with the GPU package, there are 3 basic issues (a,b,c) to
|
|
|
|
address:
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(a) Use one or more MPI tasks per GPU
|
2014-08-29 04:58:20 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
The total number of MPI tasks used by LAMMPS (one or multiple per
|
|
|
|
compute node) is set in the usual manner via the mpirun or mpiexec
|
|
|
|
commands, and is independent of the GPU package.
|
2011-08-17 22:22:48 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
When using the GPU package, you cannot assign more than one physical
|
2014-08-29 04:58:20 +08:00
|
|
|
GPU to a single MPI task. However multiple MPI tasks can share the
|
|
|
|
same GPU, and in many cases it will be more efficient to run this way.
|
|
|
|
|
|
|
|
The default is to have all MPI tasks on a compute node use a single
|
|
|
|
GPU. To use multiple GPUs per node, be sure to create one or more MPI
|
|
|
|
tasks per GPU, and use the first/last settings in the "package
|
2014-05-30 06:52:23 +08:00
|
|
|
gpu"_package.html command to include all the GPU IDs on the node.
|
2014-08-29 04:58:20 +08:00
|
|
|
E.g. first = 0, last = 1, for 2 GPUs. On a node with 8 CPU cores
|
|
|
|
and 2 GPUs, this would specify that each GPU is shared by 4 MPI tasks.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(b) Enable the GPU package
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
This can be done in one of two ways. Use a "package gpu"_package.html
|
|
|
|
command near the top of your input script.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Or use the "-sf gpu" "command-line switch"_Section_start.html#start_7,
|
|
|
|
which will automatically invoke the command "package gpu force/neigh 0
|
|
|
|
0 1"_package.html. Note that this specifies use of a single GPU (per
|
|
|
|
node), so you must specify the package command in your input script
|
|
|
|
explicitly if you want to use multiple GPUs per node.
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(c) Use GPU-accelerated styles
|
2014-08-29 04:58:20 +08:00
|
|
|
|
|
|
|
This can be done by explicitly adding a "gpu" suffix to any supported
|
|
|
|
style in your input script:
|
|
|
|
|
|
|
|
pair_style lj/cut/gpu 2.5 :pre
|
|
|
|
|
|
|
|
Or you can run with the "-sf gpu" "command-line
|
|
|
|
switch"_Section_start.html#start_7, which will automatically append
|
|
|
|
"gpu" to styles that support it.
|
|
|
|
|
|
|
|
lmp_machine -sf gpu -in in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf gpu -in in.script :pre
|
|
|
|
|
|
|
|
Using the "suffix gpu" command in your input script does the same
|
|
|
|
thing.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: The input script must also use the
|
|
|
|
"newton"_newton.html command with a pairwise setting of {off},
|
|
|
|
since {on} is the default.
|
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
The performance of a GPU versus a multi-core CPU is a function of your
|
|
|
|
hardware, which pair style is used, the number of atoms/GPU, and the
|
|
|
|
precision used on the GPU (double, single, mixed).
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
|
|
|
LAMMPS web site for performance of the GPU package on various
|
|
|
|
hardware, including the Titan HPC platform at ORNL.
|
|
|
|
|
|
|
|
You should also experiment with how many MPI tasks per GPU to use to
|
|
|
|
give the best performance for your problem and machine. This is also
|
|
|
|
a function of the problem size and the pair style being using.
|
|
|
|
Likewise, you should experiment with the precision setting for the GPU
|
|
|
|
library to see if single or mixed precision will give accurate
|
|
|
|
results, since they will typically be faster.
|
|
|
|
|
|
|
|
[Guidelines for best performance:]
|
|
|
|
|
|
|
|
Using multiple MPI tasks per GPU will often give the best performance,
|
|
|
|
as allowed my most multi-core CPU/GPU configurations. :ulb,l
|
|
|
|
|
|
|
|
If the number of particles per MPI task is small (e.g. 100s of
|
|
|
|
particles), it can be more efficient to run with fewer MPI tasks per
|
|
|
|
GPU, even if you do not use all the cores on the compute node. :l
|
|
|
|
|
|
|
|
The "package gpu"_package.html command has several options for tuning
|
|
|
|
performance. Neighbor lists can be built on the GPU or CPU. Force
|
|
|
|
calculations can be dynamically balanced across the CPU cores and
|
|
|
|
GPUs. GPU-specific settings can be made which can be optimized
|
|
|
|
for different hardware. See the "packakge"_package.html command
|
|
|
|
doc page for details. :l
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2011-08-17 22:20:30 +08:00
|
|
|
As described by the "package gpu"_package.html command, GPU
|
|
|
|
accelerated pair styles can perform computations asynchronously with
|
|
|
|
CPU computations. The "Pair" time reported by LAMMPS will be the
|
|
|
|
maximum of the time required to complete the CPU pair style
|
|
|
|
computations and the time required to complete the GPU pair style
|
|
|
|
computations. Any time spent for GPU-enabled pair styles for
|
2011-05-27 07:45:30 +08:00
|
|
|
computations that run simultaneously with "bond"_bond_style.html,
|
|
|
|
"angle"_angle_style.html, "dihedral"_dihedral_style.html,
|
|
|
|
"improper"_improper_style.html, and "long-range"_kspace_style.html
|
2014-08-29 04:58:20 +08:00
|
|
|
calculations will not be included in the "Pair" time. :l
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
When the {mode} setting for the package gpu command is force/neigh,
|
|
|
|
the time for neighbor list calculations on the GPU will be added into
|
|
|
|
the "Pair" time, not the "Neigh" time. An additional breakdown of the
|
|
|
|
times required for various tasks on the GPU (data copy, neighbor
|
2011-08-09 23:37:57 +08:00
|
|
|
calculations, force computations, etc) are output only with the LAMMPS
|
|
|
|
screen output (not in the log file) at the end of each run. These
|
|
|
|
timings represent total time spent on the GPU for each routine,
|
2014-08-29 04:58:20 +08:00
|
|
|
regardless of asynchronous CPU calculations. :l
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
The output section "GPU Time Info (average)" reports "Max Mem / Proc".
|
|
|
|
This is the maximum memory used at one time on the GPU for data
|
2014-08-29 04:58:20 +08:00
|
|
|
storage by a single MPI process. :l,ule
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Restrictions:]
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
None.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
|
|
|
:line
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
5.7 USER-CUDA package :h4,link(acc_7)
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
The USER-CUDA package was developed by Christian Trott (Sandia) while
|
2014-08-30 00:27:29 +08:00
|
|
|
at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions
|
2014-08-29 04:58:20 +08:00
|
|
|
of many pair styles, many fixes, a few computes, and for long-range
|
|
|
|
Coulombics via the PPPM command. It has the following general
|
|
|
|
features:
|
2011-08-09 23:37:57 +08:00
|
|
|
|
|
|
|
The package is designed to allow an entire LAMMPS calculation, for
|
|
|
|
many timesteps, to run entirely on the GPU (except for inter-processor
|
|
|
|
MPI communication), so that atom-based data (e.g. coordinates, forces)
|
|
|
|
do not have to move back-and-forth between the CPU and GPU. :ulb,l
|
|
|
|
|
2011-08-27 02:21:27 +08:00
|
|
|
The speed-up advantage of this approach is typically better when the
|
|
|
|
number of atoms per GPU is large :l
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Data will stay on the GPU until a timestep where a non-USER-CUDA fix
|
|
|
|
or compute is invoked. Whenever a non-GPU operation occurs (fix,
|
2011-08-18 05:55:22 +08:00
|
|
|
compute, output), data automatically moves back to the CPU as needed.
|
|
|
|
This may incur a performance penalty, but should otherwise work
|
2011-08-09 23:37:57 +08:00
|
|
|
transparently. :l
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Neighbor lists are constructed on the GPU. :l
|
2011-08-18 05:55:22 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
The package only supports use of a single MPI task, running on a
|
|
|
|
single CPU (core), assigned to each GPU. :l,ule
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Required hardware/software:]
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
To use this package, you need to have an NVIDIA GPU and
|
|
|
|
install the NVIDIA Cuda software on your system:
|
2011-05-28 01:59:03 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
|
|
|
|
help you to find out the Compute Capability of your card:
|
2011-06-14 07:18:49 +08:00
|
|
|
|
|
|
|
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
|
|
|
|
corresponding GPU drivers. The Nvidia Cuda SDK is not required, but
|
|
|
|
we recommend it also be installed. You can then make sure its sample
|
|
|
|
projects can be compiled without problems.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
[Building LAMMPS with the USER-CUDA package:]
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
This requires two steps (a,b): build the USER-CUDA library, then build
|
|
|
|
LAMMPS.
|
2011-08-18 05:55:22 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(a) Build the USER-CUDA library
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
The USER-CUDA library is in lammps/lib/cuda. If your {CUDA} toolkit
|
|
|
|
is not installed in the default system directoy {/usr/local/cuda} edit
|
|
|
|
the file {lib/cuda/Makefile.common} accordingly.
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
To set options for the library build, type "make OPTIONS", where
|
|
|
|
{OPTIONS} are one or more of the following. The settings will be
|
|
|
|
written to the {lib/cuda/Makefile.defaults} and used when
|
|
|
|
the library is built.
|
2011-08-09 23:37:57 +08:00
|
|
|
|
|
|
|
{precision=N} to set the precision level
|
|
|
|
N = 1 for single precision (default)
|
|
|
|
N = 2 for double precision
|
|
|
|
N = 3 for positions in double precision
|
|
|
|
N = 4 for positions and velocities in double precision
|
|
|
|
{arch=M} to set GPU compute capability
|
2014-09-05 00:09:39 +08:00
|
|
|
M = 35 for Kepler GPUs
|
2011-08-09 23:37:57 +08:00
|
|
|
M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
|
|
|
|
M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450)
|
|
|
|
M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
|
|
|
|
{prec_timer=0/1} to use hi-precision timers
|
|
|
|
0 = do not use them (default)
|
2014-08-29 04:58:20 +08:00
|
|
|
1 = use them
|
2011-08-09 23:37:57 +08:00
|
|
|
this is usually only useful for Mac machines
|
|
|
|
{dbg=0/1} to activate debug mode
|
|
|
|
0 = no debug mode (default)
|
|
|
|
1 = yes debug mode
|
|
|
|
this is only useful for developers
|
2014-08-29 04:58:20 +08:00
|
|
|
{cufft=1} for use of the CUDA FFT library
|
2011-08-09 23:37:57 +08:00
|
|
|
0 = no CUFFT support (default)
|
|
|
|
in the future other CUDA-enabled FFT libraries might be supported :pre
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
To build the library, simply type:
|
|
|
|
|
|
|
|
make :pre
|
|
|
|
|
|
|
|
If successful, it will produce the files libcuda.a and Makefile.lammps.
|
|
|
|
|
|
|
|
Note that if you change any of the options (like precision), you need
|
|
|
|
to re-build the entire library. Do a "make clean" first, followed by
|
|
|
|
"make".
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(b) Build LAMMPS
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
cd lammps/src
|
2011-08-09 23:37:57 +08:00
|
|
|
make yes-user-cuda
|
|
|
|
make machine :pre
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Note that if you change the USER-CUDA library precision (discussed
|
|
|
|
above), you also need to re-install the USER-CUDA package and re-build
|
|
|
|
LAMMPS, so that all affected files are re-compiled and linked to the
|
|
|
|
new USER-CUDA library.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Running with the USER-CUDA package:]
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
The bench/CUDA directories has scripts that can be run with the
|
2014-08-29 04:58:20 +08:00
|
|
|
USER-CUDA package, as well as detailed instructions on how to run
|
|
|
|
them.
|
|
|
|
|
|
|
|
To run with the USER-CUDA package, there are 3 basic issues (a,b,c) to
|
|
|
|
address:
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(a) Use one MPI task per GPU
|
2014-08-29 04:58:20 +08:00
|
|
|
|
|
|
|
This is a requirement of the USER-CUDA package, i.e. you cannot
|
|
|
|
use multiple MPI tasks per physical GPU. So if you are running
|
|
|
|
on nodes with 1 or 2 GPUs, use the mpirun or mpiexec command
|
|
|
|
to specify 1 or 2 MPI tasks per node.
|
|
|
|
|
|
|
|
If the nodes have more than 1 GPU, you must use the "package
|
|
|
|
cuda"_package.html command near the top of your input script to
|
|
|
|
specify that more than 1 GPU will be used (the default = 1).
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(b) Enable the USER-CUDA package
|
2014-06-04 23:49:05 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
The "-c on" or "-cuda on" "command-line
|
|
|
|
switch"_Section_start.html#start_7 must be used when launching LAMMPS.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(c) Use USER-CUDA-accelerated styles
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
This can be done by explicitly adding a "cuda" suffix to any supported
|
|
|
|
style in your input script:
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
pair_style lj/cut/cuda 2.5 :pre
|
|
|
|
|
|
|
|
Or you can run with the "-sf cuda" "command-line
|
|
|
|
switch"_Section_start.html#start_7, which will automatically append
|
|
|
|
"cuda" to styles that support it.
|
|
|
|
|
|
|
|
lmp_machine -sf cuda -in in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf cuda -in in.script :pre
|
|
|
|
|
|
|
|
Using the "suffix cuda" command in your input script does the same
|
|
|
|
thing.
|
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
The performance of a GPU versus a multi-core CPU is a function of your
|
|
|
|
hardware, which pair style is used, the number of atoms/GPU, and the
|
|
|
|
precision used on the GPU (double, single, mixed).
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
2014-08-30 00:27:29 +08:00
|
|
|
LAMMPS web site for performance of the USER-CUDA package on different
|
2014-08-29 04:58:20 +08:00
|
|
|
hardware.
|
|
|
|
|
|
|
|
[Guidelines for best performance:]
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
The USER-CUDA package offers more speed-up relative to CPU performance
|
|
|
|
when the number of atoms per GPU is large, e.g. on the order of tens
|
2014-08-29 04:58:20 +08:00
|
|
|
or hundreds of 1000s. :ulb,l
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
As noted above, this package will continue to run a simulation
|
|
|
|
entirely on the GPU(s) (except for inter-processor MPI communication),
|
|
|
|
for multiple timesteps, until a CPU calculation is required, either by
|
|
|
|
a fix or compute that is non-GPU-ized, or until output is performed
|
|
|
|
(thermo or dump snapshot or restart file). The less often this
|
2014-08-29 04:58:20 +08:00
|
|
|
occurs, the faster your simulation will run. :l,ule
|
|
|
|
|
|
|
|
[Restrictions:]
|
|
|
|
|
|
|
|
None.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
:line
|
|
|
|
|
|
|
|
5.8 KOKKOS package :h4,link(acc_8)
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
The KOKKOS package was developed primaritly by Christian Trott
|
|
|
|
(Sandia) with contributions of various styles by others, including
|
|
|
|
Sikandar Mashayak (UIUC). The underlying Kokkos library was written
|
|
|
|
primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
|
|
|
|
Sandia).
|
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
The KOKKOS package contains versions of pair, fix, and atom styles
|
2014-08-30 00:27:29 +08:00
|
|
|
that use data structures and macros provided by the Kokkos library,
|
|
|
|
which is included with LAMMPS in lib/kokkos.
|
|
|
|
|
|
|
|
The Kokkos library is part of
|
|
|
|
"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and is a
|
|
|
|
templated C++ library that provides two key abstractions for an
|
|
|
|
application like LAMMPS. First, it allows a single implementation of
|
|
|
|
an application kernel (e.g. a pair style) to run efficiently on
|
|
|
|
different kinds of hardware, such as a GPU, Intel Phi, or many-core
|
|
|
|
chip.
|
|
|
|
|
|
|
|
The Kokkos library also provides data abstractions to adjust (at
|
|
|
|
compile time) the memory layout of basic data structures like 2d and
|
|
|
|
3d arrays and allow the transparent utilization of special hardware
|
|
|
|
load and store operations. Such data structures are used in LAMMPS to
|
|
|
|
store atom coordinates or forces or neighbor lists. The layout is
|
|
|
|
chosen to optimize performance on different platforms. Again this
|
|
|
|
functionality is hidden from the developer, and does not affect how
|
|
|
|
the kernel is coded.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
These abstractions are set at build time, when LAMMPS is compiled with
|
|
|
|
the KOKKOS package installed. This is done by selecting a "host" and
|
|
|
|
"device" to build for, compatible with the compute nodes in your
|
2014-08-30 00:27:29 +08:00
|
|
|
machine (one on a desktop machine or 1000s on a supercomputer).
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
All Kokkos operations occur within the context of an individual MPI
|
|
|
|
task running on a single node of the machine. The total number of MPI
|
|
|
|
tasks used by LAMMPS (one or multiple per compute node) is set in the
|
|
|
|
usual manner via the mpirun or mpiexec commands, and is independent of
|
|
|
|
Kokkos.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Kokkos provides support for two different modes of execution per MPI
|
|
|
|
task. This means that computational tasks (pairwise interactions,
|
|
|
|
neighbor list builds, time integration, etc) can be parallelized for
|
|
|
|
one or the other of the two modes. The first mode is called the
|
|
|
|
"host" and is one or more threads running on one or more physical CPUs
|
|
|
|
(within the node). Currently, both multi-core CPUs and an Intel Phi
|
|
|
|
processor (running in native mode) are supported. The second mode is
|
|
|
|
called the "device" and is an accelerator chip of some kind.
|
|
|
|
Currently only an NVIDIA GPU is supported. If your compute node does
|
|
|
|
not have a GPU, then there is only one mode of execution, i.e. the
|
|
|
|
host and device are the same.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
[Required hardware/software:]
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
The KOKKOS package can be used to build and run
|
|
|
|
LAMMPS on the following kinds of hardware configurations:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
|
|
|
|
CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
|
|
|
|
Phi: on one or more Intel Phi coprocessors (per node)
|
|
|
|
GPU: on the GPUs of a node with additional OpenMP threading on the CPUs :ul
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Intel Xeon Phi coprocessors are supported in "native" mode only.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Only NVIDIA GPUs are currently supported.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
|
|
|
|
you must have Kepler generation GPUs (or later). The Kokkos library
|
|
|
|
exploits texture cache options not supported by Telsa generation GPUs
|
|
|
|
(or older).
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
|
|
|
|
installed on your system. See the discussion above for the USER-CUDA
|
|
|
|
and GPU packages for details of how to check and do this.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
[Building LAMMPS with the KOKKOS package:]
|
|
|
|
|
|
|
|
Unlike other acceleration packages discussed in this section, the
|
|
|
|
Kokkos library in lib/kokkos does not have to be pre-built before
|
|
|
|
building LAMMPS itself. Instead, options for the Kokkos library are
|
|
|
|
specified at compile time, when LAMMPS itself is built. This can be
|
|
|
|
done in one of two ways, as discussed below.
|
|
|
|
|
|
|
|
Here are examples of how to build LAMMPS for the different compute-node
|
|
|
|
configurations listed above.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
CPU-only (run all-MPI or with OpenMP threading):
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
cd lammps/src
|
2014-05-30 06:52:23 +08:00
|
|
|
make yes-kokkos
|
2014-08-30 00:27:29 +08:00
|
|
|
make g++ OMP=yes :pre
|
|
|
|
|
|
|
|
Intel Xeon Phi:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
cd lammps/src
|
|
|
|
make yes-kokkos
|
|
|
|
make g++ OMP=yes MIC=yes :pre
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
CPUs and GPUs:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
cd lammps/src
|
2014-05-30 06:52:23 +08:00
|
|
|
make yes-kokkos
|
|
|
|
make cuda CUDA=yes :pre
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
|
|
|
|
make command line which requires a GNU-compatible make command. Try
|
|
|
|
"gmake" if your system's standard make complains.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: If you build using make line variables and re-build
|
|
|
|
LAMMPS twice with different KOKKOS options and the *same* target,
|
|
|
|
e.g. g++ in the first two examples above, then you *must* perform a
|
|
|
|
"make clean-all" or "make clean-machine" before each build. This is
|
|
|
|
to force all the KOKKOS-dependent files to be re-compiled with the new
|
|
|
|
options.
|
|
|
|
|
|
|
|
You can also hardwire these variables in the specified machine
|
|
|
|
makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
|
|
|
|
with a line like:
|
|
|
|
|
|
|
|
MIC = yes :pre
|
|
|
|
|
|
|
|
Note that if you build LAMMPS multiple times in this manner, using
|
|
|
|
different KOKKOS options (defined in different machine makefiles), you
|
|
|
|
do not have to worry about doing a "clean" in between. This is
|
|
|
|
because the targets will be different.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
|
|
|
|
machine makefile, in this case src/MAKE/Makefile.cuda, which is
|
|
|
|
included in the LAMMPS distribution. To build the KOKKOS package for
|
|
|
|
a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must
|
|
|
|
have a CCFLAGS -arch setting that is appropriate for your NVIDIA
|
|
|
|
hardware and installed software. Typical values for -arch are given
|
|
|
|
in "Section 2.3.4"_Section_start.html#start_3_4 of the manual, as well
|
|
|
|
as other settings that must be included in the machine makefile, if
|
|
|
|
you create your own.
|
|
|
|
|
|
|
|
There are other allowed options when building with the KOKKOS package.
|
|
|
|
As above, They can be set either as variables on the make command line
|
|
|
|
or in the machine makefile in the src/MAKE directory. See "Section
|
|
|
|
2.3.4"_Section_start.html#start_3_4 of the manual for details.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: Currently, there are no precision options with the
|
|
|
|
KOKKOS package. All compilation and computation is performed in
|
|
|
|
double precision.
|
|
|
|
|
|
|
|
[Running with the KOKKOS package:]
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
The examples/kokkos and bench/KOKKOS directories have scripts that can
|
|
|
|
be run with the KOKKOS package, as well as detailed instructions on
|
|
|
|
how to run them.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
There are 3 issues (a,b,c) to address:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(a) Launching LAMMPS in different KOKKOS modes
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Here are examples of how to run LAMMPS for the different compute-node
|
|
|
|
configurations listed above.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Note that the -np setting for the mpirun command in these examples is
|
|
|
|
for runs on a single node. To scale these examples up to run on a
|
|
|
|
system with N compute nodes, simply multiply the -np setting by N.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
CPU-only, dual hex-core CPUs:
|
|
|
|
|
|
|
|
mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
|
|
|
|
mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
|
|
|
|
mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
|
|
|
|
mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task :pre
|
|
|
|
|
|
|
|
Intel Phi with 61 cores (240 total usable cores, with 4x hardware threading):
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12*20 = 240
|
|
|
|
mpirun -np 15 lmp_g++ -k on t 16 -sf kk -in in.lj
|
|
|
|
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj
|
|
|
|
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj :pre
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Dual hex-core CPUs and a single GPU:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU :pre
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Dual 8-core CPUs and 2 GPUs:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU :pre
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(b) Enable the KOKKOS package
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
As illustrated above, the "-k on" or "-kokkos on" "command-line
|
|
|
|
switch"_Section_start.html#start_7 must be used when launching LAMMPS.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
As documented "here"_Section_start.html#start_7, the command-line
|
|
|
|
swithc allows for several options. Commonly used ones, as illustrated
|
|
|
|
above, are:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
-k on t Nt : specifies how many threads per MPI task to use within a
|
2014-08-30 00:27:29 +08:00
|
|
|
compute node. For good performance, the product of MPI tasks *
|
|
|
|
threads/task should not exceed the number of physical cores on a CPU
|
|
|
|
or Intel Phi (including hardware threading, e.g. 240). :ulb,l
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
-k on g Ng : specifies how many GPUs per compute node are available.
|
|
|
|
The default is 1, so this should be specified is you have 2 or more
|
2014-08-30 00:27:29 +08:00
|
|
|
GPUs per compute node. :l,ule
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(c) Use KOKKOS-accelerated styles
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
This can be done by explicitly adding a "kk" suffix to any supported
|
|
|
|
style in your input script:
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
pair_style lj/cut/kk 2.5 :pre
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Or you can run with the "-sf kk" "command-line
|
|
|
|
switch"_Section_start.html#start_7, which will automatically append
|
|
|
|
"kk" to styles that support it.
|
|
|
|
|
|
|
|
lmp_machine -sf kk -in in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf kk -in in.script :pre
|
|
|
|
|
|
|
|
Using the "suffix kk" command in your input script does the same
|
|
|
|
thing.
|
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
The performance of KOKKOS running in different modes is a function of
|
|
|
|
your hardware, which KOKKOS-enable styles are used, and the problem
|
|
|
|
size.
|
|
|
|
|
|
|
|
Generally speaking, the following rules of thumb apply:
|
|
|
|
|
|
|
|
When running on CPUs only, with a single thread per MPI task,
|
|
|
|
performance of a KOKKOS style is somewhere between the standard
|
|
|
|
(un-accelerated) styles (MPI-only mode), and those provided by the
|
|
|
|
USER-OMP package. However the difference between all 3 is small (less
|
|
|
|
than 20%). :ulb,l
|
|
|
|
|
|
|
|
When running on CPUs only, with multiple threads per MPI task,
|
|
|
|
performance of a KOKKOS style is a bit slower than the USER-OMP
|
|
|
|
package. :l
|
|
|
|
|
|
|
|
When running on GPUs, KOKKOS currently out-performs the
|
|
|
|
USER-CUDA and GPU packages. :l
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
When running on Intel Xeon Phi, KOKKOS is not as fast as
|
|
|
|
the USER-INTEL package, which is optimized for that hardware. :l,ule
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
|
|
|
LAMMPS web site for performance of the KOKKOS package on different
|
|
|
|
hardware.
|
|
|
|
|
|
|
|
[Guidelines for best performance:]
|
|
|
|
|
|
|
|
Here are guidline for using the KOKKOS package on the different hardware
|
|
|
|
configurations listed above.
|
|
|
|
|
|
|
|
Many of the guidelines use the "package kokkos"_package.html command
|
|
|
|
See its doc page for details and default settings. Experimenting with
|
|
|
|
its options can provide a speed-up for specific calculations.
|
|
|
|
|
|
|
|
[Running on a multi-core CPU:]
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
If N is the number of physical cores/node, then the number of MPI
|
|
|
|
tasks/node * number of threads/task should not exceed N, and should
|
|
|
|
typically equal N. Note that the default threads/task is 1, as set by
|
|
|
|
the "t" keyword of the -k "command-line
|
2014-05-30 07:07:14 +08:00
|
|
|
switch"_Section_start.html#start_7. If you do not change this, no
|
|
|
|
additional parallelism (beyond MPI) will be invoked on the host
|
2014-05-30 06:52:23 +08:00
|
|
|
CPU(s).
|
|
|
|
|
|
|
|
You can compare the performance running in different modes:
|
|
|
|
|
|
|
|
run with 1 MPI task/node and N threads/task
|
|
|
|
run with N MPI tasks/node and 1 thread/task
|
|
|
|
run with settings in between these extremes :ul
|
|
|
|
|
|
|
|
Examples of mpirun commands in these modes, for nodes with dual
|
|
|
|
hex-core CPUs and no GPU, are shown above.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
When using KOKKOS to perform multi-threading, it is important for
|
|
|
|
performance to bind both MPI tasks to physical cores, and threads to
|
|
|
|
physical cores, so they do not migrate during a simulation.
|
|
|
|
|
|
|
|
If you are not certain MPI tasks are being bound (check the defaults
|
|
|
|
for your MPI installation), it can be forced with these flags:
|
|
|
|
|
|
|
|
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
|
|
|
|
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
|
|
|
|
|
|
|
|
For binding threads with the KOKKOS OMP option, use thread affinity
|
|
|
|
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
|
|
|
|
later, intel 12 or later) setting the environment variable
|
|
|
|
OMP_PROC_BIND=true should be sufficient. For binding threads with the
|
|
|
|
KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
|
|
|
|
discussed in "Section 2.3.4"_Sections_start.html#start_3_4 of the
|
|
|
|
manual.
|
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
[Running on GPUs:]
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Insure the -arch setting in the machine makefile you are using,
|
|
|
|
e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
|
|
|
|
(see "this section"_Section_start.html#start_3_4 of the manual for
|
|
|
|
details).
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-05-30 07:07:14 +08:00
|
|
|
The -np setting of the mpirun command should set the number of MPI
|
|
|
|
tasks/node to be equal to the # of physical GPUs on the node.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
Use the "-kokkos command-line switch"_Section_commands.html#start_7 to
|
|
|
|
specify the number of GPUs per node, and the number of threads per MPI
|
|
|
|
task. As above for multi-core CPUs (and no GPU), if N is the number
|
|
|
|
of physical cores/node, then the number of MPI tasks/node * number of
|
|
|
|
threads/task should not exceed N. With one GPU (and one MPI task) it
|
|
|
|
may be faster to use less than all the available cores, by setting
|
|
|
|
threads/task to a smaller value. This is because using all the cores
|
|
|
|
on a dual-socket node will incur extra cost to copy memory from the
|
|
|
|
2nd socket to the GPU.
|
|
|
|
|
|
|
|
Examples of mpirun commands that follow these rules, for nodes with
|
|
|
|
dual hex-core CPUs and one or two GPUs, are shown above.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
When using a GPU, you will achieve the best performance if your input
|
|
|
|
script does not use any fix or compute styles which are not yet
|
|
|
|
Kokkos-enabled. This allows data to stay on the GPU for multiple
|
|
|
|
timesteps, without being copied back to the host CPU. Invoking a
|
|
|
|
non-Kokkos fix or compute, or performing I/O for
|
|
|
|
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
|
|
|
|
to be copied back to the CPU.
|
|
|
|
|
|
|
|
You cannot yet assign multiple MPI tasks to the same GPU with the
|
|
|
|
KOKKOS package. We plan to support this in the future, similar to the
|
|
|
|
GPU package in LAMMPS.
|
|
|
|
|
|
|
|
You cannot yet use both the host (multi-threaded) and device (GPU)
|
|
|
|
together to compute pairwise interactions with the KOKKOS package. We
|
|
|
|
hope to support this in the future, similar to the GPU package in
|
|
|
|
LAMMPS.
|
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
[Running on an Intel Phi:]
|
|
|
|
|
|
|
|
Kokkos only uses Intel Phi processors in their "native" mode, i.e.
|
|
|
|
not hosted by a CPU.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
As illustrated above, build LAMMPS with OMP=yes (the default) and
|
|
|
|
MIC=yes. The latter insures code is correctly compiled for the Intel
|
|
|
|
Phi. The OMP setting means OpenMP will be used for parallelization on
|
|
|
|
the Phi, which is currently the best option within Kokkos. In the
|
|
|
|
future, other options may be added.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
Current-generation Intel Phi chips have either 61 or 57 cores. One
|
2014-08-30 00:27:29 +08:00
|
|
|
core should be excluded for running the OS, leaving 60 or 56 cores.
|
|
|
|
Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
|
|
|
|
N = 224 (4*56) cores to run on.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
The -np setting of the mpirun command sets the number of MPI
|
|
|
|
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
|
|
|
threads/task as Nt. The product of these 2 values should be N, i.e.
|
|
|
|
240 or 224. Also, the number of threads/task should be a multiple of
|
|
|
|
4 so that logical threads from more than one MPI task do not run on
|
|
|
|
the same physical core.
|
|
|
|
|
|
|
|
Examples of mpirun commands that follow these rules, for Intel Phi
|
|
|
|
nodes with 61 cores, are shown above.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
[Restrictions:]
|
2014-05-30 07:07:14 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
As noted above, if using GPUs, the number of MPI tasks per compute
|
|
|
|
node should equal to the number of GPUs per compute node. In the
|
|
|
|
future Kokkos will support assigning multiple MPI tasks to a single
|
|
|
|
GPU.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Currently Kokkos does not support AMD GPUs due to limits in the
|
|
|
|
available backend programming models. Specifically, Kokkos requires
|
|
|
|
extensive C++ support from the Kernel language. This is expected to
|
|
|
|
change in the future.
|
2014-05-30 06:52:23 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
:line
|
2014-08-15 00:30:25 +08:00
|
|
|
|
|
|
|
5.9 USER-INTEL package :h4,link(acc_9)
|
|
|
|
|
|
|
|
The USER-INTEL package was developed by Mike Brown at Intel
|
2014-08-29 04:58:20 +08:00
|
|
|
Corporation. It provides a capability to accelerate simulations by
|
2014-08-25 22:48:45 +08:00
|
|
|
offloading neighbor list and non-bonded force calculations to Intel(R)
|
|
|
|
Xeon Phi(TM) coprocessors. Additionally, it supports running
|
2014-08-15 04:26:52 +08:00
|
|
|
simulations in single, mixed, or double precision with vectorization,
|
2014-08-29 04:58:20 +08:00
|
|
|
even if a coprocessor is not present, i.e. on an Intel(R) CPU. The
|
|
|
|
same C++ code is used for both cases. When offloading to a
|
|
|
|
coprocessor, the routine is run twice, once with an offload flag.
|
2014-08-15 04:26:52 +08:00
|
|
|
|
|
|
|
The USER-INTEL package can be used in tandem with the USER-OMP
|
2014-08-29 04:58:20 +08:00
|
|
|
package. This is useful when offloading pair style computations to
|
|
|
|
coprocessors, so that other styles not supported by the USER-INTEL
|
|
|
|
package, e.g. bond, angle, dihedral, improper, and long-range
|
|
|
|
electrostatics, can be run simultaneously in threaded mode on CPU
|
|
|
|
cores. Since less MPI tasks than CPU cores will typically be invoked
|
|
|
|
when running with coprocessors, this enables the extra cores to be
|
|
|
|
utilized for useful computation.
|
|
|
|
|
|
|
|
If LAMMPS is built with both the USER-INTEL and USER-OMP packages
|
|
|
|
intsalled, this mode of operation is made easier to use, because the
|
|
|
|
"-suffix intel" "command-line switch"_Section_start.html#start_7 or
|
|
|
|
the "suffix intel"_suffix.html command will both set a second-choice
|
|
|
|
suffix to "omp" so that styles from the USER-OMP package will be used
|
|
|
|
if available, after first testing if a style from the USER-INTEL
|
|
|
|
package is available.
|
|
|
|
|
|
|
|
[Required hardware/software:]
|
|
|
|
|
|
|
|
To use the offload option, you must have one or more Intel(R) Xeon
|
|
|
|
Phi(TM) coprocessors.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
Optimizations for vectorization have only been tested with the
|
|
|
|
Intel(R) compiler. Use of other compilers may not result in
|
|
|
|
vectorization or give poor performance.
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Use of an Intel C++ compiler is reccommended, but not required. The
|
|
|
|
compiler must support the OpenMP interface.
|
2014-08-15 00:30:25 +08:00
|
|
|
|
|
|
|
[Building LAMMPS with the USER-INTEL package:]
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Include the package and build LAMMPS.
|
2014-08-15 04:26:52 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
cd lammps/src
|
|
|
|
make yes-user-intel
|
|
|
|
make yes-user-omp (if desired)
|
|
|
|
make machine :pre
|
2014-08-25 22:54:37 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
If the USER-OMP package is also installed, you can use styles from
|
|
|
|
both packages, as described below.
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support
|
|
|
|
in both the CCFLAGS and LINKFLAGS variables, which is {-openmp} for
|
|
|
|
Intel compilers. You also need to add -DLAMMPS_MEMALIGN=64 and
|
|
|
|
-restrict to CCFLAGS.
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
If you are compiling on the same architecture that will be used for
|
|
|
|
the runs, adding the flag {-xHost} to CCFLAGS will enable
|
|
|
|
vectorization with the Intel(R) compiler.
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
In order to build with support for an Intel(R) coprocessor, the flag
|
|
|
|
{-offload} should be added to the LINKFLAGS line and the flag
|
2014-08-29 05:34:07 +08:00
|
|
|
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Note that the machine makefiles Makefile.intel and
|
|
|
|
Makefile.intel_offload are included in the src/MAKE directory with
|
|
|
|
options that perform well with the Intel(R) compiler. The latter file
|
|
|
|
has support for offload to coprocessors; the former does not.
|
|
|
|
|
|
|
|
If using an Intel compiler, it is recommended that Intel(R) Compiler
|
|
|
|
2013 SP1 update 1 be used. Newer versions have some performance
|
|
|
|
issues that are being addressed. If using Intel(R) MPI, version 5 or
|
|
|
|
higher is recommended.
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
[Running with the USER-INTEL package:]
|
2014-08-15 00:30:25 +08:00
|
|
|
|
|
|
|
The examples/intel directory has scripts that can be run with the
|
|
|
|
USER-INTEL package, as well as detailed instructions on how to run
|
|
|
|
them.
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
Note that the total number of MPI tasks used by LAMMPS (one or
|
|
|
|
multiple per compute node) is set in the usual manner via the mpirun
|
|
|
|
or mpiexec commands, and is independent of the USER-INTEL package.
|
|
|
|
|
|
|
|
To run with the USER-INTEL package, there are 3 basic issues (a,b,c)
|
|
|
|
to address:
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(a) Specify how many threads per MPI task to use on the CPU.
|
2014-08-29 04:58:20 +08:00
|
|
|
|
|
|
|
Whether using the USER-INTEL package to offload computations to
|
|
|
|
Intel(R) Xeon Phi(TM) coprocessors or not, work performed on the CPU
|
|
|
|
can be multi-threaded via the USER-OMP package, assuming the USER-OMP
|
|
|
|
package was also installed when LAMMPS was built.
|
|
|
|
|
|
|
|
In this case, the instructions above for the USER-OMP package, in its
|
|
|
|
"Running with the USER-OMP package" sub-section apply here as well.
|
|
|
|
|
|
|
|
You can specify the number of threads per MPI task via the
|
|
|
|
OMP_NUM_THREADS environment variable or the "package omp"_package.html
|
|
|
|
command. The product of MPI tasks * threads/task should not exceed
|
|
|
|
the physical number of cores on the CPU (per node), otherwise
|
|
|
|
performance will suffer.
|
|
|
|
|
|
|
|
Note that the threads per MPI task setting is completely independent
|
|
|
|
of the number of threads used on the coprocessor. Only the "package
|
|
|
|
intel"_package.html command can be used to control thread counts on
|
|
|
|
the coprocessor.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(b) Enable the USER-INTEL package
|
2014-08-29 04:58:20 +08:00
|
|
|
|
|
|
|
This can be done in one of two ways. Use a "package intel"_package.html
|
|
|
|
command near the top of your input script.
|
|
|
|
|
|
|
|
Or use the "-sf intel" "command-line
|
|
|
|
switch"_Section_start.html#start_7, which will automatically invoke
|
|
|
|
the command "package intel * mixed balance -1 offload_cards 1
|
|
|
|
offload_tpc 4 offload_threads 240". Note that this specifies mixed
|
|
|
|
precision and use of a single Xeon Phi(TM) coprocessor (per node), so
|
|
|
|
you must specify the package command in your input script explicitly
|
|
|
|
if you want a different precision or to use multiple Phi coprocessor
|
|
|
|
per node. Also note that the balance and offload keywords are ignored
|
|
|
|
if you did not build LAMMPS with offload support for a coprocessor, as
|
|
|
|
descibed above.
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
(c) Use USER-INTEL-accelerated styles
|
2014-08-29 04:58:20 +08:00
|
|
|
|
|
|
|
This can be done by explicitly adding an "intel" suffix to any
|
|
|
|
supported style in your input script:
|
|
|
|
|
|
|
|
pair_style lj/cut/intel 2.5 :pre
|
|
|
|
|
|
|
|
Or you can run with the "-sf intel" "command-line
|
|
|
|
switch"_Section_start.html#start_7, which will automatically append
|
|
|
|
"intel" to styles that support it.
|
|
|
|
|
|
|
|
lmp_machine -sf intel -in in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf intel -in in.script :pre
|
|
|
|
|
|
|
|
Using the "suffix intel" command in your input script does the same
|
|
|
|
thing.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: Using an "intel" suffix in any of the above modes,
|
|
|
|
actually invokes two suffixes, "intel" and "omp". "Intel" is tried
|
|
|
|
first, and if the style does not support it, "omp" is tried next. If
|
|
|
|
neither is supported, the default non-suffix style is used.
|
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
If LAMMPS was not built with coprocessor support when including the
|
|
|
|
USER-INTEL package, then acclerated styles will run on the CPU using
|
|
|
|
vectorization optimizations and the specified precision. This may
|
|
|
|
give a substantial speed-up for a pair style, particularly if mixed or
|
|
|
|
single precision is used.
|
|
|
|
|
|
|
|
If LAMMPS was built with coproccesor support, the pair styles will run
|
|
|
|
on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The
|
|
|
|
performance of a Xeon Phi versus a multi-core CPU is a function of
|
|
|
|
your hardware, which pair style is used, the number of
|
|
|
|
atoms/coprocessor, and the precision used on the coprocessor (double,
|
|
|
|
single, mixed).
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
2014-08-30 00:27:29 +08:00
|
|
|
LAMMPS web site for performance of the USER-INTEL package on different
|
2014-08-29 04:58:20 +08:00
|
|
|
hardware.
|
|
|
|
|
2014-08-29 05:34:07 +08:00
|
|
|
[Guidelines for best performance on an Intel(R) Xeon Phi(TM)
|
|
|
|
coprocessor:]
|
2014-08-15 00:30:25 +08:00
|
|
|
|
|
|
|
The default for the "package intel"_package.html command is to have
|
2014-08-29 04:58:20 +08:00
|
|
|
all the MPI tasks on a given compute node use a single Xeon Phi(TM)
|
|
|
|
coprocessor. In general, running with a large number of MPI tasks on
|
2014-08-15 04:26:52 +08:00
|
|
|
each node will perform best with offload. Each MPI task will
|
2014-08-15 00:30:25 +08:00
|
|
|
automatically get affinity to a subset of the hardware threads
|
2014-08-15 04:26:52 +08:00
|
|
|
available on the coprocessor. For example, if your card has 61 cores,
|
|
|
|
with 60 cores available for offload and 4 hardware threads per core
|
|
|
|
(240 total threads), running with 24 MPI tasks per node will cause
|
|
|
|
each MPI task to use a subset of 10 threads on the coprocessor. Fine
|
|
|
|
tuning of the number of threads to use per MPI task or the number of
|
2014-08-29 04:58:20 +08:00
|
|
|
threads to use per core can be accomplished with keyword settings of
|
|
|
|
the "package intel"_package.html command. :ulb,l
|
|
|
|
|
|
|
|
If desired, only a fraction of the pair style computation can be
|
|
|
|
offloaded to the coprocessors. This is accomplished by setting a
|
|
|
|
balance fraction in the "package intel"_package.html command. A
|
|
|
|
balance of 0 runs all calculations on the CPU. A balance of 1 runs
|
|
|
|
all calculations on the coprocessor. A balance of 0.5 runs half of
|
|
|
|
the calculations on the coprocessor. Setting the balance to -1 (the
|
|
|
|
default) will enable dynamic load balancing that continously adjusts
|
|
|
|
the fraction of offloaded work throughout the simulation. This option
|
|
|
|
typically produces results within 5 to 10 percent of the optimal fixed
|
|
|
|
balance. :l
|
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
When using offload with CPU hyperthreading disabled, it may help
|
|
|
|
performance to use fewer MPI tasks and OpenMP threads than available
|
|
|
|
cores. This is due to the fact that additional threads are generated
|
|
|
|
internally to handle the asynchronous offload tasks. :l
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
If you have multiple coprocessors on each compute node, the
|
2014-08-15 00:30:25 +08:00
|
|
|
{offload_cards} keyword can be specified with the "package
|
2014-08-29 04:58:20 +08:00
|
|
|
intel"_package.html command. :l
|
|
|
|
|
|
|
|
If running short benchmark runs with dynamic load balancing, adding a
|
|
|
|
short warm-up run (10-20 steps) will allow the load-balancer to find a
|
|
|
|
near-optimal setting that will carry over to additional runs. :l
|
|
|
|
|
|
|
|
If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
|
|
|
|
coprocessor, a diagnostic line is printed to the screen (not to the
|
|
|
|
log file), during the setup phase of a run, indicating that offload
|
|
|
|
mode is being used and indicating the number of coprocessor threads
|
|
|
|
per MPI task. Additionally, an offload timing summary is printed at
|
|
|
|
the end of each run. When offloading, the frequency for "atom
|
|
|
|
sorting"_atom_modify.html is changed to 1 so that the per-atom data is
|
|
|
|
effectively sorted at every rebuild of the neighbor lists. :l
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2014-08-15 04:26:52 +08:00
|
|
|
For simulations with long-range electrostatics or bond, angle,
|
|
|
|
dihedral, improper calculations, computation and data transfer to the
|
2014-08-15 00:30:25 +08:00
|
|
|
coprocessor will run concurrently with computations and MPI
|
2014-08-29 04:58:20 +08:00
|
|
|
communications for these calculations on the host CPU. The USER-INTEL
|
|
|
|
package has two modes for deciding which atoms will be handled by the
|
|
|
|
coprocessor. This choice is controlled with the "offload_ghost"
|
|
|
|
keyword of the "package intel"_package.html command. When set to 0,
|
|
|
|
ghost atoms (atoms at the borders between MPI tasks) are not offloaded
|
|
|
|
to the card. This allows for overlap of MPI communication of forces
|
|
|
|
with computation on the coprocessor when the "newton"_newton.html
|
|
|
|
setting is "on". The default is dependent on the style being used,
|
|
|
|
however, better performance may be achieved by setting this option
|
|
|
|
explictly. :l,ule
|
2014-08-15 00:30:25 +08:00
|
|
|
|
|
|
|
[Restrictions:]
|
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles
|
|
|
|
that require skip lists for neighbor builds cannot be offloaded.
|
2014-08-15 04:26:52 +08:00
|
|
|
Using "hybrid/overlay"_pair_hybrid.html is allowed. Only one intel
|
2014-08-29 04:58:20 +08:00
|
|
|
accelerated style may be used with hybrid styles.
|
|
|
|
"Special_bonds"_special_bonds.html exclusion lists are not currently
|
|
|
|
supported with offload, however, the same effect can often be
|
|
|
|
accomplished by setting cutoffs for excluded atom types to 0. None of
|
|
|
|
the pair styles in the USER-INTEL package currently support the
|
2014-08-15 04:26:52 +08:00
|
|
|
"inner", "middle", "outer" options for rRESPA integration via the
|
2014-08-29 04:58:20 +08:00
|
|
|
"run_style respa"_run_style.html command; only the "pair" option is
|
|
|
|
supported.
|
2014-08-15 00:30:25 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
:line
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-29 04:58:20 +08:00
|
|
|
5.10 Comparison of GPU and USER-CUDA and KOKKOS packages :h4,link(acc_10)
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2014-08-30 00:27:29 +08:00
|
|
|
All 3 of these packages accelerate a LAMMPS calculation using NVIDIA
|
|
|
|
hardware, but they do it in different ways.
|
|
|
|
|
|
|
|
NOTE: this section still needs to be re-worked with additional KOKKOS
|
|
|
|
information.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
As a consequence, for a particular simulation on specific hardware,
|
2011-08-09 23:37:57 +08:00
|
|
|
one package may be faster than the other. We give guidelines below,
|
|
|
|
but the best way to determine which package is faster for your input
|
|
|
|
script is to try both of them on your machine. See the benchmarking
|
|
|
|
section below for examples where this has been done.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
[Guidelines for using each package optimally:]
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
The GPU package allows you to assign multiple CPUs (cores) to a single
|
|
|
|
GPU (a common configuration for "hybrid" nodes that contain multicore
|
|
|
|
CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA
|
|
|
|
package does not allow this; you can only use one CPU per GPU. :ulb,l
|
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
The GPU package moves per-atom data (coordinates, forces)
|
|
|
|
back-and-forth between the CPU and GPU every timestep. The USER-CUDA
|
|
|
|
package only does this on timesteps when a CPU calculation is required
|
|
|
|
(e.g. to invoke a fix or compute that is non-GPU-ized). Hence, if you
|
|
|
|
can formulate your input script to only use GPU-ized fixes and
|
|
|
|
computes, and avoid doing I/O too often (thermo output, dump file
|
|
|
|
snapshots, restart files), then the data transfer cost of the
|
|
|
|
USER-CUDA package can be very low, causing it to run faster than the
|
2011-08-18 05:55:22 +08:00
|
|
|
GPU package. :l
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
The GPU package is often faster than the USER-CUDA package, if the
|
|
|
|
number of atoms per GPU is "small". The crossover point, in terms of
|
|
|
|
atoms/GPU at which the USER-CUDA package becomes faster depends
|
|
|
|
strongly on the pair style. For example, for a simple Lennard Jones
|
|
|
|
system the crossover (in single precision) is often about 50K-100K
|
|
|
|
atoms per GPU. When performing double precision calculations the
|
|
|
|
crossover point can be significantly smaller. :l
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
Both packages compute bonded interactions (bonds, angles, etc) on the
|
|
|
|
CPU. This means a model with bonds will force the USER-CUDA package
|
|
|
|
to transfer per-atom data back-and-forth between the CPU and GPU every
|
|
|
|
timestep. If the GPU package is running with several MPI processes
|
|
|
|
assigned to one GPU, the cost of computing the bonded interactions is
|
2011-08-18 05:55:22 +08:00
|
|
|
spread across more CPUs and hence the GPU package can run faster. :l
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
When using the GPU package with multiple CPUs assigned to one GPU, its
|
|
|
|
performance depends to some extent on high bandwidth between the CPUs
|
|
|
|
and the GPU. Hence its performance is affected if full 16 PCIe lanes
|
|
|
|
are not available for each GPU. In HPC environments this can be the
|
|
|
|
case if S2050/70 servers are used, where two devices generally share
|
|
|
|
one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide
|
|
|
|
full 16 lanes to each of the PCIe 2.0 16x slots. :l,ule
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
[Differences between the two packages:]
|
|
|
|
|
|
|
|
The GPU package accelerates only pair force, neighbor list, and PPPM
|
|
|
|
calculations. The USER-CUDA package currently supports a wider range
|
|
|
|
of pair styles and can also accelerate many fix styles and some
|
|
|
|
compute styles, as well as neighbor list and PPPM calculations. :ulb,l
|
|
|
|
|
2012-01-25 23:33:51 +08:00
|
|
|
The USER-CUDA package does not support acceleration for minimization. :l
|
|
|
|
|
|
|
|
The USER-CUDA package does not support hybrid pair styles. :l
|
|
|
|
|
|
|
|
The USER-CUDA package can order atoms in the neighbor list differently
|
|
|
|
from run to run resulting in a different order for force accumulation. :l
|
|
|
|
|
|
|
|
The USER-CUDA package has a limit on the number of atom types that can be
|
|
|
|
used in a simulation. :l
|
|
|
|
|
|
|
|
The GPU package requires neighbor lists to be built on the CPU when using
|
|
|
|
exclusion lists or a triclinic simulation box. :l
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
The GPU package uses more GPU memory than the USER-CUDA package. This
|
|
|
|
is generally not a problem since typical runs are computation-limited
|
|
|
|
rather than memory-limited. :l,ule
|
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
[Examples:]
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
The LAMMPS distribution has two directories with sample input scripts
|
|
|
|
for the GPU and USER-CUDA packages.
|
2011-05-27 07:45:30 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
lammps/examples/gpu = GPU package files
|
|
|
|
lammps/examples/USER/cuda = USER-CUDA package files :ul
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
These contain input scripts for identical systems, so they can be used
|
|
|
|
to benchmark the performance of both packages on your system.
|