forked from lijiext/lammps
402 lines
19 KiB
HTML
402 lines
19 KiB
HTML
<HTML>
|
|
<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
|
|
<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> - <A HREF = "Section_howto.html">Next
|
|
Section</A>
|
|
</CENTER>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<HR>
|
|
|
|
<H3>5. Accelerating LAMMPS performance
|
|
</H3>
|
|
<P>This section describes various methods for improving LAMMPS
|
|
performance for different classes of problems running on different
|
|
kinds of machines.
|
|
</P>
|
|
<P>There are two thrusts to the discussion that follows. The
|
|
first is using code options that implement alternate algorithms
|
|
that can speed-up a simulation. The second is to use one
|
|
of the several accelerator packages provided with LAMMPS that
|
|
contain code optimized for certain kinds of hardware, including
|
|
multi-core CPUs, GPUs, and Intel Xeon Phi coprocessors.
|
|
</P>
|
|
<UL><LI>5.1 <A HREF = "#acc_1">Measuring performance</A>
|
|
|
|
<LI>5.2 <A HREF = "#acc_2">Algorithms and code options to boost performace</A>
|
|
|
|
<LI>5.3 <A HREF = "#acc_3">Accelerator packages with optimized styles</A>
|
|
|
|
<UL><LI> 5.3.1 <A HREF = "accelerate_cuda.html">USER-CUDA package</A>
|
|
|
|
<LI> 5.3.2 <A HREF = "accelerate_gpu.html">GPU package</A>
|
|
|
|
<LI> 5.3.3 <A HREF = "accelerate_intel.html">USER-INTEL package</A>
|
|
|
|
<LI> 5.3.4 <A HREF = "accelerate_kokkos.html">KOKKOS package</A>
|
|
|
|
<LI> 5.3.5 <A HREF = "accelerate_omp.html">USER-OMP package</A>
|
|
|
|
<LI> 5.3.6 <A HREF = "accelerate_opt.html">OPT package</A>
|
|
</UL>
|
|
<LI>5.4 <A HREF = "#acc_4">Comparison of various accelerator packages</A>
|
|
</UL>
|
|
<P>The <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the LAMMPS
|
|
web site gives performance results for the various accelerator
|
|
packages discussed in Section 5.2, for several of the standard LAMMPS
|
|
benchmark problems, as a function of problem size and number of
|
|
compute nodes, on different hardware platforms.
|
|
</P>
|
|
<HR>
|
|
|
|
<HR>
|
|
|
|
<H4><A NAME = "acc_1"></A>5.1 Measuring performance
|
|
</H4>
|
|
<P>Before trying to make your simulation run faster, you should
|
|
understand how it currently performs and where the bottlenecks are.
|
|
</P>
|
|
<P>The best way to do this is run the your system (actual number of
|
|
atoms) for a modest number of timesteps (say 100 steps) on several
|
|
different processor counts, including a single processor if possible.
|
|
Do this for an equilibrium version of your system, so that the
|
|
100-step timings are representative of a much longer run. There is
|
|
typically no need to run for 1000s of timesteps to get accurate
|
|
timings; you can simply extrapolate from short runs.
|
|
</P>
|
|
<P>For the set of runs, look at the timing data printed to the screen and
|
|
log file at the end of each LAMMPS run. <A HREF = "Section_start.html#start_8">This
|
|
section</A> of the manual has an overview.
|
|
</P>
|
|
<P>Running on one (or a few processors) should give a good estimate of
|
|
the serial performance and what portions of the timestep are taking
|
|
the most time. Running the same problem on a few different processor
|
|
counts should give an estimate of parallel scalability. I.e. if the
|
|
simulation runs 16x faster on 16 processors, its 100% parallel
|
|
efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
|
|
</P>
|
|
<P>The most important data to look at in the timing info is the timing
|
|
breakdown and relative percentages. For example, trying different
|
|
options for speeding up the long-range solvers will have little impact
|
|
if they only consume 10% of the run time. If the pairwise time is
|
|
dominating, you may want to look at GPU or OMP versions of the pair
|
|
style, as discussed below. Comparing how the percentages change as
|
|
you increase the processor count gives you a sense of how different
|
|
operations within the timestep are scaling. Note that if you are
|
|
running with a Kspace solver, there is additional output on the
|
|
breakdown of the Kspace time. For PPPM, this includes the fraction
|
|
spent on FFTs, which can be communication intensive.
|
|
</P>
|
|
<P>Another important detail in the timing info are the histograms of
|
|
atoms counts and neighbor counts. If these vary widely across
|
|
processors, you have a load-imbalance issue. This often results in
|
|
inaccurate relative timing data, because processors have to wait when
|
|
communication occurs for other processors to catch up. Thus the
|
|
reported times for "Communication" or "Other" may be higher than they
|
|
really are, due to load-imbalance. If this is an issue, you can
|
|
uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
|
|
LAMMPS, to obtain synchronized timings.
|
|
</P>
|
|
<HR>
|
|
|
|
<H4><A NAME = "acc_2"></A>5.2 General strategies
|
|
</H4>
|
|
<P>NOTE: this section 5.2 is still a work in progress
|
|
</P>
|
|
<P>Here is a list of general ideas for improving simulation performance.
|
|
Most of them are only applicable to certain models and certain
|
|
bottlenecks in the current performance, so let the timing data you
|
|
generate be your guide. It is hard, if not impossible, to predict how
|
|
much difference these options will make, since it is a function of
|
|
problem size, number of processors used, and your machine. There is
|
|
no substitute for identifying performance bottlenecks, and trying out
|
|
various options.
|
|
</P>
|
|
<UL><LI>rRESPA
|
|
<LI>2-FFT PPPM
|
|
<LI>Staggered PPPM
|
|
<LI>single vs double PPPM
|
|
<LI>partial charge PPPM
|
|
<LI>verlet/split run style
|
|
<LI>processor command for proc layout and numa layout
|
|
<LI>load-balancing: balance and fix balance
|
|
</UL>
|
|
<P>2-FFT PPPM, also called <I>analytic differentiation</I> or <I>ad</I> PPPM, uses
|
|
2 FFTs instead of the 4 FFTs used by the default <I>ik differentiation</I>
|
|
PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to
|
|
achieve the same accuracy as 4-FFT PPPM. For problems where the FFT
|
|
cost is the performance bottleneck (typically large problems running
|
|
on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.
|
|
</P>
|
|
<P>Staggered PPPM performs calculations using two different meshes, one
|
|
shifted slightly with respect to the other. This can reduce force
|
|
aliasing errors and increase the accuracy of the method, but also
|
|
doubles the amount of work required. For high relative accuracy, using
|
|
staggered PPPM allows one to half the mesh size in each dimension as
|
|
compared to regular PPPM, which can give around a 4x speedup in the
|
|
kspace time. However, for low relative accuracy, using staggered PPPM
|
|
gives little benefit and can be up to 2x slower in the kspace
|
|
time. For example, the rhodopsin benchmark was run on a single
|
|
processor, and results for kspace time vs. relative accuracy for the
|
|
different methods are shown in the figure below. For this system,
|
|
staggered PPPM (using ik differentiation) becomes useful when using a
|
|
relative accuracy of slightly greater than 1e-5 and above.
|
|
</P>
|
|
<CENTER><IMG SRC = "JPG/rhodo_staggered.jpg">
|
|
</CENTER>
|
|
<P>IMPORTANT NOTE: Using staggered PPPM may not give the same increase in
|
|
accuracy of energy and pressure as it does in forces, so some caution
|
|
must be used if energy and/or pressure are quantities of interest,
|
|
such as when using a barostat.
|
|
</P>
|
|
<HR>
|
|
|
|
<H4><A NAME = "acc_3"></A>5.3 Packages with optimized styles
|
|
</H4>
|
|
<P>Accelerated versions of various <A HREF = "pair_style.html">pair_style</A>,
|
|
<A HREF = "fix.html">fixes</A>, <A HREF = "compute.html">computes</A>, and other commands have
|
|
been added to LAMMPS, which will typically run faster than the
|
|
standard non-accelerated versions. Some require appropriate hardware
|
|
to be present on your system, e.g. GPUs or Intel Xeon Phi
|
|
coprocessors.
|
|
</P>
|
|
<P>All of these commands are in packages provided with LAMMPS. An
|
|
overview of packages is give in <A HREF = "Section_packages.html">Section
|
|
packages</A>. These are the accelerator packages
|
|
currently in LAMMPS, either as standard or user packages:
|
|
</P>
|
|
<DIV ALIGN=center><TABLE BORDER=1 >
|
|
<TR><TD ><A HREF = "accelerate_cuda.html">USER-CUDA</A> </TD><TD > for NVIDIA GPUs</TD></TR>
|
|
<TR><TD ><A HREF = "accelerate_gpu.html">GPU</A> </TD><TD > for NVIDIA GPUs as well as OpenCL support</TD></TR>
|
|
<TR><TD ><A HREF = "accelerate_intel.html">USER-INTEL</A> </TD><TD > for Intel CPUs and Intel Xeon Phi</TD></TR>
|
|
<TR><TD ><A HREF = "accelerate_kokkos.html">KOKKOS</A> </TD><TD > for GPUs, Intel Xeon Phi, and OpenMP threading</TD></TR>
|
|
<TR><TD ><A HREF = "accelerate_omp.html">USER-OMP</A> </TD><TD > for OpenMP threading</TD></TR>
|
|
<TR><TD ><A HREF = "accelerate_opt.html">OPT</A> </TD><TD > generic CPU optimizations
|
|
</TD></TR></TABLE></DIV>
|
|
|
|
<P>Any accelerated style has the same name as the corresponding standard
|
|
style, except that a suffix is appended. Otherwise, the syntax for
|
|
the command that uses the style is identical, their functionality is
|
|
the same, and the numerical results it produces should also be the
|
|
same, except for precision and round-off effects.
|
|
</P>
|
|
<P>For example, all of these styles are accelerated variants of the
|
|
Lennard-Jones <A HREF = "pair_lj.html">pair_style lj/cut</A>:
|
|
</P>
|
|
<UL><LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/intel</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/kk</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/omp</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
|
|
</UL>
|
|
<P>To see what accelerate styles are currently available, see
|
|
<A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the manual. The
|
|
doc pages for individual commands (e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or
|
|
<A HREF = "fix_nve.html">fix nve</A>) also list any accelerated variants available
|
|
for that style.
|
|
</P>
|
|
<P>To use an accelerator package in LAMMPS, and one or more of the styles
|
|
it provides, follow these general steps. Details vary from package to
|
|
package and are explained in the individual accelerator doc pages,
|
|
listed above:
|
|
</P>
|
|
<DIV ALIGN=center><TABLE BORDER=1 >
|
|
<TR><TD >build the accelerator library </TD><TD > only for USER-CUDA and GPU packages </TD></TR>
|
|
<TR><TD >install the accelerator package </TD><TD > make yes-opt, make yes-user-intel, etc </TD></TR>
|
|
<TR><TD >add compile/link flags to Makefile.machine </TD><TD > in src/MAKE, <br> only for USER-INTEL, KOKKOS, USER-OMP packages </TD></TR>
|
|
<TR><TD >re-build LAMMPS </TD><TD > make machine </TD></TR>
|
|
<TR><TD >run a LAMMPS simulation </TD><TD > lmp_machine < in.script </TD></TR>
|
|
<TR><TD >enable the accelerator package </TD><TD > via "-c on" and "-k on" <A HREF = "Section_start.html#start_7">command-line switches</A>, <br> only for USER-CUDA and KOKKOS packages </TD></TR>
|
|
<TR><TD >set any needed options for the package </TD><TD > via "-pk" <A HREF = "Section_start.html#start_7">command-line switch</A> or <A HREF = "package.html">package</A> command, <br> only if defaults need to be changed </TD></TR>
|
|
<TR><TD >use accelerated styles in your input script </TD><TD > via "-sf" <A HREF = "Section_start.html#start_7">command-line switch</A> or <A HREF = "suffix.html">suffix</A> command
|
|
</TD></TR></TABLE></DIV>
|
|
|
|
<P>The first 4 steps can be done as a single command, using the
|
|
src/Make.py tool. The Make.py tool is discussed in <A HREF = "Section_start.html#start_4">Section
|
|
2.4</A> of the manual, and its use is
|
|
illustrated in the individual accelerator sections. Typically these
|
|
steps only need to be done once, to create an executable that uses one
|
|
or more accelerator packages.
|
|
</P>
|
|
<P>The last 4 steps can all be done from the command-line when LAMMPS is
|
|
launched, without changing your input script, as illustrated in the
|
|
individual accelerator sections. Or you can add
|
|
<A HREF = "package.html">package</A> and <A HREF = "suffix.html">suffix</A> commands to your input
|
|
script.
|
|
</P>
|
|
<P>IMPORTANT NOTE: With a few exceptions, you can build a single LAMMPS
|
|
executable with all its accelerator packages installed. Note that the
|
|
USER-INTEL and KOKKOS packages require you to choose one of their
|
|
options when building. I.e. CPU or Phi for USER-INTEL. OpenMP, Cuda,
|
|
or Phi for KOKKOS. Here are the exceptions; you cannot build a single
|
|
executable with:
|
|
</P>
|
|
<UL><LI>both the USER-INTEL Phi and KOKKOS Phi options
|
|
<LI>the USER-INTEL Phi or Kokkos Phi option, and either the USER-CUDA or GPU packages
|
|
</UL>
|
|
<P>See the examples/accelerate/README and make.list files for sample
|
|
Make.py commands that build LAMMPS with any or all of the accelerator
|
|
packages. As an example, here is a command that builds with all the
|
|
GPU related packages installed (USER-CUDA, GPU, KOKKOS with Cuda),
|
|
including settings to build the needed auxiliary USER-CUDA and GPU
|
|
libraries for Kepler GPUs:
|
|
</P>
|
|
<PRE>Make.py -j 16 -p omp gpu cuda kokkos -cc nvcc wrap=mpi -cuda mode=double arch=35 -gpu mode=double arch=35 \ -kokkos cuda arch=35 lib-all file mpi
|
|
</PRE>
|
|
<P>The examples/accelerate directory also has input scripts that can be
|
|
used with all of the accelerator packages. See its README file for
|
|
details.
|
|
</P>
|
|
<P>Likewise, the bench directory has FERMI and KEPLER and PHI
|
|
sub-directories with Make.py commands and input scripts for using all
|
|
the accelerator packages on various machines. See the README files in
|
|
those dirs.
|
|
</P>
|
|
<P>As mentioned above, the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark
|
|
page</A> of the LAMMPS web site gives
|
|
performance results for the various accelerator packages for several
|
|
of the standard LAMMPS benchmark problems, as a function of problem
|
|
size and number of compute nodes, on different hardware platforms.
|
|
</P>
|
|
<P>Here is a brief summary of what the various packages provide. Details
|
|
are in the individual accelerator sections.
|
|
</P>
|
|
<UL><LI>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
|
packages, and can be run on NVIDIA GPUs. The speed-up on a GPU
|
|
depends on a variety of factors, discussed in the accelerator
|
|
sections.
|
|
|
|
<LI>Styles with an "intel" suffix are part of the USER-INTEL
|
|
package. These styles support vectorized single and mixed precision
|
|
calculations, in addition to full double precision. In extreme cases,
|
|
this can provide speedups over 3.5x on CPUs. The package also
|
|
supports acceleration in "offload" mode to Intel(R) Xeon Phi(TM)
|
|
coprocessors. This can result in additional speedup over 2x depending
|
|
on the hardware configuration.
|
|
|
|
<LI>Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
|
run using OpenMP on multicore CPUs, on an NVIDIA GPU, or on an Intel
|
|
Xeon Phi in "native" mode. The speed-up depends on a variety of
|
|
factors, as discussed on the KOKKOS accelerator page.
|
|
|
|
<LI>Styles with an "omp" suffix are part of the USER-OMP package and allow
|
|
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
|
be useful on nodes with high-core counts when using less MPI processes
|
|
than cores is advantageous, e.g. when running with PPPM so that FFTs
|
|
are run on fewer MPI processors or when the many MPI tasks would
|
|
overload the available bandwidth for communication.
|
|
|
|
<LI>Styles with an "opt" suffix are part of the OPT package and typically
|
|
speed-up the pairwise calculations of your simulation by 5-25% on a
|
|
CPU.
|
|
</UL>
|
|
<P>The individual accelerator package doc pages explain:
|
|
</P>
|
|
<UL><LI>what hardware and software the accelerated package requires
|
|
<LI>how to build LAMMPS with the accelerated package
|
|
<LI>how to run with the accelerated package either via command-line switches or modifying the input script
|
|
<LI>speed-ups to expect
|
|
<LI>guidelines for best performance
|
|
<LI>restrictions
|
|
</UL>
|
|
<HR>
|
|
|
|
<H4><A NAME = "acc_4"></A>5.4 Comparison of various accelerator packages
|
|
</H4>
|
|
<P>NOTE: this section still needs to be re-worked with additional KOKKOS
|
|
and USER-INTEL information.
|
|
</P>
|
|
<P>The next section compares and contrasts the various accelerator
|
|
options, since there are multiple ways to perform OpenMP threading,
|
|
run on GPUs, and run on Intel Xeon Phi coprocessors.
|
|
</P>
|
|
<P>All 3 of these packages accelerate a LAMMPS calculation using NVIDIA
|
|
hardware, but they do it in different ways.
|
|
</P>
|
|
<P>As a consequence, for a particular simulation on specific hardware,
|
|
one package may be faster than the other. We give guidelines below,
|
|
but the best way to determine which package is faster for your input
|
|
script is to try both of them on your machine. See the benchmarking
|
|
section below for examples where this has been done.
|
|
</P>
|
|
<P><B>Guidelines for using each package optimally:</B>
|
|
</P>
|
|
<UL><LI>The GPU package allows you to assign multiple CPUs (cores) to a single
|
|
GPU (a common configuration for "hybrid" nodes that contain multicore
|
|
CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA
|
|
package does not allow this; you can only use one CPU per GPU.
|
|
|
|
<LI>The GPU package moves per-atom data (coordinates, forces)
|
|
back-and-forth between the CPU and GPU every timestep. The USER-CUDA
|
|
package only does this on timesteps when a CPU calculation is required
|
|
(e.g. to invoke a fix or compute that is non-GPU-ized). Hence, if you
|
|
can formulate your input script to only use GPU-ized fixes and
|
|
computes, and avoid doing I/O too often (thermo output, dump file
|
|
snapshots, restart files), then the data transfer cost of the
|
|
USER-CUDA package can be very low, causing it to run faster than the
|
|
GPU package.
|
|
|
|
<LI>The GPU package is often faster than the USER-CUDA package, if the
|
|
number of atoms per GPU is "small". The crossover point, in terms of
|
|
atoms/GPU at which the USER-CUDA package becomes faster depends
|
|
strongly on the pair style. For example, for a simple Lennard Jones
|
|
system the crossover (in single precision) is often about 50K-100K
|
|
atoms per GPU. When performing double precision calculations the
|
|
crossover point can be significantly smaller.
|
|
|
|
<LI>Both packages compute bonded interactions (bonds, angles, etc) on the
|
|
CPU. This means a model with bonds will force the USER-CUDA package
|
|
to transfer per-atom data back-and-forth between the CPU and GPU every
|
|
timestep. If the GPU package is running with several MPI processes
|
|
assigned to one GPU, the cost of computing the bonded interactions is
|
|
spread across more CPUs and hence the GPU package can run faster.
|
|
|
|
<LI>When using the GPU package with multiple CPUs assigned to one GPU, its
|
|
performance depends to some extent on high bandwidth between the CPUs
|
|
and the GPU. Hence its performance is affected if full 16 PCIe lanes
|
|
are not available for each GPU. In HPC environments this can be the
|
|
case if S2050/70 servers are used, where two devices generally share
|
|
one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide
|
|
full 16 lanes to each of the PCIe 2.0 16x slots.
|
|
</UL>
|
|
<P><B>Differences between the two packages:</B>
|
|
</P>
|
|
<UL><LI>The GPU package accelerates only pair force, neighbor list, and PPPM
|
|
calculations. The USER-CUDA package currently supports a wider range
|
|
of pair styles and can also accelerate many fix styles and some
|
|
compute styles, as well as neighbor list and PPPM calculations.
|
|
|
|
<LI>The USER-CUDA package does not support acceleration for minimization.
|
|
|
|
<LI>The USER-CUDA package does not support hybrid pair styles.
|
|
|
|
<LI>The USER-CUDA package can order atoms in the neighbor list differently
|
|
from run to run resulting in a different order for force accumulation.
|
|
|
|
<LI>The USER-CUDA package has a limit on the number of atom types that can be
|
|
used in a simulation.
|
|
|
|
<LI>The GPU package requires neighbor lists to be built on the CPU when using
|
|
exclusion lists or a triclinic simulation box.
|
|
|
|
<LI>The GPU package uses more GPU memory than the USER-CUDA package. This
|
|
is generally not a problem since typical runs are computation-limited
|
|
rather than memory-limited.
|
|
</UL>
|
|
<P><B>Examples:</B>
|
|
</P>
|
|
<P>The LAMMPS distribution has two directories with sample input scripts
|
|
for the GPU and USER-CUDA packages.
|
|
</P>
|
|
<UL><LI>lammps/examples/gpu = GPU package files
|
|
<LI>lammps/examples/USER/cuda = USER-CUDA package files
|
|
</UL>
|
|
<P>These contain input scripts for identical systems, so they can be used
|
|
to benchmark the performance of both packages on your system.
|
|
</P>
|
|
</HTML>
|