git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12374 f3b2605a-c512-4ea7-a41b-209d697bcdaa

This commit is contained in:
sjplimp 2014-08-27 20:52:54 +00:00
parent dc5ad107ad
commit 444053fa6c
2 changed files with 463 additions and 235 deletions

View File

@ -26,7 +26,7 @@ kinds of machines.
5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR> 5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
5.8 <A HREF = "#acc_8">KOKKOS package</A><BR> 5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR> 5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
5.10 <A HREF = "#acc_10">Comparison of GPU and USER-CUDA packages</A> <BR> 5.10 <A HREF = "#acc_10">Comparison of USER-CUDA, GPU, and KOKKOS packages</A> <BR>
<HR> <HR>
@ -82,7 +82,7 @@ LAMMPS, to obtain synchronized timings.
<H4><A NAME = "acc_2"></A>5.2 General strategies <H4><A NAME = "acc_2"></A>5.2 General strategies
</H4> </H4>
<P>NOTE: this sub-section is still a work in progress <P>NOTE: this section is still a work in progress
</P> </P>
<P>Here is a list of general ideas for improving simulation performance. <P>Here is a list of general ideas for improving simulation performance.
Most of them are only applicable to certain models and certain Most of them are only applicable to certain models and certain
@ -142,6 +142,16 @@ been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions, if you have the appropriate standard non-accelerated versions, if you have the appropriate
hardware on your system. hardware on your system.
</P> </P>
<P>All of these commands are in <A HREF = "Section_packages.html">packages</A>.
Currently, there are 6 such packages in LAMMPS:
</P>
<UL><LI>USER-CUDA: for NVIDIA GPUs
<LI>GPU: for NVIDIA GPUs as well as OpenCL support
<LI>USER-INTEL: for Intel CPUs and Intel Xeon Phi
<LI>KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
<LI>USER-OMP: for OpenMP threading
<LI>OPT: generic CPU optimizations
</UL>
<P>The accelerated styles have the same name as the standard styles, <P>The accelerated styles have the same name as the standard styles,
except that a suffix is appended. Otherwise, the syntax for the except that a suffix is appended. Otherwise, the syntax for the
command is identical, their functionality is the same, and the command is identical, their functionality is the same, and the
@ -167,22 +177,31 @@ automatically, without changing your input script. The
to turn off and back on the comand-line switch setting, both from to turn off and back on the comand-line switch setting, both from
within your input script. within your input script.
</P> </P>
<P>To see what styles are currently available in each of the accelerated
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
manual. The doc page for each indvidual style (e.g. <A HREF = "pair_lj.html">pair
lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) also lists any
accelerated variants available for that style.
</P>
<P>Here is a brief summary of what the various packages provide. Details
are in individual sections below.
</P>
<P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU <P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
packages, and can be run on NVIDIA GPUs associated with your CPUs. packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as The speed-up on a GPU depends on a variety of factors, as discussed
discussed below. below.
</P> </P>
<P>Styles with an "intel" suffix are part of the USER-INTEL <P>Styles with an "intel" suffix are part of the USER-INTEL
package. These styles support vectorized single and mixed precision package. These styles support vectorized single and mixed precision
calculations, in addition to full double precision. In extreme cases, calculations, in addition to full double precision. In extreme cases,
this can provide speedups over 3.5x on CPUs. The package also this can provide speedups over 3.5x on CPUs. The package also
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors. supports acceleration with offload to Intel(R) Xeon Phi(TM)
This can result in additional speedup over 2x depending on the coprocessors. This can result in additional speedup over 2x depending
hardware configuration. on the hardware configuration.
</P> </P>
<P>Styles with a "kk" suffix are part of the KOKKOS package, and can be <P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
on a variety of factors, as discussed below. The speed-up depends on a variety of factors, as discussed below.
</P> </P>
<P>Styles with an "omp" suffix are part of the USER-OMP package and allow <P>Styles with an "omp" suffix are part of the USER-OMP package and allow
a pair-style to be run in multi-threaded mode using OpenMP. This can a pair-style to be run in multi-threaded mode using OpenMP. This can
@ -192,25 +211,20 @@ are run on fewer MPI processors or when the many MPI tasks would
overload the available bandwidth for communication. overload the available bandwidth for communication.
</P> </P>
<P>Styles with an "opt" suffix are part of the OPT package and typically <P>Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%. speed-up the pairwise calculations of your simulation by 5-25% on a
</P> CPU.
<P>To see what styles are currently available in each of the accelerated
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
manual. A list of accelerated styles is included in the pair, fix,
compute, and kspace sections. The doc page for each indvidual style
(e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) will also
list any accelerated variants available for that style.
</P> </P>
<P>The following sections explain: <P>The following sections explain:
</P> </P>
<UL><LI>what hardware and software the accelerated styles require <UL><LI>what hardware and software the accelerated package requires
<LI>how to build LAMMPS with the accelerated package in place <LI>how to build LAMMPS with the accelerated package
<LI>what changes (if any) are needed in your input scripts <LI>how to run an input script with the accelerated package
<LI>speed-ups to expect
<LI>guidelines for best performance <LI>guidelines for best performance
<LI>speed-ups you can expect <LI>restrictions
</UL> </UL>
<P>The final section compares and contrasts the GPU and USER-CUDA <P>The final section compares and contrasts the GPU, USER-CUDA, and
packages, since they are both designed to use NVIDIA hardware. KOKKOS packages, since they all allow for use of NVIDIA GPUs.
</P> </P>
<HR> <HR>
@ -222,22 +236,47 @@ Technologies). It contains a handful of pair styles whose compute()
methods were rewritten in C++ templated form to reduce the overhead methods were rewritten in C++ templated form to reduce the overhead
due to if tests and other conditional code. due to if tests and other conditional code.
</P> </P>
<P>The procedure for building LAMMPS with the OPT package is simple. It <P><B>Required hardware/software:</B>
is the same as for any other package which has no additional library </P>
dependencies: <P>None.
</P>
<P><B>Building LAMMPS with the OPT package:</B>
</P>
<P>Include the package and build LAMMPS.
</P> </P>
<PRE>make yes-opt <PRE>make yes-opt
make machine make machine
</PRE> </PRE>
<P>If your input script uses one of the OPT pair styles, you can run it <P>No additional compile/link flags are needed in your lo-level
as follows: src/MAKE/Makefile.machine.
</P>
<P><B>Running with the OPT package;</B>
</P>
<P>You can explicitly add an "opt" suffix to the
<A HREF = "pair_style.html">pair_style</A> command in your input script:
</P>
<PRE>pair_style lj/cut/opt 2.5
</PRE>
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
switch</A>, which will automatically append
"opt" to styles that support it.
</P> </P>
<PRE>lmp_machine -sf opt < in.script <PRE>lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script mpirun -np 4 lmp_machine -sf opt < in.script
</PRE> </PRE>
<P>You should see a reduction in the "Pair time" printed out at the end <P><B>Speed-ups to expect:</B>
of the run. On most machines and problems, this will typically be a 5 </P>
to 20% savings. <P>You should see a reduction in the "Pair time" value printed at the end
of a run. On most machines for reasonable problem sizes, it will be a
5 to 20% savings.
</P>
<P><B>Guidelines for best performance;</B>
</P>
<P>None. Just try out an OPT pair style to see how it performs.
</P>
<P><B>Restrictions:</B>
</P>
<P>None.
</P> </P>
<HR> <HR>
@ -245,118 +284,175 @@ to 20% savings.
</H4> </H4>
<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple <P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
University. It provides multi-threaded versions of most pair styles, University. It provides multi-threaded versions of most pair styles,
all dihedral styles, and a few fixes in LAMMPS. The package currently nearly all bonded styles (bond, angle, dihedral, improper), several
uses the OpenMP interface which requires using a specific compiler Kspace styles, and a few fix styles. The package currently
flag in the makefile to enable multiple threads; without this flag the uses the OpenMP interface for multi-threading.
corresponding pair styles will still be compiled and work, but do not </P>
support multi-threading. <P><B>Required hardware/software:</B>
</P>
<P>Your compiler must support the OpenMP interface. You should have one
or more multi-core CPUs so that multiple threads can be launched by an
MPI task running on a CPU.
</P> </P>
<P><B>Building LAMMPS with the USER-OMP package:</B> <P><B>Building LAMMPS with the USER-OMP package:</B>
</P> </P>
<P>The procedure for building LAMMPS with the USER-OMP package is simple. <P>Include the package and build LAMMPS.
You have to edit your machine specific makefile to add the flag to
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
For the GNU compilers and Intel compilers, this flag is called
<I>-fopenmp</I>. Check your compiler documentation to find out which flag
you need to add. The rest of the compilation is the same as for any
other package which has no additional library dependencies:
</P> </P>
<PRE>make yes-user-omp <PRE>make yes-user-omp
make machine make machine
</PRE> </PRE>
<P>If your input script uses one of regular styles that are also <P>Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
exist as an OpenMP version in the USER-OMP package you can run support in both the CCFLAGS and LINKFLAGS variables. For GNU and
it as follows: Intel compilers, this flag is <I>-fopenmp</I>. Without this flag the
USER-OMP styles will still be compiled and work, but will not support
multi-threading.
</P> </P>
<PRE>env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script <P><B>Running with the USER-OMP package;</B>
</P>
<P>You can explicitly add an "omp" suffix to any supported style in your
input script:
</P>
<PRE>pair_style lj/cut/omp 2.5
fix nve/omp
</PRE>
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
switch</A>, which will automatically append
"opt" to styles that support it.
</P>
<PRE>lmp_machine -sf omp < in.script
mpirun -np 4 lmp_machine -sf omp < in.script
</PRE>
<P>You must also specify how many threads to use per MPI task. There are
several ways to do this. Note that the default value for this setting
in the OpenMP environment is 1 thread/task, which may give poor
performance. Also note that the product of MPI tasks * threads/task
should not exceed the physical number of cores, otherwise performance
will suffer.
</P>
<P>a) You can set an environment variable, either in your shell
or its start-up script:
</P>
<PRE>setenv OMP_NUM_THREADS 4 (for csh or tcsh)
NOTE: setenv OMP_NUM_THREADS 4 (for bash)
</PRE>
<P>This value will apply to all subsequent runs you perform.
</P>
<P>b) You can set the same environment variable when you launch LAMMPS:
</P>
<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
NOTE: which mpirun is for OpenMPI or MPICH?
</PRE> </PRE>
<P>The value of the environment variable OMP_NUM_THREADS determines how <P>All three examples use a total of 4 CPU cores.
many threads per MPI task are launched. All three examples above use a
total of 4 CPU cores. For different MPI implementations the method to
pass the OMP_NUM_THREADS environment variable to all processes is
different. Two different variants, one for MPICH and OpenMPI,
respectively are shown above. Please check the documentation of your
MPI installation for additional details. Alternatively, the value
provided by OMP_NUM_THREADS can be overridded with the <A HREF = "package.html">package
omp</A> command. Depending on which styles are accelerated
in your input, you should see a reduction in the "Pair time" and/or
"Bond time" and "Loop time" printed out at the end of the run. The
optimal ratio of MPI to OpenMP can vary a lot and should always be
confirmed through some benchmark runs for the current system and on
the current machine.
</P> </P>
<P><B>Restrictions:</B> <P>Different MPI implementations have differnet ways of passing the
OMP_NUM_THREADS environment variable to all MPI processes. The first
variant above is for MPICH, the second is for OpenMPI. Check the
documentation of your MPI installation for additional details.
</P> </P>
<P>None of the pair styles in the USER-OMP package support the "inner", <P>c) Use the <A HREF = "package.html">package omp</A> command near the top of your
"middle", "outer" options for r-RESPA integration, only the "pair" script:
option is supported.
</P> </P>
<P><B>Parallel efficiency and performance tips:</B> <PRE>package omp 4
</PRE>
<P><B>Speed-ups to expect:</B>
</P> </P>
<P>In most simple cases the MPI parallelization in LAMMPS is more <P>Depending on which styles are accelerated, you should look for a
efficient than multi-threading implemented in the USER-OMP package. reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
Also the parallel efficiency varies between individual styles. time" values printed at the end of a run.
On the other hand, in many cases you still want to use the <I>omp</I> version
- even when compiling or running without OpenMP support - since they
all contain optimizations similar to those in the OPT package, which
can result in serial speedup.
</P> </P>
<P>Using multi-threading is most effective under the following <P>You may see a small performance advantage (5 to 20%) when running a
USER-OMP style (in serial or parallel) with a single thread/MPI task,
versus running standard LAMMPS with its un-accelerated styles (in
serial or all-MPI parallelization with 1 task/core). This is because
many of the USER-OMP styles contain similar optimizations to those
used in the OPT package, as described above.
</P>
<P>With multiple threads/task, the optimal choice of MPI tasks/node and
OpenMP threads/task can vary a lot and should always be tested via
benchmark runs for a specific simulation running on a specific
machine, paying attention to guidelines discussed in the next
sub-section.
</P>
<P>A description of the multi-threading strategy used in the UESR-OMP
package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
here</A>
</P>
<P><B>Guidelines for best performance;</B>
</P>
<P>For many problems on current generation CPUs, running the USER-OMP
package with a single thread/task is faster than running with multiple
threads/task. This is because the MPI parallelization in LAMMPS is
often more efficient than multi-threading as implemented in the
USER-OMP package. The parallel efficiency (in a threaded sense) also
varies for different USER-OMP styles.
</P>
<P>Using multiple threads/task can be more effective under the following
circumstances: circumstances:
</P> </P>
<UL><LI>Individual compute nodes have a significant number of CPU cores but <UL><LI>Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one (Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per nodes degradation, so that running with 4 or even only 2 MPI tasks per node
is faster. Running in hybrid MPI+OpenMP mode will reduce the is faster. Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but inter-node communication bandwidth contention in the same way, but
offers and additional speedup from utilizing the otherwise idle CPU offers an additional speedup by utilizing the otherwise idle CPU
cores. cores.
<LI>The interconnect used for MPI communication is not able to provide <LI>The interconnect used for MPI communication does not provide
sufficient bandwidth for a large number of MPI tasks per node. This sufficient bandwidth for a large number of MPI tasks per node. For
applies for example to running over gigabit ethernet or on Cray XT4 or example, this applies to running over gigabit ethernet or on Cray XT4
XT5 series supercomputers. Same as in the aforementioned case this or XT5 series supercomputers. As in the aforementioned case, this
effect worsens with using an increasing number of nodes. effect worsens when using an increasing number of nodes.
<LI>The input is a system that has an inhomogeneous particle density which <LI>The system has a spatially inhomogeneous particle density which does
cannot be mapped well to the domain decomposition scheme that LAMMPS not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
employs. While this can be to some degree alleviated through using the <A HREF = "balance.html">load-balancing</A> options that LAMMPS provides. This is
<A HREF = "processors.html">processors</A> keyword, multi-threading provides a because multi-threading achives parallelism over the number of
parallelism that parallelizes over the number of particles not their particles, not via their distribution in space.
distribution in space.
<LI>Finally, multi-threaded styles can improve performance when running <LI>A machine is being used in "capability mode", i.e. near the point
LAMMPS in "capability mode", i.e. near the point where the MPI where MPI parallelism is maxed out. For example, this can happen when
parallelism scales out. This can happen in particular when using as using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
kspace style for long-range electrostatics. Here the scaling of the electrostatics on large numbers of nodes. The scaling of the <A HREF = "kspace_style.html">kspace
kspace style is the performance limiting factor and using style</A> can become the the performance-limiting
multi-threaded styles allows to operate the kspace style at the limit factor. Using multi-threading allows less MPI tasks to be invoked and
of scaling and then increase performance parallelizing the real space can speed-up the long-range solver, while increasing overall
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can performance by parallelizing the pairwise and bonded calculations via
be achived by increasing the real-space coulomb cutoff and thus OpenMP. Likewise additional speedup can be sometimes be achived by
reducing the work in the kspace part. increasing the length of the Coulombic cutoff and thus reducing the
work done by the long-range solver.
</UL> </UL>
<P>The best parallel efficiency from <I>omp</I> styles is typically achieved <P>Other performance tips are as follows:
</P>
<UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
when there is at least one MPI task per physical processor, when there is at least one MPI task per physical processor,
i.e. socket or die. i.e. socket or die.
</P>
<P>Using threads on hyper-threading enabled cores is usually <LI>Using OpenMP threading (as opposed to all-MPI parallelism) on
counterproductive, as the cost in additional memory bandwidth hyper-threading enabled cores is usually counter-productive (e.g. on
requirements is not offset by the gain in CPU utilization through IBM BG/Q), as the cost in additional memory bandwidth requirements is
not offset by the gain in CPU utilization through
hyper-threading. hyper-threading.
</UL>
<P><B>Restrictions:</B>
</P> </P>
<P>A description of the multi-threading strategy and some performance <P>None of the pair styles in the USER-OMP package support the "inner",
examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented "middle", "outer" options for <A HREF = "run_style.html">rRESPA integration</A>.
here</A> Only the rRESPA "pair" option is supported.
</P> </P>
<HR> <HR>
<H4><A NAME = "acc_6"></A>5.6 GPU package <H4><A NAME = "acc_6"></A>5.6 GPU package
</H4> </H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The GPU package was developed by Mike Brown at ORNL and his <P>The GPU package was developed by Mike Brown at ORNL and his
collaborators. It provides GPU versions of several pair styles, collaborators. It provides GPU versions of several pair styles,
including the 3-body Stillinger-Weber pair style, and for long-range including the 3-body Stillinger-Weber pair style, and for long-range
@ -546,6 +642,12 @@ of problem size and number of compute nodes.
<H4><A NAME = "acc_7"></A>5.7 USER-CUDA package <H4><A NAME = "acc_7"></A>5.7 USER-CUDA package
</H4> </H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The USER-CUDA package was developed by Christian Trott at U Technology <P>The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
styles, many fixes, a few computes, and for long-range Coulombics via styles, many fixes, a few computes, and for long-range Coulombics via
@ -683,6 +785,12 @@ occurs, the faster your simulation will run.
<H4><A NAME = "acc_8"></A>5.8 KOKKOS package <H4><A NAME = "acc_8"></A>5.8 KOKKOS package
</H4> </H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The KOKKOS package contains versions of pair, fix, and atom styles <P>The KOKKOS package contains versions of pair, fix, and atom styles
that use data structures and methods and macros provided by the Kokkos that use data structures and methods and macros provided by the Kokkos
library, which is included with LAMMPS in lib/kokkos. library, which is included with LAMMPS in lib/kokkos.
@ -975,6 +1083,12 @@ LAMMPS.
<H4><A NAME = "acc_9"></A>5.9 USER-INTEL package <H4><A NAME = "acc_9"></A>5.9 USER-INTEL package
</H4> </H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The USER-INTEL package was developed by Mike Brown at Intel <P>The USER-INTEL package was developed by Mike Brown at Intel
Corporation. It provides a capability to accelerate simulations by Corporation. It provides a capability to accelerate simulations by
offloading neighbor list and non-bonded force calculations to Intel(R) offloading neighbor list and non-bonded force calculations to Intel(R)

View File

@ -23,7 +23,7 @@ kinds of machines.
5.7 "USER-CUDA package"_#acc_7 5.7 "USER-CUDA package"_#acc_7
5.8 "KOKKOS package"_#acc_8 5.8 "KOKKOS package"_#acc_8
5.9 "USER-INTEL package"_#acc_9 5.9 "USER-INTEL package"_#acc_9
5.10 "Comparison of GPU and USER-CUDA packages"_#acc_10 :all(b) 5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)
:line :line
:line :line
@ -78,7 +78,7 @@ LAMMPS, to obtain synchronized timings.
5.2 General strategies :h4,link(acc_2) 5.2 General strategies :h4,link(acc_2)
NOTE: this sub-section is still a work in progress NOTE: this section is still a work in progress
Here is a list of general ideas for improving simulation performance. Here is a list of general ideas for improving simulation performance.
Most of them are only applicable to certain models and certain Most of them are only applicable to certain models and certain
@ -138,6 +138,16 @@ been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions, if you have the appropriate standard non-accelerated versions, if you have the appropriate
hardware on your system. hardware on your system.
All of these commands are in "packages"_Section_packages.html.
Currently, there are 6 such packages in LAMMPS:
USER-CUDA: for NVIDIA GPUs
GPU: for NVIDIA GPUs as well as OpenCL support
USER-INTEL: for Intel CPUs and Intel Xeon Phi
KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
USER-OMP: for OpenMP threading
OPT: generic CPU optimizations :ul
The accelerated styles have the same name as the standard styles, The accelerated styles have the same name as the standard styles,
except that a suffix is appended. Otherwise, the syntax for the except that a suffix is appended. Otherwise, the syntax for the
command is identical, their functionality is the same, and the command is identical, their functionality is the same, and the
@ -163,22 +173,31 @@ automatically, without changing your input script. The
to turn off and back on the comand-line switch setting, both from to turn off and back on the comand-line switch setting, both from
within your input script. within your input script.
To see what styles are currently available in each of the accelerated
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
manual. The doc page for each indvidual style (e.g. "pair
lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
accelerated variants available for that style.
Here is a brief summary of what the various packages provide. Details
are in individual sections below.
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
packages, and can be run on NVIDIA GPUs associated with your CPUs. packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as The speed-up on a GPU depends on a variety of factors, as discussed
discussed below. below.
Styles with an "intel" suffix are part of the USER-INTEL Styles with an "intel" suffix are part of the USER-INTEL
package. These styles support vectorized single and mixed precision package. These styles support vectorized single and mixed precision
calculations, in addition to full double precision. In extreme cases, calculations, in addition to full double precision. In extreme cases,
this can provide speedups over 3.5x on CPUs. The package also this can provide speedups over 3.5x on CPUs. The package also
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors. supports acceleration with offload to Intel(R) Xeon Phi(TM)
This can result in additional speedup over 2x depending on the coprocessors. This can result in additional speedup over 2x depending
hardware configuration. on the hardware configuration.
Styles with a "kk" suffix are part of the KOKKOS package, and can be Styles with a "kk" suffix are part of the KOKKOS package, and can be
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
on a variety of factors, as discussed below. The speed-up depends on a variety of factors, as discussed below.
Styles with an "omp" suffix are part of the USER-OMP package and allow Styles with an "omp" suffix are part of the USER-OMP package and allow
a pair-style to be run in multi-threaded mode using OpenMP. This can a pair-style to be run in multi-threaded mode using OpenMP. This can
@ -188,25 +207,20 @@ are run on fewer MPI processors or when the many MPI tasks would
overload the available bandwidth for communication. overload the available bandwidth for communication.
Styles with an "opt" suffix are part of the OPT package and typically Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%. speed-up the pairwise calculations of your simulation by 5-25% on a
CPU.
To see what styles are currently available in each of the accelerated
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
manual. A list of accelerated styles is included in the pair, fix,
compute, and kspace sections. The doc page for each indvidual style
(e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) will also
list any accelerated variants available for that style.
The following sections explain: The following sections explain:
what hardware and software the accelerated styles require what hardware and software the accelerated package requires
how to build LAMMPS with the accelerated package in place how to build LAMMPS with the accelerated package
what changes (if any) are needed in your input scripts how to run an input script with the accelerated package
speed-ups to expect
guidelines for best performance guidelines for best performance
speed-ups you can expect :ul restrictions :ul
The final section compares and contrasts the GPU and USER-CUDA The final section compares and contrasts the GPU, USER-CUDA, and
packages, since they are both designed to use NVIDIA hardware. KOKKOS packages, since they all allow for use of NVIDIA GPUs.
:line :line
@ -218,22 +232,47 @@ Technologies). It contains a handful of pair styles whose compute()
methods were rewritten in C++ templated form to reduce the overhead methods were rewritten in C++ templated form to reduce the overhead
due to if tests and other conditional code. due to if tests and other conditional code.
The procedure for building LAMMPS with the OPT package is simple. It [Required hardware/software:]
is the same as for any other package which has no additional library
dependencies: None.
[Building LAMMPS with the OPT package:]
Include the package and build LAMMPS.
make yes-opt make yes-opt
make machine :pre make machine :pre
If your input script uses one of the OPT pair styles, you can run it No additional compile/link flags are needed in your lo-level
as follows: src/MAKE/Makefile.machine.
[Running with the OPT package;]
You can explicitly add an "opt" suffix to the
"pair_style"_pair_style.html command in your input script:
pair_style lj/cut/opt 2.5 :pre
Or you can run with the -sf "command-line
switch"_Section_start.html#start_7, which will automatically append
"opt" to styles that support it.
lmp_machine -sf opt < in.script lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script :pre mpirun -np 4 lmp_machine -sf opt < in.script :pre
You should see a reduction in the "Pair time" printed out at the end [Speed-ups to expect:]
of the run. On most machines and problems, this will typically be a 5
to 20% savings. You should see a reduction in the "Pair time" value printed at the end
of a run. On most machines for reasonable problem sizes, it will be a
5 to 20% savings.
[Guidelines for best performance;]
None. Just try out an OPT pair style to see how it performs.
[Restrictions:]
None.
:line :line
@ -241,118 +280,175 @@ to 20% savings.
The USER-OMP package was developed by Axel Kohlmeyer at Temple The USER-OMP package was developed by Axel Kohlmeyer at Temple
University. It provides multi-threaded versions of most pair styles, University. It provides multi-threaded versions of most pair styles,
all dihedral styles, and a few fixes in LAMMPS. The package currently nearly all bonded styles (bond, angle, dihedral, improper), several
uses the OpenMP interface which requires using a specific compiler Kspace styles, and a few fix styles. The package currently
flag in the makefile to enable multiple threads; without this flag the uses the OpenMP interface for multi-threading.
corresponding pair styles will still be compiled and work, but do not
support multi-threading. [Required hardware/software:]
Your compiler must support the OpenMP interface. You should have one
or more multi-core CPUs so that multiple threads can be launched by an
MPI task running on a CPU.
[Building LAMMPS with the USER-OMP package:] [Building LAMMPS with the USER-OMP package:]
The procedure for building LAMMPS with the USER-OMP package is simple. Include the package and build LAMMPS.
You have to edit your machine specific makefile to add the flag to
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
For the GNU compilers and Intel compilers, this flag is called
{-fopenmp}. Check your compiler documentation to find out which flag
you need to add. The rest of the compilation is the same as for any
other package which has no additional library dependencies:
make yes-user-omp make yes-user-omp
make machine :pre make machine :pre
If your input script uses one of regular styles that are also Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
exist as an OpenMP version in the USER-OMP package you can run support in both the CCFLAGS and LINKFLAGS variables. For GNU and
it as follows: Intel compilers, this flag is {-fopenmp}. Without this flag the
USER-OMP styles will still be compiled and work, but will not support
multi-threading.
env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script [Running with the USER-OMP package;]
You can explicitly add an "omp" suffix to any supported style in your
input script:
pair_style lj/cut/omp 2.5
fix nve/omp :pre
Or you can run with the -sf "command-line
switch"_Section_start.html#start_7, which will automatically append
"opt" to styles that support it.
lmp_machine -sf omp < in.script
mpirun -np 4 lmp_machine -sf omp < in.script :pre
You must also specify how many threads to use per MPI task. There are
several ways to do this. Note that the default value for this setting
in the OpenMP environment is 1 thread/task, which may give poor
performance. Also note that the product of MPI tasks * threads/task
should not exceed the physical number of cores, otherwise performance
will suffer.
a) You can set an environment variable, either in your shell
or its start-up script:
setenv OMP_NUM_THREADS 4 (for csh or tcsh)
NOTE: setenv OMP_NUM_THREADS 4 (for bash) :pre
This value will apply to all subsequent runs you perform.
b) You can set the same environment variable when you launch LAMMPS:
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
NOTE: which mpirun is for OpenMPI or MPICH? :pre
The value of the environment variable OMP_NUM_THREADS determines how All three examples use a total of 4 CPU cores.
many threads per MPI task are launched. All three examples above use a
total of 4 CPU cores. For different MPI implementations the method to Different MPI implementations have differnet ways of passing the
pass the OMP_NUM_THREADS environment variable to all processes is OMP_NUM_THREADS environment variable to all MPI processes. The first
different. Two different variants, one for MPICH and OpenMPI, variant above is for MPICH, the second is for OpenMPI. Check the
respectively are shown above. Please check the documentation of your documentation of your MPI installation for additional details.
MPI installation for additional details. Alternatively, the value
provided by OMP_NUM_THREADS can be overridded with the "package c) Use the "package omp"_package.html command near the top of your
omp"_package.html command. Depending on which styles are accelerated script:
in your input, you should see a reduction in the "Pair time" and/or
"Bond time" and "Loop time" printed out at the end of the run. The package omp 4 :pre
optimal ratio of MPI to OpenMP can vary a lot and should always be
confirmed through some benchmark runs for the current system and on [Speed-ups to expect:]
the current machine.
Depending on which styles are accelerated, you should look for a
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
time" values printed at the end of a run.
You may see a small performance advantage (5 to 20%) when running a
USER-OMP style (in serial or parallel) with a single thread/MPI task,
versus running standard LAMMPS with its un-accelerated styles (in
serial or all-MPI parallelization with 1 task/core). This is because
many of the USER-OMP styles contain similar optimizations to those
used in the OPT package, as described above.
With multiple threads/task, the optimal choice of MPI tasks/node and
OpenMP threads/task can vary a lot and should always be tested via
benchmark runs for a specific simulation running on a specific
machine, paying attention to guidelines discussed in the next
sub-section.
A description of the multi-threading strategy used in the UESR-OMP
package and some performance examples are "presented
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
[Guidelines for best performance;]
For many problems on current generation CPUs, running the USER-OMP
package with a single thread/task is faster than running with multiple
threads/task. This is because the MPI parallelization in LAMMPS is
often more efficient than multi-threading as implemented in the
USER-OMP package. The parallel efficiency (in a threaded sense) also
varies for different USER-OMP styles.
Using multiple threads/task can be more effective under the following
circumstances:
Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per node
is faster. Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but
offers an additional speedup by utilizing the otherwise idle CPU
cores. :ulb,l
The interconnect used for MPI communication does not provide
sufficient bandwidth for a large number of MPI tasks per node. For
example, this applies to running over gigabit ethernet or on Cray XT4
or XT5 series supercomputers. As in the aforementioned case, this
effect worsens when using an increasing number of nodes. :l
The system has a spatially inhomogeneous particle density which does
not map well to the "domain decomposition scheme"_processors.html or
"load-balancing"_balance.html options that LAMMPS provides. This is
because multi-threading achives parallelism over the number of
particles, not via their distribution in space. :l
A machine is being used in "capability mode", i.e. near the point
where MPI parallelism is maxed out. For example, this can happen when
using the "PPPM solver"_kspace_style.html for long-range
electrostatics on large numbers of nodes. The scaling of the "kspace
style"_kspace_style.html can become the the performance-limiting
factor. Using multi-threading allows less MPI tasks to be invoked and
can speed-up the long-range solver, while increasing overall
performance by parallelizing the pairwise and bonded calculations via
OpenMP. Likewise additional speedup can be sometimes be achived by
increasing the length of the Coulombic cutoff and thus reducing the
work done by the long-range solver. :l,ule
Other performance tips are as follows:
The best parallel efficiency from {omp} styles is typically achieved
when there is at least one MPI task per physical processor,
i.e. socket or die. :ulb,l
Using OpenMP threading (as opposed to all-MPI parallelism) on
hyper-threading enabled cores is usually counter-productive (e.g. on
IBM BG/Q), as the cost in additional memory bandwidth requirements is
not offset by the gain in CPU utilization through
hyper-threading. :l,ule
[Restrictions:] [Restrictions:]
None of the pair styles in the USER-OMP package support the "inner", None of the pair styles in the USER-OMP package support the "inner",
"middle", "outer" options for r-RESPA integration, only the "pair" "middle", "outer" options for "rRESPA integration"_run_style.html.
option is supported. Only the rRESPA "pair" option is supported.
[Parallel efficiency and performance tips:]
In most simple cases the MPI parallelization in LAMMPS is more
efficient than multi-threading implemented in the USER-OMP package.
Also the parallel efficiency varies between individual styles.
On the other hand, in many cases you still want to use the {omp} version
- even when compiling or running without OpenMP support - since they
all contain optimizations similar to those in the OPT package, which
can result in serial speedup.
Using multi-threading is most effective under the following
circumstances:
Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per nodes
is faster. Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but
offers and additional speedup from utilizing the otherwise idle CPU
cores. :ulb,l
The interconnect used for MPI communication is not able to provide
sufficient bandwidth for a large number of MPI tasks per node. This
applies for example to running over gigabit ethernet or on Cray XT4 or
XT5 series supercomputers. Same as in the aforementioned case this
effect worsens with using an increasing number of nodes. :l
The input is a system that has an inhomogeneous particle density which
cannot be mapped well to the domain decomposition scheme that LAMMPS
employs. While this can be to some degree alleviated through using the
"processors"_processors.html keyword, multi-threading provides a
parallelism that parallelizes over the number of particles not their
distribution in space. :l
Finally, multi-threaded styles can improve performance when running
LAMMPS in "capability mode", i.e. near the point where the MPI
parallelism scales out. This can happen in particular when using as
kspace style for long-range electrostatics. Here the scaling of the
kspace style is the performance limiting factor and using
multi-threaded styles allows to operate the kspace style at the limit
of scaling and then increase performance parallelizing the real space
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
be achived by increasing the real-space coulomb cutoff and thus
reducing the work in the kspace part. :l,ule
The best parallel efficiency from {omp} styles is typically achieved
when there is at least one MPI task per physical processor,
i.e. socket or die.
Using threads on hyper-threading enabled cores is usually
counterproductive, as the cost in additional memory bandwidth
requirements is not offset by the gain in CPU utilization through
hyper-threading.
A description of the multi-threading strategy and some performance
examples are "presented
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
:line :line
5.6 GPU package :h4,link(acc_6) 5.6 GPU package :h4,link(acc_6)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The GPU package was developed by Mike Brown at ORNL and his The GPU package was developed by Mike Brown at ORNL and his
collaborators. It provides GPU versions of several pair styles, collaborators. It provides GPU versions of several pair styles,
including the 3-body Stillinger-Weber pair style, and for long-range including the 3-body Stillinger-Weber pair style, and for long-range
@ -542,6 +638,12 @@ of problem size and number of compute nodes.
5.7 USER-CUDA package :h4,link(acc_7) 5.7 USER-CUDA package :h4,link(acc_7)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The USER-CUDA package was developed by Christian Trott at U Technology The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
styles, many fixes, a few computes, and for long-range Coulombics via styles, many fixes, a few computes, and for long-range Coulombics via
@ -679,6 +781,12 @@ occurs, the faster your simulation will run.
5.8 KOKKOS package :h4,link(acc_8) 5.8 KOKKOS package :h4,link(acc_8)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The KOKKOS package contains versions of pair, fix, and atom styles The KOKKOS package contains versions of pair, fix, and atom styles
that use data structures and methods and macros provided by the Kokkos that use data structures and methods and macros provided by the Kokkos
library, which is included with LAMMPS in lib/kokkos. library, which is included with LAMMPS in lib/kokkos.
@ -971,6 +1079,12 @@ LAMMPS.
5.9 USER-INTEL package :h4,link(acc_9) 5.9 USER-INTEL package :h4,link(acc_9)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The USER-INTEL package was developed by Mike Brown at Intel The USER-INTEL package was developed by Mike Brown at Intel
Corporation. It provides a capability to accelerate simulations by Corporation. It provides a capability to accelerate simulations by
offloading neighbor list and non-bonded force calculations to Intel(R) offloading neighbor list and non-bonded force calculations to Intel(R)