git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12374 f3b2605a-c512-4ea7-a41b-209d697bcdaa

This commit is contained in:
sjplimp 2014-08-27 20:52:54 +00:00
parent dc5ad107ad
commit 444053fa6c
2 changed files with 463 additions and 235 deletions

View File

@ -26,7 +26,7 @@ kinds of machines.
5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
5.10 <A HREF = "#acc_10">Comparison of GPU and USER-CUDA packages</A> <BR>
5.10 <A HREF = "#acc_10">Comparison of USER-CUDA, GPU, and KOKKOS packages</A> <BR>
<HR>
@ -82,7 +82,7 @@ LAMMPS, to obtain synchronized timings.
<H4><A NAME = "acc_2"></A>5.2 General strategies
</H4>
<P>NOTE: this sub-section is still a work in progress
<P>NOTE: this section is still a work in progress
</P>
<P>Here is a list of general ideas for improving simulation performance.
Most of them are only applicable to certain models and certain
@ -142,6 +142,16 @@ been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions, if you have the appropriate
hardware on your system.
</P>
<P>All of these commands are in <A HREF = "Section_packages.html">packages</A>.
Currently, there are 6 such packages in LAMMPS:
</P>
<UL><LI>USER-CUDA: for NVIDIA GPUs
<LI>GPU: for NVIDIA GPUs as well as OpenCL support
<LI>USER-INTEL: for Intel CPUs and Intel Xeon Phi
<LI>KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
<LI>USER-OMP: for OpenMP threading
<LI>OPT: generic CPU optimizations
</UL>
<P>The accelerated styles have the same name as the standard styles,
except that a suffix is appended. Otherwise, the syntax for the
command is identical, their functionality is the same, and the
@ -167,22 +177,31 @@ automatically, without changing your input script. The
to turn off and back on the comand-line switch setting, both from
within your input script.
</P>
<P>To see what styles are currently available in each of the accelerated
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
manual. The doc page for each indvidual style (e.g. <A HREF = "pair_lj.html">pair
lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) also lists any
accelerated variants available for that style.
</P>
<P>Here is a brief summary of what the various packages provide. Details
are in individual sections below.
</P>
<P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as
discussed below.
The speed-up on a GPU depends on a variety of factors, as discussed
below.
</P>
<P>Styles with an "intel" suffix are part of the USER-INTEL
package. These styles support vectorized single and mixed precision
calculations, in addition to full double precision. In extreme cases,
this can provide speedups over 3.5x on CPUs. The package also
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
This can result in additional speedup over 2x depending on the
hardware configuration.
supports acceleration with offload to Intel(R) Xeon Phi(TM)
coprocessors. This can result in additional speedup over 2x depending
on the hardware configuration.
</P>
<P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends
on a variety of factors, as discussed below.
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
The speed-up depends on a variety of factors, as discussed below.
</P>
<P>Styles with an "omp" suffix are part of the USER-OMP package and allow
a pair-style to be run in multi-threaded mode using OpenMP. This can
@ -192,25 +211,20 @@ are run on fewer MPI processors or when the many MPI tasks would
overload the available bandwidth for communication.
</P>
<P>Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%.
</P>
<P>To see what styles are currently available in each of the accelerated
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
manual. A list of accelerated styles is included in the pair, fix,
compute, and kspace sections. The doc page for each indvidual style
(e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) will also
list any accelerated variants available for that style.
speed-up the pairwise calculations of your simulation by 5-25% on a
CPU.
</P>
<P>The following sections explain:
</P>
<UL><LI>what hardware and software the accelerated styles require
<LI>how to build LAMMPS with the accelerated package in place
<LI>what changes (if any) are needed in your input scripts
<UL><LI>what hardware and software the accelerated package requires
<LI>how to build LAMMPS with the accelerated package
<LI>how to run an input script with the accelerated package
<LI>speed-ups to expect
<LI>guidelines for best performance
<LI>speed-ups you can expect
<LI>restrictions
</UL>
<P>The final section compares and contrasts the GPU and USER-CUDA
packages, since they are both designed to use NVIDIA hardware.
<P>The final section compares and contrasts the GPU, USER-CUDA, and
KOKKOS packages, since they all allow for use of NVIDIA GPUs.
</P>
<HR>
@ -222,22 +236,47 @@ Technologies). It contains a handful of pair styles whose compute()
methods were rewritten in C++ templated form to reduce the overhead
due to if tests and other conditional code.
</P>
<P>The procedure for building LAMMPS with the OPT package is simple. It
is the same as for any other package which has no additional library
dependencies:
<P><B>Required hardware/software:</B>
</P>
<P>None.
</P>
<P><B>Building LAMMPS with the OPT package:</B>
</P>
<P>Include the package and build LAMMPS.
</P>
<PRE>make yes-opt
make machine
</PRE>
<P>If your input script uses one of the OPT pair styles, you can run it
as follows:
<P>No additional compile/link flags are needed in your lo-level
src/MAKE/Makefile.machine.
</P>
<P><B>Running with the OPT package;</B>
</P>
<P>You can explicitly add an "opt" suffix to the
<A HREF = "pair_style.html">pair_style</A> command in your input script:
</P>
<PRE>pair_style lj/cut/opt 2.5
</PRE>
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
switch</A>, which will automatically append
"opt" to styles that support it.
</P>
<PRE>lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script
</PRE>
<P>You should see a reduction in the "Pair time" printed out at the end
of the run. On most machines and problems, this will typically be a 5
to 20% savings.
<P><B>Speed-ups to expect:</B>
</P>
<P>You should see a reduction in the "Pair time" value printed at the end
of a run. On most machines for reasonable problem sizes, it will be a
5 to 20% savings.
</P>
<P><B>Guidelines for best performance;</B>
</P>
<P>None. Just try out an OPT pair style to see how it performs.
</P>
<P><B>Restrictions:</B>
</P>
<P>None.
</P>
<HR>
@ -245,118 +284,175 @@ to 20% savings.
</H4>
<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
University. It provides multi-threaded versions of most pair styles,
all dihedral styles, and a few fixes in LAMMPS. The package currently
uses the OpenMP interface which requires using a specific compiler
flag in the makefile to enable multiple threads; without this flag the
corresponding pair styles will still be compiled and work, but do not
support multi-threading.
nearly all bonded styles (bond, angle, dihedral, improper), several
Kspace styles, and a few fix styles. The package currently
uses the OpenMP interface for multi-threading.
</P>
<P><B>Required hardware/software:</B>
</P>
<P>Your compiler must support the OpenMP interface. You should have one
or more multi-core CPUs so that multiple threads can be launched by an
MPI task running on a CPU.
</P>
<P><B>Building LAMMPS with the USER-OMP package:</B>
</P>
<P>The procedure for building LAMMPS with the USER-OMP package is simple.
You have to edit your machine specific makefile to add the flag to
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
For the GNU compilers and Intel compilers, this flag is called
<I>-fopenmp</I>. Check your compiler documentation to find out which flag
you need to add. The rest of the compilation is the same as for any
other package which has no additional library dependencies:
<P>Include the package and build LAMMPS.
</P>
<PRE>make yes-user-omp
make machine
</PRE>
<P>If your input script uses one of regular styles that are also
exist as an OpenMP version in the USER-OMP package you can run
it as follows:
<P>Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
support in both the CCFLAGS and LINKFLAGS variables. For GNU and
Intel compilers, this flag is <I>-fopenmp</I>. Without this flag the
USER-OMP styles will still be compiled and work, but will not support
multi-threading.
</P>
<PRE>env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
<P><B>Running with the USER-OMP package;</B>
</P>
<P>You can explicitly add an "omp" suffix to any supported style in your
input script:
</P>
<PRE>pair_style lj/cut/omp 2.5
fix nve/omp
</PRE>
<P>The value of the environment variable OMP_NUM_THREADS determines how
many threads per MPI task are launched. All three examples above use a
total of 4 CPU cores. For different MPI implementations the method to
pass the OMP_NUM_THREADS environment variable to all processes is
different. Two different variants, one for MPICH and OpenMPI,
respectively are shown above. Please check the documentation of your
MPI installation for additional details. Alternatively, the value
provided by OMP_NUM_THREADS can be overridded with the <A HREF = "package.html">package
omp</A> command. Depending on which styles are accelerated
in your input, you should see a reduction in the "Pair time" and/or
"Bond time" and "Loop time" printed out at the end of the run. The
optimal ratio of MPI to OpenMP can vary a lot and should always be
confirmed through some benchmark runs for the current system and on
the current machine.
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
switch</A>, which will automatically append
"opt" to styles that support it.
</P>
<P><B>Restrictions:</B>
<PRE>lmp_machine -sf omp < in.script
mpirun -np 4 lmp_machine -sf omp < in.script
</PRE>
<P>You must also specify how many threads to use per MPI task. There are
several ways to do this. Note that the default value for this setting
in the OpenMP environment is 1 thread/task, which may give poor
performance. Also note that the product of MPI tasks * threads/task
should not exceed the physical number of cores, otherwise performance
will suffer.
</P>
<P>None of the pair styles in the USER-OMP package support the "inner",
"middle", "outer" options for r-RESPA integration, only the "pair"
option is supported.
<P>a) You can set an environment variable, either in your shell
or its start-up script:
</P>
<P><B>Parallel efficiency and performance tips:</B>
<PRE>setenv OMP_NUM_THREADS 4 (for csh or tcsh)
NOTE: setenv OMP_NUM_THREADS 4 (for bash)
</PRE>
<P>This value will apply to all subsequent runs you perform.
</P>
<P>In most simple cases the MPI parallelization in LAMMPS is more
efficient than multi-threading implemented in the USER-OMP package.
Also the parallel efficiency varies between individual styles.
On the other hand, in many cases you still want to use the <I>omp</I> version
- even when compiling or running without OpenMP support - since they
all contain optimizations similar to those in the OPT package, which
can result in serial speedup.
<P>b) You can set the same environment variable when you launch LAMMPS:
</P>
<P>Using multi-threading is most effective under the following
<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
NOTE: which mpirun is for OpenMPI or MPICH?
</PRE>
<P>All three examples use a total of 4 CPU cores.
</P>
<P>Different MPI implementations have differnet ways of passing the
OMP_NUM_THREADS environment variable to all MPI processes. The first
variant above is for MPICH, the second is for OpenMPI. Check the
documentation of your MPI installation for additional details.
</P>
<P>c) Use the <A HREF = "package.html">package omp</A> command near the top of your
script:
</P>
<PRE>package omp 4
</PRE>
<P><B>Speed-ups to expect:</B>
</P>
<P>Depending on which styles are accelerated, you should look for a
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
time" values printed at the end of a run.
</P>
<P>You may see a small performance advantage (5 to 20%) when running a
USER-OMP style (in serial or parallel) with a single thread/MPI task,
versus running standard LAMMPS with its un-accelerated styles (in
serial or all-MPI parallelization with 1 task/core). This is because
many of the USER-OMP styles contain similar optimizations to those
used in the OPT package, as described above.
</P>
<P>With multiple threads/task, the optimal choice of MPI tasks/node and
OpenMP threads/task can vary a lot and should always be tested via
benchmark runs for a specific simulation running on a specific
machine, paying attention to guidelines discussed in the next
sub-section.
</P>
<P>A description of the multi-threading strategy used in the UESR-OMP
package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
here</A>
</P>
<P><B>Guidelines for best performance;</B>
</P>
<P>For many problems on current generation CPUs, running the USER-OMP
package with a single thread/task is faster than running with multiple
threads/task. This is because the MPI parallelization in LAMMPS is
often more efficient than multi-threading as implemented in the
USER-OMP package. The parallel efficiency (in a threaded sense) also
varies for different USER-OMP styles.
</P>
<P>Using multiple threads/task can be more effective under the following
circumstances:
</P>
<UL><LI>Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per nodes
is faster. Running in hybrid MPI+OpenMP mode will reduce the
degradation, so that running with 4 or even only 2 MPI tasks per node
is faster. Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but
offers and additional speedup from utilizing the otherwise idle CPU
offers an additional speedup by utilizing the otherwise idle CPU
cores.
<LI>The interconnect used for MPI communication is not able to provide
sufficient bandwidth for a large number of MPI tasks per node. This
applies for example to running over gigabit ethernet or on Cray XT4 or
XT5 series supercomputers. Same as in the aforementioned case this
effect worsens with using an increasing number of nodes.
<LI>The interconnect used for MPI communication does not provide
sufficient bandwidth for a large number of MPI tasks per node. For
example, this applies to running over gigabit ethernet or on Cray XT4
or XT5 series supercomputers. As in the aforementioned case, this
effect worsens when using an increasing number of nodes.
<LI>The input is a system that has an inhomogeneous particle density which
cannot be mapped well to the domain decomposition scheme that LAMMPS
employs. While this can be to some degree alleviated through using the
<A HREF = "processors.html">processors</A> keyword, multi-threading provides a
parallelism that parallelizes over the number of particles not their
distribution in space.
<LI>The system has a spatially inhomogeneous particle density which does
not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
<A HREF = "balance.html">load-balancing</A> options that LAMMPS provides. This is
because multi-threading achives parallelism over the number of
particles, not via their distribution in space.
<LI>Finally, multi-threaded styles can improve performance when running
LAMMPS in "capability mode", i.e. near the point where the MPI
parallelism scales out. This can happen in particular when using as
kspace style for long-range electrostatics. Here the scaling of the
kspace style is the performance limiting factor and using
multi-threaded styles allows to operate the kspace style at the limit
of scaling and then increase performance parallelizing the real space
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
be achived by increasing the real-space coulomb cutoff and thus
reducing the work in the kspace part.
<LI>A machine is being used in "capability mode", i.e. near the point
where MPI parallelism is maxed out. For example, this can happen when
using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
electrostatics on large numbers of nodes. The scaling of the <A HREF = "kspace_style.html">kspace
style</A> can become the the performance-limiting
factor. Using multi-threading allows less MPI tasks to be invoked and
can speed-up the long-range solver, while increasing overall
performance by parallelizing the pairwise and bonded calculations via
OpenMP. Likewise additional speedup can be sometimes be achived by
increasing the length of the Coulombic cutoff and thus reducing the
work done by the long-range solver.
</UL>
<P>The best parallel efficiency from <I>omp</I> styles is typically achieved
<P>Other performance tips are as follows:
</P>
<UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
when there is at least one MPI task per physical processor,
i.e. socket or die.
i.e. socket or die.
<LI>Using OpenMP threading (as opposed to all-MPI parallelism) on
hyper-threading enabled cores is usually counter-productive (e.g. on
IBM BG/Q), as the cost in additional memory bandwidth requirements is
not offset by the gain in CPU utilization through
hyper-threading.
</UL>
<P><B>Restrictions:</B>
</P>
<P>Using threads on hyper-threading enabled cores is usually
counterproductive, as the cost in additional memory bandwidth
requirements is not offset by the gain in CPU utilization through
hyper-threading.
</P>
<P>A description of the multi-threading strategy and some performance
examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
here</A>
<P>None of the pair styles in the USER-OMP package support the "inner",
"middle", "outer" options for <A HREF = "run_style.html">rRESPA integration</A>.
Only the rRESPA "pair" option is supported.
</P>
<HR>
<H4><A NAME = "acc_6"></A>5.6 GPU package
</H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The GPU package was developed by Mike Brown at ORNL and his
collaborators. It provides GPU versions of several pair styles,
including the 3-body Stillinger-Weber pair style, and for long-range
@ -546,6 +642,12 @@ of problem size and number of compute nodes.
<H4><A NAME = "acc_7"></A>5.7 USER-CUDA package
</H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
styles, many fixes, a few computes, and for long-range Coulombics via
@ -683,6 +785,12 @@ occurs, the faster your simulation will run.
<H4><A NAME = "acc_8"></A>5.8 KOKKOS package
</H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The KOKKOS package contains versions of pair, fix, and atom styles
that use data structures and methods and macros provided by the Kokkos
library, which is included with LAMMPS in lib/kokkos.
@ -975,6 +1083,12 @@ LAMMPS.
<H4><A NAME = "acc_9"></A>5.9 USER-INTEL package
</H4>
<P><B>Required hardware/software:</B>
<B>Building LAMMPS with the OPT package:</B>
<B>Running with the OPT package;</B>
<B>Guidelines for best performance;</B>
<B>Speed-ups to expect:</B>
</P>
<P>The USER-INTEL package was developed by Mike Brown at Intel
Corporation. It provides a capability to accelerate simulations by
offloading neighbor list and non-bonded force calculations to Intel(R)

View File

@ -23,7 +23,7 @@ kinds of machines.
5.7 "USER-CUDA package"_#acc_7
5.8 "KOKKOS package"_#acc_8
5.9 "USER-INTEL package"_#acc_9
5.10 "Comparison of GPU and USER-CUDA packages"_#acc_10 :all(b)
5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)
:line
:line
@ -78,7 +78,7 @@ LAMMPS, to obtain synchronized timings.
5.2 General strategies :h4,link(acc_2)
NOTE: this sub-section is still a work in progress
NOTE: this section is still a work in progress
Here is a list of general ideas for improving simulation performance.
Most of them are only applicable to certain models and certain
@ -138,6 +138,16 @@ been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions, if you have the appropriate
hardware on your system.
All of these commands are in "packages"_Section_packages.html.
Currently, there are 6 such packages in LAMMPS:
USER-CUDA: for NVIDIA GPUs
GPU: for NVIDIA GPUs as well as OpenCL support
USER-INTEL: for Intel CPUs and Intel Xeon Phi
KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
USER-OMP: for OpenMP threading
OPT: generic CPU optimizations :ul
The accelerated styles have the same name as the standard styles,
except that a suffix is appended. Otherwise, the syntax for the
command is identical, their functionality is the same, and the
@ -163,22 +173,31 @@ automatically, without changing your input script. The
to turn off and back on the comand-line switch setting, both from
within your input script.
To see what styles are currently available in each of the accelerated
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
manual. The doc page for each indvidual style (e.g. "pair
lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
accelerated variants available for that style.
Here is a brief summary of what the various packages provide. Details
are in individual sections below.
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as
discussed below.
The speed-up on a GPU depends on a variety of factors, as discussed
below.
Styles with an "intel" suffix are part of the USER-INTEL
package. These styles support vectorized single and mixed precision
calculations, in addition to full double precision. In extreme cases,
this can provide speedups over 3.5x on CPUs. The package also
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
This can result in additional speedup over 2x depending on the
hardware configuration.
supports acceleration with offload to Intel(R) Xeon Phi(TM)
coprocessors. This can result in additional speedup over 2x depending
on the hardware configuration.
Styles with a "kk" suffix are part of the KOKKOS package, and can be
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends
on a variety of factors, as discussed below.
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
The speed-up depends on a variety of factors, as discussed below.
Styles with an "omp" suffix are part of the USER-OMP package and allow
a pair-style to be run in multi-threaded mode using OpenMP. This can
@ -188,25 +207,20 @@ are run on fewer MPI processors or when the many MPI tasks would
overload the available bandwidth for communication.
Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%.
To see what styles are currently available in each of the accelerated
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
manual. A list of accelerated styles is included in the pair, fix,
compute, and kspace sections. The doc page for each indvidual style
(e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) will also
list any accelerated variants available for that style.
speed-up the pairwise calculations of your simulation by 5-25% on a
CPU.
The following sections explain:
what hardware and software the accelerated styles require
how to build LAMMPS with the accelerated package in place
what changes (if any) are needed in your input scripts
what hardware and software the accelerated package requires
how to build LAMMPS with the accelerated package
how to run an input script with the accelerated package
speed-ups to expect
guidelines for best performance
speed-ups you can expect :ul
restrictions :ul
The final section compares and contrasts the GPU and USER-CUDA
packages, since they are both designed to use NVIDIA hardware.
The final section compares and contrasts the GPU, USER-CUDA, and
KOKKOS packages, since they all allow for use of NVIDIA GPUs.
:line
@ -218,22 +232,47 @@ Technologies). It contains a handful of pair styles whose compute()
methods were rewritten in C++ templated form to reduce the overhead
due to if tests and other conditional code.
The procedure for building LAMMPS with the OPT package is simple. It
is the same as for any other package which has no additional library
dependencies:
[Required hardware/software:]
None.
[Building LAMMPS with the OPT package:]
Include the package and build LAMMPS.
make yes-opt
make machine :pre
If your input script uses one of the OPT pair styles, you can run it
as follows:
No additional compile/link flags are needed in your lo-level
src/MAKE/Makefile.machine.
[Running with the OPT package;]
You can explicitly add an "opt" suffix to the
"pair_style"_pair_style.html command in your input script:
pair_style lj/cut/opt 2.5 :pre
Or you can run with the -sf "command-line
switch"_Section_start.html#start_7, which will automatically append
"opt" to styles that support it.
lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script :pre
You should see a reduction in the "Pair time" printed out at the end
of the run. On most machines and problems, this will typically be a 5
to 20% savings.
[Speed-ups to expect:]
You should see a reduction in the "Pair time" value printed at the end
of a run. On most machines for reasonable problem sizes, it will be a
5 to 20% savings.
[Guidelines for best performance;]
None. Just try out an OPT pair style to see how it performs.
[Restrictions:]
None.
:line
@ -241,118 +280,175 @@ to 20% savings.
The USER-OMP package was developed by Axel Kohlmeyer at Temple
University. It provides multi-threaded versions of most pair styles,
all dihedral styles, and a few fixes in LAMMPS. The package currently
uses the OpenMP interface which requires using a specific compiler
flag in the makefile to enable multiple threads; without this flag the
corresponding pair styles will still be compiled and work, but do not
support multi-threading.
nearly all bonded styles (bond, angle, dihedral, improper), several
Kspace styles, and a few fix styles. The package currently
uses the OpenMP interface for multi-threading.
[Required hardware/software:]
Your compiler must support the OpenMP interface. You should have one
or more multi-core CPUs so that multiple threads can be launched by an
MPI task running on a CPU.
[Building LAMMPS with the USER-OMP package:]
The procedure for building LAMMPS with the USER-OMP package is simple.
You have to edit your machine specific makefile to add the flag to
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
For the GNU compilers and Intel compilers, this flag is called
{-fopenmp}. Check your compiler documentation to find out which flag
you need to add. The rest of the compilation is the same as for any
other package which has no additional library dependencies:
Include the package and build LAMMPS.
make yes-user-omp
make machine :pre
If your input script uses one of regular styles that are also
exist as an OpenMP version in the USER-OMP package you can run
it as follows:
Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
support in both the CCFLAGS and LINKFLAGS variables. For GNU and
Intel compilers, this flag is {-fopenmp}. Without this flag the
USER-OMP styles will still be compiled and work, but will not support
multi-threading.
env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
[Running with the USER-OMP package;]
You can explicitly add an "omp" suffix to any supported style in your
input script:
pair_style lj/cut/omp 2.5
fix nve/omp :pre
Or you can run with the -sf "command-line
switch"_Section_start.html#start_7, which will automatically append
"opt" to styles that support it.
lmp_machine -sf omp < in.script
mpirun -np 4 lmp_machine -sf omp < in.script :pre
You must also specify how many threads to use per MPI task. There are
several ways to do this. Note that the default value for this setting
in the OpenMP environment is 1 thread/task, which may give poor
performance. Also note that the product of MPI tasks * threads/task
should not exceed the physical number of cores, otherwise performance
will suffer.
a) You can set an environment variable, either in your shell
or its start-up script:
setenv OMP_NUM_THREADS 4 (for csh or tcsh)
NOTE: setenv OMP_NUM_THREADS 4 (for bash) :pre
This value will apply to all subsequent runs you perform.
b) You can set the same environment variable when you launch LAMMPS:
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
NOTE: which mpirun is for OpenMPI or MPICH? :pre
The value of the environment variable OMP_NUM_THREADS determines how
many threads per MPI task are launched. All three examples above use a
total of 4 CPU cores. For different MPI implementations the method to
pass the OMP_NUM_THREADS environment variable to all processes is
different. Two different variants, one for MPICH and OpenMPI,
respectively are shown above. Please check the documentation of your
MPI installation for additional details. Alternatively, the value
provided by OMP_NUM_THREADS can be overridded with the "package
omp"_package.html command. Depending on which styles are accelerated
in your input, you should see a reduction in the "Pair time" and/or
"Bond time" and "Loop time" printed out at the end of the run. The
optimal ratio of MPI to OpenMP can vary a lot and should always be
confirmed through some benchmark runs for the current system and on
the current machine.
All three examples use a total of 4 CPU cores.
Different MPI implementations have differnet ways of passing the
OMP_NUM_THREADS environment variable to all MPI processes. The first
variant above is for MPICH, the second is for OpenMPI. Check the
documentation of your MPI installation for additional details.
c) Use the "package omp"_package.html command near the top of your
script:
package omp 4 :pre
[Speed-ups to expect:]
Depending on which styles are accelerated, you should look for a
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
time" values printed at the end of a run.
You may see a small performance advantage (5 to 20%) when running a
USER-OMP style (in serial or parallel) with a single thread/MPI task,
versus running standard LAMMPS with its un-accelerated styles (in
serial or all-MPI parallelization with 1 task/core). This is because
many of the USER-OMP styles contain similar optimizations to those
used in the OPT package, as described above.
With multiple threads/task, the optimal choice of MPI tasks/node and
OpenMP threads/task can vary a lot and should always be tested via
benchmark runs for a specific simulation running on a specific
machine, paying attention to guidelines discussed in the next
sub-section.
A description of the multi-threading strategy used in the UESR-OMP
package and some performance examples are "presented
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
[Guidelines for best performance;]
For many problems on current generation CPUs, running the USER-OMP
package with a single thread/task is faster than running with multiple
threads/task. This is because the MPI parallelization in LAMMPS is
often more efficient than multi-threading as implemented in the
USER-OMP package. The parallel efficiency (in a threaded sense) also
varies for different USER-OMP styles.
Using multiple threads/task can be more effective under the following
circumstances:
Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per node
is faster. Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but
offers an additional speedup by utilizing the otherwise idle CPU
cores. :ulb,l
The interconnect used for MPI communication does not provide
sufficient bandwidth for a large number of MPI tasks per node. For
example, this applies to running over gigabit ethernet or on Cray XT4
or XT5 series supercomputers. As in the aforementioned case, this
effect worsens when using an increasing number of nodes. :l
The system has a spatially inhomogeneous particle density which does
not map well to the "domain decomposition scheme"_processors.html or
"load-balancing"_balance.html options that LAMMPS provides. This is
because multi-threading achives parallelism over the number of
particles, not via their distribution in space. :l
A machine is being used in "capability mode", i.e. near the point
where MPI parallelism is maxed out. For example, this can happen when
using the "PPPM solver"_kspace_style.html for long-range
electrostatics on large numbers of nodes. The scaling of the "kspace
style"_kspace_style.html can become the the performance-limiting
factor. Using multi-threading allows less MPI tasks to be invoked and
can speed-up the long-range solver, while increasing overall
performance by parallelizing the pairwise and bonded calculations via
OpenMP. Likewise additional speedup can be sometimes be achived by
increasing the length of the Coulombic cutoff and thus reducing the
work done by the long-range solver. :l,ule
Other performance tips are as follows:
The best parallel efficiency from {omp} styles is typically achieved
when there is at least one MPI task per physical processor,
i.e. socket or die. :ulb,l
Using OpenMP threading (as opposed to all-MPI parallelism) on
hyper-threading enabled cores is usually counter-productive (e.g. on
IBM BG/Q), as the cost in additional memory bandwidth requirements is
not offset by the gain in CPU utilization through
hyper-threading. :l,ule
[Restrictions:]
None of the pair styles in the USER-OMP package support the "inner",
"middle", "outer" options for r-RESPA integration, only the "pair"
option is supported.
[Parallel efficiency and performance tips:]
In most simple cases the MPI parallelization in LAMMPS is more
efficient than multi-threading implemented in the USER-OMP package.
Also the parallel efficiency varies between individual styles.
On the other hand, in many cases you still want to use the {omp} version
- even when compiling or running without OpenMP support - since they
all contain optimizations similar to those in the OPT package, which
can result in serial speedup.
Using multi-threading is most effective under the following
circumstances:
Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per nodes
is faster. Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but
offers and additional speedup from utilizing the otherwise idle CPU
cores. :ulb,l
The interconnect used for MPI communication is not able to provide
sufficient bandwidth for a large number of MPI tasks per node. This
applies for example to running over gigabit ethernet or on Cray XT4 or
XT5 series supercomputers. Same as in the aforementioned case this
effect worsens with using an increasing number of nodes. :l
The input is a system that has an inhomogeneous particle density which
cannot be mapped well to the domain decomposition scheme that LAMMPS
employs. While this can be to some degree alleviated through using the
"processors"_processors.html keyword, multi-threading provides a
parallelism that parallelizes over the number of particles not their
distribution in space. :l
Finally, multi-threaded styles can improve performance when running
LAMMPS in "capability mode", i.e. near the point where the MPI
parallelism scales out. This can happen in particular when using as
kspace style for long-range electrostatics. Here the scaling of the
kspace style is the performance limiting factor and using
multi-threaded styles allows to operate the kspace style at the limit
of scaling and then increase performance parallelizing the real space
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
be achived by increasing the real-space coulomb cutoff and thus
reducing the work in the kspace part. :l,ule
The best parallel efficiency from {omp} styles is typically achieved
when there is at least one MPI task per physical processor,
i.e. socket or die.
Using threads on hyper-threading enabled cores is usually
counterproductive, as the cost in additional memory bandwidth
requirements is not offset by the gain in CPU utilization through
hyper-threading.
A description of the multi-threading strategy and some performance
examples are "presented
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
"middle", "outer" options for "rRESPA integration"_run_style.html.
Only the rRESPA "pair" option is supported.
:line
5.6 GPU package :h4,link(acc_6)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The GPU package was developed by Mike Brown at ORNL and his
collaborators. It provides GPU versions of several pair styles,
including the 3-body Stillinger-Weber pair style, and for long-range
@ -542,6 +638,12 @@ of problem size and number of compute nodes.
5.7 USER-CUDA package :h4,link(acc_7)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
styles, many fixes, a few computes, and for long-range Coulombics via
@ -679,6 +781,12 @@ occurs, the faster your simulation will run.
5.8 KOKKOS package :h4,link(acc_8)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The KOKKOS package contains versions of pair, fix, and atom styles
that use data structures and methods and macros provided by the Kokkos
library, which is included with LAMMPS in lib/kokkos.
@ -971,6 +1079,12 @@ LAMMPS.
5.9 USER-INTEL package :h4,link(acc_9)
[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]
The USER-INTEL package was developed by Mike Brown at Intel
Corporation. It provides a capability to accelerate simulations by
offloading neighbor list and non-bonded force calculations to Intel(R)