forked from lijiext/lammps
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12374 f3b2605a-c512-4ea7-a41b-209d697bcdaa
This commit is contained in:
parent
dc5ad107ad
commit
444053fa6c
|
@ -26,7 +26,7 @@ kinds of machines.
|
||||||
5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
|
5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
|
||||||
5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
|
5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
|
||||||
5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
|
5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
|
||||||
5.10 <A HREF = "#acc_10">Comparison of GPU and USER-CUDA packages</A> <BR>
|
5.10 <A HREF = "#acc_10">Comparison of USER-CUDA, GPU, and KOKKOS packages</A> <BR>
|
||||||
|
|
||||||
<HR>
|
<HR>
|
||||||
|
|
||||||
|
@ -82,7 +82,7 @@ LAMMPS, to obtain synchronized timings.
|
||||||
|
|
||||||
<H4><A NAME = "acc_2"></A>5.2 General strategies
|
<H4><A NAME = "acc_2"></A>5.2 General strategies
|
||||||
</H4>
|
</H4>
|
||||||
<P>NOTE: this sub-section is still a work in progress
|
<P>NOTE: this section is still a work in progress
|
||||||
</P>
|
</P>
|
||||||
<P>Here is a list of general ideas for improving simulation performance.
|
<P>Here is a list of general ideas for improving simulation performance.
|
||||||
Most of them are only applicable to certain models and certain
|
Most of them are only applicable to certain models and certain
|
||||||
|
@ -142,6 +142,16 @@ been added to LAMMPS, which will typically run faster than the
|
||||||
standard non-accelerated versions, if you have the appropriate
|
standard non-accelerated versions, if you have the appropriate
|
||||||
hardware on your system.
|
hardware on your system.
|
||||||
</P>
|
</P>
|
||||||
|
<P>All of these commands are in <A HREF = "Section_packages.html">packages</A>.
|
||||||
|
Currently, there are 6 such packages in LAMMPS:
|
||||||
|
</P>
|
||||||
|
<UL><LI>USER-CUDA: for NVIDIA GPUs
|
||||||
|
<LI>GPU: for NVIDIA GPUs as well as OpenCL support
|
||||||
|
<LI>USER-INTEL: for Intel CPUs and Intel Xeon Phi
|
||||||
|
<LI>KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
|
||||||
|
<LI>USER-OMP: for OpenMP threading
|
||||||
|
<LI>OPT: generic CPU optimizations
|
||||||
|
</UL>
|
||||||
<P>The accelerated styles have the same name as the standard styles,
|
<P>The accelerated styles have the same name as the standard styles,
|
||||||
except that a suffix is appended. Otherwise, the syntax for the
|
except that a suffix is appended. Otherwise, the syntax for the
|
||||||
command is identical, their functionality is the same, and the
|
command is identical, their functionality is the same, and the
|
||||||
|
@ -167,22 +177,31 @@ automatically, without changing your input script. The
|
||||||
to turn off and back on the comand-line switch setting, both from
|
to turn off and back on the comand-line switch setting, both from
|
||||||
within your input script.
|
within your input script.
|
||||||
</P>
|
</P>
|
||||||
|
<P>To see what styles are currently available in each of the accelerated
|
||||||
|
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
|
||||||
|
manual. The doc page for each indvidual style (e.g. <A HREF = "pair_lj.html">pair
|
||||||
|
lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) also lists any
|
||||||
|
accelerated variants available for that style.
|
||||||
|
</P>
|
||||||
|
<P>Here is a brief summary of what the various packages provide. Details
|
||||||
|
are in individual sections below.
|
||||||
|
</P>
|
||||||
<P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
<P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
||||||
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
||||||
The speed-up due to GPU usage depends on a variety of factors, as
|
The speed-up on a GPU depends on a variety of factors, as discussed
|
||||||
discussed below.
|
below.
|
||||||
</P>
|
</P>
|
||||||
<P>Styles with an "intel" suffix are part of the USER-INTEL
|
<P>Styles with an "intel" suffix are part of the USER-INTEL
|
||||||
package. These styles support vectorized single and mixed precision
|
package. These styles support vectorized single and mixed precision
|
||||||
calculations, in addition to full double precision. In extreme cases,
|
calculations, in addition to full double precision. In extreme cases,
|
||||||
this can provide speedups over 3.5x on CPUs. The package also
|
this can provide speedups over 3.5x on CPUs. The package also
|
||||||
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
|
supports acceleration with offload to Intel(R) Xeon Phi(TM)
|
||||||
This can result in additional speedup over 2x depending on the
|
coprocessors. This can result in additional speedup over 2x depending
|
||||||
hardware configuration.
|
on the hardware configuration.
|
||||||
</P>
|
</P>
|
||||||
<P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
<P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
||||||
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends
|
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
|
||||||
on a variety of factors, as discussed below.
|
The speed-up depends on a variety of factors, as discussed below.
|
||||||
</P>
|
</P>
|
||||||
<P>Styles with an "omp" suffix are part of the USER-OMP package and allow
|
<P>Styles with an "omp" suffix are part of the USER-OMP package and allow
|
||||||
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
||||||
|
@ -192,25 +211,20 @@ are run on fewer MPI processors or when the many MPI tasks would
|
||||||
overload the available bandwidth for communication.
|
overload the available bandwidth for communication.
|
||||||
</P>
|
</P>
|
||||||
<P>Styles with an "opt" suffix are part of the OPT package and typically
|
<P>Styles with an "opt" suffix are part of the OPT package and typically
|
||||||
speed-up the pairwise calculations of your simulation by 5-25%.
|
speed-up the pairwise calculations of your simulation by 5-25% on a
|
||||||
</P>
|
CPU.
|
||||||
<P>To see what styles are currently available in each of the accelerated
|
|
||||||
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
|
|
||||||
manual. A list of accelerated styles is included in the pair, fix,
|
|
||||||
compute, and kspace sections. The doc page for each indvidual style
|
|
||||||
(e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) will also
|
|
||||||
list any accelerated variants available for that style.
|
|
||||||
</P>
|
</P>
|
||||||
<P>The following sections explain:
|
<P>The following sections explain:
|
||||||
</P>
|
</P>
|
||||||
<UL><LI>what hardware and software the accelerated styles require
|
<UL><LI>what hardware and software the accelerated package requires
|
||||||
<LI>how to build LAMMPS with the accelerated package in place
|
<LI>how to build LAMMPS with the accelerated package
|
||||||
<LI>what changes (if any) are needed in your input scripts
|
<LI>how to run an input script with the accelerated package
|
||||||
|
<LI>speed-ups to expect
|
||||||
<LI>guidelines for best performance
|
<LI>guidelines for best performance
|
||||||
<LI>speed-ups you can expect
|
<LI>restrictions
|
||||||
</UL>
|
</UL>
|
||||||
<P>The final section compares and contrasts the GPU and USER-CUDA
|
<P>The final section compares and contrasts the GPU, USER-CUDA, and
|
||||||
packages, since they are both designed to use NVIDIA hardware.
|
KOKKOS packages, since they all allow for use of NVIDIA GPUs.
|
||||||
</P>
|
</P>
|
||||||
<HR>
|
<HR>
|
||||||
|
|
||||||
|
@ -222,22 +236,47 @@ Technologies). It contains a handful of pair styles whose compute()
|
||||||
methods were rewritten in C++ templated form to reduce the overhead
|
methods were rewritten in C++ templated form to reduce the overhead
|
||||||
due to if tests and other conditional code.
|
due to if tests and other conditional code.
|
||||||
</P>
|
</P>
|
||||||
<P>The procedure for building LAMMPS with the OPT package is simple. It
|
<P><B>Required hardware/software:</B>
|
||||||
is the same as for any other package which has no additional library
|
</P>
|
||||||
dependencies:
|
<P>None.
|
||||||
|
</P>
|
||||||
|
<P><B>Building LAMMPS with the OPT package:</B>
|
||||||
|
</P>
|
||||||
|
<P>Include the package and build LAMMPS.
|
||||||
</P>
|
</P>
|
||||||
<PRE>make yes-opt
|
<PRE>make yes-opt
|
||||||
make machine
|
make machine
|
||||||
</PRE>
|
</PRE>
|
||||||
<P>If your input script uses one of the OPT pair styles, you can run it
|
<P>No additional compile/link flags are needed in your lo-level
|
||||||
as follows:
|
src/MAKE/Makefile.machine.
|
||||||
|
</P>
|
||||||
|
<P><B>Running with the OPT package;</B>
|
||||||
|
</P>
|
||||||
|
<P>You can explicitly add an "opt" suffix to the
|
||||||
|
<A HREF = "pair_style.html">pair_style</A> command in your input script:
|
||||||
|
</P>
|
||||||
|
<PRE>pair_style lj/cut/opt 2.5
|
||||||
|
</PRE>
|
||||||
|
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
|
||||||
|
switch</A>, which will automatically append
|
||||||
|
"opt" to styles that support it.
|
||||||
</P>
|
</P>
|
||||||
<PRE>lmp_machine -sf opt < in.script
|
<PRE>lmp_machine -sf opt < in.script
|
||||||
mpirun -np 4 lmp_machine -sf opt < in.script
|
mpirun -np 4 lmp_machine -sf opt < in.script
|
||||||
</PRE>
|
</PRE>
|
||||||
<P>You should see a reduction in the "Pair time" printed out at the end
|
<P><B>Speed-ups to expect:</B>
|
||||||
of the run. On most machines and problems, this will typically be a 5
|
</P>
|
||||||
to 20% savings.
|
<P>You should see a reduction in the "Pair time" value printed at the end
|
||||||
|
of a run. On most machines for reasonable problem sizes, it will be a
|
||||||
|
5 to 20% savings.
|
||||||
|
</P>
|
||||||
|
<P><B>Guidelines for best performance;</B>
|
||||||
|
</P>
|
||||||
|
<P>None. Just try out an OPT pair style to see how it performs.
|
||||||
|
</P>
|
||||||
|
<P><B>Restrictions:</B>
|
||||||
|
</P>
|
||||||
|
<P>None.
|
||||||
</P>
|
</P>
|
||||||
<HR>
|
<HR>
|
||||||
|
|
||||||
|
@ -245,118 +284,175 @@ to 20% savings.
|
||||||
</H4>
|
</H4>
|
||||||
<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
||||||
University. It provides multi-threaded versions of most pair styles,
|
University. It provides multi-threaded versions of most pair styles,
|
||||||
all dihedral styles, and a few fixes in LAMMPS. The package currently
|
nearly all bonded styles (bond, angle, dihedral, improper), several
|
||||||
uses the OpenMP interface which requires using a specific compiler
|
Kspace styles, and a few fix styles. The package currently
|
||||||
flag in the makefile to enable multiple threads; without this flag the
|
uses the OpenMP interface for multi-threading.
|
||||||
corresponding pair styles will still be compiled and work, but do not
|
</P>
|
||||||
support multi-threading.
|
<P><B>Required hardware/software:</B>
|
||||||
|
</P>
|
||||||
|
<P>Your compiler must support the OpenMP interface. You should have one
|
||||||
|
or more multi-core CPUs so that multiple threads can be launched by an
|
||||||
|
MPI task running on a CPU.
|
||||||
</P>
|
</P>
|
||||||
<P><B>Building LAMMPS with the USER-OMP package:</B>
|
<P><B>Building LAMMPS with the USER-OMP package:</B>
|
||||||
</P>
|
</P>
|
||||||
<P>The procedure for building LAMMPS with the USER-OMP package is simple.
|
<P>Include the package and build LAMMPS.
|
||||||
You have to edit your machine specific makefile to add the flag to
|
|
||||||
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
|
|
||||||
For the GNU compilers and Intel compilers, this flag is called
|
|
||||||
<I>-fopenmp</I>. Check your compiler documentation to find out which flag
|
|
||||||
you need to add. The rest of the compilation is the same as for any
|
|
||||||
other package which has no additional library dependencies:
|
|
||||||
</P>
|
</P>
|
||||||
<PRE>make yes-user-omp
|
<PRE>make yes-user-omp
|
||||||
make machine
|
make machine
|
||||||
</PRE>
|
</PRE>
|
||||||
<P>If your input script uses one of regular styles that are also
|
<P>Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
|
||||||
exist as an OpenMP version in the USER-OMP package you can run
|
support in both the CCFLAGS and LINKFLAGS variables. For GNU and
|
||||||
it as follows:
|
Intel compilers, this flag is <I>-fopenmp</I>. Without this flag the
|
||||||
|
USER-OMP styles will still be compiled and work, but will not support
|
||||||
|
multi-threading.
|
||||||
</P>
|
</P>
|
||||||
<PRE>env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
|
<P><B>Running with the USER-OMP package;</B>
|
||||||
|
</P>
|
||||||
|
<P>You can explicitly add an "omp" suffix to any supported style in your
|
||||||
|
input script:
|
||||||
|
</P>
|
||||||
|
<PRE>pair_style lj/cut/omp 2.5
|
||||||
|
fix nve/omp
|
||||||
|
</PRE>
|
||||||
|
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
|
||||||
|
switch</A>, which will automatically append
|
||||||
|
"opt" to styles that support it.
|
||||||
|
</P>
|
||||||
|
<PRE>lmp_machine -sf omp < in.script
|
||||||
|
mpirun -np 4 lmp_machine -sf omp < in.script
|
||||||
|
</PRE>
|
||||||
|
<P>You must also specify how many threads to use per MPI task. There are
|
||||||
|
several ways to do this. Note that the default value for this setting
|
||||||
|
in the OpenMP environment is 1 thread/task, which may give poor
|
||||||
|
performance. Also note that the product of MPI tasks * threads/task
|
||||||
|
should not exceed the physical number of cores, otherwise performance
|
||||||
|
will suffer.
|
||||||
|
</P>
|
||||||
|
<P>a) You can set an environment variable, either in your shell
|
||||||
|
or its start-up script:
|
||||||
|
</P>
|
||||||
|
<PRE>setenv OMP_NUM_THREADS 4 (for csh or tcsh)
|
||||||
|
NOTE: setenv OMP_NUM_THREADS 4 (for bash)
|
||||||
|
</PRE>
|
||||||
|
<P>This value will apply to all subsequent runs you perform.
|
||||||
|
</P>
|
||||||
|
<P>b) You can set the same environment variable when you launch LAMMPS:
|
||||||
|
</P>
|
||||||
|
<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
|
||||||
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
||||||
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
||||||
|
NOTE: which mpirun is for OpenMPI or MPICH?
|
||||||
</PRE>
|
</PRE>
|
||||||
<P>The value of the environment variable OMP_NUM_THREADS determines how
|
<P>All three examples use a total of 4 CPU cores.
|
||||||
many threads per MPI task are launched. All three examples above use a
|
|
||||||
total of 4 CPU cores. For different MPI implementations the method to
|
|
||||||
pass the OMP_NUM_THREADS environment variable to all processes is
|
|
||||||
different. Two different variants, one for MPICH and OpenMPI,
|
|
||||||
respectively are shown above. Please check the documentation of your
|
|
||||||
MPI installation for additional details. Alternatively, the value
|
|
||||||
provided by OMP_NUM_THREADS can be overridded with the <A HREF = "package.html">package
|
|
||||||
omp</A> command. Depending on which styles are accelerated
|
|
||||||
in your input, you should see a reduction in the "Pair time" and/or
|
|
||||||
"Bond time" and "Loop time" printed out at the end of the run. The
|
|
||||||
optimal ratio of MPI to OpenMP can vary a lot and should always be
|
|
||||||
confirmed through some benchmark runs for the current system and on
|
|
||||||
the current machine.
|
|
||||||
</P>
|
</P>
|
||||||
<P><B>Restrictions:</B>
|
<P>Different MPI implementations have differnet ways of passing the
|
||||||
|
OMP_NUM_THREADS environment variable to all MPI processes. The first
|
||||||
|
variant above is for MPICH, the second is for OpenMPI. Check the
|
||||||
|
documentation of your MPI installation for additional details.
|
||||||
</P>
|
</P>
|
||||||
<P>None of the pair styles in the USER-OMP package support the "inner",
|
<P>c) Use the <A HREF = "package.html">package omp</A> command near the top of your
|
||||||
"middle", "outer" options for r-RESPA integration, only the "pair"
|
script:
|
||||||
option is supported.
|
|
||||||
</P>
|
</P>
|
||||||
<P><B>Parallel efficiency and performance tips:</B>
|
<PRE>package omp 4
|
||||||
|
</PRE>
|
||||||
|
<P><B>Speed-ups to expect:</B>
|
||||||
</P>
|
</P>
|
||||||
<P>In most simple cases the MPI parallelization in LAMMPS is more
|
<P>Depending on which styles are accelerated, you should look for a
|
||||||
efficient than multi-threading implemented in the USER-OMP package.
|
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
|
||||||
Also the parallel efficiency varies between individual styles.
|
time" values printed at the end of a run.
|
||||||
On the other hand, in many cases you still want to use the <I>omp</I> version
|
|
||||||
- even when compiling or running without OpenMP support - since they
|
|
||||||
all contain optimizations similar to those in the OPT package, which
|
|
||||||
can result in serial speedup.
|
|
||||||
</P>
|
</P>
|
||||||
<P>Using multi-threading is most effective under the following
|
<P>You may see a small performance advantage (5 to 20%) when running a
|
||||||
|
USER-OMP style (in serial or parallel) with a single thread/MPI task,
|
||||||
|
versus running standard LAMMPS with its un-accelerated styles (in
|
||||||
|
serial or all-MPI parallelization with 1 task/core). This is because
|
||||||
|
many of the USER-OMP styles contain similar optimizations to those
|
||||||
|
used in the OPT package, as described above.
|
||||||
|
</P>
|
||||||
|
<P>With multiple threads/task, the optimal choice of MPI tasks/node and
|
||||||
|
OpenMP threads/task can vary a lot and should always be tested via
|
||||||
|
benchmark runs for a specific simulation running on a specific
|
||||||
|
machine, paying attention to guidelines discussed in the next
|
||||||
|
sub-section.
|
||||||
|
</P>
|
||||||
|
<P>A description of the multi-threading strategy used in the UESR-OMP
|
||||||
|
package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
|
||||||
|
here</A>
|
||||||
|
</P>
|
||||||
|
<P><B>Guidelines for best performance;</B>
|
||||||
|
</P>
|
||||||
|
<P>For many problems on current generation CPUs, running the USER-OMP
|
||||||
|
package with a single thread/task is faster than running with multiple
|
||||||
|
threads/task. This is because the MPI parallelization in LAMMPS is
|
||||||
|
often more efficient than multi-threading as implemented in the
|
||||||
|
USER-OMP package. The parallel efficiency (in a threaded sense) also
|
||||||
|
varies for different USER-OMP styles.
|
||||||
|
</P>
|
||||||
|
<P>Using multiple threads/task can be more effective under the following
|
||||||
circumstances:
|
circumstances:
|
||||||
</P>
|
</P>
|
||||||
<UL><LI>Individual compute nodes have a significant number of CPU cores but
|
<UL><LI>Individual compute nodes have a significant number of CPU cores but
|
||||||
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
|
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
|
||||||
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
||||||
MPI task per CPU core will result in significant performance
|
MPI task per CPU core will result in significant performance
|
||||||
degradation, so that running with 4 or even only 2 MPI tasks per nodes
|
degradation, so that running with 4 or even only 2 MPI tasks per node
|
||||||
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
||||||
inter-node communication bandwidth contention in the same way, but
|
inter-node communication bandwidth contention in the same way, but
|
||||||
offers and additional speedup from utilizing the otherwise idle CPU
|
offers an additional speedup by utilizing the otherwise idle CPU
|
||||||
cores.
|
cores.
|
||||||
|
|
||||||
<LI>The interconnect used for MPI communication is not able to provide
|
<LI>The interconnect used for MPI communication does not provide
|
||||||
sufficient bandwidth for a large number of MPI tasks per node. This
|
sufficient bandwidth for a large number of MPI tasks per node. For
|
||||||
applies for example to running over gigabit ethernet or on Cray XT4 or
|
example, this applies to running over gigabit ethernet or on Cray XT4
|
||||||
XT5 series supercomputers. Same as in the aforementioned case this
|
or XT5 series supercomputers. As in the aforementioned case, this
|
||||||
effect worsens with using an increasing number of nodes.
|
effect worsens when using an increasing number of nodes.
|
||||||
|
|
||||||
<LI>The input is a system that has an inhomogeneous particle density which
|
<LI>The system has a spatially inhomogeneous particle density which does
|
||||||
cannot be mapped well to the domain decomposition scheme that LAMMPS
|
not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
|
||||||
employs. While this can be to some degree alleviated through using the
|
<A HREF = "balance.html">load-balancing</A> options that LAMMPS provides. This is
|
||||||
<A HREF = "processors.html">processors</A> keyword, multi-threading provides a
|
because multi-threading achives parallelism over the number of
|
||||||
parallelism that parallelizes over the number of particles not their
|
particles, not via their distribution in space.
|
||||||
distribution in space.
|
|
||||||
|
|
||||||
<LI>Finally, multi-threaded styles can improve performance when running
|
<LI>A machine is being used in "capability mode", i.e. near the point
|
||||||
LAMMPS in "capability mode", i.e. near the point where the MPI
|
where MPI parallelism is maxed out. For example, this can happen when
|
||||||
parallelism scales out. This can happen in particular when using as
|
using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
|
||||||
kspace style for long-range electrostatics. Here the scaling of the
|
electrostatics on large numbers of nodes. The scaling of the <A HREF = "kspace_style.html">kspace
|
||||||
kspace style is the performance limiting factor and using
|
style</A> can become the the performance-limiting
|
||||||
multi-threaded styles allows to operate the kspace style at the limit
|
factor. Using multi-threading allows less MPI tasks to be invoked and
|
||||||
of scaling and then increase performance parallelizing the real space
|
can speed-up the long-range solver, while increasing overall
|
||||||
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
|
performance by parallelizing the pairwise and bonded calculations via
|
||||||
be achived by increasing the real-space coulomb cutoff and thus
|
OpenMP. Likewise additional speedup can be sometimes be achived by
|
||||||
reducing the work in the kspace part.
|
increasing the length of the Coulombic cutoff and thus reducing the
|
||||||
|
work done by the long-range solver.
|
||||||
</UL>
|
</UL>
|
||||||
<P>The best parallel efficiency from <I>omp</I> styles is typically achieved
|
<P>Other performance tips are as follows:
|
||||||
|
</P>
|
||||||
|
<UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
|
||||||
when there is at least one MPI task per physical processor,
|
when there is at least one MPI task per physical processor,
|
||||||
i.e. socket or die.
|
i.e. socket or die.
|
||||||
</P>
|
|
||||||
<P>Using threads on hyper-threading enabled cores is usually
|
<LI>Using OpenMP threading (as opposed to all-MPI parallelism) on
|
||||||
counterproductive, as the cost in additional memory bandwidth
|
hyper-threading enabled cores is usually counter-productive (e.g. on
|
||||||
requirements is not offset by the gain in CPU utilization through
|
IBM BG/Q), as the cost in additional memory bandwidth requirements is
|
||||||
|
not offset by the gain in CPU utilization through
|
||||||
hyper-threading.
|
hyper-threading.
|
||||||
|
</UL>
|
||||||
|
<P><B>Restrictions:</B>
|
||||||
</P>
|
</P>
|
||||||
<P>A description of the multi-threading strategy and some performance
|
<P>None of the pair styles in the USER-OMP package support the "inner",
|
||||||
examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
|
"middle", "outer" options for <A HREF = "run_style.html">rRESPA integration</A>.
|
||||||
here</A>
|
Only the rRESPA "pair" option is supported.
|
||||||
</P>
|
</P>
|
||||||
<HR>
|
<HR>
|
||||||
|
|
||||||
<H4><A NAME = "acc_6"></A>5.6 GPU package
|
<H4><A NAME = "acc_6"></A>5.6 GPU package
|
||||||
</H4>
|
</H4>
|
||||||
|
<P><B>Required hardware/software:</B>
|
||||||
|
<B>Building LAMMPS with the OPT package:</B>
|
||||||
|
<B>Running with the OPT package;</B>
|
||||||
|
<B>Guidelines for best performance;</B>
|
||||||
|
<B>Speed-ups to expect:</B>
|
||||||
|
</P>
|
||||||
<P>The GPU package was developed by Mike Brown at ORNL and his
|
<P>The GPU package was developed by Mike Brown at ORNL and his
|
||||||
collaborators. It provides GPU versions of several pair styles,
|
collaborators. It provides GPU versions of several pair styles,
|
||||||
including the 3-body Stillinger-Weber pair style, and for long-range
|
including the 3-body Stillinger-Weber pair style, and for long-range
|
||||||
|
@ -546,6 +642,12 @@ of problem size and number of compute nodes.
|
||||||
|
|
||||||
<H4><A NAME = "acc_7"></A>5.7 USER-CUDA package
|
<H4><A NAME = "acc_7"></A>5.7 USER-CUDA package
|
||||||
</H4>
|
</H4>
|
||||||
|
<P><B>Required hardware/software:</B>
|
||||||
|
<B>Building LAMMPS with the OPT package:</B>
|
||||||
|
<B>Running with the OPT package;</B>
|
||||||
|
<B>Guidelines for best performance;</B>
|
||||||
|
<B>Speed-ups to expect:</B>
|
||||||
|
</P>
|
||||||
<P>The USER-CUDA package was developed by Christian Trott at U Technology
|
<P>The USER-CUDA package was developed by Christian Trott at U Technology
|
||||||
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
|
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
|
||||||
styles, many fixes, a few computes, and for long-range Coulombics via
|
styles, many fixes, a few computes, and for long-range Coulombics via
|
||||||
|
@ -683,6 +785,12 @@ occurs, the faster your simulation will run.
|
||||||
|
|
||||||
<H4><A NAME = "acc_8"></A>5.8 KOKKOS package
|
<H4><A NAME = "acc_8"></A>5.8 KOKKOS package
|
||||||
</H4>
|
</H4>
|
||||||
|
<P><B>Required hardware/software:</B>
|
||||||
|
<B>Building LAMMPS with the OPT package:</B>
|
||||||
|
<B>Running with the OPT package;</B>
|
||||||
|
<B>Guidelines for best performance;</B>
|
||||||
|
<B>Speed-ups to expect:</B>
|
||||||
|
</P>
|
||||||
<P>The KOKKOS package contains versions of pair, fix, and atom styles
|
<P>The KOKKOS package contains versions of pair, fix, and atom styles
|
||||||
that use data structures and methods and macros provided by the Kokkos
|
that use data structures and methods and macros provided by the Kokkos
|
||||||
library, which is included with LAMMPS in lib/kokkos.
|
library, which is included with LAMMPS in lib/kokkos.
|
||||||
|
@ -975,6 +1083,12 @@ LAMMPS.
|
||||||
|
|
||||||
<H4><A NAME = "acc_9"></A>5.9 USER-INTEL package
|
<H4><A NAME = "acc_9"></A>5.9 USER-INTEL package
|
||||||
</H4>
|
</H4>
|
||||||
|
<P><B>Required hardware/software:</B>
|
||||||
|
<B>Building LAMMPS with the OPT package:</B>
|
||||||
|
<B>Running with the OPT package;</B>
|
||||||
|
<B>Guidelines for best performance;</B>
|
||||||
|
<B>Speed-ups to expect:</B>
|
||||||
|
</P>
|
||||||
<P>The USER-INTEL package was developed by Mike Brown at Intel
|
<P>The USER-INTEL package was developed by Mike Brown at Intel
|
||||||
Corporation. It provides a capability to accelerate simulations by
|
Corporation. It provides a capability to accelerate simulations by
|
||||||
offloading neighbor list and non-bonded force calculations to Intel(R)
|
offloading neighbor list and non-bonded force calculations to Intel(R)
|
||||||
|
|
|
@ -23,7 +23,7 @@ kinds of machines.
|
||||||
5.7 "USER-CUDA package"_#acc_7
|
5.7 "USER-CUDA package"_#acc_7
|
||||||
5.8 "KOKKOS package"_#acc_8
|
5.8 "KOKKOS package"_#acc_8
|
||||||
5.9 "USER-INTEL package"_#acc_9
|
5.9 "USER-INTEL package"_#acc_9
|
||||||
5.10 "Comparison of GPU and USER-CUDA packages"_#acc_10 :all(b)
|
5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)
|
||||||
|
|
||||||
:line
|
:line
|
||||||
:line
|
:line
|
||||||
|
@ -78,7 +78,7 @@ LAMMPS, to obtain synchronized timings.
|
||||||
|
|
||||||
5.2 General strategies :h4,link(acc_2)
|
5.2 General strategies :h4,link(acc_2)
|
||||||
|
|
||||||
NOTE: this sub-section is still a work in progress
|
NOTE: this section is still a work in progress
|
||||||
|
|
||||||
Here is a list of general ideas for improving simulation performance.
|
Here is a list of general ideas for improving simulation performance.
|
||||||
Most of them are only applicable to certain models and certain
|
Most of them are only applicable to certain models and certain
|
||||||
|
@ -138,6 +138,16 @@ been added to LAMMPS, which will typically run faster than the
|
||||||
standard non-accelerated versions, if you have the appropriate
|
standard non-accelerated versions, if you have the appropriate
|
||||||
hardware on your system.
|
hardware on your system.
|
||||||
|
|
||||||
|
All of these commands are in "packages"_Section_packages.html.
|
||||||
|
Currently, there are 6 such packages in LAMMPS:
|
||||||
|
|
||||||
|
USER-CUDA: for NVIDIA GPUs
|
||||||
|
GPU: for NVIDIA GPUs as well as OpenCL support
|
||||||
|
USER-INTEL: for Intel CPUs and Intel Xeon Phi
|
||||||
|
KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
|
||||||
|
USER-OMP: for OpenMP threading
|
||||||
|
OPT: generic CPU optimizations :ul
|
||||||
|
|
||||||
The accelerated styles have the same name as the standard styles,
|
The accelerated styles have the same name as the standard styles,
|
||||||
except that a suffix is appended. Otherwise, the syntax for the
|
except that a suffix is appended. Otherwise, the syntax for the
|
||||||
command is identical, their functionality is the same, and the
|
command is identical, their functionality is the same, and the
|
||||||
|
@ -163,22 +173,31 @@ automatically, without changing your input script. The
|
||||||
to turn off and back on the comand-line switch setting, both from
|
to turn off and back on the comand-line switch setting, both from
|
||||||
within your input script.
|
within your input script.
|
||||||
|
|
||||||
|
To see what styles are currently available in each of the accelerated
|
||||||
|
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
|
||||||
|
manual. The doc page for each indvidual style (e.g. "pair
|
||||||
|
lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
|
||||||
|
accelerated variants available for that style.
|
||||||
|
|
||||||
|
Here is a brief summary of what the various packages provide. Details
|
||||||
|
are in individual sections below.
|
||||||
|
|
||||||
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
||||||
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
||||||
The speed-up due to GPU usage depends on a variety of factors, as
|
The speed-up on a GPU depends on a variety of factors, as discussed
|
||||||
discussed below.
|
below.
|
||||||
|
|
||||||
Styles with an "intel" suffix are part of the USER-INTEL
|
Styles with an "intel" suffix are part of the USER-INTEL
|
||||||
package. These styles support vectorized single and mixed precision
|
package. These styles support vectorized single and mixed precision
|
||||||
calculations, in addition to full double precision. In extreme cases,
|
calculations, in addition to full double precision. In extreme cases,
|
||||||
this can provide speedups over 3.5x on CPUs. The package also
|
this can provide speedups over 3.5x on CPUs. The package also
|
||||||
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
|
supports acceleration with offload to Intel(R) Xeon Phi(TM)
|
||||||
This can result in additional speedup over 2x depending on the
|
coprocessors. This can result in additional speedup over 2x depending
|
||||||
hardware configuration.
|
on the hardware configuration.
|
||||||
|
|
||||||
Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
||||||
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends
|
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
|
||||||
on a variety of factors, as discussed below.
|
The speed-up depends on a variety of factors, as discussed below.
|
||||||
|
|
||||||
Styles with an "omp" suffix are part of the USER-OMP package and allow
|
Styles with an "omp" suffix are part of the USER-OMP package and allow
|
||||||
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
||||||
|
@ -188,25 +207,20 @@ are run on fewer MPI processors or when the many MPI tasks would
|
||||||
overload the available bandwidth for communication.
|
overload the available bandwidth for communication.
|
||||||
|
|
||||||
Styles with an "opt" suffix are part of the OPT package and typically
|
Styles with an "opt" suffix are part of the OPT package and typically
|
||||||
speed-up the pairwise calculations of your simulation by 5-25%.
|
speed-up the pairwise calculations of your simulation by 5-25% on a
|
||||||
|
CPU.
|
||||||
To see what styles are currently available in each of the accelerated
|
|
||||||
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
|
|
||||||
manual. A list of accelerated styles is included in the pair, fix,
|
|
||||||
compute, and kspace sections. The doc page for each indvidual style
|
|
||||||
(e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) will also
|
|
||||||
list any accelerated variants available for that style.
|
|
||||||
|
|
||||||
The following sections explain:
|
The following sections explain:
|
||||||
|
|
||||||
what hardware and software the accelerated styles require
|
what hardware and software the accelerated package requires
|
||||||
how to build LAMMPS with the accelerated package in place
|
how to build LAMMPS with the accelerated package
|
||||||
what changes (if any) are needed in your input scripts
|
how to run an input script with the accelerated package
|
||||||
|
speed-ups to expect
|
||||||
guidelines for best performance
|
guidelines for best performance
|
||||||
speed-ups you can expect :ul
|
restrictions :ul
|
||||||
|
|
||||||
The final section compares and contrasts the GPU and USER-CUDA
|
The final section compares and contrasts the GPU, USER-CUDA, and
|
||||||
packages, since they are both designed to use NVIDIA hardware.
|
KOKKOS packages, since they all allow for use of NVIDIA GPUs.
|
||||||
|
|
||||||
:line
|
:line
|
||||||
|
|
||||||
|
@ -218,22 +232,47 @@ Technologies). It contains a handful of pair styles whose compute()
|
||||||
methods were rewritten in C++ templated form to reduce the overhead
|
methods were rewritten in C++ templated form to reduce the overhead
|
||||||
due to if tests and other conditional code.
|
due to if tests and other conditional code.
|
||||||
|
|
||||||
The procedure for building LAMMPS with the OPT package is simple. It
|
[Required hardware/software:]
|
||||||
is the same as for any other package which has no additional library
|
|
||||||
dependencies:
|
None.
|
||||||
|
|
||||||
|
[Building LAMMPS with the OPT package:]
|
||||||
|
|
||||||
|
Include the package and build LAMMPS.
|
||||||
|
|
||||||
make yes-opt
|
make yes-opt
|
||||||
make machine :pre
|
make machine :pre
|
||||||
|
|
||||||
If your input script uses one of the OPT pair styles, you can run it
|
No additional compile/link flags are needed in your lo-level
|
||||||
as follows:
|
src/MAKE/Makefile.machine.
|
||||||
|
|
||||||
|
[Running with the OPT package;]
|
||||||
|
|
||||||
|
You can explicitly add an "opt" suffix to the
|
||||||
|
"pair_style"_pair_style.html command in your input script:
|
||||||
|
|
||||||
|
pair_style lj/cut/opt 2.5 :pre
|
||||||
|
|
||||||
|
Or you can run with the -sf "command-line
|
||||||
|
switch"_Section_start.html#start_7, which will automatically append
|
||||||
|
"opt" to styles that support it.
|
||||||
|
|
||||||
lmp_machine -sf opt < in.script
|
lmp_machine -sf opt < in.script
|
||||||
mpirun -np 4 lmp_machine -sf opt < in.script :pre
|
mpirun -np 4 lmp_machine -sf opt < in.script :pre
|
||||||
|
|
||||||
You should see a reduction in the "Pair time" printed out at the end
|
[Speed-ups to expect:]
|
||||||
of the run. On most machines and problems, this will typically be a 5
|
|
||||||
to 20% savings.
|
You should see a reduction in the "Pair time" value printed at the end
|
||||||
|
of a run. On most machines for reasonable problem sizes, it will be a
|
||||||
|
5 to 20% savings.
|
||||||
|
|
||||||
|
[Guidelines for best performance;]
|
||||||
|
|
||||||
|
None. Just try out an OPT pair style to see how it performs.
|
||||||
|
|
||||||
|
[Restrictions:]
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
:line
|
:line
|
||||||
|
|
||||||
|
@ -241,118 +280,175 @@ to 20% savings.
|
||||||
|
|
||||||
The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
||||||
University. It provides multi-threaded versions of most pair styles,
|
University. It provides multi-threaded versions of most pair styles,
|
||||||
all dihedral styles, and a few fixes in LAMMPS. The package currently
|
nearly all bonded styles (bond, angle, dihedral, improper), several
|
||||||
uses the OpenMP interface which requires using a specific compiler
|
Kspace styles, and a few fix styles. The package currently
|
||||||
flag in the makefile to enable multiple threads; without this flag the
|
uses the OpenMP interface for multi-threading.
|
||||||
corresponding pair styles will still be compiled and work, but do not
|
|
||||||
support multi-threading.
|
[Required hardware/software:]
|
||||||
|
|
||||||
|
Your compiler must support the OpenMP interface. You should have one
|
||||||
|
or more multi-core CPUs so that multiple threads can be launched by an
|
||||||
|
MPI task running on a CPU.
|
||||||
|
|
||||||
[Building LAMMPS with the USER-OMP package:]
|
[Building LAMMPS with the USER-OMP package:]
|
||||||
|
|
||||||
The procedure for building LAMMPS with the USER-OMP package is simple.
|
Include the package and build LAMMPS.
|
||||||
You have to edit your machine specific makefile to add the flag to
|
|
||||||
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
|
|
||||||
For the GNU compilers and Intel compilers, this flag is called
|
|
||||||
{-fopenmp}. Check your compiler documentation to find out which flag
|
|
||||||
you need to add. The rest of the compilation is the same as for any
|
|
||||||
other package which has no additional library dependencies:
|
|
||||||
|
|
||||||
make yes-user-omp
|
make yes-user-omp
|
||||||
make machine :pre
|
make machine :pre
|
||||||
|
|
||||||
If your input script uses one of regular styles that are also
|
Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
|
||||||
exist as an OpenMP version in the USER-OMP package you can run
|
support in both the CCFLAGS and LINKFLAGS variables. For GNU and
|
||||||
it as follows:
|
Intel compilers, this flag is {-fopenmp}. Without this flag the
|
||||||
|
USER-OMP styles will still be compiled and work, but will not support
|
||||||
|
multi-threading.
|
||||||
|
|
||||||
env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
|
[Running with the USER-OMP package;]
|
||||||
|
|
||||||
|
You can explicitly add an "omp" suffix to any supported style in your
|
||||||
|
input script:
|
||||||
|
|
||||||
|
pair_style lj/cut/omp 2.5
|
||||||
|
fix nve/omp :pre
|
||||||
|
|
||||||
|
Or you can run with the -sf "command-line
|
||||||
|
switch"_Section_start.html#start_7, which will automatically append
|
||||||
|
"opt" to styles that support it.
|
||||||
|
|
||||||
|
lmp_machine -sf omp < in.script
|
||||||
|
mpirun -np 4 lmp_machine -sf omp < in.script :pre
|
||||||
|
|
||||||
|
You must also specify how many threads to use per MPI task. There are
|
||||||
|
several ways to do this. Note that the default value for this setting
|
||||||
|
in the OpenMP environment is 1 thread/task, which may give poor
|
||||||
|
performance. Also note that the product of MPI tasks * threads/task
|
||||||
|
should not exceed the physical number of cores, otherwise performance
|
||||||
|
will suffer.
|
||||||
|
|
||||||
|
a) You can set an environment variable, either in your shell
|
||||||
|
or its start-up script:
|
||||||
|
|
||||||
|
setenv OMP_NUM_THREADS 4 (for csh or tcsh)
|
||||||
|
NOTE: setenv OMP_NUM_THREADS 4 (for bash) :pre
|
||||||
|
|
||||||
|
This value will apply to all subsequent runs you perform.
|
||||||
|
|
||||||
|
b) You can set the same environment variable when you launch LAMMPS:
|
||||||
|
|
||||||
|
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
|
||||||
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
||||||
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre
|
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
||||||
|
NOTE: which mpirun is for OpenMPI or MPICH? :pre
|
||||||
|
|
||||||
The value of the environment variable OMP_NUM_THREADS determines how
|
All three examples use a total of 4 CPU cores.
|
||||||
many threads per MPI task are launched. All three examples above use a
|
|
||||||
total of 4 CPU cores. For different MPI implementations the method to
|
Different MPI implementations have differnet ways of passing the
|
||||||
pass the OMP_NUM_THREADS environment variable to all processes is
|
OMP_NUM_THREADS environment variable to all MPI processes. The first
|
||||||
different. Two different variants, one for MPICH and OpenMPI,
|
variant above is for MPICH, the second is for OpenMPI. Check the
|
||||||
respectively are shown above. Please check the documentation of your
|
documentation of your MPI installation for additional details.
|
||||||
MPI installation for additional details. Alternatively, the value
|
|
||||||
provided by OMP_NUM_THREADS can be overridded with the "package
|
c) Use the "package omp"_package.html command near the top of your
|
||||||
omp"_package.html command. Depending on which styles are accelerated
|
script:
|
||||||
in your input, you should see a reduction in the "Pair time" and/or
|
|
||||||
"Bond time" and "Loop time" printed out at the end of the run. The
|
package omp 4 :pre
|
||||||
optimal ratio of MPI to OpenMP can vary a lot and should always be
|
|
||||||
confirmed through some benchmark runs for the current system and on
|
[Speed-ups to expect:]
|
||||||
the current machine.
|
|
||||||
|
Depending on which styles are accelerated, you should look for a
|
||||||
|
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
|
||||||
|
time" values printed at the end of a run.
|
||||||
|
|
||||||
|
You may see a small performance advantage (5 to 20%) when running a
|
||||||
|
USER-OMP style (in serial or parallel) with a single thread/MPI task,
|
||||||
|
versus running standard LAMMPS with its un-accelerated styles (in
|
||||||
|
serial or all-MPI parallelization with 1 task/core). This is because
|
||||||
|
many of the USER-OMP styles contain similar optimizations to those
|
||||||
|
used in the OPT package, as described above.
|
||||||
|
|
||||||
|
With multiple threads/task, the optimal choice of MPI tasks/node and
|
||||||
|
OpenMP threads/task can vary a lot and should always be tested via
|
||||||
|
benchmark runs for a specific simulation running on a specific
|
||||||
|
machine, paying attention to guidelines discussed in the next
|
||||||
|
sub-section.
|
||||||
|
|
||||||
|
A description of the multi-threading strategy used in the UESR-OMP
|
||||||
|
package and some performance examples are "presented
|
||||||
|
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
|
||||||
|
|
||||||
|
[Guidelines for best performance;]
|
||||||
|
|
||||||
|
For many problems on current generation CPUs, running the USER-OMP
|
||||||
|
package with a single thread/task is faster than running with multiple
|
||||||
|
threads/task. This is because the MPI parallelization in LAMMPS is
|
||||||
|
often more efficient than multi-threading as implemented in the
|
||||||
|
USER-OMP package. The parallel efficiency (in a threaded sense) also
|
||||||
|
varies for different USER-OMP styles.
|
||||||
|
|
||||||
|
Using multiple threads/task can be more effective under the following
|
||||||
|
circumstances:
|
||||||
|
|
||||||
|
Individual compute nodes have a significant number of CPU cores but
|
||||||
|
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
|
||||||
|
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
||||||
|
MPI task per CPU core will result in significant performance
|
||||||
|
degradation, so that running with 4 or even only 2 MPI tasks per node
|
||||||
|
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
||||||
|
inter-node communication bandwidth contention in the same way, but
|
||||||
|
offers an additional speedup by utilizing the otherwise idle CPU
|
||||||
|
cores. :ulb,l
|
||||||
|
|
||||||
|
The interconnect used for MPI communication does not provide
|
||||||
|
sufficient bandwidth for a large number of MPI tasks per node. For
|
||||||
|
example, this applies to running over gigabit ethernet or on Cray XT4
|
||||||
|
or XT5 series supercomputers. As in the aforementioned case, this
|
||||||
|
effect worsens when using an increasing number of nodes. :l
|
||||||
|
|
||||||
|
The system has a spatially inhomogeneous particle density which does
|
||||||
|
not map well to the "domain decomposition scheme"_processors.html or
|
||||||
|
"load-balancing"_balance.html options that LAMMPS provides. This is
|
||||||
|
because multi-threading achives parallelism over the number of
|
||||||
|
particles, not via their distribution in space. :l
|
||||||
|
|
||||||
|
A machine is being used in "capability mode", i.e. near the point
|
||||||
|
where MPI parallelism is maxed out. For example, this can happen when
|
||||||
|
using the "PPPM solver"_kspace_style.html for long-range
|
||||||
|
electrostatics on large numbers of nodes. The scaling of the "kspace
|
||||||
|
style"_kspace_style.html can become the the performance-limiting
|
||||||
|
factor. Using multi-threading allows less MPI tasks to be invoked and
|
||||||
|
can speed-up the long-range solver, while increasing overall
|
||||||
|
performance by parallelizing the pairwise and bonded calculations via
|
||||||
|
OpenMP. Likewise additional speedup can be sometimes be achived by
|
||||||
|
increasing the length of the Coulombic cutoff and thus reducing the
|
||||||
|
work done by the long-range solver. :l,ule
|
||||||
|
|
||||||
|
Other performance tips are as follows:
|
||||||
|
|
||||||
|
The best parallel efficiency from {omp} styles is typically achieved
|
||||||
|
when there is at least one MPI task per physical processor,
|
||||||
|
i.e. socket or die. :ulb,l
|
||||||
|
|
||||||
|
Using OpenMP threading (as opposed to all-MPI parallelism) on
|
||||||
|
hyper-threading enabled cores is usually counter-productive (e.g. on
|
||||||
|
IBM BG/Q), as the cost in additional memory bandwidth requirements is
|
||||||
|
not offset by the gain in CPU utilization through
|
||||||
|
hyper-threading. :l,ule
|
||||||
|
|
||||||
[Restrictions:]
|
[Restrictions:]
|
||||||
|
|
||||||
None of the pair styles in the USER-OMP package support the "inner",
|
None of the pair styles in the USER-OMP package support the "inner",
|
||||||
"middle", "outer" options for r-RESPA integration, only the "pair"
|
"middle", "outer" options for "rRESPA integration"_run_style.html.
|
||||||
option is supported.
|
Only the rRESPA "pair" option is supported.
|
||||||
|
|
||||||
[Parallel efficiency and performance tips:]
|
|
||||||
|
|
||||||
In most simple cases the MPI parallelization in LAMMPS is more
|
|
||||||
efficient than multi-threading implemented in the USER-OMP package.
|
|
||||||
Also the parallel efficiency varies between individual styles.
|
|
||||||
On the other hand, in many cases you still want to use the {omp} version
|
|
||||||
- even when compiling or running without OpenMP support - since they
|
|
||||||
all contain optimizations similar to those in the OPT package, which
|
|
||||||
can result in serial speedup.
|
|
||||||
|
|
||||||
Using multi-threading is most effective under the following
|
|
||||||
circumstances:
|
|
||||||
|
|
||||||
Individual compute nodes have a significant number of CPU cores but
|
|
||||||
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
|
|
||||||
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
|
||||||
MPI task per CPU core will result in significant performance
|
|
||||||
degradation, so that running with 4 or even only 2 MPI tasks per nodes
|
|
||||||
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
|
||||||
inter-node communication bandwidth contention in the same way, but
|
|
||||||
offers and additional speedup from utilizing the otherwise idle CPU
|
|
||||||
cores. :ulb,l
|
|
||||||
|
|
||||||
The interconnect used for MPI communication is not able to provide
|
|
||||||
sufficient bandwidth for a large number of MPI tasks per node. This
|
|
||||||
applies for example to running over gigabit ethernet or on Cray XT4 or
|
|
||||||
XT5 series supercomputers. Same as in the aforementioned case this
|
|
||||||
effect worsens with using an increasing number of nodes. :l
|
|
||||||
|
|
||||||
The input is a system that has an inhomogeneous particle density which
|
|
||||||
cannot be mapped well to the domain decomposition scheme that LAMMPS
|
|
||||||
employs. While this can be to some degree alleviated through using the
|
|
||||||
"processors"_processors.html keyword, multi-threading provides a
|
|
||||||
parallelism that parallelizes over the number of particles not their
|
|
||||||
distribution in space. :l
|
|
||||||
|
|
||||||
Finally, multi-threaded styles can improve performance when running
|
|
||||||
LAMMPS in "capability mode", i.e. near the point where the MPI
|
|
||||||
parallelism scales out. This can happen in particular when using as
|
|
||||||
kspace style for long-range electrostatics. Here the scaling of the
|
|
||||||
kspace style is the performance limiting factor and using
|
|
||||||
multi-threaded styles allows to operate the kspace style at the limit
|
|
||||||
of scaling and then increase performance parallelizing the real space
|
|
||||||
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
|
|
||||||
be achived by increasing the real-space coulomb cutoff and thus
|
|
||||||
reducing the work in the kspace part. :l,ule
|
|
||||||
|
|
||||||
The best parallel efficiency from {omp} styles is typically achieved
|
|
||||||
when there is at least one MPI task per physical processor,
|
|
||||||
i.e. socket or die.
|
|
||||||
|
|
||||||
Using threads on hyper-threading enabled cores is usually
|
|
||||||
counterproductive, as the cost in additional memory bandwidth
|
|
||||||
requirements is not offset by the gain in CPU utilization through
|
|
||||||
hyper-threading.
|
|
||||||
|
|
||||||
A description of the multi-threading strategy and some performance
|
|
||||||
examples are "presented
|
|
||||||
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
|
|
||||||
|
|
||||||
:line
|
:line
|
||||||
|
|
||||||
5.6 GPU package :h4,link(acc_6)
|
5.6 GPU package :h4,link(acc_6)
|
||||||
|
|
||||||
|
[Required hardware/software:]
|
||||||
|
[Building LAMMPS with the OPT package:]
|
||||||
|
[Running with the OPT package;]
|
||||||
|
[Guidelines for best performance;]
|
||||||
|
[Speed-ups to expect:]
|
||||||
|
|
||||||
The GPU package was developed by Mike Brown at ORNL and his
|
The GPU package was developed by Mike Brown at ORNL and his
|
||||||
collaborators. It provides GPU versions of several pair styles,
|
collaborators. It provides GPU versions of several pair styles,
|
||||||
including the 3-body Stillinger-Weber pair style, and for long-range
|
including the 3-body Stillinger-Weber pair style, and for long-range
|
||||||
|
@ -542,6 +638,12 @@ of problem size and number of compute nodes.
|
||||||
|
|
||||||
5.7 USER-CUDA package :h4,link(acc_7)
|
5.7 USER-CUDA package :h4,link(acc_7)
|
||||||
|
|
||||||
|
[Required hardware/software:]
|
||||||
|
[Building LAMMPS with the OPT package:]
|
||||||
|
[Running with the OPT package;]
|
||||||
|
[Guidelines for best performance;]
|
||||||
|
[Speed-ups to expect:]
|
||||||
|
|
||||||
The USER-CUDA package was developed by Christian Trott at U Technology
|
The USER-CUDA package was developed by Christian Trott at U Technology
|
||||||
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
|
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
|
||||||
styles, many fixes, a few computes, and for long-range Coulombics via
|
styles, many fixes, a few computes, and for long-range Coulombics via
|
||||||
|
@ -679,6 +781,12 @@ occurs, the faster your simulation will run.
|
||||||
|
|
||||||
5.8 KOKKOS package :h4,link(acc_8)
|
5.8 KOKKOS package :h4,link(acc_8)
|
||||||
|
|
||||||
|
[Required hardware/software:]
|
||||||
|
[Building LAMMPS with the OPT package:]
|
||||||
|
[Running with the OPT package;]
|
||||||
|
[Guidelines for best performance;]
|
||||||
|
[Speed-ups to expect:]
|
||||||
|
|
||||||
The KOKKOS package contains versions of pair, fix, and atom styles
|
The KOKKOS package contains versions of pair, fix, and atom styles
|
||||||
that use data structures and methods and macros provided by the Kokkos
|
that use data structures and methods and macros provided by the Kokkos
|
||||||
library, which is included with LAMMPS in lib/kokkos.
|
library, which is included with LAMMPS in lib/kokkos.
|
||||||
|
@ -971,6 +1079,12 @@ LAMMPS.
|
||||||
|
|
||||||
5.9 USER-INTEL package :h4,link(acc_9)
|
5.9 USER-INTEL package :h4,link(acc_9)
|
||||||
|
|
||||||
|
[Required hardware/software:]
|
||||||
|
[Building LAMMPS with the OPT package:]
|
||||||
|
[Running with the OPT package;]
|
||||||
|
[Guidelines for best performance;]
|
||||||
|
[Speed-ups to expect:]
|
||||||
|
|
||||||
The USER-INTEL package was developed by Mike Brown at Intel
|
The USER-INTEL package was developed by Mike Brown at Intel
|
||||||
Corporation. It provides a capability to accelerate simulations by
|
Corporation. It provides a capability to accelerate simulations by
|
||||||
offloading neighbor list and non-bonded force calculations to Intel(R)
|
offloading neighbor list and non-bonded force calculations to Intel(R)
|
||||||
|
|
Loading…
Reference in New Issue