forked from lijiext/lammps
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12374 f3b2605a-c512-4ea7-a41b-209d697bcdaa
This commit is contained in:
parent
dc5ad107ad
commit
444053fa6c
|
@ -26,7 +26,7 @@ kinds of machines.
|
|||
5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
|
||||
5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
|
||||
5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
|
||||
5.10 <A HREF = "#acc_10">Comparison of GPU and USER-CUDA packages</A> <BR>
|
||||
5.10 <A HREF = "#acc_10">Comparison of USER-CUDA, GPU, and KOKKOS packages</A> <BR>
|
||||
|
||||
<HR>
|
||||
|
||||
|
@ -82,7 +82,7 @@ LAMMPS, to obtain synchronized timings.
|
|||
|
||||
<H4><A NAME = "acc_2"></A>5.2 General strategies
|
||||
</H4>
|
||||
<P>NOTE: this sub-section is still a work in progress
|
||||
<P>NOTE: this section is still a work in progress
|
||||
</P>
|
||||
<P>Here is a list of general ideas for improving simulation performance.
|
||||
Most of them are only applicable to certain models and certain
|
||||
|
@ -142,6 +142,16 @@ been added to LAMMPS, which will typically run faster than the
|
|||
standard non-accelerated versions, if you have the appropriate
|
||||
hardware on your system.
|
||||
</P>
|
||||
<P>All of these commands are in <A HREF = "Section_packages.html">packages</A>.
|
||||
Currently, there are 6 such packages in LAMMPS:
|
||||
</P>
|
||||
<UL><LI>USER-CUDA: for NVIDIA GPUs
|
||||
<LI>GPU: for NVIDIA GPUs as well as OpenCL support
|
||||
<LI>USER-INTEL: for Intel CPUs and Intel Xeon Phi
|
||||
<LI>KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
|
||||
<LI>USER-OMP: for OpenMP threading
|
||||
<LI>OPT: generic CPU optimizations
|
||||
</UL>
|
||||
<P>The accelerated styles have the same name as the standard styles,
|
||||
except that a suffix is appended. Otherwise, the syntax for the
|
||||
command is identical, their functionality is the same, and the
|
||||
|
@ -167,22 +177,31 @@ automatically, without changing your input script. The
|
|||
to turn off and back on the comand-line switch setting, both from
|
||||
within your input script.
|
||||
</P>
|
||||
<P>To see what styles are currently available in each of the accelerated
|
||||
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
|
||||
manual. The doc page for each indvidual style (e.g. <A HREF = "pair_lj.html">pair
|
||||
lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) also lists any
|
||||
accelerated variants available for that style.
|
||||
</P>
|
||||
<P>Here is a brief summary of what the various packages provide. Details
|
||||
are in individual sections below.
|
||||
</P>
|
||||
<P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
||||
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
||||
The speed-up due to GPU usage depends on a variety of factors, as
|
||||
discussed below.
|
||||
The speed-up on a GPU depends on a variety of factors, as discussed
|
||||
below.
|
||||
</P>
|
||||
<P>Styles with an "intel" suffix are part of the USER-INTEL
|
||||
package. These styles support vectorized single and mixed precision
|
||||
calculations, in addition to full double precision. In extreme cases,
|
||||
this can provide speedups over 3.5x on CPUs. The package also
|
||||
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
|
||||
This can result in additional speedup over 2x depending on the
|
||||
hardware configuration.
|
||||
supports acceleration with offload to Intel(R) Xeon Phi(TM)
|
||||
coprocessors. This can result in additional speedup over 2x depending
|
||||
on the hardware configuration.
|
||||
</P>
|
||||
<P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
||||
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends
|
||||
on a variety of factors, as discussed below.
|
||||
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
|
||||
The speed-up depends on a variety of factors, as discussed below.
|
||||
</P>
|
||||
<P>Styles with an "omp" suffix are part of the USER-OMP package and allow
|
||||
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
||||
|
@ -192,25 +211,20 @@ are run on fewer MPI processors or when the many MPI tasks would
|
|||
overload the available bandwidth for communication.
|
||||
</P>
|
||||
<P>Styles with an "opt" suffix are part of the OPT package and typically
|
||||
speed-up the pairwise calculations of your simulation by 5-25%.
|
||||
</P>
|
||||
<P>To see what styles are currently available in each of the accelerated
|
||||
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
|
||||
manual. A list of accelerated styles is included in the pair, fix,
|
||||
compute, and kspace sections. The doc page for each indvidual style
|
||||
(e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) will also
|
||||
list any accelerated variants available for that style.
|
||||
speed-up the pairwise calculations of your simulation by 5-25% on a
|
||||
CPU.
|
||||
</P>
|
||||
<P>The following sections explain:
|
||||
</P>
|
||||
<UL><LI>what hardware and software the accelerated styles require
|
||||
<LI>how to build LAMMPS with the accelerated package in place
|
||||
<LI>what changes (if any) are needed in your input scripts
|
||||
<UL><LI>what hardware and software the accelerated package requires
|
||||
<LI>how to build LAMMPS with the accelerated package
|
||||
<LI>how to run an input script with the accelerated package
|
||||
<LI>speed-ups to expect
|
||||
<LI>guidelines for best performance
|
||||
<LI>speed-ups you can expect
|
||||
<LI>restrictions
|
||||
</UL>
|
||||
<P>The final section compares and contrasts the GPU and USER-CUDA
|
||||
packages, since they are both designed to use NVIDIA hardware.
|
||||
<P>The final section compares and contrasts the GPU, USER-CUDA, and
|
||||
KOKKOS packages, since they all allow for use of NVIDIA GPUs.
|
||||
</P>
|
||||
<HR>
|
||||
|
||||
|
@ -222,22 +236,47 @@ Technologies). It contains a handful of pair styles whose compute()
|
|||
methods were rewritten in C++ templated form to reduce the overhead
|
||||
due to if tests and other conditional code.
|
||||
</P>
|
||||
<P>The procedure for building LAMMPS with the OPT package is simple. It
|
||||
is the same as for any other package which has no additional library
|
||||
dependencies:
|
||||
<P><B>Required hardware/software:</B>
|
||||
</P>
|
||||
<P>None.
|
||||
</P>
|
||||
<P><B>Building LAMMPS with the OPT package:</B>
|
||||
</P>
|
||||
<P>Include the package and build LAMMPS.
|
||||
</P>
|
||||
<PRE>make yes-opt
|
||||
make machine
|
||||
</PRE>
|
||||
<P>If your input script uses one of the OPT pair styles, you can run it
|
||||
as follows:
|
||||
<P>No additional compile/link flags are needed in your lo-level
|
||||
src/MAKE/Makefile.machine.
|
||||
</P>
|
||||
<P><B>Running with the OPT package;</B>
|
||||
</P>
|
||||
<P>You can explicitly add an "opt" suffix to the
|
||||
<A HREF = "pair_style.html">pair_style</A> command in your input script:
|
||||
</P>
|
||||
<PRE>pair_style lj/cut/opt 2.5
|
||||
</PRE>
|
||||
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
|
||||
switch</A>, which will automatically append
|
||||
"opt" to styles that support it.
|
||||
</P>
|
||||
<PRE>lmp_machine -sf opt < in.script
|
||||
mpirun -np 4 lmp_machine -sf opt < in.script
|
||||
</PRE>
|
||||
<P>You should see a reduction in the "Pair time" printed out at the end
|
||||
of the run. On most machines and problems, this will typically be a 5
|
||||
to 20% savings.
|
||||
<P><B>Speed-ups to expect:</B>
|
||||
</P>
|
||||
<P>You should see a reduction in the "Pair time" value printed at the end
|
||||
of a run. On most machines for reasonable problem sizes, it will be a
|
||||
5 to 20% savings.
|
||||
</P>
|
||||
<P><B>Guidelines for best performance;</B>
|
||||
</P>
|
||||
<P>None. Just try out an OPT pair style to see how it performs.
|
||||
</P>
|
||||
<P><B>Restrictions:</B>
|
||||
</P>
|
||||
<P>None.
|
||||
</P>
|
||||
<HR>
|
||||
|
||||
|
@ -245,118 +284,175 @@ to 20% savings.
|
|||
</H4>
|
||||
<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
||||
University. It provides multi-threaded versions of most pair styles,
|
||||
all dihedral styles, and a few fixes in LAMMPS. The package currently
|
||||
uses the OpenMP interface which requires using a specific compiler
|
||||
flag in the makefile to enable multiple threads; without this flag the
|
||||
corresponding pair styles will still be compiled and work, but do not
|
||||
support multi-threading.
|
||||
nearly all bonded styles (bond, angle, dihedral, improper), several
|
||||
Kspace styles, and a few fix styles. The package currently
|
||||
uses the OpenMP interface for multi-threading.
|
||||
</P>
|
||||
<P><B>Required hardware/software:</B>
|
||||
</P>
|
||||
<P>Your compiler must support the OpenMP interface. You should have one
|
||||
or more multi-core CPUs so that multiple threads can be launched by an
|
||||
MPI task running on a CPU.
|
||||
</P>
|
||||
<P><B>Building LAMMPS with the USER-OMP package:</B>
|
||||
</P>
|
||||
<P>The procedure for building LAMMPS with the USER-OMP package is simple.
|
||||
You have to edit your machine specific makefile to add the flag to
|
||||
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
|
||||
For the GNU compilers and Intel compilers, this flag is called
|
||||
<I>-fopenmp</I>. Check your compiler documentation to find out which flag
|
||||
you need to add. The rest of the compilation is the same as for any
|
||||
other package which has no additional library dependencies:
|
||||
<P>Include the package and build LAMMPS.
|
||||
</P>
|
||||
<PRE>make yes-user-omp
|
||||
make machine
|
||||
</PRE>
|
||||
<P>If your input script uses one of regular styles that are also
|
||||
exist as an OpenMP version in the USER-OMP package you can run
|
||||
it as follows:
|
||||
<P>Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
|
||||
support in both the CCFLAGS and LINKFLAGS variables. For GNU and
|
||||
Intel compilers, this flag is <I>-fopenmp</I>. Without this flag the
|
||||
USER-OMP styles will still be compiled and work, but will not support
|
||||
multi-threading.
|
||||
</P>
|
||||
<PRE>env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
|
||||
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
||||
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
||||
<P><B>Running with the USER-OMP package;</B>
|
||||
</P>
|
||||
<P>You can explicitly add an "omp" suffix to any supported style in your
|
||||
input script:
|
||||
</P>
|
||||
<PRE>pair_style lj/cut/omp 2.5
|
||||
fix nve/omp
|
||||
</PRE>
|
||||
<P>The value of the environment variable OMP_NUM_THREADS determines how
|
||||
many threads per MPI task are launched. All three examples above use a
|
||||
total of 4 CPU cores. For different MPI implementations the method to
|
||||
pass the OMP_NUM_THREADS environment variable to all processes is
|
||||
different. Two different variants, one for MPICH and OpenMPI,
|
||||
respectively are shown above. Please check the documentation of your
|
||||
MPI installation for additional details. Alternatively, the value
|
||||
provided by OMP_NUM_THREADS can be overridded with the <A HREF = "package.html">package
|
||||
omp</A> command. Depending on which styles are accelerated
|
||||
in your input, you should see a reduction in the "Pair time" and/or
|
||||
"Bond time" and "Loop time" printed out at the end of the run. The
|
||||
optimal ratio of MPI to OpenMP can vary a lot and should always be
|
||||
confirmed through some benchmark runs for the current system and on
|
||||
the current machine.
|
||||
<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
|
||||
switch</A>, which will automatically append
|
||||
"opt" to styles that support it.
|
||||
</P>
|
||||
<P><B>Restrictions:</B>
|
||||
<PRE>lmp_machine -sf omp < in.script
|
||||
mpirun -np 4 lmp_machine -sf omp < in.script
|
||||
</PRE>
|
||||
<P>You must also specify how many threads to use per MPI task. There are
|
||||
several ways to do this. Note that the default value for this setting
|
||||
in the OpenMP environment is 1 thread/task, which may give poor
|
||||
performance. Also note that the product of MPI tasks * threads/task
|
||||
should not exceed the physical number of cores, otherwise performance
|
||||
will suffer.
|
||||
</P>
|
||||
<P>None of the pair styles in the USER-OMP package support the "inner",
|
||||
"middle", "outer" options for r-RESPA integration, only the "pair"
|
||||
option is supported.
|
||||
<P>a) You can set an environment variable, either in your shell
|
||||
or its start-up script:
|
||||
</P>
|
||||
<P><B>Parallel efficiency and performance tips:</B>
|
||||
<PRE>setenv OMP_NUM_THREADS 4 (for csh or tcsh)
|
||||
NOTE: setenv OMP_NUM_THREADS 4 (for bash)
|
||||
</PRE>
|
||||
<P>This value will apply to all subsequent runs you perform.
|
||||
</P>
|
||||
<P>In most simple cases the MPI parallelization in LAMMPS is more
|
||||
efficient than multi-threading implemented in the USER-OMP package.
|
||||
Also the parallel efficiency varies between individual styles.
|
||||
On the other hand, in many cases you still want to use the <I>omp</I> version
|
||||
- even when compiling or running without OpenMP support - since they
|
||||
all contain optimizations similar to those in the OPT package, which
|
||||
can result in serial speedup.
|
||||
<P>b) You can set the same environment variable when you launch LAMMPS:
|
||||
</P>
|
||||
<P>Using multi-threading is most effective under the following
|
||||
<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
|
||||
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
||||
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
||||
NOTE: which mpirun is for OpenMPI or MPICH?
|
||||
</PRE>
|
||||
<P>All three examples use a total of 4 CPU cores.
|
||||
</P>
|
||||
<P>Different MPI implementations have differnet ways of passing the
|
||||
OMP_NUM_THREADS environment variable to all MPI processes. The first
|
||||
variant above is for MPICH, the second is for OpenMPI. Check the
|
||||
documentation of your MPI installation for additional details.
|
||||
</P>
|
||||
<P>c) Use the <A HREF = "package.html">package omp</A> command near the top of your
|
||||
script:
|
||||
</P>
|
||||
<PRE>package omp 4
|
||||
</PRE>
|
||||
<P><B>Speed-ups to expect:</B>
|
||||
</P>
|
||||
<P>Depending on which styles are accelerated, you should look for a
|
||||
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
|
||||
time" values printed at the end of a run.
|
||||
</P>
|
||||
<P>You may see a small performance advantage (5 to 20%) when running a
|
||||
USER-OMP style (in serial or parallel) with a single thread/MPI task,
|
||||
versus running standard LAMMPS with its un-accelerated styles (in
|
||||
serial or all-MPI parallelization with 1 task/core). This is because
|
||||
many of the USER-OMP styles contain similar optimizations to those
|
||||
used in the OPT package, as described above.
|
||||
</P>
|
||||
<P>With multiple threads/task, the optimal choice of MPI tasks/node and
|
||||
OpenMP threads/task can vary a lot and should always be tested via
|
||||
benchmark runs for a specific simulation running on a specific
|
||||
machine, paying attention to guidelines discussed in the next
|
||||
sub-section.
|
||||
</P>
|
||||
<P>A description of the multi-threading strategy used in the UESR-OMP
|
||||
package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
|
||||
here</A>
|
||||
</P>
|
||||
<P><B>Guidelines for best performance;</B>
|
||||
</P>
|
||||
<P>For many problems on current generation CPUs, running the USER-OMP
|
||||
package with a single thread/task is faster than running with multiple
|
||||
threads/task. This is because the MPI parallelization in LAMMPS is
|
||||
often more efficient than multi-threading as implemented in the
|
||||
USER-OMP package. The parallel efficiency (in a threaded sense) also
|
||||
varies for different USER-OMP styles.
|
||||
</P>
|
||||
<P>Using multiple threads/task can be more effective under the following
|
||||
circumstances:
|
||||
</P>
|
||||
<UL><LI>Individual compute nodes have a significant number of CPU cores but
|
||||
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
|
||||
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
|
||||
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
||||
MPI task per CPU core will result in significant performance
|
||||
degradation, so that running with 4 or even only 2 MPI tasks per nodes
|
||||
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
||||
degradation, so that running with 4 or even only 2 MPI tasks per node
|
||||
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
||||
inter-node communication bandwidth contention in the same way, but
|
||||
offers and additional speedup from utilizing the otherwise idle CPU
|
||||
offers an additional speedup by utilizing the otherwise idle CPU
|
||||
cores.
|
||||
|
||||
<LI>The interconnect used for MPI communication is not able to provide
|
||||
sufficient bandwidth for a large number of MPI tasks per node. This
|
||||
applies for example to running over gigabit ethernet or on Cray XT4 or
|
||||
XT5 series supercomputers. Same as in the aforementioned case this
|
||||
effect worsens with using an increasing number of nodes.
|
||||
<LI>The interconnect used for MPI communication does not provide
|
||||
sufficient bandwidth for a large number of MPI tasks per node. For
|
||||
example, this applies to running over gigabit ethernet or on Cray XT4
|
||||
or XT5 series supercomputers. As in the aforementioned case, this
|
||||
effect worsens when using an increasing number of nodes.
|
||||
|
||||
<LI>The input is a system that has an inhomogeneous particle density which
|
||||
cannot be mapped well to the domain decomposition scheme that LAMMPS
|
||||
employs. While this can be to some degree alleviated through using the
|
||||
<A HREF = "processors.html">processors</A> keyword, multi-threading provides a
|
||||
parallelism that parallelizes over the number of particles not their
|
||||
distribution in space.
|
||||
<LI>The system has a spatially inhomogeneous particle density which does
|
||||
not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
|
||||
<A HREF = "balance.html">load-balancing</A> options that LAMMPS provides. This is
|
||||
because multi-threading achives parallelism over the number of
|
||||
particles, not via their distribution in space.
|
||||
|
||||
<LI>Finally, multi-threaded styles can improve performance when running
|
||||
LAMMPS in "capability mode", i.e. near the point where the MPI
|
||||
parallelism scales out. This can happen in particular when using as
|
||||
kspace style for long-range electrostatics. Here the scaling of the
|
||||
kspace style is the performance limiting factor and using
|
||||
multi-threaded styles allows to operate the kspace style at the limit
|
||||
of scaling and then increase performance parallelizing the real space
|
||||
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
|
||||
be achived by increasing the real-space coulomb cutoff and thus
|
||||
reducing the work in the kspace part.
|
||||
<LI>A machine is being used in "capability mode", i.e. near the point
|
||||
where MPI parallelism is maxed out. For example, this can happen when
|
||||
using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
|
||||
electrostatics on large numbers of nodes. The scaling of the <A HREF = "kspace_style.html">kspace
|
||||
style</A> can become the the performance-limiting
|
||||
factor. Using multi-threading allows less MPI tasks to be invoked and
|
||||
can speed-up the long-range solver, while increasing overall
|
||||
performance by parallelizing the pairwise and bonded calculations via
|
||||
OpenMP. Likewise additional speedup can be sometimes be achived by
|
||||
increasing the length of the Coulombic cutoff and thus reducing the
|
||||
work done by the long-range solver.
|
||||
</UL>
|
||||
<P>The best parallel efficiency from <I>omp</I> styles is typically achieved
|
||||
<P>Other performance tips are as follows:
|
||||
</P>
|
||||
<UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
|
||||
when there is at least one MPI task per physical processor,
|
||||
i.e. socket or die.
|
||||
i.e. socket or die.
|
||||
|
||||
<LI>Using OpenMP threading (as opposed to all-MPI parallelism) on
|
||||
hyper-threading enabled cores is usually counter-productive (e.g. on
|
||||
IBM BG/Q), as the cost in additional memory bandwidth requirements is
|
||||
not offset by the gain in CPU utilization through
|
||||
hyper-threading.
|
||||
</UL>
|
||||
<P><B>Restrictions:</B>
|
||||
</P>
|
||||
<P>Using threads on hyper-threading enabled cores is usually
|
||||
counterproductive, as the cost in additional memory bandwidth
|
||||
requirements is not offset by the gain in CPU utilization through
|
||||
hyper-threading.
|
||||
</P>
|
||||
<P>A description of the multi-threading strategy and some performance
|
||||
examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
|
||||
here</A>
|
||||
<P>None of the pair styles in the USER-OMP package support the "inner",
|
||||
"middle", "outer" options for <A HREF = "run_style.html">rRESPA integration</A>.
|
||||
Only the rRESPA "pair" option is supported.
|
||||
</P>
|
||||
<HR>
|
||||
|
||||
<H4><A NAME = "acc_6"></A>5.6 GPU package
|
||||
</H4>
|
||||
<P><B>Required hardware/software:</B>
|
||||
<B>Building LAMMPS with the OPT package:</B>
|
||||
<B>Running with the OPT package;</B>
|
||||
<B>Guidelines for best performance;</B>
|
||||
<B>Speed-ups to expect:</B>
|
||||
</P>
|
||||
<P>The GPU package was developed by Mike Brown at ORNL and his
|
||||
collaborators. It provides GPU versions of several pair styles,
|
||||
including the 3-body Stillinger-Weber pair style, and for long-range
|
||||
|
@ -546,6 +642,12 @@ of problem size and number of compute nodes.
|
|||
|
||||
<H4><A NAME = "acc_7"></A>5.7 USER-CUDA package
|
||||
</H4>
|
||||
<P><B>Required hardware/software:</B>
|
||||
<B>Building LAMMPS with the OPT package:</B>
|
||||
<B>Running with the OPT package;</B>
|
||||
<B>Guidelines for best performance;</B>
|
||||
<B>Speed-ups to expect:</B>
|
||||
</P>
|
||||
<P>The USER-CUDA package was developed by Christian Trott at U Technology
|
||||
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
|
||||
styles, many fixes, a few computes, and for long-range Coulombics via
|
||||
|
@ -683,6 +785,12 @@ occurs, the faster your simulation will run.
|
|||
|
||||
<H4><A NAME = "acc_8"></A>5.8 KOKKOS package
|
||||
</H4>
|
||||
<P><B>Required hardware/software:</B>
|
||||
<B>Building LAMMPS with the OPT package:</B>
|
||||
<B>Running with the OPT package;</B>
|
||||
<B>Guidelines for best performance;</B>
|
||||
<B>Speed-ups to expect:</B>
|
||||
</P>
|
||||
<P>The KOKKOS package contains versions of pair, fix, and atom styles
|
||||
that use data structures and methods and macros provided by the Kokkos
|
||||
library, which is included with LAMMPS in lib/kokkos.
|
||||
|
@ -975,6 +1083,12 @@ LAMMPS.
|
|||
|
||||
<H4><A NAME = "acc_9"></A>5.9 USER-INTEL package
|
||||
</H4>
|
||||
<P><B>Required hardware/software:</B>
|
||||
<B>Building LAMMPS with the OPT package:</B>
|
||||
<B>Running with the OPT package;</B>
|
||||
<B>Guidelines for best performance;</B>
|
||||
<B>Speed-ups to expect:</B>
|
||||
</P>
|
||||
<P>The USER-INTEL package was developed by Mike Brown at Intel
|
||||
Corporation. It provides a capability to accelerate simulations by
|
||||
offloading neighbor list and non-bonded force calculations to Intel(R)
|
||||
|
|
|
@ -23,7 +23,7 @@ kinds of machines.
|
|||
5.7 "USER-CUDA package"_#acc_7
|
||||
5.8 "KOKKOS package"_#acc_8
|
||||
5.9 "USER-INTEL package"_#acc_9
|
||||
5.10 "Comparison of GPU and USER-CUDA packages"_#acc_10 :all(b)
|
||||
5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)
|
||||
|
||||
:line
|
||||
:line
|
||||
|
@ -78,7 +78,7 @@ LAMMPS, to obtain synchronized timings.
|
|||
|
||||
5.2 General strategies :h4,link(acc_2)
|
||||
|
||||
NOTE: this sub-section is still a work in progress
|
||||
NOTE: this section is still a work in progress
|
||||
|
||||
Here is a list of general ideas for improving simulation performance.
|
||||
Most of them are only applicable to certain models and certain
|
||||
|
@ -138,6 +138,16 @@ been added to LAMMPS, which will typically run faster than the
|
|||
standard non-accelerated versions, if you have the appropriate
|
||||
hardware on your system.
|
||||
|
||||
All of these commands are in "packages"_Section_packages.html.
|
||||
Currently, there are 6 such packages in LAMMPS:
|
||||
|
||||
USER-CUDA: for NVIDIA GPUs
|
||||
GPU: for NVIDIA GPUs as well as OpenCL support
|
||||
USER-INTEL: for Intel CPUs and Intel Xeon Phi
|
||||
KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
|
||||
USER-OMP: for OpenMP threading
|
||||
OPT: generic CPU optimizations :ul
|
||||
|
||||
The accelerated styles have the same name as the standard styles,
|
||||
except that a suffix is appended. Otherwise, the syntax for the
|
||||
command is identical, their functionality is the same, and the
|
||||
|
@ -163,22 +173,31 @@ automatically, without changing your input script. The
|
|||
to turn off and back on the comand-line switch setting, both from
|
||||
within your input script.
|
||||
|
||||
To see what styles are currently available in each of the accelerated
|
||||
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
|
||||
manual. The doc page for each indvidual style (e.g. "pair
|
||||
lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
|
||||
accelerated variants available for that style.
|
||||
|
||||
Here is a brief summary of what the various packages provide. Details
|
||||
are in individual sections below.
|
||||
|
||||
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
||||
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
||||
The speed-up due to GPU usage depends on a variety of factors, as
|
||||
discussed below.
|
||||
The speed-up on a GPU depends on a variety of factors, as discussed
|
||||
below.
|
||||
|
||||
Styles with an "intel" suffix are part of the USER-INTEL
|
||||
package. These styles support vectorized single and mixed precision
|
||||
calculations, in addition to full double precision. In extreme cases,
|
||||
this can provide speedups over 3.5x on CPUs. The package also
|
||||
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
|
||||
This can result in additional speedup over 2x depending on the
|
||||
hardware configuration.
|
||||
supports acceleration with offload to Intel(R) Xeon Phi(TM)
|
||||
coprocessors. This can result in additional speedup over 2x depending
|
||||
on the hardware configuration.
|
||||
|
||||
Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
||||
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends
|
||||
on a variety of factors, as discussed below.
|
||||
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
|
||||
The speed-up depends on a variety of factors, as discussed below.
|
||||
|
||||
Styles with an "omp" suffix are part of the USER-OMP package and allow
|
||||
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
||||
|
@ -188,25 +207,20 @@ are run on fewer MPI processors or when the many MPI tasks would
|
|||
overload the available bandwidth for communication.
|
||||
|
||||
Styles with an "opt" suffix are part of the OPT package and typically
|
||||
speed-up the pairwise calculations of your simulation by 5-25%.
|
||||
|
||||
To see what styles are currently available in each of the accelerated
|
||||
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
|
||||
manual. A list of accelerated styles is included in the pair, fix,
|
||||
compute, and kspace sections. The doc page for each indvidual style
|
||||
(e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) will also
|
||||
list any accelerated variants available for that style.
|
||||
speed-up the pairwise calculations of your simulation by 5-25% on a
|
||||
CPU.
|
||||
|
||||
The following sections explain:
|
||||
|
||||
what hardware and software the accelerated styles require
|
||||
how to build LAMMPS with the accelerated package in place
|
||||
what changes (if any) are needed in your input scripts
|
||||
what hardware and software the accelerated package requires
|
||||
how to build LAMMPS with the accelerated package
|
||||
how to run an input script with the accelerated package
|
||||
speed-ups to expect
|
||||
guidelines for best performance
|
||||
speed-ups you can expect :ul
|
||||
restrictions :ul
|
||||
|
||||
The final section compares and contrasts the GPU and USER-CUDA
|
||||
packages, since they are both designed to use NVIDIA hardware.
|
||||
The final section compares and contrasts the GPU, USER-CUDA, and
|
||||
KOKKOS packages, since they all allow for use of NVIDIA GPUs.
|
||||
|
||||
:line
|
||||
|
||||
|
@ -218,22 +232,47 @@ Technologies). It contains a handful of pair styles whose compute()
|
|||
methods were rewritten in C++ templated form to reduce the overhead
|
||||
due to if tests and other conditional code.
|
||||
|
||||
The procedure for building LAMMPS with the OPT package is simple. It
|
||||
is the same as for any other package which has no additional library
|
||||
dependencies:
|
||||
[Required hardware/software:]
|
||||
|
||||
None.
|
||||
|
||||
[Building LAMMPS with the OPT package:]
|
||||
|
||||
Include the package and build LAMMPS.
|
||||
|
||||
make yes-opt
|
||||
make machine :pre
|
||||
|
||||
If your input script uses one of the OPT pair styles, you can run it
|
||||
as follows:
|
||||
No additional compile/link flags are needed in your lo-level
|
||||
src/MAKE/Makefile.machine.
|
||||
|
||||
[Running with the OPT package;]
|
||||
|
||||
You can explicitly add an "opt" suffix to the
|
||||
"pair_style"_pair_style.html command in your input script:
|
||||
|
||||
pair_style lj/cut/opt 2.5 :pre
|
||||
|
||||
Or you can run with the -sf "command-line
|
||||
switch"_Section_start.html#start_7, which will automatically append
|
||||
"opt" to styles that support it.
|
||||
|
||||
lmp_machine -sf opt < in.script
|
||||
mpirun -np 4 lmp_machine -sf opt < in.script :pre
|
||||
|
||||
You should see a reduction in the "Pair time" printed out at the end
|
||||
of the run. On most machines and problems, this will typically be a 5
|
||||
to 20% savings.
|
||||
[Speed-ups to expect:]
|
||||
|
||||
You should see a reduction in the "Pair time" value printed at the end
|
||||
of a run. On most machines for reasonable problem sizes, it will be a
|
||||
5 to 20% savings.
|
||||
|
||||
[Guidelines for best performance;]
|
||||
|
||||
None. Just try out an OPT pair style to see how it performs.
|
||||
|
||||
[Restrictions:]
|
||||
|
||||
None.
|
||||
|
||||
:line
|
||||
|
||||
|
@ -241,118 +280,175 @@ to 20% savings.
|
|||
|
||||
The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
||||
University. It provides multi-threaded versions of most pair styles,
|
||||
all dihedral styles, and a few fixes in LAMMPS. The package currently
|
||||
uses the OpenMP interface which requires using a specific compiler
|
||||
flag in the makefile to enable multiple threads; without this flag the
|
||||
corresponding pair styles will still be compiled and work, but do not
|
||||
support multi-threading.
|
||||
nearly all bonded styles (bond, angle, dihedral, improper), several
|
||||
Kspace styles, and a few fix styles. The package currently
|
||||
uses the OpenMP interface for multi-threading.
|
||||
|
||||
[Required hardware/software:]
|
||||
|
||||
Your compiler must support the OpenMP interface. You should have one
|
||||
or more multi-core CPUs so that multiple threads can be launched by an
|
||||
MPI task running on a CPU.
|
||||
|
||||
[Building LAMMPS with the USER-OMP package:]
|
||||
|
||||
The procedure for building LAMMPS with the USER-OMP package is simple.
|
||||
You have to edit your machine specific makefile to add the flag to
|
||||
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
|
||||
For the GNU compilers and Intel compilers, this flag is called
|
||||
{-fopenmp}. Check your compiler documentation to find out which flag
|
||||
you need to add. The rest of the compilation is the same as for any
|
||||
other package which has no additional library dependencies:
|
||||
Include the package and build LAMMPS.
|
||||
|
||||
make yes-user-omp
|
||||
make machine :pre
|
||||
|
||||
If your input script uses one of regular styles that are also
|
||||
exist as an OpenMP version in the USER-OMP package you can run
|
||||
it as follows:
|
||||
Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
|
||||
support in both the CCFLAGS and LINKFLAGS variables. For GNU and
|
||||
Intel compilers, this flag is {-fopenmp}. Without this flag the
|
||||
USER-OMP styles will still be compiled and work, but will not support
|
||||
multi-threading.
|
||||
|
||||
env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
|
||||
[Running with the USER-OMP package;]
|
||||
|
||||
You can explicitly add an "omp" suffix to any supported style in your
|
||||
input script:
|
||||
|
||||
pair_style lj/cut/omp 2.5
|
||||
fix nve/omp :pre
|
||||
|
||||
Or you can run with the -sf "command-line
|
||||
switch"_Section_start.html#start_7, which will automatically append
|
||||
"opt" to styles that support it.
|
||||
|
||||
lmp_machine -sf omp < in.script
|
||||
mpirun -np 4 lmp_machine -sf omp < in.script :pre
|
||||
|
||||
You must also specify how many threads to use per MPI task. There are
|
||||
several ways to do this. Note that the default value for this setting
|
||||
in the OpenMP environment is 1 thread/task, which may give poor
|
||||
performance. Also note that the product of MPI tasks * threads/task
|
||||
should not exceed the physical number of cores, otherwise performance
|
||||
will suffer.
|
||||
|
||||
a) You can set an environment variable, either in your shell
|
||||
or its start-up script:
|
||||
|
||||
setenv OMP_NUM_THREADS 4 (for csh or tcsh)
|
||||
NOTE: setenv OMP_NUM_THREADS 4 (for bash) :pre
|
||||
|
||||
This value will apply to all subsequent runs you perform.
|
||||
|
||||
b) You can set the same environment variable when you launch LAMMPS:
|
||||
|
||||
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
|
||||
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
||||
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre
|
||||
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
||||
NOTE: which mpirun is for OpenMPI or MPICH? :pre
|
||||
|
||||
The value of the environment variable OMP_NUM_THREADS determines how
|
||||
many threads per MPI task are launched. All three examples above use a
|
||||
total of 4 CPU cores. For different MPI implementations the method to
|
||||
pass the OMP_NUM_THREADS environment variable to all processes is
|
||||
different. Two different variants, one for MPICH and OpenMPI,
|
||||
respectively are shown above. Please check the documentation of your
|
||||
MPI installation for additional details. Alternatively, the value
|
||||
provided by OMP_NUM_THREADS can be overridded with the "package
|
||||
omp"_package.html command. Depending on which styles are accelerated
|
||||
in your input, you should see a reduction in the "Pair time" and/or
|
||||
"Bond time" and "Loop time" printed out at the end of the run. The
|
||||
optimal ratio of MPI to OpenMP can vary a lot and should always be
|
||||
confirmed through some benchmark runs for the current system and on
|
||||
the current machine.
|
||||
All three examples use a total of 4 CPU cores.
|
||||
|
||||
Different MPI implementations have differnet ways of passing the
|
||||
OMP_NUM_THREADS environment variable to all MPI processes. The first
|
||||
variant above is for MPICH, the second is for OpenMPI. Check the
|
||||
documentation of your MPI installation for additional details.
|
||||
|
||||
c) Use the "package omp"_package.html command near the top of your
|
||||
script:
|
||||
|
||||
package omp 4 :pre
|
||||
|
||||
[Speed-ups to expect:]
|
||||
|
||||
Depending on which styles are accelerated, you should look for a
|
||||
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
|
||||
time" values printed at the end of a run.
|
||||
|
||||
You may see a small performance advantage (5 to 20%) when running a
|
||||
USER-OMP style (in serial or parallel) with a single thread/MPI task,
|
||||
versus running standard LAMMPS with its un-accelerated styles (in
|
||||
serial or all-MPI parallelization with 1 task/core). This is because
|
||||
many of the USER-OMP styles contain similar optimizations to those
|
||||
used in the OPT package, as described above.
|
||||
|
||||
With multiple threads/task, the optimal choice of MPI tasks/node and
|
||||
OpenMP threads/task can vary a lot and should always be tested via
|
||||
benchmark runs for a specific simulation running on a specific
|
||||
machine, paying attention to guidelines discussed in the next
|
||||
sub-section.
|
||||
|
||||
A description of the multi-threading strategy used in the UESR-OMP
|
||||
package and some performance examples are "presented
|
||||
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
|
||||
|
||||
[Guidelines for best performance;]
|
||||
|
||||
For many problems on current generation CPUs, running the USER-OMP
|
||||
package with a single thread/task is faster than running with multiple
|
||||
threads/task. This is because the MPI parallelization in LAMMPS is
|
||||
often more efficient than multi-threading as implemented in the
|
||||
USER-OMP package. The parallel efficiency (in a threaded sense) also
|
||||
varies for different USER-OMP styles.
|
||||
|
||||
Using multiple threads/task can be more effective under the following
|
||||
circumstances:
|
||||
|
||||
Individual compute nodes have a significant number of CPU cores but
|
||||
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
|
||||
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
||||
MPI task per CPU core will result in significant performance
|
||||
degradation, so that running with 4 or even only 2 MPI tasks per node
|
||||
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
||||
inter-node communication bandwidth contention in the same way, but
|
||||
offers an additional speedup by utilizing the otherwise idle CPU
|
||||
cores. :ulb,l
|
||||
|
||||
The interconnect used for MPI communication does not provide
|
||||
sufficient bandwidth for a large number of MPI tasks per node. For
|
||||
example, this applies to running over gigabit ethernet or on Cray XT4
|
||||
or XT5 series supercomputers. As in the aforementioned case, this
|
||||
effect worsens when using an increasing number of nodes. :l
|
||||
|
||||
The system has a spatially inhomogeneous particle density which does
|
||||
not map well to the "domain decomposition scheme"_processors.html or
|
||||
"load-balancing"_balance.html options that LAMMPS provides. This is
|
||||
because multi-threading achives parallelism over the number of
|
||||
particles, not via their distribution in space. :l
|
||||
|
||||
A machine is being used in "capability mode", i.e. near the point
|
||||
where MPI parallelism is maxed out. For example, this can happen when
|
||||
using the "PPPM solver"_kspace_style.html for long-range
|
||||
electrostatics on large numbers of nodes. The scaling of the "kspace
|
||||
style"_kspace_style.html can become the the performance-limiting
|
||||
factor. Using multi-threading allows less MPI tasks to be invoked and
|
||||
can speed-up the long-range solver, while increasing overall
|
||||
performance by parallelizing the pairwise and bonded calculations via
|
||||
OpenMP. Likewise additional speedup can be sometimes be achived by
|
||||
increasing the length of the Coulombic cutoff and thus reducing the
|
||||
work done by the long-range solver. :l,ule
|
||||
|
||||
Other performance tips are as follows:
|
||||
|
||||
The best parallel efficiency from {omp} styles is typically achieved
|
||||
when there is at least one MPI task per physical processor,
|
||||
i.e. socket or die. :ulb,l
|
||||
|
||||
Using OpenMP threading (as opposed to all-MPI parallelism) on
|
||||
hyper-threading enabled cores is usually counter-productive (e.g. on
|
||||
IBM BG/Q), as the cost in additional memory bandwidth requirements is
|
||||
not offset by the gain in CPU utilization through
|
||||
hyper-threading. :l,ule
|
||||
|
||||
[Restrictions:]
|
||||
|
||||
None of the pair styles in the USER-OMP package support the "inner",
|
||||
"middle", "outer" options for r-RESPA integration, only the "pair"
|
||||
option is supported.
|
||||
|
||||
[Parallel efficiency and performance tips:]
|
||||
|
||||
In most simple cases the MPI parallelization in LAMMPS is more
|
||||
efficient than multi-threading implemented in the USER-OMP package.
|
||||
Also the parallel efficiency varies between individual styles.
|
||||
On the other hand, in many cases you still want to use the {omp} version
|
||||
- even when compiling or running without OpenMP support - since they
|
||||
all contain optimizations similar to those in the OPT package, which
|
||||
can result in serial speedup.
|
||||
|
||||
Using multi-threading is most effective under the following
|
||||
circumstances:
|
||||
|
||||
Individual compute nodes have a significant number of CPU cores but
|
||||
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
|
||||
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
||||
MPI task per CPU core will result in significant performance
|
||||
degradation, so that running with 4 or even only 2 MPI tasks per nodes
|
||||
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
||||
inter-node communication bandwidth contention in the same way, but
|
||||
offers and additional speedup from utilizing the otherwise idle CPU
|
||||
cores. :ulb,l
|
||||
|
||||
The interconnect used for MPI communication is not able to provide
|
||||
sufficient bandwidth for a large number of MPI tasks per node. This
|
||||
applies for example to running over gigabit ethernet or on Cray XT4 or
|
||||
XT5 series supercomputers. Same as in the aforementioned case this
|
||||
effect worsens with using an increasing number of nodes. :l
|
||||
|
||||
The input is a system that has an inhomogeneous particle density which
|
||||
cannot be mapped well to the domain decomposition scheme that LAMMPS
|
||||
employs. While this can be to some degree alleviated through using the
|
||||
"processors"_processors.html keyword, multi-threading provides a
|
||||
parallelism that parallelizes over the number of particles not their
|
||||
distribution in space. :l
|
||||
|
||||
Finally, multi-threaded styles can improve performance when running
|
||||
LAMMPS in "capability mode", i.e. near the point where the MPI
|
||||
parallelism scales out. This can happen in particular when using as
|
||||
kspace style for long-range electrostatics. Here the scaling of the
|
||||
kspace style is the performance limiting factor and using
|
||||
multi-threaded styles allows to operate the kspace style at the limit
|
||||
of scaling and then increase performance parallelizing the real space
|
||||
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
|
||||
be achived by increasing the real-space coulomb cutoff and thus
|
||||
reducing the work in the kspace part. :l,ule
|
||||
|
||||
The best parallel efficiency from {omp} styles is typically achieved
|
||||
when there is at least one MPI task per physical processor,
|
||||
i.e. socket or die.
|
||||
|
||||
Using threads on hyper-threading enabled cores is usually
|
||||
counterproductive, as the cost in additional memory bandwidth
|
||||
requirements is not offset by the gain in CPU utilization through
|
||||
hyper-threading.
|
||||
|
||||
A description of the multi-threading strategy and some performance
|
||||
examples are "presented
|
||||
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
|
||||
"middle", "outer" options for "rRESPA integration"_run_style.html.
|
||||
Only the rRESPA "pair" option is supported.
|
||||
|
||||
:line
|
||||
|
||||
5.6 GPU package :h4,link(acc_6)
|
||||
|
||||
[Required hardware/software:]
|
||||
[Building LAMMPS with the OPT package:]
|
||||
[Running with the OPT package;]
|
||||
[Guidelines for best performance;]
|
||||
[Speed-ups to expect:]
|
||||
|
||||
The GPU package was developed by Mike Brown at ORNL and his
|
||||
collaborators. It provides GPU versions of several pair styles,
|
||||
including the 3-body Stillinger-Weber pair style, and for long-range
|
||||
|
@ -542,6 +638,12 @@ of problem size and number of compute nodes.
|
|||
|
||||
5.7 USER-CUDA package :h4,link(acc_7)
|
||||
|
||||
[Required hardware/software:]
|
||||
[Building LAMMPS with the OPT package:]
|
||||
[Running with the OPT package;]
|
||||
[Guidelines for best performance;]
|
||||
[Speed-ups to expect:]
|
||||
|
||||
The USER-CUDA package was developed by Christian Trott at U Technology
|
||||
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
|
||||
styles, many fixes, a few computes, and for long-range Coulombics via
|
||||
|
@ -679,6 +781,12 @@ occurs, the faster your simulation will run.
|
|||
|
||||
5.8 KOKKOS package :h4,link(acc_8)
|
||||
|
||||
[Required hardware/software:]
|
||||
[Building LAMMPS with the OPT package:]
|
||||
[Running with the OPT package;]
|
||||
[Guidelines for best performance;]
|
||||
[Speed-ups to expect:]
|
||||
|
||||
The KOKKOS package contains versions of pair, fix, and atom styles
|
||||
that use data structures and methods and macros provided by the Kokkos
|
||||
library, which is included with LAMMPS in lib/kokkos.
|
||||
|
@ -971,6 +1079,12 @@ LAMMPS.
|
|||
|
||||
5.9 USER-INTEL package :h4,link(acc_9)
|
||||
|
||||
[Required hardware/software:]
|
||||
[Building LAMMPS with the OPT package:]
|
||||
[Running with the OPT package;]
|
||||
[Guidelines for best performance;]
|
||||
[Speed-ups to expect:]
|
||||
|
||||
The USER-INTEL package was developed by Mike Brown at Intel
|
||||
Corporation. It provides a capability to accelerate simulations by
|
||||
offloading neighbor list and non-bonded force calculations to Intel(R)
|
||||
|
|
Loading…
Reference in New Issue