git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12374 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2014-08-27 20:52:54 +00:00 · 2014-08-27 20:52:54 +00:00 · 444053fa6c
parent dc5ad107ad
commit 444053fa6c
2 changed files with 463 additions and 235 deletions
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
@ -26,7 +26,7 @@ kinds of machines.
 5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
 5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
 5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
-5.10 <A HREF = "#acc_10">Comparison of GPU and USER-CUDA packages</A> <BR>
+5.10 <A HREF = "#acc_10">Comparison of USER-CUDA, GPU, and KOKKOS packages</A> <BR>
 <HR>
@ -82,7 +82,7 @@ LAMMPS, to obtain synchronized timings.
 <H4><A NAME = "acc_2"></A>5.2 General strategies 
 </H4>
-<P>NOTE: this sub-section is still a work in progress
+<P>NOTE: this section is still a work in progress
 </P>
 <P>Here is a list of general ideas for improving simulation performance.
 Most of them are only applicable to certain models and certain
@ -142,6 +142,16 @@ been added to LAMMPS, which will typically run faster than the
 standard non-accelerated versions, if you have the appropriate
 hardware on your system.
 </P>
 <P>All of these commands are in <A HREF = "Section_packages.html">packages</A>.
 Currently, there are 6 such packages in LAMMPS:
 </P>
 <UL><LI>USER-CUDA: for NVIDIA GPUs
 <LI>GPU: for NVIDIA GPUs as well as OpenCL support
 <LI>USER-INTEL: for Intel CPUs and Intel Xeon Phi
 <LI>KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
 <LI>USER-OMP: for OpenMP threading
 <LI>OPT: generic CPU optimizations 
 </UL>
 <P>The accelerated styles have the same name as the standard styles,
 except that a suffix is appended.  Otherwise, the syntax for the
 command is identical, their functionality is the same, and the
@ -167,22 +177,31 @@ automatically, without changing your input script.  The
 to turn off and back on the comand-line switch setting, both from
 within your input script.
 </P>
 <P>To see what styles are currently available in each of the accelerated
 packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
 manual.  The doc page for each indvidual style (e.g. <A HREF = "pair_lj.html">pair
 lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) also lists any
 accelerated variants available for that style.
 </P>
 <P>Here is a brief summary of what the various packages provide.  Details
 are in individual sections below.
 </P>
 <P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
 packages, and can be run on NVIDIA GPUs associated with your CPUs.
-The speed-up due to GPU usage depends on a variety of factors, as
+The speed-up on a GPU depends on a variety of factors, as discussed
-discussed below.
+below.
 </P>
 <P>Styles with an "intel" suffix are part of the USER-INTEL
 package. These styles support vectorized single and mixed precision
 calculations, in addition to full double precision.  In extreme cases,
 this can provide speedups over 3.5x on CPUs.  The package also
-supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
+supports acceleration with offload to Intel(R) Xeon Phi(TM)
-This can result in additional speedup over 2x depending on the
+coprocessors.  This can result in additional speedup over 2x depending
-hardware configuration.
+on the hardware configuration.
 </P>
 <P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
-run using OpenMP, pthreads, or on an NVIDIA GPU.  The speed-up depends
+run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
-on a variety of factors, as discussed below.
+The speed-up depends on a variety of factors, as discussed below.
 </P>
 <P>Styles with an "omp" suffix are part of the USER-OMP package and allow
 a pair-style to be run in multi-threaded mode using OpenMP.  This can
@ -192,25 +211,20 @@ are run on fewer MPI processors or when the many MPI tasks would
 overload the available bandwidth for communication.
 </P>
 <P>Styles with an "opt" suffix are part of the OPT package and typically
-speed-up the pairwise calculations of your simulation by 5-25%.
+speed-up the pairwise calculations of your simulation by 5-25% on a
-</P>
+CPU.
 <P>To see what styles are currently available in each of the accelerated
 packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
 manual.  A list of accelerated styles is included in the pair, fix,
 compute, and kspace sections.  The doc page for each indvidual style
 (e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) will also
 list any accelerated variants available for that style.
 </P>
 <P>The following sections explain:
 </P>
-<UL><LI>what hardware and software the accelerated styles require
+<UL><LI>what hardware and software the accelerated package requires
-<LI>how to build LAMMPS with the accelerated package in place
+<LI>how to build LAMMPS with the accelerated package
-<LI>what changes (if any) are needed in your input scripts
+<LI>how to run an input script with the accelerated package
 <LI>speed-ups to expect
 <LI>guidelines for best performance
-<LI>speed-ups you can expect 
+<LI>restrictions 
 </UL>
-<P>The final section compares and contrasts the GPU and USER-CUDA
+<P>The final section compares and contrasts the GPU, USER-CUDA, and
-packages, since they are both designed to use NVIDIA hardware.
+KOKKOS packages, since they all allow for use of NVIDIA GPUs.
 </P>
 <HR>
@ -222,22 +236,47 @@ Technologies).  It contains a handful of pair styles whose compute()
 methods were rewritten in C++ templated form to reduce the overhead
 due to if tests and other conditional code.
 </P>
-<P>The procedure for building LAMMPS with the OPT package is simple.  It
+<P><B>Required hardware/software:</B>
-is the same as for any other package which has no additional library
+</P>
-dependencies:
+<P>None.
 </P>
 <P><B>Building LAMMPS with the OPT package:</B>
 </P>
 <P>Include the package and build LAMMPS.
 </P>
 <PRE>make yes-opt
 make machine 
 </PRE>
-<P>If your input script uses one of the OPT pair styles, you can run it
+<P>No additional compile/link flags are needed in your lo-level
-as follows:
+src/MAKE/Makefile.machine.
 </P>
 <P><B>Running with the OPT package;</B>
 </P>
 <P>You can explicitly add an "opt" suffix to the
 <A HREF = "pair_style.html">pair_style</A> command in your input script:
 </P>
 <PRE>pair_style lj/cut/opt 2.5 
 </PRE>
 <P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
 switch</A>, which will automatically append
 "opt" to styles that support it.
 </P>
 <PRE>lmp_machine -sf opt < in.script
 mpirun -np 4 lmp_machine -sf opt < in.script 
 </PRE>
-<P>You should see a reduction in the "Pair time" printed out at the end
+<P><B>Speed-ups to expect:</B>
-of the run.  On most machines and problems, this will typically be a 5
+</P>
-to 20% savings.
+<P>You should see a reduction in the "Pair time" value printed at the end
 of a run.  On most machines for reasonable problem sizes, it will be a
 5 to 20% savings.
 </P>
 <P><B>Guidelines for best performance;</B>
 </P>
 <P>None.  Just try out an OPT pair style to see how it performs.
 </P>
 <P><B>Restrictions:</B>
 </P>
 <P>None.
 </P>
 <HR>
@ -245,118 +284,175 @@ to 20% savings.
 </H4>
 <P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
 University.  It provides multi-threaded versions of most pair styles,
-all dihedral styles, and a few fixes in LAMMPS. The package currently
+nearly all bonded styles (bond, angle, dihedral, improper), several
-uses the OpenMP interface which requires using a specific compiler
+Kspace styles, and a few fix styles.  The package currently
-flag in the makefile to enable multiple threads; without this flag the
+uses the OpenMP interface for multi-threading.
-corresponding pair styles will still be compiled and work, but do not
+</P>
-support multi-threading.
+<P><B>Required hardware/software:</B>
 </P>
 <P>Your compiler must support the OpenMP interface.  You should have one
 or more multi-core CPUs so that multiple threads can be launched by an
 MPI task running on a CPU.
 </P>
 <P><B>Building LAMMPS with the USER-OMP package:</B>
 </P>
-<P>The procedure for building LAMMPS with the USER-OMP package is simple.
+<P>Include the package and build LAMMPS.  
 You have to edit your machine specific makefile to add the flag to
 enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
 For the GNU compilers and Intel compilers, this flag is called
 <I>-fopenmp</I>. Check your compiler documentation to find out which flag
 you need to add.  The rest of the compilation is the same as for any
 other package which has no additional library dependencies:
 </P>
 <PRE>make yes-user-omp
 make machine 
 </PRE>
-<P>If your input script uses one of regular styles that are also
+<P>Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
-exist as an OpenMP version in the USER-OMP package you can run
+support in both the CCFLAGS and LINKFLAGS variables.  For GNU and
-it as follows:
+Intel compilers, this flag is <I>-fopenmp</I>.  Without this flag the
 USER-OMP styles will still be compiled and work, but will not support
 multi-threading.
 </P>
-<PRE>env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
+<P><B>Running with the USER-OMP package;</B>
 </P>
 <P>You can explicitly add an "omp" suffix to any supported style in your
 input script:
 </P>
 <PRE>pair_style lj/cut/omp 2.5
 fix nve/omp 
 </PRE>
 <P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
 switch</A>, which will automatically append
 "opt" to styles that support it.
 </P>
 <PRE>lmp_machine -sf omp < in.script
 mpirun -np 4 lmp_machine -sf omp < in.script 
 </PRE>
 <P>You must also specify how many threads to use per MPI task.  There are
 several ways to do this.  Note that the default value for this setting
 in the OpenMP environment is 1 thread/task, which may give poor
 performance.  Also note that the product of MPI tasks * threads/task
 should not exceed the physical number of cores, otherwise performance
 will suffer.
 </P>
 <P>a) You can set an environment variable, either in your shell
 or its start-up script:
 </P>
 <PRE>setenv OMP_NUM_THREADS 4 (for csh or tcsh)
 NOTE: setenv OMP_NUM_THREADS 4 (for bash) 
 </PRE>
 <P>This value will apply to all subsequent runs you perform.
 </P>
 <P>b) You can set the same environment variable when you launch LAMMPS:
 </P>
 <PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
 env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
 mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
 NOTE: which mpirun is for OpenMPI or MPICH? 
 </PRE>
-<P>The value of the environment variable OMP_NUM_THREADS determines how
+<P>All three examples use a total of 4 CPU cores.
 many threads per MPI task are launched. All three examples above use a
 total of 4 CPU cores.  For different MPI implementations the method to
 pass the OMP_NUM_THREADS environment variable to all processes is
 different.  Two different variants, one for MPICH and OpenMPI,
 respectively are shown above.  Please check the documentation of your
 MPI installation for additional details.  Alternatively, the value
 provided by OMP_NUM_THREADS can be overridded with the <A HREF = "package.html">package
 omp</A> command.  Depending on which styles are accelerated
 in your input, you should see a reduction in the "Pair time" and/or
 "Bond time" and "Loop time" printed out at the end of the run. The
 optimal ratio of MPI to OpenMP can vary a lot and should always be
 confirmed through some benchmark runs for the current system and on
 the current machine.
 </P>
-<P><B>Restrictions:</B>
+<P>Different MPI implementations have differnet ways of passing the
 OMP_NUM_THREADS environment variable to all MPI processes.  The first
 variant above is for MPICH, the second is for OpenMPI.  Check the
 documentation of your MPI installation for additional details.
 </P>
-<P>None of the pair styles in the USER-OMP package support the "inner",
+<P>c) Use the <A HREF = "package.html">package omp</A> command near the top of your
-"middle", "outer" options for r-RESPA integration, only the "pair"
+script:
 option is supported.
 </P>
-<P><B>Parallel efficiency and performance tips:</B>
+<PRE>package omp 4 
 </PRE>
 <P><B>Speed-ups to expect:</B>
 </P>
-<P>In most simple cases the MPI parallelization in LAMMPS is more
+<P>Depending on which styles are accelerated, you should look for a
-efficient than multi-threading implemented in the USER-OMP package.
+reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
-Also the parallel efficiency varies between individual styles.
+time" values printed at the end of a run.  
 On the other hand, in many cases you still want to use the <I>omp</I> version
 - even when compiling or running without OpenMP support - since they
 all contain optimizations similar to those in the OPT package, which
 can result in serial speedup.
 </P>
-<P>Using multi-threading is most effective under the following
+<P>You may see a small performance advantage (5 to 20%) when running a
 USER-OMP style (in serial or parallel) with a single thread/MPI task,
 versus running standard LAMMPS with its un-accelerated styles (in
 serial or all-MPI parallelization with 1 task/core).  This is because
 many of the USER-OMP styles contain similar optimizations to those
 used in the OPT package, as described above.
 </P>
 <P>With multiple threads/task, the optimal choice of MPI tasks/node and
 OpenMP threads/task can vary a lot and should always be tested via
 benchmark runs for a specific simulation running on a specific
 machine, paying attention to guidelines discussed in the next
 sub-section.
 </P>
 <P>A description of the multi-threading strategy used in the UESR-OMP
 package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
 here</A>
 </P>
 <P><B>Guidelines for best performance;</B>
 </P>
 <P>For many problems on current generation CPUs, running the USER-OMP
 package with a single thread/task is faster than running with multiple
 threads/task.  This is because the MPI parallelization in LAMMPS is
 often more efficient than multi-threading as implemented in the
 USER-OMP package.  The parallel efficiency (in a threaded sense) also
 varies for different USER-OMP styles.
 </P>
 <P>Using multiple threads/task can be more effective under the following
 circumstances:
 </P>
 <UL><LI>Individual compute nodes have a significant number of CPU cores but
-the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
+the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
 (Clovertown) and 54xx (Harpertown) quad core processors. Running one
 MPI task per CPU core will result in significant performance
-degradation, so that running with 4 or even only 2 MPI tasks per nodes
+degradation, so that running with 4 or even only 2 MPI tasks per node
 is faster.  Running in hybrid MPI+OpenMP mode will reduce the
 inter-node communication bandwidth contention in the same way, but
-offers and additional speedup from utilizing the otherwise idle CPU
+offers an additional speedup by utilizing the otherwise idle CPU
 cores. 
-<LI>The interconnect used for MPI communication is not able to provide
+<LI>The interconnect used for MPI communication does not provide
-sufficient bandwidth for a large number of MPI tasks per node.  This
+sufficient bandwidth for a large number of MPI tasks per node.  For
-applies for example to running over gigabit ethernet or on Cray XT4 or
+example, this applies to running over gigabit ethernet or on Cray XT4
-XT5 series supercomputers. Same as in the aforementioned case this
+or XT5 series supercomputers.  As in the aforementioned case, this
-effect worsens with using an increasing number of nodes. 
+effect worsens when using an increasing number of nodes. 
-<LI>The input is a system that has an inhomogeneous particle density which
+<LI>The system has a spatially inhomogeneous particle density which does
-cannot be mapped well to the domain decomposition scheme that LAMMPS
+not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
-employs. While this can be to some degree alleviated through using the
+<A HREF = "balance.html">load-balancing</A> options that LAMMPS provides.  This is
-<A HREF = "processors.html">processors</A> keyword, multi-threading provides a
+because multi-threading achives parallelism over the number of
-parallelism that parallelizes over the number of particles not their
+particles, not via their distribution in space. 
 distribution in space. 
-<LI>Finally, multi-threaded styles can improve performance when running
+<LI>A machine is being used in "capability mode", i.e. near the point
-LAMMPS in "capability mode", i.e. near the point where the MPI
+where MPI parallelism is maxed out.  For example, this can happen when
-parallelism scales out. This can happen in particular when using as
+using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
-kspace style for long-range electrostatics. Here the scaling of the
+electrostatics on large numbers of nodes.  The scaling of the <A HREF = "kspace_style.html">kspace
-kspace style is the performance limiting factor and using
+style</A> can become the the performance-limiting
-multi-threaded styles allows to operate the kspace style at the limit
+factor.  Using multi-threading allows less MPI tasks to be invoked and
-of scaling and then increase performance parallelizing the real space
+can speed-up the long-range solver, while increasing overall
-calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
+performance by parallelizing the pairwise and bonded calculations via
-be achived by increasing the real-space coulomb cutoff and thus
+OpenMP.  Likewise additional speedup can be sometimes be achived by
-reducing the work in the kspace part. 
+increasing the length of the Coulombic cutoff and thus reducing the
 work done by the long-range solver. 
 </UL>
-<P>The best parallel efficiency from <I>omp</I> styles is typically achieved
+<P>Other performance tips are as follows:
 </P>
 <UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
 when there is at least one MPI task per physical processor,
 i.e. socket or die. 
-</P>
+
-<P>Using threads on hyper-threading enabled cores is usually
+<LI>Using OpenMP threading (as opposed to all-MPI parallelism) on
-counterproductive, as the cost in additional memory bandwidth
+hyper-threading enabled cores is usually counter-productive (e.g. on
-requirements is not offset by the gain in CPU utilization through
+IBM BG/Q), as the cost in additional memory bandwidth requirements is
 not offset by the gain in CPU utilization through
 hyper-threading. 
 </UL>
 <P><B>Restrictions:</B>
 </P>
-<P>A description of the multi-threading strategy and some performance
+<P>None of the pair styles in the USER-OMP package support the "inner",
-examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
+"middle", "outer" options for <A HREF = "run_style.html">rRESPA integration</A>.
-here</A>
+Only the rRESPA "pair" option is supported.
 </P>
 <HR>
 <H4><A NAME = "acc_6"></A>5.6 GPU package 
 </H4>
 <P><B>Required hardware/software:</B>
 <B>Building LAMMPS with the OPT package:</B>
 <B>Running with the OPT package;</B>
 <B>Guidelines for best performance;</B>
 <B>Speed-ups to expect:</B>
 </P>
 <P>The GPU package was developed by Mike Brown at ORNL and his
 collaborators.  It provides GPU versions of several pair styles,
 including the 3-body Stillinger-Weber pair style, and for long-range
@ -546,6 +642,12 @@ of problem size and number of compute nodes.
 <H4><A NAME = "acc_7"></A>5.7 USER-CUDA package 
 </H4>
 <P><B>Required hardware/software:</B>
 <B>Building LAMMPS with the OPT package:</B>
 <B>Running with the OPT package;</B>
 <B>Guidelines for best performance;</B>
 <B>Speed-ups to expect:</B>
 </P>
 <P>The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
 styles, many fixes, a few computes, and for long-range Coulombics via
@ -683,6 +785,12 @@ occurs, the faster your simulation will run.
 <H4><A NAME = "acc_8"></A>5.8 KOKKOS package 
 </H4>
 <P><B>Required hardware/software:</B>
 <B>Building LAMMPS with the OPT package:</B>
 <B>Running with the OPT package;</B>
 <B>Guidelines for best performance;</B>
 <B>Speed-ups to expect:</B>
 </P>
 <P>The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and methods and macros provided by the Kokkos
 library, which is included with LAMMPS in lib/kokkos.
@ -975,6 +1083,12 @@ LAMMPS.
 <H4><A NAME = "acc_9"></A>5.9 USER-INTEL package 
 </H4>
 <P><B>Required hardware/software:</B>
 <B>Building LAMMPS with the OPT package:</B>
 <B>Running with the OPT package;</B>
 <B>Guidelines for best performance;</B>
 <B>Speed-ups to expect:</B>
 </P>
 <P>The USER-INTEL package was developed by Mike Brown at Intel
 Corporation. It provides a capability to accelerate simulations by
 offloading neighbor list and non-bonded force calculations to Intel(R)
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
@ -23,7 +23,7 @@ kinds of machines.
 5.7 "USER-CUDA package"_#acc_7
 5.8 "KOKKOS package"_#acc_8
 5.9 "USER-INTEL package"_#acc_9
-5.10 "Comparison of GPU and USER-CUDA packages"_#acc_10 :all(b)
+5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)
 :line
 :line
@ -78,7 +78,7 @@ LAMMPS, to obtain synchronized timings.
 5.2 General strategies :h4,link(acc_2)
-NOTE: this sub-section is still a work in progress
+NOTE: this section is still a work in progress
 Here is a list of general ideas for improving simulation performance.
 Most of them are only applicable to certain models and certain
@ -138,6 +138,16 @@ been added to LAMMPS, which will typically run faster than the
 standard non-accelerated versions, if you have the appropriate
 hardware on your system.
 All of these commands are in "packages"_Section_packages.html.
 Currently, there are 6 such packages in LAMMPS:
 USER-CUDA: for NVIDIA GPUs
 GPU: for NVIDIA GPUs as well as OpenCL support
 USER-INTEL: for Intel CPUs and Intel Xeon Phi
 KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
 USER-OMP: for OpenMP threading
 OPT: generic CPU optimizations :ul
 The accelerated styles have the same name as the standard styles,
 except that a suffix is appended.  Otherwise, the syntax for the
 command is identical, their functionality is the same, and the
@ -163,22 +173,31 @@ automatically, without changing your input script.  The
 to turn off and back on the comand-line switch setting, both from
 within your input script.
 To see what styles are currently available in each of the accelerated
 packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
 manual.  The doc page for each indvidual style (e.g. "pair
 lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
 accelerated variants available for that style.
 Here is a brief summary of what the various packages provide.  Details
 are in individual sections below.
 Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
 packages, and can be run on NVIDIA GPUs associated with your CPUs.
-The speed-up due to GPU usage depends on a variety of factors, as
+The speed-up on a GPU depends on a variety of factors, as discussed
-discussed below.
+below.
 Styles with an "intel" suffix are part of the USER-INTEL
 package. These styles support vectorized single and mixed precision
 calculations, in addition to full double precision.  In extreme cases,
 this can provide speedups over 3.5x on CPUs.  The package also
-supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
+supports acceleration with offload to Intel(R) Xeon Phi(TM)
-This can result in additional speedup over 2x depending on the
+coprocessors.  This can result in additional speedup over 2x depending
-hardware configuration.
+on the hardware configuration.
 Styles with a "kk" suffix are part of the KOKKOS package, and can be
-run using OpenMP, pthreads, or on an NVIDIA GPU.  The speed-up depends
+run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
-on a variety of factors, as discussed below.
+The speed-up depends on a variety of factors, as discussed below.
 Styles with an "omp" suffix are part of the USER-OMP package and allow
 a pair-style to be run in multi-threaded mode using OpenMP.  This can
@ -188,25 +207,20 @@ are run on fewer MPI processors or when the many MPI tasks would
 overload the available bandwidth for communication.
 Styles with an "opt" suffix are part of the OPT package and typically
-speed-up the pairwise calculations of your simulation by 5-25%.
+speed-up the pairwise calculations of your simulation by 5-25% on a
-
+CPU.
 To see what styles are currently available in each of the accelerated
 packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
 manual.  A list of accelerated styles is included in the pair, fix,
 compute, and kspace sections.  The doc page for each indvidual style
 (e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) will also
 list any accelerated variants available for that style.
 The following sections explain:
-what hardware and software the accelerated styles require
+what hardware and software the accelerated package requires
-how to build LAMMPS with the accelerated package in place
+how to build LAMMPS with the accelerated package
-what changes (if any) are needed in your input scripts
+how to run an input script with the accelerated package
 speed-ups to expect
 guidelines for best performance
-speed-ups you can expect :ul
+restrictions :ul
-The final section compares and contrasts the GPU and USER-CUDA
+The final section compares and contrasts the GPU, USER-CUDA, and
-packages, since they are both designed to use NVIDIA hardware.
+KOKKOS packages, since they all allow for use of NVIDIA GPUs.
 :line
@ -218,22 +232,47 @@ Technologies).  It contains a handful of pair styles whose compute()
 methods were rewritten in C++ templated form to reduce the overhead
 due to if tests and other conditional code.
-The procedure for building LAMMPS with the OPT package is simple.  It
+[Required hardware/software:]
-is the same as for any other package which has no additional library
+
-dependencies:
+None.
 [Building LAMMPS with the OPT package:]
 Include the package and build LAMMPS.
 make yes-opt
 make machine :pre
-If your input script uses one of the OPT pair styles, you can run it
+No additional compile/link flags are needed in your lo-level
-as follows:
+src/MAKE/Makefile.machine.
 [Running with the OPT package;]
 You can explicitly add an "opt" suffix to the
 "pair_style"_pair_style.html command in your input script:
 pair_style lj/cut/opt 2.5 :pre
 Or you can run with the -sf "command-line
 switch"_Section_start.html#start_7, which will automatically append
 "opt" to styles that support it.
 lmp_machine -sf opt < in.script
 mpirun -np 4 lmp_machine -sf opt < in.script :pre
-You should see a reduction in the "Pair time" printed out at the end
+[Speed-ups to expect:]
-of the run.  On most machines and problems, this will typically be a 5
+
-to 20% savings.
+You should see a reduction in the "Pair time" value printed at the end
 of a run.  On most machines for reasonable problem sizes, it will be a
 5 to 20% savings.
 [Guidelines for best performance;]
 None.  Just try out an OPT pair style to see how it performs.
 [Restrictions:]
 None.
 :line
@ -241,118 +280,175 @@ to 20% savings.
 The USER-OMP package was developed by Axel Kohlmeyer at Temple
 University.  It provides multi-threaded versions of most pair styles,
-all dihedral styles, and a few fixes in LAMMPS. The package currently
+nearly all bonded styles (bond, angle, dihedral, improper), several
-uses the OpenMP interface which requires using a specific compiler
+Kspace styles, and a few fix styles.  The package currently
-flag in the makefile to enable multiple threads; without this flag the
+uses the OpenMP interface for multi-threading.
-corresponding pair styles will still be compiled and work, but do not
+
-support multi-threading.
+[Required hardware/software:]
 Your compiler must support the OpenMP interface.  You should have one
 or more multi-core CPUs so that multiple threads can be launched by an
 MPI task running on a CPU.
 [Building LAMMPS with the USER-OMP package:]
-The procedure for building LAMMPS with the USER-OMP package is simple.
+Include the package and build LAMMPS.  
 You have to edit your machine specific makefile to add the flag to
 enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
 For the GNU compilers and Intel compilers, this flag is called
 {-fopenmp}. Check your compiler documentation to find out which flag
 you need to add.  The rest of the compilation is the same as for any
 other package which has no additional library dependencies:
 make yes-user-omp
 make machine :pre
-If your input script uses one of regular styles that are also
+Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
-exist as an OpenMP version in the USER-OMP package you can run
+support in both the CCFLAGS and LINKFLAGS variables.  For GNU and
-it as follows:
+Intel compilers, this flag is {-fopenmp}.  Without this flag the
 USER-OMP styles will still be compiled and work, but will not support
 multi-threading.
-env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
+[Running with the USER-OMP package;]
 You can explicitly add an "omp" suffix to any supported style in your
 input script:
 pair_style lj/cut/omp 2.5
 fix nve/omp :pre
 Or you can run with the -sf "command-line
 switch"_Section_start.html#start_7, which will automatically append
 "opt" to styles that support it.
 lmp_machine -sf omp < in.script
 mpirun -np 4 lmp_machine -sf omp < in.script :pre
 You must also specify how many threads to use per MPI task.  There are
 several ways to do this.  Note that the default value for this setting
 in the OpenMP environment is 1 thread/task, which may give poor
 performance.  Also note that the product of MPI tasks * threads/task
 should not exceed the physical number of cores, otherwise performance
 will suffer.
 a) You can set an environment variable, either in your shell
 or its start-up script:
 setenv OMP_NUM_THREADS 4 (for csh or tcsh)
 NOTE: setenv OMP_NUM_THREADS 4 (for bash) :pre
 This value will apply to all subsequent runs you perform.
 b) You can set the same environment variable when you launch LAMMPS:
 env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
 env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
-mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre
+mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
 NOTE: which mpirun is for OpenMPI or MPICH? :pre
-The value of the environment variable OMP_NUM_THREADS determines how
+All three examples use a total of 4 CPU cores.
-many threads per MPI task are launched. All three examples above use a
+
-total of 4 CPU cores.  For different MPI implementations the method to
+Different MPI implementations have differnet ways of passing the
-pass the OMP_NUM_THREADS environment variable to all processes is
+OMP_NUM_THREADS environment variable to all MPI processes.  The first
-different.  Two different variants, one for MPICH and OpenMPI,
+variant above is for MPICH, the second is for OpenMPI.  Check the
-respectively are shown above.  Please check the documentation of your
+documentation of your MPI installation for additional details.
-MPI installation for additional details.  Alternatively, the value
+
-provided by OMP_NUM_THREADS can be overridded with the "package
+c) Use the "package omp"_package.html command near the top of your
-omp"_package.html command.  Depending on which styles are accelerated
+script:
-in your input, you should see a reduction in the "Pair time" and/or
+
-"Bond time" and "Loop time" printed out at the end of the run. The
+package omp 4 :pre
-optimal ratio of MPI to OpenMP can vary a lot and should always be
+
-confirmed through some benchmark runs for the current system and on
+[Speed-ups to expect:]
-the current machine.
+
 Depending on which styles are accelerated, you should look for a
 reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
 time" values printed at the end of a run.  
 You may see a small performance advantage (5 to 20%) when running a
 USER-OMP style (in serial or parallel) with a single thread/MPI task,
 versus running standard LAMMPS with its un-accelerated styles (in
 serial or all-MPI parallelization with 1 task/core).  This is because
 many of the USER-OMP styles contain similar optimizations to those
 used in the OPT package, as described above.
 With multiple threads/task, the optimal choice of MPI tasks/node and
 OpenMP threads/task can vary a lot and should always be tested via
 benchmark runs for a specific simulation running on a specific
 machine, paying attention to guidelines discussed in the next
 sub-section.
 A description of the multi-threading strategy used in the UESR-OMP
 package and some performance examples are "presented
 here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
 [Guidelines for best performance;]
 For many problems on current generation CPUs, running the USER-OMP
 package with a single thread/task is faster than running with multiple
 threads/task.  This is because the MPI parallelization in LAMMPS is
 often more efficient than multi-threading as implemented in the
 USER-OMP package.  The parallel efficiency (in a threaded sense) also
 varies for different USER-OMP styles.
 Using multiple threads/task can be more effective under the following
 circumstances:
 Individual compute nodes have a significant number of CPU cores but
 the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
 (Clovertown) and 54xx (Harpertown) quad core processors. Running one
 MPI task per CPU core will result in significant performance
 degradation, so that running with 4 or even only 2 MPI tasks per node
 is faster.  Running in hybrid MPI+OpenMP mode will reduce the
 inter-node communication bandwidth contention in the same way, but
 offers an additional speedup by utilizing the otherwise idle CPU
 cores. :ulb,l
 The interconnect used for MPI communication does not provide
 sufficient bandwidth for a large number of MPI tasks per node.  For
 example, this applies to running over gigabit ethernet or on Cray XT4
 or XT5 series supercomputers.  As in the aforementioned case, this
 effect worsens when using an increasing number of nodes. :l
 The system has a spatially inhomogeneous particle density which does
 not map well to the "domain decomposition scheme"_processors.html or
 "load-balancing"_balance.html options that LAMMPS provides.  This is
 because multi-threading achives parallelism over the number of
 particles, not via their distribution in space. :l
 A machine is being used in "capability mode", i.e. near the point
 where MPI parallelism is maxed out.  For example, this can happen when
 using the "PPPM solver"_kspace_style.html for long-range
 electrostatics on large numbers of nodes.  The scaling of the "kspace
 style"_kspace_style.html can become the the performance-limiting
 factor.  Using multi-threading allows less MPI tasks to be invoked and
 can speed-up the long-range solver, while increasing overall
 performance by parallelizing the pairwise and bonded calculations via
 OpenMP.  Likewise additional speedup can be sometimes be achived by
 increasing the length of the Coulombic cutoff and thus reducing the
 work done by the long-range solver. :l,ule
 Other performance tips are as follows:
 The best parallel efficiency from {omp} styles is typically achieved
 when there is at least one MPI task per physical processor,
 i.e. socket or die. :ulb,l
 Using OpenMP threading (as opposed to all-MPI parallelism) on
 hyper-threading enabled cores is usually counter-productive (e.g. on
 IBM BG/Q), as the cost in additional memory bandwidth requirements is
 not offset by the gain in CPU utilization through
 hyper-threading. :l,ule
 [Restrictions:]
 None of the pair styles in the USER-OMP package support the "inner",
-"middle", "outer" options for r-RESPA integration, only the "pair"
+"middle", "outer" options for "rRESPA integration"_run_style.html.
-option is supported.
+Only the rRESPA "pair" option is supported.
 [Parallel efficiency and performance tips:]
 In most simple cases the MPI parallelization in LAMMPS is more
 efficient than multi-threading implemented in the USER-OMP package.
 Also the parallel efficiency varies between individual styles.
 On the other hand, in many cases you still want to use the {omp} version
 - even when compiling or running without OpenMP support - since they
 all contain optimizations similar to those in the OPT package, which
 can result in serial speedup.
 Using multi-threading is most effective under the following
 circumstances:
 Individual compute nodes have a significant number of CPU cores but
 the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
 (Clovertown) and 54xx (Harpertown) quad core processors. Running one
 MPI task per CPU core will result in significant performance
 degradation, so that running with 4 or even only 2 MPI tasks per nodes
 is faster. Running in hybrid MPI+OpenMP mode will reduce the
 inter-node communication bandwidth contention in the same way, but
 offers and additional speedup from utilizing the otherwise idle CPU
 cores. :ulb,l
 The interconnect used for MPI communication is not able to provide
 sufficient bandwidth for a large number of MPI tasks per node.  This
 applies for example to running over gigabit ethernet or on Cray XT4 or
 XT5 series supercomputers. Same as in the aforementioned case this
 effect worsens with using an increasing number of nodes. :l
 The input is a system that has an inhomogeneous particle density which
 cannot be mapped well to the domain decomposition scheme that LAMMPS
 employs. While this can be to some degree alleviated through using the
 "processors"_processors.html keyword, multi-threading provides a
 parallelism that parallelizes over the number of particles not their
 distribution in space. :l
 Finally, multi-threaded styles can improve performance when running
 LAMMPS in "capability mode", i.e. near the point where the MPI
 parallelism scales out. This can happen in particular when using as
 kspace style for long-range electrostatics. Here the scaling of the
 kspace style is the performance limiting factor and using
 multi-threaded styles allows to operate the kspace style at the limit
 of scaling and then increase performance parallelizing the real space
 calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
 be achived by increasing the real-space coulomb cutoff and thus
 reducing the work in the kspace part. :l,ule
 The best parallel efficiency from {omp} styles is typically achieved
 when there is at least one MPI task per physical processor,
 i.e. socket or die.
 Using threads on hyper-threading enabled cores is usually
 counterproductive, as the cost in additional memory bandwidth
 requirements is not offset by the gain in CPU utilization through
 hyper-threading.
 A description of the multi-threading strategy and some performance
 examples are "presented
 here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
 :line
 5.6 GPU package :h4,link(acc_6)
 [Required hardware/software:]
 [Building LAMMPS with the OPT package:]
 [Running with the OPT package;]
 [Guidelines for best performance;]
 [Speed-ups to expect:]
 The GPU package was developed by Mike Brown at ORNL and his
 collaborators.  It provides GPU versions of several pair styles,
 including the 3-body Stillinger-Weber pair style, and for long-range
@ -542,6 +638,12 @@ of problem size and number of compute nodes.
 5.7 USER-CUDA package :h4,link(acc_7)
 [Required hardware/software:]
 [Building LAMMPS with the OPT package:]
 [Running with the OPT package;]
 [Guidelines for best performance;]
 [Speed-ups to expect:]
 The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
 styles, many fixes, a few computes, and for long-range Coulombics via
@ -679,6 +781,12 @@ occurs, the faster your simulation will run.
 5.8 KOKKOS package :h4,link(acc_8)
 [Required hardware/software:]
 [Building LAMMPS with the OPT package:]
 [Running with the OPT package;]
 [Guidelines for best performance;]
 [Speed-ups to expect:]
 The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and methods and macros provided by the Kokkos
 library, which is included with LAMMPS in lib/kokkos.
@ -971,6 +1079,12 @@ LAMMPS.
 5.9 USER-INTEL package :h4,link(acc_9)
 [Required hardware/software:]
 [Building LAMMPS with the OPT package:]
 [Running with the OPT package;]
 [Guidelines for best performance;]
 [Speed-ups to expect:]
 The USER-INTEL package was developed by Mike Brown at Intel
 Corporation. It provides a capability to accelerate simulations by
 offloading neighbor list and non-bonded force calculations to Intel(R)