git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12374 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2014-08-27 20:52:54 +00:00 · 2014-08-27 20:52:54 +00:00 · 444053fa6c
parent dc5ad107ad
commit 444053fa6c
2 changed files with 463 additions and 235 deletions
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
@ -26,7 +26,7 @@ kinds of machines.
 5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
 5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
 5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
-5.10 <A HREF = "#acc_10">Comparison of GPU and USER-CUDA packages</A> <BR>
+5.10 <A HREF = "#acc_10">Comparison of USER-CUDA, GPU, and KOKKOS packages</A> <BR>

 <HR>

@ -82,7 +82,7 @@ LAMMPS, to obtain synchronized timings.

 <H4><A NAME = "acc_2"></A>5.2 General strategies 
 </H4>
-<P>NOTE: this sub-section is still a work in progress
+<P>NOTE: this section is still a work in progress
 </P>
 <P>Here is a list of general ideas for improving simulation performance.
 Most of them are only applicable to certain models and certain
@ -142,6 +142,16 @@ been added to LAMMPS, which will typically run faster than the
 standard non-accelerated versions, if you have the appropriate
 hardware on your system.
 </P>
+<P>All of these commands are in <A HREF = "Section_packages.html">packages</A>.
+Currently, there are 6 such packages in LAMMPS:
+</P>
+<UL><LI>USER-CUDA: for NVIDIA GPUs
+<LI>GPU: for NVIDIA GPUs as well as OpenCL support
+<LI>USER-INTEL: for Intel CPUs and Intel Xeon Phi
+<LI>KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
+<LI>USER-OMP: for OpenMP threading
+<LI>OPT: generic CPU optimizations 
+</UL>
 <P>The accelerated styles have the same name as the standard styles,
 except that a suffix is appended.  Otherwise, the syntax for the
 command is identical, their functionality is the same, and the
@ -167,22 +177,31 @@ automatically, without changing your input script.  The
 to turn off and back on the comand-line switch setting, both from
 within your input script.
 </P>
+<P>To see what styles are currently available in each of the accelerated
+packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
+manual.  The doc page for each indvidual style (e.g. <A HREF = "pair_lj.html">pair
+lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) also lists any
+accelerated variants available for that style.
+</P>
+<P>Here is a brief summary of what the various packages provide.  Details
+are in individual sections below.
+</P>
 <P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
 packages, and can be run on NVIDIA GPUs associated with your CPUs.
-The speed-up due to GPU usage depends on a variety of factors, as
-discussed below.
+The speed-up on a GPU depends on a variety of factors, as discussed
+below.
 </P>
 <P>Styles with an "intel" suffix are part of the USER-INTEL
 package. These styles support vectorized single and mixed precision
 calculations, in addition to full double precision.  In extreme cases,
 this can provide speedups over 3.5x on CPUs.  The package also
-supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
-This can result in additional speedup over 2x depending on the
-hardware configuration.
+supports acceleration with offload to Intel(R) Xeon Phi(TM)
+coprocessors.  This can result in additional speedup over 2x depending
+on the hardware configuration.
 </P>
 <P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
-run using OpenMP, pthreads, or on an NVIDIA GPU.  The speed-up depends
-on a variety of factors, as discussed below.
+run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
+The speed-up depends on a variety of factors, as discussed below.
 </P>
 <P>Styles with an "omp" suffix are part of the USER-OMP package and allow
 a pair-style to be run in multi-threaded mode using OpenMP.  This can
@ -192,25 +211,20 @@ are run on fewer MPI processors or when the many MPI tasks would
 overload the available bandwidth for communication.
 </P>
 <P>Styles with an "opt" suffix are part of the OPT package and typically
-speed-up the pairwise calculations of your simulation by 5-25%.
-</P>
-<P>To see what styles are currently available in each of the accelerated
-packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
-manual.  A list of accelerated styles is included in the pair, fix,
-compute, and kspace sections.  The doc page for each indvidual style
-(e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) will also
-list any accelerated variants available for that style.
+speed-up the pairwise calculations of your simulation by 5-25% on a
+CPU.
 </P>
 <P>The following sections explain:
 </P>
-<UL><LI>what hardware and software the accelerated styles require
-<LI>how to build LAMMPS with the accelerated package in place
-<LI>what changes (if any) are needed in your input scripts
+<UL><LI>what hardware and software the accelerated package requires
+<LI>how to build LAMMPS with the accelerated package
+<LI>how to run an input script with the accelerated package
+<LI>speed-ups to expect
 <LI>guidelines for best performance
-<LI>speed-ups you can expect 
+<LI>restrictions 
 </UL>
-<P>The final section compares and contrasts the GPU and USER-CUDA
-packages, since they are both designed to use NVIDIA hardware.
+<P>The final section compares and contrasts the GPU, USER-CUDA, and
+KOKKOS packages, since they all allow for use of NVIDIA GPUs.
 </P>
 <HR>

@ -222,22 +236,47 @@ Technologies).  It contains a handful of pair styles whose compute()
 methods were rewritten in C++ templated form to reduce the overhead
 due to if tests and other conditional code.
 </P>
-<P>The procedure for building LAMMPS with the OPT package is simple.  It
-is the same as for any other package which has no additional library
-dependencies:
+<P><B>Required hardware/software:</B>
+</P>
+<P>None.
+</P>
+<P><B>Building LAMMPS with the OPT package:</B>
+</P>
+<P>Include the package and build LAMMPS.
 </P>
 <PRE>make yes-opt
 make machine 
 </PRE>
-<P>If your input script uses one of the OPT pair styles, you can run it
-as follows:
+<P>No additional compile/link flags are needed in your lo-level
+src/MAKE/Makefile.machine.
+</P>
+<P><B>Running with the OPT package;</B>
+</P>
+<P>You can explicitly add an "opt" suffix to the
+<A HREF = "pair_style.html">pair_style</A> command in your input script:
+</P>
+<PRE>pair_style lj/cut/opt 2.5 
+</PRE>
+<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
+switch</A>, which will automatically append
+"opt" to styles that support it.
 </P>
 <PRE>lmp_machine -sf opt < in.script
 mpirun -np 4 lmp_machine -sf opt < in.script 
 </PRE>
-<P>You should see a reduction in the "Pair time" printed out at the end
-of the run.  On most machines and problems, this will typically be a 5
-to 20% savings.
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>You should see a reduction in the "Pair time" value printed at the end
+of a run.  On most machines for reasonable problem sizes, it will be a
+5 to 20% savings.
+</P>
+<P><B>Guidelines for best performance;</B>
+</P>
+<P>None.  Just try out an OPT pair style to see how it performs.
+</P>
+<P><B>Restrictions:</B>
+</P>
+<P>None.
 </P>
 <HR>

@ -245,118 +284,175 @@ to 20% savings.
 </H4>
 <P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
 University.  It provides multi-threaded versions of most pair styles,
-all dihedral styles, and a few fixes in LAMMPS. The package currently
-uses the OpenMP interface which requires using a specific compiler
-flag in the makefile to enable multiple threads; without this flag the
-corresponding pair styles will still be compiled and work, but do not
-support multi-threading.
+nearly all bonded styles (bond, angle, dihedral, improper), several
+Kspace styles, and a few fix styles.  The package currently
+uses the OpenMP interface for multi-threading.
+</P>
+<P><B>Required hardware/software:</B>
+</P>
+<P>Your compiler must support the OpenMP interface.  You should have one
+or more multi-core CPUs so that multiple threads can be launched by an
+MPI task running on a CPU.
 </P>
 <P><B>Building LAMMPS with the USER-OMP package:</B>
 </P>
-<P>The procedure for building LAMMPS with the USER-OMP package is simple.
-You have to edit your machine specific makefile to add the flag to
-enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
-For the GNU compilers and Intel compilers, this flag is called
-<I>-fopenmp</I>. Check your compiler documentation to find out which flag
-you need to add.  The rest of the compilation is the same as for any
-other package which has no additional library dependencies:
+<P>Include the package and build LAMMPS.  
 </P>
 <PRE>make yes-user-omp
 make machine 
 </PRE>
-<P>If your input script uses one of regular styles that are also
-exist as an OpenMP version in the USER-OMP package you can run
-it as follows:
+<P>Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
+support in both the CCFLAGS and LINKFLAGS variables.  For GNU and
+Intel compilers, this flag is <I>-fopenmp</I>.  Without this flag the
+USER-OMP styles will still be compiled and work, but will not support
+multi-threading.
 </P>
-<PRE>env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
-env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
-mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script 
+<P><B>Running with the USER-OMP package;</B>
+</P>
+<P>You can explicitly add an "omp" suffix to any supported style in your
+input script:
+</P>
+<PRE>pair_style lj/cut/omp 2.5
+fix nve/omp 
 </PRE>
-<P>The value of the environment variable OMP_NUM_THREADS determines how
-many threads per MPI task are launched. All three examples above use a
-total of 4 CPU cores.  For different MPI implementations the method to
-pass the OMP_NUM_THREADS environment variable to all processes is
-different.  Two different variants, one for MPICH and OpenMPI,
-respectively are shown above.  Please check the documentation of your
-MPI installation for additional details.  Alternatively, the value
-provided by OMP_NUM_THREADS can be overridded with the <A HREF = "package.html">package
-omp</A> command.  Depending on which styles are accelerated
-in your input, you should see a reduction in the "Pair time" and/or
-"Bond time" and "Loop time" printed out at the end of the run. The
-optimal ratio of MPI to OpenMP can vary a lot and should always be
-confirmed through some benchmark runs for the current system and on
-the current machine.
+<P>Or you can run with the -sf <A HREF = "Section_start.html#start_7">command-line
+switch</A>, which will automatically append
+"opt" to styles that support it.
 </P>
-<P><B>Restrictions:</B>
+<PRE>lmp_machine -sf omp < in.script
+mpirun -np 4 lmp_machine -sf omp < in.script 
+</PRE>
+<P>You must also specify how many threads to use per MPI task.  There are
+several ways to do this.  Note that the default value for this setting
+in the OpenMP environment is 1 thread/task, which may give poor
+performance.  Also note that the product of MPI tasks * threads/task
+should not exceed the physical number of cores, otherwise performance
+will suffer.
 </P>
-<P>None of the pair styles in the USER-OMP package support the "inner",
-"middle", "outer" options for r-RESPA integration, only the "pair"
-option is supported.
+<P>a) You can set an environment variable, either in your shell
+or its start-up script:
 </P>
-<P><B>Parallel efficiency and performance tips:</B>
+<PRE>setenv OMP_NUM_THREADS 4 (for csh or tcsh)
+NOTE: setenv OMP_NUM_THREADS 4 (for bash) 
+</PRE>
+<P>This value will apply to all subsequent runs you perform.
 </P>
-<P>In most simple cases the MPI parallelization in LAMMPS is more
-efficient than multi-threading implemented in the USER-OMP package.
-Also the parallel efficiency varies between individual styles.
-On the other hand, in many cases you still want to use the <I>omp</I> version
- even when compiling or running without OpenMP support - since they
-all contain optimizations similar to those in the OPT package, which
-can result in serial speedup.
+<P>b) You can set the same environment variable when you launch LAMMPS:
 </P>
-<P>Using multi-threading is most effective under the following
+<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
+env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
+mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
+NOTE: which mpirun is for OpenMPI or MPICH? 
+</PRE>
+<P>All three examples use a total of 4 CPU cores.
+</P>
+<P>Different MPI implementations have differnet ways of passing the
+OMP_NUM_THREADS environment variable to all MPI processes.  The first
+variant above is for MPICH, the second is for OpenMPI.  Check the
+documentation of your MPI installation for additional details.
+</P>
+<P>c) Use the <A HREF = "package.html">package omp</A> command near the top of your
+script:
+</P>
+<PRE>package omp 4 
+</PRE>
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>Depending on which styles are accelerated, you should look for a
+reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
+time" values printed at the end of a run.  
+</P>
+<P>You may see a small performance advantage (5 to 20%) when running a
+USER-OMP style (in serial or parallel) with a single thread/MPI task,
+versus running standard LAMMPS with its un-accelerated styles (in
+serial or all-MPI parallelization with 1 task/core).  This is because
+many of the USER-OMP styles contain similar optimizations to those
+used in the OPT package, as described above.
+</P>
+<P>With multiple threads/task, the optimal choice of MPI tasks/node and
+OpenMP threads/task can vary a lot and should always be tested via
+benchmark runs for a specific simulation running on a specific
+machine, paying attention to guidelines discussed in the next
+sub-section.
+</P>
+<P>A description of the multi-threading strategy used in the UESR-OMP
+package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
+here</A>
+</P>
+<P><B>Guidelines for best performance;</B>
+</P>
+<P>For many problems on current generation CPUs, running the USER-OMP
+package with a single thread/task is faster than running with multiple
+threads/task.  This is because the MPI parallelization in LAMMPS is
+often more efficient than multi-threading as implemented in the
+USER-OMP package.  The parallel efficiency (in a threaded sense) also
+varies for different USER-OMP styles.
+</P>
+<P>Using multiple threads/task can be more effective under the following
 circumstances:
 </P>
 <UL><LI>Individual compute nodes have a significant number of CPU cores but
-the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
+the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
 (Clovertown) and 54xx (Harpertown) quad core processors. Running one
 MPI task per CPU core will result in significant performance
-degradation, so that running with 4 or even only 2 MPI tasks per nodes
-is faster. Running in hybrid MPI+OpenMP mode will reduce the
+degradation, so that running with 4 or even only 2 MPI tasks per node
+is faster.  Running in hybrid MPI+OpenMP mode will reduce the
 inter-node communication bandwidth contention in the same way, but
-offers and additional speedup from utilizing the otherwise idle CPU
+offers an additional speedup by utilizing the otherwise idle CPU
 cores. 

-<LI>The interconnect used for MPI communication is not able to provide
-sufficient bandwidth for a large number of MPI tasks per node.  This
-applies for example to running over gigabit ethernet or on Cray XT4 or
-XT5 series supercomputers. Same as in the aforementioned case this
-effect worsens with using an increasing number of nodes. 
+<LI>The interconnect used for MPI communication does not provide
+sufficient bandwidth for a large number of MPI tasks per node.  For
+example, this applies to running over gigabit ethernet or on Cray XT4
+or XT5 series supercomputers.  As in the aforementioned case, this
+effect worsens when using an increasing number of nodes. 

-<LI>The input is a system that has an inhomogeneous particle density which
-cannot be mapped well to the domain decomposition scheme that LAMMPS
-employs. While this can be to some degree alleviated through using the
-<A HREF = "processors.html">processors</A> keyword, multi-threading provides a
-parallelism that parallelizes over the number of particles not their
-distribution in space. 
+<LI>The system has a spatially inhomogeneous particle density which does
+not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
+<A HREF = "balance.html">load-balancing</A> options that LAMMPS provides.  This is
+because multi-threading achives parallelism over the number of
+particles, not via their distribution in space. 

-<LI>Finally, multi-threaded styles can improve performance when running
-LAMMPS in "capability mode", i.e. near the point where the MPI
-parallelism scales out. This can happen in particular when using as
-kspace style for long-range electrostatics. Here the scaling of the
-kspace style is the performance limiting factor and using
-multi-threaded styles allows to operate the kspace style at the limit
-of scaling and then increase performance parallelizing the real space
-calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
-be achived by increasing the real-space coulomb cutoff and thus
-reducing the work in the kspace part. 
+<LI>A machine is being used in "capability mode", i.e. near the point
+where MPI parallelism is maxed out.  For example, this can happen when
+using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
+electrostatics on large numbers of nodes.  The scaling of the <A HREF = "kspace_style.html">kspace
+style</A> can become the the performance-limiting
+factor.  Using multi-threading allows less MPI tasks to be invoked and
+can speed-up the long-range solver, while increasing overall
+performance by parallelizing the pairwise and bonded calculations via
+OpenMP.  Likewise additional speedup can be sometimes be achived by
+increasing the length of the Coulombic cutoff and thus reducing the
+work done by the long-range solver. 
 </UL>
-<P>The best parallel efficiency from <I>omp</I> styles is typically achieved
+<P>Other performance tips are as follows:
+</P>
+<UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
 when there is at least one MPI task per physical processor,
-i.e. socket or die.
+i.e. socket or die. 
+
+<LI>Using OpenMP threading (as opposed to all-MPI parallelism) on
+hyper-threading enabled cores is usually counter-productive (e.g. on
+IBM BG/Q), as the cost in additional memory bandwidth requirements is
+not offset by the gain in CPU utilization through
+hyper-threading. 
+</UL>
+<P><B>Restrictions:</B>
 </P>
-<P>Using threads on hyper-threading enabled cores is usually
-counterproductive, as the cost in additional memory bandwidth
-requirements is not offset by the gain in CPU utilization through
-hyper-threading.
-</P>
-<P>A description of the multi-threading strategy and some performance
-examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
-here</A>
+<P>None of the pair styles in the USER-OMP package support the "inner",
+"middle", "outer" options for <A HREF = "run_style.html">rRESPA integration</A>.
+Only the rRESPA "pair" option is supported.
 </P>
 <HR>

 <H4><A NAME = "acc_6"></A>5.6 GPU package 
 </H4>
+<P><B>Required hardware/software:</B>
+<B>Building LAMMPS with the OPT package:</B>
+<B>Running with the OPT package;</B>
+<B>Guidelines for best performance;</B>
+<B>Speed-ups to expect:</B>
+</P>
 <P>The GPU package was developed by Mike Brown at ORNL and his
 collaborators.  It provides GPU versions of several pair styles,
 including the 3-body Stillinger-Weber pair style, and for long-range
@ -546,6 +642,12 @@ of problem size and number of compute nodes.

 <H4><A NAME = "acc_7"></A>5.7 USER-CUDA package 
 </H4>
+<P><B>Required hardware/software:</B>
+<B>Building LAMMPS with the OPT package:</B>
+<B>Running with the OPT package;</B>
+<B>Guidelines for best performance;</B>
+<B>Speed-ups to expect:</B>
+</P>
 <P>The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
 styles, many fixes, a few computes, and for long-range Coulombics via
@ -683,6 +785,12 @@ occurs, the faster your simulation will run.

 <H4><A NAME = "acc_8"></A>5.8 KOKKOS package 
 </H4>
+<P><B>Required hardware/software:</B>
+<B>Building LAMMPS with the OPT package:</B>
+<B>Running with the OPT package;</B>
+<B>Guidelines for best performance;</B>
+<B>Speed-ups to expect:</B>
+</P>
 <P>The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and methods and macros provided by the Kokkos
 library, which is included with LAMMPS in lib/kokkos.
@ -975,6 +1083,12 @@ LAMMPS.

 <H4><A NAME = "acc_9"></A>5.9 USER-INTEL package 
 </H4>
+<P><B>Required hardware/software:</B>
+<B>Building LAMMPS with the OPT package:</B>
+<B>Running with the OPT package;</B>
+<B>Guidelines for best performance;</B>
+<B>Speed-ups to expect:</B>
+</P>
 <P>The USER-INTEL package was developed by Mike Brown at Intel
 Corporation. It provides a capability to accelerate simulations by
 offloading neighbor list and non-bonded force calculations to Intel(R)
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
@ -23,7 +23,7 @@ kinds of machines.
 5.7 "USER-CUDA package"_#acc_7
 5.8 "KOKKOS package"_#acc_8
 5.9 "USER-INTEL package"_#acc_9
-5.10 "Comparison of GPU and USER-CUDA packages"_#acc_10 :all(b)
+5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)

 :line
 :line
@ -78,7 +78,7 @@ LAMMPS, to obtain synchronized timings.

 5.2 General strategies :h4,link(acc_2)

-NOTE: this sub-section is still a work in progress
+NOTE: this section is still a work in progress

 Here is a list of general ideas for improving simulation performance.
 Most of them are only applicable to certain models and certain
@ -138,6 +138,16 @@ been added to LAMMPS, which will typically run faster than the
 standard non-accelerated versions, if you have the appropriate
 hardware on your system.

+All of these commands are in "packages"_Section_packages.html.
+Currently, there are 6 such packages in LAMMPS:
+
+USER-CUDA: for NVIDIA GPUs
+GPU: for NVIDIA GPUs as well as OpenCL support
+USER-INTEL: for Intel CPUs and Intel Xeon Phi
+KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
+USER-OMP: for OpenMP threading
+OPT: generic CPU optimizations :ul
+
 The accelerated styles have the same name as the standard styles,
 except that a suffix is appended.  Otherwise, the syntax for the
 command is identical, their functionality is the same, and the
@ -163,22 +173,31 @@ automatically, without changing your input script.  The
 to turn off and back on the comand-line switch setting, both from
 within your input script.

+To see what styles are currently available in each of the accelerated
+packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
+manual.  The doc page for each indvidual style (e.g. "pair
+lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
+accelerated variants available for that style.
+
+Here is a brief summary of what the various packages provide.  Details
+are in individual sections below.
+
 Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
 packages, and can be run on NVIDIA GPUs associated with your CPUs.
-The speed-up due to GPU usage depends on a variety of factors, as
-discussed below.
+The speed-up on a GPU depends on a variety of factors, as discussed
+below.

 Styles with an "intel" suffix are part of the USER-INTEL
 package. These styles support vectorized single and mixed precision
 calculations, in addition to full double precision.  In extreme cases,
 this can provide speedups over 3.5x on CPUs.  The package also
-supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
-This can result in additional speedup over 2x depending on the
-hardware configuration.
+supports acceleration with offload to Intel(R) Xeon Phi(TM)
+coprocessors.  This can result in additional speedup over 2x depending
+on the hardware configuration.

 Styles with a "kk" suffix are part of the KOKKOS package, and can be
-run using OpenMP, pthreads, or on an NVIDIA GPU.  The speed-up depends
-on a variety of factors, as discussed below.
+run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
+The speed-up depends on a variety of factors, as discussed below.

 Styles with an "omp" suffix are part of the USER-OMP package and allow
 a pair-style to be run in multi-threaded mode using OpenMP.  This can
@ -188,25 +207,20 @@ are run on fewer MPI processors or when the many MPI tasks would
 overload the available bandwidth for communication.

 Styles with an "opt" suffix are part of the OPT package and typically
-speed-up the pairwise calculations of your simulation by 5-25%.
-
-To see what styles are currently available in each of the accelerated
-packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
-manual.  A list of accelerated styles is included in the pair, fix,
-compute, and kspace sections.  The doc page for each indvidual style
-(e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) will also
-list any accelerated variants available for that style.
+speed-up the pairwise calculations of your simulation by 5-25% on a
+CPU.

 The following sections explain:

-what hardware and software the accelerated styles require
-how to build LAMMPS with the accelerated package in place
-what changes (if any) are needed in your input scripts
+what hardware and software the accelerated package requires
+how to build LAMMPS with the accelerated package
+how to run an input script with the accelerated package
+speed-ups to expect
 guidelines for best performance
-speed-ups you can expect :ul
+restrictions :ul

-The final section compares and contrasts the GPU and USER-CUDA
-packages, since they are both designed to use NVIDIA hardware.
+The final section compares and contrasts the GPU, USER-CUDA, and
+KOKKOS packages, since they all allow for use of NVIDIA GPUs.

 :line

@ -218,22 +232,47 @@ Technologies).  It contains a handful of pair styles whose compute()
 methods were rewritten in C++ templated form to reduce the overhead
 due to if tests and other conditional code.

-The procedure for building LAMMPS with the OPT package is simple.  It
-is the same as for any other package which has no additional library
-dependencies:
+[Required hardware/software:]
+
+None.
+
+[Building LAMMPS with the OPT package:]
+
+Include the package and build LAMMPS.

 make yes-opt
 make machine :pre

-If your input script uses one of the OPT pair styles, you can run it
-as follows:
+No additional compile/link flags are needed in your lo-level
+src/MAKE/Makefile.machine.
+
+[Running with the OPT package;]
+
+You can explicitly add an "opt" suffix to the
+"pair_style"_pair_style.html command in your input script:
+
+pair_style lj/cut/opt 2.5 :pre
+
+Or you can run with the -sf "command-line
+switch"_Section_start.html#start_7, which will automatically append
+"opt" to styles that support it.

 lmp_machine -sf opt < in.script
 mpirun -np 4 lmp_machine -sf opt < in.script :pre

-You should see a reduction in the "Pair time" printed out at the end
-of the run.  On most machines and problems, this will typically be a 5
-to 20% savings.
+[Speed-ups to expect:]
+
+You should see a reduction in the "Pair time" value printed at the end
+of a run.  On most machines for reasonable problem sizes, it will be a
+5 to 20% savings.
+
+[Guidelines for best performance;]
+
+None.  Just try out an OPT pair style to see how it performs.
+
+[Restrictions:]
+
+None.

 :line

@ -241,118 +280,175 @@ to 20% savings.

 The USER-OMP package was developed by Axel Kohlmeyer at Temple
 University.  It provides multi-threaded versions of most pair styles,
-all dihedral styles, and a few fixes in LAMMPS. The package currently
-uses the OpenMP interface which requires using a specific compiler
-flag in the makefile to enable multiple threads; without this flag the
-corresponding pair styles will still be compiled and work, but do not
-support multi-threading.
+nearly all bonded styles (bond, angle, dihedral, improper), several
+Kspace styles, and a few fix styles.  The package currently
+uses the OpenMP interface for multi-threading.
+
+[Required hardware/software:]
+
+Your compiler must support the OpenMP interface.  You should have one
+or more multi-core CPUs so that multiple threads can be launched by an
+MPI task running on a CPU.

 [Building LAMMPS with the USER-OMP package:]

-The procedure for building LAMMPS with the USER-OMP package is simple.
-You have to edit your machine specific makefile to add the flag to
-enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
-For the GNU compilers and Intel compilers, this flag is called
-{-fopenmp}. Check your compiler documentation to find out which flag
-you need to add.  The rest of the compilation is the same as for any
-other package which has no additional library dependencies:
+Include the package and build LAMMPS.  

 make yes-user-omp
 make machine :pre

-If your input script uses one of regular styles that are also
-exist as an OpenMP version in the USER-OMP package you can run
-it as follows:
+Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
+support in both the CCFLAGS and LINKFLAGS variables.  For GNU and
+Intel compilers, this flag is {-fopenmp}.  Without this flag the
+USER-OMP styles will still be compiled and work, but will not support
+multi-threading.

-env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
+[Running with the USER-OMP package;]
+
+You can explicitly add an "omp" suffix to any supported style in your
+input script:
+
+pair_style lj/cut/omp 2.5
+fix nve/omp :pre
+
+Or you can run with the -sf "command-line
+switch"_Section_start.html#start_7, which will automatically append
+"opt" to styles that support it.
+
+lmp_machine -sf omp < in.script
+mpirun -np 4 lmp_machine -sf omp < in.script :pre
+
+You must also specify how many threads to use per MPI task.  There are
+several ways to do this.  Note that the default value for this setting
+in the OpenMP environment is 1 thread/task, which may give poor
+performance.  Also note that the product of MPI tasks * threads/task
+should not exceed the physical number of cores, otherwise performance
+will suffer.
+
+a) You can set an environment variable, either in your shell
+or its start-up script:
+
+setenv OMP_NUM_THREADS 4 (for csh or tcsh)
+NOTE: setenv OMP_NUM_THREADS 4 (for bash) :pre
+
+This value will apply to all subsequent runs you perform.
+
+b) You can set the same environment variable when you launch LAMMPS:
+
+env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
 env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
-mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre
+mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
+NOTE: which mpirun is for OpenMPI or MPICH? :pre

-The value of the environment variable OMP_NUM_THREADS determines how
-many threads per MPI task are launched. All three examples above use a
-total of 4 CPU cores.  For different MPI implementations the method to
-pass the OMP_NUM_THREADS environment variable to all processes is
-different.  Two different variants, one for MPICH and OpenMPI,
-respectively are shown above.  Please check the documentation of your
-MPI installation for additional details.  Alternatively, the value
-provided by OMP_NUM_THREADS can be overridded with the "package
-omp"_package.html command.  Depending on which styles are accelerated
-in your input, you should see a reduction in the "Pair time" and/or
-"Bond time" and "Loop time" printed out at the end of the run. The
-optimal ratio of MPI to OpenMP can vary a lot and should always be
-confirmed through some benchmark runs for the current system and on
-the current machine.
+All three examples use a total of 4 CPU cores.
+
+Different MPI implementations have differnet ways of passing the
+OMP_NUM_THREADS environment variable to all MPI processes.  The first
+variant above is for MPICH, the second is for OpenMPI.  Check the
+documentation of your MPI installation for additional details.
+
+c) Use the "package omp"_package.html command near the top of your
+script:
+
+package omp 4 :pre
+
+[Speed-ups to expect:]
+
+Depending on which styles are accelerated, you should look for a
+reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
+time" values printed at the end of a run.  
+
+You may see a small performance advantage (5 to 20%) when running a
+USER-OMP style (in serial or parallel) with a single thread/MPI task,
+versus running standard LAMMPS with its un-accelerated styles (in
+serial or all-MPI parallelization with 1 task/core).  This is because
+many of the USER-OMP styles contain similar optimizations to those
+used in the OPT package, as described above.
+
+With multiple threads/task, the optimal choice of MPI tasks/node and
+OpenMP threads/task can vary a lot and should always be tested via
+benchmark runs for a specific simulation running on a specific
+machine, paying attention to guidelines discussed in the next
+sub-section.
+
+A description of the multi-threading strategy used in the UESR-OMP
+package and some performance examples are "presented
+here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
+
+[Guidelines for best performance;]
+
+For many problems on current generation CPUs, running the USER-OMP
+package with a single thread/task is faster than running with multiple
+threads/task.  This is because the MPI parallelization in LAMMPS is
+often more efficient than multi-threading as implemented in the
+USER-OMP package.  The parallel efficiency (in a threaded sense) also
+varies for different USER-OMP styles.
+
+Using multiple threads/task can be more effective under the following
+circumstances:
+
+Individual compute nodes have a significant number of CPU cores but
+the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
+(Clovertown) and 54xx (Harpertown) quad core processors. Running one
+MPI task per CPU core will result in significant performance
+degradation, so that running with 4 or even only 2 MPI tasks per node
+is faster.  Running in hybrid MPI+OpenMP mode will reduce the
+inter-node communication bandwidth contention in the same way, but
+offers an additional speedup by utilizing the otherwise idle CPU
+cores. :ulb,l
+
+The interconnect used for MPI communication does not provide
+sufficient bandwidth for a large number of MPI tasks per node.  For
+example, this applies to running over gigabit ethernet or on Cray XT4
+or XT5 series supercomputers.  As in the aforementioned case, this
+effect worsens when using an increasing number of nodes. :l
+
+The system has a spatially inhomogeneous particle density which does
+not map well to the "domain decomposition scheme"_processors.html or
+"load-balancing"_balance.html options that LAMMPS provides.  This is
+because multi-threading achives parallelism over the number of
+particles, not via their distribution in space. :l
+
+A machine is being used in "capability mode", i.e. near the point
+where MPI parallelism is maxed out.  For example, this can happen when
+using the "PPPM solver"_kspace_style.html for long-range
+electrostatics on large numbers of nodes.  The scaling of the "kspace
+style"_kspace_style.html can become the the performance-limiting
+factor.  Using multi-threading allows less MPI tasks to be invoked and
+can speed-up the long-range solver, while increasing overall
+performance by parallelizing the pairwise and bonded calculations via
+OpenMP.  Likewise additional speedup can be sometimes be achived by
+increasing the length of the Coulombic cutoff and thus reducing the
+work done by the long-range solver. :l,ule
+
+Other performance tips are as follows:
+
+The best parallel efficiency from {omp} styles is typically achieved
+when there is at least one MPI task per physical processor,
+i.e. socket or die. :ulb,l
+
+Using OpenMP threading (as opposed to all-MPI parallelism) on
+hyper-threading enabled cores is usually counter-productive (e.g. on
+IBM BG/Q), as the cost in additional memory bandwidth requirements is
+not offset by the gain in CPU utilization through
+hyper-threading. :l,ule

 [Restrictions:]

 None of the pair styles in the USER-OMP package support the "inner",
-"middle", "outer" options for r-RESPA integration, only the "pair"
-option is supported.
-
-[Parallel efficiency and performance tips:]
-
-In most simple cases the MPI parallelization in LAMMPS is more
-efficient than multi-threading implemented in the USER-OMP package.
-Also the parallel efficiency varies between individual styles.
-On the other hand, in many cases you still want to use the {omp} version
- even when compiling or running without OpenMP support - since they
-all contain optimizations similar to those in the OPT package, which
-can result in serial speedup.
-
-Using multi-threading is most effective under the following
-circumstances:
-
-Individual compute nodes have a significant number of CPU cores but
-the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
-(Clovertown) and 54xx (Harpertown) quad core processors. Running one
-MPI task per CPU core will result in significant performance
-degradation, so that running with 4 or even only 2 MPI tasks per nodes
-is faster. Running in hybrid MPI+OpenMP mode will reduce the
-inter-node communication bandwidth contention in the same way, but
-offers and additional speedup from utilizing the otherwise idle CPU
-cores. :ulb,l
-
-The interconnect used for MPI communication is not able to provide
-sufficient bandwidth for a large number of MPI tasks per node.  This
-applies for example to running over gigabit ethernet or on Cray XT4 or
-XT5 series supercomputers. Same as in the aforementioned case this
-effect worsens with using an increasing number of nodes. :l
-
-The input is a system that has an inhomogeneous particle density which
-cannot be mapped well to the domain decomposition scheme that LAMMPS
-employs. While this can be to some degree alleviated through using the
-"processors"_processors.html keyword, multi-threading provides a
-parallelism that parallelizes over the number of particles not their
-distribution in space. :l
-
-Finally, multi-threaded styles can improve performance when running
-LAMMPS in "capability mode", i.e. near the point where the MPI
-parallelism scales out. This can happen in particular when using as
-kspace style for long-range electrostatics. Here the scaling of the
-kspace style is the performance limiting factor and using
-multi-threaded styles allows to operate the kspace style at the limit
-of scaling and then increase performance parallelizing the real space
-calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
-be achived by increasing the real-space coulomb cutoff and thus
-reducing the work in the kspace part. :l,ule
-
-The best parallel efficiency from {omp} styles is typically achieved
-when there is at least one MPI task per physical processor,
-i.e. socket or die.
-
-Using threads on hyper-threading enabled cores is usually
-counterproductive, as the cost in additional memory bandwidth
-requirements is not offset by the gain in CPU utilization through
-hyper-threading.
-
-A description of the multi-threading strategy and some performance
-examples are "presented
-here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
+"middle", "outer" options for "rRESPA integration"_run_style.html.
+Only the rRESPA "pair" option is supported.

 :line

 5.6 GPU package :h4,link(acc_6)

+[Required hardware/software:]
+[Building LAMMPS with the OPT package:]
+[Running with the OPT package;]
+[Guidelines for best performance;]
+[Speed-ups to expect:]
+
 The GPU package was developed by Mike Brown at ORNL and his
 collaborators.  It provides GPU versions of several pair styles,
 including the 3-body Stillinger-Weber pair style, and for long-range
@ -542,6 +638,12 @@ of problem size and number of compute nodes.

 5.7 USER-CUDA package :h4,link(acc_7)

+[Required hardware/software:]
+[Building LAMMPS with the OPT package:]
+[Running with the OPT package;]
+[Guidelines for best performance;]
+[Speed-ups to expect:]
+
 The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
 styles, many fixes, a few computes, and for long-range Coulombics via
@ -679,6 +781,12 @@ occurs, the faster your simulation will run.

 5.8 KOKKOS package :h4,link(acc_8)

+[Required hardware/software:]
+[Building LAMMPS with the OPT package:]
+[Running with the OPT package;]
+[Guidelines for best performance;]
+[Speed-ups to expect:]
+
 The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and methods and macros provided by the Kokkos
 library, which is included with LAMMPS in lib/kokkos.
@ -971,6 +1079,12 @@ LAMMPS.

 5.9 USER-INTEL package :h4,link(acc_9)

+[Required hardware/software:]
+[Building LAMMPS with the OPT package:]
+[Running with the OPT package;]
+[Guidelines for best performance;]
+[Speed-ups to expect:]
+
 The USER-INTEL package was developed by Mike Brown at Intel
 Corporation. It provides a capability to accelerate simulations by
 offloading neighbor list and non-bonded force calculations to Intel(R)