git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2014-09-10 15:32:24 +00:00 · 2014-09-10 15:32:24 +00:00 · e8780fc49d
parent f864979cdd
commit e8780fc49d
16 changed files with 3103 additions and 2885 deletions
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
--- a/doc/accelerate_cuda.html
+++ b/doc/accelerate_cuda.html
@ -0,0 +1,208 @@
+<HTML>
+<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
+<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
+</CENTER>
+
+
+
+
+
+
+<HR>
+
+<P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
+</P>
+<H4>5.3.1 USER-CUDA package 
+</H4>
+<P>The USER-CUDA package was developed by Christian Trott (Sandia) while
+at U Technology Ilmenau in Germany.  It provides NVIDIA GPU versions
+of many pair styles, many fixes, a few computes, and for long-range
+Coulombics via the PPPM command.  It has the following general
+features:
+</P>
+<UL><LI>The package is designed to allow an entire LAMMPS calculation, for
+many timesteps, to run entirely on the GPU (except for inter-processor
+MPI communication), so that atom-based data (e.g. coordinates, forces)
+do not have to move back-and-forth between the CPU and GPU. 
+
+<LI>The speed-up advantage of this approach is typically better when the
+number of atoms per GPU is large 
+
+<LI>Data will stay on the GPU until a timestep where a non-USER-CUDA fix
+or compute is invoked.  Whenever a non-GPU operation occurs (fix,
+compute, output), data automatically moves back to the CPU as needed.
+This may incur a performance penalty, but should otherwise work
+transparently. 
+
+<LI>Neighbor lists are constructed on the GPU. 
+
+<LI>The package only supports use of a single MPI task, running on a
+single CPU (core), assigned to each GPU. 
+</UL>
+<P>Here is a quick overview of how to use the USER-CUDA package:
+</P>
+<UL><LI>build the library in lib/cuda for your GPU hardware with desired precision
+<LI>include the USER-CUDA package and build LAMMPS
+<LI>use the mpirun command to specify 1 MPI task per GPU (on each node)
+<LI>enable the USER-CUDA package via the "-c on" command-line switch
+<LI>specify the # of GPUs per node
+<LI>use USER-CUDA styles in your input script 
+</UL>
+<P>The latter two steps can be done using the "-pk cuda" and "-sf cuda"
+<A HREF = "Section_start.html#start_7">command-line switches</A> respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the <A HREF = "package.html">package cuda</A> or <A HREF = "suffix.html">suffix cuda</A> commands
+respectively to your input script.
+</P>
+<P><B>Required hardware/software:</B>
+</P>
+<P>To use this package, you need to have one or more NVIDIA GPUs and
+install the NVIDIA Cuda software on your system:
+</P>
+<P>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
+help you to find out the Compute Capability of your card:
+</P>
+<P>http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
+</P>
+<P>Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
+corresponding GPU drivers.  The Nvidia Cuda SDK is not required, but
+we recommend it also be installed.  You can then make sure its sample
+projects can be compiled without problems.
+</P>
+<P><B>Building LAMMPS with the USER-CUDA package:</B>
+</P>
+<P>This requires two steps (a,b): build the USER-CUDA library, then build
+LAMMPS with the USER-CUDA package.
+</P>
+<P>(a) Build the USER-CUDA library
+</P>
+<P>The USER-CUDA library is in lammps/lib/cuda.  If your <I>CUDA</I> toolkit
+is not installed in the default system directoy <I>/usr/local/cuda</I> edit
+the file <I>lib/cuda/Makefile.common</I> accordingly.
+</P>
+<P>To set options for the library build, type "make OPTIONS", where
+<I>OPTIONS</I> are one or more of the following. The settings will be
+written to the <I>lib/cuda/Makefile.defaults</I> and used when
+the library is built.
+</P>
+<PRE><I>precision=N</I> to set the precision level
+  N = 1 for single precision (default)
+  N = 2 for double precision
+  N = 3 for positions in double precision
+  N = 4 for positions and velocities in double precision
+<I>arch=M</I> to set GPU compute capability
+  M = 35 for Kepler GPUs
+  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
+  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
+  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
+<I>prec_timer=0/1</I> to use hi-precision timers
+  0 = do not use them (default)
+  1 = use them
+  this is usually only useful for Mac machines 
+<I>dbg=0/1</I> to activate debug mode
+  0 = no debug mode (default)
+  1 = yes debug mode
+  this is only useful for developers
+<I>cufft=1</I> for use of the CUDA FFT library
+  0 = no CUFFT support (default)
+  in the future other CUDA-enabled FFT libraries might be supported 
+</PRE>
+<P>To build the library, simply type:
+</P>
+<PRE>make 
+</PRE>
+<P>If successful, it will produce the files libcuda.a and Makefile.lammps.
+</P>
+<P>Note that if you change any of the options (like precision), you need
+to re-build the entire library.  Do a "make clean" first, followed by
+"make".
+</P>
+<P>(b) Build LAMMPS with the USER-CUDA package
+</P>
+<PRE>cd lammps/src
+make yes-user-cuda
+make machine 
+</PRE>
+<P>No additional compile/link flags are needed in your Makefile.machine
+in src/MAKE.
+</P>
+<P>Note that if you change the USER-CUDA library precision (discussed
+above) and rebuild the USER-CUDA library, then you also need to
+re-install the USER-CUDA package and re-build LAMMPS, so that all
+affected files are re-compiled and linked to the new USER-CUDA
+library.
+</P>
+<P><B>Run with the USER-CUDA package from the command line:</B>
+</P>
+<P>The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+</P>
+<P>When using the USER-CUDA package, you must use exactly one MPI task
+per physical GPU.
+</P>
+<P>You must use the "-c on" <A HREF = "Section_start.html#start_7">command-line
+switch</A> to enable the USER-CUDA package.
+</P>
+<P>Use the "-sf cuda" <A HREF = "Section_start.html#start_7">command-line switch</A>,
+which will automatically append "cuda" to styles that support it.  Use
+the "-pk cuda Ng" <A HREF = "Section_start.html#start_7">command-line switch</A> to
+set Ng = # of GPUs per node.
+</P>
+<PRE>lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
+mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
+mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes 
+</PRE>
+<P>The "-pk" switch must be used (unless the <A HREF = "package.html">package cuda</A>
+command is used in the input script) to set the number of GPUs/node to
+use.  It also allows for setting of additional options.  Its syntax is
+the same as same as the "package cuda" command.  See the
+<A HREF = "package.html">package</A> command doc page for details.
+</P>
+<P><B>Or run with the USER-CUDA package by editing an input script:</B>
+</P>
+<P>The discussion above for the mpirun/mpiexec command and the requirement
+of one MPI task per GPU is the same.
+</P>
+<P>You must still use the "-c on" <A HREF = "Section_start.html#start_7">command-line
+switch</A> to enable the USER-CUDA package.
+</P>
+<P>Use the <A HREF = "suffix.html">suffix cuda</A> command, or you can explicitly add a
+"cuda" suffix to individual styles in your input script, e.g.
+</P>
+<PRE>pair_style lj/cut/cuda 2.5 
+</PRE>
+<P>You must use the <A HREF = "package.html">package cuda</A> command to set the the
+number of GPUs/node, unless the "-pk" <A HREF = "Section_start.html#start_7">command-line
+switch</A> was used.  The command also
+allows for setting of additional options.
+</P>
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>The performance of a GPU versus a multi-core CPU is a function of your
+hardware, which pair style is used, the number of atoms/GPU, and the
+precision used on the GPU (double, single, mixed).
+</P>
+<P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
+LAMMPS web site for performance of the USER-CUDA package on different
+hardware.
+</P>
+<P><B>Guidelines for best performance:</B>
+</P>
+<UL><LI>The USER-CUDA package offers more speed-up relative to CPU performance
+when the number of atoms per GPU is large, e.g. on the order of tens
+or hundreds of 1000s. 
+
+<LI>As noted above, this package will continue to run a simulation
+entirely on the GPU(s) (except for inter-processor MPI communication),
+for multiple timesteps, until a CPU calculation is required, either by
+a fix or compute that is non-GPU-ized, or until output is performed
+(thermo or dump snapshot or restart file).  The less often this
+occurs, the faster your simulation will run. 
+</UL>
+<P><B>Restrictions:</B>
+</P>
+<P>None.
+</P>
+</HTML>
--- a/doc/accelerate_cuda.txt
+++ b/doc/accelerate_cuda.txt
@ -0,0 +1,203 @@
+"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
+"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
+
+:link(lws,http://lammps.sandia.gov)
+:link(ld,Manual.html)
+:link(lc,Section_commands.html#comm)
+
+:line
+
+"Return to Section accelerate overview"_Section_accelerate.html
+
+5.3.1 USER-CUDA package :h4
+
+The USER-CUDA package was developed by Christian Trott (Sandia) while
+at U Technology Ilmenau in Germany.  It provides NVIDIA GPU versions
+of many pair styles, many fixes, a few computes, and for long-range
+Coulombics via the PPPM command.  It has the following general
+features:
+
+The package is designed to allow an entire LAMMPS calculation, for
+many timesteps, to run entirely on the GPU (except for inter-processor
+MPI communication), so that atom-based data (e.g. coordinates, forces)
+do not have to move back-and-forth between the CPU and GPU. :ulb,l
+
+The speed-up advantage of this approach is typically better when the
+number of atoms per GPU is large :l
+
+Data will stay on the GPU until a timestep where a non-USER-CUDA fix
+or compute is invoked.  Whenever a non-GPU operation occurs (fix,
+compute, output), data automatically moves back to the CPU as needed.
+This may incur a performance penalty, but should otherwise work
+transparently. :l
+
+Neighbor lists are constructed on the GPU. :l
+
+The package only supports use of a single MPI task, running on a
+single CPU (core), assigned to each GPU. :l,ule
+
+Here is a quick overview of how to use the USER-CUDA package:
+
+build the library in lib/cuda for your GPU hardware with desired precision
+include the USER-CUDA package and build LAMMPS
+use the mpirun command to specify 1 MPI task per GPU (on each node)
+enable the USER-CUDA package via the "-c on" command-line switch
+specify the # of GPUs per node
+use USER-CUDA styles in your input script :ul
+
+The latter two steps can be done using the "-pk cuda" and "-sf cuda"
+"command-line switches"_Section_start.html#start_7 respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the "package cuda"_package.html or "suffix cuda"_suffix.html commands
+respectively to your input script.
+
+[Required hardware/software:]
+
+To use this package, you need to have one or more NVIDIA GPUs and
+install the NVIDIA Cuda software on your system:
+
+Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
+help you to find out the Compute Capability of your card:
+
+http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
+
+Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
+corresponding GPU drivers.  The Nvidia Cuda SDK is not required, but
+we recommend it also be installed.  You can then make sure its sample
+projects can be compiled without problems.
+
+[Building LAMMPS with the USER-CUDA package:]
+
+This requires two steps (a,b): build the USER-CUDA library, then build
+LAMMPS with the USER-CUDA package.
+
+(a) Build the USER-CUDA library
+
+The USER-CUDA library is in lammps/lib/cuda.  If your {CUDA} toolkit
+is not installed in the default system directoy {/usr/local/cuda} edit
+the file {lib/cuda/Makefile.common} accordingly.
+
+To set options for the library build, type "make OPTIONS", where
+{OPTIONS} are one or more of the following. The settings will be
+written to the {lib/cuda/Makefile.defaults} and used when
+the library is built.
+
+{precision=N} to set the precision level
+  N = 1 for single precision (default)
+  N = 2 for double precision
+  N = 3 for positions in double precision
+  N = 4 for positions and velocities in double precision
+{arch=M} to set GPU compute capability
+  M = 35 for Kepler GPUs
+  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
+  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
+  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
+{prec_timer=0/1} to use hi-precision timers
+  0 = do not use them (default)
+  1 = use them
+  this is usually only useful for Mac machines 
+{dbg=0/1} to activate debug mode
+  0 = no debug mode (default)
+  1 = yes debug mode
+  this is only useful for developers
+{cufft=1} for use of the CUDA FFT library
+  0 = no CUFFT support (default)
+  in the future other CUDA-enabled FFT libraries might be supported :pre
+
+To build the library, simply type:
+
+make :pre
+
+If successful, it will produce the files libcuda.a and Makefile.lammps.
+
+Note that if you change any of the options (like precision), you need
+to re-build the entire library.  Do a "make clean" first, followed by
+"make".
+
+(b) Build LAMMPS with the USER-CUDA package
+
+cd lammps/src
+make yes-user-cuda
+make machine :pre
+
+No additional compile/link flags are needed in your Makefile.machine
+in src/MAKE.
+
+Note that if you change the USER-CUDA library precision (discussed
+above) and rebuild the USER-CUDA library, then you also need to
+re-install the USER-CUDA package and re-build LAMMPS, so that all
+affected files are re-compiled and linked to the new USER-CUDA
+library.
+
+[Run with the USER-CUDA package from the command line:]
+
+The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+
+When using the USER-CUDA package, you must use exactly one MPI task
+per physical GPU.
+
+You must use the "-c on" "command-line
+switch"_Section_start.html#start_7 to enable the USER-CUDA package.
+
+Use the "-sf cuda" "command-line switch"_Section_start.html#start_7,
+which will automatically append "cuda" to styles that support it.  Use
+the "-pk cuda Ng" "command-line switch"_Section_start.html#start_7 to
+set Ng = # of GPUs per node.
+
+lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
+mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
+mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes :pre
+
+The "-pk" switch must be used (unless the "package cuda"_package.html
+command is used in the input script) to set the number of GPUs/node to
+use.  It also allows for setting of additional options.  Its syntax is
+the same as same as the "package cuda" command.  See the
+"package"_package.html command doc page for details.
+
+[Or run with the USER-CUDA package by editing an input script:]
+
+The discussion above for the mpirun/mpiexec command and the requirement
+of one MPI task per GPU is the same.
+
+You must still use the "-c on" "command-line
+switch"_Section_start.html#start_7 to enable the USER-CUDA package.
+
+Use the "suffix cuda"_suffix.html command, or you can explicitly add a
+"cuda" suffix to individual styles in your input script, e.g.
+
+pair_style lj/cut/cuda 2.5 :pre
+
+You must use the "package cuda"_package.html command to set the the
+number of GPUs/node, unless the "-pk" "command-line
+switch"_Section_start.html#start_7 was used.  The command also
+allows for setting of additional options.
+
+[Speed-ups to expect:]
+
+The performance of a GPU versus a multi-core CPU is a function of your
+hardware, which pair style is used, the number of atoms/GPU, and the
+precision used on the GPU (double, single, mixed).
+
+See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
+LAMMPS web site for performance of the USER-CUDA package on different
+hardware.
+
+[Guidelines for best performance:]
+
+The USER-CUDA package offers more speed-up relative to CPU performance
+when the number of atoms per GPU is large, e.g. on the order of tens
+or hundreds of 1000s. :ulb,l
+
+As noted above, this package will continue to run a simulation
+entirely on the GPU(s) (except for inter-processor MPI communication),
+for multiple timesteps, until a CPU calculation is required, either by
+a fix or compute that is non-GPU-ized, or until output is performed
+(thermo or dump snapshot or restart file).  The less often this
+occurs, the faster your simulation will run. :l,ule
+
+[Restrictions:]
+
+None.
--- a/doc/accelerate_gpu.html
+++ b/doc/accelerate_gpu.html
@ -0,0 +1,247 @@
+<HTML>
+<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
+<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
+</CENTER>
+
+
+
+
+
+
+<HR>
+
+<P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
+</P>
+<H4>5.3.2 GPU package 
+</H4>
+<P>The GPU package was developed by Mike Brown at ORNL and his
+collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
+versions of many pair styles, including the 3-body Stillinger-Weber
+pair style, and for <A HREF = "kspace_style.html">kspace_style pppm</A> for
+long-range Coulombics.  It has the following general features:
+</P>
+<UL><LI>It is designed to exploit common GPU hardware configurations where one
+or more GPUs are coupled to many cores of one or more multi-core CPUs,
+e.g. within a node of a parallel machine. 
+
+<LI>Atom-based data (e.g. coordinates, forces) moves back-and-forth
+between the CPU(s) and GPU every timestep. 
+
+<LI>Neighbor lists can be built on the CPU or on the GPU 
+
+<LI>The charge assignement and force interpolation portions of PPPM can be
+run on the GPU.  The FFT portion, which requires MPI communication
+between processors, runs on the CPU. 
+
+<LI>Asynchronous force computations can be performed simultaneously on the
+CPU(s) and GPU. 
+
+<LI>It allows for GPU computations to be performed in single or double
+precision, or in mixed-mode precision, where pairwise forces are
+computed in single precision, but accumulated into double-precision
+force vectors. 
+
+<LI>LAMMPS-specific code is in the GPU package.  It makes calls to a
+generic GPU library in the lib/gpu directory.  This library provides
+NVIDIA support as well as more general OpenCL support, so that the
+same functionality can eventually be supported on a variety of GPU
+hardware. 
+</UL>
+<P>Here is a quick overview of how to use the GPU package:
+</P>
+<UL><LI>build the library in lib/gpu for your GPU hardware wity desired precision
+<LI>include the GPU package and build LAMMPS
+<LI>use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU
+<LI>specify the # of GPUs per node
+<LI>use GPU styles in your input script 
+</UL>
+<P>The latter two steps can be done using the "-pk gpu" and "-sf gpu"
+<A HREF = "Section_start.html#start_7">command-line switches</A> respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the <A HREF = "package.html">package gpu</A> or <A HREF = "suffix.html">suffix gpu</A> commands
+respectively to your input script.
+</P>
+<P><B>Required hardware/software:</B>
+</P>
+<P>To use this package, you currently need to have an NVIDIA GPU and
+install the NVIDIA Cuda software on your system:
+</P>
+<UL><LI>Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
+<LI>Go to http://www.nvidia.com/object/cuda_get.html
+<LI>Install a driver and toolkit appropriate for your system (SDK is not necessary)
+<LI>Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties 
+</UL>
+<P><B>Building LAMMPS with the GPU package:</B>
+</P>
+<P>This requires two steps (a,b): build the GPU library, then build
+LAMMPS with the GPU package.
+</P>
+<P>(a) Build the GPU library
+</P>
+<P>The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
+lib/gpu) appropriate for your system.  You should pay special
+attention to 3 settings in this makefile.
+</P>
+<UL><LI>CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
+<LI>CUDA_ARCH = needs to be appropriate to your GPUs
+<LI>CUDA_PREC = precision (double, mixed, single) you desire 
+</UL>
+<P>See lib/gpu/Makefile.linux.double for examples of the ARCH settings
+for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
+possible precision settings:
+</P>
+<PRE>CUDA_PREC = -D_SINGLE_SINGLE  # single precision for all calculations
+CUDA_PREC = -D_DOUBLE_DOUBLE  # double precision for all calculations
+CUDA_PREC = -D_SINGLE_DOUBLE  # accumulation of forces, etc, in double 
+</PRE>
+<P>The last setting is the mixed mode referred to above.  Note that your
+GPU must support double precision to use either the 2nd or 3rd of
+these settings.
+</P>
+<P>To build the library, type:
+</P>
+<PRE>make -f Makefile.machine 
+</PRE>
+<P>If successful, it will produce the files libgpu.a and Makefile.lammps.
+</P>
+<P>The latter file has 3 settings that need to be appropriate for the
+paths and settings for the CUDA system software on your machine.
+Makefile.lammps is a copy of the file specified by the EXTRAMAKE
+setting in Makefile.machine.  You can change EXTRAMAKE or create your
+own Makefile.lammps.machine if needed.
+</P>
+<P>Note that to change the precision of the GPU library, you need to
+re-build the entire library.  Do a "clean" first, e.g. "make -f
+Makefile.linux clean", followed by the make command above.
+</P>
+<P>(b) Build LAMMPS with the GPU package
+</P>
+<PRE>cd lammps/src
+make yes-gpu
+make machine 
+</PRE>
+<P>No additional compile/link flags are needed in your Makefile.machine
+in src/MAKE.
+</P>
+<P>Note that if you change the GPU library precision (discussed above)
+and rebuild the GPU library, then you also need to re-install the GPU
+package and re-build LAMMPS, so that all affected files are
+re-compiled and linked to the new GPU library.
+</P>
+<P><B>Run with the GPU package from the command line:</B>
+</P>
+<P>The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+</P>
+<P>When using the GPU package, you cannot assign more than one GPU to a
+single MPI task.  However multiple MPI tasks can share the same GPU,
+and in many cases it will be more efficient to run this way.  Likewise
+it may be more efficient to use less MPI tasks/node than the available
+# of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
+automatically if you create more MPI tasks/node than there are
+GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
+shared by 4 MPI tasks.
+</P>
+<P>Use the "-sf gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>,
+which will automatically append "gpu" to styles that support it.  Use
+the "-pk gpu Ng" <A HREF = "Section_start.html#start_7">command-line switch</A> to
+set Ng = # of GPUs/node to use.
+</P>
+<PRE>lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
+mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
+mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes 
+</PRE>
+<P>Note that if the "-sf gpu" switch is used, it also issues a default
+<A HREF = "package.html">package gpu 1</A> command, which sets the number of
+GPUs/node to use to 1.
+</P>
+<P>Using the "-pk" switch explicitly allows for direct setting of the
+number of GPUs/node to use and additional options.  Its syntax is the
+same as same as the "package gpu" command.  See the
+<A HREF = "package.html">package</A> command doc page for details, including the
+default values used for all its options if it is not specified.
+</P>
+<P><B>Or run with the GPU package by editing an input script:</B>
+</P>
+<P>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
+and use of multiple MPI tasks/GPU is the same.
+</P>
+<P>Use the <A HREF = "suffix.html">suffix gpu</A> command, or you can explicitly add an
+"gpu" suffix to individual styles in your input script, e.g.
+</P>
+<PRE>pair_style lj/cut/gpu 2.5 
+</PRE>
+<P>You must also use the <A HREF = "package.html">package gpu</A> command to enable the
+GPU package, unless the "-sf gpu" or "-pk gpu" <A HREF = "Section_start.html#start_7">command-line
+switches</A> were used.  It specifies the
+number of GPUs/node to use, as well as other options.
+</P>
+<P>IMPORTANT NOTE: The input script must also use a newton pairwise
+setting of <I>off</I> in order to use GPU package pair styles.  This can be
+set via the <A HREF = "package.html">package gpu</A> or <A HREF = "newton.html">newton</A>
+commands.
+</P>
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>The performance of a GPU versus a multi-core CPU is a function of your
+hardware, which pair style is used, the number of atoms/GPU, and the
+precision used on the GPU (double, single, mixed).
+</P>
+<P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
+LAMMPS web site for performance of the GPU package on various
+hardware, including the Titan HPC platform at ORNL.
+</P>
+<P>You should also experiment with how many MPI tasks per GPU to use to
+give the best performance for your problem and machine.  This is also
+a function of the problem size and the pair style being using.
+Likewise, you should experiment with the precision setting for the GPU
+library to see if single or mixed precision will give accurate
+results, since they will typically be faster.
+</P>
+<P><B>Guidelines for best performance:</B>
+</P>
+<UL><LI>Using multiple MPI tasks per GPU will often give the best performance,
+as allowed my most multi-core CPU/GPU configurations. 
+
+<LI>If the number of particles per MPI task is small (e.g. 100s of
+particles), it can be more efficient to run with fewer MPI tasks per
+GPU, even if you do not use all the cores on the compute node. 
+
+<LI>The <A HREF = "package.html">package gpu</A> command has several options for tuning
+performance.  Neighbor lists can be built on the GPU or CPU.  Force
+calculations can be dynamically balanced across the CPU cores and
+GPUs.  GPU-specific settings can be made which can be optimized
+for different hardware.  See the <A HREF = "package.html">packakge</A> command
+doc page for details. 
+
+<LI>As described by the <A HREF = "package.html">package gpu</A> command, GPU
+accelerated pair styles can perform computations asynchronously with
+CPU computations. The "Pair" time reported by LAMMPS will be the
+maximum of the time required to complete the CPU pair style
+computations and the time required to complete the GPU pair style
+computations. Any time spent for GPU-enabled pair styles for
+computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
+<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
+<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
+calculations will not be included in the "Pair" time. 
+
+<LI>When the <I>mode</I> setting for the package gpu command is force/neigh,
+the time for neighbor list calculations on the GPU will be added into
+the "Pair" time, not the "Neigh" time.  An additional breakdown of the
+times required for various tasks on the GPU (data copy, neighbor
+calculations, force computations, etc) are output only with the LAMMPS
+screen output (not in the log file) at the end of each run.  These
+timings represent total time spent on the GPU for each routine,
+regardless of asynchronous CPU calculations. 
+
+<LI>The output section "GPU Time Info (average)" reports "Max Mem / Proc".
+This is the maximum memory used at one time on the GPU for data
+storage by a single MPI process. 
+</UL>
+<P><B>Restrictions:</B>
+</P>
+<P>None.
+</P>
+</HTML>
--- a/doc/accelerate_gpu.txt
+++ b/doc/accelerate_gpu.txt
@ -0,0 +1,242 @@
+"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
+"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
+
+:link(lws,http://lammps.sandia.gov)
+:link(ld,Manual.html)
+:link(lc,Section_commands.html#comm)
+
+:line
+
+"Return to Section accelerate overview"_Section_accelerate.html
+
+5.3.2 GPU package :h4
+
+The GPU package was developed by Mike Brown at ORNL and his
+collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
+versions of many pair styles, including the 3-body Stillinger-Weber
+pair style, and for "kspace_style pppm"_kspace_style.html for
+long-range Coulombics.  It has the following general features:
+
+It is designed to exploit common GPU hardware configurations where one
+or more GPUs are coupled to many cores of one or more multi-core CPUs,
+e.g. within a node of a parallel machine. :ulb,l
+
+Atom-based data (e.g. coordinates, forces) moves back-and-forth
+between the CPU(s) and GPU every timestep. :l
+
+Neighbor lists can be built on the CPU or on the GPU :l
+
+The charge assignement and force interpolation portions of PPPM can be
+run on the GPU.  The FFT portion, which requires MPI communication
+between processors, runs on the CPU. :l
+
+Asynchronous force computations can be performed simultaneously on the
+CPU(s) and GPU. :l
+
+It allows for GPU computations to be performed in single or double
+precision, or in mixed-mode precision, where pairwise forces are
+computed in single precision, but accumulated into double-precision
+force vectors. :l
+
+LAMMPS-specific code is in the GPU package.  It makes calls to a
+generic GPU library in the lib/gpu directory.  This library provides
+NVIDIA support as well as more general OpenCL support, so that the
+same functionality can eventually be supported on a variety of GPU
+hardware. :l,ule
+
+Here is a quick overview of how to use the GPU package:
+
+build the library in lib/gpu for your GPU hardware wity desired precision
+include the GPU package and build LAMMPS
+use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU
+specify the # of GPUs per node
+use GPU styles in your input script :ul
+
+The latter two steps can be done using the "-pk gpu" and "-sf gpu"
+"command-line switches"_Section_start.html#start_7 respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the "package gpu"_package.html or "suffix gpu"_suffix.html commands
+respectively to your input script.
+
+[Required hardware/software:]
+
+To use this package, you currently need to have an NVIDIA GPU and
+install the NVIDIA Cuda software on your system:
+
+Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
+Go to http://www.nvidia.com/object/cuda_get.html
+Install a driver and toolkit appropriate for your system (SDK is not necessary)
+Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul
+
+[Building LAMMPS with the GPU package:]
+
+This requires two steps (a,b): build the GPU library, then build
+LAMMPS with the GPU package.
+
+(a) Build the GPU library
+
+The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
+lib/gpu) appropriate for your system.  You should pay special
+attention to 3 settings in this makefile.
+
+CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
+CUDA_ARCH = needs to be appropriate to your GPUs
+CUDA_PREC = precision (double, mixed, single) you desire :ul
+
+See lib/gpu/Makefile.linux.double for examples of the ARCH settings
+for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
+possible precision settings:
+
+CUDA_PREC = -D_SINGLE_SINGLE  # single precision for all calculations
+CUDA_PREC = -D_DOUBLE_DOUBLE  # double precision for all calculations
+CUDA_PREC = -D_SINGLE_DOUBLE  # accumulation of forces, etc, in double :pre
+
+The last setting is the mixed mode referred to above.  Note that your
+GPU must support double precision to use either the 2nd or 3rd of
+these settings.
+
+To build the library, type:
+
+make -f Makefile.machine :pre
+
+If successful, it will produce the files libgpu.a and Makefile.lammps.
+
+The latter file has 3 settings that need to be appropriate for the
+paths and settings for the CUDA system software on your machine.
+Makefile.lammps is a copy of the file specified by the EXTRAMAKE
+setting in Makefile.machine.  You can change EXTRAMAKE or create your
+own Makefile.lammps.machine if needed.
+
+Note that to change the precision of the GPU library, you need to
+re-build the entire library.  Do a "clean" first, e.g. "make -f
+Makefile.linux clean", followed by the make command above.
+
+(b) Build LAMMPS with the GPU package
+
+cd lammps/src
+make yes-gpu
+make machine :pre
+
+No additional compile/link flags are needed in your Makefile.machine
+in src/MAKE.
+
+Note that if you change the GPU library precision (discussed above)
+and rebuild the GPU library, then you also need to re-install the GPU
+package and re-build LAMMPS, so that all affected files are
+re-compiled and linked to the new GPU library.
+
+[Run with the GPU package from the command line:]
+
+The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+
+When using the GPU package, you cannot assign more than one GPU to a
+single MPI task.  However multiple MPI tasks can share the same GPU,
+and in many cases it will be more efficient to run this way.  Likewise
+it may be more efficient to use less MPI tasks/node than the available
+# of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
+automatically if you create more MPI tasks/node than there are
+GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
+shared by 4 MPI tasks.
+
+Use the "-sf gpu" "command-line switch"_Section_start.html#start_7,
+which will automatically append "gpu" to styles that support it.  Use
+the "-pk gpu Ng" "command-line switch"_Section_start.html#start_7 to
+set Ng = # of GPUs/node to use.
+
+lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
+mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
+mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes :pre
+
+Note that if the "-sf gpu" switch is used, it also issues a default
+"package gpu 1"_package.html command, which sets the number of
+GPUs/node to use to 1.
+
+Using the "-pk" switch explicitly allows for direct setting of the
+number of GPUs/node to use and additional options.  Its syntax is the
+same as same as the "package gpu" command.  See the
+"package"_package.html command doc page for details, including the
+default values used for all its options if it is not specified.
+
+[Or run with the GPU package by editing an input script:]
+
+The discussion above for the mpirun/mpiexec command, MPI tasks/node,
+and use of multiple MPI tasks/GPU is the same.
+
+Use the "suffix gpu"_suffix.html command, or you can explicitly add an
+"gpu" suffix to individual styles in your input script, e.g.
+
+pair_style lj/cut/gpu 2.5 :pre
+
+You must also use the "package gpu"_package.html command to enable the
+GPU package, unless the "-sf gpu" or "-pk gpu" "command-line
+switches"_Section_start.html#start_7 were used.  It specifies the
+number of GPUs/node to use, as well as other options.
+
+IMPORTANT NOTE: The input script must also use a newton pairwise
+setting of {off} in order to use GPU package pair styles.  This can be
+set via the "package gpu"_package.html or "newton"_newton.html
+commands.
+
+[Speed-ups to expect:]
+
+The performance of a GPU versus a multi-core CPU is a function of your
+hardware, which pair style is used, the number of atoms/GPU, and the
+precision used on the GPU (double, single, mixed).
+
+See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
+LAMMPS web site for performance of the GPU package on various
+hardware, including the Titan HPC platform at ORNL.
+
+You should also experiment with how many MPI tasks per GPU to use to
+give the best performance for your problem and machine.  This is also
+a function of the problem size and the pair style being using.
+Likewise, you should experiment with the precision setting for the GPU
+library to see if single or mixed precision will give accurate
+results, since they will typically be faster.
+
+[Guidelines for best performance:]
+
+Using multiple MPI tasks per GPU will often give the best performance,
+as allowed my most multi-core CPU/GPU configurations. :ulb,l
+
+If the number of particles per MPI task is small (e.g. 100s of
+particles), it can be more efficient to run with fewer MPI tasks per
+GPU, even if you do not use all the cores on the compute node. :l
+
+The "package gpu"_package.html command has several options for tuning
+performance.  Neighbor lists can be built on the GPU or CPU.  Force
+calculations can be dynamically balanced across the CPU cores and
+GPUs.  GPU-specific settings can be made which can be optimized
+for different hardware.  See the "packakge"_package.html command
+doc page for details. :l
+
+As described by the "package gpu"_package.html command, GPU
+accelerated pair styles can perform computations asynchronously with
+CPU computations. The "Pair" time reported by LAMMPS will be the
+maximum of the time required to complete the CPU pair style
+computations and the time required to complete the GPU pair style
+computations. Any time spent for GPU-enabled pair styles for
+computations that run simultaneously with "bond"_bond_style.html,
+"angle"_angle_style.html, "dihedral"_dihedral_style.html,
+"improper"_improper_style.html, and "long-range"_kspace_style.html
+calculations will not be included in the "Pair" time. :l
+
+When the {mode} setting for the package gpu command is force/neigh,
+the time for neighbor list calculations on the GPU will be added into
+the "Pair" time, not the "Neigh" time.  An additional breakdown of the
+times required for various tasks on the GPU (data copy, neighbor
+calculations, force computations, etc) are output only with the LAMMPS
+screen output (not in the log file) at the end of each run.  These
+timings represent total time spent on the GPU for each routine,
+regardless of asynchronous CPU calculations. :l
+
+The output section "GPU Time Info (average)" reports "Max Mem / Proc".
+This is the maximum memory used at one time on the GPU for data
+storage by a single MPI process. :l,ule
+
+[Restrictions:]
+
+None.
--- a/doc/accelerate_intel.html
+++ b/doc/accelerate_intel.html
@ -0,0 +1,304 @@
+<HTML>
+<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
+<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
+</CENTER>
+
+
+
+
+
+
+<HR>
+
+<P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
+</P>
+<H4>5.3.3 USER-INTEL package 
+</H4>
+<P>The USER-INTEL package was developed by Mike Brown at Intel
+Corporation.  It provides a capability to accelerate simulations by
+offloading neighbor list and non-bonded force calculations to Intel(R)
+Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package).
+Additionally, it supports running simulations in single, mixed, or
+double precision with vectorization, even if a coprocessor is not
+present, i.e. on an Intel(R) CPU.  The same C++ code is used for both
+cases.  When offloading to a coprocessor, the routine is run twice,
+once with an offload flag.
+</P>
+<P>The USER-INTEL package can be used in tandem with the USER-OMP
+package.  This is useful when offloading pair style computations to
+coprocessors, so that other styles not supported by the USER-INTEL
+package, e.g. bond, angle, dihedral, improper, and long-range
+electrostatics, can be run simultaneously in threaded mode on CPU
+cores.  Since less MPI tasks than CPU cores will typically be invoked
+when running with coprocessors, this enables the extra cores to be
+utilized for useful computation.
+</P>
+<P>If LAMMPS is built with both the USER-INTEL and USER-OMP packages
+intsalled, this mode of operation is made easier to use, because the
+"-suffix intel" <A HREF = "Section_start.html#start_7">command-line switch</A> or
+the <A HREF = "suffix.html">suffix intel</A> command will both set a second-choice
+suffix to "omp" so that styles from the USER-OMP package will be used
+if available, after first testing if a style from the USER-INTEL
+package is available.
+</P>
+<P>Here is a quick overview of how to use the USER-INTEL package
+for CPU acceleration:
+</P>
+<UL><LI>specify these CCFLAGS in your Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, and -restrict, -xHost
+<LI>specify -fopenmp with LINKFLAGS in your Makefile.machine
+<LI>include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
+<LI>if using the USER-OMP package, specify how many threads per MPI task to use
+<LI>use USER-INTEL styles in your input script 
+</UL>
+<P>Using the USER-INTEL package to offload work to the Intel(R)
+Xeon Phi(TM) coprocessor is the same except for these additional
+steps:
+</P>
+<UL><LI>add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
+<LI>add the flag -offload to LINKFLAGS in your Makefile.machine
+<LI>specify how many threads per coprocessor to use 
+</UL>
+<P>The latter two steps in the first case and the last step in the
+coprocessor case can be done using the "-pk omp" and "-sf intel" and
+"-pk intel" <A HREF = "Section_start.html#start_7">command-line switches</A>
+respectively.  Or the effect of the "-pk" or "-sf" switches can be
+duplicated by adding the <A HREF = "package.html">package omp</A> or <A HREF = "suffix.html">suffix
+intel</A> or <A HREF = "package.html">package intel</A> commands
+respectively to your input script.
+</P>
+<P><B>Required hardware/software:</B>
+</P>
+<P>To use the offload option, you must have one or more Intel(R) Xeon
+Phi(TM) coprocessors.
+</P>
+<P>Optimizations for vectorization have only been tested with the
+Intel(R) compiler.  Use of other compilers may not result in
+vectorization or give poor performance.
+</P>
+<P>Use of an Intel C++ compiler is reccommended, but not required.  The
+compiler must support the OpenMP interface.
+</P>
+<P><B>Building LAMMPS with the USER-INTEL package:</B>
+</P>
+<P>Include the package(s) and build LAMMPS:  
+</P>
+<PRE>cd lammps/src
+make yes-user-intel
+make yes-user-omp (if desired)
+make machine 
+</PRE>
+<P>If the USER-OMP package is also installed, you can use styles from
+both packages, as described below.
+</P>
+<P>The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support
+in both the CCFLAGS and LINKFLAGS variables, which is <I>-openmp</I> for
+Intel compilers.  You also need to add -DLAMMPS_MEMALIGN=64 and
+-restrict to CCFLAGS.
+</P>
+<P>If you are compiling on the same architecture that will be used for
+the runs, adding the flag <I>-xHost</I> to CCFLAGS will enable
+vectorization with the Intel(R) compiler.
+</P>
+<P>In order to build with support for an Intel(R) coprocessor, the flag
+<I>-offload</I> should be added to the LINKFLAGS line and the flag
+-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
+</P>
+<P>Note that the machine makefiles Makefile.intel and
+Makefile.intel_offload are included in the src/MAKE directory with
+options that perform well with the Intel(R) compiler. The latter file
+has support for offload to coprocessors; the former does not.
+</P>
+<P>If using an Intel compiler, it is recommended that Intel(R) Compiler
+2013 SP1 update 1 be used.  Newer versions have some performance
+issues that are being addressed. If using Intel(R) MPI, version 5 or
+higher is recommended.
+</P>
+<P><B>Running with the USER-INTEL package from the command line:</B>
+</P>
+<P>The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+</P>
+<P>If LAMMPS was also built with the USER-OMP package, you need to choose
+how many OpenMP threads per MPI task will be used by the USER-OMP
+package.  Note that the product of MPI tasks * OpenMP threads/task
+should not exceed the physical number of cores (on a node), otherwise
+performance will suffer.
+</P>
+<P>If LAMMPS was built with coprocessor support for the USER-INTEL
+package, you need to specify the number of coprocessor/node and the
+number of threads to use on the coprocessor per MPI task.  Note that
+coprocessor threads (which run on the coprocessor) are totally
+independent from OpenMP threads (which run on the CPU).  The product
+of MPI tasks * coprocessor threads/task should not exceed the maximum
+number of threads the coproprocessor is designed to run, otherwise
+performance will suffer.  This value is 240 for current generation
+Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core.  The
+threads/core value can be set to a smaller value if desired by an
+option on the <A HREF = "package.html">package intel</A> command, in which case the
+maximum number of threads is also reduced.
+</P>
+<P>Use the "-sf intel" <A HREF = "Section_start.html#start_7">command-line switch</A>,
+which will automatically append "intel" to styles that support it.  If
+a style does not support it, a "omp" suffix is tried next.  Use the
+"-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switch</A>, to set
+Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
+the USER-OMP package.  Use the "-pk intel Nphi" <A HREF = "Section_start.html#start_7">command-line
+switch</A> to set Nphi = # of Xeon Phi(TM)
+coprocessors/node, if LAMMPS was built with coprocessor support.
+</P>
+<PRE>CPU-only without USER-OMP (but using Intel vectorization on CPU):
+lmp_machine -sf intel -in in.script                 # 1 MPI task
+mpirun -np 32 lmp_machine -sf intel -in in.script   # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) 
+</PRE>
+<PRE>CPU-only with USER-OMP (and Intel vectorization on CPU):
+lmp_machine -sf intel -pk intel 16 0 -in in.script                # 1 MPI task on a 16-core node
+mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script    # 4 MPI tasks each with 4 threads on a single 16-core node
+mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script   # ditto on 8 16-core nodes 
+</PRE>
+<PRE>CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
+lmp_machine -sf intel -pk intel 16 1 -in in.script                                  # 1 MPI task, 240 threads on 1 coprocessor
+mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script            # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node, 
+                                                                                    # each MPI task uses 60 threads on 1 coprocessor
+mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script   # ditto on 8 16-core nodes for MPI tasks and OpenMP threads, 
+                                                                                    # each MPI task uses 120 threads on one of 2 coprocessors 
+</PRE>
+<P>Note that if the "-sf intel" switch is used, it also issues two
+default commands: <A HREF = "package.html">package omp 0</A> and <A HREF = "package.html">package intel
+1</A> command.  These set the number of OpenMP threads per
+MPI task via the OMP_NUM_THREADS environment variable, and the number
+of Xeon Phi(TM) coprocessors/node to 1.  The former is ignored if
+LAMMPS was not built with the USER-OMP package.  The latter is ignored
+is LAMMPS was not built with coprocessor support, except for its
+optional precision setting.
+</P>
+<P>Using the "-pk omp" switch explicitly allows for direct setting of the
+number of OpenMP threads per MPI task, and additional options.  Using
+the "-pk intel" switch explicitly allows for direct setting of the
+number of coprocessors/node, and additional options.  The syntax for
+these two switches is the same as the <A HREF = "package.html">package omp</A> and
+<A HREF = "package.html">package intel</A> commands.  See the <A HREF = "package.html">package</A>
+command doc page for details, including the default values used for
+all its options if these switches are not specified, and how to set
+the number of OpenMP threads via the OMP_NUM_THREADS environment
+variable if desired.
+</P>
+<P><B>Or run with the USER-INTEL package by editing an input script:</B>
+</P>
+<P>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
+OpenMP threads per MPI task, and coprocessor threads per MPI task is
+the same.
+</P>
+<P>Use the <A HREF = "suffix.html">suffix intel</A> command, or you can explicitly add an
+"intel" suffix to individual styles in your input script, e.g.
+</P>
+<PRE>pair_style lj/cut/intel 2.5 
+</PRE>
+<P>You must also use the <A HREF = "package.html">package omp</A> command to enable the
+USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
+intel" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line switches</A>
+were used.  It specifies how many OpenMP threads per MPI task to use,
+as well as other options.  Its doc page explains how to set the number
+of threads via an environment variable if desired.
+</P>
+<P>You must also use the <A HREF = "package.html">package intel</A> command to enable
+coprocessor support within the USER-INTEL package (assuming LAMMPS was
+built with coprocessor support) unless the "-sf intel" or "-pk intel"
+<A HREF = "Section_start.html#start_7">command-line switches</A> were used.  It
+specifies how many coprocessors/node to use, as well as other
+coprocessor options.
+</P>
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>If LAMMPS was not built with coprocessor support when including the
+USER-INTEL package, then acclerated styles will run on the CPU using
+vectorization optimizations and the specified precision.  This may
+give a substantial speed-up for a pair style, particularly if mixed or
+single precision is used.
+</P>
+<P>If LAMMPS was built with coproccesor support, the pair styles will run
+on one or more Intel(R) Xeon Phi(TM) coprocessors (per node).  The
+performance of a Xeon Phi versus a multi-core CPU is a function of
+your hardware, which pair style is used, the number of
+atoms/coprocessor, and the precision used on the coprocessor (double,
+single, mixed).
+</P>
+<P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
+LAMMPS web site for performance of the USER-INTEL package on different
+hardware.
+</P>
+<P><B>Guidelines for best performance on an Intel(R) Xeon Phi(TM)
+coprocessor:</B>
+</P>
+<UL><LI>The default for the <A HREF = "package.html">package intel</A> command is to have
+all the MPI tasks on a given compute node use a single Xeon Phi(TM)
+coprocessor.  In general, running with a large number of MPI tasks on
+each node will perform best with offload.  Each MPI task will
+automatically get affinity to a subset of the hardware threads
+available on the coprocessor.  For example, if your card has 61 cores,
+with 60 cores available for offload and 4 hardware threads per core
+(240 total threads), running with 24 MPI tasks per node will cause
+each MPI task to use a subset of 10 threads on the coprocessor.  Fine
+tuning of the number of threads to use per MPI task or the number of
+threads to use per core can be accomplished with keyword settings of
+the <A HREF = "package.html">package intel</A> command. 
+
+<LI>If desired, only a fraction of the pair style computation can be
+offloaded to the coprocessors.  This is accomplished by using the
+<I>balance</I> keyword in the <A HREF = "package.html">package intel</A> command.  A
+balance of 0 runs all calculations on the CPU.  A balance of 1 runs
+all calculations on the coprocessor.  A balance of 0.5 runs half of
+the calculations on the coprocessor.  Setting the balance to -1 (the
+default) will enable dynamic load balancing that continously adjusts
+the fraction of offloaded work throughout the simulation.  This option
+typically produces results within 5 to 10 percent of the optimal fixed
+balance. 
+
+<LI>When using offload with CPU hyperthreading disabled, it may help
+performance to use fewer MPI tasks and OpenMP threads than available
+cores.  This is due to the fact that additional threads are generated
+internally to handle the asynchronous offload tasks. 
+
+<LI>If running short benchmark runs with dynamic load balancing, adding a
+short warm-up run (10-20 steps) will allow the load-balancer to find a
+near-optimal setting that will carry over to additional runs. 
+
+<LI>If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
+coprocessor, a diagnostic line is printed to the screen (not to the
+log file), during the setup phase of a run, indicating that offload
+mode is being used and indicating the number of coprocessor threads
+per MPI task.  Additionally, an offload timing summary is printed at
+the end of each run.  When offloading, the frequency for <A HREF = "atom_modify.html">atom
+sorting</A> is changed to 1 so that the per-atom data is
+effectively sorted at every rebuild of the neighbor lists. 
+
+<LI>For simulations with long-range electrostatics or bond, angle,
+dihedral, improper calculations, computation and data transfer to the
+coprocessor will run concurrently with computations and MPI
+communications for these calculations on the host CPU.  The USER-INTEL
+package has two modes for deciding which atoms will be handled by the
+coprocessor.  This choice is controlled with the <I>ghost</I> keyword of
+the <A HREF = "package.html">package intel</A> command.  When set to 0, ghost atoms
+(atoms at the borders between MPI tasks) are not offloaded to the
+card.  This allows for overlap of MPI communication of forces with
+computation on the coprocessor when the <A HREF = "newton.html">newton</A> setting
+is "on".  The default is dependent on the style being used, however,
+better performance may be achieved by setting this option
+explictly. 
+</UL>
+<P><B>Restrictions:</B>
+</P>
+<P>When offloading to a coprocessor, <A HREF = "pair_hybrid.html">hybrid</A> styles
+that require skip lists for neighbor builds cannot be offloaded.
+Using <A HREF = "pair_hybrid.html">hybrid/overlay</A> is allowed.  Only one intel
+accelerated style may be used with hybrid styles.
+<A HREF = "special_bonds.html">Special_bonds</A> exclusion lists are not currently
+supported with offload, however, the same effect can often be
+accomplished by setting cutoffs for excluded atom types to 0.  None of
+the pair styles in the USER-INTEL package currently support the
+"inner", "middle", "outer" options for rRESPA integration via the
+<A HREF = "run_style.html">run_style respa</A> command; only the "pair" option is
+supported.
+</P>
+</HTML>
--- a/doc/accelerate_intel.txt
+++ b/doc/accelerate_intel.txt
@ -0,0 +1,299 @@
+"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
+"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
+
+:link(lws,http://lammps.sandia.gov)
+:link(ld,Manual.html)
+:link(lc,Section_commands.html#comm)
+
+:line
+
+"Return to Section accelerate overview"_Section_accelerate.html
+
+5.3.3 USER-INTEL package :h4
+
+The USER-INTEL package was developed by Mike Brown at Intel
+Corporation.  It provides a capability to accelerate simulations by
+offloading neighbor list and non-bonded force calculations to Intel(R)
+Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package).
+Additionally, it supports running simulations in single, mixed, or
+double precision with vectorization, even if a coprocessor is not
+present, i.e. on an Intel(R) CPU.  The same C++ code is used for both
+cases.  When offloading to a coprocessor, the routine is run twice,
+once with an offload flag.
+
+The USER-INTEL package can be used in tandem with the USER-OMP
+package.  This is useful when offloading pair style computations to
+coprocessors, so that other styles not supported by the USER-INTEL
+package, e.g. bond, angle, dihedral, improper, and long-range
+electrostatics, can be run simultaneously in threaded mode on CPU
+cores.  Since less MPI tasks than CPU cores will typically be invoked
+when running with coprocessors, this enables the extra cores to be
+utilized for useful computation.
+
+If LAMMPS is built with both the USER-INTEL and USER-OMP packages
+intsalled, this mode of operation is made easier to use, because the
+"-suffix intel" "command-line switch"_Section_start.html#start_7 or
+the "suffix intel"_suffix.html command will both set a second-choice
+suffix to "omp" so that styles from the USER-OMP package will be used
+if available, after first testing if a style from the USER-INTEL
+package is available.
+
+Here is a quick overview of how to use the USER-INTEL package
+for CPU acceleration:
+
+specify these CCFLAGS in your Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, and -restrict, -xHost
+specify -fopenmp with LINKFLAGS in your Makefile.machine
+include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
+if using the USER-OMP package, specify how many threads per MPI task to use
+use USER-INTEL styles in your input script :ul
+
+Using the USER-INTEL package to offload work to the Intel(R)
+Xeon Phi(TM) coprocessor is the same except for these additional
+steps:
+
+add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
+add the flag -offload to LINKFLAGS in your Makefile.machine
+specify how many threads per coprocessor to use :ul
+
+The latter two steps in the first case and the last step in the
+coprocessor case can be done using the "-pk omp" and "-sf intel" and
+"-pk intel" "command-line switches"_Section_start.html#start_7
+respectively.  Or the effect of the "-pk" or "-sf" switches can be
+duplicated by adding the "package omp"_package.html or "suffix
+intel"_suffix.html or "package intel"_package.html commands
+respectively to your input script.
+
+[Required hardware/software:]
+
+To use the offload option, you must have one or more Intel(R) Xeon
+Phi(TM) coprocessors.
+
+Optimizations for vectorization have only been tested with the
+Intel(R) compiler.  Use of other compilers may not result in
+vectorization or give poor performance.
+
+Use of an Intel C++ compiler is reccommended, but not required.  The
+compiler must support the OpenMP interface.
+
+[Building LAMMPS with the USER-INTEL package:]
+
+Include the package(s) and build LAMMPS:  
+
+cd lammps/src
+make yes-user-intel
+make yes-user-omp (if desired)
+make machine :pre
+
+If the USER-OMP package is also installed, you can use styles from
+both packages, as described below.
+
+The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support
+in both the CCFLAGS and LINKFLAGS variables, which is {-openmp} for
+Intel compilers.  You also need to add -DLAMMPS_MEMALIGN=64 and
+-restrict to CCFLAGS.
+
+If you are compiling on the same architecture that will be used for
+the runs, adding the flag {-xHost} to CCFLAGS will enable
+vectorization with the Intel(R) compiler.
+
+In order to build with support for an Intel(R) coprocessor, the flag
+{-offload} should be added to the LINKFLAGS line and the flag
+-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
+
+Note that the machine makefiles Makefile.intel and
+Makefile.intel_offload are included in the src/MAKE directory with
+options that perform well with the Intel(R) compiler. The latter file
+has support for offload to coprocessors; the former does not.
+
+If using an Intel compiler, it is recommended that Intel(R) Compiler
+2013 SP1 update 1 be used.  Newer versions have some performance
+issues that are being addressed. If using Intel(R) MPI, version 5 or
+higher is recommended.
+
+[Running with the USER-INTEL package from the command line:]
+
+The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+
+If LAMMPS was also built with the USER-OMP package, you need to choose
+how many OpenMP threads per MPI task will be used by the USER-OMP
+package.  Note that the product of MPI tasks * OpenMP threads/task
+should not exceed the physical number of cores (on a node), otherwise
+performance will suffer.
+
+If LAMMPS was built with coprocessor support for the USER-INTEL
+package, you need to specify the number of coprocessor/node and the
+number of threads to use on the coprocessor per MPI task.  Note that
+coprocessor threads (which run on the coprocessor) are totally
+independent from OpenMP threads (which run on the CPU).  The product
+of MPI tasks * coprocessor threads/task should not exceed the maximum
+number of threads the coproprocessor is designed to run, otherwise
+performance will suffer.  This value is 240 for current generation
+Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core.  The
+threads/core value can be set to a smaller value if desired by an
+option on the "package intel"_package.html command, in which case the
+maximum number of threads is also reduced.
+
+Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
+which will automatically append "intel" to styles that support it.  If
+a style does not support it, a "omp" suffix is tried next.  Use the
+"-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set
+Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
+the USER-OMP package.  Use the "-pk intel Nphi" "command-line
+switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
+coprocessors/node, if LAMMPS was built with coprocessor support.
+
+CPU-only without USER-OMP (but using Intel vectorization on CPU):
+lmp_machine -sf intel -in in.script                 # 1 MPI task
+mpirun -np 32 lmp_machine -sf intel -in in.script   # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre
+
+CPU-only with USER-OMP (and Intel vectorization on CPU):
+lmp_machine -sf intel -pk intel 16 0 -in in.script                # 1 MPI task on a 16-core node
+mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script    # 4 MPI tasks each with 4 threads on a single 16-core node
+mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script   # ditto on 8 16-core nodes :pre
+
+CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
+lmp_machine -sf intel -pk intel 16 1 -in in.script                                  # 1 MPI task, 240 threads on 1 coprocessor
+mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script            # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node, 
+                                                                                    # each MPI task uses 60 threads on 1 coprocessor
+mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script   # ditto on 8 16-core nodes for MPI tasks and OpenMP threads, 
+                                                                                    # each MPI task uses 120 threads on one of 2 coprocessors :pre
+
+Note that if the "-sf intel" switch is used, it also issues two
+default commands: "package omp 0"_package.html and "package intel
+1"_package.html command.  These set the number of OpenMP threads per
+MPI task via the OMP_NUM_THREADS environment variable, and the number
+of Xeon Phi(TM) coprocessors/node to 1.  The former is ignored if
+LAMMPS was not built with the USER-OMP package.  The latter is ignored
+is LAMMPS was not built with coprocessor support, except for its
+optional precision setting.
+
+Using the "-pk omp" switch explicitly allows for direct setting of the
+number of OpenMP threads per MPI task, and additional options.  Using
+the "-pk intel" switch explicitly allows for direct setting of the
+number of coprocessors/node, and additional options.  The syntax for
+these two switches is the same as the "package omp"_package.html and
+"package intel"_package.html commands.  See the "package"_package.html
+command doc page for details, including the default values used for
+all its options if these switches are not specified, and how to set
+the number of OpenMP threads via the OMP_NUM_THREADS environment
+variable if desired.
+
+[Or run with the USER-INTEL package by editing an input script:]
+
+The discussion above for the mpirun/mpiexec command, MPI tasks/node,
+OpenMP threads per MPI task, and coprocessor threads per MPI task is
+the same.
+
+Use the "suffix intel"_suffix.html command, or you can explicitly add an
+"intel" suffix to individual styles in your input script, e.g.
+
+pair_style lj/cut/intel 2.5 :pre
+
+You must also use the "package omp"_package.html command to enable the
+USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
+intel" or "-pk omp" "command-line switches"_Section_start.html#start_7
+were used.  It specifies how many OpenMP threads per MPI task to use,
+as well as other options.  Its doc page explains how to set the number
+of threads via an environment variable if desired.
+
+You must also use the "package intel"_package.html command to enable
+coprocessor support within the USER-INTEL package (assuming LAMMPS was
+built with coprocessor support) unless the "-sf intel" or "-pk intel"
+"command-line switches"_Section_start.html#start_7 were used.  It
+specifies how many coprocessors/node to use, as well as other
+coprocessor options.
+
+[Speed-ups to expect:]
+
+If LAMMPS was not built with coprocessor support when including the
+USER-INTEL package, then acclerated styles will run on the CPU using
+vectorization optimizations and the specified precision.  This may
+give a substantial speed-up for a pair style, particularly if mixed or
+single precision is used.
+
+If LAMMPS was built with coproccesor support, the pair styles will run
+on one or more Intel(R) Xeon Phi(TM) coprocessors (per node).  The
+performance of a Xeon Phi versus a multi-core CPU is a function of
+your hardware, which pair style is used, the number of
+atoms/coprocessor, and the precision used on the coprocessor (double,
+single, mixed).
+
+See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
+LAMMPS web site for performance of the USER-INTEL package on different
+hardware.
+
+[Guidelines for best performance on an Intel(R) Xeon Phi(TM)
+coprocessor:]
+
+The default for the "package intel"_package.html command is to have
+all the MPI tasks on a given compute node use a single Xeon Phi(TM)
+coprocessor.  In general, running with a large number of MPI tasks on
+each node will perform best with offload.  Each MPI task will
+automatically get affinity to a subset of the hardware threads
+available on the coprocessor.  For example, if your card has 61 cores,
+with 60 cores available for offload and 4 hardware threads per core
+(240 total threads), running with 24 MPI tasks per node will cause
+each MPI task to use a subset of 10 threads on the coprocessor.  Fine
+tuning of the number of threads to use per MPI task or the number of
+threads to use per core can be accomplished with keyword settings of
+the "package intel"_package.html command. :ulb,l
+
+If desired, only a fraction of the pair style computation can be
+offloaded to the coprocessors.  This is accomplished by using the
+{balance} keyword in the "package intel"_package.html command.  A
+balance of 0 runs all calculations on the CPU.  A balance of 1 runs
+all calculations on the coprocessor.  A balance of 0.5 runs half of
+the calculations on the coprocessor.  Setting the balance to -1 (the
+default) will enable dynamic load balancing that continously adjusts
+the fraction of offloaded work throughout the simulation.  This option
+typically produces results within 5 to 10 percent of the optimal fixed
+balance. :l
+
+When using offload with CPU hyperthreading disabled, it may help
+performance to use fewer MPI tasks and OpenMP threads than available
+cores.  This is due to the fact that additional threads are generated
+internally to handle the asynchronous offload tasks. :l
+
+If running short benchmark runs with dynamic load balancing, adding a
+short warm-up run (10-20 steps) will allow the load-balancer to find a
+near-optimal setting that will carry over to additional runs. :l
+
+If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
+coprocessor, a diagnostic line is printed to the screen (not to the
+log file), during the setup phase of a run, indicating that offload
+mode is being used and indicating the number of coprocessor threads
+per MPI task.  Additionally, an offload timing summary is printed at
+the end of each run.  When offloading, the frequency for "atom
+sorting"_atom_modify.html is changed to 1 so that the per-atom data is
+effectively sorted at every rebuild of the neighbor lists. :l
+
+For simulations with long-range electrostatics or bond, angle,
+dihedral, improper calculations, computation and data transfer to the
+coprocessor will run concurrently with computations and MPI
+communications for these calculations on the host CPU.  The USER-INTEL
+package has two modes for deciding which atoms will be handled by the
+coprocessor.  This choice is controlled with the {ghost} keyword of
+the "package intel"_package.html command.  When set to 0, ghost atoms
+(atoms at the borders between MPI tasks) are not offloaded to the
+card.  This allows for overlap of MPI communication of forces with
+computation on the coprocessor when the "newton"_newton.html setting
+is "on".  The default is dependent on the style being used, however,
+better performance may be achieved by setting this option
+explictly. :l,ule
+
+[Restrictions:]
+
+When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles
+that require skip lists for neighbor builds cannot be offloaded.
+Using "hybrid/overlay"_pair_hybrid.html is allowed.  Only one intel
+accelerated style may be used with hybrid styles.
+"Special_bonds"_special_bonds.html exclusion lists are not currently
+supported with offload, however, the same effect can often be
+accomplished by setting cutoffs for excluded atom types to 0.  None of
+the pair styles in the USER-INTEL package currently support the
+"inner", "middle", "outer" options for rRESPA integration via the
+"run_style respa"_run_style.html command; only the "pair" option is
+supported.
--- a/doc/accelerate_kokkos.html
+++ b/doc/accelerate_kokkos.html
@ -0,0 +1,426 @@
+<HTML>
+<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
+<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
+</CENTER>
+
+
+
+
+
+
+<HR>
+
+<P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
+</P>
+<H4>5.3.4 KOKKOS package 
+</H4>
+<P>The KOKKOS package was developed primaritly by Christian Trott
+(Sandia) with contributions of various styles by others, including
+Sikandar Mashayak (UIUC).  The underlying Kokkos library was written
+primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
+Sandia).
+</P>
+<P>The KOKKOS package contains versions of pair, fix, and atom styles
+that use data structures and macros provided by the Kokkos library,
+which is included with LAMMPS in lib/kokkos.
+</P>
+<P>The Kokkos library is part of
+<A HREF = "http://trilinos.sandia.gov/packages/kokkos">Trilinos</A> and is a
+templated C++ library that provides two key abstractions for an
+application like LAMMPS.  First, it allows a single implementation of
+an application kernel (e.g. a pair style) to run efficiently on
+different kinds of hardware, such as a GPU, Intel Phi, or many-core
+chip.
+</P>
+<P>The Kokkos library also provides data abstractions to adjust (at
+compile time) the memory layout of basic data structures like 2d and
+3d arrays and allow the transparent utilization of special hardware
+load and store operations.  Such data structures are used in LAMMPS to
+store atom coordinates or forces or neighbor lists.  The layout is
+chosen to optimize performance on different platforms.  Again this
+functionality is hidden from the developer, and does not affect how
+the kernel is coded.
+</P>
+<P>These abstractions are set at build time, when LAMMPS is compiled with
+the KOKKOS package installed.  This is done by selecting a "host" and
+"device" to build for, compatible with the compute nodes in your
+machine (one on a desktop machine or 1000s on a supercomputer).
+</P>
+<P>All Kokkos operations occur within the context of an individual MPI
+task running on a single node of the machine.  The total number of MPI
+tasks used by LAMMPS (one or multiple per compute node) is set in the
+usual manner via the mpirun or mpiexec commands, and is independent of
+Kokkos.
+</P>
+<P>Kokkos provides support for two different modes of execution per MPI
+task.  This means that computational tasks (pairwise interactions,
+neighbor list builds, time integration, etc) can be parallelized for
+one or the other of the two modes.  The first mode is called the
+"host" and is one or more threads running on one or more physical CPUs
+(within the node).  Currently, both multi-core CPUs and an Intel Phi
+processor (running in native mode, not offload mode like the
+USER-INTEL package) are supported.  The second mode is called the
+"device" and is an accelerator chip of some kind.  Currently only an
+NVIDIA GPU is supported.  If your compute node does not have a GPU,
+then there is only one mode of execution, i.e. the host and device are
+the same.
+</P>
+<P>Here is a quick overview of how to use the KOKKOS package
+for GPU acceleration:
+</P>
+<UL><LI>specify variables and settings in your Makefile.machine that enable GPU, Phi, or OpenMP support
+<LI>include the KOKKOS package and build LAMMPS
+<LI>enable the KOKKOS package and its hardware options via the "-k on" command-line switch
+<LI>use KOKKOS styles in your input script 
+</UL>
+<P>The latter two steps can be done using the "-k on", "-pk kokkos" and
+"-sf kk" <A HREF = "Section_start.html#start_7">command-line switches</A>
+respectively.  Or the effect of the "-pk" or "-sf" switches can be
+duplicated by adding the <A HREF = "package.html">package kokkos</A> or <A HREF = "suffix.html">suffix
+kk</A> commands respectively to your input script.
+</P>
+<P><B>Required hardware/software:</B>
+</P>
+<P>The KOKKOS package can be used to build and run LAMMPS on the
+following kinds of hardware:
+</P>
+<UL><LI>CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
+<LI>CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
+<LI>Phi: on one or more Intel Phi coprocessors (per node)
+<LI>GPU: on the GPUs of a node with additional OpenMP threading on the CPUs 
+</UL>
+<P>Note that Intel Xeon Phi coprocessors are supported in "native" mode,
+not "offload" mode like the USER-INTEL package supports.
+</P>
+<P>Only NVIDIA GPUs are currently supported.
+</P>
+<P>IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
+you must have Kepler generation GPUs (or later).  The Kokkos library
+exploits texture cache options not supported by Telsa generation GPUs
+(or older).
+</P>
+<P>To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
+installed on your system.  See the discussion above for the USER-CUDA
+and GPU packages for details of how to check and do this.
+</P>
+<P><B>Building LAMMPS with the KOKKOS package:</B>
+</P>
+<P>Unlike other acceleration packages discussed in this section, the
+Kokkos library in lib/kokkos does not have to be pre-built before
+building LAMMPS itself.  Instead, options for the Kokkos library are
+specified at compile time, when LAMMPS itself is built.  This can be
+done in one of two ways, as discussed below.
+</P>
+<P>Here are examples of how to build LAMMPS for the different compute-node
+configurations listed above.
+</P>
+<P>CPU-only (run all-MPI or with OpenMP threading):
+</P>
+<PRE>cd lammps/src
+make yes-kokkos
+make g++ OMP=yes 
+</PRE>
+<P>Intel Xeon Phi:
+</P>
+<PRE>cd lammps/src
+make yes-kokkos
+make g++ OMP=yes MIC=yes 
+</PRE>
+<P>CPUs and GPUs:
+</P>
+<PRE>cd lammps/src
+make yes-kokkos
+make cuda CUDA=yes 
+</PRE>
+<P>These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
+make command line which requires a GNU-compatible make command.  Try
+"gmake" if your system's standard make complains.  
+</P>
+<P>IMPORTANT NOTE: If you build using make line variables and re-build
+LAMMPS twice with different KOKKOS options and the *same* target,
+e.g. g++ in the first two examples above, then you *must* perform a
+"make clean-all" or "make clean-machine" before each build.  This is
+to force all the KOKKOS-dependent files to be re-compiled with the new
+options.
+</P>
+<P>You can also hardwire these make variables in the specified machine
+makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
+with a line like:
+</P>
+<PRE>MIC = yes 
+</PRE>
+<P>Note that if you build LAMMPS multiple times in this manner, using
+different KOKKOS options (defined in different machine makefiles), you
+do not have to worry about doing a "clean" in between.  This is
+because the targets will be different.
+</P>
+<P>IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
+machine makefile, in this case src/MAKE/Makefile.cuda, which is
+included in the LAMMPS distribution.  To build the KOKKOS package for
+a GPU, this makefile must use the NVIDA "nvcc" compiler.  And it must
+have a CCFLAGS -arch setting that is appropriate for your NVIDIA
+hardware and installed software.  Typical values for -arch are given
+in <A HREF = "Section_start.html#start_3_4">Section 2.3.4</A> of the manual, as well
+as other settings that must be included in the machine makefile, if
+you create your own.
+</P>
+<P>There are other allowed options when building with the KOKKOS package.
+As above, They can be set either as variables on the make command line
+or in the machine makefile in the src/MAKE directory.  See <A HREF = "Section_start.html#start_3_4">Section
+2.3.4</A> of the manual for details.
+</P>
+<P>IMPORTANT NOTE: Currently, there are no precision options with the
+KOKKOS package.  All compilation and computation is performed in
+double precision.
+</P>
+<P><B>Run with the KOKKOS package from the command line:</B>
+</P>
+<P>The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+</P>
+<P>When using KOKKOS built with host=OMP, you need to choose how many
+OpenMP threads per MPI task will be used (via the "-k" command-line
+switch discussed below).  Note that the product of MPI tasks * OpenMP
+threads/task should not exceed the physical number of cores (on a
+node), otherwise performance will suffer.
+</P>
+<P>When using the KOKKOS package built with device=CUDA, you must use
+exactly one MPI task per physical GPU.
+</P>
+<P>When using the KOKKOS package built with host=MIC for Intel Xeon Phi
+coprocessor support you need to insure there are one or more MPI tasks
+per coprocessor, and choose the number of coprocessor threads to use
+per MPI task (via the "-k" command-line switch discussed below).  The
+product of MPI tasks * coprocessor threads/task should not exceed the
+maximum number of threads the coproprocessor is designed to run,
+otherwise performance will suffer.  This value is 240 for current
+generation Xeon Phi(TM) chips, which is 60 physical cores * 4
+threads/core.  Note that with the KOKKOS package you do not need to
+specify how many Phi coprocessors there are per node; each
+coprocessors is simply treated as running some number of MPI tasks.
+</P>
+<P>You must use the "-k on" <A HREF = "Section_start.html#start_7">command-line
+switch</A> to enable the KOKKOS package.  It
+takes additional arguments for hardware settings appropriate to your
+system.  Those arguments are <A HREF = "Section_start.html#start_7">documented
+here</A>.  The two most commonly used arguments
+are:
+</P>
+<PRE>-k on t Nt
+-k on g Ng 
+</PRE>
+<P>The "t Nt" option applies to host=OMP (even if device=CUDA) and
+host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
+task to use with a node.  For host=MIC, it specifies how many Xeon Phi
+threads per MPI task to use within a node.  The default is Nt = 1.
+Note that for host=OMP this is effectively MPI-only mode which may be
+fine.  But for host=MIC you will typically end up using far less than
+all the 240 available threads, which could give very poor performance.
+</P>
+<P>The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
+per compute node to use.  The default is 1, so this only needs to be
+specified is you have 2 or more GPUs per compute node.
+</P>
+<P>The "-k on" switch also issues a default <A HREF = "package.html">package kokkos neigh full
+comm host</A> command which sets various KOKKOS options to
+default values, as discussed on the <A HREF = "package.html">package</A> command doc
+page.
+</P>
+<P>Use the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A>,
+which will automatically append "kk" to styles that support it.  Use
+the "-pk kokkos" <A HREF = "Section_start.html#start_7">command-line switch</A> if
+you wish to override any of the default values set by the <A HREF = "package.html">package
+kokkos</A> command invoked by the "-k on" switch.
+</P>
+<PRE>host=OMP, dual hex-core nodes (12 threads/node):
+mpirun -np 12 lmp_g++ -in in.lj                           # MPI-only mode with no Kokkos
+mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj              # MPI-only mode with Kokkos
+mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj          # one MPI task, 12 threads
+mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj           # two MPI tasks, 6 threads/task 
+mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj   # ditto on 16 nodes 
+</PRE>
+<P>host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
+mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
+mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
+mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
+mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis
+</P>
+<PRE>host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
+mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
+mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes 
+</PRE>
+<PRE>host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
+mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
+mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes 
+</PRE>
+<P><B>Or run with the KOKKOS package by editing an input script:</B>
+</P>
+<P>The discussion above for the mpirun/mpiexec command and setting
+appropriate thread and GPU values for host=OMP or host=MIC or
+device=CUDA are the same.
+</P>
+<P>You must still use the "-k on" <A HREF = "Section_start.html#start_7">command-line
+switch</A> to enable the KOKKOS package, and
+specify its additional arguments for hardware options appopriate to
+your system, as documented above.
+</P>
+<P>Use the <A HREF = "suffix.html">suffix kk</A> command, or you can explicitly add a
+"kk" suffix to individual styles in your input script, e.g.
+</P>
+<PRE>pair_style lj/cut/kk 2.5 
+</PRE>
+<P>You only need to use the <A HREF = "package.html">package kokkos</A> command if you
+wish to change any of its option defaults.
+</P>
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>The performance of KOKKOS running in different modes is a function of
+your hardware, which KOKKOS-enable styles are used, and the problem
+size.
+</P>
+<P>Generally speaking, the following rules of thumb apply:
+</P>
+<UL><LI>When running on CPUs only, with a single thread per MPI task,
+performance of a KOKKOS style is somewhere between the standard
+(un-accelerated) styles (MPI-only mode), and those provided by the
+USER-OMP package.  However the difference between all 3 is small (less
+than 20%). 
+
+<LI>When running on CPUs only, with multiple threads per MPI task,
+performance of a KOKKOS style is a bit slower than the USER-OMP
+package. 
+
+<LI>When running on GPUs, KOKKOS is typically faster than the USER-CUDA
+and GPU packages. 
+
+<LI>When running on Intel Xeon Phi, KOKKOS is not as fast as
+the USER-INTEL package, which is optimized for that hardware. 
+</UL>
+<P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
+LAMMPS web site for performance of the KOKKOS package on different
+hardware.
+</P>
+<P><B>Guidelines for best performance:</B>
+</P>
+<P>Here are guidline for using the KOKKOS package on the different
+hardware configurations listed above.
+</P>
+<P>Many of the guidelines use the <A HREF = "package.html">package kokkos</A> command
+See its doc page for details and default settings.  Experimenting with
+its options can provide a speed-up for specific calculations.
+</P>
+<P><B>Running on a multi-core CPU:</B>
+</P>
+<P>If N is the number of physical cores/node, then the number of MPI
+tasks/node * number of threads/task should not exceed N, and should
+typically equal N.  Note that the default threads/task is 1, as set by
+the "t" keyword of the "-k" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.  If you do not change this, no
+additional parallelism (beyond MPI) will be invoked on the host
+CPU(s).
+</P>
+<P>You can compare the performance running in different modes:
+</P>
+<UL><LI>run with 1 MPI task/node and N threads/task
+<LI>run with N MPI tasks/node and 1 thread/task
+<LI>run with settings in between these extremes 
+</UL>
+<P>Examples of mpirun commands in these modes are shown above.
+</P>
+<P>When using KOKKOS to perform multi-threading, it is important for
+performance to bind both MPI tasks to physical cores, and threads to
+physical cores, so they do not migrate during a simulation.
+</P>
+<P>If you are not certain MPI tasks are being bound (check the defaults
+for your MPI installation), binding can be forced with these flags:
+</P>
+<PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
+Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... 
+</PRE>
+<P>For binding threads with the KOKKOS OMP option, use thread affinity
+environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
+later, intel 12 or later) setting the environment variable
+OMP_PROC_BIND=true should be sufficient.  For binding threads with the
+KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
+discussed in <A HREF = "Sections_start.html#start_3_4">Section 2.3.4</A> of the
+manual.
+</P>
+<P><B>Running on GPUs:</B>
+</P>
+<P>Insure the -arch setting in the machine makefile you are using,
+e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
+(see <A HREF = "Section_start.html#start_3_4">this section</A> of the manual for
+details).
+</P>
+<P>The -np setting of the mpirun command should set the number of MPI
+tasks/node to be equal to the # of physical GPUs on the node. 
+</P>
+<P>Use the "-k" <A HREF = "Section_commands.html#start_7">command-line switch</A> to
+specify the number of GPUs per node, and the number of threads per MPI
+task.  As above for multi-core CPUs (and no GPU), if N is the number
+of physical cores/node, then the number of MPI tasks/node * number of
+threads/task should not exceed N.  With one GPU (and one MPI task) it
+may be faster to use less than all the available cores, by setting
+threads/task to a smaller value.  This is because using all the cores
+on a dual-socket node will incur extra cost to copy memory from the
+2nd socket to the GPU.
+</P>
+<P>Examples of mpirun commands that follow these rules are shown above.
+</P>
+<P>IMPORTANT NOTE: When using a GPU, you will achieve the best
+performance if your input script does not use any fix or compute
+styles which are not yet Kokkos-enabled.  This allows data to stay on
+the GPU for multiple timesteps, without being copied back to the host
+CPU.  Invoking a non-Kokkos fix or compute, or performing I/O for
+<A HREF = "thermo_style.html">thermo</A> or <A HREF = "dump.html">dump</A> output will cause data
+to be copied back to the CPU.
+</P>
+<P>You cannot yet assign multiple MPI tasks to the same GPU with the
+KOKKOS package.  We plan to support this in the future, similar to the
+GPU package in LAMMPS.
+</P>
+<P>You cannot yet use both the host (multi-threaded) and device (GPU)
+together to compute pairwise interactions with the KOKKOS package.  We
+hope to support this in the future, similar to the GPU package in
+LAMMPS.
+</P>
+<P><B>Running on an Intel Phi:</B>
+</P>
+<P>Kokkos only uses Intel Phi processors in their "native" mode, i.e.
+not hosted by a CPU.
+</P>
+<P>As illustrated above, build LAMMPS with OMP=yes (the default) and
+MIC=yes.  The latter insures code is correctly compiled for the Intel
+Phi.  The OMP setting means OpenMP will be used for parallelization on
+the Phi, which is currently the best option within Kokkos.  In the
+future, other options may be added.
+</P>
+<P>Current-generation Intel Phi chips have either 61 or 57 cores.  One
+core should be excluded for running the OS, leaving 60 or 56 cores.
+Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
+N = 224 (4*56) cores to run on.
+</P>
+<P>The -np setting of the mpirun command sets the number of MPI
+tasks/node.  The "-k on t Nt" command-line switch sets the number of
+threads/task as Nt.  The product of these 2 values should be N, i.e.
+240 or 224.  Also, the number of threads/task should be a multiple of
+4 so that logical threads from more than one MPI task do not run on
+the same physical core.
+</P>
+<P>Examples of mpirun commands that follow these rules are shown above.
+</P>
+<P><B>Restrictions:</B>
+</P>
+<P>As noted above, if using GPUs, the number of MPI tasks per compute
+node should equal to the number of GPUs per compute node.  In the
+future Kokkos will support assigning multiple MPI tasks to a single
+GPU.
+</P>
+<P>Currently Kokkos does not support AMD GPUs due to limits in the
+available backend programming models.  Specifically, Kokkos requires
+extensive C++ support from the Kernel language.  This is expected to
+change in the future.
+</P>
+</HTML>
--- a/doc/accelerate_kokkos.txt
+++ b/doc/accelerate_kokkos.txt
@ -0,0 +1,422 @@
+"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
+"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
+
+:link(lws,http://lammps.sandia.gov)
+:link(ld,Manual.html)
+:link(lc,Section_commands.html#comm)
+
+:line
+
+"Return to Section accelerate overview"_Section_accelerate.html
+
+5.3.4 KOKKOS package :h4
+
+The KOKKOS package was developed primaritly by Christian Trott
+(Sandia) with contributions of various styles by others, including
+Sikandar Mashayak (UIUC).  The underlying Kokkos library was written
+primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
+Sandia).
+
+The KOKKOS package contains versions of pair, fix, and atom styles
+that use data structures and macros provided by the Kokkos library,
+which is included with LAMMPS in lib/kokkos.
+
+The Kokkos library is part of
+"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and is a
+templated C++ library that provides two key abstractions for an
+application like LAMMPS.  First, it allows a single implementation of
+an application kernel (e.g. a pair style) to run efficiently on
+different kinds of hardware, such as a GPU, Intel Phi, or many-core
+chip.
+
+The Kokkos library also provides data abstractions to adjust (at
+compile time) the memory layout of basic data structures like 2d and
+3d arrays and allow the transparent utilization of special hardware
+load and store operations.  Such data structures are used in LAMMPS to
+store atom coordinates or forces or neighbor lists.  The layout is
+chosen to optimize performance on different platforms.  Again this
+functionality is hidden from the developer, and does not affect how
+the kernel is coded.
+
+These abstractions are set at build time, when LAMMPS is compiled with
+the KOKKOS package installed.  This is done by selecting a "host" and
+"device" to build for, compatible with the compute nodes in your
+machine (one on a desktop machine or 1000s on a supercomputer).
+
+All Kokkos operations occur within the context of an individual MPI
+task running on a single node of the machine.  The total number of MPI
+tasks used by LAMMPS (one or multiple per compute node) is set in the
+usual manner via the mpirun or mpiexec commands, and is independent of
+Kokkos.
+
+Kokkos provides support for two different modes of execution per MPI
+task.  This means that computational tasks (pairwise interactions,
+neighbor list builds, time integration, etc) can be parallelized for
+one or the other of the two modes.  The first mode is called the
+"host" and is one or more threads running on one or more physical CPUs
+(within the node).  Currently, both multi-core CPUs and an Intel Phi
+processor (running in native mode, not offload mode like the
+USER-INTEL package) are supported.  The second mode is called the
+"device" and is an accelerator chip of some kind.  Currently only an
+NVIDIA GPU is supported.  If your compute node does not have a GPU,
+then there is only one mode of execution, i.e. the host and device are
+the same.
+
+Here is a quick overview of how to use the KOKKOS package
+for GPU acceleration:
+
+specify variables and settings in your Makefile.machine that enable GPU, Phi, or OpenMP support
+include the KOKKOS package and build LAMMPS
+enable the KOKKOS package and its hardware options via the "-k on" command-line switch
+use KOKKOS styles in your input script :ul
+
+The latter two steps can be done using the "-k on", "-pk kokkos" and
+"-sf kk" "command-line switches"_Section_start.html#start_7
+respectively.  Or the effect of the "-pk" or "-sf" switches can be
+duplicated by adding the "package kokkos"_package.html or "suffix
+kk"_suffix.html commands respectively to your input script.
+
+[Required hardware/software:]
+
+The KOKKOS package can be used to build and run LAMMPS on the
+following kinds of hardware:
+
+CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
+CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
+Phi: on one or more Intel Phi coprocessors (per node)
+GPU: on the GPUs of a node with additional OpenMP threading on the CPUs :ul
+
+Note that Intel Xeon Phi coprocessors are supported in "native" mode,
+not "offload" mode like the USER-INTEL package supports.
+
+Only NVIDIA GPUs are currently supported.
+
+IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
+you must have Kepler generation GPUs (or later).  The Kokkos library
+exploits texture cache options not supported by Telsa generation GPUs
+(or older).
+
+To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
+installed on your system.  See the discussion above for the USER-CUDA
+and GPU packages for details of how to check and do this.
+
+[Building LAMMPS with the KOKKOS package:]
+
+Unlike other acceleration packages discussed in this section, the
+Kokkos library in lib/kokkos does not have to be pre-built before
+building LAMMPS itself.  Instead, options for the Kokkos library are
+specified at compile time, when LAMMPS itself is built.  This can be
+done in one of two ways, as discussed below.
+
+Here are examples of how to build LAMMPS for the different compute-node
+configurations listed above.
+
+CPU-only (run all-MPI or with OpenMP threading):
+
+cd lammps/src
+make yes-kokkos
+make g++ OMP=yes :pre
+
+Intel Xeon Phi:
+
+cd lammps/src
+make yes-kokkos
+make g++ OMP=yes MIC=yes :pre
+
+CPUs and GPUs:
+
+cd lammps/src
+make yes-kokkos
+make cuda CUDA=yes :pre
+
+These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
+make command line which requires a GNU-compatible make command.  Try
+"gmake" if your system's standard make complains.  
+
+IMPORTANT NOTE: If you build using make line variables and re-build
+LAMMPS twice with different KOKKOS options and the *same* target,
+e.g. g++ in the first two examples above, then you *must* perform a
+"make clean-all" or "make clean-machine" before each build.  This is
+to force all the KOKKOS-dependent files to be re-compiled with the new
+options.
+
+You can also hardwire these make variables in the specified machine
+makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
+with a line like:
+
+MIC = yes :pre
+
+Note that if you build LAMMPS multiple times in this manner, using
+different KOKKOS options (defined in different machine makefiles), you
+do not have to worry about doing a "clean" in between.  This is
+because the targets will be different.
+
+IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
+machine makefile, in this case src/MAKE/Makefile.cuda, which is
+included in the LAMMPS distribution.  To build the KOKKOS package for
+a GPU, this makefile must use the NVIDA "nvcc" compiler.  And it must
+have a CCFLAGS -arch setting that is appropriate for your NVIDIA
+hardware and installed software.  Typical values for -arch are given
+in "Section 2.3.4"_Section_start.html#start_3_4 of the manual, as well
+as other settings that must be included in the machine makefile, if
+you create your own.
+
+There are other allowed options when building with the KOKKOS package.
+As above, They can be set either as variables on the make command line
+or in the machine makefile in the src/MAKE directory.  See "Section
+2.3.4"_Section_start.html#start_3_4 of the manual for details.
+
+IMPORTANT NOTE: Currently, there are no precision options with the
+KOKKOS package.  All compilation and computation is performed in
+double precision.
+
+[Run with the KOKKOS package from the command line:]
+
+The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+
+When using KOKKOS built with host=OMP, you need to choose how many
+OpenMP threads per MPI task will be used (via the "-k" command-line
+switch discussed below).  Note that the product of MPI tasks * OpenMP
+threads/task should not exceed the physical number of cores (on a
+node), otherwise performance will suffer.
+
+When using the KOKKOS package built with device=CUDA, you must use
+exactly one MPI task per physical GPU.
+
+When using the KOKKOS package built with host=MIC for Intel Xeon Phi
+coprocessor support you need to insure there are one or more MPI tasks
+per coprocessor, and choose the number of coprocessor threads to use
+per MPI task (via the "-k" command-line switch discussed below).  The
+product of MPI tasks * coprocessor threads/task should not exceed the
+maximum number of threads the coproprocessor is designed to run,
+otherwise performance will suffer.  This value is 240 for current
+generation Xeon Phi(TM) chips, which is 60 physical cores * 4
+threads/core.  Note that with the KOKKOS package you do not need to
+specify how many Phi coprocessors there are per node; each
+coprocessors is simply treated as running some number of MPI tasks.
+
+You must use the "-k on" "command-line
+switch"_Section_start.html#start_7 to enable the KOKKOS package.  It
+takes additional arguments for hardware settings appropriate to your
+system.  Those arguments are "documented
+here"_Section_start.html#start_7.  The two most commonly used arguments
+are:
+
+-k on t Nt
+-k on g Ng :pre
+
+The "t Nt" option applies to host=OMP (even if device=CUDA) and
+host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
+task to use with a node.  For host=MIC, it specifies how many Xeon Phi
+threads per MPI task to use within a node.  The default is Nt = 1.
+Note that for host=OMP this is effectively MPI-only mode which may be
+fine.  But for host=MIC you will typically end up using far less than
+all the 240 available threads, which could give very poor performance.
+
+The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
+per compute node to use.  The default is 1, so this only needs to be
+specified is you have 2 or more GPUs per compute node.
+
+The "-k on" switch also issues a default "package kokkos neigh full
+comm host"_package.html command which sets various KOKKOS options to
+default values, as discussed on the "package"_package.html command doc
+page.
+
+Use the "-sf kk" "command-line switch"_Section_start.html#start_7,
+which will automatically append "kk" to styles that support it.  Use
+the "-pk kokkos" "command-line switch"_Section_start.html#start_7 if
+you wish to override any of the default values set by the "package
+kokkos"_package.html command invoked by the "-k on" switch.
+
+host=OMP, dual hex-core nodes (12 threads/node):
+mpirun -np 12 lmp_g++ -in in.lj                           # MPI-only mode with no Kokkos
+mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj              # MPI-only mode with Kokkos
+mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj          # one MPI task, 12 threads
+mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj           # two MPI tasks, 6 threads/task 
+mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj   # ditto on 16 nodes :pre
+
+host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
+mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
+mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
+mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
+mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis
+
+
+host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
+mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
+mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes :pre
+
+host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
+mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
+mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes :pre
+
+[Or run with the KOKKOS package by editing an input script:]
+
+The discussion above for the mpirun/mpiexec command and setting
+appropriate thread and GPU values for host=OMP or host=MIC or
+device=CUDA are the same.
+
+You must still use the "-k on" "command-line
+switch"_Section_start.html#start_7 to enable the KOKKOS package, and
+specify its additional arguments for hardware options appopriate to
+your system, as documented above.
+
+Use the "suffix kk"_suffix.html command, or you can explicitly add a
+"kk" suffix to individual styles in your input script, e.g.
+
+pair_style lj/cut/kk 2.5 :pre
+
+You only need to use the "package kokkos"_package.html command if you
+wish to change any of its option defaults.
+
+[Speed-ups to expect:]
+
+The performance of KOKKOS running in different modes is a function of
+your hardware, which KOKKOS-enable styles are used, and the problem
+size.
+
+Generally speaking, the following rules of thumb apply:
+
+When running on CPUs only, with a single thread per MPI task,
+performance of a KOKKOS style is somewhere between the standard
+(un-accelerated) styles (MPI-only mode), and those provided by the
+USER-OMP package.  However the difference between all 3 is small (less
+than 20%). :ulb,l
+
+When running on CPUs only, with multiple threads per MPI task,
+performance of a KOKKOS style is a bit slower than the USER-OMP
+package. :l
+
+When running on GPUs, KOKKOS is typically faster than the USER-CUDA
+and GPU packages. :l
+
+When running on Intel Xeon Phi, KOKKOS is not as fast as
+the USER-INTEL package, which is optimized for that hardware. :l,ule
+
+See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
+LAMMPS web site for performance of the KOKKOS package on different
+hardware.
+
+[Guidelines for best performance:]
+
+Here are guidline for using the KOKKOS package on the different
+hardware configurations listed above.
+
+Many of the guidelines use the "package kokkos"_package.html command
+See its doc page for details and default settings.  Experimenting with
+its options can provide a speed-up for specific calculations.
+
+[Running on a multi-core CPU:]
+
+If N is the number of physical cores/node, then the number of MPI
+tasks/node * number of threads/task should not exceed N, and should
+typically equal N.  Note that the default threads/task is 1, as set by
+the "t" keyword of the "-k" "command-line
+switch"_Section_start.html#start_7.  If you do not change this, no
+additional parallelism (beyond MPI) will be invoked on the host
+CPU(s).
+
+You can compare the performance running in different modes:
+  
+run with 1 MPI task/node and N threads/task
+run with N MPI tasks/node and 1 thread/task
+run with settings in between these extremes :ul
+
+Examples of mpirun commands in these modes are shown above.
+
+When using KOKKOS to perform multi-threading, it is important for
+performance to bind both MPI tasks to physical cores, and threads to
+physical cores, so they do not migrate during a simulation.
+
+If you are not certain MPI tasks are being bound (check the defaults
+for your MPI installation), binding can be forced with these flags:
+
+OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
+Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
+
+For binding threads with the KOKKOS OMP option, use thread affinity
+environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
+later, intel 12 or later) setting the environment variable
+OMP_PROC_BIND=true should be sufficient.  For binding threads with the
+KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
+discussed in "Section 2.3.4"_Sections_start.html#start_3_4 of the
+manual.
+
+[Running on GPUs:]
+
+Insure the -arch setting in the machine makefile you are using,
+e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
+(see "this section"_Section_start.html#start_3_4 of the manual for
+details).
+
+The -np setting of the mpirun command should set the number of MPI
+tasks/node to be equal to the # of physical GPUs on the node. 
+
+Use the "-k" "command-line switch"_Section_commands.html#start_7 to
+specify the number of GPUs per node, and the number of threads per MPI
+task.  As above for multi-core CPUs (and no GPU), if N is the number
+of physical cores/node, then the number of MPI tasks/node * number of
+threads/task should not exceed N.  With one GPU (and one MPI task) it
+may be faster to use less than all the available cores, by setting
+threads/task to a smaller value.  This is because using all the cores
+on a dual-socket node will incur extra cost to copy memory from the
+2nd socket to the GPU.
+
+Examples of mpirun commands that follow these rules are shown above.
+
+IMPORTANT NOTE: When using a GPU, you will achieve the best
+performance if your input script does not use any fix or compute
+styles which are not yet Kokkos-enabled.  This allows data to stay on
+the GPU for multiple timesteps, without being copied back to the host
+CPU.  Invoking a non-Kokkos fix or compute, or performing I/O for
+"thermo"_thermo_style.html or "dump"_dump.html output will cause data
+to be copied back to the CPU.
+
+You cannot yet assign multiple MPI tasks to the same GPU with the
+KOKKOS package.  We plan to support this in the future, similar to the
+GPU package in LAMMPS.
+
+You cannot yet use both the host (multi-threaded) and device (GPU)
+together to compute pairwise interactions with the KOKKOS package.  We
+hope to support this in the future, similar to the GPU package in
+LAMMPS.
+
+[Running on an Intel Phi:]
+
+Kokkos only uses Intel Phi processors in their "native" mode, i.e.
+not hosted by a CPU.
+
+As illustrated above, build LAMMPS with OMP=yes (the default) and
+MIC=yes.  The latter insures code is correctly compiled for the Intel
+Phi.  The OMP setting means OpenMP will be used for parallelization on
+the Phi, which is currently the best option within Kokkos.  In the
+future, other options may be added.
+
+Current-generation Intel Phi chips have either 61 or 57 cores.  One
+core should be excluded for running the OS, leaving 60 or 56 cores.
+Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
+N = 224 (4*56) cores to run on.
+
+The -np setting of the mpirun command sets the number of MPI
+tasks/node.  The "-k on t Nt" command-line switch sets the number of
+threads/task as Nt.  The product of these 2 values should be N, i.e.
+240 or 224.  Also, the number of threads/task should be a multiple of
+4 so that logical threads from more than one MPI task do not run on
+the same physical core.
+
+Examples of mpirun commands that follow these rules are shown above.
+
+[Restrictions:]
+
+As noted above, if using GPUs, the number of MPI tasks per compute
+node should equal to the number of GPUs per compute node.  In the
+future Kokkos will support assigning multiple MPI tasks to a single
+GPU.
+
+Currently Kokkos does not support AMD GPUs due to limits in the
+available backend programming models.  Specifically, Kokkos requires
+extensive C++ support from the Kernel language.  This is expected to
+change in the future.
--- a/doc/accelerate_omp.html
+++ b/doc/accelerate_omp.html
@ -0,0 +1,197 @@
+<HTML>
+<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
+<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
+</CENTER>
+
+
+
+
+
+
+<HR>
+
+<P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
+</P>
+<H4>5.3.5 USER-OMP package 
+</H4>
+<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
+University.  It provides multi-threaded versions of most pair styles,
+nearly all bonded styles (bond, angle, dihedral, improper), several
+Kspace styles, and a few fix styles.  The package currently
+uses the OpenMP interface for multi-threading.
+</P>
+<P>Here is a quick overview of how to use the USER-OMP package:
+</P>
+<UL><LI>use the -fopenmp flag for compiling and linking in your Makefile.machine
+<LI>include the USER-OMP package and build LAMMPS
+<LI>use the mpirun command to set the number of MPI tasks/node
+<LI>specify how many threads per MPI task to use
+<LI>use USER-OMP styles in your input script 
+</UL>
+<P>The latter two steps can be done using the "-pk omp" and "-sf omp"
+<A HREF = "Section_start.html#start_7">command-line switches</A> respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the <A HREF = "package.html">package omp</A> or <A HREF = "suffix.html">suffix omp</A> commands
+respectively to your input script.
+</P>
+<P><B>Required hardware/software:</B>
+</P>
+<P>Your compiler must support the OpenMP interface.  You should have one
+or more multi-core CPUs so that multiple threads can be launched by an
+MPI task running on a CPU.
+</P>
+<P><B>Building LAMMPS with the USER-OMP package:</B>
+</P>
+<P>Include the package and build LAMMPS:
+</P>
+<PRE>cd lammps/src
+make yes-user-omp
+make machine 
+</PRE>
+<P>Your src/MAKE/Makefile.machine needs a flag for OpenMP support in both
+the CCFLAGS and LINKFLAGS variables.  For GNU and Intel compilers,
+this flag is "-fopenmp".  Without this flag the USER-OMP styles will
+still be compiled and work, but will not support multi-threading.
+</P>
+<P><B>Run with the USER-OMP package from the command line:</B>
+</P>
+<P>The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+</P>
+<P>You need to choose how many threads per MPI task will be used by the
+USER-OMP package.  Note that the product of MPI tasks * threads/task
+should not exceed the physical number of cores (on a node), otherwise
+performance will suffer.
+</P>
+<P>Use the "-sf omp" <A HREF = "Section_start.html#start_7">command-line switch</A>,
+which will automatically append "omp" to styles that support it.  Use
+the "-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switch</A>, to
+set Nt = # of OpenMP threads per MPI task to use.
+</P>
+<PRE>lmp_machine -sf omp -pk omp 16 -in in.script                       # 1 MPI task on a 16-core node
+mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script           # 4 MPI tasks each with 4 threads on a single 16-core node
+mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script   # ditto on 8 16-core nodes 
+</PRE>
+<P>Note that if the "-sf omp" switch is used, it also issues a default
+<A HREF = "package.html">package omp 0</A> command, which sets the number of threads
+per MPI task via the OMP_NUM_THREADS environment variable.
+</P>
+<P>Using the "-pk" switch explicitly allows for direct setting of the
+number of threads and additional options.  Its syntax is the same as
+the "package omp" command.  See the <A HREF = "package.html">package</A> command doc
+page for details, including the default values used for all its
+options if it is not specified, and how to set the number of threads
+via the OMP_NUM_THREADS environment variable if desired.
+</P>
+<P><B>Or run with the USER-OMP package by editing an input script:</B>
+</P>
+<P>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
+and threads/MPI task is the same.
+</P>
+<P>Use the <A HREF = "suffix.html">suffix omp</A> command, or you can explicitly add an
+"omp" suffix to individual styles in your input script, e.g.
+</P>
+<PRE>pair_style lj/cut/omp 2.5 
+</PRE>
+<P>You must also use the <A HREF = "package.html">package omp</A> command to enable the
+USER-OMP package, unless the "-sf omp" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line
+switches</A> were used.  It specifies how many
+threads per MPI task to use, as well as other options.  Its doc page
+explains how to set the number of threads via an environment variable
+if desired.
+</P>
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>Depending on which styles are accelerated, you should look for a
+reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
+time" values printed at the end of a run.  
+</P>
+<P>You may see a small performance advantage (5 to 20%) when running a
+USER-OMP style (in serial or parallel) with a single thread per MPI
+task, versus running standard LAMMPS with its standard
+(un-accelerated) styles (in serial or all-MPI parallelization with 1
+task/core).  This is because many of the USER-OMP styles contain
+similar optimizations to those used in the OPT package, as described
+above.
+</P>
+<P>With multiple threads/task, the optimal choice of MPI tasks/node and
+OpenMP threads/task can vary a lot and should always be tested via
+benchmark runs for a specific simulation running on a specific
+machine, paying attention to guidelines discussed in the next
+sub-section.
+</P>
+<P>A description of the multi-threading strategy used in the USER-OMP
+package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
+here</A>
+</P>
+<P><B>Guidelines for best performance:</B>
+</P>
+<P>For many problems on current generation CPUs, running the USER-OMP
+package with a single thread/task is faster than running with multiple
+threads/task.  This is because the MPI parallelization in LAMMPS is
+often more efficient than multi-threading as implemented in the
+USER-OMP package.  The parallel efficiency (in a threaded sense) also
+varies for different USER-OMP styles.
+</P>
+<P>Using multiple threads/task can be more effective under the following
+circumstances:
+</P>
+<UL><LI>Individual compute nodes have a significant number of CPU cores but
+the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
+(Clovertown) and 54xx (Harpertown) quad core processors. Running one
+MPI task per CPU core will result in significant performance
+degradation, so that running with 4 or even only 2 MPI tasks per node
+is faster.  Running in hybrid MPI+OpenMP mode will reduce the
+inter-node communication bandwidth contention in the same way, but
+offers an additional speedup by utilizing the otherwise idle CPU
+cores. 
+
+<LI>The interconnect used for MPI communication does not provide
+sufficient bandwidth for a large number of MPI tasks per node.  For
+example, this applies to running over gigabit ethernet or on Cray XT4
+or XT5 series supercomputers.  As in the aforementioned case, this
+effect worsens when using an increasing number of nodes. 
+
+<LI>The system has a spatially inhomogeneous particle density which does
+not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
+<A HREF = "balance.html">load-balancing</A> options that LAMMPS provides.  This is
+because multi-threading achives parallelism over the number of
+particles, not via their distribution in space. 
+
+<LI>A machine is being used in "capability mode", i.e. near the point
+where MPI parallelism is maxed out.  For example, this can happen when
+using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
+electrostatics on large numbers of nodes.  The scaling of the KSpace
+calculation (see the <A HREF = "kspace_style.html">kspace_style</A> command) becomes
+the performance-limiting factor.  Using multi-threading allows less
+MPI tasks to be invoked and can speed-up the long-range solver, while
+increasing overall performance by parallelizing the pairwise and
+bonded calculations via OpenMP.  Likewise additional speedup can be
+sometimes be achived by increasing the length of the Coulombic cutoff
+and thus reducing the work done by the long-range solver.  Using the
+<A HREF = "run_style.html">run_style verlet/split</A> command, which is compatible
+with the USER-OMP package, is an alternative way to reduce the number
+of MPI tasks assigned to the KSpace calculation. 
+</UL>
+<P>Additional performance tips are as follows:
+</P>
+<UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
+when there is at least one MPI task per physical processor,
+i.e. socket or die. 
+
+<LI>It is usually most efficient to restrict threading to a single
+socket, i.e. use one or more MPI task per socket. 
+
+<LI>Several current MPI implementation by default use a processor affinity
+setting that restricts each MPI task to a single CPU core.  Using
+multi-threading in this mode will force the threads to share that core
+and thus is likely to be counterproductive.  Instead, binding MPI
+tasks to a (multi-core) socket, should solve this issue. 
+</UL>
+<P><B>Restrictions:</B>
+</P>
+<P>None.
+</P>
+</HTML>
--- a/doc/accelerate_omp.txt
+++ b/doc/accelerate_omp.txt
@ -0,0 +1,192 @@
+"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
+"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
+
+:link(lws,http://lammps.sandia.gov)
+:link(ld,Manual.html)
+:link(lc,Section_commands.html#comm)
+
+:line
+
+"Return to Section accelerate overview"_Section_accelerate.html
+
+5.3.5 USER-OMP package :h4
+
+The USER-OMP package was developed by Axel Kohlmeyer at Temple
+University.  It provides multi-threaded versions of most pair styles,
+nearly all bonded styles (bond, angle, dihedral, improper), several
+Kspace styles, and a few fix styles.  The package currently
+uses the OpenMP interface for multi-threading.
+
+Here is a quick overview of how to use the USER-OMP package:
+
+use the -fopenmp flag for compiling and linking in your Makefile.machine
+include the USER-OMP package and build LAMMPS
+use the mpirun command to set the number of MPI tasks/node
+specify how many threads per MPI task to use
+use USER-OMP styles in your input script :ul
+
+The latter two steps can be done using the "-pk omp" and "-sf omp"
+"command-line switches"_Section_start.html#start_7 respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the "package omp"_package.html or "suffix omp"_suffix.html commands
+respectively to your input script.
+
+[Required hardware/software:]
+
+Your compiler must support the OpenMP interface.  You should have one
+or more multi-core CPUs so that multiple threads can be launched by an
+MPI task running on a CPU.
+
+[Building LAMMPS with the USER-OMP package:]
+
+Include the package and build LAMMPS:
+
+cd lammps/src
+make yes-user-omp
+make machine :pre
+
+Your src/MAKE/Makefile.machine needs a flag for OpenMP support in both
+the CCFLAGS and LINKFLAGS variables.  For GNU and Intel compilers,
+this flag is "-fopenmp".  Without this flag the USER-OMP styles will
+still be compiled and work, but will not support multi-threading.
+
+[Run with the USER-OMP package from the command line:]
+
+The mpirun or mpiexec command sets the total number of MPI tasks used
+by LAMMPS (one or multiple per compute node) and the number of MPI
+tasks used per node.  E.g. the mpirun command does this via its -np
+and -ppn switches.
+
+You need to choose how many threads per MPI task will be used by the
+USER-OMP package.  Note that the product of MPI tasks * threads/task
+should not exceed the physical number of cores (on a node), otherwise
+performance will suffer.
+
+Use the "-sf omp" "command-line switch"_Section_start.html#start_7,
+which will automatically append "omp" to styles that support it.  Use
+the "-pk omp Nt" "command-line switch"_Section_start.html#start_7, to
+set Nt = # of OpenMP threads per MPI task to use.
+
+lmp_machine -sf omp -pk omp 16 -in in.script                       # 1 MPI task on a 16-core node
+mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script           # 4 MPI tasks each with 4 threads on a single 16-core node
+mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script   # ditto on 8 16-core nodes :pre
+
+Note that if the "-sf omp" switch is used, it also issues a default
+"package omp 0"_package.html command, which sets the number of threads
+per MPI task via the OMP_NUM_THREADS environment variable.
+
+Using the "-pk" switch explicitly allows for direct setting of the
+number of threads and additional options.  Its syntax is the same as
+the "package omp" command.  See the "package"_package.html command doc
+page for details, including the default values used for all its
+options if it is not specified, and how to set the number of threads
+via the OMP_NUM_THREADS environment variable if desired.
+
+[Or run with the USER-OMP package by editing an input script:]
+
+The discussion above for the mpirun/mpiexec command, MPI tasks/node,
+and threads/MPI task is the same.
+
+Use the "suffix omp"_suffix.html command, or you can explicitly add an
+"omp" suffix to individual styles in your input script, e.g.
+
+pair_style lj/cut/omp 2.5 :pre
+
+You must also use the "package omp"_package.html command to enable the
+USER-OMP package, unless the "-sf omp" or "-pk omp" "command-line
+switches"_Section_start.html#start_7 were used.  It specifies how many
+threads per MPI task to use, as well as other options.  Its doc page
+explains how to set the number of threads via an environment variable
+if desired.
+
+[Speed-ups to expect:]
+
+Depending on which styles are accelerated, you should look for a
+reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
+time" values printed at the end of a run.  
+
+You may see a small performance advantage (5 to 20%) when running a
+USER-OMP style (in serial or parallel) with a single thread per MPI
+task, versus running standard LAMMPS with its standard
+(un-accelerated) styles (in serial or all-MPI parallelization with 1
+task/core).  This is because many of the USER-OMP styles contain
+similar optimizations to those used in the OPT package, as described
+above.
+
+With multiple threads/task, the optimal choice of MPI tasks/node and
+OpenMP threads/task can vary a lot and should always be tested via
+benchmark runs for a specific simulation running on a specific
+machine, paying attention to guidelines discussed in the next
+sub-section.
+
+A description of the multi-threading strategy used in the USER-OMP
+package and some performance examples are "presented
+here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
+
+[Guidelines for best performance:]
+
+For many problems on current generation CPUs, running the USER-OMP
+package with a single thread/task is faster than running with multiple
+threads/task.  This is because the MPI parallelization in LAMMPS is
+often more efficient than multi-threading as implemented in the
+USER-OMP package.  The parallel efficiency (in a threaded sense) also
+varies for different USER-OMP styles.
+
+Using multiple threads/task can be more effective under the following
+circumstances:
+
+Individual compute nodes have a significant number of CPU cores but
+the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
+(Clovertown) and 54xx (Harpertown) quad core processors. Running one
+MPI task per CPU core will result in significant performance
+degradation, so that running with 4 or even only 2 MPI tasks per node
+is faster.  Running in hybrid MPI+OpenMP mode will reduce the
+inter-node communication bandwidth contention in the same way, but
+offers an additional speedup by utilizing the otherwise idle CPU
+cores. :ulb,l
+
+The interconnect used for MPI communication does not provide
+sufficient bandwidth for a large number of MPI tasks per node.  For
+example, this applies to running over gigabit ethernet or on Cray XT4
+or XT5 series supercomputers.  As in the aforementioned case, this
+effect worsens when using an increasing number of nodes. :l
+
+The system has a spatially inhomogeneous particle density which does
+not map well to the "domain decomposition scheme"_processors.html or
+"load-balancing"_balance.html options that LAMMPS provides.  This is
+because multi-threading achives parallelism over the number of
+particles, not via their distribution in space. :l
+
+A machine is being used in "capability mode", i.e. near the point
+where MPI parallelism is maxed out.  For example, this can happen when
+using the "PPPM solver"_kspace_style.html for long-range
+electrostatics on large numbers of nodes.  The scaling of the KSpace
+calculation (see the "kspace_style"_kspace_style.html command) becomes
+the performance-limiting factor.  Using multi-threading allows less
+MPI tasks to be invoked and can speed-up the long-range solver, while
+increasing overall performance by parallelizing the pairwise and
+bonded calculations via OpenMP.  Likewise additional speedup can be
+sometimes be achived by increasing the length of the Coulombic cutoff
+and thus reducing the work done by the long-range solver.  Using the
+"run_style verlet/split"_run_style.html command, which is compatible
+with the USER-OMP package, is an alternative way to reduce the number
+of MPI tasks assigned to the KSpace calculation. :l,ule
+
+Additional performance tips are as follows:
+
+The best parallel efficiency from {omp} styles is typically achieved
+when there is at least one MPI task per physical processor,
+i.e. socket or die. :ulb,l
+
+It is usually most efficient to restrict threading to a single
+socket, i.e. use one or more MPI task per socket. :l
+
+Several current MPI implementation by default use a processor affinity
+setting that restricts each MPI task to a single CPU core.  Using
+multi-threading in this mode will force the threads to share that core
+and thus is likely to be counterproductive.  Instead, binding MPI
+tasks to a (multi-core) socket, should solve this issue. :l,ule
+
+[Restrictions:]
+
+None.
--- a/doc/accelerate_opt.html
+++ b/doc/accelerate_opt.html
@ -0,0 +1,77 @@
+<HTML>
+<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
+<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
+</CENTER>
+
+
+
+
+
+
+<HR>
+
+<P><A HREF = "Section_accelerate.html">Return to Section accelerate</A>
+</P>
+<H4>5.3.6 OPT package 
+</H4>
+<P>The OPT package was developed by James Fischer (High Performance
+Technologies), David Richie, and Vincent Natoli (Stone Ridge
+Technologies).  It contains a handful of pair styles whose compute()
+methods were rewritten in C++ templated form to reduce the overhead
+due to if tests and other conditional code.
+</P>
+<P>Here is a quick overview of how to use the OPT package:
+</P>
+<UL><LI>include the OPT package and build LAMMPS
+<LI>use OPT pair styles in your input script 
+</UL>
+<P>The last step can be done using the "-sf opt" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.  Or the effect of the "-sf" switch
+can be duplicated by adding a <A HREF = "suffix.html">suffix opt</A> command to your
+input script.
+</P>
+<P><B>Required hardware/software:</B>
+</P>
+<P>None.
+</P>
+<P><B>Building LAMMPS with the OPT package:</B>
+</P>
+<P>Include the package and build LAMMPS:
+</P>
+<PRE>cd lammps/src
+make yes-opt
+make machine 
+</PRE>
+<P>No additional compile/link flags are needed in your Makefile.machine
+in src/MAKE.
+</P>
+<P><B>Run with the OPT package from the command line:</B>
+</P>
+<P>Use the "-sf opt" <A HREF = "Section_start.html#start_7">command-line switch</A>,
+which will automatically append "opt" to styles that support it.
+</P>
+<PRE>lmp_machine -sf opt -in in.script
+mpirun -np 4 lmp_machine -sf opt -in in.script 
+</PRE>
+<P><B>Or run with the OPT package by editing an input script:</B>
+</P>
+<P>Use the <A HREF = "suffix.html">suffix opt</A> command, or you can explicitly add an
+"opt" suffix to individual styles in your input script, e.g.
+</P>
+<PRE>pair_style lj/cut/opt 2.5 
+</PRE>
+<P><B>Speed-ups to expect:</B>
+</P>
+<P>You should see a reduction in the "Pair time" value printed at the end
+of a run.  On most machines for reasonable problem sizes, it will be a
+5 to 20% savings.
+</P>
+<P><B>Guidelines for best performance:</B>
+</P>
+<P>None.  Just try out an OPT pair style to see how it performs.
+</P>
+<P><B>Restrictions:</B>
+</P>
+<P>None.
+</P>
+</HTML>
--- a/doc/accelerate_opt.txt
+++ b/doc/accelerate_opt.txt
@ -0,0 +1,72 @@
+"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
+"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
+
+:link(lws,http://lammps.sandia.gov)
+:link(ld,Manual.html)
+:link(lc,Section_commands.html#comm)
+
+:line
+
+"Return to Section accelerate"_Section_accelerate.html
+
+5.3.6 OPT package :h4
+
+The OPT package was developed by James Fischer (High Performance
+Technologies), David Richie, and Vincent Natoli (Stone Ridge
+Technologies).  It contains a handful of pair styles whose compute()
+methods were rewritten in C++ templated form to reduce the overhead
+due to if tests and other conditional code.
+
+Here is a quick overview of how to use the OPT package:
+
+include the OPT package and build LAMMPS
+use OPT pair styles in your input script :ul
+
+The last step can be done using the "-sf opt" "command-line
+switch"_Section_start.html#start_7.  Or the effect of the "-sf" switch
+can be duplicated by adding a "suffix opt"_suffix.html command to your
+input script.
+
+[Required hardware/software:]
+
+None.
+
+[Building LAMMPS with the OPT package:]
+
+Include the package and build LAMMPS:
+
+cd lammps/src
+make yes-opt
+make machine :pre
+
+No additional compile/link flags are needed in your Makefile.machine
+in src/MAKE.
+
+[Run with the OPT package from the command line:]
+
+Use the "-sf opt" "command-line switch"_Section_start.html#start_7,
+which will automatically append "opt" to styles that support it.
+
+lmp_machine -sf opt -in in.script
+mpirun -np 4 lmp_machine -sf opt -in in.script :pre
+
+[Or run with the OPT package by editing an input script:]
+
+Use the "suffix opt"_suffix.html command, or you can explicitly add an
+"opt" suffix to individual styles in your input script, e.g.
+
+pair_style lj/cut/opt 2.5 :pre
+
+[Speed-ups to expect:]
+
+You should see a reduction in the "Pair time" value printed at the end
+of a run.  On most machines for reasonable problem sizes, it will be a
+5 to 20% savings.
+
+[Guidelines for best performance:]
+
+None.  Just try out an OPT pair style to see how it performs.
+
+[Restrictions:]
+
+None.
--- a/doc/package.html
+++ b/doc/package.html
@ -101,9 +101,16 @@ package intel * mixed balance -1
 following packages use it: USER-CUDA, GPU, USER-INTEL, KOKKOS, and
 USER-OMP.
 </P>
-<P>Talk about command line switches
+<P>If allows calling multiple times, all options set to their
+defaults, whether specified or not.
 </P>
-<P>When does it have to be invoked
+<P>Talk about command line switch -pk as alternate option.
+</P>
+<P>Which packages require it to be invoked, only CUDA
+  this is b/c can only be invoked once
+vs optional: all others?  and allow multiple invokes
+</P>
+<P>Must be invoked early in script, before simulation box is defined.
 </P>
 <P>To use the accelerated GPU and USER-OMP styles, the use of the package
 command is required.  However, as described in the "Defaults" section
@ -120,7 +127,8 @@ need to use the package command if you want to change the defaults.
 more details about using these various packages for accelerating
 LAMMPS calculations.
 </P>
-<P>Package GPU always sets newton pair off.  Not so for USER-CUDA>
+<P>Package GPU always sets newton pair off.  Not so for USER-CUDA
+add newton options to GPU, CUDA, KOKKOS.
 </P>
 <HR>

--- a/doc/package.txt
+++ b/doc/package.txt
@ -95,9 +95,16 @@ This command invokes package-specific settings.  Currently the
 following packages use it: USER-CUDA, GPU, USER-INTEL, KOKKOS, and
 USER-OMP.

-Talk about command line switches
+If allows calling multiple times, all options set to their
+defaults, whether specified or not.

-When does it have to be invoked
+Talk about command line switch -pk as alternate option.
+
+Which packages require it to be invoked, only CUDA
+  this is b/c can only be invoked once
+vs optional: all others?  and allow multiple invokes
+
+Must be invoked early in script, before simulation box is defined.

 To use the accelerated GPU and USER-OMP styles, the use of the package
 command is required.  However, as described in the "Defaults" section
@ -114,7 +121,8 @@ See "Section_accelerate"_Section_accelerate.html of the manual for
 more details about using these various packages for accelerating
 LAMMPS calculations.

-Package GPU always sets newton pair off.  Not so for USER-CUDA>
+Package GPU always sets newton pair off.  Not so for USER-CUDA
+add newton options to GPU, CUDA, KOKKOS.

 :line