From 2bb69773d3bf7e342ce4747281027684fad34ecf Mon Sep 17 00:00:00 2001
From: Stan Moore <stamoor@sandia.gov>
Date: Mon, 8 Apr 2019 13:07:29 -0600
Subject: [PATCH] Update Kokkos docs

---
 doc/src/Build_extras.txt |   5 +-
 doc/src/Speed_kokkos.txt |  61 +++++++--------
 doc/src/package.txt      | 164 ++++++++++++++++++++-------------------
 3 files changed, 115 insertions(+), 115 deletions(-)

diff --git a/doc/src/Build_extras.txt b/doc/src/Build_extras.txt
index 17d18243f2..3b9da2db39 100644
--- a/doc/src/Build_extras.txt
+++ b/doc/src/Build_extras.txt
@@ -247,7 +247,10 @@ Maxwell50 = NVIDIA Maxwell generation CC 5.0
 Maxwell52 = NVIDIA Maxwell generation CC 5.2
 Maxwell53 = NVIDIA Maxwell generation CC 5.3
 Pascal60 = NVIDIA Pascal generation CC 6.0
-Pascal61 = NVIDIA Pascal generation CC 6.1 :ul
+Pascal61 = NVIDIA Pascal generation CC 6.1
+Volta70 = NVIDIA Volta generation CC 7.0
+Volta72 = NVIDIA Volta generation CC 7.2
+Turing75 = NVIDIA Turing generation CC 7.5 :ul
 
 [CMake build]:
 
diff --git a/doc/src/Speed_kokkos.txt b/doc/src/Speed_kokkos.txt
index d04f8ac6f1..1750d2400f 100644
--- a/doc/src/Speed_kokkos.txt
+++ b/doc/src/Speed_kokkos.txt
@@ -111,16 +111,10 @@ Makefile.kokkos_mpi_only) will give better performance than the OpenMP
 back end (i.e. Makefile.kokkos_omp) because some of the overhead to make
 the code thread-safe is removed.
 
-NOTE: The default for the "package kokkos"_package.html command is to
-use "full" neighbor lists and set the Newton flag to "off" for both
-pairwise and bonded interactions. However, when running on CPUs, it
-will typically be faster to use "half" neighbor lists and set the
-Newton flag to "on", just as is the case for non-accelerated pair
-styles. It can also be faster to use non-threaded communication.  Use
-the "-pk kokkos" "command-line switch"_Run_options.html to change the
-default "package kokkos"_package.html options. See its doc page for
-details and default settings. Experimenting with its options can
-provide a speed-up for specific calculations. For example:
+NOTE: Use the "-pk kokkos" "command-line switch"_Run_options.html to 
+change the default "package kokkos"_package.html options. See its doc 
+page for details and default settings. Experimenting with its options 
+can provide a speed-up for specific calculations. For example: 
 
 mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj       # Newton on, Half neighbor list, non-threaded comm :pre
 
@@ -190,19 +184,18 @@ tasks/node. The "-k on t Nt" command-line switch sets the number of
 threads/task as Nt. The product of these two values should be N, i.e.
 256 or 264.
 
-NOTE: The default for the "package kokkos"_package.html command is to
-use "full" neighbor lists and set the Newton flag to "off" for both
-pairwise and bonded interactions. When running on KNL, this will
-typically be best for pair-wise potentials. For many-body potentials,
-using "half" neighbor lists and setting the Newton flag to "on" may be
-faster. It can also be faster to use non-threaded communication.  Use
-the "-pk kokkos" "command-line switch"_Run_options.html to change the
-default "package kokkos"_package.html options. See its doc page for
-details and default settings. Experimenting with its options can
-provide a speed-up for specific calculations. For example:
+NOTE: The default for the "package kokkos"_package.html command when 
+running on KNL is to use "half" neighbor lists and set the Newton flag 
+to "on" for both pairwise and bonded interactions. This will typically 
+be best for many-body potentials. For simpler pair-wise potentials, it 
+may be faster to use a "full" neighbor list with Newton flag to "off". 
+Use the "-pk kokkos" "command-line switch"_Run_options.html to change 
+the default "package kokkos"_package.html options. See its doc page for 
+details and default settings. Experimenting with its options can provide 
+a speed-up for specific calculations. For example: 
 
-mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj      #  Newton off, full neighbor list, non-threaded comm
-mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax      # Newton on, half neighbor list, non-threaded comm :pre
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.lj      #  Newton on, half neighbor list, threaded comm
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton off neigh full comm no -in in.reax      # Newton off, full neighbor list, non-threaded comm :pre
 
 NOTE: MPI tasks and threads should be bound to cores as described
 above for CPUs.
@@ -236,18 +229,18 @@ one or more nodes, each with two GPUs:
 mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj          # 1 node,   2 MPI tasks/node, 2 GPUs/node
 mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj  # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre
 
-NOTE: The default for the "package kokkos"_package.html command is to
-use "full" neighbor lists and set the Newton flag to "off" for both
-pairwise and bonded interactions, along with threaded communication.
-When running on Maxwell or Kepler GPUs, this will typically be
-best. For Pascal GPUs, using "half" neighbor lists and setting the
-Newton flag to "on" may be faster. For many pair styles, setting the
-neighbor binsize equal to twice the CPU default value will give speedup,
-which is the default when running on GPUs.
-Use the "-pk kokkos" "command-line switch"_Run_options.html to change
-the default "package kokkos"_package.html options. See its doc page
-for details and default settings. Experimenting with its options can
-provide a speed-up for specific calculations. For example:
+NOTE: The default for the "package kokkos"_package.html command when 
+running on GPUs is to use "full" neighbor lists and set the Newton flag 
+to "off" for both pairwise and bonded interactions, along with threaded 
+communication. When running on Maxwell or Kepler GPUs, this will 
+typically be best. For Pascal GPUs, using "half" neighbor lists and 
+setting the Newton flag to "on" may be faster. For many pair styles, 
+setting the neighbor binsize equal to twice the CPU default value will 
+give speedup, which is the default when running on GPUs. Use the "-pk 
+kokkos" "command-line switch"_Run_options.html to change the default 
+"package kokkos"_package.html options. See its doc page for details and 
+default settings. Experimenting with its options can provide a speed-up 
+for specific calculations. For example: 
 
 mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj      # Newton on, half neighbor list, set binsize = neighbor ghost cutoff :pre
 
diff --git a/doc/src/package.txt b/doc/src/package.txt
index 89cfd03f5f..94ee93a743 100644
--- a/doc/src/package.txt
+++ b/doc/src/package.txt
@@ -64,7 +64,7 @@ args = arguments specific to the style :l
       {no_affinity} values = none
   {kokkos} args = keyword value ...
     zero or more keyword/value pairs may be appended
-    keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse}
+    keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
       {neigh} value = {full} or {half}
         full = full neighbor list
         half = half neighbor list built in thread-safe manner
@@ -72,7 +72,7 @@ args = arguments specific to the style :l
         full = full neighbor list
         half = half neighbor list built in thread-safe manner
       {newton} = {off} or {on}
-        off = set Newton pairwise and bonded flags off (default)
+        off = set Newton pairwise and bonded flags off
         on = set Newton pairwise and bonded flags on
       {binsize} value = size
         size = bin size for neighbor list construction (distance units)
@@ -422,33 +422,35 @@ processes/threads used for LAMMPS.
 
 :line
 
-The {kokkos} style invokes settings associated with the use of the
-KOKKOS package.
+The {kokkos} style invokes settings associated with the use of the 
+KOKKOS package. 
 
-All of the settings are optional keyword/value pairs.  Each has a
-default value as listed below.
+All of the settings are optional keyword/value pairs. Each has a default 
+value as listed below. 
 
-The {neigh} keyword determines how neighbor lists are built.  A value
-of {half} uses a thread-safe variant of half-neighbor lists,
-the same as used by most pair styles in LAMMPS.
+The {neigh} keyword determines how neighbor lists are built. A value of 
+{half} uses a thread-safe variant of half-neighbor lists, the same as 
+used by most pair styles in LAMMPS, which is the default when running on 
+CPUs (i.e. CUDA backend not enabled). 
 
-A value of {full} uses a full neighbor lists and is the default.  This
-performs twice as much computation as the {half} option, however that
-is often a win because it is thread-safe and doesn't require atomic
-operations in the calculation of pair forces.  For that reason, {full}
-is the default setting.  However, when running in MPI-only mode with 1
-thread per MPI task, {half} neighbor lists will typically be faster,
-just as it is for non-accelerated pair styles. Similarly, the {neigh/qeq}
-keyword determines how neighbor lists are built for "fix qeq/reax/kk"_fix_qeq_reax.html.
-If not explicitly set, the value of {neigh/qeq} will match {neigh}.
+A value of {full} uses a full neighbor lists and is the default when 
+running on GPUs. This performs twice as much computation as the {half} 
+option, however that is often a win because it is thread-safe and 
+doesn't require atomic operations in the calculation of pair forces. For 
+that reason, {full} is the default setting for GPUs. However, when 
+running on CPUs, a {half} neighbor list is the default because it are 
+often faster, just as it is for non-accelerated pair styles. Similarly, 
+the {neigh/qeq} keyword determines how neighbor lists are built for "fix 
+qeq/reax/kk"_fix_qeq_reax.html. If not explicitly set, the value of 
+{neigh/qeq} will match {neigh}. 
 
-The {newton} keyword sets the Newton flags for pairwise and bonded
-interactions to {off} or {on}, the same as the "newton"_newton.html
-command allows.  The default is {off} because this will almost always
-give better performance for the KOKKOS package.  This means more
-computation is done, but less communication.  However, when running in
-MPI-only mode with 1 thread per MPI task, a value of {on} will
-typically be faster, just as it is for non-accelerated pair styles.
+The {newton} keyword sets the Newton flags for pairwise and bonded 
+interactions to {off} or {on}, the same as the "newton"_newton.html 
+command allows. The default for GPUs is {off} because this will almost 
+always give better performance for the KOKKOS package. This means more 
+computation is done, but less communication. However, when running on 
+CPUs a value of {on} is the deafult since it can often be faster, just 
+as it is for non-accelerated pair styles 
 
 The {binsize} keyword sets the size of bins used to bin atoms in 
 neighbor list builds. The same value can be set by the "neigh_modify 
@@ -465,58 +467,58 @@ because the GPU is faster at performing pairwise interactions, then this
 rule of thumb may give too large a binsize and the default should be 
 overridden with a smaller value. 
 
-The {comm} and {comm/exchange} and {comm/forward} and {comm/reverse} keywords determine
-whether the host or device performs the packing and unpacking of data
-when communicating per-atom data between processors.  "Exchange"
-communication happens only on timesteps that neighbor lists are
-rebuilt.  The data is only for atoms that migrate to new processors.
-"Forward" communication happens every timestep. "Reverse" communication
-happens every timestep if the {newton} option is on.  The data is for atom
-coordinates and any other atom properties that needs to be updated for
-ghost atoms owned by each processor.
+The {comm} and {comm/exchange} and {comm/forward} and {comm/reverse} 
+keywords determine whether the host or device performs the packing and 
+unpacking of data when communicating per-atom data between processors. 
+"Exchange" communication happens only on timesteps that neighbor lists 
+are rebuilt. The data is only for atoms that migrate to new processors. 
+"Forward" communication happens every timestep. "Reverse" communication 
+happens every timestep if the {newton} option is on. The data is for 
+atom coordinates and any other atom properties that needs to be updated 
+for ghost atoms owned by each processor. 
 
-The {comm} keyword is simply a short-cut to set the same value
-for both the {comm/exchange} and {comm/forward} and {comm/reverse} keywords.
+The {comm} keyword is simply a short-cut to set the same value for both 
+the {comm/exchange} and {comm/forward} and {comm/reverse} keywords. 
 
-The value options for all 3 keywords are {no} or {host} or {device}.
-A value of {no} means to use the standard non-KOKKOS method of
-packing/unpacking data for the communication.  A value of {host} means
-to use the host, typically a multi-core CPU, and perform the
-packing/unpacking in parallel with threads.  A value of {device}
-means to use the device, typically a GPU, to perform the
-packing/unpacking operation.
+The value options for all 3 keywords are {no} or {host} or {device}. A 
+value of {no} means to use the standard non-KOKKOS method of 
+packing/unpacking data for the communication. A value of {host} means to 
+use the host, typically a multi-core CPU, and perform the 
+packing/unpacking in parallel with threads. A value of {device} means to 
+use the device, typically a GPU, to perform the packing/unpacking 
+operation. 
 
-The optimal choice for these keywords depends on the input script and
-the hardware used.  The {no} value is useful for verifying that the
-Kokkos-based {host} and {device} values are working correctly.
-It may also be the fastest choice when using Kokkos styles in
-MPI-only mode (i.e. with a thread count of 1).
+The optimal choice for these keywords depends on the input script and 
+the hardware used. The {no} value is useful for verifying that the 
+Kokkos-based {host} and {device} values are working correctly. It is the 
+default when running on CPUs since it is usually the fastest. 
 
-When running on CPUs or Xeon Phi, the {host} and {device} values work
-identically.  When using GPUs, the {device} value will typically be
-optimal if all of your styles used in your input script are supported
-by the KOKKOS package.  In this case data can stay on the GPU for many
-timesteps without being moved between the host and GPU, if you use the
-{device} value.  This requires that your MPI is able to access GPU
-memory directly.  Currently that is true for OpenMPI 1.8 (or later
-versions), Mvapich2 1.9 (or later), and CrayMPI.  If your script uses
-styles (e.g. fixes) which are not yet supported by the KOKKOS package,
-then data has to be move between the host and device anyway, so it is
-typically faster to let the host handle communication, by using the
-{host} value.  Using {host} instead of {no} will enable use of
-multiple threads to pack/unpack communicated data.
+When running on CPUs or Xeon Phi, the {host} and {device} values work 
+identically. When using GPUs, the {device} value is the default since it 
+will typically be optimal if all of your styles used in your input 
+script are supported by the KOKKOS package. In this case data can stay 
+on the GPU for many timesteps without being moved between the host and 
+GPU, if you use the {device} value. This requires that your MPI is able 
+to access GPU memory directly. Currently that is true for OpenMPI 1.8 
+(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your 
+script uses styles (e.g. fixes) which are not yet supported by the 
+KOKKOS package, then data has to be move between the host and device 
+anyway, so it is typically faster to let the host handle communication, 
+by using the {host} value. Using {host} instead of {no} will enable use 
+of multiple threads to pack/unpack communicated data. 
 
-The {gpu/direct} keyword chooses whether GPU-direct will be used. When
-this keyword is set to {on}, buffers in GPU memory are passed directly
-through MPI send/receive calls. This reduces overhead of first copying
-the data to the host CPU. However GPU-direct is not supported on all
-systems, which can lead to segmentation faults and would require
-using a value of {off}. If LAMMPS can safely detect that GPU-direct is
-not available (currently only possible with OpenMPI v2.0.0 or later),
-then the {gpu/direct} keyword is automatically set to {off} by default.
-When the {gpu/direct} keyword is set to {off} while any of the {comm}
-keywords are set to {device}, the value for these {comm} keywords will
-be automatically changed to {host}.
+The {gpu/direct} keyword chooses whether GPU-direct will be used. When 
+this keyword is set to {on}, buffers in GPU memory are passed directly 
+through MPI send/receive calls. This reduces overhead of first copying 
+the data to the host CPU. However GPU-direct is not supported on all 
+systems, which can lead to segmentation faults and would require using a 
+value of {off}. If LAMMPS can safely detect that GPU-direct is not 
+available (currently only possible with OpenMPI v2.0.0 or later), then 
+the {gpu/direct} keyword is automatically set to {off} by default. When 
+the {gpu/direct} keyword is set to {off} while any of the {comm} 
+keywords are set to {device}, the value for these {comm} keywords will 
+be automatically changed to {host}. This setting has no effect if not 
+running on GPUs.
 
 :line
 
@@ -623,14 +625,16 @@ not used, you must invoke the package intel command in your input
 script or or via the "-pk intel" "command-line
 switch"_Run_options.html.
 
-For the KOKKOS package, the option defaults neigh = full, neigh/qeq = 
-full, newton = off, binsize for CPUs = 0.0, binsize for GPUs = 2x LAMMPS 
-default value, and comm = device, gpu/direct = on. When LAMMPS can 
-safely detect, that GPU-direct is not available, the default value of 
-gpu/direct becomes "off". These settings are made automatically by the 
-required "-k on" "command-line switch"_Run_options.html. You can change 
-them by using the package kokkos command in your input script or via the 
-"-pk kokkos command-line switch"_Run_options.html. 
+For the KOKKOS package, the option defaults for GPUs are neigh = full, 
+neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default 
+value, comm = device, gpu/direct = on. When LAMMPS can safely detect 
+that GPU-direct is not available, the default value of gpu/direct 
+becomes "off". For CPUs or Xeon Phis, the option defaults are neigh = 
+half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. These 
+settings are made automatically by the required "-k on" "command-line 
+switch"_Run_options.html. You can change them by using the package 
+kokkos command in your input script or via the "-pk kokkos command-line 
+switch"_Run_options.html.
 
 For the OMP package, the default is Nthreads = 0 and the option
 defaults are neigh = yes.  These settings are made automatically if