Merge pull request #748 from stanmoore1/kk_docs

Update Kokkos docs
2018-01-08 09:15:36 -07:00 · 2018-01-08 09:15:36 -07:00 · cc9b6118b8
parent 09bed0c09a 798d68c607
commit cc9b6118b8
1 changed files with 340 additions and 375 deletions
--- a/doc/src/accelerate_kokkos.txt
+++ b/doc/src/accelerate_kokkos.txt
@ -11,336 +11,346 @@

 5.3.3 KOKKOS package :h5

-The KOKKOS package was developed primarily by Christian Trott (Sandia)
-with contributions of various styles by others, including Sikandar
-Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia).  The
-underlying Kokkos library was written primarily by Carter Edwards,
+Kokkos is a templated C++ library that provides abstractions to allow
+a single implementation of an application kernel (e.g. a pair style) to run efficiently on
+different kinds of hardware, such as GPUs, Intel Xeon Phis, or many-core
+CPUs. Kokkos maps the C++ kernel onto different backend languages such as CUDA, OpenMP, or Pthreads.
+The Kokkos library also provides data abstractions to adjust (at
+compile time) the memory layout of data structures like 2d and
+3d arrays to optimize performance on different hardware. For more information on Kokkos, see
+"Github"_https://github.com/kokkos/kokkos. Kokkos is part of
+"Trilinos"_http://trilinos.sandia.gov/packages/kokkos. The Kokkos library was written primarily by Carter Edwards,
 Christian Trott, and Dan Sunderland (all Sandia).

-The KOKKOS package contains versions of pair, fix, and atom styles
+The LAMMPS KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and macros provided by the Kokkos library,
-which is included with LAMMPS in lib/kokkos.
-
-The Kokkos library is part of
-"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be
-downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a
-templated C++ library that provides two key abstractions for an
-application like LAMMPS.  First, it allows a single implementation of
-an application kernel (e.g. a pair style) to run efficiently on
-different kinds of hardware, such as a GPU, Intel Phi, or many-core
-CPU.
-
-The Kokkos library also provides data abstractions to adjust (at
-compile time) the memory layout of basic data structures like 2d and
-3d arrays and allow the transparent utilization of special hardware
-load and store operations.  Such data structures are used in LAMMPS to
-store atom coordinates or forces or neighbor lists.  The layout is
-chosen to optimize performance on different platforms.  Again this
-functionality is hidden from the developer, and does not affect how
-the kernel is coded.
-
-These abstractions are set at build time, when LAMMPS is compiled with
-the KOKKOS package installed.  All Kokkos operations occur within the
-context of an individual MPI task running on a single node of the
-machine.  The total number of MPI tasks used by LAMMPS (one or
-multiple per compute node) is set in the usual manner via the mpirun
-or mpiexec commands, and is independent of Kokkos.
+which is included with LAMMPS in /lib/kokkos. The KOKKOS package was developed primarily by Christian Trott (Sandia)
+and Stan Moore (Sandia) with contributions of various styles by others, including Sikandar
+Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez (Sandia). For more information on developing using Kokkos abstractions
+see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.

 Kokkos currently provides support for 3 modes of execution (per MPI
-task).  These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs),
-and OpenMP (for Intel Phi).  Note that the KOKKOS package supports
-running on the Phi in native mode, not offload mode like the
-USER-INTEL package supports.  You choose the mode at build time to
+task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP (threading
+for many-core CPUs and Intel Phi), and CUDA (for NVIDIA GPUs). You choose the mode at build time to
 produce an executable compatible with specific hardware.

-Here is a quick overview of how to use the KOKKOS package
-for CPU acceleration, assuming one or more 16-core nodes.
-More details follow.
-
-use a C++11 compatible compiler
-make yes-kokkos
-make mpi KOKKOS_DEVICES=OpenMP                 # build with the KOKKOS package
-make kokkos_omp                                # or Makefile.kokkos_omp already has variable set :pre
-
-mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj              # 1 node, 16 MPI tasks/node, no threads
-mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj   # 2 nodes, 1 MPI task/node, 16 threads/task
-mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj           # 1 node, 2 MPI tasks/node, 8 threads/task
-mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj   # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
-
-specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
-include the KOKKOS package and build LAMMPS
-enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul
-
-Here is a quick overview of how to use the KOKKOS package for GPUs,
-assuming one or more nodes, each with 16 cores and a GPU.  More
-details follow.
-
-discuss use of NVCC, which Makefiles to examine
-
-use a C++11 compatible compiler
-KOKKOS_DEVICES = Cuda, OpenMP
-KOKKOS_ARCH = Kepler35
-make yes-kokkos
-make machine :pre
-
-mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
-mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes :pre
-
-mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
-mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes :pre
-
-Here is a quick overview of how to use the KOKKOS package
-for the Intel Phi:
-
-use a C++11 compatible compiler
-KOKKOS_DEVICES = OpenMP
-KOKKOS_ARCH = KNC
-make yes-kokkos
-make machine :pre
-
-host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
-mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
-mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
-mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
-mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis :pre
-
-[Required hardware/software:]
-
-Kokkos support within LAMMPS must be built with a C++11 compatible
-compiler.  If using gcc, version 4.7.2 or later is required.
-
-To build with Kokkos support for CPUs, your compiler must support the
-OpenMP interface.  You should have one or more multi-core CPUs so that
-multiple threads can be launched by each MPI task running on a CPU.
-
-To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
-version 7.5 or later must be installed on your system.  See the
-discussion for the "GPU"_accelerate_gpu.html package for details of
-how to check and do this.
-
-NOTE: For good performance of the KOKKOS package on GPUs, you must
-have Kepler generation GPUs (or later).  The Kokkos library exploits
-texture cache options not supported by Telsa generation GPUs (or
-older).
-
-To build with Kokkos support for Intel Xeon Phi coprocessors, your
-sysmte must be configured to use them in "native" mode, not "offload"
-mode like the USER-INTEL package supports.
-
 [Building LAMMPS with the KOKKOS package:]

-You must choose at build time whether to build for CPUs (OpenMP),
-GPUs, or Phi.
+NOTE: Kokkos support within LAMMPS must be built with a C++11 compatible
+compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
+Clang 3.5.2 or later is required.

-You can do any of these in one line, using the suitable make command
-line flags as described in "Section 4"_Section_packages.html of the
-manual. If run from the src directory, these
-commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda_mpi, and
-lmp_kokkos_phi.  Note that the OMP and PHI options use
-src/MAKE/Makefile.mpi as the starting Makefile.machine.  The CUDA
-option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi.
+The recommended method of building the KOKKOS package is to start with the provided Kokkos
+Makefiles in /src/MAKE/OPTIONS/. You may need to modify the KOKKOS_ARCH variable in the Makefile
+to match your specific hardware. For example:

-The latter two steps can be done using the "-k on", "-pk kokkos" and
-"-sf kk" "command-line switches"_Section_start.html#start_6
-respectively.  Or the effect of the "-pk" or "-sf" switches can be
-duplicated by adding the "package kokkos"_package.html or "suffix
-kk"_suffix.html commands respectively to your input script.
+for Sandy Bridge CPUs, set KOKKOS_ARCH=SNB
+for Broadwell CPUs, set KOKKOS_ARCH=BWD
+for K80 GPUs, set KOKKOS_ARCH=Kepler37
+for P100 GPUs and Power8 CPUs, set KOKKOS_ARCH=Pascal60,Power8 :ul

+See the [Advanced Kokkos Options] section below for a listing of all KOKKOS_ARCH options.

-Or you can follow these steps:
+[Compile for CPU-only (MPI only, no threading):]

-CPU-only (run all-MPI or with OpenMP threading):
-
-cd lammps/src
-make yes-kokkos
-make kokkos_omp :pre
-
-CPU-only (only MPI, no threading):
+use a C++11 compatible compiler and set KOKKOS_ARCH variable in
+/src/MAKE/OPTIONS/Makefile.kokkos_mpi_only as described above. Then do the
+following:

 cd lammps/src
 make yes-kokkos
 make kokkos_mpi_only :pre

-Intel Xeon Phi (Intel Compiler, Intel MPI):
+[Compile for CPU-only (MPI plus OpenMP threading):]
+
+NOTE: To build with Kokkos support for OpenMP threading, your compiler must support the
+OpenMP interface. You should have one or more multi-core CPUs so that
+multiple threads can be launched by each MPI task running on a CPU.
+
+use a C++11 compatible compiler and set KOKKOS_ARCH variable in
+/src/MAKE/OPTIONS/Makefile.kokkos_omp as described above.  Then do the
+following:
+
+cd lammps/src
+make yes-kokkos
+make kokkos_omp :pre
+
+[Compile for Intel KNL Xeon Phi (Intel Compiler, OpenMPI):]
+
+use a C++11 compatible compiler and do the following:

 cd lammps/src
 make yes-kokkos
 make kokkos_phi :pre

-CPUs and GPUs (with MPICH or OpenMPI):
+[Compile for CPUs and GPUs (with OpenMPI or MPICH):]
+
+NOTE: To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
+version 7.5 or later must be installed on your system. See the
+discussion for the "GPU"_accelerate_gpu.html package for details of
+how to check and do this.
+
+use a C++11 compatible compiler and set KOKKOS_ARCH variable in
+/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi for both GPU and CPU as described
+above.  Then do the following:

 cd lammps/src
 make yes-kokkos
 make kokkos_cuda_mpi :pre

-These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
-make command line which requires a GNU-compatible make command.  Try
+[Alternative Methods of Compiling:]
+
+Alternatively, the KOKKOS package can be built by specifying Kokkos variables
+on the make command line. For example:
+
+make mpi KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=SNB     # set the KOKKOS_DEVICES and KOKKOS_ARCH variable explicitly
+make kokkos_cuda_mpi KOKKOS_ARCH=Pascal60,Power8   # set the KOKKOS_ARCH variable explicitly :pre
+
+Setting the KOKKOS_DEVICES and KOKKOS_ARCH variables on the
+make command line requires a GNU-compatible make command. Try
 "gmake" if your system's standard make complains.

 NOTE: If you build using make line variables and re-build LAMMPS twice
-with different KOKKOS options and the *same* target, e.g. g++ in the
-first two examples above, then you *must* perform a "make clean-all"
-or "make clean-machine" before each build.  This is to force all the
+with different KOKKOS options and the *same* target, then you *must* perform a "make clean-all"
+or "make clean-machine" before each build. This is to force all the
 KOKKOS-dependent files to be re-compiled with the new options.

-NOTE: Currently, there are no precision options with the KOKKOS
-package.  All compilation and computation is performed in double
-precision.
+[Running LAMMPS with the KOKKOS package:]

-There are other allowed options when building with the KOKKOS package.
-As above, they can be set either as variables on the make command line
-or in Makefile.machine.  This is the full list of options, including
-those discussed above, Each takes a value shown below.  The
-default value is listed, which is set in the
-lib/kokkos/Makefile.kokkos file.
+All Kokkos operations occur within the
+context of an individual MPI task running on a single node of the
+machine. The total number of MPI tasks used by LAMMPS (one or
+multiple per compute node) is set in the usual manner via the mpirun
+or mpiexec commands, and is independent of Kokkos. E.g. the mpirun
+command in OpenMPI does this via its
+-np and -npernode switches. Ditto for MPICH via -np and -ppn.

-#Default settings specific options
-#Options: force_uvm,use_ldg,rdc
+[Running on a multi-core CPU:]

-KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP}
-KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none}
-KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
-KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none}
-KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul
+Here is a quick overview of how to use the KOKKOS package
+for CPU acceleration, assuming one or more 16-core nodes.

-KOKKOS_DEVICE sets the parallelization method used for Kokkos code
-(within LAMMPS).  KOKKOS_DEVICES=OpenMP means that OpenMP will be
-used.  KOKKOS_DEVICES=Pthreads means that pthreads will be used.
-KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
-
-If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
-directory must use "nvcc" as its compiler, via its CC setting.  For
-best performance its CCFLAGS setting should use -O3 and have a
-KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
-hardware and software installation, e.g. KOKKOS_ARCH=Kepler30.  Note
-the minimal required compute capability is 2.0, but this will give
-significantly reduced performance compared to Kepler generation GPUs
-with compute capability 3.x.  For the LINK setting, "nvcc" should not
-be used; instead use g++ or another compiler suitable for linking C++
-applications.  Often you will want to use your MPI compiler wrapper
-for this setting (i.e. mpicxx).  Finally, the lo-level Makefile must
-also have a "Compilation rule" for creating *.o files from *.cu files.
-See src/Makefile.cuda for an example of a lo-level Makefile with all
-of these settings.
-
-KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
-migrate during a simulation.  KOKKOS_USE_TPLS=hwloc should always be
-used if running with KOKKOS_DEVICES=Pthreads for pthreads.  It is not
-necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
-provides alternative methods via environment variables for binding
-threads to hardware cores.  More info on binding threads to cores is
-given in "Section 5.3"_Section_accelerate.html#acc_3.
-
-KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an
-Intel Phi processor.
-
-KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
-on most Unix platforms.  This library is not available on all
-platforms.
-
-KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
-within LAMMPS.  KOKKOS_DEBUG=yes enables printing of run-time
-debugging information that can be useful.  It also enables runtime
-bounds checking on Kokkos data structures.
-
-KOKKOS_CUDA_OPTIONS are additional options for CUDA.
-
-For more information on Kokkos see the Kokkos programmers' guide here:
-/lib/kokkos/doc/Kokkos_PG.pdf.
-
-[Run with the KOKKOS package from the command line:]
-
-The mpirun or mpiexec command sets the total number of MPI tasks used
-by LAMMPS (one or multiple per compute node) and the number of MPI
-tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
-
-When using KOKKOS built with host=OMP, you need to choose how many
-OpenMP threads per MPI task will be used (via the "-k" command-line
-switch discussed below).  Note that the product of MPI tasks * OpenMP
-threads/task should not exceed the physical number of cores (on a
-node), otherwise performance will suffer.
-
-When using the KOKKOS package built with device=CUDA, you must use
-exactly one MPI task per physical GPU.
-
-When using the KOKKOS package built with host=MIC for Intel Xeon Phi
-coprocessor support you need to insure there are one or more MPI tasks
-per coprocessor, and choose the number of coprocessor threads to use
-per MPI task (via the "-k" command-line switch discussed below).  The
-product of MPI tasks * coprocessor threads/task should not exceed the
-maximum number of threads the coprocessor is designed to run,
-otherwise performance will suffer.  This value is 240 for current
-generation Xeon Phi(TM) chips, which is 60 physical cores * 4
-threads/core.  Note that with the KOKKOS package you do not need to
-specify how many Phi coprocessors there are per node; each
-coprocessors is simply treated as running some number of MPI tasks.
+mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj        # 1 node, 16 MPI tasks/node, no multi-threading
+mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj  # 2 nodes, 1 MPI task/node, 16 threads/task
+mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj          # 1 node,  2 MPI tasks/node, 8 threads/task
+mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj  # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre

+To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk kokkos" "command-line switches"_Section_start.html#start_7 in your mpirun command.
 You must use the "-k on" "command-line
-switch"_Section_start.html#start_6 to enable the KOKKOS package.  It
+switch"_Section_start.html#start_7 to enable the KOKKOS package. It
 takes additional arguments for hardware settings appropriate to your
-system.  Those arguments are "documented
-here"_Section_start.html#start_6.  The two most commonly used
-options are:
+system. Those arguments are "documented
+here"_Section_start.html#start_7. For OpenMP use:

-k on t Nt g Ng :pre
-
-The "t Nt" option applies to host=OMP (even if device=CUDA) and
-host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
-task to use with a node.  For host=MIC, it specifies how many Xeon Phi
-threads per MPI task to use within a node.  The default is Nt = 1.
-Note that for host=OMP this is effectively MPI-only mode which may be
-fine.  But for host=MIC you will typically end up using far less than
-all the 240 available threads, which could give very poor performance.
-
-The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
-per compute node to use.  The default is 1, so this only needs to be
-specified is you have 2 or more GPUs per compute node.
+-k on t Nt :pre

+The "t Nt" option specifies how many OpenMP threads per MPI
+task to use with a node. The default is Nt = 1, which is MPI-only mode.
+Note that the product of MPI tasks * OpenMP
+threads/task should not exceed the physical number of cores (on a
+node), otherwise performance will suffer. If hyperthreading is enabled, then
+the product of MPI tasks * OpenMP threads/task should not exceed the
+physical number of cores * hardware threads.
 The "-k on" switch also issues a "package kokkos" command (with no
 additional arguments) which sets various KOKKOS options to default
 values, as discussed on the "package"_package.html command doc page.

-Use the "-sf kk" "command-line switch"_Section_start.html#start_6,
-which will automatically append "kk" to styles that support it.  Use
-the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if
-you wish to change any of the default "package kokkos"_package.html
-optionns set by the "-k on" "command-line
-switch"_Section_start.html#start_6.
+The "-sf kk" "command-line switch"_Section_start.html#start_7
+will automatically append the "/kk" suffix to styles that support it.
+In this manner no modification to the input script is needed. Alternatively,
+one can run with the KOKKOS package by editing the input script as described below.

-
-
-Note that the default for the "package kokkos"_package.html command is
+NOTE: The default for the "package kokkos"_package.html command is
 to use "full" neighbor lists and set the Newton flag to "off" for both
-pairwise and bonded interactions.  This typically gives fastest
-performance.  If the "newton"_newton.html command is used in the input
-script, it can override the Newton flag defaults.
-
-However, when running in MPI-only mode with 1 thread per MPI task, it
+pairwise and bonded interactions. However, when running on CPUs, it
 will typically be faster to use "half" neighbor lists and set the
 Newton flag to "on", just as is the case for non-accelerated pair
-styles.  You can do this with the "-pk" "command-line
-switch"_Section_start.html#start_6.
+styles. It can also be faster to use non-threaded communication.
+Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
+change the default "package kokkos"_package.html
+options. See its doc page for details and default settings. Experimenting with
+its options can provide a speed-up for specific calculations. For example:

-[Or run with the KOKKOS package by editing an input script:]
+mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj       # Newton on, Half neighbor list, non-threaded comm :pre

-The discussion above for the mpirun/mpiexec command and setting
-appropriate thread and GPU values for host=OMP or host=MIC or
-device=CUDA are the same.
+If the "newton"_newton.html command is used in the input
+script, it can also override the Newton flag defaults.
+
+[Core and Thread Affinity:]
+
+When using multi-threading, it is important for
+performance to bind both MPI tasks to physical cores, and threads to
+physical cores, so they do not migrate during a simulation.
+
+If you are not certain MPI tasks are being bound (check the defaults
+for your MPI installation), binding can be forced with these flags:
+
+OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
+Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... :pre
+
+For binding threads with KOKKOS OpenMP, use thread affinity
+environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
+later, intel 12 or later) setting the environment variable
+OMP_PROC_BIND=true should be sufficient. In general, for best performance
+with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads.
+For binding threads with the
+KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
+as described below.
+
+[Running on Knight's Landing (KNL) Intel Xeon Phi:]
+
+Here is a quick overview of how to use the KOKKOS package
+for the Intel Knight's Landing (KNL) Xeon Phi:
+
+KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores
+are reserved for the OS, and only 64 or 66 cores are used. Each core
+has 4 hyperthreads,so there are effectively N = 256 (4*64) or
+N = 264 (4*66) cores to run on. The product of MPI tasks * OpenMP threads/task should not exceed this limit,
+otherwise performance will suffer. Note that with the KOKKOS package you do not need to
+specify how many KNLs there are per node; each
+KNL is simply treated as running some number of MPI tasks.
+
+Examples of mpirun commands that follow these rules are shown below.
+
+Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj      # 1 node, 64 MPI tasks/node, 4 threads/task
+mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj      # 1 node, 66 MPI tasks/node, 4 threads/task
+mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj      # 1 node, 32 MPI tasks/node, 8 threads/task
+mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj  # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre
+
+The -np setting of the mpirun command sets the number of MPI
+tasks/node. The "-k on t Nt" command-line switch sets the number of
+threads/task as Nt. The product of these two values should be N, i.e.
+256 or 264.
+
+NOTE: The default for the "package kokkos"_package.html command is
+to use "full" neighbor lists and set the Newton flag to "off" for both
+pairwise and bonded interactions. When running on KNL, this
+will typically be best for pair-wise potentials. For manybody potentials,
+using "half" neighbor lists and setting the
+Newton flag to "on" may be faster. It can also be faster to use non-threaded communication.
+Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
+change the default "package kokkos"_package.html
+options. See its doc page for details and default settings. Experimenting with
+its options can provide a speed-up for specific calculations. For example:
+
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj      #  Newton off, full neighbor list, non-threaded comm
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax      # Newton on, half neighbor list, non-threaded comm :pre
+
+NOTE: MPI tasks and threads should be bound to cores as described above for CPUs.
+
+NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors such as Knight's Corner (KNC), your
+system must be configured to use them in "native" mode, not "offload"
+mode like the USER-INTEL package supports.
+
+[Running on GPUs:]
+
+Use the "-k" "command-line switch"_Section_commands.html#start_7 to
+specify the number of GPUs per node. Typically the -np setting
+of the mpirun command should set the number of MPI
+tasks/node to be equal to the # of physical GPUs on the node.
+You can assign multiple MPI tasks to the same GPU with the
+KOKKOS package, but this is usually only faster if significant portions
+of the input script have not been ported to use Kokkos. Using CUDA MPS
+is recommended in this scenario. As above for multi-core CPUs (and no GPU), if N is the number
+of physical cores/node, then the number of MPI tasks/node should not exceed N.
+
+-k on g Ng :pre
+
+Here are examples of how to use the KOKKOS package for GPUs,
+assuming one or more nodes, each with two GPUs:
+
+mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj          # 1 node,   2 MPI tasks/node, 2 GPUs/node
+mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj  # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre
+
+NOTE: The default for the "package kokkos"_package.html command is
+to use "full" neighbor lists and set the Newton flag to "off" for both
+pairwise and bonded interactions, along with threaded communication.
+When running on Maxwell or Kepler GPUs, this will typically be best. For Pascal GPUs,
+using "half" neighbor lists and setting the
+Newton flag to "on" may be faster. For many pair styles, setting the neighbor binsize
+equal to the ghost atom cutoff will give speedup.
+Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
+change the default "package kokkos"_package.html
+options. See its doc page for details and default settings. Experimenting with
+its options can provide a speed-up for specific calculations. For example:
+
+mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj      # Set binsize = neighbor ghost cutoff
+mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj      # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre
+
+NOTE: For good performance of the KOKKOS package on GPUs, you must
+have Kepler generation GPUs (or later). The Kokkos library exploits
+texture cache options not supported by Telsa generation GPUs (or
+older).
+
+NOTE: When using a GPU, you will achieve the best performance if your
+input script does not use fix or compute styles which are not yet
+Kokkos-enabled. This allows data to stay on the GPU for multiple
+timesteps, without being copied back to the host CPU. Invoking a
+non-Kokkos fix or compute, or performing I/O for
+"thermo"_thermo_style.html or "dump"_dump.html output will cause data
+to be copied back to the CPU incurring a performance penalty.
+
+NOTE: To get an accurate timing breakdown between time spend in pair,
+kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
+However, this will reduce performance and is not recommended for production runs.
+
+[Run with the KOKKOS package by editing an input script:]
+
+Alternatively the effect of the "-sf" or "-pk" switches can be
+duplicated by adding the "package kokkos"_package.html or "suffix
+kk"_suffix.html commands to your input script.
+
+The discussion above for building LAMMPS with the KOKKOS package, the mpirun/mpiexec command, and setting
+appropriate thread are the same.

 You must still use the "-k on" "command-line
-switch"_Section_start.html#start_6 to enable the KOKKOS package, and
+switch"_Section_start.html#start_7 to enable the KOKKOS package, and
 specify its additional arguments for hardware options appropriate to
 your system, as documented above.

-Use the "suffix kk"_suffix.html command, or you can explicitly add a
+You can use the "suffix kk"_suffix.html command, or you can explicitly add a
 "kk" suffix to individual styles in your input script, e.g.

 pair_style lj/cut/kk 2.5 :pre

 You only need to use the "package kokkos"_package.html command if you
 wish to change any of its option defaults, as set by the "-k on"
-"command-line switch"_Section_start.html#start_6.
+"command-line switch"_Section_start.html#start_7.
+
+[Using OpenMP threading and CUDA together (experimental):]
+
+With the KOKKOS package, both OpenMP multi-threading and GPUs can be used
+together in a few special cases. In the Makefile, the KOKKOS_DEVICES variable must
+include both "Cuda" and "OpenMP", as is the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
+
+KOKKOS_DEVICES=Cuda,OpenMP :pre
+
+The suffix “/kk” is equivalent to “/kk/device”, and for Kokkos CUDA,
+using the “-sf kk” in the command line gives the default CUDA version everywhere.
+However, if the “/kk/host” suffix is added to a specific style in the input
+script, the Kokkos OpenMP (CPU) version of that specific style will be used instead.
+Set the number of OpenMP threads as "t Nt" and the number of GPUs as "g Ng"
+
+-k on t Nt g Ng :pre
+
+For example, the command to run with 1 GPU and 8 OpenMP threads is then:
+
+mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre
+
+Conversely, if the “-sf kk/host” is used in the command line and then the
+“/kk” or “/kk/device” suffix is added to a specific style in your input script,
+then only that specific style will run on the GPU while everything else will
+run on the CPU in OpenMP mode. Note that the execution of the CPU and GPU
+styles will NOT overlap, except for a special case:
+
+A kspace style and/or molecular topology (bonds, angles, etc.) running on
+the host CPU can overlap with a pair style running on the GPU. First compile
+with “--default-stream per-thread” added to CCFLAGS in the Kokkos CUDA Makefile.
+Then explicitly use the “/kk/host” suffix for kspace and bonds, angles, etc.
+in the input file and the "kk" suffix (equal to "kk/device") on the command line.
+Also make sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
+so CPU/GPU overlap can occur.

 [Speed-ups to expect:]

@ -353,7 +363,7 @@ Generally speaking, the following rules of thumb apply:
 When running on CPUs only, with a single thread per MPI task,
 performance of a KOKKOS style is somewhere between the standard
 (un-accelerated) styles (MPI-only mode), and those provided by the
-USER-OMP package.  However the difference between all 3 is small (less
+USER-OMP package. However the difference between all 3 is small (less
 than 20%). :ulb,l

 When running on CPUs only, with multiple threads per MPI task,
@ -363,7 +373,7 @@ package. :l
 When running large number of atoms per GPU, KOKKOS is typically faster
 than the GPU package. :l

-When running on Intel Xeon Phi, KOKKOS is not as fast as
+When running on Intel hardware, KOKKOS is not as fast as
 the USER-INTEL package, which is optimized for that hardware. :l
 :ule

@ -371,123 +381,78 @@ See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the KOKKOS package on different
 hardware.

-[Guidelines for best performance:]
+[Advanced Kokkos options:]

-Here are guidline for using the KOKKOS package on the different
-hardware configurations listed above.
+There are other allowed options when building with the KOKKOS package.
+As above, they can be set either as variables on the make command line
+or in Makefile.machine. This is the full list of options, including
+those discussed above. Each takes a value shown below. The
+default value is listed, which is set in the
+/lib/kokkos/Makefile.kokkos file.

-Many of the guidelines use the "package kokkos"_package.html command
-See its doc page for details and default settings.  Experimenting with
-its options can provide a speed-up for specific calculations.
+KOKKOS_DEVICES, values = {Serial}, {OpenMP}, {Pthreads}, {Cuda}, default = {OpenMP}
+KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {Pascal60}, {Pascal61}, {ARMv80}, {ARMv81}, {ARMv81}, {ARMv8-ThunderX}, {BGQ}, {Power7}, {Power8}, {Power9}, {KNL}, {BDW}, {SKX}, default = {none}
+KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
+KOKKOS_USE_TPLS, values = {hwloc}, {librt}, {experimental_memkind}, default = {none}
+KOKKOS_CXX_STANDARD, values = {c++11}, {c++1z}, default = {c++11}
+KOKKOS_OPTIONS, values = {aggressive_vectorization}, {disable_profiling}, default = {none}
+KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc}, {enable_lambda}, default = {enable_lambda} :ul

-[Running on a multi-core CPU:]
+KOKKOS_DEVICES sets the parallelization method used for Kokkos code
+(within LAMMPS). KOKKOS_DEVICES=Serial means that no threading will be used.
+KOKKOS_DEVICES=OpenMP means that OpenMP threading will be
+used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
+KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.

-If N is the number of physical cores/node, then the number of MPI
-tasks/node * number of threads/task should not exceed N, and should
-typically equal N.  Note that the default threads/task is 1, as set by
-the "t" keyword of the "-k" "command-line
-switch"_Section_start.html#start_6.  If you do not change this, no
-additional parallelism (beyond MPI) will be invoked on the host
-CPU(s).
+KOKKOS_ARCH enables compiler switches needed when compiling for a
+specific hardware:

-You can compare the performance running in different modes:
+ARMv80 = ARMv8.0 Compatible CPU
+ARMv81 = ARMv8.1 Compatible CPU
+ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
+SNB = Intel Sandy/Ivy Bridge CPUs
+HSW = Intel Haswell CPUs
+BDW = Intel Broadwell Xeon E-class CPUs
+SKX = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
+KNC = Intel Knights Corner Xeon Phi
+KNL = Intel Knights Landing Xeon Phi
+Kepler30 = NVIDIA Kepler generation CC 3.0
+Kepler32 = NVIDIA Kepler generation CC 3.2
+Kepler35 = NVIDIA Kepler generation CC 3.5
+Kepler37 = NVIDIA Kepler generation CC 3.7
+Maxwell50 = NVIDIA Maxwell generation CC 5.0
+Maxwell52 = NVIDIA Maxwell generation CC 5.2
+Maxwell53 = NVIDIA Maxwell generation CC 5.3
+Pascal60 = NVIDIA Pascal generation CC 6.0
+Pascal61 = NVIDIA Pascal generation CC 6.1
+BGQ = IBM Blue Gene/Q CPUs
+Power8 = IBM POWER8 CPUs
+Power9 = IBM POWER9 CPUs :ul

-run with 1 MPI task/node and N threads/task
-run with N MPI tasks/node and 1 thread/task
-run with settings in between these extremes :ul
+KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
+migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
+used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
+necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
+provides alternative methods via environment variables for binding
+threads to hardware cores. More info on binding threads to cores is
+given in "Section 5.3"_Section_accelerate.html#acc_3.

-Examples of mpirun commands in these modes are shown above.
+KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
+on most Unix platforms. This library is not available on all
+platforms.

-When using KOKKOS to perform multi-threading, it is important for
-performance to bind both MPI tasks to physical cores, and threads to
-physical cores, so they do not migrate during a simulation.
+KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
+within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
+debugging information that can be useful. It also enables runtime
+bounds checking on Kokkos data structures.

-If you are not certain MPI tasks are being bound (check the defaults
-for your MPI installation), binding can be forced with these flags:
+KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when building LAMMPS.

-OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
-Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
-
-For binding threads with the KOKKOS OMP option, use thread affinity
-environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
-later, intel 12 or later) setting the environment variable
-OMP_PROC_BIND=true should be sufficient.  For binding threads with the
-KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
-(see "this section"_Section_packages.html#KOKKOS of the manual for
-details).
-
-[Running on GPUs:]
-
-Insure the -arch setting in the machine makefile you are using,
-e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software.
-(see "this section"_Section_packages.html#KOKKOS of the manual for
-details).
-
-The -np setting of the mpirun command should set the number of MPI
-tasks/node to be equal to the # of physical GPUs on the node.
-
-Use the "-k" "command-line switch"_Section_commands.html#start_6 to
-specify the number of GPUs per node, and the number of threads per MPI
-task.  As above for multi-core CPUs (and no GPU), if N is the number
-of physical cores/node, then the number of MPI tasks/node * number of
-threads/task should not exceed N.  With one GPU (and one MPI task) it
-may be faster to use less than all the available cores, by setting
-threads/task to a smaller value.  This is because using all the cores
-on a dual-socket node will incur extra cost to copy memory from the
-2nd socket to the GPU.
-
-Examples of mpirun commands that follow these rules are shown above.
-
-NOTE: When using a GPU, you will achieve the best performance if your
-input script does not use any fix or compute styles which are not yet
-Kokkos-enabled.  This allows data to stay on the GPU for multiple
-timesteps, without being copied back to the host CPU.  Invoking a
-non-Kokkos fix or compute, or performing I/O for
-"thermo"_thermo_style.html or "dump"_dump.html output will cause data
-to be copied back to the CPU.
-
-You cannot yet assign multiple MPI tasks to the same GPU with the
-KOKKOS package.  We plan to support this in the future, similar to the
-GPU package in LAMMPS.
-
-You cannot yet use both the host (multi-threaded) and device (GPU)
-together to compute pairwise interactions with the KOKKOS package.  We
-hope to support this in the future, similar to the GPU package in
-LAMMPS.
-
-[Running on an Intel Phi:]
-
-Kokkos only uses Intel Phi processors in their "native" mode, i.e.
-not hosted by a CPU.
-
-As illustrated above, build LAMMPS with OMP=yes (the default) and
-MIC=yes.  The latter insures code is correctly compiled for the Intel
-Phi.  The OMP setting means OpenMP will be used for parallelization on
-the Phi, which is currently the best option within Kokkos.  In the
-future, other options may be added.
-
-Current-generation Intel Phi chips have either 61 or 57 cores.  One
-core should be excluded for running the OS, leaving 60 or 56 cores.
-Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
-N = 224 (4*56) cores to run on.
-
-The -np setting of the mpirun command sets the number of MPI
-tasks/node.  The "-k on t Nt" command-line switch sets the number of
-threads/task as Nt.  The product of these 2 values should be N, i.e.
-240 or 224.  Also, the number of threads/task should be a multiple of
-4 so that logical threads from more than one MPI task do not run on
-the same physical core.
-
-Examples of mpirun commands that follow these rules are shown above.
+KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS package must be compiled
+with the {enable_lambda} option when using GPUs.

 [Restrictions:]

-As noted above, if using GPUs, the number of MPI tasks per compute
-node should equal to the number of GPUs per compute node.  In the
-future Kokkos will support assigning multiple MPI tasks to a single
-GPU.
-
-Currently Kokkos does not support AMD GPUs due to limits in the
-available backend programming models.  Specifically, Kokkos requires
-extensive C++ support from the Kernel language.  This is expected to
-change in the future.
+Currently, there are no precision options with the KOKKOS
+package. All compilation and computation is performed in double
+precision.