forked from lijiext/lammps
Update Kokkos docs
This commit is contained in:
parent
d029cb9002
commit
8a93f63de9
|
@ -11,336 +11,344 @@
|
||||||
|
|
||||||
5.3.3 KOKKOS package :h5
|
5.3.3 KOKKOS package :h5
|
||||||
|
|
||||||
The KOKKOS package was developed primarily by Christian Trott (Sandia)
|
Kokkos is a templated C++ library that provides abstractions to allow
|
||||||
with contributions of various styles by others, including Sikandar
|
a single implementation of an application kernel (e.g. a pair style) to run efficiently on
|
||||||
Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia). The
|
different kinds of hardware, such as GPUs, Intel Xeon Phis, or many-core
|
||||||
underlying Kokkos library was written primarily by Carter Edwards,
|
CPUs. Kokkos maps the C++ kernel onto different backend languages such as CUDA, OpenMP, or Pthreads.
|
||||||
|
The Kokkos library also provides data abstractions to adjust (at
|
||||||
|
compile time) the memory layout of data structures like 2d and
|
||||||
|
3d arrays to optimize performance on different hardware. For more information on Kokkos, see
|
||||||
|
"Github"_https://github.com/kokkos/kokkos. Kokkos is part of
|
||||||
|
"Trilinos"_http://trilinos.sandia.gov/packages/kokkos. The Kokkos library was written primarily by Carter Edwards,
|
||||||
Christian Trott, and Dan Sunderland (all Sandia).
|
Christian Trott, and Dan Sunderland (all Sandia).
|
||||||
|
|
||||||
The KOKKOS package contains versions of pair, fix, and atom styles
|
The LAMMPS KOKKOS package contains versions of pair, fix, and atom styles
|
||||||
that use data structures and macros provided by the Kokkos library,
|
that use data structures and macros provided by the Kokkos library,
|
||||||
which is included with LAMMPS in lib/kokkos.
|
which is included with LAMMPS in /lib/kokkos. The KOKKOS package was developed primarily by Christian Trott (Sandia)
|
||||||
|
and Stan Moore (Sandia) with contributions of various styles by others, including Sikandar
|
||||||
The Kokkos library is part of
|
Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez (Sandia). For more information on developing using Kokkos abstractions
|
||||||
"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be
|
see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.
|
||||||
downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a
|
|
||||||
templated C++ library that provides two key abstractions for an
|
|
||||||
application like LAMMPS. First, it allows a single implementation of
|
|
||||||
an application kernel (e.g. a pair style) to run efficiently on
|
|
||||||
different kinds of hardware, such as a GPU, Intel Phi, or many-core
|
|
||||||
CPU.
|
|
||||||
|
|
||||||
The Kokkos library also provides data abstractions to adjust (at
|
|
||||||
compile time) the memory layout of basic data structures like 2d and
|
|
||||||
3d arrays and allow the transparent utilization of special hardware
|
|
||||||
load and store operations. Such data structures are used in LAMMPS to
|
|
||||||
store atom coordinates or forces or neighbor lists. The layout is
|
|
||||||
chosen to optimize performance on different platforms. Again this
|
|
||||||
functionality is hidden from the developer, and does not affect how
|
|
||||||
the kernel is coded.
|
|
||||||
|
|
||||||
These abstractions are set at build time, when LAMMPS is compiled with
|
|
||||||
the KOKKOS package installed. All Kokkos operations occur within the
|
|
||||||
context of an individual MPI task running on a single node of the
|
|
||||||
machine. The total number of MPI tasks used by LAMMPS (one or
|
|
||||||
multiple per compute node) is set in the usual manner via the mpirun
|
|
||||||
or mpiexec commands, and is independent of Kokkos.
|
|
||||||
|
|
||||||
Kokkos currently provides support for 3 modes of execution (per MPI
|
Kokkos currently provides support for 3 modes of execution (per MPI
|
||||||
task). These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs),
|
task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP (threading
|
||||||
and OpenMP (for Intel Phi). Note that the KOKKOS package supports
|
for many-core CPUs and Intel Phi), and CUDA (for NVIDIA GPUs). You choose the mode at build time to
|
||||||
running on the Phi in native mode, not offload mode like the
|
|
||||||
USER-INTEL package supports. You choose the mode at build time to
|
|
||||||
produce an executable compatible with specific hardware.
|
produce an executable compatible with specific hardware.
|
||||||
|
|
||||||
Here is a quick overview of how to use the KOKKOS package
|
|
||||||
for CPU acceleration, assuming one or more 16-core nodes.
|
|
||||||
More details follow.
|
|
||||||
|
|
||||||
use a C++11 compatible compiler
|
|
||||||
make yes-kokkos
|
|
||||||
make mpi KOKKOS_DEVICES=OpenMP # build with the KOKKOS package
|
|
||||||
make kokkos_omp # or Makefile.kokkos_omp already has variable set :pre
|
|
||||||
|
|
||||||
mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no threads
|
|
||||||
mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task
|
|
||||||
mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
|
|
||||||
mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
|
|
||||||
|
|
||||||
specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
|
|
||||||
include the KOKKOS package and build LAMMPS
|
|
||||||
enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul
|
|
||||||
|
|
||||||
Here is a quick overview of how to use the KOKKOS package for GPUs,
|
|
||||||
assuming one or more nodes, each with 16 cores and a GPU. More
|
|
||||||
details follow.
|
|
||||||
|
|
||||||
discuss use of NVCC, which Makefiles to examine
|
|
||||||
|
|
||||||
use a C++11 compatible compiler
|
|
||||||
KOKKOS_DEVICES = Cuda, OpenMP
|
|
||||||
KOKKOS_ARCH = Kepler35
|
|
||||||
make yes-kokkos
|
|
||||||
make machine :pre
|
|
||||||
|
|
||||||
mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU
|
|
||||||
mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes :pre
|
|
||||||
|
|
||||||
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU
|
|
||||||
mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes :pre
|
|
||||||
|
|
||||||
Here is a quick overview of how to use the KOKKOS package
|
|
||||||
for the Intel Phi:
|
|
||||||
|
|
||||||
use a C++11 compatible compiler
|
|
||||||
KOKKOS_DEVICES = OpenMP
|
|
||||||
KOKKOS_ARCH = KNC
|
|
||||||
make yes-kokkos
|
|
||||||
make machine :pre
|
|
||||||
|
|
||||||
host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
|
|
||||||
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240
|
|
||||||
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240
|
|
||||||
mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240
|
|
||||||
mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis :pre
|
|
||||||
|
|
||||||
[Required hardware/software:]
|
|
||||||
|
|
||||||
Kokkos support within LAMMPS must be built with a C++11 compatible
|
|
||||||
compiler. If using gcc, version 4.7.2 or later is required.
|
|
||||||
|
|
||||||
To build with Kokkos support for CPUs, your compiler must support the
|
|
||||||
OpenMP interface. You should have one or more multi-core CPUs so that
|
|
||||||
multiple threads can be launched by each MPI task running on a CPU.
|
|
||||||
|
|
||||||
To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
|
|
||||||
version 7.5 or later must be installed on your system. See the
|
|
||||||
discussion for the "GPU"_accelerate_gpu.html package for details of
|
|
||||||
how to check and do this.
|
|
||||||
|
|
||||||
NOTE: For good performance of the KOKKOS package on GPUs, you must
|
|
||||||
have Kepler generation GPUs (or later). The Kokkos library exploits
|
|
||||||
texture cache options not supported by Telsa generation GPUs (or
|
|
||||||
older).
|
|
||||||
|
|
||||||
To build with Kokkos support for Intel Xeon Phi coprocessors, your
|
|
||||||
sysmte must be configured to use them in "native" mode, not "offload"
|
|
||||||
mode like the USER-INTEL package supports.
|
|
||||||
|
|
||||||
[Building LAMMPS with the KOKKOS package:]
|
[Building LAMMPS with the KOKKOS package:]
|
||||||
|
|
||||||
You must choose at build time whether to build for CPUs (OpenMP),
|
NOTE: Kokkos support within LAMMPS must be built with a C++11 compatible
|
||||||
GPUs, or Phi.
|
compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
|
||||||
|
Clang 3.5.2 or later is required.
|
||||||
|
|
||||||
You can do any of these in one line, using the suitable make command
|
The recommended method of building the KOKKOS package is to start with the provided Kokkos
|
||||||
line flags as described in "Section 4"_Section_packages.html of the
|
Makefiles in /src/MAKE/OPTIONS/. You may need to modify the KOKKOS_ARCH variable in the Makefile
|
||||||
manual. If run from the src directory, these
|
to match your specific hardware. For example:
|
||||||
commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda_mpi, and
|
|
||||||
lmp_kokkos_phi. Note that the OMP and PHI options use
|
|
||||||
src/MAKE/Makefile.mpi as the starting Makefile.machine. The CUDA
|
|
||||||
option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi.
|
|
||||||
|
|
||||||
The latter two steps can be done using the "-k on", "-pk kokkos" and
|
for Sandy Bridge CPUs, set KOKKOS_ARCH=SNB
|
||||||
"-sf kk" "command-line switches"_Section_start.html#start_6
|
for Broadwell CPUs, set KOKKOS_ARCH=BWD
|
||||||
respectively. Or the effect of the "-pk" or "-sf" switches can be
|
for K80 GPUs, set KOKKOS_ARCH=Kepler37
|
||||||
duplicated by adding the "package kokkos"_package.html or "suffix
|
for P100 GPUs and Power8 CPUs, set KOKKOS_ARCH=Pascal60,Power8 :ul
|
||||||
kk"_suffix.html commands respectively to your input script.
|
|
||||||
|
|
||||||
|
See the [Advanced Kokkos Options] section below for a listing of all KOKKOS_ARCH options.
|
||||||
|
|
||||||
Or you can follow these steps:
|
[Compile for CPU-only (MPI only, no threading):]
|
||||||
|
|
||||||
CPU-only (run all-MPI or with OpenMP threading):
|
use a C++11 compatible compiler and
|
||||||
|
set KOKKOS_ARCH variable in /src/MAKE/OPTIONS/Makefile.kokkos_mpi_only
|
||||||
cd lammps/src
|
as described above
|
||||||
make yes-kokkos
|
|
||||||
make kokkos_omp :pre
|
|
||||||
|
|
||||||
CPU-only (only MPI, no threading):
|
|
||||||
|
|
||||||
cd lammps/src
|
cd lammps/src
|
||||||
make yes-kokkos
|
make yes-kokkos
|
||||||
make kokkos_mpi_only :pre
|
make kokkos_mpi_only :pre
|
||||||
|
|
||||||
Intel Xeon Phi (Intel Compiler, Intel MPI):
|
[Compile for CPU-only (MPI plus OpenMP threading):]
|
||||||
|
|
||||||
|
NOTE: To build with Kokkos support for OpenMP threading, your compiler must support the
|
||||||
|
OpenMP interface. You should have one or more multi-core CPUs so that
|
||||||
|
multiple threads can be launched by each MPI task running on a CPU.
|
||||||
|
|
||||||
|
use a C++11 compatible compiler and
|
||||||
|
set KOKKOS_ARCH variable in /src/MAKE/OPTIONS/Makefile.kokkos_omp
|
||||||
|
|
||||||
|
cd lammps/src
|
||||||
|
make yes-kokkos
|
||||||
|
make kokkos_omp :pre
|
||||||
|
|
||||||
|
[Compile for Intel KNL Xeon Phi (Intel Compiler, OpenMPI):]
|
||||||
|
|
||||||
|
use a C++11 compatible compiler and
|
||||||
|
|
||||||
cd lammps/src
|
cd lammps/src
|
||||||
make yes-kokkos
|
make yes-kokkos
|
||||||
make kokkos_phi :pre
|
make kokkos_phi :pre
|
||||||
|
|
||||||
CPUs and GPUs (with MPICH or OpenMPI):
|
[Compile for CPUs and GPUs (with OpenMPI or MPICH):]
|
||||||
|
|
||||||
|
NOTE: To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
|
||||||
|
version 7.5 or later must be installed on your system. See the
|
||||||
|
discussion for the "GPU"_accelerate_gpu.html package for details of
|
||||||
|
how to check and do this.
|
||||||
|
|
||||||
|
use a C++11 compatible compiler and
|
||||||
|
set KOKKOS_ARCH variable in /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi for both GPU and CPU
|
||||||
|
|
||||||
cd lammps/src
|
cd lammps/src
|
||||||
make yes-kokkos
|
make yes-kokkos
|
||||||
make kokkos_cuda_mpi :pre
|
make kokkos_cuda_mpi :pre
|
||||||
|
|
||||||
These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
|
[Alternative Methods of Compiling:]
|
||||||
make command line which requires a GNU-compatible make command. Try
|
|
||||||
|
Alternatively, the KOKKOS package can be built by specifying Kokkos variables
|
||||||
|
on the make command line. For example:
|
||||||
|
|
||||||
|
make mpi KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=SNB # set the KOKKOS_DEVICES and KOKKOS_ARCH variable explicitly
|
||||||
|
make kokkos_cuda_mpi KOKKOS_ARCH=Pascal60,Power8 # set the KOKKOS_ARCH variable explicitly :pre
|
||||||
|
|
||||||
|
Setting the KOKKOS_DEVICES and KOKKOS_ARCH variables on the
|
||||||
|
make command line requires a GNU-compatible make command. Try
|
||||||
"gmake" if your system's standard make complains.
|
"gmake" if your system's standard make complains.
|
||||||
|
|
||||||
NOTE: If you build using make line variables and re-build LAMMPS twice
|
NOTE: If you build using make line variables and re-build LAMMPS twice
|
||||||
with different KOKKOS options and the *same* target, e.g. g++ in the
|
with different KOKKOS options and the *same* target, then you *must* perform a "make clean-all"
|
||||||
first two examples above, then you *must* perform a "make clean-all"
|
or "make clean-machine" before each build. This is to force all the
|
||||||
or "make clean-machine" before each build. This is to force all the
|
|
||||||
KOKKOS-dependent files to be re-compiled with the new options.
|
KOKKOS-dependent files to be re-compiled with the new options.
|
||||||
|
|
||||||
NOTE: Currently, there are no precision options with the KOKKOS
|
[Running LAMMPS with the KOKKOS package:]
|
||||||
package. All compilation and computation is performed in double
|
|
||||||
precision.
|
|
||||||
|
|
||||||
There are other allowed options when building with the KOKKOS package.
|
All Kokkos operations occur within the
|
||||||
As above, they can be set either as variables on the make command line
|
context of an individual MPI task running on a single node of the
|
||||||
or in Makefile.machine. This is the full list of options, including
|
machine. The total number of MPI tasks used by LAMMPS (one or
|
||||||
those discussed above, Each takes a value shown below. The
|
multiple per compute node) is set in the usual manner via the mpirun
|
||||||
default value is listed, which is set in the
|
or mpiexec commands, and is independent of Kokkos. E.g. the mpirun
|
||||||
lib/kokkos/Makefile.kokkos file.
|
command in OpenMPI does this via its
|
||||||
|
-np and -npernode switches. Ditto for MPICH via -np and -ppn.
|
||||||
|
|
||||||
#Default settings specific options
|
[Running on a multi-core CPU:]
|
||||||
#Options: force_uvm,use_ldg,rdc
|
|
||||||
|
|
||||||
KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP}
|
Here is a quick overview of how to use the KOKKOS package
|
||||||
KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none}
|
for CPU acceleration, assuming one or more 16-core nodes.
|
||||||
KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
|
|
||||||
KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none}
|
|
||||||
KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul
|
|
||||||
|
|
||||||
KOKKOS_DEVICE sets the parallelization method used for Kokkos code
|
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading
|
||||||
(within LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be
|
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task
|
||||||
used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
|
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
|
||||||
KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
|
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
|
||||||
|
|
||||||
If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
|
|
||||||
directory must use "nvcc" as its compiler, via its CC setting. For
|
|
||||||
best performance its CCFLAGS setting should use -O3 and have a
|
|
||||||
KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
|
|
||||||
hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note
|
|
||||||
the minimal required compute capability is 2.0, but this will give
|
|
||||||
significantly reduced performance compared to Kepler generation GPUs
|
|
||||||
with compute capability 3.x. For the LINK setting, "nvcc" should not
|
|
||||||
be used; instead use g++ or another compiler suitable for linking C++
|
|
||||||
applications. Often you will want to use your MPI compiler wrapper
|
|
||||||
for this setting (i.e. mpicxx). Finally, the lo-level Makefile must
|
|
||||||
also have a "Compilation rule" for creating *.o files from *.cu files.
|
|
||||||
See src/Makefile.cuda for an example of a lo-level Makefile with all
|
|
||||||
of these settings.
|
|
||||||
|
|
||||||
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
|
|
||||||
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
|
|
||||||
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
|
|
||||||
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
|
|
||||||
provides alternative methods via environment variables for binding
|
|
||||||
threads to hardware cores. More info on binding threads to cores is
|
|
||||||
given in "Section 5.3"_Section_accelerate.html#acc_3.
|
|
||||||
|
|
||||||
KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an
|
|
||||||
Intel Phi processor.
|
|
||||||
|
|
||||||
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
|
|
||||||
on most Unix platforms. This library is not available on all
|
|
||||||
platforms.
|
|
||||||
|
|
||||||
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
|
|
||||||
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
|
|
||||||
debugging information that can be useful. It also enables runtime
|
|
||||||
bounds checking on Kokkos data structures.
|
|
||||||
|
|
||||||
KOKKOS_CUDA_OPTIONS are additional options for CUDA.
|
|
||||||
|
|
||||||
For more information on Kokkos see the Kokkos programmers' guide here:
|
|
||||||
/lib/kokkos/doc/Kokkos_PG.pdf.
|
|
||||||
|
|
||||||
[Run with the KOKKOS package from the command line:]
|
|
||||||
|
|
||||||
The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
||||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
|
||||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
|
||||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
|
||||||
|
|
||||||
When using KOKKOS built with host=OMP, you need to choose how many
|
|
||||||
OpenMP threads per MPI task will be used (via the "-k" command-line
|
|
||||||
switch discussed below). Note that the product of MPI tasks * OpenMP
|
|
||||||
threads/task should not exceed the physical number of cores (on a
|
|
||||||
node), otherwise performance will suffer.
|
|
||||||
|
|
||||||
When using the KOKKOS package built with device=CUDA, you must use
|
|
||||||
exactly one MPI task per physical GPU.
|
|
||||||
|
|
||||||
When using the KOKKOS package built with host=MIC for Intel Xeon Phi
|
|
||||||
coprocessor support you need to insure there are one or more MPI tasks
|
|
||||||
per coprocessor, and choose the number of coprocessor threads to use
|
|
||||||
per MPI task (via the "-k" command-line switch discussed below). The
|
|
||||||
product of MPI tasks * coprocessor threads/task should not exceed the
|
|
||||||
maximum number of threads the coprocessor is designed to run,
|
|
||||||
otherwise performance will suffer. This value is 240 for current
|
|
||||||
generation Xeon Phi(TM) chips, which is 60 physical cores * 4
|
|
||||||
threads/core. Note that with the KOKKOS package you do not need to
|
|
||||||
specify how many Phi coprocessors there are per node; each
|
|
||||||
coprocessors is simply treated as running some number of MPI tasks.
|
|
||||||
|
|
||||||
|
To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk kokkos" "command-line switches"_Section_start.html#start_7 in your mpirun command.
|
||||||
You must use the "-k on" "command-line
|
You must use the "-k on" "command-line
|
||||||
switch"_Section_start.html#start_6 to enable the KOKKOS package. It
|
switch"_Section_start.html#start_7 to enable the KOKKOS package. It
|
||||||
takes additional arguments for hardware settings appropriate to your
|
takes additional arguments for hardware settings appropriate to your
|
||||||
system. Those arguments are "documented
|
system. Those arguments are "documented
|
||||||
here"_Section_start.html#start_6. The two most commonly used
|
here"_Section_start.html#start_7. For OpenMP use:
|
||||||
options are:
|
|
||||||
|
|
||||||
-k on t Nt g Ng :pre
|
-k on t Nt :pre
|
||||||
|
|
||||||
The "t Nt" option applies to host=OMP (even if device=CUDA) and
|
|
||||||
host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI
|
|
||||||
task to use with a node. For host=MIC, it specifies how many Xeon Phi
|
|
||||||
threads per MPI task to use within a node. The default is Nt = 1.
|
|
||||||
Note that for host=OMP this is effectively MPI-only mode which may be
|
|
||||||
fine. But for host=MIC you will typically end up using far less than
|
|
||||||
all the 240 available threads, which could give very poor performance.
|
|
||||||
|
|
||||||
The "g Ng" option applies to device=CUDA. It specifies how many GPUs
|
|
||||||
per compute node to use. The default is 1, so this only needs to be
|
|
||||||
specified is you have 2 or more GPUs per compute node.
|
|
||||||
|
|
||||||
|
The "t Nt" option specifies how many OpenMP threads per MPI
|
||||||
|
task to use with a node. The default is Nt = 1, which is MPI-only mode.
|
||||||
|
Note that the product of MPI tasks * OpenMP
|
||||||
|
threads/task should not exceed the physical number of cores (on a
|
||||||
|
node), otherwise performance will suffer. If hyperthreading is enabled, then
|
||||||
|
the product of MPI tasks * OpenMP threads/task should not exceed the
|
||||||
|
physical number of cores * hardware threads.
|
||||||
The "-k on" switch also issues a "package kokkos" command (with no
|
The "-k on" switch also issues a "package kokkos" command (with no
|
||||||
additional arguments) which sets various KOKKOS options to default
|
additional arguments) which sets various KOKKOS options to default
|
||||||
values, as discussed on the "package"_package.html command doc page.
|
values, as discussed on the "package"_package.html command doc page.
|
||||||
|
|
||||||
Use the "-sf kk" "command-line switch"_Section_start.html#start_6,
|
The "-sf kk" "command-line switch"_Section_start.html#start_7
|
||||||
which will automatically append "kk" to styles that support it. Use
|
will automatically append the "/kk" suffix to styles that support it.
|
||||||
the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if
|
In this manner no modification to the input script is needed. Alternatively,
|
||||||
you wish to change any of the default "package kokkos"_package.html
|
one can run with the KOKKOS package by editing the input script as described below.
|
||||||
optionns set by the "-k on" "command-line
|
|
||||||
switch"_Section_start.html#start_6.
|
|
||||||
|
|
||||||
|
NOTE: The default for the "package kokkos"_package.html command is
|
||||||
|
|
||||||
Note that the default for the "package kokkos"_package.html command is
|
|
||||||
to use "full" neighbor lists and set the Newton flag to "off" for both
|
to use "full" neighbor lists and set the Newton flag to "off" for both
|
||||||
pairwise and bonded interactions. This typically gives fastest
|
pairwise and bonded interactions. However, when running on CPUs, it
|
||||||
performance. If the "newton"_newton.html command is used in the input
|
|
||||||
script, it can override the Newton flag defaults.
|
|
||||||
|
|
||||||
However, when running in MPI-only mode with 1 thread per MPI task, it
|
|
||||||
will typically be faster to use "half" neighbor lists and set the
|
will typically be faster to use "half" neighbor lists and set the
|
||||||
Newton flag to "on", just as is the case for non-accelerated pair
|
Newton flag to "on", just as is the case for non-accelerated pair
|
||||||
styles. You can do this with the "-pk" "command-line
|
styles. It can also be faster to use non-threaded communication.
|
||||||
switch"_Section_start.html#start_6.
|
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
|
||||||
|
change the default "package kokkos"_package.html
|
||||||
|
options. See its doc page for details and default settings. Experimenting with
|
||||||
|
its options can provide a speed-up for specific calculations. For example:
|
||||||
|
|
||||||
[Or run with the KOKKOS package by editing an input script:]
|
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm :pre
|
||||||
|
|
||||||
The discussion above for the mpirun/mpiexec command and setting
|
If the "newton"_newton.html command is used in the input
|
||||||
appropriate thread and GPU values for host=OMP or host=MIC or
|
script, it can also override the Newton flag defaults.
|
||||||
device=CUDA are the same.
|
|
||||||
|
[Core and Thread Affinity:]
|
||||||
|
|
||||||
|
When using multi-threading, it is important for
|
||||||
|
performance to bind both MPI tasks to physical cores, and threads to
|
||||||
|
physical cores, so they do not migrate during a simulation.
|
||||||
|
|
||||||
|
If you are not certain MPI tasks are being bound (check the defaults
|
||||||
|
for your MPI installation), binding can be forced with these flags:
|
||||||
|
|
||||||
|
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
|
||||||
|
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... :pre
|
||||||
|
|
||||||
|
For binding threads with KOKKOS OpenMP, use thread affinity
|
||||||
|
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
|
||||||
|
later, intel 12 or later) setting the environment variable
|
||||||
|
OMP_PROC_BIND=true should be sufficient. In general, for best performance
|
||||||
|
with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads.
|
||||||
|
For binding threads with the
|
||||||
|
KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
|
||||||
|
as described below.
|
||||||
|
|
||||||
|
[Running on Knight's Landing (KNL) Intel Xeon Phi:]
|
||||||
|
|
||||||
|
Here is a quick overview of how to use the KOKKOS package
|
||||||
|
for the Intel Knight's Landing (KNL) Xeon Phi:
|
||||||
|
|
||||||
|
KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores
|
||||||
|
are reserved for the OS, and only 64 or 66 cores are used. Each core
|
||||||
|
has 4 hyperthreads,so there are effectively N = 256 (4*64) or
|
||||||
|
N = 264 (4*66) cores to run on. The product of MPI tasks * OpenMP threads/task should not exceed this limit,
|
||||||
|
otherwise performance will suffer. Note that with the KOKKOS package you do not need to
|
||||||
|
specify how many KNLs there are per node; each
|
||||||
|
KNL is simply treated as running some number of MPI tasks.
|
||||||
|
|
||||||
|
Examples of mpirun commands that follow these rules are shown below.
|
||||||
|
|
||||||
|
Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
|
||||||
|
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task
|
||||||
|
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task
|
||||||
|
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task
|
||||||
|
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre
|
||||||
|
|
||||||
|
The -np setting of the mpirun command sets the number of MPI
|
||||||
|
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
||||||
|
threads/task as Nt. The product of these two values should be N, i.e.
|
||||||
|
256 or 264.
|
||||||
|
|
||||||
|
NOTE: The default for the "package kokkos"_package.html command is
|
||||||
|
to use "full" neighbor lists and set the Newton flag to "off" for both
|
||||||
|
pairwise and bonded interactions. When running on KNL, this
|
||||||
|
will typically be best for pair-wise potentials. For manybody potentials,
|
||||||
|
using "half" neighbor lists and setting the
|
||||||
|
Newton flag to "on" may be faster. It can also be faster to use non-threaded communication.
|
||||||
|
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
|
||||||
|
change the default "package kokkos"_package.html
|
||||||
|
options. See its doc page for details and default settings. Experimenting with
|
||||||
|
its options can provide a speed-up for specific calculations. For example:
|
||||||
|
|
||||||
|
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj # Newton off, full neighbor list, non-threaded comm
|
||||||
|
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax # Newton on, half neighbor list, non-threaded comm :pre
|
||||||
|
|
||||||
|
NOTE: MPI tasks and threads should be bound to cores as described above for CPUs.
|
||||||
|
|
||||||
|
NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors such as Knight's Corner (KNC), your
|
||||||
|
system must be configured to use them in "native" mode, not "offload"
|
||||||
|
mode like the USER-INTEL package supports.
|
||||||
|
|
||||||
|
[Running on GPUs:]
|
||||||
|
|
||||||
|
Use the "-k" "command-line switch"_Section_commands.html#start_7 to
|
||||||
|
specify the number of GPUs per node. Typically the -np setting
|
||||||
|
of the mpirun command should set the number of MPI
|
||||||
|
tasks/node to be equal to the # of physical GPUs on the node.
|
||||||
|
You can assign multiple MPI tasks to the same GPU with the
|
||||||
|
KOKKOS package, but this is usually only faster if significant portions
|
||||||
|
of the input script have not been ported to use Kokkos. Using CUDA MPS
|
||||||
|
is recommended in this scenario. As above for multi-core CPUs (and no GPU), if N is the number
|
||||||
|
of physical cores/node, then the number of MPI tasks/node should not exceed N.
|
||||||
|
|
||||||
|
-k on g Ng :pre
|
||||||
|
|
||||||
|
Here are examples of how to use the KOKKOS package for GPUs,
|
||||||
|
assuming one or more nodes, each with two GPUs:
|
||||||
|
|
||||||
|
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node
|
||||||
|
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre
|
||||||
|
|
||||||
|
NOTE: The default for the "package kokkos"_package.html command is
|
||||||
|
to use "full" neighbor lists and set the Newton flag to "off" for both
|
||||||
|
pairwise and bonded interactions, along with threaded communication.
|
||||||
|
When running on Maxwell or Kepler GPUs, this will typically be best. For Pascal GPUs,
|
||||||
|
using "half" neighbor lists and setting the
|
||||||
|
Newton flag to "on" may be faster. For many pair styles, setting the neighbor binsize
|
||||||
|
equal to the ghost atom cutoff will give speedup.
|
||||||
|
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
|
||||||
|
change the default "package kokkos"_package.html
|
||||||
|
options. See its doc page for details and default settings. Experimenting with
|
||||||
|
its options can provide a speed-up for specific calculations. For example:
|
||||||
|
|
||||||
|
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj # Set binsize = neighbor ghost cutoff
|
||||||
|
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre
|
||||||
|
|
||||||
|
NOTE: For good performance of the KOKKOS package on GPUs, you must
|
||||||
|
have Kepler generation GPUs (or later). The Kokkos library exploits
|
||||||
|
texture cache options not supported by Telsa generation GPUs (or
|
||||||
|
older).
|
||||||
|
|
||||||
|
NOTE: When using a GPU, you will achieve the best performance if your
|
||||||
|
input script does not use fix or compute styles which are not yet
|
||||||
|
Kokkos-enabled. This allows data to stay on the GPU for multiple
|
||||||
|
timesteps, without being copied back to the host CPU. Invoking a
|
||||||
|
non-Kokkos fix or compute, or performing I/O for
|
||||||
|
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
|
||||||
|
to be copied back to the CPU incurring a performance penalty.
|
||||||
|
|
||||||
|
NOTE: To get an accurate timing breakdown between time spend in pair,
|
||||||
|
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
|
||||||
|
However, this will reduce performance and is not recommended for production runs.
|
||||||
|
|
||||||
|
[Run with the KOKKOS package by editing an input script:]
|
||||||
|
|
||||||
|
Alternatively the effect of the "-sf" or "-pk" switches can be
|
||||||
|
duplicated by adding the "package kokkos"_package.html or "suffix
|
||||||
|
kk"_suffix.html commands to your input script.
|
||||||
|
|
||||||
|
The discussion above for building LAMMPS with the KOKKOS package, the mpirun/mpiexec command, and setting
|
||||||
|
appropriate thread are the same.
|
||||||
|
|
||||||
You must still use the "-k on" "command-line
|
You must still use the "-k on" "command-line
|
||||||
switch"_Section_start.html#start_6 to enable the KOKKOS package, and
|
switch"_Section_start.html#start_7 to enable the KOKKOS package, and
|
||||||
specify its additional arguments for hardware options appropriate to
|
specify its additional arguments for hardware options appropriate to
|
||||||
your system, as documented above.
|
your system, as documented above.
|
||||||
|
|
||||||
Use the "suffix kk"_suffix.html command, or you can explicitly add a
|
You can use the "suffix kk"_suffix.html command, or you can explicitly add a
|
||||||
"kk" suffix to individual styles in your input script, e.g.
|
"kk" suffix to individual styles in your input script, e.g.
|
||||||
|
|
||||||
pair_style lj/cut/kk 2.5 :pre
|
pair_style lj/cut/kk 2.5 :pre
|
||||||
|
|
||||||
You only need to use the "package kokkos"_package.html command if you
|
You only need to use the "package kokkos"_package.html command if you
|
||||||
wish to change any of its option defaults, as set by the "-k on"
|
wish to change any of its option defaults, as set by the "-k on"
|
||||||
"command-line switch"_Section_start.html#start_6.
|
"command-line switch"_Section_start.html#start_7.
|
||||||
|
|
||||||
|
[Using OpenMP threading and CUDA together (experimental):]
|
||||||
|
|
||||||
|
With the KOKKOS package, both OpenMP multi-threading and GPUs can be used
|
||||||
|
together in a few special cases. In the Makefile, the KOKKOS_DEVICES variable must
|
||||||
|
include both "Cuda" and "OpenMP", as is the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
|
||||||
|
|
||||||
|
KOKKOS_DEVICES=Cuda,OpenMP :pre
|
||||||
|
|
||||||
|
The suffix “/kk” is equivalent to “/kk/device”, and for Kokkos CUDA,
|
||||||
|
using the “-sf kk” in the command line gives the default CUDA version everywhere.
|
||||||
|
However, if the “/kk/host” suffix is added to a specific style in the input
|
||||||
|
script, the Kokkos OpenMP (CPU) version of that specific style will be used instead.
|
||||||
|
Set the number of OpenMP threads as "t Nt" and the number of GPUs as "g Ng"
|
||||||
|
|
||||||
|
-k on t Nt g Ng :pre
|
||||||
|
|
||||||
|
For example, the command to run with 1 GPU and 8 OpenMP threads is then:
|
||||||
|
|
||||||
|
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre
|
||||||
|
|
||||||
|
Conversely, if the “-sf kk/host” is used in the command line and then the
|
||||||
|
“/kk” or “/kk/device” suffix is added to a specific style in your input script,
|
||||||
|
then only that specific style will run on the GPU while everything else will
|
||||||
|
run on the CPU in OpenMP mode. Note that the execution of the CPU and GPU
|
||||||
|
styles will NOT overlap, except for a special case:
|
||||||
|
|
||||||
|
A kspace style and/or molecular topology (bonds, angles, etc.) running on
|
||||||
|
the host CPU can overlap with a pair style running on the GPU. First compile
|
||||||
|
with “--default-stream per-thread” added to CCFLAGS in the Kokkos CUDA Makefile.
|
||||||
|
Then explicitly use the “/kk/host” suffix for kspace and bonds, angles, etc.
|
||||||
|
in the input file and the "kk" suffix (equal to "kk/device") on the command line.
|
||||||
|
Also make sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
|
||||||
|
so CPU/GPU overlap can occur.
|
||||||
|
|
||||||
[Speed-ups to expect:]
|
[Speed-ups to expect:]
|
||||||
|
|
||||||
|
@ -353,7 +361,7 @@ Generally speaking, the following rules of thumb apply:
|
||||||
When running on CPUs only, with a single thread per MPI task,
|
When running on CPUs only, with a single thread per MPI task,
|
||||||
performance of a KOKKOS style is somewhere between the standard
|
performance of a KOKKOS style is somewhere between the standard
|
||||||
(un-accelerated) styles (MPI-only mode), and those provided by the
|
(un-accelerated) styles (MPI-only mode), and those provided by the
|
||||||
USER-OMP package. However the difference between all 3 is small (less
|
USER-OMP package. However the difference between all 3 is small (less
|
||||||
than 20%). :ulb,l
|
than 20%). :ulb,l
|
||||||
|
|
||||||
When running on CPUs only, with multiple threads per MPI task,
|
When running on CPUs only, with multiple threads per MPI task,
|
||||||
|
@ -363,7 +371,7 @@ package. :l
|
||||||
When running large number of atoms per GPU, KOKKOS is typically faster
|
When running large number of atoms per GPU, KOKKOS is typically faster
|
||||||
than the GPU package. :l
|
than the GPU package. :l
|
||||||
|
|
||||||
When running on Intel Xeon Phi, KOKKOS is not as fast as
|
When running on Intel hardware, KOKKOS is not as fast as
|
||||||
the USER-INTEL package, which is optimized for that hardware. :l
|
the USER-INTEL package, which is optimized for that hardware. :l
|
||||||
:ule
|
:ule
|
||||||
|
|
||||||
|
@ -371,123 +379,78 @@ See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
||||||
LAMMPS web site for performance of the KOKKOS package on different
|
LAMMPS web site for performance of the KOKKOS package on different
|
||||||
hardware.
|
hardware.
|
||||||
|
|
||||||
[Guidelines for best performance:]
|
[Advanced Kokkos options:]
|
||||||
|
|
||||||
Here are guidline for using the KOKKOS package on the different
|
There are other allowed options when building with the KOKKOS package.
|
||||||
hardware configurations listed above.
|
As above, they can be set either as variables on the make command line
|
||||||
|
or in Makefile.machine. This is the full list of options, including
|
||||||
|
those discussed above. Each takes a value shown below. The
|
||||||
|
default value is listed, which is set in the
|
||||||
|
/lib/kokkos/Makefile.kokkos file.
|
||||||
|
|
||||||
Many of the guidelines use the "package kokkos"_package.html command
|
KOKKOS_DEVICES, values = {Serial}, {OpenMP}, {Pthreads}, {Cuda}, default = {OpenMP}
|
||||||
See its doc page for details and default settings. Experimenting with
|
KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {Pascal60}, {Pascal61}, {ARMv80}, {ARMv81}, {ARMv81}, {ARMv8-ThunderX}, {BGQ}, {Power7}, {Power8}, {Power9}, {KNL}, {BDW}, {SKX}, default = {none}
|
||||||
its options can provide a speed-up for specific calculations.
|
KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
|
||||||
|
KOKKOS_USE_TPLS, values = {hwloc}, {librt}, {experimental_memkind}, default = {none}
|
||||||
|
KOKKOS_CXX_STANDARD, values = {c++11}, {c++1z}, default = {c++11}
|
||||||
|
KOKKOS_OPTIONS, values = {aggressive_vectorization}, {disable_profiling}, default = {none}
|
||||||
|
KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc}, {enable_lambda}, default = {enable_lambda} :ul
|
||||||
|
|
||||||
[Running on a multi-core CPU:]
|
KOKKOS_DEVICES sets the parallelization method used for Kokkos code
|
||||||
|
(within LAMMPS). KOKKOS_DEVICES=Serial means that no threading will be used.
|
||||||
|
KOKKOS_DEVICES=OpenMP means that OpenMP threading will be
|
||||||
|
used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
|
||||||
|
KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
|
||||||
|
|
||||||
If N is the number of physical cores/node, then the number of MPI
|
KOKKOS_ARCH enables compiler switches needed when compiling for a
|
||||||
tasks/node * number of threads/task should not exceed N, and should
|
specific hardware:
|
||||||
typically equal N. Note that the default threads/task is 1, as set by
|
|
||||||
the "t" keyword of the "-k" "command-line
|
|
||||||
switch"_Section_start.html#start_6. If you do not change this, no
|
|
||||||
additional parallelism (beyond MPI) will be invoked on the host
|
|
||||||
CPU(s).
|
|
||||||
|
|
||||||
You can compare the performance running in different modes:
|
ARMv80 = ARMv8.0 Compatible CPU
|
||||||
|
ARMv81 = ARMv8.1 Compatible CPU
|
||||||
|
ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
|
||||||
|
SNB = Intel Sandy/Ivy Bridge CPUs
|
||||||
|
HSW = Intel Haswell CPUs
|
||||||
|
BDW = Intel Broadwell Xeon E-class CPUs
|
||||||
|
SKX = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
|
||||||
|
KNC = Intel Knights Corner Xeon Phi
|
||||||
|
KNL = Intel Knights Landing Xeon Phi
|
||||||
|
Kepler30 = NVIDIA Kepler generation CC 3.0
|
||||||
|
Kepler32 = NVIDIA Kepler generation CC 3.2
|
||||||
|
Kepler35 = NVIDIA Kepler generation CC 3.5
|
||||||
|
Kepler37 = NVIDIA Kepler generation CC 3.7
|
||||||
|
Maxwell50 = NVIDIA Maxwell generation CC 5.0
|
||||||
|
Maxwell52 = NVIDIA Maxwell generation CC 5.2
|
||||||
|
Maxwell53 = NVIDIA Maxwell generation CC 5.3
|
||||||
|
Pascal60 = NVIDIA Pascal generation CC 6.0
|
||||||
|
Pascal61 = NVIDIA Pascal generation CC 6.1
|
||||||
|
BGQ = IBM Blue Gene/Q CPUs
|
||||||
|
Power8 = IBM POWER8 CPUs
|
||||||
|
Power9 = IBM POWER9 CPUs :ul
|
||||||
|
|
||||||
run with 1 MPI task/node and N threads/task
|
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
|
||||||
run with N MPI tasks/node and 1 thread/task
|
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
|
||||||
run with settings in between these extremes :ul
|
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
|
||||||
|
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
|
||||||
|
provides alternative methods via environment variables for binding
|
||||||
|
threads to hardware cores. More info on binding threads to cores is
|
||||||
|
given in "Section 5.3"_Section_accelerate.html#acc_3.
|
||||||
|
|
||||||
Examples of mpirun commands in these modes are shown above.
|
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
|
||||||
|
on most Unix platforms. This library is not available on all
|
||||||
|
platforms.
|
||||||
|
|
||||||
When using KOKKOS to perform multi-threading, it is important for
|
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
|
||||||
performance to bind both MPI tasks to physical cores, and threads to
|
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
|
||||||
physical cores, so they do not migrate during a simulation.
|
debugging information that can be useful. It also enables runtime
|
||||||
|
bounds checking on Kokkos data structures.
|
||||||
|
|
||||||
If you are not certain MPI tasks are being bound (check the defaults
|
KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when building LAMMPS.
|
||||||
for your MPI installation), binding can be forced with these flags:
|
|
||||||
|
|
||||||
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
|
KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS package must be compiled
|
||||||
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
|
with the {enable_lambda} option when using GPUs.
|
||||||
|
|
||||||
For binding threads with the KOKKOS OMP option, use thread affinity
|
|
||||||
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
|
|
||||||
later, intel 12 or later) setting the environment variable
|
|
||||||
OMP_PROC_BIND=true should be sufficient. For binding threads with the
|
|
||||||
KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
|
|
||||||
(see "this section"_Section_packages.html#KOKKOS of the manual for
|
|
||||||
details).
|
|
||||||
|
|
||||||
[Running on GPUs:]
|
|
||||||
|
|
||||||
Insure the -arch setting in the machine makefile you are using,
|
|
||||||
e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software.
|
|
||||||
(see "this section"_Section_packages.html#KOKKOS of the manual for
|
|
||||||
details).
|
|
||||||
|
|
||||||
The -np setting of the mpirun command should set the number of MPI
|
|
||||||
tasks/node to be equal to the # of physical GPUs on the node.
|
|
||||||
|
|
||||||
Use the "-k" "command-line switch"_Section_commands.html#start_6 to
|
|
||||||
specify the number of GPUs per node, and the number of threads per MPI
|
|
||||||
task. As above for multi-core CPUs (and no GPU), if N is the number
|
|
||||||
of physical cores/node, then the number of MPI tasks/node * number of
|
|
||||||
threads/task should not exceed N. With one GPU (and one MPI task) it
|
|
||||||
may be faster to use less than all the available cores, by setting
|
|
||||||
threads/task to a smaller value. This is because using all the cores
|
|
||||||
on a dual-socket node will incur extra cost to copy memory from the
|
|
||||||
2nd socket to the GPU.
|
|
||||||
|
|
||||||
Examples of mpirun commands that follow these rules are shown above.
|
|
||||||
|
|
||||||
NOTE: When using a GPU, you will achieve the best performance if your
|
|
||||||
input script does not use any fix or compute styles which are not yet
|
|
||||||
Kokkos-enabled. This allows data to stay on the GPU for multiple
|
|
||||||
timesteps, without being copied back to the host CPU. Invoking a
|
|
||||||
non-Kokkos fix or compute, or performing I/O for
|
|
||||||
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
|
|
||||||
to be copied back to the CPU.
|
|
||||||
|
|
||||||
You cannot yet assign multiple MPI tasks to the same GPU with the
|
|
||||||
KOKKOS package. We plan to support this in the future, similar to the
|
|
||||||
GPU package in LAMMPS.
|
|
||||||
|
|
||||||
You cannot yet use both the host (multi-threaded) and device (GPU)
|
|
||||||
together to compute pairwise interactions with the KOKKOS package. We
|
|
||||||
hope to support this in the future, similar to the GPU package in
|
|
||||||
LAMMPS.
|
|
||||||
|
|
||||||
[Running on an Intel Phi:]
|
|
||||||
|
|
||||||
Kokkos only uses Intel Phi processors in their "native" mode, i.e.
|
|
||||||
not hosted by a CPU.
|
|
||||||
|
|
||||||
As illustrated above, build LAMMPS with OMP=yes (the default) and
|
|
||||||
MIC=yes. The latter insures code is correctly compiled for the Intel
|
|
||||||
Phi. The OMP setting means OpenMP will be used for parallelization on
|
|
||||||
the Phi, which is currently the best option within Kokkos. In the
|
|
||||||
future, other options may be added.
|
|
||||||
|
|
||||||
Current-generation Intel Phi chips have either 61 or 57 cores. One
|
|
||||||
core should be excluded for running the OS, leaving 60 or 56 cores.
|
|
||||||
Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
|
|
||||||
N = 224 (4*56) cores to run on.
|
|
||||||
|
|
||||||
The -np setting of the mpirun command sets the number of MPI
|
|
||||||
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
|
||||||
threads/task as Nt. The product of these 2 values should be N, i.e.
|
|
||||||
240 or 224. Also, the number of threads/task should be a multiple of
|
|
||||||
4 so that logical threads from more than one MPI task do not run on
|
|
||||||
the same physical core.
|
|
||||||
|
|
||||||
Examples of mpirun commands that follow these rules are shown above.
|
|
||||||
|
|
||||||
[Restrictions:]
|
[Restrictions:]
|
||||||
|
|
||||||
As noted above, if using GPUs, the number of MPI tasks per compute
|
Currently, there are no precision options with the KOKKOS
|
||||||
node should equal to the number of GPUs per compute node. In the
|
package. All compilation and computation is performed in double
|
||||||
future Kokkos will support assigning multiple MPI tasks to a single
|
precision.
|
||||||
GPU.
|
|
||||||
|
|
||||||
Currently Kokkos does not support AMD GPUs due to limits in the
|
|
||||||
available backend programming models. Specifically, Kokkos requires
|
|
||||||
extensive C++ support from the Kernel language. This is expected to
|
|
||||||
change in the future.
|
|
Loading…
Reference in New Issue