2014-09-10 23:32:24 +08:00
|
|
|
"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
|
|
|
|
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
|
|
|
|
|
|
|
|
:link(lws,http://lammps.sandia.gov)
|
|
|
|
:link(ld,Manual.html)
|
|
|
|
:link(lc,Section_commands.html#comm)
|
|
|
|
|
|
|
|
:line
|
|
|
|
|
|
|
|
"Return to Section accelerate overview"_Section_accelerate.html
|
|
|
|
|
|
|
|
5.3.4 KOKKOS package :h4
|
|
|
|
|
|
|
|
The KOKKOS package was developed primaritly by Christian Trott
|
|
|
|
(Sandia) with contributions of various styles by others, including
|
|
|
|
Sikandar Mashayak (UIUC). The underlying Kokkos library was written
|
|
|
|
primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
|
|
|
|
Sandia).
|
|
|
|
|
|
|
|
The KOKKOS package contains versions of pair, fix, and atom styles
|
|
|
|
that use data structures and macros provided by the Kokkos library,
|
|
|
|
which is included with LAMMPS in lib/kokkos.
|
|
|
|
|
|
|
|
The Kokkos library is part of
|
|
|
|
"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and is a
|
|
|
|
templated C++ library that provides two key abstractions for an
|
|
|
|
application like LAMMPS. First, it allows a single implementation of
|
|
|
|
an application kernel (e.g. a pair style) to run efficiently on
|
|
|
|
different kinds of hardware, such as a GPU, Intel Phi, or many-core
|
|
|
|
chip.
|
|
|
|
|
|
|
|
The Kokkos library also provides data abstractions to adjust (at
|
|
|
|
compile time) the memory layout of basic data structures like 2d and
|
|
|
|
3d arrays and allow the transparent utilization of special hardware
|
|
|
|
load and store operations. Such data structures are used in LAMMPS to
|
|
|
|
store atom coordinates or forces or neighbor lists. The layout is
|
|
|
|
chosen to optimize performance on different platforms. Again this
|
|
|
|
functionality is hidden from the developer, and does not affect how
|
|
|
|
the kernel is coded.
|
|
|
|
|
|
|
|
These abstractions are set at build time, when LAMMPS is compiled with
|
|
|
|
the KOKKOS package installed. This is done by selecting a "host" and
|
|
|
|
"device" to build for, compatible with the compute nodes in your
|
|
|
|
machine (one on a desktop machine or 1000s on a supercomputer).
|
|
|
|
|
|
|
|
All Kokkos operations occur within the context of an individual MPI
|
|
|
|
task running on a single node of the machine. The total number of MPI
|
|
|
|
tasks used by LAMMPS (one or multiple per compute node) is set in the
|
|
|
|
usual manner via the mpirun or mpiexec commands, and is independent of
|
|
|
|
Kokkos.
|
|
|
|
|
|
|
|
Kokkos provides support for two different modes of execution per MPI
|
|
|
|
task. This means that computational tasks (pairwise interactions,
|
|
|
|
neighbor list builds, time integration, etc) can be parallelized for
|
|
|
|
one or the other of the two modes. The first mode is called the
|
|
|
|
"host" and is one or more threads running on one or more physical CPUs
|
|
|
|
(within the node). Currently, both multi-core CPUs and an Intel Phi
|
|
|
|
processor (running in native mode, not offload mode like the
|
|
|
|
USER-INTEL package) are supported. The second mode is called the
|
|
|
|
"device" and is an accelerator chip of some kind. Currently only an
|
2014-10-07 23:08:33 +08:00
|
|
|
NVIDIA GPU is supported via Cuda. If your compute node does not have
|
|
|
|
a GPU, then there is only one mode of execution, i.e. the host and
|
|
|
|
device are the same.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
When using the KOKKOS package, you must choose at build time whether
|
|
|
|
you are building for OpenMP, GPU, or for using the Xeon Phi in native
|
|
|
|
mode.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
Here is a quick overview of how to use the KOKKOS package:
|
|
|
|
|
|
|
|
specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
|
2014-09-10 23:32:24 +08:00
|
|
|
include the KOKKOS package and build LAMMPS
|
|
|
|
enable the KOKKOS package and its hardware options via the "-k on" command-line switch
|
|
|
|
use KOKKOS styles in your input script :ul
|
|
|
|
|
|
|
|
The latter two steps can be done using the "-k on", "-pk kokkos" and
|
|
|
|
"-sf kk" "command-line switches"_Section_start.html#start_7
|
|
|
|
respectively. Or the effect of the "-pk" or "-sf" switches can be
|
|
|
|
duplicated by adding the "package kokkos"_package.html or "suffix
|
|
|
|
kk"_suffix.html commands respectively to your input script.
|
|
|
|
|
|
|
|
[Required hardware/software:]
|
|
|
|
|
|
|
|
The KOKKOS package can be used to build and run LAMMPS on the
|
|
|
|
following kinds of hardware:
|
|
|
|
|
|
|
|
CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
|
|
|
|
CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
|
|
|
|
Phi: on one or more Intel Phi coprocessors (per node)
|
|
|
|
GPU: on the GPUs of a node with additional OpenMP threading on the CPUs :ul
|
|
|
|
|
|
|
|
Note that Intel Xeon Phi coprocessors are supported in "native" mode,
|
|
|
|
not "offload" mode like the USER-INTEL package supports.
|
|
|
|
|
|
|
|
Only NVIDIA GPUs are currently supported.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
|
|
|
|
you must have Kepler generation GPUs (or later). The Kokkos library
|
|
|
|
exploits texture cache options not supported by Telsa generation GPUs
|
|
|
|
(or older).
|
|
|
|
|
|
|
|
To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
|
|
|
|
installed on your system. See the discussion above for the USER-CUDA
|
|
|
|
and GPU packages for details of how to check and do this.
|
|
|
|
|
|
|
|
[Building LAMMPS with the KOKKOS package:]
|
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
You must choose at build time whether to build for OpenMP, Cuda, or
|
|
|
|
Phi.
|
|
|
|
|
|
|
|
You can do any of these in one line, using the src/Make.py script,
|
|
|
|
described in "Section 2.4"_Section_start.html#start_4 of the manual.
|
|
|
|
Type "Make.py -h" for help. If run from the src directory, these
|
|
|
|
commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and
|
|
|
|
lmp_kokkos_phi. The OMP and PHI options use src/MAKE/Makefile.mpi as
|
|
|
|
the starting Makefile.machine. The CUDA option uses
|
|
|
|
src/MAKE/OPTIONS/Makefile.cuda since the NVIDIA nvcc compiler is
|
|
|
|
required.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
Make.py -p kokkos -kokkos omp -o kokkos_omp file mpi
|
|
|
|
Make.py -p kokkos -kokkos cuda arch=31 -o kokkos_cuda file kokkos_cuda
|
|
|
|
Make.py -p kokkos -kokkos phi -o kokkos_phi file mpi
|
|
|
|
|
|
|
|
Or you can follow these steps:
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
CPU-only (run all-MPI or with OpenMP threading):
|
|
|
|
|
|
|
|
cd lammps/src
|
|
|
|
make yes-kokkos
|
|
|
|
make g++ OMP=yes :pre
|
|
|
|
|
|
|
|
Intel Xeon Phi:
|
|
|
|
|
|
|
|
cd lammps/src
|
|
|
|
make yes-kokkos
|
|
|
|
make g++ OMP=yes MIC=yes :pre
|
|
|
|
|
|
|
|
CPUs and GPUs:
|
|
|
|
|
|
|
|
cd lammps/src
|
|
|
|
make yes-kokkos
|
|
|
|
make cuda CUDA=yes :pre
|
|
|
|
|
|
|
|
These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
|
|
|
|
make command line which requires a GNU-compatible make command. Try
|
|
|
|
"gmake" if your system's standard make complains.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: If you build using make line variables and re-build
|
|
|
|
LAMMPS twice with different KOKKOS options and the *same* target,
|
|
|
|
e.g. g++ in the first two examples above, then you *must* perform a
|
|
|
|
"make clean-all" or "make clean-machine" before each build. This is
|
|
|
|
to force all the KOKKOS-dependent files to be re-compiled with the new
|
|
|
|
options.
|
|
|
|
|
|
|
|
You can also hardwire these make variables in the specified machine
|
|
|
|
makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
|
|
|
|
with a line like:
|
|
|
|
|
|
|
|
MIC = yes :pre
|
|
|
|
|
|
|
|
Note that if you build LAMMPS multiple times in this manner, using
|
|
|
|
different KOKKOS options (defined in different machine makefiles), you
|
|
|
|
do not have to worry about doing a "clean" in between. This is
|
|
|
|
because the targets will be different.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
|
|
|
|
machine makefile, in this case src/MAKE/Makefile.cuda, which is
|
|
|
|
included in the LAMMPS distribution. To build the KOKKOS package for
|
|
|
|
a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must
|
|
|
|
have a CCFLAGS -arch setting that is appropriate for your NVIDIA
|
|
|
|
hardware and installed software. Typical values for -arch are given
|
|
|
|
in "Section 2.3.4"_Section_start.html#start_3_4 of the manual, as well
|
|
|
|
as other settings that must be included in the machine makefile, if
|
|
|
|
you create your own.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: Currently, there are no precision options with the
|
|
|
|
KOKKOS package. All compilation and computation is performed in
|
|
|
|
double precision.
|
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
There are other allowed options when building with the KOKKOS package.
|
|
|
|
As above, they can be set either as variables on the make command line
|
|
|
|
or in Makefile.machine. This is the full list of options, including
|
|
|
|
those discussed above, Each takes a value of {yes} or {no}. The
|
|
|
|
default value is listed, which is set in the
|
|
|
|
lib/kokkos/Makefile.lammps file.
|
|
|
|
|
|
|
|
OMP, default = {yes}
|
|
|
|
CUDA, default = {no}
|
|
|
|
HWLOC, default = {no}
|
|
|
|
AVX, default = {no}
|
|
|
|
MIC, default = {no}
|
|
|
|
LIBRT, default = {no}
|
|
|
|
DEBUG, default = {no} :ul
|
|
|
|
|
|
|
|
OMP sets the parallelization method used for Kokkos code (within
|
|
|
|
LAMMPS) that runs on the host. OMP=yes means that OpenMP will be
|
|
|
|
used. OMP=no means that pthreads will be used.
|
|
|
|
|
|
|
|
CUDA sets the parallelization method used for Kokkos code (within
|
|
|
|
LAMMPS) that runs on the device. CUDA=yes means an NVIDIA GPU running
|
|
|
|
CUDA will be used. CUDA=no means that the OMP=yes or OMP=no setting
|
|
|
|
will be used for the device as well as the host.
|
|
|
|
|
|
|
|
If CUDA=yes, then the lo-level Makefile in the src/MAKE directory must
|
|
|
|
use "nvcc" as its compiler, via its CC setting. For best performance
|
|
|
|
its CCFLAGS setting should use -O3 and have an -arch setting that
|
|
|
|
matches the compute capability of your NVIDIA hardware and software
|
|
|
|
installation, e.g. -arch=sm_20. Generally Fermi Generation GPUs are
|
|
|
|
sm_20, while Kepler generation GPUs are sm_30 or sm_35 and Maxwell
|
|
|
|
cards are sm_50. A complete list can be found on
|
|
|
|
"wikipedia"_http://en.wikipedia.org/wiki/CUDA#Supported_GPUs. You can
|
|
|
|
also use the deviceQuery tool that comes with the CUDA samples. Note
|
|
|
|
the minimal required compute capability is 2.0, but this will give
|
|
|
|
signicantly reduced performance compared to Kepler generation GPUs
|
|
|
|
with compute capability 3.x. For the LINK setting, "nvcc" should not
|
|
|
|
be used; instead use g++ or another compiler suitable for linking C++
|
|
|
|
applications. Often you will want to use your MPI compiler wrapper
|
|
|
|
for this setting (i.e. mpicxx). Finally, the lo-level Makefile must
|
|
|
|
also have a "Compilation rule" for creating *.o files from *.cu files.
|
|
|
|
See src/Makefile.cuda for an example of a lo-level Makefile with all
|
|
|
|
of these settings.
|
|
|
|
|
|
|
|
HWLOC binds threads to hardware cores, so they do not migrate during a
|
|
|
|
simulation. HWLOC=yes should always be used if running with OMP=no
|
|
|
|
for pthreads. It is not necessary for OMP=yes for OpenMP, because
|
|
|
|
OpenMP provides alternative methods via environment variables for
|
|
|
|
binding threads to hardware cores. More info on binding threads to
|
|
|
|
cores is given in "this section"_Section_accelerate.html#acc_8.
|
|
|
|
|
|
|
|
AVX enables Intel advanced vector extensions when compiling for an
|
|
|
|
Intel-compatible chip. AVX=yes should only be set if your host
|
|
|
|
hardware supports AVX. If it does not support it, this will cause a
|
|
|
|
run-time crash.
|
|
|
|
|
|
|
|
MIC enables compiler switches needed when compling for an Intel Phi
|
|
|
|
processor.
|
|
|
|
|
|
|
|
LIBRT enables use of a more accurate timer mechanism on most Unix
|
|
|
|
platforms. This library is not available on all platforms.
|
|
|
|
|
|
|
|
DEBUG is only useful when developing a Kokkos-enabled style within
|
|
|
|
LAMMPS. DEBUG=yes enables printing of run-time debugging information
|
|
|
|
that can be useful. It also enables runtime bounds checking on Kokkos
|
|
|
|
data structures.
|
|
|
|
|
2014-09-10 23:32:24 +08:00
|
|
|
[Run with the KOKKOS package from the command line:]
|
|
|
|
|
|
|
|
The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
|
|
by LAMMPS (one or multiple per compute node) and the number of MPI
|
2014-09-12 05:14:29 +08:00
|
|
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
2014-09-13 05:19:51 +08:00
|
|
|
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
When using KOKKOS built with host=OMP, you need to choose how many
|
|
|
|
OpenMP threads per MPI task will be used (via the "-k" command-line
|
|
|
|
switch discussed below). Note that the product of MPI tasks * OpenMP
|
|
|
|
threads/task should not exceed the physical number of cores (on a
|
|
|
|
node), otherwise performance will suffer.
|
|
|
|
|
|
|
|
When using the KOKKOS package built with device=CUDA, you must use
|
|
|
|
exactly one MPI task per physical GPU.
|
|
|
|
|
|
|
|
When using the KOKKOS package built with host=MIC for Intel Xeon Phi
|
|
|
|
coprocessor support you need to insure there are one or more MPI tasks
|
|
|
|
per coprocessor, and choose the number of coprocessor threads to use
|
|
|
|
per MPI task (via the "-k" command-line switch discussed below). The
|
|
|
|
product of MPI tasks * coprocessor threads/task should not exceed the
|
|
|
|
maximum number of threads the coproprocessor is designed to run,
|
|
|
|
otherwise performance will suffer. This value is 240 for current
|
|
|
|
generation Xeon Phi(TM) chips, which is 60 physical cores * 4
|
|
|
|
threads/core. Note that with the KOKKOS package you do not need to
|
|
|
|
specify how many Phi coprocessors there are per node; each
|
|
|
|
coprocessors is simply treated as running some number of MPI tasks.
|
|
|
|
|
|
|
|
You must use the "-k on" "command-line
|
|
|
|
switch"_Section_start.html#start_7 to enable the KOKKOS package. It
|
|
|
|
takes additional arguments for hardware settings appropriate to your
|
|
|
|
system. Those arguments are "documented
|
2014-09-10 23:47:24 +08:00
|
|
|
here"_Section_start.html#start_7. The two most commonly used
|
|
|
|
options are:
|
2014-09-10 23:32:24 +08:00
|
|
|
|
2014-09-10 23:47:24 +08:00
|
|
|
-k on t Nt g Ng :pre
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
The "t Nt" option applies to host=OMP (even if device=CUDA) and
|
|
|
|
host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI
|
|
|
|
task to use with a node. For host=MIC, it specifies how many Xeon Phi
|
|
|
|
threads per MPI task to use within a node. The default is Nt = 1.
|
|
|
|
Note that for host=OMP this is effectively MPI-only mode which may be
|
|
|
|
fine. But for host=MIC you will typically end up using far less than
|
|
|
|
all the 240 available threads, which could give very poor performance.
|
|
|
|
|
|
|
|
The "g Ng" option applies to device=CUDA. It specifies how many GPUs
|
|
|
|
per compute node to use. The default is 1, so this only needs to be
|
|
|
|
specified is you have 2 or more GPUs per compute node.
|
|
|
|
|
2014-09-10 23:47:24 +08:00
|
|
|
The "-k on" switch also issues a "package kokkos" command (with no
|
|
|
|
additional arguments) which sets various KOKKOS options to default
|
|
|
|
values, as discussed on the "package"_package.html command doc page.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
Use the "-sf kk" "command-line switch"_Section_start.html#start_7,
|
|
|
|
which will automatically append "kk" to styles that support it. Use
|
|
|
|
the "-pk kokkos" "command-line switch"_Section_start.html#start_7 if
|
2014-09-10 23:47:24 +08:00
|
|
|
you wish to change any of the default "package kokkos"_package.html
|
|
|
|
optionns set by the "-k on" "command-line
|
|
|
|
switch"_Section_start.html#start_7.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
host=OMP, dual hex-core nodes (12 threads/node):
|
|
|
|
mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
|
|
|
|
mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
|
|
|
|
mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
|
|
|
|
mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task
|
|
|
|
mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj # ditto on 16 nodes :pre
|
|
|
|
|
|
|
|
host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
|
|
|
|
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240
|
|
|
|
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240
|
|
|
|
mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240
|
|
|
|
mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis
|
|
|
|
|
|
|
|
host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
|
|
|
|
mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU
|
|
|
|
mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes :pre
|
|
|
|
|
|
|
|
host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
|
|
|
|
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU
|
|
|
|
mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes :pre
|
|
|
|
|
2014-09-11 01:48:28 +08:00
|
|
|
Note that the default for the "package kokkos"_package.html command is
|
|
|
|
to use "full" neighbor lists and set the Newton flag to "off" for both
|
|
|
|
pairwise and bonded interactions. This typically gives fastest
|
|
|
|
performance. If the "newton"_newton.html command is used in the input
|
|
|
|
script, it can override the Newton flag defaults.
|
|
|
|
|
|
|
|
However, when running in MPI-only mode with 1 thread per MPI task, it
|
|
|
|
will typically be faster to use "half" neighbor lists and set the
|
|
|
|
Newton flag to "on", just as is the case for non-accelerated pair
|
|
|
|
styles. You can do this with the "-pk" "command-line
|
|
|
|
switch"_Section_start.html#start_7.
|
|
|
|
|
2014-09-10 23:32:24 +08:00
|
|
|
[Or run with the KOKKOS package by editing an input script:]
|
|
|
|
|
|
|
|
The discussion above for the mpirun/mpiexec command and setting
|
|
|
|
appropriate thread and GPU values for host=OMP or host=MIC or
|
|
|
|
device=CUDA are the same.
|
|
|
|
|
|
|
|
You must still use the "-k on" "command-line
|
|
|
|
switch"_Section_start.html#start_7 to enable the KOKKOS package, and
|
|
|
|
specify its additional arguments for hardware options appopriate to
|
|
|
|
your system, as documented above.
|
|
|
|
|
|
|
|
Use the "suffix kk"_suffix.html command, or you can explicitly add a
|
|
|
|
"kk" suffix to individual styles in your input script, e.g.
|
|
|
|
|
|
|
|
pair_style lj/cut/kk 2.5 :pre
|
|
|
|
|
|
|
|
You only need to use the "package kokkos"_package.html command if you
|
2014-09-10 23:47:24 +08:00
|
|
|
wish to change any of its option defaults, as set by the "-k on"
|
|
|
|
"command-line switch"_Section_start.html#start_7.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
The performance of KOKKOS running in different modes is a function of
|
|
|
|
your hardware, which KOKKOS-enable styles are used, and the problem
|
|
|
|
size.
|
|
|
|
|
|
|
|
Generally speaking, the following rules of thumb apply:
|
|
|
|
|
|
|
|
When running on CPUs only, with a single thread per MPI task,
|
|
|
|
performance of a KOKKOS style is somewhere between the standard
|
|
|
|
(un-accelerated) styles (MPI-only mode), and those provided by the
|
|
|
|
USER-OMP package. However the difference between all 3 is small (less
|
|
|
|
than 20%). :ulb,l
|
|
|
|
|
|
|
|
When running on CPUs only, with multiple threads per MPI task,
|
|
|
|
performance of a KOKKOS style is a bit slower than the USER-OMP
|
|
|
|
package. :l
|
|
|
|
|
|
|
|
When running on GPUs, KOKKOS is typically faster than the USER-CUDA
|
|
|
|
and GPU packages. :l
|
|
|
|
|
|
|
|
When running on Intel Xeon Phi, KOKKOS is not as fast as
|
|
|
|
the USER-INTEL package, which is optimized for that hardware. :l,ule
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
|
|
|
LAMMPS web site for performance of the KOKKOS package on different
|
|
|
|
hardware.
|
|
|
|
|
|
|
|
[Guidelines for best performance:]
|
|
|
|
|
|
|
|
Here are guidline for using the KOKKOS package on the different
|
|
|
|
hardware configurations listed above.
|
|
|
|
|
|
|
|
Many of the guidelines use the "package kokkos"_package.html command
|
|
|
|
See its doc page for details and default settings. Experimenting with
|
|
|
|
its options can provide a speed-up for specific calculations.
|
|
|
|
|
|
|
|
[Running on a multi-core CPU:]
|
|
|
|
|
|
|
|
If N is the number of physical cores/node, then the number of MPI
|
|
|
|
tasks/node * number of threads/task should not exceed N, and should
|
|
|
|
typically equal N. Note that the default threads/task is 1, as set by
|
|
|
|
the "t" keyword of the "-k" "command-line
|
|
|
|
switch"_Section_start.html#start_7. If you do not change this, no
|
|
|
|
additional parallelism (beyond MPI) will be invoked on the host
|
|
|
|
CPU(s).
|
|
|
|
|
|
|
|
You can compare the performance running in different modes:
|
|
|
|
|
|
|
|
run with 1 MPI task/node and N threads/task
|
|
|
|
run with N MPI tasks/node and 1 thread/task
|
|
|
|
run with settings in between these extremes :ul
|
|
|
|
|
|
|
|
Examples of mpirun commands in these modes are shown above.
|
|
|
|
|
|
|
|
When using KOKKOS to perform multi-threading, it is important for
|
|
|
|
performance to bind both MPI tasks to physical cores, and threads to
|
|
|
|
physical cores, so they do not migrate during a simulation.
|
|
|
|
|
|
|
|
If you are not certain MPI tasks are being bound (check the defaults
|
|
|
|
for your MPI installation), binding can be forced with these flags:
|
|
|
|
|
|
|
|
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
|
|
|
|
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
|
|
|
|
|
|
|
|
For binding threads with the KOKKOS OMP option, use thread affinity
|
|
|
|
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
|
|
|
|
later, intel 12 or later) setting the environment variable
|
|
|
|
OMP_PROC_BIND=true should be sufficient. For binding threads with the
|
|
|
|
KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
|
|
|
|
discussed in "Section 2.3.4"_Sections_start.html#start_3_4 of the
|
|
|
|
manual.
|
|
|
|
|
|
|
|
[Running on GPUs:]
|
|
|
|
|
|
|
|
Insure the -arch setting in the machine makefile you are using,
|
|
|
|
e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
|
|
|
|
(see "this section"_Section_start.html#start_3_4 of the manual for
|
|
|
|
details).
|
|
|
|
|
|
|
|
The -np setting of the mpirun command should set the number of MPI
|
|
|
|
tasks/node to be equal to the # of physical GPUs on the node.
|
|
|
|
|
|
|
|
Use the "-k" "command-line switch"_Section_commands.html#start_7 to
|
|
|
|
specify the number of GPUs per node, and the number of threads per MPI
|
|
|
|
task. As above for multi-core CPUs (and no GPU), if N is the number
|
|
|
|
of physical cores/node, then the number of MPI tasks/node * number of
|
|
|
|
threads/task should not exceed N. With one GPU (and one MPI task) it
|
|
|
|
may be faster to use less than all the available cores, by setting
|
|
|
|
threads/task to a smaller value. This is because using all the cores
|
|
|
|
on a dual-socket node will incur extra cost to copy memory from the
|
|
|
|
2nd socket to the GPU.
|
|
|
|
|
|
|
|
Examples of mpirun commands that follow these rules are shown above.
|
|
|
|
|
|
|
|
IMPORTANT NOTE: When using a GPU, you will achieve the best
|
|
|
|
performance if your input script does not use any fix or compute
|
|
|
|
styles which are not yet Kokkos-enabled. This allows data to stay on
|
|
|
|
the GPU for multiple timesteps, without being copied back to the host
|
|
|
|
CPU. Invoking a non-Kokkos fix or compute, or performing I/O for
|
|
|
|
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
|
|
|
|
to be copied back to the CPU.
|
|
|
|
|
|
|
|
You cannot yet assign multiple MPI tasks to the same GPU with the
|
|
|
|
KOKKOS package. We plan to support this in the future, similar to the
|
|
|
|
GPU package in LAMMPS.
|
|
|
|
|
|
|
|
You cannot yet use both the host (multi-threaded) and device (GPU)
|
|
|
|
together to compute pairwise interactions with the KOKKOS package. We
|
|
|
|
hope to support this in the future, similar to the GPU package in
|
|
|
|
LAMMPS.
|
|
|
|
|
|
|
|
[Running on an Intel Phi:]
|
|
|
|
|
|
|
|
Kokkos only uses Intel Phi processors in their "native" mode, i.e.
|
|
|
|
not hosted by a CPU.
|
|
|
|
|
|
|
|
As illustrated above, build LAMMPS with OMP=yes (the default) and
|
|
|
|
MIC=yes. The latter insures code is correctly compiled for the Intel
|
|
|
|
Phi. The OMP setting means OpenMP will be used for parallelization on
|
|
|
|
the Phi, which is currently the best option within Kokkos. In the
|
|
|
|
future, other options may be added.
|
|
|
|
|
|
|
|
Current-generation Intel Phi chips have either 61 or 57 cores. One
|
|
|
|
core should be excluded for running the OS, leaving 60 or 56 cores.
|
|
|
|
Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
|
|
|
|
N = 224 (4*56) cores to run on.
|
|
|
|
|
|
|
|
The -np setting of the mpirun command sets the number of MPI
|
|
|
|
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
|
|
|
threads/task as Nt. The product of these 2 values should be N, i.e.
|
|
|
|
240 or 224. Also, the number of threads/task should be a multiple of
|
|
|
|
4 so that logical threads from more than one MPI task do not run on
|
|
|
|
the same physical core.
|
|
|
|
|
|
|
|
Examples of mpirun commands that follow these rules are shown above.
|
|
|
|
|
|
|
|
[Restrictions:]
|
|
|
|
|
|
|
|
As noted above, if using GPUs, the number of MPI tasks per compute
|
|
|
|
node should equal to the number of GPUs per compute node. In the
|
|
|
|
future Kokkos will support assigning multiple MPI tasks to a single
|
|
|
|
GPU.
|
|
|
|
|
|
|
|
Currently Kokkos does not support AMD GPUs due to limits in the
|
|
|
|
available backend programming models. Specifically, Kokkos requires
|
|
|
|
extensive C++ support from the Kernel language. This is expected to
|
|
|
|
change in the future.
|
2015-02-05 06:14:38 +08:00
|
|
|
|
|
|
|
Kokkos must be built with a C++11 compatible compiler. For example,
|
|
|
|
gcc 4.7.2 or later.
|