Doc tweak

This commit is contained in:
Stan Moore 2019-04-09 15:17:40 -06:00
parent 618547b72e
commit 073f003470
2 changed files with 30 additions and 27 deletions

View File

@ -46,7 +46,7 @@ software version 7.5 or later must be installed on your system. See
the discussion for the "GPU package"_Speed_gpu.html for details of how the discussion for the "GPU package"_Speed_gpu.html for details of how
to check and do this. to check and do this.
NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
library is CUDA-aware and has support for GPU-direct. This is not library is CUDA-aware and has support for GPU-direct. This is not
always the case, especially when using pre-compiled MPI libraries always the case, especially when using pre-compiled MPI libraries
provided by a Linux distribution. This is not a problem when using provided by a Linux distribution. This is not a problem when using
@ -207,19 +207,21 @@ supports.
[Running on GPUs:] [Running on GPUs:]
Use the "-k" "command-line switch"_Run_options.html to Use the "-k" "command-line switch"_Run_options.html to specify the
specify the number of GPUs per node. Typically the -np setting of the number of GPUs per node. Typically the -np setting of the mpirun command
mpirun command should set the number of MPI tasks/node to be equal to should set the number of MPI tasks/node to be equal to the number of
the number of physical GPUs on the node. You can assign multiple MPI physical GPUs on the node. You can assign multiple MPI tasks to the same
tasks to the same GPU with the KOKKOS package, but this is usually GPU with the KOKKOS package, but this is usually only faster if some
only faster if significant portions of the input script have not portions of the input script have not been ported to use Kokkos. In this
been ported to use Kokkos. Using CUDA MPS is recommended in this case, also packing/unpacking communication buffers on the host may give
scenario. Using a CUDA-aware MPI library with support for GPU-direct speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS
is highly recommended. GPU-direct use can be avoided by using is recommended in this scenario.
"-pk kokkos gpu/direct no"_package.html.
As above for multi-core CPUs (and no GPU), if N is the number of Using a CUDA-aware MPI library with
physical cores/node, then the number of MPI tasks/node should not support for GPU-direct is highly recommended. GPU-direct use can be
exceed N. avoided by using "-pk kokkos gpu/direct no"_package.html. As above for
multi-core CPUs (and no GPU), if N is the number of physical cores/node,
then the number of MPI tasks/node should not exceed N.
-k on g Ng :pre -k on g Ng :pre

View File

@ -490,10 +490,10 @@ are rebuilt. The data is only for atoms that migrate to new processors.
"Forward" communication happens every timestep. "Reverse" communication "Forward" communication happens every timestep. "Reverse" communication
happens every timestep if the {newton} option is on. The data is for happens every timestep if the {newton} option is on. The data is for
atom coordinates and any other atom properties that needs to be updated atom coordinates and any other atom properties that needs to be updated
for ghost atoms owned by each processor. for ghost atoms owned by each processor.
The {comm} keyword is simply a short-cut to set the same value for both The {comm} keyword is simply a short-cut to set the same value for both
the {comm/exchange} and {comm/forward} and {comm/reverse} keywords. the {comm/exchange} and {comm/forward} and {comm/reverse} keywords.
The value options for all 3 keywords are {no} or {host} or {device}. A The value options for all 3 keywords are {no} or {host} or {device}. A
value of {no} means to use the standard non-KOKKOS method of value of {no} means to use the standard non-KOKKOS method of
@ -501,26 +501,26 @@ packing/unpacking data for the communication. A value of {host} means to
use the host, typically a multi-core CPU, and perform the use the host, typically a multi-core CPU, and perform the
packing/unpacking in parallel with threads. A value of {device} means to packing/unpacking in parallel with threads. A value of {device} means to
use the device, typically a GPU, to perform the packing/unpacking use the device, typically a GPU, to perform the packing/unpacking
operation. operation.
The optimal choice for these keywords depends on the input script and The optimal choice for these keywords depends on the input script and
the hardware used. The {no} value is useful for verifying that the the hardware used. The {no} value is useful for verifying that the
Kokkos-based {host} and {device} values are working correctly. It is the Kokkos-based {host} and {device} values are working correctly. It is the
default when running on CPUs since it is usually the fastest. default when running on CPUs since it is usually the fastest.
When running on CPUs or Xeon Phi, the {host} and {device} values work When running on CPUs or Xeon Phi, the {host} and {device} values work
identically. When using GPUs, the {device} value is the default since it identically. When using GPUs, the {device} value is the default since it
will typically be optimal if all of your styles used in your input will typically be optimal if all of your styles used in your input
script are supported by the KOKKOS package. In this case data can stay script are supported by the KOKKOS package. In this case data can stay
on the GPU for many timesteps without being moved between the host and on the GPU for many timesteps without being moved between the host and
GPU, if you use the {device} value. This requires that your MPI is able GPU, if you use the {device} value. If your script uses styles (e.g.
to access GPU memory directly. Currently that is true for OpenMPI 1.8 fixes) which are not yet supported by the KOKKOS package, then data has
(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your to be move between the host and device anyway, so it is typically faster
script uses styles (e.g. fixes) which are not yet supported by the to let the host handle communication, by using the {host} value. Using
KOKKOS package, then data has to be move between the host and device {host} instead of {no} will enable use of multiple threads to
anyway, so it is typically faster to let the host handle communication, pack/unpack communicated data. When running small systems on a GPU,
by using the {host} value. Using {host} instead of {no} will enable use performing the exchange pack/unpack on the host CPU can give speedup
of multiple threads to pack/unpack communicated data. since it reduces the number of CUDA kernel launches.
The {gpu/direct} keyword chooses whether GPU-direct will be used. When The {gpu/direct} keyword chooses whether GPU-direct will be used. When
this keyword is set to {on}, buffers in GPU memory are passed directly this keyword is set to {on}, buffers in GPU memory are passed directly
@ -533,7 +533,8 @@ the {gpu/direct} keyword is automatically set to {off} by default. When
the {gpu/direct} keyword is set to {off} while any of the {comm} the {gpu/direct} keyword is set to {off} while any of the {comm}
keywords are set to {device}, the value for these {comm} keywords will keywords are set to {device}, the value for these {comm} keywords will
be automatically changed to {host}. This setting has no effect if not be automatically changed to {host}. This setting has no effect if not
running on GPUs. running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later
versions), Mvapich2 1.9 (or later), and CrayMPI.
:line :line