forked from lijiext/lammps
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12048 f3b2605a-c512-4ea7-a41b-209d697bcdaa
This commit is contained in:
parent
0fc38af7a3
commit
85d9a319d7
|
@ -683,19 +683,20 @@ occurs, the faster your simulation will run.
|
|||
that use data structures and methods and macros provided by the Kokkos
|
||||
library, which is included with LAMMPS in lib/kokkos.
|
||||
</P>
|
||||
<P>Kokkos is a C++ library that provides two key abstractions for an
|
||||
application like LAMMPS. First, it allows a single implementation of
|
||||
an application kernel (e.g. a pair style) to run efficiently on
|
||||
different kinds of hardware (GPU, Intel Phi, many-core chip).
|
||||
<P><A HREF = "http://trilinos.sandia.gov/packages/kokkos">Kokkos</A> is a C++ library
|
||||
that provides two key abstractions for an application like LAMMPS.
|
||||
First, it allows a single implementation of an application kernel
|
||||
(e.g. a pair style) to run efficiently on different kinds of hardware
|
||||
(GPU, Intel Phi, many-core chip).
|
||||
</P>
|
||||
<P>Second, it adjusts the memory layout of basic data structures like 2d
|
||||
and 3d arrays specifically for the chosen hardware. These are used in
|
||||
LAMMPS to store atom coordinates or forces or neighbor lists. The
|
||||
layout is chosen to optimize performance on different platforms.
|
||||
Again this operation is hidden from the developer, and does not affect
|
||||
how the single implementation of the kernel is coded.
|
||||
</P>
|
||||
<P>CT NOTE: Pointer to Kokkos web page???
|
||||
<P>Second, it provides data abstractions to adjust (at compile time) the
|
||||
memory layout of basic data structures like 2d and 3d arrays and allow
|
||||
the transparent utilization of special hardware load and store units.
|
||||
Such data structures are used in LAMMPS to store atom coordinates or
|
||||
forces or neighbor lists. The layout is chosen to optimize
|
||||
performance on different platforms. Again this operation is hidden
|
||||
from the developer, and does not affect how the single implementation
|
||||
of the kernel is coded.
|
||||
</P>
|
||||
<P>These abstractions are set at build time, when LAMMPS is compiled with
|
||||
the KOKKOS package installed. This is done by selecting a "host" and
|
||||
|
@ -727,9 +728,11 @@ i.e. the host and device are the same.
|
|||
<P>IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
|
||||
of MPI tasks per compute node to be equal to the number of GPUs per
|
||||
compute node. In the future Kokkos will support assigning one GPU to
|
||||
multiple MPI tasks or using multiple GPUs per MPI task.
|
||||
</P>
|
||||
<P>CT NOTE: what about AMD GPUs running OpenCL? are they supported?
|
||||
multiple MPI tasks or using multiple GPUs per MPI task. Currently
|
||||
Kokkos does not support AMD GPUs due to limits in the available
|
||||
backend programming models (in particular relative extensive C++
|
||||
support is required for the Kernel language). This is expected to
|
||||
change in the future.
|
||||
</P>
|
||||
<P>Here are several examples of how to build LAMMPS and run a simulation
|
||||
using the KOKKOS package for typical compute node configurations.
|
||||
|
@ -857,8 +860,8 @@ communication can provide a speed-up for specific calculations.
|
|||
tasks/node * number of threads/task should not exceed N, and should
|
||||
typically equal N. Note that the default threads/task is 1, as set by
|
||||
the "t" keyword of the -k <A HREF = "Section_start.html#start_7">command-line
|
||||
switch</A>. If you do not change this, there
|
||||
will no additional parallelism (beyond MPI) invoked on the host
|
||||
switch</A>. If you do not change this, no
|
||||
additional parallelism (beyond MPI) will be invoked on the host
|
||||
CPU(s).
|
||||
</P>
|
||||
<P>You can compare the performance running in different modes:
|
||||
|
@ -878,9 +881,8 @@ software installation. Insure the -arch setting in
|
|||
src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
|
||||
<A HREF = "Section_start.html#start_3_4">this section</A> of the manual for details.
|
||||
</P>
|
||||
<P>The -np setting of the mpirun command must set the number of MPI
|
||||
tasks/node to be equal to the # of physical GPUs on the node. CT
|
||||
NOTE: does LAMMPS enforce this?
|
||||
<P>The -np setting of the mpirun command should set the number of MPI
|
||||
tasks/node to be equal to the # of physical GPUs on the node.
|
||||
</P>
|
||||
<P>Use the <A HREF = "Section_commands.html#start_7">-kokkos command-line switch</A> to
|
||||
specify the number of GPUs per node, and the number of threads per MPI
|
||||
|
@ -936,9 +938,19 @@ will be added later.
|
|||
performance to bind the threads to physical cores, so they do not
|
||||
migrate during a simulation. The same is true for MPI tasks, but the
|
||||
default binding rules implemented for various MPI versions, do not
|
||||
account for thread binding. Thus you should do the following if using
|
||||
multiple threads per MPI task. CT NOTE: explain what to do.
|
||||
account for thread binding.
|
||||
</P>
|
||||
<P>Thus if you use more than one thread per MPI task, you should insure
|
||||
MPI tasks are bound to CPU sockets. Furthermore, use thread affinity
|
||||
environment variables from the OpenMP runtime when using OpenMP and
|
||||
compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc
|
||||
4.7 or later, intel 12 or later) setting the environment variable
|
||||
OMP_PROC_BIND=true should be sufficient. A typical mpirun command
|
||||
should set these flags:
|
||||
</P>
|
||||
<PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
|
||||
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...
|
||||
</PRE>
|
||||
<P>When using a GPU, you will achieve the best performance if your input
|
||||
script does not use any fix or compute styles which are not yet
|
||||
Kokkos-enabled. This allows data to stay on the GPU for multiple
|
||||
|
@ -956,8 +968,6 @@ together to compute pairwise interactions with the KOKKOS package. We
|
|||
hope to support this in the future, similar to the GPU package in
|
||||
LAMMPS.
|
||||
</P>
|
||||
<P>CT NOTE: other performance tips??
|
||||
</P>
|
||||
<HR>
|
||||
|
||||
<HR>
|
||||
|
|
|
@ -679,19 +679,20 @@ The KOKKOS package contains versions of pair, fix, and atom styles
|
|||
that use data structures and methods and macros provided by the Kokkos
|
||||
library, which is included with LAMMPS in lib/kokkos.
|
||||
|
||||
Kokkos is a C++ library that provides two key abstractions for an
|
||||
application like LAMMPS. First, it allows a single implementation of
|
||||
an application kernel (e.g. a pair style) to run efficiently on
|
||||
different kinds of hardware (GPU, Intel Phi, many-core chip).
|
||||
"Kokkos"_http://trilinos.sandia.gov/packages/kokkos is a C++ library
|
||||
that provides two key abstractions for an application like LAMMPS.
|
||||
First, it allows a single implementation of an application kernel
|
||||
(e.g. a pair style) to run efficiently on different kinds of hardware
|
||||
(GPU, Intel Phi, many-core chip).
|
||||
|
||||
Second, it adjusts the memory layout of basic data structures like 2d
|
||||
and 3d arrays specifically for the chosen hardware. These are used in
|
||||
LAMMPS to store atom coordinates or forces or neighbor lists. The
|
||||
layout is chosen to optimize performance on different platforms.
|
||||
Again this operation is hidden from the developer, and does not affect
|
||||
how the single implementation of the kernel is coded.
|
||||
|
||||
CT NOTE: Pointer to Kokkos web page???
|
||||
Second, it provides data abstractions to adjust (at compile time) the
|
||||
memory layout of basic data structures like 2d and 3d arrays and allow
|
||||
the transparent utilization of special hardware load and store units.
|
||||
Such data structures are used in LAMMPS to store atom coordinates or
|
||||
forces or neighbor lists. The layout is chosen to optimize
|
||||
performance on different platforms. Again this operation is hidden
|
||||
from the developer, and does not affect how the single implementation
|
||||
of the kernel is coded.
|
||||
|
||||
These abstractions are set at build time, when LAMMPS is compiled with
|
||||
the KOKKOS package installed. This is done by selecting a "host" and
|
||||
|
@ -723,9 +724,11 @@ i.e. the host and device are the same.
|
|||
IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
|
||||
of MPI tasks per compute node to be equal to the number of GPUs per
|
||||
compute node. In the future Kokkos will support assigning one GPU to
|
||||
multiple MPI tasks or using multiple GPUs per MPI task.
|
||||
|
||||
CT NOTE: what about AMD GPUs running OpenCL? are they supported?
|
||||
multiple MPI tasks or using multiple GPUs per MPI task. Currently
|
||||
Kokkos does not support AMD GPUs due to limits in the available
|
||||
backend programming models (in particular relative extensive C++
|
||||
support is required for the Kernel language). This is expected to
|
||||
change in the future.
|
||||
|
||||
Here are several examples of how to build LAMMPS and run a simulation
|
||||
using the KOKKOS package for typical compute node configurations.
|
||||
|
@ -853,8 +856,8 @@ If N is the number of physical cores/node, then the number of MPI
|
|||
tasks/node * number of threads/task should not exceed N, and should
|
||||
typically equal N. Note that the default threads/task is 1, as set by
|
||||
the "t" keyword of the -k "command-line
|
||||
switch"_Section_start.html#start_7. If you do not change this, there
|
||||
will no additional parallelism (beyond MPI) invoked on the host
|
||||
switch"_Section_start.html#start_7. If you do not change this, no
|
||||
additional parallelism (beyond MPI) will be invoked on the host
|
||||
CPU(s).
|
||||
|
||||
You can compare the performance running in different modes:
|
||||
|
@ -874,9 +877,8 @@ software installation. Insure the -arch setting in
|
|||
src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
|
||||
"this section"_Section_start.html#start_3_4 of the manual for details.
|
||||
|
||||
The -np setting of the mpirun command must set the number of MPI
|
||||
tasks/node to be equal to the # of physical GPUs on the node. CT
|
||||
NOTE: does LAMMPS enforce this?
|
||||
The -np setting of the mpirun command should set the number of MPI
|
||||
tasks/node to be equal to the # of physical GPUs on the node.
|
||||
|
||||
Use the "-kokkos command-line switch"_Section_commands.html#start_7 to
|
||||
specify the number of GPUs per node, and the number of threads per MPI
|
||||
|
@ -932,8 +934,18 @@ When using threads (OpenMP or pthreads), it is important for
|
|||
performance to bind the threads to physical cores, so they do not
|
||||
migrate during a simulation. The same is true for MPI tasks, but the
|
||||
default binding rules implemented for various MPI versions, do not
|
||||
account for thread binding. Thus you should do the following if using
|
||||
multiple threads per MPI task. CT NOTE: explain what to do.
|
||||
account for thread binding.
|
||||
|
||||
Thus if you use more than one thread per MPI task, you should insure
|
||||
MPI tasks are bound to CPU sockets. Furthermore, use thread affinity
|
||||
environment variables from the OpenMP runtime when using OpenMP and
|
||||
compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc
|
||||
4.7 or later, intel 12 or later) setting the environment variable
|
||||
OMP_PROC_BIND=true should be sufficient. A typical mpirun command
|
||||
should set these flags:
|
||||
|
||||
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
|
||||
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
|
||||
|
||||
When using a GPU, you will achieve the best performance if your input
|
||||
script does not use any fix or compute styles which are not yet
|
||||
|
@ -952,8 +964,6 @@ together to compute pairwise interactions with the KOKKOS package. We
|
|||
hope to support this in the future, similar to the GPU package in
|
||||
LAMMPS.
|
||||
|
||||
CT NOTE: other performance tips??
|
||||
|
||||
:line
|
||||
:line
|
||||
|
||||
|
|
|
@ -1259,13 +1259,13 @@ Ng = 1 and Ns is not set.
|
|||
<P>Depending on which flavor of MPI you are running, LAMMPS will look for
|
||||
one of these 3 environment variables
|
||||
</P>
|
||||
<PRE>SLURM_LOCALID (???) CT NOTE: what MPI is this for?
|
||||
<PRE>SLURM_LOCALID (various MPI variants compiled with SLURM support)
|
||||
MV2_COMM_WORLD_LOCAL_RANK (Mvapich)
|
||||
OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI)
|
||||
</PRE>
|
||||
<P>which are initialized by "mpirun" or "mpiexec". The environment
|
||||
variable setting for each MPI rank is used to assign a unique GPU ID
|
||||
to the MPI task.
|
||||
<P>which are initialized by the "srun", "mpirun" or "mpiexec" commands.
|
||||
The environment variable setting for each MPI rank is used to assign a
|
||||
unique GPU ID to the MPI task.
|
||||
</P>
|
||||
<PRE>threads Nt
|
||||
</PRE>
|
||||
|
@ -1274,13 +1274,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode.
|
|||
The default is Nt = 1, which essentially runs in MPI-only mode. If
|
||||
there are Np MPI tasks per physical node, you generally want Np*Nt =
|
||||
the number of physical cores per node, to use your available hardware
|
||||
optimally.
|
||||
optimally. This also sets the number of threads used by the host when
|
||||
LAMMPS is compiled with CUDA=yes.
|
||||
</P>
|
||||
<PRE>numa Nm
|
||||
</PRE>
|
||||
<P>CT NOTE: what does numa set, and why use it?
|
||||
</P>
|
||||
<P>Explain. The default is Nm = 1.
|
||||
<P>This option is only relevant when using pthreads with hwloc support.
|
||||
In this case Nm defines the number of NUMA regions (typicaly sockets)
|
||||
on a node which will be utilizied by a single MPI rank. By default Nm
|
||||
= 1. If this option is used the total number of worker-threads per
|
||||
MPI rank is threads*numa. Currently it is always almost better to
|
||||
assign at least one MPI rank per NUMA region, and leave numa set to
|
||||
its default value of 1. This is because letting a single process span
|
||||
multiple NUMA regions induces a significant amount of cross NUMA data
|
||||
traffic which is slow.
|
||||
</P>
|
||||
<PRE>-log file
|
||||
</PRE>
|
||||
|
|
|
@ -1253,13 +1253,13 @@ Ng = 1 and Ns is not set.
|
|||
Depending on which flavor of MPI you are running, LAMMPS will look for
|
||||
one of these 3 environment variables
|
||||
|
||||
SLURM_LOCALID (???) CT NOTE: what MPI is this for?
|
||||
SLURM_LOCALID (various MPI variants compiled with SLURM support)
|
||||
MV2_COMM_WORLD_LOCAL_RANK (Mvapich)
|
||||
OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) :pre
|
||||
|
||||
which are initialized by "mpirun" or "mpiexec". The environment
|
||||
variable setting for each MPI rank is used to assign a unique GPU ID
|
||||
to the MPI task.
|
||||
which are initialized by the "srun", "mpirun" or "mpiexec" commands.
|
||||
The environment variable setting for each MPI rank is used to assign a
|
||||
unique GPU ID to the MPI task.
|
||||
|
||||
threads Nt :pre
|
||||
|
||||
|
@ -1268,13 +1268,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode.
|
|||
The default is Nt = 1, which essentially runs in MPI-only mode. If
|
||||
there are Np MPI tasks per physical node, you generally want Np*Nt =
|
||||
the number of physical cores per node, to use your available hardware
|
||||
optimally.
|
||||
optimally. This also sets the number of threads used by the host when
|
||||
LAMMPS is compiled with CUDA=yes.
|
||||
|
||||
numa Nm :pre
|
||||
|
||||
CT NOTE: what does numa set, and why use it?
|
||||
|
||||
Explain. The default is Nm = 1.
|
||||
This option is only relevant when using pthreads with hwloc support.
|
||||
In this case Nm defines the number of NUMA regions (typicaly sockets)
|
||||
on a node which will be utilizied by a single MPI rank. By default Nm
|
||||
= 1. If this option is used the total number of worker-threads per
|
||||
MPI rank is threads*numa. Currently it is always almost better to
|
||||
assign at least one MPI rank per NUMA region, and leave numa set to
|
||||
its default value of 1. This is because letting a single process span
|
||||
multiple NUMA regions induces a significant amount of cross NUMA data
|
||||
traffic which is slow.
|
||||
|
||||
-log file :pre
|
||||
|
||||
|
|
|
@ -216,26 +216,24 @@ device type can be specified when building LAMMPS with the GPU library.
|
|||
</P>
|
||||
<HR>
|
||||
|
||||
<P>The <I>kk</I> style invokes options associated with the use of the
|
||||
<P>The <I>kokkos</I> style invokes options associated with the use of the
|
||||
KOKKOS package.
|
||||
</P>
|
||||
<P>The <I>neigh</I> keyword determines what kinds of neighbor lists are built.
|
||||
A value of <I>half</I> uses half-neighbor lists, the same as used by most
|
||||
pair styles in LAMMPS. This is the default when running without
|
||||
threads on a CPU. A value of <I>half/thread</I> uses a threadsafe variant
|
||||
of the half-neighbor list. It should be used instead of <I>half</I> when
|
||||
running with threads on a CPU. A value of <I>full</I> uses a
|
||||
pair styles in LAMMPS. A value of <I>half/thread</I> uses a threadsafe
|
||||
variant of the half-neighbor list. It should be used instead of
|
||||
<I>half</I> when running with threads on a CPU. A value of <I>full</I> uses a
|
||||
full-neighborlist, i.e. f_ij and f_ji are both calculated. This
|
||||
performs twice as much computation as the <I>half</I> option, however that
|
||||
can be a win because it is threadsafe and doesn't require atomic
|
||||
operations. This is the default when running in threaded mode or on
|
||||
GPUs. A value of <I>full/cluster</I> is an experimental neighbor style,
|
||||
where particles interact with all particles within a small cluster, if
|
||||
at least one of the clusters particles is within the neighbor cutoff
|
||||
range. This potentially allows for better vectorization on
|
||||
architectures such as the Intel Phi. If also reduces the size of the
|
||||
neighbor list by roughly a factor of the cluster size, thus reducing
|
||||
the total memory footprint considerably.
|
||||
operations. A value of <I>full/cluster</I> is an experimental neighbor
|
||||
style, where particles interact with all particles within a small
|
||||
cluster, if at least one of the clusters particles is within the
|
||||
neighbor cutoff range. This potentially allows for better
|
||||
vectorization on architectures such as the Intel Phi. If also reduces
|
||||
the size of the neighbor list by roughly a factor of the cluster size,
|
||||
thus reducing the total memory footprint considerably.
|
||||
</P>
|
||||
<P>The <I>comm/exchange</I> and <I>comm/forward</I> keywords determine whether the
|
||||
host or device performs the packing and unpacking of data when
|
||||
|
@ -254,26 +252,23 @@ packing/unpacking in parallel with threads. A value of <I>device</I> means
|
|||
to use the device, typically a GPU, to perform the packing/unpacking
|
||||
operation.
|
||||
</P>
|
||||
<P>CT NOTE: please read this paragraph, to make sure it is correct:
|
||||
</P>
|
||||
<P>The optimal choice for these keywords depends on the input script and
|
||||
the hardware used. The <I>no</I> value is useful for verifying that Kokkos
|
||||
code is working correctly. It may also be the fastest choice when
|
||||
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
|
||||
When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values
|
||||
should work identically. When using GPUs, the <I>device</I> value will
|
||||
typically be optimal if all of your styles used in your input script
|
||||
are supported by the KOKKOS package. In this case data can stay on
|
||||
the GPU for many timesteps without being moved between the host and
|
||||
GPU, if you use the <I>device</I> value. This requires that your MPI is
|
||||
able to access GPU memory directly. Currently that is true for
|
||||
OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI.
|
||||
If your script uses styles (e.g. fixes) which are not yet supported by
|
||||
the KOKKOS package, then data has to be move between the host and
|
||||
device anyway, so it is typically faster to let the host handle
|
||||
communication, by using the <I>host</I> value. Using <I>host</I> instead of
|
||||
<I>no</I> will enable use of multiple threads to pack/unpack communicated
|
||||
data.
|
||||
When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values work
|
||||
identically. When using GPUs, the <I>device</I> value will typically be
|
||||
optimal if all of your styles used in your input script are supported
|
||||
by the KOKKOS package. In this case data can stay on the GPU for many
|
||||
timesteps without being moved between the host and GPU, if you use the
|
||||
<I>device</I> value. This requires that your MPI is able to access GPU
|
||||
memory directly. Currently that is true for OpenMPI 1.8 (or later
|
||||
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
|
||||
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
|
||||
then data has to be move between the host and device anyway, so it is
|
||||
typically faster to let the host handle communication, by using the
|
||||
<I>host</I> value. Using <I>host</I> instead of <I>no</I> will enable use of
|
||||
multiple threads to pack/unpack communicated data.
|
||||
</P>
|
||||
<HR>
|
||||
|
||||
|
@ -354,13 +349,10 @@ invoked, to specify default settings for the GPU package. If the
|
|||
command-line switch is not used, then no defaults are set, and you
|
||||
must specify the appropriate package command in your input script.
|
||||
</P>
|
||||
<P>CT NOTE: is this correct? The above sems to say the
|
||||
choice of neigh value depends on use of threads or not.
|
||||
</P>
|
||||
<P>The default settings for the KOKKOS package are "package kokkos neigh
|
||||
full comm/exchange host comm/forward host". This is the case whether
|
||||
the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
|
||||
or not.
|
||||
<P>The default settings for the KOKKOS package are "package kk neigh full
|
||||
comm/exchange host comm/forward host". This is the case whether the
|
||||
"-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used or
|
||||
not.
|
||||
</P>
|
||||
<P>If the "-sf omp" <A HREF = "Section_start.html#start_7">command-line switch</A> is
|
||||
used then it is as if the command "package omp *" were invoked, to
|
||||
|
|
|
@ -210,26 +210,24 @@ device type can be specified when building LAMMPS with the GPU library.
|
|||
|
||||
:line
|
||||
|
||||
The {kk} style invokes options associated with the use of the
|
||||
The {kokkos} style invokes options associated with the use of the
|
||||
KOKKOS package.
|
||||
|
||||
The {neigh} keyword determines what kinds of neighbor lists are built.
|
||||
A value of {half} uses half-neighbor lists, the same as used by most
|
||||
pair styles in LAMMPS. This is the default when running without
|
||||
threads on a CPU. A value of {half/thread} uses a threadsafe variant
|
||||
of the half-neighbor list. It should be used instead of {half} when
|
||||
running with threads on a CPU. A value of {full} uses a
|
||||
pair styles in LAMMPS. A value of {half/thread} uses a threadsafe
|
||||
variant of the half-neighbor list. It should be used instead of
|
||||
{half} when running with threads on a CPU. A value of {full} uses a
|
||||
full-neighborlist, i.e. f_ij and f_ji are both calculated. This
|
||||
performs twice as much computation as the {half} option, however that
|
||||
can be a win because it is threadsafe and doesn't require atomic
|
||||
operations. This is the default when running in threaded mode or on
|
||||
GPUs. A value of {full/cluster} is an experimental neighbor style,
|
||||
where particles interact with all particles within a small cluster, if
|
||||
at least one of the clusters particles is within the neighbor cutoff
|
||||
range. This potentially allows for better vectorization on
|
||||
architectures such as the Intel Phi. If also reduces the size of the
|
||||
neighbor list by roughly a factor of the cluster size, thus reducing
|
||||
the total memory footprint considerably.
|
||||
operations. A value of {full/cluster} is an experimental neighbor
|
||||
style, where particles interact with all particles within a small
|
||||
cluster, if at least one of the clusters particles is within the
|
||||
neighbor cutoff range. This potentially allows for better
|
||||
vectorization on architectures such as the Intel Phi. If also reduces
|
||||
the size of the neighbor list by roughly a factor of the cluster size,
|
||||
thus reducing the total memory footprint considerably.
|
||||
|
||||
The {comm/exchange} and {comm/forward} keywords determine whether the
|
||||
host or device performs the packing and unpacking of data when
|
||||
|
@ -248,26 +246,23 @@ packing/unpacking in parallel with threads. A value of {device} means
|
|||
to use the device, typically a GPU, to perform the packing/unpacking
|
||||
operation.
|
||||
|
||||
CT NOTE: please read this paragraph, to make sure it is correct:
|
||||
|
||||
The optimal choice for these keywords depends on the input script and
|
||||
the hardware used. The {no} value is useful for verifying that Kokkos
|
||||
code is working correctly. It may also be the fastest choice when
|
||||
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
|
||||
When running on CPUs or Xeon Phi, the {host} and {device} values
|
||||
should work identically. When using GPUs, the {device} value will
|
||||
typically be optimal if all of your styles used in your input script
|
||||
are supported by the KOKKOS package. In this case data can stay on
|
||||
the GPU for many timesteps without being moved between the host and
|
||||
GPU, if you use the {device} value. This requires that your MPI is
|
||||
able to access GPU memory directly. Currently that is true for
|
||||
OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI.
|
||||
If your script uses styles (e.g. fixes) which are not yet supported by
|
||||
the KOKKOS package, then data has to be move between the host and
|
||||
device anyway, so it is typically faster to let the host handle
|
||||
communication, by using the {host} value. Using {host} instead of
|
||||
{no} will enable use of multiple threads to pack/unpack communicated
|
||||
data.
|
||||
When running on CPUs or Xeon Phi, the {host} and {device} values work
|
||||
identically. When using GPUs, the {device} value will typically be
|
||||
optimal if all of your styles used in your input script are supported
|
||||
by the KOKKOS package. In this case data can stay on the GPU for many
|
||||
timesteps without being moved between the host and GPU, if you use the
|
||||
{device} value. This requires that your MPI is able to access GPU
|
||||
memory directly. Currently that is true for OpenMPI 1.8 (or later
|
||||
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
|
||||
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
|
||||
then data has to be move between the host and device anyway, so it is
|
||||
typically faster to let the host handle communication, by using the
|
||||
{host} value. Using {host} instead of {no} will enable use of
|
||||
multiple threads to pack/unpack communicated data.
|
||||
|
||||
:line
|
||||
|
||||
|
@ -348,13 +343,10 @@ invoked, to specify default settings for the GPU package. If the
|
|||
command-line switch is not used, then no defaults are set, and you
|
||||
must specify the appropriate package command in your input script.
|
||||
|
||||
CT NOTE: is this correct? The above sems to say the
|
||||
choice of neigh value depends on use of threads or not.
|
||||
|
||||
The default settings for the KOKKOS package are "package kokkos neigh
|
||||
full comm/exchange host comm/forward host". This is the case whether
|
||||
the "-sf kk" "command-line switch"_Section_start.html#start_7 is used
|
||||
or not.
|
||||
The default settings for the KOKKOS package are "package kk neigh full
|
||||
comm/exchange host comm/forward host". This is the case whether the
|
||||
"-sf kk" "command-line switch"_Section_start.html#start_7 is used or
|
||||
not.
|
||||
|
||||
If the "-sf omp" "command-line switch"_Section_start.html#start_7 is
|
||||
used then it is as if the command "package omp *" were invoked, to
|
||||
|
|
Loading…
Reference in New Issue