git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12048 f3b2605a-c512-4ea7-a41b-209d697bcdaa

This commit is contained in:
sjplimp 2014-05-29 23:07:14 +00:00
parent 0fc38af7a3
commit 85d9a319d7
6 changed files with 154 additions and 136 deletions

View File

@ -683,19 +683,20 @@ occurs, the faster your simulation will run.
that use data structures and methods and macros provided by the Kokkos
library, which is included with LAMMPS in lib/kokkos.
</P>
<P>Kokkos is a C++ library that provides two key abstractions for an
application like LAMMPS. First, it allows a single implementation of
an application kernel (e.g. a pair style) to run efficiently on
different kinds of hardware (GPU, Intel Phi, many-core chip).
<P><A HREF = "http://trilinos.sandia.gov/packages/kokkos">Kokkos</A> is a C++ library
that provides two key abstractions for an application like LAMMPS.
First, it allows a single implementation of an application kernel
(e.g. a pair style) to run efficiently on different kinds of hardware
(GPU, Intel Phi, many-core chip).
</P>
<P>Second, it adjusts the memory layout of basic data structures like 2d
and 3d arrays specifically for the chosen hardware. These are used in
LAMMPS to store atom coordinates or forces or neighbor lists. The
layout is chosen to optimize performance on different platforms.
Again this operation is hidden from the developer, and does not affect
how the single implementation of the kernel is coded.
</P>
<P>CT NOTE: Pointer to Kokkos web page???
<P>Second, it provides data abstractions to adjust (at compile time) the
memory layout of basic data structures like 2d and 3d arrays and allow
the transparent utilization of special hardware load and store units.
Such data structures are used in LAMMPS to store atom coordinates or
forces or neighbor lists. The layout is chosen to optimize
performance on different platforms. Again this operation is hidden
from the developer, and does not affect how the single implementation
of the kernel is coded.
</P>
<P>These abstractions are set at build time, when LAMMPS is compiled with
the KOKKOS package installed. This is done by selecting a "host" and
@ -727,9 +728,11 @@ i.e. the host and device are the same.
<P>IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
of MPI tasks per compute node to be equal to the number of GPUs per
compute node. In the future Kokkos will support assigning one GPU to
multiple MPI tasks or using multiple GPUs per MPI task.
</P>
<P>CT NOTE: what about AMD GPUs running OpenCL? are they supported?
multiple MPI tasks or using multiple GPUs per MPI task. Currently
Kokkos does not support AMD GPUs due to limits in the available
backend programming models (in particular relative extensive C++
support is required for the Kernel language). This is expected to
change in the future.
</P>
<P>Here are several examples of how to build LAMMPS and run a simulation
using the KOKKOS package for typical compute node configurations.
@ -857,8 +860,8 @@ communication can provide a speed-up for specific calculations.
tasks/node * number of threads/task should not exceed N, and should
typically equal N. Note that the default threads/task is 1, as set by
the "t" keyword of the -k <A HREF = "Section_start.html#start_7">command-line
switch</A>. If you do not change this, there
will no additional parallelism (beyond MPI) invoked on the host
switch</A>. If you do not change this, no
additional parallelism (beyond MPI) will be invoked on the host
CPU(s).
</P>
<P>You can compare the performance running in different modes:
@ -878,9 +881,8 @@ software installation. Insure the -arch setting in
src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
<A HREF = "Section_start.html#start_3_4">this section</A> of the manual for details.
</P>
<P>The -np setting of the mpirun command must set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node. CT
NOTE: does LAMMPS enforce this?
<P>The -np setting of the mpirun command should set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node.
</P>
<P>Use the <A HREF = "Section_commands.html#start_7">-kokkos command-line switch</A> to
specify the number of GPUs per node, and the number of threads per MPI
@ -936,9 +938,19 @@ will be added later.
performance to bind the threads to physical cores, so they do not
migrate during a simulation. The same is true for MPI tasks, but the
default binding rules implemented for various MPI versions, do not
account for thread binding. Thus you should do the following if using
multiple threads per MPI task. CT NOTE: explain what to do.
account for thread binding.
</P>
<P>Thus if you use more than one thread per MPI task, you should insure
MPI tasks are bound to CPU sockets. Furthermore, use thread affinity
environment variables from the OpenMP runtime when using OpenMP and
compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc
4.7 or later, intel 12 or later) setting the environment variable
OMP_PROC_BIND=true should be sufficient. A typical mpirun command
should set these flags:
</P>
<PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...
</PRE>
<P>When using a GPU, you will achieve the best performance if your input
script does not use any fix or compute styles which are not yet
Kokkos-enabled. This allows data to stay on the GPU for multiple
@ -956,8 +968,6 @@ together to compute pairwise interactions with the KOKKOS package. We
hope to support this in the future, similar to the GPU package in
LAMMPS.
</P>
<P>CT NOTE: other performance tips??
</P>
<HR>
<HR>

View File

@ -679,19 +679,20 @@ The KOKKOS package contains versions of pair, fix, and atom styles
that use data structures and methods and macros provided by the Kokkos
library, which is included with LAMMPS in lib/kokkos.
Kokkos is a C++ library that provides two key abstractions for an
application like LAMMPS. First, it allows a single implementation of
an application kernel (e.g. a pair style) to run efficiently on
different kinds of hardware (GPU, Intel Phi, many-core chip).
"Kokkos"_http://trilinos.sandia.gov/packages/kokkos is a C++ library
that provides two key abstractions for an application like LAMMPS.
First, it allows a single implementation of an application kernel
(e.g. a pair style) to run efficiently on different kinds of hardware
(GPU, Intel Phi, many-core chip).
Second, it adjusts the memory layout of basic data structures like 2d
and 3d arrays specifically for the chosen hardware. These are used in
LAMMPS to store atom coordinates or forces or neighbor lists. The
layout is chosen to optimize performance on different platforms.
Again this operation is hidden from the developer, and does not affect
how the single implementation of the kernel is coded.
CT NOTE: Pointer to Kokkos web page???
Second, it provides data abstractions to adjust (at compile time) the
memory layout of basic data structures like 2d and 3d arrays and allow
the transparent utilization of special hardware load and store units.
Such data structures are used in LAMMPS to store atom coordinates or
forces or neighbor lists. The layout is chosen to optimize
performance on different platforms. Again this operation is hidden
from the developer, and does not affect how the single implementation
of the kernel is coded.
These abstractions are set at build time, when LAMMPS is compiled with
the KOKKOS package installed. This is done by selecting a "host" and
@ -723,9 +724,11 @@ i.e. the host and device are the same.
IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
of MPI tasks per compute node to be equal to the number of GPUs per
compute node. In the future Kokkos will support assigning one GPU to
multiple MPI tasks or using multiple GPUs per MPI task.
CT NOTE: what about AMD GPUs running OpenCL? are they supported?
multiple MPI tasks or using multiple GPUs per MPI task. Currently
Kokkos does not support AMD GPUs due to limits in the available
backend programming models (in particular relative extensive C++
support is required for the Kernel language). This is expected to
change in the future.
Here are several examples of how to build LAMMPS and run a simulation
using the KOKKOS package for typical compute node configurations.
@ -853,8 +856,8 @@ If N is the number of physical cores/node, then the number of MPI
tasks/node * number of threads/task should not exceed N, and should
typically equal N. Note that the default threads/task is 1, as set by
the "t" keyword of the -k "command-line
switch"_Section_start.html#start_7. If you do not change this, there
will no additional parallelism (beyond MPI) invoked on the host
switch"_Section_start.html#start_7. If you do not change this, no
additional parallelism (beyond MPI) will be invoked on the host
CPU(s).
You can compare the performance running in different modes:
@ -874,9 +877,8 @@ software installation. Insure the -arch setting in
src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
"this section"_Section_start.html#start_3_4 of the manual for details.
The -np setting of the mpirun command must set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node. CT
NOTE: does LAMMPS enforce this?
The -np setting of the mpirun command should set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node.
Use the "-kokkos command-line switch"_Section_commands.html#start_7 to
specify the number of GPUs per node, and the number of threads per MPI
@ -932,8 +934,18 @@ When using threads (OpenMP or pthreads), it is important for
performance to bind the threads to physical cores, so they do not
migrate during a simulation. The same is true for MPI tasks, but the
default binding rules implemented for various MPI versions, do not
account for thread binding. Thus you should do the following if using
multiple threads per MPI task. CT NOTE: explain what to do.
account for thread binding.
Thus if you use more than one thread per MPI task, you should insure
MPI tasks are bound to CPU sockets. Furthermore, use thread affinity
environment variables from the OpenMP runtime when using OpenMP and
compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc
4.7 or later, intel 12 or later) setting the environment variable
OMP_PROC_BIND=true should be sufficient. A typical mpirun command
should set these flags:
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
When using a GPU, you will achieve the best performance if your input
script does not use any fix or compute styles which are not yet
@ -952,8 +964,6 @@ together to compute pairwise interactions with the KOKKOS package. We
hope to support this in the future, similar to the GPU package in
LAMMPS.
CT NOTE: other performance tips??
:line
:line

View File

@ -1259,13 +1259,13 @@ Ng = 1 and Ns is not set.
<P>Depending on which flavor of MPI you are running, LAMMPS will look for
one of these 3 environment variables
</P>
<PRE>SLURM_LOCALID (???) CT NOTE: what MPI is this for?
<PRE>SLURM_LOCALID (various MPI variants compiled with SLURM support)
MV2_COMM_WORLD_LOCAL_RANK (Mvapich)
OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI)
</PRE>
<P>which are initialized by "mpirun" or "mpiexec". The environment
variable setting for each MPI rank is used to assign a unique GPU ID
to the MPI task.
<P>which are initialized by the "srun", "mpirun" or "mpiexec" commands.
The environment variable setting for each MPI rank is used to assign a
unique GPU ID to the MPI task.
</P>
<PRE>threads Nt
</PRE>
@ -1274,13 +1274,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode.
The default is Nt = 1, which essentially runs in MPI-only mode. If
there are Np MPI tasks per physical node, you generally want Np*Nt =
the number of physical cores per node, to use your available hardware
optimally.
optimally. This also sets the number of threads used by the host when
LAMMPS is compiled with CUDA=yes.
</P>
<PRE>numa Nm
</PRE>
<P>CT NOTE: what does numa set, and why use it?
</P>
<P>Explain. The default is Nm = 1.
<P>This option is only relevant when using pthreads with hwloc support.
In this case Nm defines the number of NUMA regions (typicaly sockets)
on a node which will be utilizied by a single MPI rank. By default Nm
= 1. If this option is used the total number of worker-threads per
MPI rank is threads*numa. Currently it is always almost better to
assign at least one MPI rank per NUMA region, and leave numa set to
its default value of 1. This is because letting a single process span
multiple NUMA regions induces a significant amount of cross NUMA data
traffic which is slow.
</P>
<PRE>-log file
</PRE>

View File

@ -1253,13 +1253,13 @@ Ng = 1 and Ns is not set.
Depending on which flavor of MPI you are running, LAMMPS will look for
one of these 3 environment variables
SLURM_LOCALID (???) CT NOTE: what MPI is this for?
SLURM_LOCALID (various MPI variants compiled with SLURM support)
MV2_COMM_WORLD_LOCAL_RANK (Mvapich)
OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) :pre
which are initialized by "mpirun" or "mpiexec". The environment
variable setting for each MPI rank is used to assign a unique GPU ID
to the MPI task.
which are initialized by the "srun", "mpirun" or "mpiexec" commands.
The environment variable setting for each MPI rank is used to assign a
unique GPU ID to the MPI task.
threads Nt :pre
@ -1268,13 +1268,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode.
The default is Nt = 1, which essentially runs in MPI-only mode. If
there are Np MPI tasks per physical node, you generally want Np*Nt =
the number of physical cores per node, to use your available hardware
optimally.
optimally. This also sets the number of threads used by the host when
LAMMPS is compiled with CUDA=yes.
numa Nm :pre
CT NOTE: what does numa set, and why use it?
Explain. The default is Nm = 1.
This option is only relevant when using pthreads with hwloc support.
In this case Nm defines the number of NUMA regions (typicaly sockets)
on a node which will be utilizied by a single MPI rank. By default Nm
= 1. If this option is used the total number of worker-threads per
MPI rank is threads*numa. Currently it is always almost better to
assign at least one MPI rank per NUMA region, and leave numa set to
its default value of 1. This is because letting a single process span
multiple NUMA regions induces a significant amount of cross NUMA data
traffic which is slow.
-log file :pre

View File

@ -216,26 +216,24 @@ device type can be specified when building LAMMPS with the GPU library.
</P>
<HR>
<P>The <I>kk</I> style invokes options associated with the use of the
<P>The <I>kokkos</I> style invokes options associated with the use of the
KOKKOS package.
</P>
<P>The <I>neigh</I> keyword determines what kinds of neighbor lists are built.
A value of <I>half</I> uses half-neighbor lists, the same as used by most
pair styles in LAMMPS. This is the default when running without
threads on a CPU. A value of <I>half/thread</I> uses a threadsafe variant
of the half-neighbor list. It should be used instead of <I>half</I> when
running with threads on a CPU. A value of <I>full</I> uses a
pair styles in LAMMPS. A value of <I>half/thread</I> uses a threadsafe
variant of the half-neighbor list. It should be used instead of
<I>half</I> when running with threads on a CPU. A value of <I>full</I> uses a
full-neighborlist, i.e. f_ij and f_ji are both calculated. This
performs twice as much computation as the <I>half</I> option, however that
can be a win because it is threadsafe and doesn't require atomic
operations. This is the default when running in threaded mode or on
GPUs. A value of <I>full/cluster</I> is an experimental neighbor style,
where particles interact with all particles within a small cluster, if
at least one of the clusters particles is within the neighbor cutoff
range. This potentially allows for better vectorization on
architectures such as the Intel Phi. If also reduces the size of the
neighbor list by roughly a factor of the cluster size, thus reducing
the total memory footprint considerably.
operations. A value of <I>full/cluster</I> is an experimental neighbor
style, where particles interact with all particles within a small
cluster, if at least one of the clusters particles is within the
neighbor cutoff range. This potentially allows for better
vectorization on architectures such as the Intel Phi. If also reduces
the size of the neighbor list by roughly a factor of the cluster size,
thus reducing the total memory footprint considerably.
</P>
<P>The <I>comm/exchange</I> and <I>comm/forward</I> keywords determine whether the
host or device performs the packing and unpacking of data when
@ -254,26 +252,23 @@ packing/unpacking in parallel with threads. A value of <I>device</I> means
to use the device, typically a GPU, to perform the packing/unpacking
operation.
</P>
<P>CT NOTE: please read this paragraph, to make sure it is correct:
</P>
<P>The optimal choice for these keywords depends on the input script and
the hardware used. The <I>no</I> value is useful for verifying that Kokkos
code is working correctly. It may also be the fastest choice when
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values
should work identically. When using GPUs, the <I>device</I> value will
typically be optimal if all of your styles used in your input script
are supported by the KOKKOS package. In this case data can stay on
the GPU for many timesteps without being moved between the host and
GPU, if you use the <I>device</I> value. This requires that your MPI is
able to access GPU memory directly. Currently that is true for
OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI.
If your script uses styles (e.g. fixes) which are not yet supported by
the KOKKOS package, then data has to be move between the host and
device anyway, so it is typically faster to let the host handle
communication, by using the <I>host</I> value. Using <I>host</I> instead of
<I>no</I> will enable use of multiple threads to pack/unpack communicated
data.
When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values work
identically. When using GPUs, the <I>device</I> value will typically be
optimal if all of your styles used in your input script are supported
by the KOKKOS package. In this case data can stay on the GPU for many
timesteps without being moved between the host and GPU, if you use the
<I>device</I> value. This requires that your MPI is able to access GPU
memory directly. Currently that is true for OpenMPI 1.8 (or later
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
then data has to be move between the host and device anyway, so it is
typically faster to let the host handle communication, by using the
<I>host</I> value. Using <I>host</I> instead of <I>no</I> will enable use of
multiple threads to pack/unpack communicated data.
</P>
<HR>
@ -354,13 +349,10 @@ invoked, to specify default settings for the GPU package. If the
command-line switch is not used, then no defaults are set, and you
must specify the appropriate package command in your input script.
</P>
<P>CT NOTE: is this correct? The above sems to say the
choice of neigh value depends on use of threads or not.
</P>
<P>The default settings for the KOKKOS package are "package kokkos neigh
full comm/exchange host comm/forward host". This is the case whether
the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
or not.
<P>The default settings for the KOKKOS package are "package kk neigh full
comm/exchange host comm/forward host". This is the case whether the
"-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used or
not.
</P>
<P>If the "-sf omp" <A HREF = "Section_start.html#start_7">command-line switch</A> is
used then it is as if the command "package omp *" were invoked, to

View File

@ -210,26 +210,24 @@ device type can be specified when building LAMMPS with the GPU library.
:line
The {kk} style invokes options associated with the use of the
The {kokkos} style invokes options associated with the use of the
KOKKOS package.
The {neigh} keyword determines what kinds of neighbor lists are built.
A value of {half} uses half-neighbor lists, the same as used by most
pair styles in LAMMPS. This is the default when running without
threads on a CPU. A value of {half/thread} uses a threadsafe variant
of the half-neighbor list. It should be used instead of {half} when
running with threads on a CPU. A value of {full} uses a
pair styles in LAMMPS. A value of {half/thread} uses a threadsafe
variant of the half-neighbor list. It should be used instead of
{half} when running with threads on a CPU. A value of {full} uses a
full-neighborlist, i.e. f_ij and f_ji are both calculated. This
performs twice as much computation as the {half} option, however that
can be a win because it is threadsafe and doesn't require atomic
operations. This is the default when running in threaded mode or on
GPUs. A value of {full/cluster} is an experimental neighbor style,
where particles interact with all particles within a small cluster, if
at least one of the clusters particles is within the neighbor cutoff
range. This potentially allows for better vectorization on
architectures such as the Intel Phi. If also reduces the size of the
neighbor list by roughly a factor of the cluster size, thus reducing
the total memory footprint considerably.
operations. A value of {full/cluster} is an experimental neighbor
style, where particles interact with all particles within a small
cluster, if at least one of the clusters particles is within the
neighbor cutoff range. This potentially allows for better
vectorization on architectures such as the Intel Phi. If also reduces
the size of the neighbor list by roughly a factor of the cluster size,
thus reducing the total memory footprint considerably.
The {comm/exchange} and {comm/forward} keywords determine whether the
host or device performs the packing and unpacking of data when
@ -248,26 +246,23 @@ packing/unpacking in parallel with threads. A value of {device} means
to use the device, typically a GPU, to perform the packing/unpacking
operation.
CT NOTE: please read this paragraph, to make sure it is correct:
The optimal choice for these keywords depends on the input script and
the hardware used. The {no} value is useful for verifying that Kokkos
code is working correctly. It may also be the fastest choice when
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
When running on CPUs or Xeon Phi, the {host} and {device} values
should work identically. When using GPUs, the {device} value will
typically be optimal if all of your styles used in your input script
are supported by the KOKKOS package. In this case data can stay on
the GPU for many timesteps without being moved between the host and
GPU, if you use the {device} value. This requires that your MPI is
able to access GPU memory directly. Currently that is true for
OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI.
If your script uses styles (e.g. fixes) which are not yet supported by
the KOKKOS package, then data has to be move between the host and
device anyway, so it is typically faster to let the host handle
communication, by using the {host} value. Using {host} instead of
{no} will enable use of multiple threads to pack/unpack communicated
data.
When running on CPUs or Xeon Phi, the {host} and {device} values work
identically. When using GPUs, the {device} value will typically be
optimal if all of your styles used in your input script are supported
by the KOKKOS package. In this case data can stay on the GPU for many
timesteps without being moved between the host and GPU, if you use the
{device} value. This requires that your MPI is able to access GPU
memory directly. Currently that is true for OpenMPI 1.8 (or later
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
then data has to be move between the host and device anyway, so it is
typically faster to let the host handle communication, by using the
{host} value. Using {host} instead of {no} will enable use of
multiple threads to pack/unpack communicated data.
:line
@ -348,13 +343,10 @@ invoked, to specify default settings for the GPU package. If the
command-line switch is not used, then no defaults are set, and you
must specify the appropriate package command in your input script.
CT NOTE: is this correct? The above sems to say the
choice of neigh value depends on use of threads or not.
The default settings for the KOKKOS package are "package kokkos neigh
full comm/exchange host comm/forward host". This is the case whether
the "-sf kk" "command-line switch"_Section_start.html#start_7 is used
or not.
The default settings for the KOKKOS package are "package kk neigh full
comm/exchange host comm/forward host". This is the case whether the
"-sf kk" "command-line switch"_Section_start.html#start_7 is used or
not.
If the "-sf omp" "command-line switch"_Section_start.html#start_7 is
used then it is as if the command "package omp *" were invoked, to