git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12451 f3b2605a-c512-4ea7-a41b-209d697bcdaa

This commit is contained in:
sjplimp 2014-09-09 16:05:17 +00:00
parent 95d3f975f5
commit 3a18e667d4
4 changed files with 190 additions and 206 deletions

View File

@ -264,8 +264,9 @@ due to if tests and other conditional code.
<LI>use OPT pair styles in your input script
</UL>
<P>The last step can be done using the "-sf opt" <A HREF = "Section_start.html#start_7">command-line
switch</A>. Or it can be done by adding a
<A HREF = "suffix.html">suffix opt</A> command to your input script.
switch</A>. Or the effect of the "-sf" switch
can be duplicated by adding a <A HREF = "suffix.html">suffix opt</A> command to your
input script.
</P>
<P><B>Required hardware/software:</B>
</P>
@ -331,8 +332,9 @@ uses the OpenMP interface for multi-threading.
</UL>
<P>The latter two steps can be done using the "-pk omp" and "-sf omp"
<A HREF = "Section_start.html#start_7">command-line switches</A> respectively. Or
either step can be done by adding the <A HREF = "package.html">package omp</A> or
<A HREF = "suffix.html">suffix omp</A> commands respectively to your input script.
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the <A HREF = "package.html">package omp</A> or <A HREF = "suffix.html">suffix omp</A> commands
respectively to your input script.
</P>
<P><B>Required hardware/software:</B>
</P>
@ -541,8 +543,9 @@ hardware.
</UL>
<P>The latter two steps can be done using the "-pk gpu" and "-sf gpu"
<A HREF = "Section_start.html#start_7">command-line switches</A> respectively. Or
either step can be done by adding the <A HREF = "package.html">package gpu</A> or
<A HREF = "suffix.html">suffix gpu</A> commands respectively to your input script.
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the <A HREF = "package.html">package gpu</A> or <A HREF = "suffix.html">suffix gpu</A> commands
respectively to your input script.
</P>
<P><B>Required hardware/software:</B>
</P>
@ -767,8 +770,9 @@ single CPU (core), assigned to each GPU.
</UL>
<P>The latter two steps can be done using the "-pk cuda" and "-sf cuda"
<A HREF = "Section_start.html#start_7">command-line switches</A> respectively. Or
either step can be done by adding the <A HREF = "package.html">package cuda</A> or
<A HREF = "suffix.html">suffix cuda</A> commands respectively to your input script.
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the <A HREF = "package.html">package cuda</A> or <A HREF = "suffix.html">suffix cuda</A> commands
respectively to your input script.
</P>
<P><B>Required hardware/software:</B>
</P>
@ -894,7 +898,8 @@ sets the number of GPUs/node to use to 2.
<PRE>pair_style lj/cut/cuda 2.5
</PRE>
<P>You only need to use the <A HREF = "package.html">package cuda</A> command if you
wish to change the number of GPUs/node to use or its other options.
wish to change the number of GPUs/node to use or its other option
defaults.
</P>
<P><B>Speed-ups to expect:</B>
</P>
@ -988,22 +993,22 @@ for GPU acceleration:
</UL>
<P>The latter two steps can be done using the "-k on", "-pk kokkos" and
"-sf kk" <A HREF = "Section_start.html#start_7">command-line switches</A>
respectively. Or either the steps can be done by adding the <A HREF = "package.html">package
kokkod</A> or <A HREF = "suffix.html">suffix kk</A> commands respectively
to your input script.
respectively. Or the effect of the "-pk" or "-sf" switches can be
duplicated by adding the <A HREF = "package.html">package kokkos</A> or <A HREF = "suffix.html">suffix
kk</A> commands respectively to your input script.
</P>
<P><B>Required hardware/software:</B>
</P>
<P>The KOKKOS package can be used to build and run
LAMMPS on the following kinds of hardware configurations:
<P>The KOKKOS package can be used to build and run LAMMPS on the
following kinds of hardware:
</P>
<UL><LI>CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
<LI>CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
<LI>Phi: on one or more Intel Phi coprocessors (per node)
<LI>GPU: on the GPUs of a node with additional OpenMP threading on the CPUs
</UL>
<P>Intel Xeon Phi coprocessors are supported in "native" mode only, not
"offload" mode.
<P>Note that Intel Xeon Phi coprocessors are supported in "native" mode,
not "offload" mode like the USER-INTEL package supports.
</P>
<P>Only NVIDIA GPUs are currently supported.
</P>
@ -1094,31 +1099,32 @@ tasks used per node. E.g. the mpirun command does this via its -np
and -ppn switches.
</P>
<P>When using KOKKOS built with host=OMP, you need to choose how many
OpenMP threads per MPI task will be used. Note that the product of
MPI tasks * OpenMP threads/task should not exceed the physical number
of cores (on a node), otherwise performance will suffer.
OpenMP threads per MPI task will be used (via the "-k" command-line
switch discussed below). Note that the product of MPI tasks * OpenMP
threads/task should not exceed the physical number of cores (on a
node), otherwise performance will suffer.
</P>
<P>When using the KOKKOS package built with device=CUDA, you must use
exactly one MPI task per physical GPU.
</P>
<P>When using the KOKKOS package built with host=MIC for Intel Xeon Phi
coprocessor support you need to insure there is one or more MPI tasks
per coprocessor and choose the number of threads to use on a
coproessor per MPI task. The product of MPI tasks * coprocessor
threads/task should not exceed the maximum number of threads the
coproprocessor is designed to run, otherwise performance will suffer.
This value is 240 for current generation Xeon Phi(TM) chips, which is
60 physical cores * 4 threads/core.
</P>
<P>NOTE: does not matter how many Phi per node, only concenred
with MPI tasks
coprocessor support you need to insure there are one or more MPI tasks
per coprocessor, and choose the number of coprocessor threads to use
per MPI task (via the "-k" command-line switch discussed below). The
product of MPI tasks * coprocessor threads/task should not exceed the
maximum number of threads the coproprocessor is designed to run,
otherwise performance will suffer. This value is 240 for current
generation Xeon Phi(TM) chips, which is 60 physical cores * 4
threads/core. Note that with the KOKKOS package you do not need to
specify how many Phi coprocessors there are per node; each
coprocessors is simply treated as running some number of MPI tasks.
</P>
<P>You must use the "-k on" <A HREF = "Section_start.html#start_7">command-line
switch</A> to enable the KOKKOS package. It
takes additional arguments for hardware settings appropriate to your
system. Those arguments are documented
<A HREF = "Section_start.html#start_7">here</A>. The two commonly used ones are as
follows:
system. Those arguments are <A HREF = "Section_start.html#start_7">documented
here</A>. The two most commonly used arguments
are:
</P>
<PRE>-k on t Nt
-k on g Ng
@ -1128,69 +1134,63 @@ host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI
task to use with a node. For host=MIC, it specifies how many Xeon Phi
threads per MPI task to use within a node. The default is Nt = 1.
Note that for host=OMP this is effectively MPI-only mode which may be
fine. But for host=MIC this may run 240 MPI tasks on the coprocessor,
which could give very poor perforamnce.
fine. But for host=MIC you will typically end up using far less than
all the 240 available threads, which could give very poor performance.
</P>
<P>The "g Ng" option applies to device=CUDA. It specifies how many GPUs
per compute node to use. The default is 1, so this only needs to be
specified is you have 2 or more GPUs per compute node.
</P>
<P>This also issues a default <A HREF = "package.html">package cuda 2</A> command which
sets the number of GPUs/node to use to 2.
</P>
<P>The "-k on" switch also issues a default <A HREF = "package.html">package kk neigh full
comm/exchange host comm/forward host</A> command which sets
some KOKKOS options to default values, discussed on the
<A HREF = "package.html">package</A> command doc page.
<P>The "-k on" switch also issues a default <A HREF = "package.html">package kokkos neigh full
comm host</A> command which sets various KOKKOS options to
default values, as discussed on the <A HREF = "package.html">package</A> command doc
page.
</P>
<P>Use the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A>,
which will automatically append "kokkos" to styles that support it.
Use the "-pk kokkos" <A HREF = "Section_start.html#start_7">command-line switch</A>
if you wish to override any of the default values set by the <A HREF = "package.html">package
which will automatically append "kk" to styles that support it. Use
the "-pk kokkos" <A HREF = "Section_start.html#start_7">command-line switch</A> if
you wish to override any of the default values set by the <A HREF = "package.html">package
kokkos</A> command invoked by the "-k on" switch.
</P>
<P>host=OMP, dual hex-core nodes (12 threads/node):
</P>
<PRE>mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task
<PRE>host=OMP, dual hex-core nodes (12 threads/node):
mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task
mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj # ditto on 16 nodes
</PRE>
<P>host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240
mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240
mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis
</P>
<PRE>mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12*20 = 240
mpirun -np 15 lmp_g++ -k on t 16 -sf kk -in in.lj
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj
<PRE>host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU
mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes
</PRE>
<P>host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
</P>
<PRE>mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU
</PRE>
<P>host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
</P>
<P>Dual 8-core CPUs and 2 GPUs:
</P>
<PRE>mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU
<PRE>host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU
mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes
</PRE>
<P><B>Or run with the KOKKOS package by editing an input script:</B>
</P>
<P>The discussion above for the mpirun/mpiexec command and setting
appropriate thread and GPU values for host=OMP or host=MIC or
device=CUDA are the same.
</P>
<P>of one MPI task per GPU is the same.
<P>You must still use the "-k on" <A HREF = "Section_start.html#start_7">command-line
switch</A> to enable the KOKKOS package, and
specify its additional arguments for hardware options appopriate to
your system, as documented above.
</P>
<P>You must still use the "-c on" <A HREF = "Section_start.html#start_7">command-line
switch</A> to enable the USER-CUDA package.
This also issues a default <A HREF = "pacakge.html">package cuda 2</A> command which
sets the number of GPUs/node to use to 2.
<P>Use the <A HREF = "suffix.html">suffix kk</A> command, or you can explicitly add a
"kk" suffix to individual styles in your input script, e.g.
</P>
<P>Use the <A HREF = "suffix.html">suffix cuda</A> command, or you can explicitly add a
"cuda" suffix to individual styles in your input script, e.g.
</P>
<PRE>pair_style lj/cut/cuda 2.5
<PRE>pair_style lj/cut/kk 2.5
</PRE>
<P>You only need to use the <A HREF = "package.html">package cuda</A> command if you
wish to change the number of GPUs/node to use or its other options.
<P>You only need to use the <A HREF = "package.html">package kokkos</A> command if you
wish to change any of its option defaults.
</P>
<P><B>Speed-ups to expect:</B>
</P>
@ -1210,8 +1210,8 @@ than 20%).
performance of a KOKKOS style is a bit slower than the USER-OMP
package.
<LI>When running on GPUs, KOKKOS currently out-performs the
USER-CUDA and GPU packages.
<LI>When running on GPUs, KOKKOS is typically faster than the USER-CUDA
and GPU packages.
<LI>When running on Intel Xeon Phi, KOKKOS is not as fast as
the USER-INTEL package, which is optimized for that hardware.
@ -1222,8 +1222,8 @@ hardware.
</P>
<P><B>Guidelines for best performance:</B>
</P>
<P>Here are guidline for using the KOKKOS package on the different hardware
configurations listed above.
<P>Here are guidline for using the KOKKOS package on the different
hardware configurations listed above.
</P>
<P>Many of the guidelines use the <A HREF = "package.html">package kokkos</A> command
See its doc page for details and default settings. Experimenting with
@ -1234,7 +1234,7 @@ its options can provide a speed-up for specific calculations.
<P>If N is the number of physical cores/node, then the number of MPI
tasks/node * number of threads/task should not exceed N, and should
typically equal N. Note that the default threads/task is 1, as set by
the "t" keyword of the -k <A HREF = "Section_start.html#start_7">command-line
the "t" keyword of the "-k" <A HREF = "Section_start.html#start_7">command-line
switch</A>. If you do not change this, no
additional parallelism (beyond MPI) will be invoked on the host
CPU(s).
@ -1245,15 +1245,14 @@ CPU(s).
<LI>run with N MPI tasks/node and 1 thread/task
<LI>run with settings in between these extremes
</UL>
<P>Examples of mpirun commands in these modes, for nodes with dual
hex-core CPUs and no GPU, are shown above.
<P>Examples of mpirun commands in these modes are shown above.
</P>
<P>When using KOKKOS to perform multi-threading, it is important for
performance to bind both MPI tasks to physical cores, and threads to
physical cores, so they do not migrate during a simulation.
</P>
<P>If you are not certain MPI tasks are being bound (check the defaults
for your MPI installation), it can be forced with these flags:
for your MPI installation), binding can be forced with these flags:
</P>
<PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...
@ -1276,7 +1275,7 @@ details).
<P>The -np setting of the mpirun command should set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node.
</P>
<P>Use the <A HREF = "Section_commands.html#start_7">-kokkos command-line switch</A> to
<P>Use the "-k" <A HREF = "Section_commands.html#start_7">command-line switch</A> to
specify the number of GPUs per node, and the number of threads per MPI
task. As above for multi-core CPUs (and no GPU), if N is the number
of physical cores/node, then the number of MPI tasks/node * number of
@ -1286,14 +1285,13 @@ threads/task to a smaller value. This is because using all the cores
on a dual-socket node will incur extra cost to copy memory from the
2nd socket to the GPU.
</P>
<P>Examples of mpirun commands that follow these rules, for nodes with
dual hex-core CPUs and one or two GPUs, are shown above.
<P>Examples of mpirun commands that follow these rules are shown above.
</P>
<P>When using a GPU, you will achieve the best performance if your input
script does not use any fix or compute styles which are not yet
Kokkos-enabled. This allows data to stay on the GPU for multiple
timesteps, without being copied back to the host CPU. Invoking a
non-Kokkos fix or compute, or performing I/O for
<P>IMPORTANT NOTE: When using a GPU, you will achieve the best
performance if your input script does not use any fix or compute
styles which are not yet Kokkos-enabled. This allows data to stay on
the GPU for multiple timesteps, without being copied back to the host
CPU. Invoking a non-Kokkos fix or compute, or performing I/O for
<A HREF = "thermo_style.html">thermo</A> or <A HREF = "dump.html">dump</A> output will cause data
to be copied back to the CPU.
</P>
@ -1329,8 +1327,7 @@ threads/task as Nt. The product of these 2 values should be N, i.e.
4 so that logical threads from more than one MPI task do not run on
the same physical core.
</P>
<P>Examples of mpirun commands that follow these rules, for Intel Phi
nodes with 61 cores, are shown above.
<P>Examples of mpirun commands that follow these rules are shown above.
</P>
<P><B>Restrictions:</B>
</P>
@ -1395,8 +1392,8 @@ steps:
<P>The latter two steps in the first case and the last step in the
coprocessor case can be done using the "-pk omp" and "-sf intel" and
"-pk intel" <A HREF = "Section_start.html#start_7">command-line switches</A>
respectively. Or any of the 3 steps can be done by adding the
<A HREF = "package.html">package intel</A> or <A HREF = "suffix.html">suffix cuda</A> or <A HREF = "package.html">package
respectively. Or the effect of the "-pk" or "-sf" switches can be
duplicated by adding the <A HREF = "package.html">package intel</A> or <A HREF = "suffix.html">suffix
intel</A> commands respectively to your input script.
</P>
<P><B>Required hardware/software:</B>
@ -1514,7 +1511,7 @@ all its options if these switches are not specified, and how to set
the number of OpenMP threads via the OMP_NUM_THREADS environment
variable if desired.
</P>
<P><B>Or run with the USER-OMP package by editing an input script:</B>
<P><B>Or run with the USER-INTEL package by editing an input script:</B>
</P>
<P>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
OpenMP threads per MPI task, and coprocessor threads per MPI task is

View File

@ -258,8 +258,9 @@ include the OPT package and build LAMMPS
use OPT pair styles in your input script :ul
The last step can be done using the "-sf opt" "command-line
switch"_Section_start.html#start_7. Or it can be done by adding a
"suffix opt"_suffix.html command to your input script.
switch"_Section_start.html#start_7. Or the effect of the "-sf" switch
can be duplicated by adding a "suffix opt"_suffix.html command to your
input script.
[Required hardware/software:]
@ -325,8 +326,9 @@ use USER-OMP styles in your input script :ul
The latter two steps can be done using the "-pk omp" and "-sf omp"
"command-line switches"_Section_start.html#start_7 respectively. Or
either step can be done by adding the "package omp"_package.html or
"suffix omp"_suffix.html commands respectively to your input script.
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the "package omp"_package.html or "suffix omp"_suffix.html commands
respectively to your input script.
[Required hardware/software:]
@ -535,8 +537,9 @@ use GPU styles in your input script :ul
The latter two steps can be done using the "-pk gpu" and "-sf gpu"
"command-line switches"_Section_start.html#start_7 respectively. Or
either step can be done by adding the "package gpu"_package.html or
"suffix gpu"_suffix.html commands respectively to your input script.
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the "package gpu"_package.html or "suffix gpu"_suffix.html commands
respectively to your input script.
[Required hardware/software:]
@ -761,8 +764,9 @@ use USER-CUDA styles in your input script :ul
The latter two steps can be done using the "-pk cuda" and "-sf cuda"
"command-line switches"_Section_start.html#start_7 respectively. Or
either step can be done by adding the "package cuda"_package.html or
"suffix cuda"_suffix.html commands respectively to your input script.
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the "package cuda"_package.html or "suffix cuda"_suffix.html commands
respectively to your input script.
[Required hardware/software:]
@ -888,7 +892,8 @@ Use the "suffix cuda"_suffix.html command, or you can explicitly add a
pair_style lj/cut/cuda 2.5 :pre
You only need to use the "package cuda"_package.html command if you
wish to change the number of GPUs/node to use or its other options.
wish to change the number of GPUs/node to use or its other option
defaults.
[Speed-ups to expect:]
@ -982,22 +987,22 @@ use KOKKOS styles in your input script :ul
The latter two steps can be done using the "-k on", "-pk kokkos" and
"-sf kk" "command-line switches"_Section_start.html#start_7
respectively. Or either the steps can be done by adding the "package
kokkod"_package.html or "suffix kk"_suffix.html commands respectively
to your input script.
respectively. Or the effect of the "-pk" or "-sf" switches can be
duplicated by adding the "package kokkos"_package.html or "suffix
kk"_suffix.html commands respectively to your input script.
[Required hardware/software:]
The KOKKOS package can be used to build and run
LAMMPS on the following kinds of hardware configurations:
The KOKKOS package can be used to build and run LAMMPS on the
following kinds of hardware:
CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
Phi: on one or more Intel Phi coprocessors (per node)
GPU: on the GPUs of a node with additional OpenMP threading on the CPUs :ul
Intel Xeon Phi coprocessors are supported in "native" mode only, not
"offload" mode.
Note that Intel Xeon Phi coprocessors are supported in "native" mode,
not "offload" mode like the USER-INTEL package supports.
Only NVIDIA GPUs are currently supported.
@ -1088,33 +1093,32 @@ tasks used per node. E.g. the mpirun command does this via its -np
and -ppn switches.
When using KOKKOS built with host=OMP, you need to choose how many
OpenMP threads per MPI task will be used. Note that the product of
MPI tasks * OpenMP threads/task should not exceed the physical number
of cores (on a node), otherwise performance will suffer.
OpenMP threads per MPI task will be used (via the "-k" command-line
switch discussed below). Note that the product of MPI tasks * OpenMP
threads/task should not exceed the physical number of cores (on a
node), otherwise performance will suffer.
When using the KOKKOS package built with device=CUDA, you must use
exactly one MPI task per physical GPU.
When using the KOKKOS package built with host=MIC for Intel Xeon Phi
coprocessor support you need to insure there is one or more MPI tasks
per coprocessor and choose the number of threads to use on a
coproessor per MPI task. The product of MPI tasks * coprocessor
threads/task should not exceed the maximum number of threads the
coproprocessor is designed to run, otherwise performance will suffer.
This value is 240 for current generation Xeon Phi(TM) chips, which is
60 physical cores * 4 threads/core.
NOTE: does not matter how many Phi per node, only concenred
with MPI tasks
coprocessor support you need to insure there are one or more MPI tasks
per coprocessor, and choose the number of coprocessor threads to use
per MPI task (via the "-k" command-line switch discussed below). The
product of MPI tasks * coprocessor threads/task should not exceed the
maximum number of threads the coproprocessor is designed to run,
otherwise performance will suffer. This value is 240 for current
generation Xeon Phi(TM) chips, which is 60 physical cores * 4
threads/core. Note that with the KOKKOS package you do not need to
specify how many Phi coprocessors there are per node; each
coprocessors is simply treated as running some number of MPI tasks.
You must use the "-k on" "command-line
switch"_Section_start.html#start_7 to enable the KOKKOS package. It
takes additional arguments for hardware settings appropriate to your
system. Those arguments are documented
"here"_Section_start.html#start_7. The two commonly used ones are as
follows:
system. Those arguments are "documented
here"_Section_start.html#start_7. The two most commonly used arguments
are:
-k on t Nt
-k on g Ng :pre
@ -1124,78 +1128,64 @@ host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI
task to use with a node. For host=MIC, it specifies how many Xeon Phi
threads per MPI task to use within a node. The default is Nt = 1.
Note that for host=OMP this is effectively MPI-only mode which may be
fine. But for host=MIC this may run 240 MPI tasks on the coprocessor,
which could give very poor perforamnce.
fine. But for host=MIC you will typically end up using far less than
all the 240 available threads, which could give very poor performance.
The "g Ng" option applies to device=CUDA. It specifies how many GPUs
per compute node to use. The default is 1, so this only needs to be
specified is you have 2 or more GPUs per compute node.
This also issues a default "package cuda 2"_package.html command which
sets the number of GPUs/node to use to 2.
The "-k on" switch also issues a default "package kk neigh full
comm/exchange host comm/forward host"_package.html command which sets
some KOKKOS options to default values, discussed on the
"package"_package.html command doc page.
The "-k on" switch also issues a default "package kokkos neigh full
comm host"_package.html command which sets various KOKKOS options to
default values, as discussed on the "package"_package.html command doc
page.
Use the "-sf kk" "command-line switch"_Section_start.html#start_7,
which will automatically append "kokkos" to styles that support it.
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7
if you wish to override any of the default values set by the "package
which will automatically append "kk" to styles that support it. Use
the "-pk kokkos" "command-line switch"_Section_start.html#start_7 if
you wish to override any of the default values set by the "package
kokkos"_package.html command invoked by the "-k on" switch.
host=OMP, dual hex-core nodes (12 threads/node):
mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task :pre
mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos
mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos
mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads
mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task
mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj # ditto on 16 nodes :pre
host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240
mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240
mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis
mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12*20 = 240
mpirun -np 15 lmp_g++ -k on t 16 -sf kk -in in.lj
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj :pre
host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU :pre
mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU
mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes :pre
host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
Dual 8-core CPUs and 2 GPUs:
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU :pre
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU
mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes :pre
[Or run with the KOKKOS package by editing an input script:]
The discussion above for the mpirun/mpiexec command and setting
appropriate thread and GPU values for host=OMP or host=MIC or
device=CUDA are the same.
of one MPI task per GPU is the same.
You must still use the "-c on" "command-line
switch"_Section_start.html#start_7 to enable the USER-CUDA package.
This also issues a default "package cuda 2"_pacakge.html command which
sets the number of GPUs/node to use to 2.
Use the "suffix cuda"_suffix.html command, or you can explicitly add a
"cuda" suffix to individual styles in your input script, e.g.
pair_style lj/cut/cuda 2.5 :pre
You only need to use the "package cuda"_package.html command if you
wish to change the number of GPUs/node to use or its other options.
You must still use the "-k on" "command-line
switch"_Section_start.html#start_7 to enable the KOKKOS package, and
specify its additional arguments for hardware options appopriate to
your system, as documented above.
Use the "suffix kk"_suffix.html command, or you can explicitly add a
"kk" suffix to individual styles in your input script, e.g.
pair_style lj/cut/kk 2.5 :pre
You only need to use the "package kokkos"_package.html command if you
wish to change any of its option defaults.
[Speed-ups to expect:]
@ -1215,8 +1205,8 @@ When running on CPUs only, with multiple threads per MPI task,
performance of a KOKKOS style is a bit slower than the USER-OMP
package. :l
When running on GPUs, KOKKOS currently out-performs the
USER-CUDA and GPU packages. :l
When running on GPUs, KOKKOS is typically faster than the USER-CUDA
and GPU packages. :l
When running on Intel Xeon Phi, KOKKOS is not as fast as
the USER-INTEL package, which is optimized for that hardware. :l,ule
@ -1227,8 +1217,8 @@ hardware.
[Guidelines for best performance:]
Here are guidline for using the KOKKOS package on the different hardware
configurations listed above.
Here are guidline for using the KOKKOS package on the different
hardware configurations listed above.
Many of the guidelines use the "package kokkos"_package.html command
See its doc page for details and default settings. Experimenting with
@ -1239,7 +1229,7 @@ its options can provide a speed-up for specific calculations.
If N is the number of physical cores/node, then the number of MPI
tasks/node * number of threads/task should not exceed N, and should
typically equal N. Note that the default threads/task is 1, as set by
the "t" keyword of the -k "command-line
the "t" keyword of the "-k" "command-line
switch"_Section_start.html#start_7. If you do not change this, no
additional parallelism (beyond MPI) will be invoked on the host
CPU(s).
@ -1250,15 +1240,14 @@ run with 1 MPI task/node and N threads/task
run with N MPI tasks/node and 1 thread/task
run with settings in between these extremes :ul
Examples of mpirun commands in these modes, for nodes with dual
hex-core CPUs and no GPU, are shown above.
Examples of mpirun commands in these modes are shown above.
When using KOKKOS to perform multi-threading, it is important for
performance to bind both MPI tasks to physical cores, and threads to
physical cores, so they do not migrate during a simulation.
If you are not certain MPI tasks are being bound (check the defaults
for your MPI installation), it can be forced with these flags:
for your MPI installation), binding can be forced with these flags:
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
@ -1281,7 +1270,7 @@ details).
The -np setting of the mpirun command should set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node.
Use the "-kokkos command-line switch"_Section_commands.html#start_7 to
Use the "-k" "command-line switch"_Section_commands.html#start_7 to
specify the number of GPUs per node, and the number of threads per MPI
task. As above for multi-core CPUs (and no GPU), if N is the number
of physical cores/node, then the number of MPI tasks/node * number of
@ -1291,14 +1280,13 @@ threads/task to a smaller value. This is because using all the cores
on a dual-socket node will incur extra cost to copy memory from the
2nd socket to the GPU.
Examples of mpirun commands that follow these rules, for nodes with
dual hex-core CPUs and one or two GPUs, are shown above.
Examples of mpirun commands that follow these rules are shown above.
When using a GPU, you will achieve the best performance if your input
script does not use any fix or compute styles which are not yet
Kokkos-enabled. This allows data to stay on the GPU for multiple
timesteps, without being copied back to the host CPU. Invoking a
non-Kokkos fix or compute, or performing I/O for
IMPORTANT NOTE: When using a GPU, you will achieve the best
performance if your input script does not use any fix or compute
styles which are not yet Kokkos-enabled. This allows data to stay on
the GPU for multiple timesteps, without being copied back to the host
CPU. Invoking a non-Kokkos fix or compute, or performing I/O for
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
to be copied back to the CPU.
@ -1334,8 +1322,7 @@ threads/task as Nt. The product of these 2 values should be N, i.e.
4 so that logical threads from more than one MPI task do not run on
the same physical core.
Examples of mpirun commands that follow these rules, for Intel Phi
nodes with 61 cores, are shown above.
Examples of mpirun commands that follow these rules are shown above.
[Restrictions:]
@ -1400,9 +1387,9 @@ specify how many threads per coprocessor to use :ul
The latter two steps in the first case and the last step in the
coprocessor case can be done using the "-pk omp" and "-sf intel" and
"-pk intel" "command-line switches"_Section_start.html#start_7
respectively. Or any of the 3 steps can be done by adding the
"package intel"_package.html or "suffix cuda"_suffix.html or "package
intel"_package.html commands respectively to your input script.
respectively. Or the effect of the "-pk" or "-sf" switches can be
duplicated by adding the "package intel"_package.html or "suffix
intel"_suffix.html commands respectively to your input script.
[Required hardware/software:]
@ -1519,7 +1506,7 @@ all its options if these switches are not specified, and how to set
the number of OpenMP threads via the OMP_NUM_THREADS environment
variable if desired.
[Or run with the USER-OMP package by editing an input script:]
[Or run with the USER-INTEL package by editing an input script:]
The discussion above for the mpirun/mpiexec command, MPI tasks/node,
OpenMP threads per MPI task, and coprocessor threads per MPI task is

View File

@ -449,10 +449,10 @@ The <I>offload_ghost</I> default setting is determined by the intel style
being used. The value used is output to the screen in the offload
report at the end of each run.
</P>
<P>The default settings for the KOKKOS package are "package kk neigh full
comm/exchange host comm/forward host". This is the case whether the
"-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used or
not.
<P>The default settings for the KOKKOS package are "package kokkos neigh
full comm/exchange host comm/forward host". This is the case whether
the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
or not.
</P>
<P>If the "-sf omp" <A HREF = "Section_start.html#start_7">command-line switch</A> is
used then it is as if the command "package omp *" were invoked, to

View File

@ -451,10 +451,10 @@ The {offload_ghost} default setting is determined by the intel style
being used. The value used is output to the screen in the offload
report at the end of each run.
The default settings for the KOKKOS package are "package kk neigh full
comm/exchange host comm/forward host". This is the case whether the
"-sf kk" "command-line switch"_Section_start.html#start_7 is used or
not.
The default settings for the KOKKOS package are "package kokkos neigh
full comm/exchange host comm/forward host". This is the case whether
the "-sf kk" "command-line switch"_Section_start.html#start_7 is used
or not.
If the "-sf omp" "command-line switch"_Section_start.html#start_7 is
used then it is as if the command "package omp *" were invoked, to