forked from lijiext/lammps
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12508 f3b2605a-c512-4ea7-a41b-209d697bcdaa
This commit is contained in:
parent
16864ce4e3
commit
d0b6d228c7
|
@ -137,7 +137,7 @@ library.
|
|||
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
</P>
|
||||
<P>When using the USER-CUDA package, you must use exactly one MPI task
|
||||
per physical GPU.
|
||||
|
|
|
@ -134,7 +134,7 @@ library.
|
|||
The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
|
||||
When using the USER-CUDA package, you must use exactly one MPI task
|
||||
per physical GPU.
|
||||
|
|
|
@ -133,7 +133,7 @@ re-compiled and linked to the new GPU library.
|
|||
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
</P>
|
||||
<P>When using the GPU package, you cannot assign more than one GPU to a
|
||||
single MPI task. However multiple MPI tasks can share the same GPU,
|
||||
|
|
|
@ -130,7 +130,7 @@ re-compiled and linked to the new GPU library.
|
|||
The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
|
||||
When using the GPU package, you cannot assign more than one GPU to a
|
||||
single MPI task. However multiple MPI tasks can share the same GPU,
|
||||
|
|
|
@ -28,10 +28,10 @@ once with an offload flag.
|
|||
package. This is useful when offloading pair style computations to
|
||||
coprocessors, so that other styles not supported by the USER-INTEL
|
||||
package, e.g. bond, angle, dihedral, improper, and long-range
|
||||
electrostatics, can be run simultaneously in threaded mode on CPU
|
||||
electrostatics, can run simultaneously in threaded mode on the CPU
|
||||
cores. Since less MPI tasks than CPU cores will typically be invoked
|
||||
when running with coprocessors, this enables the extra cores to be
|
||||
utilized for useful computation.
|
||||
when running with coprocessors, this enables the extra CPU cores to be
|
||||
used for useful computation.
|
||||
</P>
|
||||
<P>If LAMMPS is built with both the USER-INTEL and USER-OMP packages
|
||||
intsalled, this mode of operation is made easier to use, because the
|
||||
|
@ -42,13 +42,13 @@ if available, after first testing if a style from the USER-INTEL
|
|||
package is available.
|
||||
</P>
|
||||
<P>Here is a quick overview of how to use the USER-INTEL package
|
||||
for CPU acceleration:
|
||||
for CPU-only acceleration:
|
||||
</P>
|
||||
<UL><LI>specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
|
||||
<LI>specify -fopenmp with LINKFLAGS in your Makefile.machine
|
||||
<UL><LI>specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
|
||||
<LI>specify -openmp with LINKFLAGS in your Makefile.machine
|
||||
<LI>include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
|
||||
<LI>if using the USER-OMP package, specify how many threads per MPI task to use
|
||||
<LI>use USER-INTEL styles in your input script
|
||||
<LI>specify how many OpenMP threads per MPI task to use
|
||||
<LI>use USER-INTEL and (optionally) USER-OMP styles in your input script
|
||||
</UL>
|
||||
<P>Using the USER-INTEL package to offload work to the Intel(R)
|
||||
Xeon Phi(TM) coprocessor is the same except for these additional
|
||||
|
@ -56,15 +56,14 @@ steps:
|
|||
</P>
|
||||
<UL><LI>add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
|
||||
<LI>add the flag -offload to LINKFLAGS in your Makefile.machine
|
||||
<LI>specify how many threads per coprocessor to use
|
||||
<LI>specify how many coprocessor threads per MPI task to use
|
||||
</UL>
|
||||
<P>The latter two steps in the first case and the last step in the
|
||||
coprocessor case can be done using the "-pk omp" and "-sf intel" and
|
||||
"-pk intel" <A HREF = "Section_start.html#start_7">command-line switches</A>
|
||||
respectively. Or the effect of the "-pk" or "-sf" switches can be
|
||||
duplicated by adding the <A HREF = "package.html">package omp</A> or <A HREF = "suffix.html">suffix
|
||||
intel</A> or <A HREF = "package.html">package intel</A> commands
|
||||
respectively to your input script.
|
||||
coprocessor case can be done using the "-pk intel" and "-sf intel"
|
||||
<A HREF = "Section_start.html#start_7">command-line switches</A> respectively. Or
|
||||
the effect of the "-pk" or "-sf" switches can be duplicated by adding
|
||||
the <A HREF = "package.html">package intel</A> or <A HREF = "suffix.html">suffix intel</A>
|
||||
commands respectively to your input script.
|
||||
</P>
|
||||
<P><B>Required hardware/software:</B>
|
||||
</P>
|
||||
|
@ -99,9 +98,9 @@ Intel compilers. You also need to add -DLAMMPS_MEMALIGN=64 and
|
|||
the runs, adding the flag <I>-xHost</I> to CCFLAGS will enable
|
||||
vectorization with the Intel(R) compiler.
|
||||
</P>
|
||||
<P>In order to build with support for an Intel(R) coprocessor, the flag
|
||||
<I>-offload</I> should be added to the LINKFLAGS line and the flag
|
||||
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
|
||||
<P>In order to build with support for an Intel(R) Xeon Phi(TM)
|
||||
coprocessor, the flag <I>-offload</I> should be added to the LINKFLAGS line
|
||||
and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
|
||||
</P>
|
||||
<P>Note that the machine makefiles Makefile.intel and
|
||||
Makefile.intel_offload are included in the src/MAKE directory with
|
||||
|
@ -118,71 +117,77 @@ higher is recommended.
|
|||
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
</P>
|
||||
<P>If LAMMPS was also built with the USER-OMP package, you need to choose
|
||||
how many OpenMP threads per MPI task will be used by the USER-OMP
|
||||
package. Note that the product of MPI tasks * OpenMP threads/task
|
||||
should not exceed the physical number of cores (on a node), otherwise
|
||||
performance will suffer.
|
||||
<P>If you plan to compute (any portion of) pairwise interactions using
|
||||
USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
|
||||
you need to choose how many OpenMP threads per MPI task to use. Note
|
||||
that the product of MPI tasks * OpenMP threads/task should not exceed
|
||||
the physical number of cores (on a node), otherwise performance will
|
||||
suffer.
|
||||
</P>
|
||||
<P>If LAMMPS was built with coprocessor support for the USER-INTEL
|
||||
package, you need to specify the number of coprocessor/node and the
|
||||
number of threads to use on the coprocessor per MPI task. Note that
|
||||
package, you also need to specify the number of coprocessor/node and
|
||||
the number of coprocessor threads per MPI task to use. Note that
|
||||
coprocessor threads (which run on the coprocessor) are totally
|
||||
independent from OpenMP threads (which run on the CPU). The product
|
||||
of MPI tasks * coprocessor threads/task should not exceed the maximum
|
||||
number of threads the coproprocessor is designed to run, otherwise
|
||||
performance will suffer. This value is 240 for current generation
|
||||
Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. The
|
||||
threads/core value can be set to a smaller value if desired by an
|
||||
option on the <A HREF = "package.html">package intel</A> command, in which case the
|
||||
maximum number of threads is also reduced.
|
||||
independent from OpenMP threads (which run on the CPU). The default
|
||||
values for the settings that affect coprocessor threads are typically
|
||||
fine, as discussed below.
|
||||
</P>
|
||||
<P>Use the "-sf intel" <A HREF = "Section_start.html#start_7">command-line switch</A>,
|
||||
which will automatically append "intel" to styles that support it. If
|
||||
a style does not support it, a "omp" suffix is tried next. Use the
|
||||
"-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switch</A>, to set
|
||||
Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
|
||||
the USER-OMP package. Use the "-pk intel Nphi" <A HREF = "Section_start.html#start_7">command-line
|
||||
a style does not support it, an "omp" suffix is tried next. OpenMP
|
||||
threads per MPI task can be set via the "-pk intel Nphi omp Nt" or
|
||||
"-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switches</A>, which
|
||||
set Nt = # of OpenMP threads per MPI task to use. The "-pk omp" form
|
||||
is only allowed if LAMMPS was also built with the USER-OMP package.
|
||||
</P>
|
||||
<P>Use the "-pk intel Nphi" <A HREF = "Section_start.html#start_7">command-line
|
||||
switch</A> to set Nphi = # of Xeon Phi(TM)
|
||||
coprocessors/node, if LAMMPS was built with coprocessor support.
|
||||
coprocessors/node, if LAMMPS was built with coprocessor support. All
|
||||
the available coprocessor threads on each Phi will be divided among
|
||||
MPI tasks, unless the <I>tptask</I> option of the "-pk intel" <A HREF = "Section_start.html#start_7">command-line
|
||||
switch</A> is used to limit the coprocessor
|
||||
threads per MPI task. See the <A HREF = "package.html">package intel</A> command
|
||||
for details.
|
||||
</P>
|
||||
<PRE>CPU-only without USER-OMP (but using Intel vectorization on CPU):
|
||||
lmp_machine -sf intel -in in.script # 1 MPI task
|
||||
mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes)
|
||||
</PRE>
|
||||
<PRE>CPU-only with USER-OMP (and Intel vectorization on CPU):
|
||||
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
|
||||
mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
|
||||
mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script # ditto on 8 16-core nodes
|
||||
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
|
||||
mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
|
||||
mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script # ditto on 8 16-core nodes
|
||||
</PRE>
|
||||
<PRE>CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
|
||||
lmp_machine -sf intel -pk intel 16 1 -in in.script # 1 MPI task, 240 threads on 1 coprocessor
|
||||
mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node,
|
||||
# each MPI task uses 60 threads on 1 coprocessor
|
||||
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script # ditto on 8 16-core nodes for MPI tasks and OpenMP threads,
|
||||
# each MPI task uses 120 threads on one of 2 coprocessors
|
||||
<PRE>CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
|
||||
lmp_machine -sf intel -pk intel 1 omp 16 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
|
||||
lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
|
||||
mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
|
||||
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # ditto on 8 16-core nodes
|
||||
mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task
|
||||
</PRE>
|
||||
<P>Note that if the "-sf intel" switch is used, it also issues two
|
||||
default commands: <A HREF = "package.html">package omp 0</A> and <A HREF = "package.html">package intel
|
||||
1</A> command. These set the number of OpenMP threads per
|
||||
MPI task via the OMP_NUM_THREADS environment variable, and the number
|
||||
of Xeon Phi(TM) coprocessors/node to 1. The former is ignored if
|
||||
LAMMPS was not built with the USER-OMP package. The latter is ignored
|
||||
is LAMMPS was not built with coprocessor support, except for its
|
||||
optional precision setting.
|
||||
<P>Note that if the "-sf intel" switch is used, it also invokes two
|
||||
default commands: <A HREF = "package.html">package intel 1</A>, followed by <A HREF = "package.html">package
|
||||
omp 0</A>. These both set the number of OpenMP threads per
|
||||
MPI task via the OMP_NUM_THREADS environment variable. The first
|
||||
command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
|
||||
the precision mode to "mixed", as one of its option defaults). The
|
||||
latter command is not invoked if LAMMPS was not built with the
|
||||
USER-OMP package. The Nphi = 1 value for the first command is ignored
|
||||
if LAMMPS was not built with coprocessor support.
|
||||
</P>
|
||||
<P>Using the "-pk omp" switch explicitly allows for direct setting of the
|
||||
number of OpenMP threads per MPI task, and additional options. Using
|
||||
the "-pk intel" switch explicitly allows for direct setting of the
|
||||
number of coprocessors/node, and additional options. The syntax for
|
||||
these two switches is the same as the <A HREF = "package.html">package omp</A> and
|
||||
<A HREF = "package.html">package intel</A> commands. See the <A HREF = "package.html">package</A>
|
||||
command doc page for details, including the default values used for
|
||||
all its options if these switches are not specified, and how to set
|
||||
the number of OpenMP threads via the OMP_NUM_THREADS environment
|
||||
variable if desired.
|
||||
<P>Using the "-pk intel" or "-pk omp" switches explicitly allows for
|
||||
direct setting of the number of OpenMP threads per MPI task, and
|
||||
additional options for either of the USER-INTEL or USER-OMP packages.
|
||||
In particular, the "-pk intel" switch sets the number of
|
||||
coprocessors/node and can limit the number of coprocessor threads per
|
||||
MPI task. The syntax for these two switches is the same as the
|
||||
<A HREF = "package.html">package omp</A> and <A HREF = "package.html">package intel</A> commands.
|
||||
See the <A HREF = "package.html">package</A> command doc page for details, including
|
||||
the default values used for all its options if these switches are not
|
||||
specified, and how to set the number of OpenMP threads via the
|
||||
OMP_NUM_THREADS environment variable if desired.
|
||||
</P>
|
||||
<P><B>Or run with the USER-INTEL package by editing an input script:</B>
|
||||
</P>
|
||||
|
@ -195,19 +200,20 @@ the same.
|
|||
</P>
|
||||
<PRE>pair_style lj/cut/intel 2.5
|
||||
</PRE>
|
||||
<P>You must also use the <A HREF = "package.html">package omp</A> command to enable the
|
||||
USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
|
||||
intel" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line switches</A>
|
||||
were used. It specifies how many OpenMP threads per MPI task to use,
|
||||
as well as other options. Its doc page explains how to set the number
|
||||
of threads via an environment variable if desired.
|
||||
<P>You must also use the <A HREF = "package.html">package intel</A> command, unless the
|
||||
"-sf intel" or "-pk intel" <A HREF = "Section_start.html#start_7">command-line
|
||||
switches</A> were used. It specifies how many
|
||||
coprocessors/node to use, as well as other OpenMP threading and
|
||||
coprocessor options. Its doc page explains how to set the number of
|
||||
OpenMP threads via an environment variable if desired.
|
||||
</P>
|
||||
<P>You must also use the <A HREF = "package.html">package intel</A> command to enable
|
||||
coprocessor support within the USER-INTEL package (assuming LAMMPS was
|
||||
built with coprocessor support) unless the "-sf intel" or "-pk intel"
|
||||
<A HREF = "Section_start.html#start_7">command-line switches</A> were used. It
|
||||
specifies how many coprocessors/node to use, as well as other
|
||||
coprocessor options.
|
||||
<P>If LAMMPS was also built with the USER-OMP package, you must also use
|
||||
the <A HREF = "package.html">package omp</A> command to enable that package, unless
|
||||
the "-sf intel" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line
|
||||
switches</A> were used. It specifies how many
|
||||
OpenMP threads per MPI task to use, as well as other options. Its doc
|
||||
page explains how to set the number of OpenMP threads via an
|
||||
environment variable if desired.
|
||||
</P>
|
||||
<P><B>Speed-ups to expect:</B>
|
||||
</P>
|
||||
|
|
|
@ -25,10 +25,10 @@ The USER-INTEL package can be used in tandem with the USER-OMP
|
|||
package. This is useful when offloading pair style computations to
|
||||
coprocessors, so that other styles not supported by the USER-INTEL
|
||||
package, e.g. bond, angle, dihedral, improper, and long-range
|
||||
electrostatics, can be run simultaneously in threaded mode on CPU
|
||||
electrostatics, can run simultaneously in threaded mode on the CPU
|
||||
cores. Since less MPI tasks than CPU cores will typically be invoked
|
||||
when running with coprocessors, this enables the extra cores to be
|
||||
utilized for useful computation.
|
||||
when running with coprocessors, this enables the extra CPU cores to be
|
||||
used for useful computation.
|
||||
|
||||
If LAMMPS is built with both the USER-INTEL and USER-OMP packages
|
||||
intsalled, this mode of operation is made easier to use, because the
|
||||
|
@ -39,13 +39,13 @@ if available, after first testing if a style from the USER-INTEL
|
|||
package is available.
|
||||
|
||||
Here is a quick overview of how to use the USER-INTEL package
|
||||
for CPU acceleration:
|
||||
for CPU-only acceleration:
|
||||
|
||||
specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
|
||||
specify -fopenmp with LINKFLAGS in your Makefile.machine
|
||||
specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
|
||||
specify -openmp with LINKFLAGS in your Makefile.machine
|
||||
include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
|
||||
if using the USER-OMP package, specify how many threads per MPI task to use
|
||||
use USER-INTEL styles in your input script :ul
|
||||
specify how many OpenMP threads per MPI task to use
|
||||
use USER-INTEL and (optionally) USER-OMP styles in your input script :ul
|
||||
|
||||
Using the USER-INTEL package to offload work to the Intel(R)
|
||||
Xeon Phi(TM) coprocessor is the same except for these additional
|
||||
|
@ -53,15 +53,14 @@ steps:
|
|||
|
||||
add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
|
||||
add the flag -offload to LINKFLAGS in your Makefile.machine
|
||||
specify how many threads per coprocessor to use :ul
|
||||
specify how many coprocessor threads per MPI task to use :ul
|
||||
|
||||
The latter two steps in the first case and the last step in the
|
||||
coprocessor case can be done using the "-pk omp" and "-sf intel" and
|
||||
"-pk intel" "command-line switches"_Section_start.html#start_7
|
||||
respectively. Or the effect of the "-pk" or "-sf" switches can be
|
||||
duplicated by adding the "package omp"_package.html or "suffix
|
||||
intel"_suffix.html or "package intel"_package.html commands
|
||||
respectively to your input script.
|
||||
coprocessor case can be done using the "-pk intel" and "-sf intel"
|
||||
"command-line switches"_Section_start.html#start_7 respectively. Or
|
||||
the effect of the "-pk" or "-sf" switches can be duplicated by adding
|
||||
the "package intel"_package.html or "suffix intel"_suffix.html
|
||||
commands respectively to your input script.
|
||||
|
||||
[Required hardware/software:]
|
||||
|
||||
|
@ -96,9 +95,9 @@ If you are compiling on the same architecture that will be used for
|
|||
the runs, adding the flag {-xHost} to CCFLAGS will enable
|
||||
vectorization with the Intel(R) compiler.
|
||||
|
||||
In order to build with support for an Intel(R) coprocessor, the flag
|
||||
{-offload} should be added to the LINKFLAGS line and the flag
|
||||
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
|
||||
In order to build with support for an Intel(R) Xeon Phi(TM)
|
||||
coprocessor, the flag {-offload} should be added to the LINKFLAGS line
|
||||
and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
|
||||
|
||||
Note that the machine makefiles Makefile.intel and
|
||||
Makefile.intel_offload are included in the src/MAKE directory with
|
||||
|
@ -115,71 +114,77 @@ higher is recommended.
|
|||
The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
|
||||
If LAMMPS was also built with the USER-OMP package, you need to choose
|
||||
how many OpenMP threads per MPI task will be used by the USER-OMP
|
||||
package. Note that the product of MPI tasks * OpenMP threads/task
|
||||
should not exceed the physical number of cores (on a node), otherwise
|
||||
performance will suffer.
|
||||
If you plan to compute (any portion of) pairwise interactions using
|
||||
USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
|
||||
you need to choose how many OpenMP threads per MPI task to use. Note
|
||||
that the product of MPI tasks * OpenMP threads/task should not exceed
|
||||
the physical number of cores (on a node), otherwise performance will
|
||||
suffer.
|
||||
|
||||
If LAMMPS was built with coprocessor support for the USER-INTEL
|
||||
package, you need to specify the number of coprocessor/node and the
|
||||
number of threads to use on the coprocessor per MPI task. Note that
|
||||
package, you also need to specify the number of coprocessor/node and
|
||||
the number of coprocessor threads per MPI task to use. Note that
|
||||
coprocessor threads (which run on the coprocessor) are totally
|
||||
independent from OpenMP threads (which run on the CPU). The product
|
||||
of MPI tasks * coprocessor threads/task should not exceed the maximum
|
||||
number of threads the coproprocessor is designed to run, otherwise
|
||||
performance will suffer. This value is 240 for current generation
|
||||
Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. The
|
||||
threads/core value can be set to a smaller value if desired by an
|
||||
option on the "package intel"_package.html command, in which case the
|
||||
maximum number of threads is also reduced.
|
||||
independent from OpenMP threads (which run on the CPU). The default
|
||||
values for the settings that affect coprocessor threads are typically
|
||||
fine, as discussed below.
|
||||
|
||||
Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
|
||||
which will automatically append "intel" to styles that support it. If
|
||||
a style does not support it, a "omp" suffix is tried next. Use the
|
||||
"-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set
|
||||
Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
|
||||
the USER-OMP package. Use the "-pk intel Nphi" "command-line
|
||||
a style does not support it, an "omp" suffix is tried next. OpenMP
|
||||
threads per MPI task can be set via the "-pk intel Nphi omp Nt" or
|
||||
"-pk omp Nt" "command-line switches"_Section_start.html#start_7, which
|
||||
set Nt = # of OpenMP threads per MPI task to use. The "-pk omp" form
|
||||
is only allowed if LAMMPS was also built with the USER-OMP package.
|
||||
|
||||
Use the "-pk intel Nphi" "command-line
|
||||
switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
|
||||
coprocessors/node, if LAMMPS was built with coprocessor support.
|
||||
coprocessors/node, if LAMMPS was built with coprocessor support. All
|
||||
the available coprocessor threads on each Phi will be divided among
|
||||
MPI tasks, unless the {tptask} option of the "-pk intel" "command-line
|
||||
switch"_Section_start.html#start_7 is used to limit the coprocessor
|
||||
threads per MPI task. See the "package intel"_package.html command
|
||||
for details.
|
||||
|
||||
CPU-only without USER-OMP (but using Intel vectorization on CPU):
|
||||
lmp_machine -sf intel -in in.script # 1 MPI task
|
||||
mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre
|
||||
|
||||
CPU-only with USER-OMP (and Intel vectorization on CPU):
|
||||
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
|
||||
mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
|
||||
mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script # ditto on 8 16-core nodes :pre
|
||||
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
|
||||
mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
|
||||
mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script # ditto on 8 16-core nodes :pre
|
||||
|
||||
CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
|
||||
lmp_machine -sf intel -pk intel 16 1 -in in.script # 1 MPI task, 240 threads on 1 coprocessor
|
||||
mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node,
|
||||
# each MPI task uses 60 threads on 1 coprocessor
|
||||
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script # ditto on 8 16-core nodes for MPI tasks and OpenMP threads,
|
||||
# each MPI task uses 120 threads on one of 2 coprocessors :pre
|
||||
CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
|
||||
lmp_machine -sf intel -pk intel 1 omp 16 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
|
||||
lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
|
||||
mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
|
||||
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # ditto on 8 16-core nodes
|
||||
mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task :pre
|
||||
|
||||
Note that if the "-sf intel" switch is used, it also issues two
|
||||
default commands: "package omp 0"_package.html and "package intel
|
||||
1"_package.html command. These set the number of OpenMP threads per
|
||||
MPI task via the OMP_NUM_THREADS environment variable, and the number
|
||||
of Xeon Phi(TM) coprocessors/node to 1. The former is ignored if
|
||||
LAMMPS was not built with the USER-OMP package. The latter is ignored
|
||||
is LAMMPS was not built with coprocessor support, except for its
|
||||
optional precision setting.
|
||||
Note that if the "-sf intel" switch is used, it also invokes two
|
||||
default commands: "package intel 1"_package.html, followed by "package
|
||||
omp 0"_package.html. These both set the number of OpenMP threads per
|
||||
MPI task via the OMP_NUM_THREADS environment variable. The first
|
||||
command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
|
||||
the precision mode to "mixed", as one of its option defaults). The
|
||||
latter command is not invoked if LAMMPS was not built with the
|
||||
USER-OMP package. The Nphi = 1 value for the first command is ignored
|
||||
if LAMMPS was not built with coprocessor support.
|
||||
|
||||
Using the "-pk omp" switch explicitly allows for direct setting of the
|
||||
number of OpenMP threads per MPI task, and additional options. Using
|
||||
the "-pk intel" switch explicitly allows for direct setting of the
|
||||
number of coprocessors/node, and additional options. The syntax for
|
||||
these two switches is the same as the "package omp"_package.html and
|
||||
"package intel"_package.html commands. See the "package"_package.html
|
||||
command doc page for details, including the default values used for
|
||||
all its options if these switches are not specified, and how to set
|
||||
the number of OpenMP threads via the OMP_NUM_THREADS environment
|
||||
variable if desired.
|
||||
Using the "-pk intel" or "-pk omp" switches explicitly allows for
|
||||
direct setting of the number of OpenMP threads per MPI task, and
|
||||
additional options for either of the USER-INTEL or USER-OMP packages.
|
||||
In particular, the "-pk intel" switch sets the number of
|
||||
coprocessors/node and can limit the number of coprocessor threads per
|
||||
MPI task. The syntax for these two switches is the same as the
|
||||
"package omp"_package.html and "package intel"_package.html commands.
|
||||
See the "package"_package.html command doc page for details, including
|
||||
the default values used for all its options if these switches are not
|
||||
specified, and how to set the number of OpenMP threads via the
|
||||
OMP_NUM_THREADS environment variable if desired.
|
||||
|
||||
[Or run with the USER-INTEL package by editing an input script:]
|
||||
|
||||
|
@ -192,19 +197,20 @@ Use the "suffix intel"_suffix.html command, or you can explicitly add an
|
|||
|
||||
pair_style lj/cut/intel 2.5 :pre
|
||||
|
||||
You must also use the "package omp"_package.html command to enable the
|
||||
USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
|
||||
intel" or "-pk omp" "command-line switches"_Section_start.html#start_7
|
||||
were used. It specifies how many OpenMP threads per MPI task to use,
|
||||
as well as other options. Its doc page explains how to set the number
|
||||
of threads via an environment variable if desired.
|
||||
You must also use the "package intel"_package.html command, unless the
|
||||
"-sf intel" or "-pk intel" "command-line
|
||||
switches"_Section_start.html#start_7 were used. It specifies how many
|
||||
coprocessors/node to use, as well as other OpenMP threading and
|
||||
coprocessor options. Its doc page explains how to set the number of
|
||||
OpenMP threads via an environment variable if desired.
|
||||
|
||||
You must also use the "package intel"_package.html command to enable
|
||||
coprocessor support within the USER-INTEL package (assuming LAMMPS was
|
||||
built with coprocessor support) unless the "-sf intel" or "-pk intel"
|
||||
"command-line switches"_Section_start.html#start_7 were used. It
|
||||
specifies how many coprocessors/node to use, as well as other
|
||||
coprocessor options.
|
||||
If LAMMPS was also built with the USER-OMP package, you must also use
|
||||
the "package omp"_package.html command to enable that package, unless
|
||||
the "-sf intel" or "-pk omp" "command-line
|
||||
switches"_Section_start.html#start_7 were used. It specifies how many
|
||||
OpenMP threads per MPI task to use, as well as other options. Its doc
|
||||
page explains how to set the number of OpenMP threads via an
|
||||
environment variable if desired.
|
||||
|
||||
[Speed-ups to expect:]
|
||||
|
||||
|
|
|
@ -178,7 +178,7 @@ double precision.
|
|||
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
</P>
|
||||
<P>When using KOKKOS built with host=OMP, you need to choose how many
|
||||
OpenMP threads per MPI task will be used (via the "-k" command-line
|
||||
|
|
|
@ -175,7 +175,7 @@ double precision.
|
|||
The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
|
||||
When using KOKKOS built with host=OMP, you need to choose how many
|
||||
OpenMP threads per MPI task will be used (via the "-k" command-line
|
||||
|
|
|
@ -57,7 +57,7 @@ Intel compilers the CCFLAGS setting also needs to include "-restrict".
|
|||
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
</P>
|
||||
<P>You need to choose how many threads per MPI task will be used by the
|
||||
USER-OMP package. Note that the product of MPI tasks * threads/task
|
||||
|
|
|
@ -54,7 +54,7 @@ Intel compilers the CCFLAGS setting also needs to include "-restrict".
|
|||
The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
|
||||
You need to choose how many threads per MPI task will be used by the
|
||||
USER-OMP package. Note that the product of MPI tasks * threads/task
|
||||
|
|
137
doc/package.html
137
doc/package.html
|
@ -59,20 +59,22 @@
|
|||
<I>intel</I> args = NPhi keyword value ...
|
||||
Nphi = # of coprocessors per node
|
||||
zero or more keyword/value pairs may be appended
|
||||
keywords = <I>prec</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
|
||||
<I>prec</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
|
||||
keywords = <I>omp</I> or <I>mode</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
|
||||
<I>omp</I> value = Nthreads
|
||||
Nthreads = number of OpenMP threads to use on CPU (default = 0)
|
||||
<I>mode</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
|
||||
single = perform force calculations in single precision
|
||||
mixed = perform force calculations in mixed precision
|
||||
double = perform force calculations in double precision
|
||||
<I>balance</I> value = split
|
||||
split = fraction of work to offload to coprocessor, -1 for dynamic
|
||||
<I>ghost</I> value = <I>yes</I> or <I>no</I>
|
||||
yes = include ghost atoms for offload
|
||||
no = do not include ghost atoms for offload
|
||||
<I>tpc</I> value = Ntpc
|
||||
Ntpc = number of threads to use on each physical core of coprocessor
|
||||
<I>tptask</I> value = Ntptask
|
||||
Ntptask = max number of threads to use on coprocessor for each MPI task
|
||||
<I>balance</I> value = split
|
||||
split = fraction of work to offload to coprocessor, -1 for dynamic
|
||||
<I>ghost</I> value = <I>yes</I> or <I>no</I>
|
||||
yes = include ghost atoms for offload
|
||||
no = do not include ghost atoms for offload
|
||||
<I>tpc</I> value = Ntpc
|
||||
Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
|
||||
<I>tptask</I> value = Ntptask
|
||||
Ntptask = max number of coprocessor threads per MPI task (default = 240)
|
||||
<I>kokkos</I> args = keyword value ...
|
||||
zero or more keyword/value pairs may be appended
|
||||
keywords = <I>neigh</I> or <I>newton</I> or <I>binsize</I> or <I>comm</I> or <I>comm/exchange</I> or <I>comm/forward</I>
|
||||
|
@ -114,7 +116,8 @@ package cuda 1 test 3948
|
|||
package kokkos neigh half/thread comm device
|
||||
package omp 0 neigh no
|
||||
package omp 4
|
||||
package intel * mixed balance -1
|
||||
package intel 1
|
||||
package intel 2 omp 4 mode mixed balance 0.5
|
||||
</PRE>
|
||||
<P><B>Description:</B>
|
||||
</P>
|
||||
|
@ -324,18 +327,56 @@ lib/gpu/Makefile that is used.
|
|||
<HR>
|
||||
|
||||
<P>The <I>intel</I> style invokes settings associated with the use of the
|
||||
USER-INTEL package. All of its settings, except the <I>prec</I> keyword,
|
||||
are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
|
||||
when building with the USER-INTEL package. All of its settings,
|
||||
including the <I>prec</I> keyword are applicable if LAMMPS was built with
|
||||
coprocessor support.
|
||||
USER-INTEL package. All of its settings, except the <I>omp</I> and <I>mode</I>
|
||||
keywords, are ignored if LAMMPS was not built with Xeon Phi
|
||||
coprocessor support. All of its settings, including the <I>omp</I> and
|
||||
<I>mode</I> keyword are applicable if LAMMPS was built with coprocessor
|
||||
support.
|
||||
</P>
|
||||
<P>The <I>Nphi</I> argument sets the number of coprocessors per node.
|
||||
This can be set to any value, including 0, if LAMMPS was not
|
||||
built with coprocessor support.
|
||||
</P>
|
||||
<P>Optional keyword/value pairs can also be specified. Each has a
|
||||
default value as listed below.
|
||||
</P>
|
||||
<P>The <I>prec</I> keyword argument determines the precision mode to use for
|
||||
<P>The <I>omp</I> keyword determines the number of OpenMP threads allocated
|
||||
for each MPI task when any portion of the interactions computed by a
|
||||
USER-INTEL pair style are run on the CPU. This can be the case even
|
||||
if LAMMPS was built with coprocessor support; see the <I>balance</I>
|
||||
keyword discussion below. If you are running with less MPI tasks/node
|
||||
than there are CPUs, it can be advantageous to use OpenMP threading on
|
||||
the CPUs.
|
||||
</P>
|
||||
<P>IMPORTANT NOTE: The <I>omp</I> keyword has nothing to do with coprocessor
|
||||
threads on the Xeon Phi; see the <I>tpc</I> and <I>tptask</I> keywords below for
|
||||
a discussion of coprocessor threads.
|
||||
</P>
|
||||
<P>The <I>Nthread</I> value for the <I>omp</I> keyword sets the number of OpenMP
|
||||
threads allocated for each MPI task. Setting <I>Nthread</I> = 0 (the
|
||||
default) instructs LAMMPS to use whatever value is the default for the
|
||||
given OpenMP environment. This is usually determined via the
|
||||
<I>OMP_NUM_THREADS</I> environment variable or the compiler runtime, which
|
||||
is usually a value of 1.
|
||||
</P>
|
||||
<P>For more details, including examples of how to set the OMP_NUM_THREADS
|
||||
environment variable, see the discussion of the <I>Nthreads</I> setting on
|
||||
this doc page for the "package omp" command. Nthreads is a required
|
||||
argument for the USER-OMP package. Its meaning is exactly the same
|
||||
for the USER-INTEL pacakge.
|
||||
</P>
|
||||
<P>IMPORTANT NOTE: If you build LAMMPS with both the USER-INTEL and
|
||||
USER-OMP packages, be aware that both packages allow setting of the
|
||||
<I>Nthreads</I> value via their package commands, but there is only a
|
||||
single global <I>Nthreads</I> value used by OpenMP. Thus if both package
|
||||
commands are invoked, you should insure the two values are consistent.
|
||||
If they are not, the last one invoked will take precedence, for both
|
||||
packages. Also note that if the "-sf intel" <A HREF = <A HREF = "Section_start.html#start_7">command-line"></A>
|
||||
switch</A> is used, it invokes a "package
|
||||
intel" command, followed by a "package omp" command, both with a
|
||||
setting of <I>Nthreads</I> = 0.
|
||||
</P>
|
||||
<P>The <I>mode</I> keyword determines the precision mode to use for
|
||||
computing pair style forces, either on the CPU or on the coprocessor,
|
||||
when using a USER-INTEL supported <A HREF = "pair_style.html">pair style</A>. It
|
||||
can take a value of <I>single</I>, <I>mixed</I> which is the default, or
|
||||
|
@ -347,12 +388,12 @@ quantities. <I>Double</I> means double precision is used for the entire
|
|||
force calculation.
|
||||
</P>
|
||||
<P>The <I>balance</I> keyword sets the fraction of <A HREF = "pair_style.html">pair
|
||||
style</A> work offloaded to the coprocessor style for
|
||||
split values between 0.0 and 1.0 inclusive. While this fraction of
|
||||
work is running on the coprocessor, other calculations will run on the
|
||||
host, including neighbor and pair calculations that are not offloaded,
|
||||
angle, bond, dihedral, kspace, and some MPI communications. If
|
||||
<I>split</I> is set to -1, the fraction of work is dynamically adjusted
|
||||
style</A> work offloaded to the coprocessor for split
|
||||
values between 0.0 and 1.0 inclusive. While this fraction of work is
|
||||
running on the coprocessor, other calculations will run on the host,
|
||||
including neighbor and pair calculations that are not offloaded, as
|
||||
well as angle, bond, dihedral, kspace, and some MPI communications.
|
||||
If <I>split</I> is set to -1, the fraction of work is dynamically adjusted
|
||||
automatically throughout the run. This typically give performance
|
||||
within 5 to 10 percent of the optimal fixed fraction.
|
||||
</P>
|
||||
|
@ -362,21 +403,28 @@ and force calculations. When the value = "no", ghost atoms are not
|
|||
offloaded. This option can reduce the amount of data transfer with
|
||||
the coprocessor and can also overlap MPI communication of forces with
|
||||
computation on the coprocessor when the <A HREF = "newton.html">newton pair</A>
|
||||
setting is "on". When the value = "ues", ghost atoms are offloaded.
|
||||
setting is "on". When the value = "yes", ghost atoms are offloaded.
|
||||
In some cases this can provide better performance, especially if the
|
||||
<I>balance</I> fraction is high.
|
||||
</P>
|
||||
<P>The <I>tpc</I> keyword sets the maximum # of threads <I>Ntpc</I> that will
|
||||
run on each physical core of the coprocessor. The default value is
|
||||
set to 4, which is the number of hardware threads per core supported
|
||||
by the current generation Xeon Phi chips.
|
||||
<P>The <I>tpc</I> keyword sets the max # of coprocessor threads <I>Ntpc</I> that
|
||||
will run on each core of the coprocessor. The default value = 4,
|
||||
which is the number of hardware threads per core supported by the
|
||||
current generation Xeon Phi chips.
|
||||
</P>
|
||||
<P>The <I>tptask</I> keyword sets the maximum # of threads (Ntptask</I> that will
|
||||
be used on the coprocessor for each MPI task. This, along with the
|
||||
<I>tpc</I> keyword setting, are the only methods for changing the number of
|
||||
threads used on the coprocessor. The default value is set to 240 =
|
||||
60*4, which is the maximum # of threads supported by an entire current
|
||||
generation Xeon Phi chip.
|
||||
<P>The <I>tptask</I> keyword sets the max # of coprocessor threads (Ntptask</I>
|
||||
assigned to each MPI task. The default value = 240, which is the
|
||||
total # of threads an entire current generation Xeon Phi chip can run
|
||||
(240 = 60 cores * 4 threads/core). This means each MPI task assigned
|
||||
to the Phi will enough threads for the chip to run the max allowed,
|
||||
even if only 1 MPI task is assigned. If 8 MPI tasks are assigned to
|
||||
the Phi, each will run with 30 threads. If you wish to limit the
|
||||
number of threads per MPI task, set <I>tptask</I> to a smaller value.
|
||||
E.g. for <I>tptask</I> = 16, if 8 MPI tasks are assigned, each will run
|
||||
with 16 threads, for a total of 128.
|
||||
</P>
|
||||
<P>Note that the default settings for <I>tpc</I> and <I>tptask</I> are fine for
|
||||
most problems, regardless of how many MPI tasks you assign to a Phi.
|
||||
</P>
|
||||
<HR>
|
||||
|
||||
|
@ -581,15 +629,16 @@ must invoke the package gpu command in your input script or via the
|
|||
"-pk gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>.
|
||||
</P>
|
||||
<P>For the USER-INTEL package, the default is Nphi = 1 and the option
|
||||
defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240. Note
|
||||
that all of these settings, except "prec", are ignored if LAMMPS was
|
||||
not built with Xeon Phi coprocessor support. The default ghost option
|
||||
is determined by the pair style being used. This value is output to
|
||||
the screen in the offload report at the end of each run. These
|
||||
settings are made automatically if the "-sf intel" <A HREF = "Section_start.html#start_7">command-line
|
||||
switch</A> is used. If it is not used, you
|
||||
must invoke the package intel command in your input script or or via
|
||||
the "-pk intel" <A HREF = "Section_start.html#start_7">command-line switch</A>.
|
||||
defaults are omp = 0, mode = mixed, balance = -1, tpc = 4, tptask =
|
||||
240. The default ghost option is determined by the pair style being
|
||||
used. This value is output to the screen in the offload report at the
|
||||
end of each run. Note that all of these settings, except "omp" and
|
||||
"mode", are ignored if LAMMPS was not built with Xeon Phi coprocessor
|
||||
support. These settings are made automatically if the "-sf intel"
|
||||
<A HREF = "Section_start.html#start_7">command-line switch</A> is used. If it is
|
||||
not used, you must invoke the package intel command in your input
|
||||
script or or via the "-pk intel" <A HREF = "Section_start.html#start_7">command-line
|
||||
switch</A>.
|
||||
</P>
|
||||
<P>For the KOKKOS package, the option defaults neigh = full, newton =
|
||||
off, binsize = 0.0, and comm = host. These settings are made
|
||||
|
|
142
doc/package.txt
142
doc/package.txt
|
@ -54,20 +54,22 @@ args = arguments specific to the style :l
|
|||
{intel} args = NPhi keyword value ...
|
||||
Nphi = # of coprocessors per node
|
||||
zero or more keyword/value pairs may be appended
|
||||
keywords = {prec} or {balance} or {ghost} or {tpc} or {tptask}
|
||||
{prec} value = {single} or {mixed} or {double}
|
||||
keywords = {omp} or {mode} or {balance} or {ghost} or {tpc} or {tptask}
|
||||
{omp} value = Nthreads
|
||||
Nthreads = number of OpenMP threads to use on CPU (default = 0)
|
||||
{mode} value = {single} or {mixed} or {double}
|
||||
single = perform force calculations in single precision
|
||||
mixed = perform force calculations in mixed precision
|
||||
double = perform force calculations in double precision
|
||||
{balance} value = split
|
||||
split = fraction of work to offload to coprocessor, -1 for dynamic
|
||||
{ghost} value = {yes} or {no}
|
||||
yes = include ghost atoms for offload
|
||||
no = do not include ghost atoms for offload
|
||||
{tpc} value = Ntpc
|
||||
Ntpc = number of threads to use on each physical core of coprocessor
|
||||
{tptask} value = Ntptask
|
||||
Ntptask = max number of threads to use on coprocessor for each MPI task
|
||||
{balance} value = split
|
||||
split = fraction of work to offload to coprocessor, -1 for dynamic
|
||||
{ghost} value = {yes} or {no}
|
||||
yes = include ghost atoms for offload
|
||||
no = do not include ghost atoms for offload
|
||||
{tpc} value = Ntpc
|
||||
Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
|
||||
{tptask} value = Ntptask
|
||||
Ntptask = max number of coprocessor threads per MPI task (default = 240)
|
||||
{kokkos} args = keyword value ...
|
||||
zero or more keyword/value pairs may be appended
|
||||
keywords = {neigh} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward}
|
||||
|
@ -108,7 +110,8 @@ package cuda 1 test 3948
|
|||
package kokkos neigh half/thread comm device
|
||||
package omp 0 neigh no
|
||||
package omp 4
|
||||
package intel * mixed balance -1 :pre
|
||||
package intel 1
|
||||
package intel 2 omp 4 mode mixed balance 0.5 :pre
|
||||
|
||||
[Description:]
|
||||
|
||||
|
@ -263,11 +266,6 @@ cutoff of 20*sigma in LJ "units"_units.html and a neighbor skin
|
|||
distance of sigma, a {binsize} = 5.25*sigma can be more efficient than
|
||||
the default.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
The {split} keyword can be used for load balancing force calculations
|
||||
between CPU and GPU cores in GPU-enabled pair styles. If 0 < {split} <
|
||||
1.0, a fixed fraction of particles is offloaded to the GPU while force
|
||||
|
@ -323,18 +321,56 @@ lib/gpu/Makefile that is used.
|
|||
:line
|
||||
|
||||
The {intel} style invokes settings associated with the use of the
|
||||
USER-INTEL package. All of its settings, except the {prec} keyword,
|
||||
are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
|
||||
when building with the USER-INTEL package. All of its settings,
|
||||
including the {prec} keyword are applicable if LAMMPS was built with
|
||||
coprocessor support.
|
||||
USER-INTEL package. All of its settings, except the {omp} and {mode}
|
||||
keywords, are ignored if LAMMPS was not built with Xeon Phi
|
||||
coprocessor support. All of its settings, including the {omp} and
|
||||
{mode} keyword are applicable if LAMMPS was built with coprocessor
|
||||
support.
|
||||
|
||||
The {Nphi} argument sets the number of coprocessors per node.
|
||||
This can be set to any value, including 0, if LAMMPS was not
|
||||
built with coprocessor support.
|
||||
|
||||
Optional keyword/value pairs can also be specified. Each has a
|
||||
default value as listed below.
|
||||
|
||||
The {prec} keyword argument determines the precision mode to use for
|
||||
The {omp} keyword determines the number of OpenMP threads allocated
|
||||
for each MPI task when any portion of the interactions computed by a
|
||||
USER-INTEL pair style are run on the CPU. This can be the case even
|
||||
if LAMMPS was built with coprocessor support; see the {balance}
|
||||
keyword discussion below. If you are running with less MPI tasks/node
|
||||
than there are CPUs, it can be advantageous to use OpenMP threading on
|
||||
the CPUs.
|
||||
|
||||
IMPORTANT NOTE: The {omp} keyword has nothing to do with coprocessor
|
||||
threads on the Xeon Phi; see the {tpc} and {tptask} keywords below for
|
||||
a discussion of coprocessor threads.
|
||||
|
||||
The {Nthread} value for the {omp} keyword sets the number of OpenMP
|
||||
threads allocated for each MPI task. Setting {Nthread} = 0 (the
|
||||
default) instructs LAMMPS to use whatever value is the default for the
|
||||
given OpenMP environment. This is usually determined via the
|
||||
{OMP_NUM_THREADS} environment variable or the compiler runtime, which
|
||||
is usually a value of 1.
|
||||
|
||||
For more details, including examples of how to set the OMP_NUM_THREADS
|
||||
environment variable, see the discussion of the {Nthreads} setting on
|
||||
this doc page for the "package omp" command. Nthreads is a required
|
||||
argument for the USER-OMP package. Its meaning is exactly the same
|
||||
for the USER-INTEL pacakge.
|
||||
|
||||
IMPORTANT NOTE: If you build LAMMPS with both the USER-INTEL and
|
||||
USER-OMP packages, be aware that both packages allow setting of the
|
||||
{Nthreads} value via their package commands, but there is only a
|
||||
single global {Nthreads} value used by OpenMP. Thus if both package
|
||||
commands are invoked, you should insure the two values are consistent.
|
||||
If they are not, the last one invoked will take precedence, for both
|
||||
packages. Also note that if the "-sf intel" "command-line
|
||||
switch"_"_Section_start.html#start_7 is used, it invokes a "package
|
||||
intel" command, followed by a "package omp" command, both with a
|
||||
setting of {Nthreads} = 0.
|
||||
|
||||
The {mode} keyword determines the precision mode to use for
|
||||
computing pair style forces, either on the CPU or on the coprocessor,
|
||||
when using a USER-INTEL supported "pair style"_pair_style.html. It
|
||||
can take a value of {single}, {mixed} which is the default, or
|
||||
|
@ -346,12 +382,12 @@ quantities. {Double} means double precision is used for the entire
|
|||
force calculation.
|
||||
|
||||
The {balance} keyword sets the fraction of "pair
|
||||
style"_pair_style.html work offloaded to the coprocessor style for
|
||||
split values between 0.0 and 1.0 inclusive. While this fraction of
|
||||
work is running on the coprocessor, other calculations will run on the
|
||||
host, including neighbor and pair calculations that are not offloaded,
|
||||
angle, bond, dihedral, kspace, and some MPI communications. If
|
||||
{split} is set to -1, the fraction of work is dynamically adjusted
|
||||
style"_pair_style.html work offloaded to the coprocessor for split
|
||||
values between 0.0 and 1.0 inclusive. While this fraction of work is
|
||||
running on the coprocessor, other calculations will run on the host,
|
||||
including neighbor and pair calculations that are not offloaded, as
|
||||
well as angle, bond, dihedral, kspace, and some MPI communications.
|
||||
If {split} is set to -1, the fraction of work is dynamically adjusted
|
||||
automatically throughout the run. This typically give performance
|
||||
within 5 to 10 percent of the optimal fixed fraction.
|
||||
|
||||
|
@ -361,21 +397,28 @@ and force calculations. When the value = "no", ghost atoms are not
|
|||
offloaded. This option can reduce the amount of data transfer with
|
||||
the coprocessor and can also overlap MPI communication of forces with
|
||||
computation on the coprocessor when the "newton pair"_newton.html
|
||||
setting is "on". When the value = "ues", ghost atoms are offloaded.
|
||||
setting is "on". When the value = "yes", ghost atoms are offloaded.
|
||||
In some cases this can provide better performance, especially if the
|
||||
{balance} fraction is high.
|
||||
|
||||
The {tpc} keyword sets the maximum # of threads {Ntpc} that will
|
||||
run on each physical core of the coprocessor. The default value is
|
||||
set to 4, which is the number of hardware threads per core supported
|
||||
by the current generation Xeon Phi chips.
|
||||
The {tpc} keyword sets the max # of coprocessor threads {Ntpc} that
|
||||
will run on each core of the coprocessor. The default value = 4,
|
||||
which is the number of hardware threads per core supported by the
|
||||
current generation Xeon Phi chips.
|
||||
|
||||
The {tptask} keyword sets the maximum # of threads (Ntptask} that will
|
||||
be used on the coprocessor for each MPI task. This, along with the
|
||||
{tpc} keyword setting, are the only methods for changing the number of
|
||||
threads used on the coprocessor. The default value is set to 240 =
|
||||
60*4, which is the maximum # of threads supported by an entire current
|
||||
generation Xeon Phi chip.
|
||||
The {tptask} keyword sets the max # of coprocessor threads (Ntptask}
|
||||
assigned to each MPI task. The default value = 240, which is the
|
||||
total # of threads an entire current generation Xeon Phi chip can run
|
||||
(240 = 60 cores * 4 threads/core). This means each MPI task assigned
|
||||
to the Phi will enough threads for the chip to run the max allowed,
|
||||
even if only 1 MPI task is assigned. If 8 MPI tasks are assigned to
|
||||
the Phi, each will run with 30 threads. If you wish to limit the
|
||||
number of threads per MPI task, set {tptask} to a smaller value.
|
||||
E.g. for {tptask} = 16, if 8 MPI tasks are assigned, each will run
|
||||
with 16 threads, for a total of 128.
|
||||
|
||||
Note that the default settings for {tpc} and {tptask} are fine for
|
||||
most problems, regardless of how many MPI tasks you assign to a Phi.
|
||||
|
||||
:line
|
||||
|
||||
|
@ -580,15 +623,16 @@ must invoke the package gpu command in your input script or via the
|
|||
"-pk gpu" "command-line switch"_Section_start.html#start_7.
|
||||
|
||||
For the USER-INTEL package, the default is Nphi = 1 and the option
|
||||
defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240. Note
|
||||
that all of these settings, except "prec", are ignored if LAMMPS was
|
||||
not built with Xeon Phi coprocessor support. The default ghost option
|
||||
is determined by the pair style being used. This value is output to
|
||||
the screen in the offload report at the end of each run. These
|
||||
settings are made automatically if the "-sf intel" "command-line
|
||||
switch"_Section_start.html#start_7 is used. If it is not used, you
|
||||
must invoke the package intel command in your input script or or via
|
||||
the "-pk intel" "command-line switch"_Section_start.html#start_7.
|
||||
defaults are omp = 0, mode = mixed, balance = -1, tpc = 4, tptask =
|
||||
240. The default ghost option is determined by the pair style being
|
||||
used. This value is output to the screen in the offload report at the
|
||||
end of each run. Note that all of these settings, except "omp" and
|
||||
"mode", are ignored if LAMMPS was not built with Xeon Phi coprocessor
|
||||
support. These settings are made automatically if the "-sf intel"
|
||||
"command-line switch"_Section_start.html#start_7 is used. If it is
|
||||
not used, you must invoke the package intel command in your input
|
||||
script or or via the "-pk intel" "command-line
|
||||
switch"_Section_start.html#start_7.
|
||||
|
||||
For the KOKKOS package, the option defaults neigh = full, newton =
|
||||
off, binsize = 0.0, and comm = host. These settings are made
|
||||
|
|
Loading…
Reference in New Issue