git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12508 f3b2605a-c512-4ea7-a41b-209d697bcdaa

This commit is contained in:
sjplimp 2014-09-12 21:19:51 +00:00
parent 16864ce4e3
commit d0b6d228c7
12 changed files with 362 additions and 257 deletions

View File

@ -137,7 +137,7 @@ library.
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
</P>
<P>When using the USER-CUDA package, you must use exactly one MPI task
per physical GPU.

View File

@ -134,7 +134,7 @@ library.
The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
When using the USER-CUDA package, you must use exactly one MPI task
per physical GPU.

View File

@ -133,7 +133,7 @@ re-compiled and linked to the new GPU library.
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
</P>
<P>When using the GPU package, you cannot assign more than one GPU to a
single MPI task. However multiple MPI tasks can share the same GPU,

View File

@ -130,7 +130,7 @@ re-compiled and linked to the new GPU library.
The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
When using the GPU package, you cannot assign more than one GPU to a
single MPI task. However multiple MPI tasks can share the same GPU,

View File

@ -28,10 +28,10 @@ once with an offload flag.
package. This is useful when offloading pair style computations to
coprocessors, so that other styles not supported by the USER-INTEL
package, e.g. bond, angle, dihedral, improper, and long-range
electrostatics, can be run simultaneously in threaded mode on CPU
electrostatics, can run simultaneously in threaded mode on the CPU
cores. Since less MPI tasks than CPU cores will typically be invoked
when running with coprocessors, this enables the extra cores to be
utilized for useful computation.
when running with coprocessors, this enables the extra CPU cores to be
used for useful computation.
</P>
<P>If LAMMPS is built with both the USER-INTEL and USER-OMP packages
intsalled, this mode of operation is made easier to use, because the
@ -42,13 +42,13 @@ if available, after first testing if a style from the USER-INTEL
package is available.
</P>
<P>Here is a quick overview of how to use the USER-INTEL package
for CPU acceleration:
for CPU-only acceleration:
</P>
<UL><LI>specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
<LI>specify -fopenmp with LINKFLAGS in your Makefile.machine
<UL><LI>specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
<LI>specify -openmp with LINKFLAGS in your Makefile.machine
<LI>include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
<LI>if using the USER-OMP package, specify how many threads per MPI task to use
<LI>use USER-INTEL styles in your input script
<LI>specify how many OpenMP threads per MPI task to use
<LI>use USER-INTEL and (optionally) USER-OMP styles in your input script
</UL>
<P>Using the USER-INTEL package to offload work to the Intel(R)
Xeon Phi(TM) coprocessor is the same except for these additional
@ -56,15 +56,14 @@ steps:
</P>
<UL><LI>add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
<LI>add the flag -offload to LINKFLAGS in your Makefile.machine
<LI>specify how many threads per coprocessor to use
<LI>specify how many coprocessor threads per MPI task to use
</UL>
<P>The latter two steps in the first case and the last step in the
coprocessor case can be done using the "-pk omp" and "-sf intel" and
"-pk intel" <A HREF = "Section_start.html#start_7">command-line switches</A>
respectively. Or the effect of the "-pk" or "-sf" switches can be
duplicated by adding the <A HREF = "package.html">package omp</A> or <A HREF = "suffix.html">suffix
intel</A> or <A HREF = "package.html">package intel</A> commands
respectively to your input script.
coprocessor case can be done using the "-pk intel" and "-sf intel"
<A HREF = "Section_start.html#start_7">command-line switches</A> respectively. Or
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the <A HREF = "package.html">package intel</A> or <A HREF = "suffix.html">suffix intel</A>
commands respectively to your input script.
</P>
<P><B>Required hardware/software:</B>
</P>
@ -99,9 +98,9 @@ Intel compilers. You also need to add -DLAMMPS_MEMALIGN=64 and
the runs, adding the flag <I>-xHost</I> to CCFLAGS will enable
vectorization with the Intel(R) compiler.
</P>
<P>In order to build with support for an Intel(R) coprocessor, the flag
<I>-offload</I> should be added to the LINKFLAGS line and the flag
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
<P>In order to build with support for an Intel(R) Xeon Phi(TM)
coprocessor, the flag <I>-offload</I> should be added to the LINKFLAGS line
and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
</P>
<P>Note that the machine makefiles Makefile.intel and
Makefile.intel_offload are included in the src/MAKE directory with
@ -118,71 +117,77 @@ higher is recommended.
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
</P>
<P>If LAMMPS was also built with the USER-OMP package, you need to choose
how many OpenMP threads per MPI task will be used by the USER-OMP
package. Note that the product of MPI tasks * OpenMP threads/task
should not exceed the physical number of cores (on a node), otherwise
performance will suffer.
<P>If you plan to compute (any portion of) pairwise interactions using
USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
you need to choose how many OpenMP threads per MPI task to use. Note
that the product of MPI tasks * OpenMP threads/task should not exceed
the physical number of cores (on a node), otherwise performance will
suffer.
</P>
<P>If LAMMPS was built with coprocessor support for the USER-INTEL
package, you need to specify the number of coprocessor/node and the
number of threads to use on the coprocessor per MPI task. Note that
package, you also need to specify the number of coprocessor/node and
the number of coprocessor threads per MPI task to use. Note that
coprocessor threads (which run on the coprocessor) are totally
independent from OpenMP threads (which run on the CPU). The product
of MPI tasks * coprocessor threads/task should not exceed the maximum
number of threads the coproprocessor is designed to run, otherwise
performance will suffer. This value is 240 for current generation
Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. The
threads/core value can be set to a smaller value if desired by an
option on the <A HREF = "package.html">package intel</A> command, in which case the
maximum number of threads is also reduced.
independent from OpenMP threads (which run on the CPU). The default
values for the settings that affect coprocessor threads are typically
fine, as discussed below.
</P>
<P>Use the "-sf intel" <A HREF = "Section_start.html#start_7">command-line switch</A>,
which will automatically append "intel" to styles that support it. If
a style does not support it, a "omp" suffix is tried next. Use the
"-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switch</A>, to set
Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
the USER-OMP package. Use the "-pk intel Nphi" <A HREF = "Section_start.html#start_7">command-line
a style does not support it, an "omp" suffix is tried next. OpenMP
threads per MPI task can be set via the "-pk intel Nphi omp Nt" or
"-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switches</A>, which
set Nt = # of OpenMP threads per MPI task to use. The "-pk omp" form
is only allowed if LAMMPS was also built with the USER-OMP package.
</P>
<P>Use the "-pk intel Nphi" <A HREF = "Section_start.html#start_7">command-line
switch</A> to set Nphi = # of Xeon Phi(TM)
coprocessors/node, if LAMMPS was built with coprocessor support.
coprocessors/node, if LAMMPS was built with coprocessor support. All
the available coprocessor threads on each Phi will be divided among
MPI tasks, unless the <I>tptask</I> option of the "-pk intel" <A HREF = "Section_start.html#start_7">command-line
switch</A> is used to limit the coprocessor
threads per MPI task. See the <A HREF = "package.html">package intel</A> command
for details.
</P>
<PRE>CPU-only without USER-OMP (but using Intel vectorization on CPU):
lmp_machine -sf intel -in in.script # 1 MPI task
mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes)
</PRE>
<PRE>CPU-only with USER-OMP (and Intel vectorization on CPU):
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script # ditto on 8 16-core nodes
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script # ditto on 8 16-core nodes
</PRE>
<PRE>CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
lmp_machine -sf intel -pk intel 16 1 -in in.script # 1 MPI task, 240 threads on 1 coprocessor
mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node,
# each MPI task uses 60 threads on 1 coprocessor
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script # ditto on 8 16-core nodes for MPI tasks and OpenMP threads,
# each MPI task uses 120 threads on one of 2 coprocessors
<PRE>CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
lmp_machine -sf intel -pk intel 1 omp 16 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # ditto on 8 16-core nodes
mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task
</PRE>
<P>Note that if the "-sf intel" switch is used, it also issues two
default commands: <A HREF = "package.html">package omp 0</A> and <A HREF = "package.html">package intel
1</A> command. These set the number of OpenMP threads per
MPI task via the OMP_NUM_THREADS environment variable, and the number
of Xeon Phi(TM) coprocessors/node to 1. The former is ignored if
LAMMPS was not built with the USER-OMP package. The latter is ignored
is LAMMPS was not built with coprocessor support, except for its
optional precision setting.
<P>Note that if the "-sf intel" switch is used, it also invokes two
default commands: <A HREF = "package.html">package intel 1</A>, followed by <A HREF = "package.html">package
omp 0</A>. These both set the number of OpenMP threads per
MPI task via the OMP_NUM_THREADS environment variable. The first
command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
the precision mode to "mixed", as one of its option defaults). The
latter command is not invoked if LAMMPS was not built with the
USER-OMP package. The Nphi = 1 value for the first command is ignored
if LAMMPS was not built with coprocessor support.
</P>
<P>Using the "-pk omp" switch explicitly allows for direct setting of the
number of OpenMP threads per MPI task, and additional options. Using
the "-pk intel" switch explicitly allows for direct setting of the
number of coprocessors/node, and additional options. The syntax for
these two switches is the same as the <A HREF = "package.html">package omp</A> and
<A HREF = "package.html">package intel</A> commands. See the <A HREF = "package.html">package</A>
command doc page for details, including the default values used for
all its options if these switches are not specified, and how to set
the number of OpenMP threads via the OMP_NUM_THREADS environment
variable if desired.
<P>Using the "-pk intel" or "-pk omp" switches explicitly allows for
direct setting of the number of OpenMP threads per MPI task, and
additional options for either of the USER-INTEL or USER-OMP packages.
In particular, the "-pk intel" switch sets the number of
coprocessors/node and can limit the number of coprocessor threads per
MPI task. The syntax for these two switches is the same as the
<A HREF = "package.html">package omp</A> and <A HREF = "package.html">package intel</A> commands.
See the <A HREF = "package.html">package</A> command doc page for details, including
the default values used for all its options if these switches are not
specified, and how to set the number of OpenMP threads via the
OMP_NUM_THREADS environment variable if desired.
</P>
<P><B>Or run with the USER-INTEL package by editing an input script:</B>
</P>
@ -195,19 +200,20 @@ the same.
</P>
<PRE>pair_style lj/cut/intel 2.5
</PRE>
<P>You must also use the <A HREF = "package.html">package omp</A> command to enable the
USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
intel" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line switches</A>
were used. It specifies how many OpenMP threads per MPI task to use,
as well as other options. Its doc page explains how to set the number
of threads via an environment variable if desired.
<P>You must also use the <A HREF = "package.html">package intel</A> command, unless the
"-sf intel" or "-pk intel" <A HREF = "Section_start.html#start_7">command-line
switches</A> were used. It specifies how many
coprocessors/node to use, as well as other OpenMP threading and
coprocessor options. Its doc page explains how to set the number of
OpenMP threads via an environment variable if desired.
</P>
<P>You must also use the <A HREF = "package.html">package intel</A> command to enable
coprocessor support within the USER-INTEL package (assuming LAMMPS was
built with coprocessor support) unless the "-sf intel" or "-pk intel"
<A HREF = "Section_start.html#start_7">command-line switches</A> were used. It
specifies how many coprocessors/node to use, as well as other
coprocessor options.
<P>If LAMMPS was also built with the USER-OMP package, you must also use
the <A HREF = "package.html">package omp</A> command to enable that package, unless
the "-sf intel" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line
switches</A> were used. It specifies how many
OpenMP threads per MPI task to use, as well as other options. Its doc
page explains how to set the number of OpenMP threads via an
environment variable if desired.
</P>
<P><B>Speed-ups to expect:</B>
</P>

View File

@ -25,10 +25,10 @@ The USER-INTEL package can be used in tandem with the USER-OMP
package. This is useful when offloading pair style computations to
coprocessors, so that other styles not supported by the USER-INTEL
package, e.g. bond, angle, dihedral, improper, and long-range
electrostatics, can be run simultaneously in threaded mode on CPU
electrostatics, can run simultaneously in threaded mode on the CPU
cores. Since less MPI tasks than CPU cores will typically be invoked
when running with coprocessors, this enables the extra cores to be
utilized for useful computation.
when running with coprocessors, this enables the extra CPU cores to be
used for useful computation.
If LAMMPS is built with both the USER-INTEL and USER-OMP packages
intsalled, this mode of operation is made easier to use, because the
@ -39,13 +39,13 @@ if available, after first testing if a style from the USER-INTEL
package is available.
Here is a quick overview of how to use the USER-INTEL package
for CPU acceleration:
for CPU-only acceleration:
specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
specify -fopenmp with LINKFLAGS in your Makefile.machine
specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
specify -openmp with LINKFLAGS in your Makefile.machine
include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
if using the USER-OMP package, specify how many threads per MPI task to use
use USER-INTEL styles in your input script :ul
specify how many OpenMP threads per MPI task to use
use USER-INTEL and (optionally) USER-OMP styles in your input script :ul
Using the USER-INTEL package to offload work to the Intel(R)
Xeon Phi(TM) coprocessor is the same except for these additional
@ -53,15 +53,14 @@ steps:
add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
add the flag -offload to LINKFLAGS in your Makefile.machine
specify how many threads per coprocessor to use :ul
specify how many coprocessor threads per MPI task to use :ul
The latter two steps in the first case and the last step in the
coprocessor case can be done using the "-pk omp" and "-sf intel" and
"-pk intel" "command-line switches"_Section_start.html#start_7
respectively. Or the effect of the "-pk" or "-sf" switches can be
duplicated by adding the "package omp"_package.html or "suffix
intel"_suffix.html or "package intel"_package.html commands
respectively to your input script.
coprocessor case can be done using the "-pk intel" and "-sf intel"
"command-line switches"_Section_start.html#start_7 respectively. Or
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the "package intel"_package.html or "suffix intel"_suffix.html
commands respectively to your input script.
[Required hardware/software:]
@ -96,9 +95,9 @@ If you are compiling on the same architecture that will be used for
the runs, adding the flag {-xHost} to CCFLAGS will enable
vectorization with the Intel(R) compiler.
In order to build with support for an Intel(R) coprocessor, the flag
{-offload} should be added to the LINKFLAGS line and the flag
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
In order to build with support for an Intel(R) Xeon Phi(TM)
coprocessor, the flag {-offload} should be added to the LINKFLAGS line
and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
Note that the machine makefiles Makefile.intel and
Makefile.intel_offload are included in the src/MAKE directory with
@ -115,71 +114,77 @@ higher is recommended.
The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
If LAMMPS was also built with the USER-OMP package, you need to choose
how many OpenMP threads per MPI task will be used by the USER-OMP
package. Note that the product of MPI tasks * OpenMP threads/task
should not exceed the physical number of cores (on a node), otherwise
performance will suffer.
If you plan to compute (any portion of) pairwise interactions using
USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
you need to choose how many OpenMP threads per MPI task to use. Note
that the product of MPI tasks * OpenMP threads/task should not exceed
the physical number of cores (on a node), otherwise performance will
suffer.
If LAMMPS was built with coprocessor support for the USER-INTEL
package, you need to specify the number of coprocessor/node and the
number of threads to use on the coprocessor per MPI task. Note that
package, you also need to specify the number of coprocessor/node and
the number of coprocessor threads per MPI task to use. Note that
coprocessor threads (which run on the coprocessor) are totally
independent from OpenMP threads (which run on the CPU). The product
of MPI tasks * coprocessor threads/task should not exceed the maximum
number of threads the coproprocessor is designed to run, otherwise
performance will suffer. This value is 240 for current generation
Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. The
threads/core value can be set to a smaller value if desired by an
option on the "package intel"_package.html command, in which case the
maximum number of threads is also reduced.
independent from OpenMP threads (which run on the CPU). The default
values for the settings that affect coprocessor threads are typically
fine, as discussed below.
Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
which will automatically append "intel" to styles that support it. If
a style does not support it, a "omp" suffix is tried next. Use the
"-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set
Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
the USER-OMP package. Use the "-pk intel Nphi" "command-line
a style does not support it, an "omp" suffix is tried next. OpenMP
threads per MPI task can be set via the "-pk intel Nphi omp Nt" or
"-pk omp Nt" "command-line switches"_Section_start.html#start_7, which
set Nt = # of OpenMP threads per MPI task to use. The "-pk omp" form
is only allowed if LAMMPS was also built with the USER-OMP package.
Use the "-pk intel Nphi" "command-line
switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
coprocessors/node, if LAMMPS was built with coprocessor support.
coprocessors/node, if LAMMPS was built with coprocessor support. All
the available coprocessor threads on each Phi will be divided among
MPI tasks, unless the {tptask} option of the "-pk intel" "command-line
switch"_Section_start.html#start_7 is used to limit the coprocessor
threads per MPI task. See the "package intel"_package.html command
for details.
CPU-only without USER-OMP (but using Intel vectorization on CPU):
lmp_machine -sf intel -in in.script # 1 MPI task
mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre
CPU-only with USER-OMP (and Intel vectorization on CPU):
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script # ditto on 8 16-core nodes :pre
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script # ditto on 8 16-core nodes :pre
CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
lmp_machine -sf intel -pk intel 16 1 -in in.script # 1 MPI task, 240 threads on 1 coprocessor
mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node,
# each MPI task uses 60 threads on 1 coprocessor
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script # ditto on 8 16-core nodes for MPI tasks and OpenMP threads,
# each MPI task uses 120 threads on one of 2 coprocessors :pre
CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
lmp_machine -sf intel -pk intel 1 omp 16 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # ditto on 8 16-core nodes
mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task :pre
Note that if the "-sf intel" switch is used, it also issues two
default commands: "package omp 0"_package.html and "package intel
1"_package.html command. These set the number of OpenMP threads per
MPI task via the OMP_NUM_THREADS environment variable, and the number
of Xeon Phi(TM) coprocessors/node to 1. The former is ignored if
LAMMPS was not built with the USER-OMP package. The latter is ignored
is LAMMPS was not built with coprocessor support, except for its
optional precision setting.
Note that if the "-sf intel" switch is used, it also invokes two
default commands: "package intel 1"_package.html, followed by "package
omp 0"_package.html. These both set the number of OpenMP threads per
MPI task via the OMP_NUM_THREADS environment variable. The first
command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
the precision mode to "mixed", as one of its option defaults). The
latter command is not invoked if LAMMPS was not built with the
USER-OMP package. The Nphi = 1 value for the first command is ignored
if LAMMPS was not built with coprocessor support.
Using the "-pk omp" switch explicitly allows for direct setting of the
number of OpenMP threads per MPI task, and additional options. Using
the "-pk intel" switch explicitly allows for direct setting of the
number of coprocessors/node, and additional options. The syntax for
these two switches is the same as the "package omp"_package.html and
"package intel"_package.html commands. See the "package"_package.html
command doc page for details, including the default values used for
all its options if these switches are not specified, and how to set
the number of OpenMP threads via the OMP_NUM_THREADS environment
variable if desired.
Using the "-pk intel" or "-pk omp" switches explicitly allows for
direct setting of the number of OpenMP threads per MPI task, and
additional options for either of the USER-INTEL or USER-OMP packages.
In particular, the "-pk intel" switch sets the number of
coprocessors/node and can limit the number of coprocessor threads per
MPI task. The syntax for these two switches is the same as the
"package omp"_package.html and "package intel"_package.html commands.
See the "package"_package.html command doc page for details, including
the default values used for all its options if these switches are not
specified, and how to set the number of OpenMP threads via the
OMP_NUM_THREADS environment variable if desired.
[Or run with the USER-INTEL package by editing an input script:]
@ -192,19 +197,20 @@ Use the "suffix intel"_suffix.html command, or you can explicitly add an
pair_style lj/cut/intel 2.5 :pre
You must also use the "package omp"_package.html command to enable the
USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
intel" or "-pk omp" "command-line switches"_Section_start.html#start_7
were used. It specifies how many OpenMP threads per MPI task to use,
as well as other options. Its doc page explains how to set the number
of threads via an environment variable if desired.
You must also use the "package intel"_package.html command, unless the
"-sf intel" or "-pk intel" "command-line
switches"_Section_start.html#start_7 were used. It specifies how many
coprocessors/node to use, as well as other OpenMP threading and
coprocessor options. Its doc page explains how to set the number of
OpenMP threads via an environment variable if desired.
You must also use the "package intel"_package.html command to enable
coprocessor support within the USER-INTEL package (assuming LAMMPS was
built with coprocessor support) unless the "-sf intel" or "-pk intel"
"command-line switches"_Section_start.html#start_7 were used. It
specifies how many coprocessors/node to use, as well as other
coprocessor options.
If LAMMPS was also built with the USER-OMP package, you must also use
the "package omp"_package.html command to enable that package, unless
the "-sf intel" or "-pk omp" "command-line
switches"_Section_start.html#start_7 were used. It specifies how many
OpenMP threads per MPI task to use, as well as other options. Its doc
page explains how to set the number of OpenMP threads via an
environment variable if desired.
[Speed-ups to expect:]

View File

@ -178,7 +178,7 @@ double precision.
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
</P>
<P>When using KOKKOS built with host=OMP, you need to choose how many
OpenMP threads per MPI task will be used (via the "-k" command-line

View File

@ -175,7 +175,7 @@ double precision.
The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
When using KOKKOS built with host=OMP, you need to choose how many
OpenMP threads per MPI task will be used (via the "-k" command-line

View File

@ -57,7 +57,7 @@ Intel compilers the CCFLAGS setting also needs to include "-restrict".
<P>The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
</P>
<P>You need to choose how many threads per MPI task will be used by the
USER-OMP package. Note that the product of MPI tasks * threads/task

View File

@ -54,7 +54,7 @@ Intel compilers the CCFLAGS setting also needs to include "-restrict".
The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
You need to choose how many threads per MPI task will be used by the
USER-OMP package. Note that the product of MPI tasks * threads/task

View File

@ -59,20 +59,22 @@
<I>intel</I> args = NPhi keyword value ...
Nphi = # of coprocessors per node
zero or more keyword/value pairs may be appended
keywords = <I>prec</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
<I>prec</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
keywords = <I>omp</I> or <I>mode</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
<I>omp</I> value = Nthreads
Nthreads = number of OpenMP threads to use on CPU (default = 0)
<I>mode</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
single = perform force calculations in single precision
mixed = perform force calculations in mixed precision
double = perform force calculations in double precision
<I>balance</I> value = split
split = fraction of work to offload to coprocessor, -1 for dynamic
<I>ghost</I> value = <I>yes</I> or <I>no</I>
yes = include ghost atoms for offload
no = do not include ghost atoms for offload
<I>tpc</I> value = Ntpc
Ntpc = number of threads to use on each physical core of coprocessor
<I>tptask</I> value = Ntptask
Ntptask = max number of threads to use on coprocessor for each MPI task
<I>balance</I> value = split
split = fraction of work to offload to coprocessor, -1 for dynamic
<I>ghost</I> value = <I>yes</I> or <I>no</I>
yes = include ghost atoms for offload
no = do not include ghost atoms for offload
<I>tpc</I> value = Ntpc
Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
<I>tptask</I> value = Ntptask
Ntptask = max number of coprocessor threads per MPI task (default = 240)
<I>kokkos</I> args = keyword value ...
zero or more keyword/value pairs may be appended
keywords = <I>neigh</I> or <I>newton</I> or <I>binsize</I> or <I>comm</I> or <I>comm/exchange</I> or <I>comm/forward</I>
@ -114,7 +116,8 @@ package cuda 1 test 3948
package kokkos neigh half/thread comm device
package omp 0 neigh no
package omp 4
package intel * mixed balance -1
package intel 1
package intel 2 omp 4 mode mixed balance 0.5
</PRE>
<P><B>Description:</B>
</P>
@ -324,18 +327,56 @@ lib/gpu/Makefile that is used.
<HR>
<P>The <I>intel</I> style invokes settings associated with the use of the
USER-INTEL package. All of its settings, except the <I>prec</I> keyword,
are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
when building with the USER-INTEL package. All of its settings,
including the <I>prec</I> keyword are applicable if LAMMPS was built with
coprocessor support.
USER-INTEL package. All of its settings, except the <I>omp</I> and <I>mode</I>
keywords, are ignored if LAMMPS was not built with Xeon Phi
coprocessor support. All of its settings, including the <I>omp</I> and
<I>mode</I> keyword are applicable if LAMMPS was built with coprocessor
support.
</P>
<P>The <I>Nphi</I> argument sets the number of coprocessors per node.
This can be set to any value, including 0, if LAMMPS was not
built with coprocessor support.
</P>
<P>Optional keyword/value pairs can also be specified. Each has a
default value as listed below.
</P>
<P>The <I>prec</I> keyword argument determines the precision mode to use for
<P>The <I>omp</I> keyword determines the number of OpenMP threads allocated
for each MPI task when any portion of the interactions computed by a
USER-INTEL pair style are run on the CPU. This can be the case even
if LAMMPS was built with coprocessor support; see the <I>balance</I>
keyword discussion below. If you are running with less MPI tasks/node
than there are CPUs, it can be advantageous to use OpenMP threading on
the CPUs.
</P>
<P>IMPORTANT NOTE: The <I>omp</I> keyword has nothing to do with coprocessor
threads on the Xeon Phi; see the <I>tpc</I> and <I>tptask</I> keywords below for
a discussion of coprocessor threads.
</P>
<P>The <I>Nthread</I> value for the <I>omp</I> keyword sets the number of OpenMP
threads allocated for each MPI task. Setting <I>Nthread</I> = 0 (the
default) instructs LAMMPS to use whatever value is the default for the
given OpenMP environment. This is usually determined via the
<I>OMP_NUM_THREADS</I> environment variable or the compiler runtime, which
is usually a value of 1.
</P>
<P>For more details, including examples of how to set the OMP_NUM_THREADS
environment variable, see the discussion of the <I>Nthreads</I> setting on
this doc page for the "package omp" command. Nthreads is a required
argument for the USER-OMP package. Its meaning is exactly the same
for the USER-INTEL pacakge.
</P>
<P>IMPORTANT NOTE: If you build LAMMPS with both the USER-INTEL and
USER-OMP packages, be aware that both packages allow setting of the
<I>Nthreads</I> value via their package commands, but there is only a
single global <I>Nthreads</I> value used by OpenMP. Thus if both package
commands are invoked, you should insure the two values are consistent.
If they are not, the last one invoked will take precedence, for both
packages. Also note that if the "-sf intel" <A HREF = <A HREF = "Section_start.html#start_7">command-line"></A>
switch</A> is used, it invokes a "package
intel" command, followed by a "package omp" command, both with a
setting of <I>Nthreads</I> = 0.
</P>
<P>The <I>mode</I> keyword determines the precision mode to use for
computing pair style forces, either on the CPU or on the coprocessor,
when using a USER-INTEL supported <A HREF = "pair_style.html">pair style</A>. It
can take a value of <I>single</I>, <I>mixed</I> which is the default, or
@ -347,12 +388,12 @@ quantities. <I>Double</I> means double precision is used for the entire
force calculation.
</P>
<P>The <I>balance</I> keyword sets the fraction of <A HREF = "pair_style.html">pair
style</A> work offloaded to the coprocessor style for
split values between 0.0 and 1.0 inclusive. While this fraction of
work is running on the coprocessor, other calculations will run on the
host, including neighbor and pair calculations that are not offloaded,
angle, bond, dihedral, kspace, and some MPI communications. If
<I>split</I> is set to -1, the fraction of work is dynamically adjusted
style</A> work offloaded to the coprocessor for split
values between 0.0 and 1.0 inclusive. While this fraction of work is
running on the coprocessor, other calculations will run on the host,
including neighbor and pair calculations that are not offloaded, as
well as angle, bond, dihedral, kspace, and some MPI communications.
If <I>split</I> is set to -1, the fraction of work is dynamically adjusted
automatically throughout the run. This typically give performance
within 5 to 10 percent of the optimal fixed fraction.
</P>
@ -362,21 +403,28 @@ and force calculations. When the value = "no", ghost atoms are not
offloaded. This option can reduce the amount of data transfer with
the coprocessor and can also overlap MPI communication of forces with
computation on the coprocessor when the <A HREF = "newton.html">newton pair</A>
setting is "on". When the value = "ues", ghost atoms are offloaded.
setting is "on". When the value = "yes", ghost atoms are offloaded.
In some cases this can provide better performance, especially if the
<I>balance</I> fraction is high.
</P>
<P>The <I>tpc</I> keyword sets the maximum # of threads <I>Ntpc</I> that will
run on each physical core of the coprocessor. The default value is
set to 4, which is the number of hardware threads per core supported
by the current generation Xeon Phi chips.
<P>The <I>tpc</I> keyword sets the max # of coprocessor threads <I>Ntpc</I> that
will run on each core of the coprocessor. The default value = 4,
which is the number of hardware threads per core supported by the
current generation Xeon Phi chips.
</P>
<P>The <I>tptask</I> keyword sets the maximum # of threads (Ntptask</I> that will
be used on the coprocessor for each MPI task. This, along with the
<I>tpc</I> keyword setting, are the only methods for changing the number of
threads used on the coprocessor. The default value is set to 240 =
60*4, which is the maximum # of threads supported by an entire current
generation Xeon Phi chip.
<P>The <I>tptask</I> keyword sets the max # of coprocessor threads (Ntptask</I>
assigned to each MPI task. The default value = 240, which is the
total # of threads an entire current generation Xeon Phi chip can run
(240 = 60 cores * 4 threads/core). This means each MPI task assigned
to the Phi will enough threads for the chip to run the max allowed,
even if only 1 MPI task is assigned. If 8 MPI tasks are assigned to
the Phi, each will run with 30 threads. If you wish to limit the
number of threads per MPI task, set <I>tptask</I> to a smaller value.
E.g. for <I>tptask</I> = 16, if 8 MPI tasks are assigned, each will run
with 16 threads, for a total of 128.
</P>
<P>Note that the default settings for <I>tpc</I> and <I>tptask</I> are fine for
most problems, regardless of how many MPI tasks you assign to a Phi.
</P>
<HR>
@ -581,15 +629,16 @@ must invoke the package gpu command in your input script or via the
"-pk gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>.
</P>
<P>For the USER-INTEL package, the default is Nphi = 1 and the option
defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240. Note
that all of these settings, except "prec", are ignored if LAMMPS was
not built with Xeon Phi coprocessor support. The default ghost option
is determined by the pair style being used. This value is output to
the screen in the offload report at the end of each run. These
settings are made automatically if the "-sf intel" <A HREF = "Section_start.html#start_7">command-line
switch</A> is used. If it is not used, you
must invoke the package intel command in your input script or or via
the "-pk intel" <A HREF = "Section_start.html#start_7">command-line switch</A>.
defaults are omp = 0, mode = mixed, balance = -1, tpc = 4, tptask =
240. The default ghost option is determined by the pair style being
used. This value is output to the screen in the offload report at the
end of each run. Note that all of these settings, except "omp" and
"mode", are ignored if LAMMPS was not built with Xeon Phi coprocessor
support. These settings are made automatically if the "-sf intel"
<A HREF = "Section_start.html#start_7">command-line switch</A> is used. If it is
not used, you must invoke the package intel command in your input
script or or via the "-pk intel" <A HREF = "Section_start.html#start_7">command-line
switch</A>.
</P>
<P>For the KOKKOS package, the option defaults neigh = full, newton =
off, binsize = 0.0, and comm = host. These settings are made

View File

@ -54,20 +54,22 @@ args = arguments specific to the style :l
{intel} args = NPhi keyword value ...
Nphi = # of coprocessors per node
zero or more keyword/value pairs may be appended
keywords = {prec} or {balance} or {ghost} or {tpc} or {tptask}
{prec} value = {single} or {mixed} or {double}
keywords = {omp} or {mode} or {balance} or {ghost} or {tpc} or {tptask}
{omp} value = Nthreads
Nthreads = number of OpenMP threads to use on CPU (default = 0)
{mode} value = {single} or {mixed} or {double}
single = perform force calculations in single precision
mixed = perform force calculations in mixed precision
double = perform force calculations in double precision
{balance} value = split
split = fraction of work to offload to coprocessor, -1 for dynamic
{ghost} value = {yes} or {no}
yes = include ghost atoms for offload
no = do not include ghost atoms for offload
{tpc} value = Ntpc
Ntpc = number of threads to use on each physical core of coprocessor
{tptask} value = Ntptask
Ntptask = max number of threads to use on coprocessor for each MPI task
{balance} value = split
split = fraction of work to offload to coprocessor, -1 for dynamic
{ghost} value = {yes} or {no}
yes = include ghost atoms for offload
no = do not include ghost atoms for offload
{tpc} value = Ntpc
Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
{tptask} value = Ntptask
Ntptask = max number of coprocessor threads per MPI task (default = 240)
{kokkos} args = keyword value ...
zero or more keyword/value pairs may be appended
keywords = {neigh} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward}
@ -108,7 +110,8 @@ package cuda 1 test 3948
package kokkos neigh half/thread comm device
package omp 0 neigh no
package omp 4
package intel * mixed balance -1 :pre
package intel 1
package intel 2 omp 4 mode mixed balance 0.5 :pre
[Description:]
@ -263,11 +266,6 @@ cutoff of 20*sigma in LJ "units"_units.html and a neighbor skin
distance of sigma, a {binsize} = 5.25*sigma can be more efficient than
the default.
The {split} keyword can be used for load balancing force calculations
between CPU and GPU cores in GPU-enabled pair styles. If 0 < {split} <
1.0, a fixed fraction of particles is offloaded to the GPU while force
@ -323,18 +321,56 @@ lib/gpu/Makefile that is used.
:line
The {intel} style invokes settings associated with the use of the
USER-INTEL package. All of its settings, except the {prec} keyword,
are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
when building with the USER-INTEL package. All of its settings,
including the {prec} keyword are applicable if LAMMPS was built with
coprocessor support.
USER-INTEL package. All of its settings, except the {omp} and {mode}
keywords, are ignored if LAMMPS was not built with Xeon Phi
coprocessor support. All of its settings, including the {omp} and
{mode} keyword are applicable if LAMMPS was built with coprocessor
support.
The {Nphi} argument sets the number of coprocessors per node.
This can be set to any value, including 0, if LAMMPS was not
built with coprocessor support.
Optional keyword/value pairs can also be specified. Each has a
default value as listed below.
The {prec} keyword argument determines the precision mode to use for
The {omp} keyword determines the number of OpenMP threads allocated
for each MPI task when any portion of the interactions computed by a
USER-INTEL pair style are run on the CPU. This can be the case even
if LAMMPS was built with coprocessor support; see the {balance}
keyword discussion below. If you are running with less MPI tasks/node
than there are CPUs, it can be advantageous to use OpenMP threading on
the CPUs.
IMPORTANT NOTE: The {omp} keyword has nothing to do with coprocessor
threads on the Xeon Phi; see the {tpc} and {tptask} keywords below for
a discussion of coprocessor threads.
The {Nthread} value for the {omp} keyword sets the number of OpenMP
threads allocated for each MPI task. Setting {Nthread} = 0 (the
default) instructs LAMMPS to use whatever value is the default for the
given OpenMP environment. This is usually determined via the
{OMP_NUM_THREADS} environment variable or the compiler runtime, which
is usually a value of 1.
For more details, including examples of how to set the OMP_NUM_THREADS
environment variable, see the discussion of the {Nthreads} setting on
this doc page for the "package omp" command. Nthreads is a required
argument for the USER-OMP package. Its meaning is exactly the same
for the USER-INTEL pacakge.
IMPORTANT NOTE: If you build LAMMPS with both the USER-INTEL and
USER-OMP packages, be aware that both packages allow setting of the
{Nthreads} value via their package commands, but there is only a
single global {Nthreads} value used by OpenMP. Thus if both package
commands are invoked, you should insure the two values are consistent.
If they are not, the last one invoked will take precedence, for both
packages. Also note that if the "-sf intel" "command-line
switch"_"_Section_start.html#start_7 is used, it invokes a "package
intel" command, followed by a "package omp" command, both with a
setting of {Nthreads} = 0.
The {mode} keyword determines the precision mode to use for
computing pair style forces, either on the CPU or on the coprocessor,
when using a USER-INTEL supported "pair style"_pair_style.html. It
can take a value of {single}, {mixed} which is the default, or
@ -346,12 +382,12 @@ quantities. {Double} means double precision is used for the entire
force calculation.
The {balance} keyword sets the fraction of "pair
style"_pair_style.html work offloaded to the coprocessor style for
split values between 0.0 and 1.0 inclusive. While this fraction of
work is running on the coprocessor, other calculations will run on the
host, including neighbor and pair calculations that are not offloaded,
angle, bond, dihedral, kspace, and some MPI communications. If
{split} is set to -1, the fraction of work is dynamically adjusted
style"_pair_style.html work offloaded to the coprocessor for split
values between 0.0 and 1.0 inclusive. While this fraction of work is
running on the coprocessor, other calculations will run on the host,
including neighbor and pair calculations that are not offloaded, as
well as angle, bond, dihedral, kspace, and some MPI communications.
If {split} is set to -1, the fraction of work is dynamically adjusted
automatically throughout the run. This typically give performance
within 5 to 10 percent of the optimal fixed fraction.
@ -361,21 +397,28 @@ and force calculations. When the value = "no", ghost atoms are not
offloaded. This option can reduce the amount of data transfer with
the coprocessor and can also overlap MPI communication of forces with
computation on the coprocessor when the "newton pair"_newton.html
setting is "on". When the value = "ues", ghost atoms are offloaded.
setting is "on". When the value = "yes", ghost atoms are offloaded.
In some cases this can provide better performance, especially if the
{balance} fraction is high.
The {tpc} keyword sets the maximum # of threads {Ntpc} that will
run on each physical core of the coprocessor. The default value is
set to 4, which is the number of hardware threads per core supported
by the current generation Xeon Phi chips.
The {tpc} keyword sets the max # of coprocessor threads {Ntpc} that
will run on each core of the coprocessor. The default value = 4,
which is the number of hardware threads per core supported by the
current generation Xeon Phi chips.
The {tptask} keyword sets the maximum # of threads (Ntptask} that will
be used on the coprocessor for each MPI task. This, along with the
{tpc} keyword setting, are the only methods for changing the number of
threads used on the coprocessor. The default value is set to 240 =
60*4, which is the maximum # of threads supported by an entire current
generation Xeon Phi chip.
The {tptask} keyword sets the max # of coprocessor threads (Ntptask}
assigned to each MPI task. The default value = 240, which is the
total # of threads an entire current generation Xeon Phi chip can run
(240 = 60 cores * 4 threads/core). This means each MPI task assigned
to the Phi will enough threads for the chip to run the max allowed,
even if only 1 MPI task is assigned. If 8 MPI tasks are assigned to
the Phi, each will run with 30 threads. If you wish to limit the
number of threads per MPI task, set {tptask} to a smaller value.
E.g. for {tptask} = 16, if 8 MPI tasks are assigned, each will run
with 16 threads, for a total of 128.
Note that the default settings for {tpc} and {tptask} are fine for
most problems, regardless of how many MPI tasks you assign to a Phi.
:line
@ -580,15 +623,16 @@ must invoke the package gpu command in your input script or via the
"-pk gpu" "command-line switch"_Section_start.html#start_7.
For the USER-INTEL package, the default is Nphi = 1 and the option
defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240. Note
that all of these settings, except "prec", are ignored if LAMMPS was
not built with Xeon Phi coprocessor support. The default ghost option
is determined by the pair style being used. This value is output to
the screen in the offload report at the end of each run. These
settings are made automatically if the "-sf intel" "command-line
switch"_Section_start.html#start_7 is used. If it is not used, you
must invoke the package intel command in your input script or or via
the "-pk intel" "command-line switch"_Section_start.html#start_7.
defaults are omp = 0, mode = mixed, balance = -1, tpc = 4, tptask =
240. The default ghost option is determined by the pair style being
used. This value is output to the screen in the offload report at the
end of each run. Note that all of these settings, except "omp" and
"mode", are ignored if LAMMPS was not built with Xeon Phi coprocessor
support. These settings are made automatically if the "-sf intel"
"command-line switch"_Section_start.html#start_7 is used. If it is
not used, you must invoke the package intel command in your input
script or or via the "-pk intel" "command-line
switch"_Section_start.html#start_7.
For the KOKKOS package, the option defaults neigh = full, newton =
off, binsize = 0.0, and comm = host. These settings are made