forked from lijiext/lammps
690 lines
32 KiB
Plaintext
690 lines
32 KiB
Plaintext
|
.. index:: package
|
||
|
|
||
|
package command
|
||
|
===============
|
||
|
|
||
|
Syntax
|
||
|
""""""
|
||
|
|
||
|
.. parsed-literal::
|
||
|
|
||
|
package style args
|
||
|
|
||
|
* style = *cuda* or *gpu* or *intel* or *kokkos* or *omp*
|
||
|
* args = arguments specific to the style
|
||
|
.. parsed-literal::
|
||
|
|
||
|
*cuda* args = Ngpu keyword value ...
|
||
|
Ngpu = # of GPUs per node
|
||
|
zero or more keyword/value pairs may be appended
|
||
|
keywords = *newton* or *gpuID* or *timing* or *test* or *thread*
|
||
|
*newton* = *off* or *on*
|
||
|
off = set Newton pairwise and bonded flags off (default)
|
||
|
on = set Newton pairwise and bonded flags on
|
||
|
*gpuID* values = gpu1 .. gpuN
|
||
|
gpu1 .. gpuN = IDs of the Ngpu GPUs to use
|
||
|
*timing* values = none
|
||
|
*test* values = id
|
||
|
id = atom-ID of a test particle
|
||
|
*thread* = auto or tpa or bpa
|
||
|
auto = test whether tpa or bpa is faster
|
||
|
tpa = one thread per atom
|
||
|
bpa = one block per atom
|
||
|
*gpu* args = Ngpu keyword value ...
|
||
|
Ngpu = # of GPUs per node
|
||
|
zero or more keyword/value pairs may be appended
|
||
|
keywords = *neigh* or *newton* or *binsize* or *split* or *gpuID* or *tpa* or *device* or *blocksize*
|
||
|
*neigh* value = *yes* or *no*
|
||
|
yes = neighbor list build on GPU (default)
|
||
|
no = neighbor list build on CPU
|
||
|
*newton* = *off* or *on*
|
||
|
off = set Newton pairwise flag off (default and required)
|
||
|
on = set Newton pairwise flag on (currently not allowed)
|
||
|
*binsize* value = size
|
||
|
size = bin size for neighbor list construction (distance units)
|
||
|
*split* = fraction
|
||
|
fraction = fraction of atoms assigned to GPU (default = 1.0)
|
||
|
*gpuID* values = first last
|
||
|
first = ID of first GPU to be used on each node
|
||
|
last = ID of last GPU to be used on each node
|
||
|
*tpa* value = Nthreads
|
||
|
Nthreads = # of GPU threads used per atom
|
||
|
*device* value = device_type
|
||
|
device_type = *kepler* or *fermi* or *cypress* or *generic*
|
||
|
*blocksize* value = size
|
||
|
size = thread block size for pair force computation
|
||
|
*intel* args = NPhi keyword value ...
|
||
|
Nphi = # of coprocessors per node
|
||
|
zero or more keyword/value pairs may be appended
|
||
|
keywords = *omp* or *mode* or *balance* or *ghost* or *tpc* or *tptask* or *no_affinity*
|
||
|
*omp* value = Nthreads
|
||
|
Nthreads = number of OpenMP threads to use on CPU (default = 0)
|
||
|
*mode* value = *single* or *mixed* or *double*
|
||
|
single = perform force calculations in single precision
|
||
|
mixed = perform force calculations in mixed precision
|
||
|
double = perform force calculations in double precision
|
||
|
*balance* value = split
|
||
|
split = fraction of work to offload to coprocessor, -1 for dynamic
|
||
|
*ghost* value = *yes* or *no*
|
||
|
yes = include ghost atoms for offload
|
||
|
no = do not include ghost atoms for offload
|
||
|
*tpc* value = Ntpc
|
||
|
Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
|
||
|
*tptask* value = Ntptask
|
||
|
Ntptask = max number of coprocessor threads per MPI task (default = 240)
|
||
|
*no_affinity* values = none
|
||
|
*kokkos* args = keyword value ...
|
||
|
zero or more keyword/value pairs may be appended
|
||
|
keywords = *neigh* or *newton* or *binsize* or *comm* or *comm/exchange* or *comm/forward*
|
||
|
*neigh* value = *full* or *half/thread* or *half* or *n2* or *full/cluster*
|
||
|
full = full neighbor list
|
||
|
half/thread = half neighbor list built in thread-safe manner
|
||
|
half = half neighbor list, not thread-safe, only use when 1 thread/MPI task
|
||
|
n2 = non-binning neighbor list build, O(N^2) algorithm
|
||
|
full/cluster = full neighbor list with clustered groups of atoms
|
||
|
*newton* = *off* or *on*
|
||
|
off = set Newton pairwise and bonded flags off (default)
|
||
|
on = set Newton pairwise and bonded flags on
|
||
|
*binsize* value = size
|
||
|
size = bin size for neighbor list construction (distance units)
|
||
|
*comm* value = *no* or *host* or *device*
|
||
|
use value for both comm/exchange and comm/forward
|
||
|
*comm/exchange* value = *no* or *host* or *device*
|
||
|
*comm/forward* value = *no* or *host* or *device*
|
||
|
no = perform communication pack/unpack in non-KOKKOS mode
|
||
|
host = perform pack/unpack on host (e.g. with OpenMP threading)
|
||
|
device = perform pack/unpack on device (e.g. on GPU)
|
||
|
*omp* args = Nthreads keyword value ...
|
||
|
Nthread = # of OpenMP threads to associate with each MPI process
|
||
|
zero or more keyword/value pairs may be appended
|
||
|
keywords = *neigh*
|
||
|
*neigh* value = *yes* or *no*
|
||
|
yes = threaded neighbor list build (default)
|
||
|
no = non-threaded neighbor list build
|
||
|
|
||
|
|
||
|
|
||
|
Examples
|
||
|
""""""""
|
||
|
|
||
|
.. parsed-literal::
|
||
|
|
||
|
package gpu 1
|
||
|
package gpu 1 split 0.75
|
||
|
package gpu 2 split -1.0
|
||
|
package cuda 2 gpuID 0 2
|
||
|
package cuda 1 test 3948
|
||
|
package kokkos neigh half/thread comm device
|
||
|
package omp 0 neigh no
|
||
|
package omp 4
|
||
|
package intel 1
|
||
|
package intel 2 omp 4 mode mixed balance 0.5
|
||
|
|
||
|
Description
|
||
|
"""""""""""
|
||
|
|
||
|
This command invokes package-specific settings for the various
|
||
|
accelerator packages available in LAMMPS. Currently the following
|
||
|
packages use settings from this command: USER-CUDA, GPU, USER-INTEL,
|
||
|
KOKKOS, and USER-OMP.
|
||
|
|
||
|
If this command is specified in an input script, it must be near the
|
||
|
top of the script, before the simulation box has been defined. This
|
||
|
is because it specifies settings that the accelerator packages use in
|
||
|
their intialization, before a simultion is defined.
|
||
|
|
||
|
This command can also be specified from the command-line when
|
||
|
launching LAMMPS, using the "-pk" :ref:`command-line switch <start_7>`. The syntax is exactly the same as
|
||
|
when used in an input script.
|
||
|
|
||
|
Note that all of the accelerator packages require the package command
|
||
|
to be specified (except the OPT package), if the package is to be used
|
||
|
in a simulation (LAMMPS can be built with an accelerator package
|
||
|
without using it in a particular simulation). However, in all cases,
|
||
|
a default version of the command is typically invoked by other
|
||
|
accelerator settings.
|
||
|
|
||
|
The USER-CUDA and KOKKOS packages require a "-c on" or "-k on"
|
||
|
:ref:`command-line switch <start_7>` respectively, which
|
||
|
invokes a "package cuda" or "package kokkos" command with default
|
||
|
settings.
|
||
|
|
||
|
For the GPU, USER-INTEL, and USER-OMP packages, if a "-sf gpu" or "-sf
|
||
|
intel" or "-sf omp" :ref:`command-line switch <start_7>`
|
||
|
is used to auto-append accelerator suffixes to various styles in the
|
||
|
input script, then those switches also invoke a "package gpu",
|
||
|
"package intel", or "package omp" command with default settings.
|
||
|
|
||
|
.. note::
|
||
|
|
||
|
A package command for a particular style can be invoked multiple
|
||
|
times when a simulation is setup, e.g. by the "-c on", "-k on", "-sf",
|
||
|
and "-pk" :ref:`command-line switches <start_7>`, and by
|
||
|
using this command in an input script. Each time it is used all of
|
||
|
the style options are set, either to default values or to specified
|
||
|
settings. I.e. settings from previous invocations do not persist
|
||
|
across multiple invocations.
|
||
|
|
||
|
See the :doc:`Section Accelerate <Section_accelerate>` section of the
|
||
|
manual for more details about using the various accelerator packages
|
||
|
for speeding up LAMMPS simulations.
|
||
|
|
||
|
|
||
|
----------
|
||
|
|
||
|
|
||
|
The *cuda* style invokes settings associated with the use of the
|
||
|
USER-CUDA package.
|
||
|
|
||
|
The *Ngpus* argument sets the number of GPUs per node. There must be
|
||
|
exactly one MPI task per GPU, as set by the mpirun or mpiexec command.
|
||
|
|
||
|
Optional keyword/value pairs can also be specified. Each has a
|
||
|
default value as listed below.
|
||
|
|
||
|
The *newton* keyword sets the Newton flags for pairwise and bonded
|
||
|
interactions to *off* or *on*, the same as the :doc:`newton <newton>`
|
||
|
command allows. The default is *off* because this will almost always
|
||
|
give better performance for the USER-CUDA package. This means
|
||
|
more computation is done, but less communication.
|
||
|
|
||
|
The *gpuID* keyword allows selection of which GPUs on each node will
|
||
|
be used for a simulation. GPU IDs range from 0 to N-1 where N is the
|
||
|
physical number of GPUs/node. An ID is specified for each of the
|
||
|
Ngpus being used. For example if you have three GPUs on a machine,
|
||
|
one of which is used for the X-Server (the GPU with the ID 1) while
|
||
|
the others (with IDs 0 and 2) are used for computations you would
|
||
|
specify:
|
||
|
|
||
|
.. parsed-literal::
|
||
|
|
||
|
package cuda 2 gpuID 0 2
|
||
|
|
||
|
The purpose of the *gpuID* keyword is to allow two (or more)
|
||
|
simulations to be run on one workstation. In that case one could set
|
||
|
the first simulation to use GPU 0 and the second to use GPU 1. This is
|
||
|
not necessary however, if the GPUs are in what is called *compute
|
||
|
exclusive* mode. Using that setting, every process will get its own
|
||
|
GPU automatically. This *compute exclusive* mode can be set as root
|
||
|
using the *nvidia-smi* tool which is part of the CUDA installation.
|
||
|
|
||
|
Also note that if the *gpuID* keyword is not used, the USER-CUDA
|
||
|
package sorts existing GPUs on each node according to their number of
|
||
|
multiprocessors. This way, compute GPUs will be priorized over
|
||
|
X-Server GPUs.
|
||
|
|
||
|
If the *timing* keyword is specified, detailed timing information for
|
||
|
various subroutines will be output.
|
||
|
|
||
|
If the *test* keyword is specified, information for the specified atom
|
||
|
with atom-ID will be output at several points during each timestep.
|
||
|
This is mainly usefull for debugging purposes. Note that the
|
||
|
simulation slow down dramatically if this option is used.
|
||
|
|
||
|
The *thread* keyword can be used to specify how GPU threads are
|
||
|
assigned work during pair style force evaluation. If the value =
|
||
|
*tpa*, one thread per atom is used. If the value = *bpa*, one block
|
||
|
per atom is used. If the value = *auto*, a short test is performed at
|
||
|
the beginning of each run to determing where *tpa* or *bpa* mode is
|
||
|
faster. The result of this test is output. Since *auto* is the
|
||
|
default value, it is usually not necessary to use this keyword.
|
||
|
|
||
|
|
||
|
----------
|
||
|
|
||
|
|
||
|
The *gpu* style invokes settings associated with the use of the GPU
|
||
|
package.
|
||
|
|
||
|
The *Ngpu* argument sets the number of GPUs per node. There must be
|
||
|
at least as many MPI tasks per node as GPUs, as set by the mpirun or
|
||
|
mpiexec command. If there are more MPI tasks (per node)
|
||
|
than GPUs, multiple MPI tasks will share each GPU.
|
||
|
|
||
|
Optional keyword/value pairs can also be specified. Each has a
|
||
|
default value as listed below.
|
||
|
|
||
|
The *neigh* keyword specifies where neighbor lists for pair style
|
||
|
computation will be built. If *neigh* is *yes*, which is the default,
|
||
|
neighbor list building is performed on the GPU. If *neigh* is *no*,
|
||
|
neighbor list building is performed on the CPU. GPU neighbor list
|
||
|
building currently cannot be used with a triclinic box. GPU neighbor
|
||
|
list calculation currently cannot be used with
|
||
|
:doc:`hybrid <pair_hybrid>` pair styles. GPU neighbor lists are not
|
||
|
compatible with comannds that are not GPU-enabled. When a non-GPU
|
||
|
enabled command requires a neighbor list, it will also be built on the
|
||
|
CPU. In these cases, it will typically be more efficient to only use
|
||
|
CPU neighbor list builds.
|
||
|
|
||
|
The *newton* keyword sets the Newton flags for pairwise (not bonded)
|
||
|
interactions to *off* or *on*, the same as the :doc:`newton <newton>`
|
||
|
command allows. Currently, only an *off* value is allowed, since all
|
||
|
the GPU package pair styles require this setting. This means more
|
||
|
computation is done, but less communication. In the future a value of
|
||
|
*on* may be allowed, so the *newton* keyword is included as an option
|
||
|
for compatibility with the package command for other accelerator
|
||
|
styles. Note that the newton setting for bonded interactions is not
|
||
|
affected by this keyword.
|
||
|
|
||
|
The *binsize* keyword sets the size of bins used to bin atoms in
|
||
|
neighbor list builds performed on the GPU, if *neigh* = *yes* is set.
|
||
|
If *binsize* is set to 0.0 (the default), then bins = the size of the
|
||
|
pairwise cutoff + neighbor skin distance. This is 2x larger than the
|
||
|
LAMMPS default used for neighbor list building on the CPU. This will
|
||
|
be close to optimal for the GPU, so you do not normally need to use
|
||
|
this keyword. Note that if you use a longer-than-usual pairwise
|
||
|
cutoff, e.g. to allow for a smaller fraction of KSpace work with a
|
||
|
:doc:`long-range Coulombic solver <kspace_style>` because the GPU is
|
||
|
faster at performing pairwise interactions, then it may be optimal to
|
||
|
make the *binsize* smaller than the default. For example, with a
|
||
|
cutoff of 20*sigma in LJ :doc:`units <units>` and a neighbor skin
|
||
|
distance of sigma, a *binsize* = 5.25*sigma can be more efficient than
|
||
|
the default.
|
||
|
|
||
|
The *split* keyword can be used for load balancing force calculations
|
||
|
between CPU and GPU cores in GPU-enabled pair styles. If 0 < *split* <
|
||
|
1.0, a fixed fraction of particles is offloaded to the GPU while force
|
||
|
calculation for the other particles occurs simulataneously on the CPU.
|
||
|
If *split* < 0.0, the optimal fraction (based on CPU and GPU timings)
|
||
|
is calculated every 25 timesteps, i.e. dynamic load-balancing across
|
||
|
the CPU and GPU is performed. If *split* = 1.0, all force
|
||
|
calculations for GPU accelerated pair styles are performed on the GPU.
|
||
|
In this case, other :doc:`hybrid <pair_hybrid>` pair interactions,
|
||
|
:doc:`bond <bond_style>`, :doc:`angle <angle_style>`,
|
||
|
:doc:`dihedral <dihedral_style>`, :doc:`improper <improper_style>`, and
|
||
|
:doc:`long-range <kspace_style>` calculations can be performed on the
|
||
|
CPU while the GPU is performing force calculations for the GPU-enabled
|
||
|
pair style. If all CPU force computations complete before the GPU
|
||
|
completes, LAMMPS will block until the GPU has finished before
|
||
|
continuing the timestep.
|
||
|
|
||
|
As an example, if you have two GPUs per node and 8 CPU cores per node,
|
||
|
and would like to run on 4 nodes (32 cores) with dynamic balancing of
|
||
|
force calculation across CPU and GPU cores, you could specify
|
||
|
|
||
|
.. parsed-literal::
|
||
|
|
||
|
mpirun -np 32 -sf gpu -in in.script # launch command
|
||
|
package gpu 2 split -1 # input script command
|
||
|
|
||
|
In this case, all CPU cores and GPU devices on the nodes would be
|
||
|
utilized. Each GPU device would be shared by 4 CPU cores. The CPU
|
||
|
cores would perform force calculations for some fraction of the
|
||
|
particles at the same time the GPUs performed force calculation for
|
||
|
the other particles.
|
||
|
|
||
|
The *gpuID* keyword allows selection of which GPUs on each node will
|
||
|
be used for a simulation. The *first* and *last* values specify the
|
||
|
GPU IDs to use (from 0 to Ngpu-1). By default, first = 0 and last =
|
||
|
Ngpu-1, so that all GPUs are used, assuming Ngpu is set to the number
|
||
|
of physical GPUs. If you only wish to use a subset, set Ngpu to a
|
||
|
smaller number and first/last to a sub-range of the available GPUs.
|
||
|
|
||
|
The *tpa* keyword sets the number of GPU thread per atom used to
|
||
|
perform force calculations. With a default value of 1, the number of
|
||
|
threads will be chosen based on the pair style, however, the value can
|
||
|
be set explicitly with this keyword to fine-tune performance. For
|
||
|
large cutoffs or with a small number of particles per GPU, increasing
|
||
|
the value can improve performance. The number of threads per atom must
|
||
|
be a power of 2 and currently cannot be greater than 32.
|
||
|
|
||
|
The *device* keyword can be used to tune parameters optimized for a
|
||
|
specific accelerator, when using OpenCL. For CUDA, the *device*
|
||
|
keyword is ignored. Currently, the device type is limited to NVIDIA
|
||
|
Kepler, NVIDIA Fermi, AMD Cypress, or a generic device. More devices
|
||
|
may be added later. The default device type can be specified when
|
||
|
building LAMMPS with the GPU library, via settings in the
|
||
|
lib/gpu/Makefile that is used.
|
||
|
|
||
|
The *blocksize* keyword allows you to tweak the number of threads used
|
||
|
per thread block. This number should be a multiple of 32 (for GPUs)
|
||
|
and its maximum depends on the specific GPU hardware. Typical choices
|
||
|
are 64, 128, or 256. A larger blocksize increases occupancy of
|
||
|
individual GPU cores, but reduces the total number of thread blocks,
|
||
|
thus may lead to load imbalance.
|
||
|
|
||
|
|
||
|
----------
|
||
|
|
||
|
|
||
|
The *intel* style invokes settings associated with the use of the
|
||
|
USER-INTEL package. All of its settings, except the *omp* and *mode*
|
||
|
keywords, are ignored if LAMMPS was not built with Xeon Phi
|
||
|
coprocessor support. All of its settings, including the *omp* and
|
||
|
*mode* keyword are applicable if LAMMPS was built with coprocessor
|
||
|
support.
|
||
|
|
||
|
The *Nphi* argument sets the number of coprocessors per node.
|
||
|
This can be set to any value, including 0, if LAMMPS was not
|
||
|
built with coprocessor support.
|
||
|
|
||
|
Optional keyword/value pairs can also be specified. Each has a
|
||
|
default value as listed below.
|
||
|
|
||
|
The *omp* keyword determines the number of OpenMP threads allocated
|
||
|
for each MPI task when any portion of the interactions computed by a
|
||
|
USER-INTEL pair style are run on the CPU. This can be the case even
|
||
|
if LAMMPS was built with coprocessor support; see the *balance*
|
||
|
keyword discussion below. If you are running with less MPI tasks/node
|
||
|
than there are CPUs, it can be advantageous to use OpenMP threading on
|
||
|
the CPUs.
|
||
|
|
||
|
.. note::
|
||
|
|
||
|
The *omp* keyword has nothing to do with coprocessor threads on
|
||
|
the Xeon Phi; see the *tpc* and *tptask* keywords below for a
|
||
|
discussion of coprocessor threads.
|
||
|
|
||
|
The *Nthread* value for the *omp* keyword sets the number of OpenMP
|
||
|
threads allocated for each MPI task. Setting *Nthread* = 0 (the
|
||
|
default) instructs LAMMPS to use whatever value is the default for the
|
||
|
given OpenMP environment. This is usually determined via the
|
||
|
*OMP_NUM_THREADS* environment variable or the compiler runtime, which
|
||
|
is usually a value of 1.
|
||
|
|
||
|
For more details, including examples of how to set the OMP_NUM_THREADS
|
||
|
environment variable, see the discussion of the *Nthreads* setting on
|
||
|
this doc page for the "package omp" command. Nthreads is a required
|
||
|
argument for the USER-OMP package. Its meaning is exactly the same
|
||
|
for the USER-INTEL pacakge.
|
||
|
|
||
|
.. note::
|
||
|
|
||
|
If you build LAMMPS with both the USER-INTEL and USER-OMP
|
||
|
packages, be aware that both packages allow setting of the *Nthreads*
|
||
|
value via their package commands, but there is only a single global
|
||
|
*Nthreads* value used by OpenMP. Thus if both package commands are
|
||
|
invoked, you should insure the two values are consistent. If they are
|
||
|
not, the last one invoked will take precedence, for both packages.
|
||
|
Also note that if the "-sf hybrid intel omp" :ref:`command-line switch <start_7>` is used, it invokes a "package
|
||
|
intel" command, followed by a "package omp" command, both with a
|
||
|
setting of *Nthreads* = 0.
|
||
|
|
||
|
The *mode* keyword determines the precision mode to use for
|
||
|
computing pair style forces, either on the CPU or on the coprocessor,
|
||
|
when using a USER-INTEL supported :doc:`pair style <pair_style>`. It
|
||
|
can take a value of *single*, *mixed* which is the default, or
|
||
|
*double*. *Single* means single precision is used for the entire
|
||
|
force calculation. *Mixed* means forces between a pair of atoms are
|
||
|
computed in single precision, but accumulated and stored in double
|
||
|
precision, including storage of forces, torques, energies, and virial
|
||
|
quantities. *Double* means double precision is used for the entire
|
||
|
force calculation.
|
||
|
|
||
|
The *balance* keyword sets the fraction of :doc:`pair style <pair_style>` work offloaded to the coprocessor for split
|
||
|
values between 0.0 and 1.0 inclusive. While this fraction of work is
|
||
|
running on the coprocessor, other calculations will run on the host,
|
||
|
including neighbor and pair calculations that are not offloaded, as
|
||
|
well as angle, bond, dihedral, kspace, and some MPI communications.
|
||
|
If *split* is set to -1, the fraction of work is dynamically adjusted
|
||
|
automatically throughout the run. This typically give performance
|
||
|
within 5 to 10 percent of the optimal fixed fraction.
|
||
|
|
||
|
The *ghost* keyword determines whether or not ghost atoms, i.e. atoms
|
||
|
at the boundaries of proessor sub-domains, are offloaded for neighbor
|
||
|
and force calculations. When the value = "no", ghost atoms are not
|
||
|
offloaded. This option can reduce the amount of data transfer with
|
||
|
the coprocessor and can also overlap MPI communication of forces with
|
||
|
computation on the coprocessor when the :doc:`newton pair <newton>`
|
||
|
setting is "on". When the value = "yes", ghost atoms are offloaded.
|
||
|
In some cases this can provide better performance, especially if the
|
||
|
*balance* fraction is high.
|
||
|
|
||
|
The *tpc* keyword sets the max # of coprocessor threads *Ntpc* that
|
||
|
will run on each core of the coprocessor. The default value = 4,
|
||
|
which is the number of hardware threads per core supported by the
|
||
|
current generation Xeon Phi chips.
|
||
|
|
||
|
The *tptask* keyword sets the max # of coprocessor threads (Ntptask*
|
||
|
assigned to each MPI task. The default value = 240, which is the
|
||
|
total # of threads an entire current generation Xeon Phi chip can run
|
||
|
(240 = 60 cores * 4 threads/core). This means each MPI task assigned
|
||
|
to the Phi will enough threads for the chip to run the max allowed,
|
||
|
even if only 1 MPI task is assigned. If 8 MPI tasks are assigned to
|
||
|
the Phi, each will run with 30 threads. If you wish to limit the
|
||
|
number of threads per MPI task, set *tptask* to a smaller value.
|
||
|
E.g. for *tptask* = 16, if 8 MPI tasks are assigned, each will run
|
||
|
with 16 threads, for a total of 128.
|
||
|
|
||
|
Note that the default settings for *tpc* and *tptask* are fine for
|
||
|
most problems, regardless of how many MPI tasks you assign to a Phi.
|
||
|
|
||
|
The *no_affinity* keyword will turn off automatic setting of core
|
||
|
affinity for MPI tasks and OpenMP threads on the host when using
|
||
|
offload to a coprocessor. Affinity settings are used when possible
|
||
|
to prevent MPI tasks and OpenMP threads from being on separate NUMA
|
||
|
domains and to prevent offload threads from interfering with other
|
||
|
processes/threads used for LAMMPS.
|
||
|
|
||
|
|
||
|
----------
|
||
|
|
||
|
|
||
|
The *kokkos* style invokes settings associated with the use of the
|
||
|
KOKKOS package.
|
||
|
|
||
|
All of the settings are optional keyword/value pairs. Each has a
|
||
|
default value as listed below.
|
||
|
|
||
|
The *neigh* keyword determines how neighbor lists are built. A value
|
||
|
of *half* uses half-neighbor lists, the same as used by most pair
|
||
|
styles in LAMMPS. A value of *half/thread* uses a thread-safe variant
|
||
|
of the half-neighbor list. It should be used instead of *half* when
|
||
|
running with more than 1 threads per MPI task on a CPU. A value of
|
||
|
*n2* uses an O(N^2) algorithm to build the neighbor list without
|
||
|
binning, where N = # of atoms on a processor. It is typically slower
|
||
|
than the other methods, which use binning.
|
||
|
|
||
|
A value of *full* uses a full neighbor lists and is the default. This
|
||
|
performs twice as much computation as the *half* option, however that
|
||
|
is often a win because it is thread-safe and doesn't require atomic
|
||
|
operations in the calculation of pair forces. For that reason, *full*
|
||
|
is the default setting. However, when running in MPI-only mode with 1
|
||
|
thread per MPI task, *half* neighbor lists will typically be faster,
|
||
|
just as it is for non-accelerated pair styles.
|
||
|
|
||
|
A value of *full/cluster* is an experimental neighbor style, where
|
||
|
particles interact with all particles within a small cluster, if at
|
||
|
least one of the clusters particles is within the neighbor cutoff
|
||
|
range. This potentially allows for better vectorization on
|
||
|
architectures such as the Intel Phi. If also reduces the size of the
|
||
|
neighbor list by roughly a factor of the cluster size, thus reducing
|
||
|
the total memory footprint considerably.
|
||
|
|
||
|
The *newton* keyword sets the Newton flags for pairwise and bonded
|
||
|
interactions to *off* or *on*, the same as the :doc:`newton <newton>`
|
||
|
command allows. The default is *off* because this will almost always
|
||
|
give better performance for the KOKKOS package. This means more
|
||
|
computation is done, but less communication. However, when running in
|
||
|
MPI-only mode with 1 thread per MPI task, a value of *on* will
|
||
|
typically be faster, just as it is for non-accelerated pair styles.
|
||
|
|
||
|
The *binsize* keyword sets the size of bins used to bin atoms in
|
||
|
neighbor list builds. The same value can be set by the :doc:`neigh_modify binsize <neigh_modify>` command. Making it an option in the
|
||
|
package kokkos command allows it to be set from the command line. The
|
||
|
default value is 0.0, which means the LAMMPS default will be used,
|
||
|
which is bins = 1/2 the size of the pairwise cutoff + neighbor skin
|
||
|
distance. This is fine when neighbor lists are built on the CPU. For
|
||
|
GPU builds, a 2x larger binsize equal to the pairwise cutoff +
|
||
|
neighbor skin, is often faster, which can be set by this keyword.
|
||
|
Note that if you use a longer-than-usual pairwise cutoff, e.g. to
|
||
|
allow for a smaller fraction of KSpace work with a :doc:`long-range Coulombic solver <kspace_style>` because the GPU is faster at
|
||
|
performing pairwise interactions, then this rule of thumb may give too
|
||
|
large a binsize.
|
||
|
|
||
|
The *comm* and *comm/exchange* and *comm/forward* keywords determine
|
||
|
whether the host or device performs the packing and unpacking of data
|
||
|
when communicating per-atom data between processors. "Exchange"
|
||
|
communication happens only on timesteps that neighbor lists are
|
||
|
rebuilt. The data is only for atoms that migrate to new processors.
|
||
|
"Forward" communication happens every timestep. The data is for atom
|
||
|
coordinates and any other atom properties that needs to be updated for
|
||
|
ghost atoms owned by each processor.
|
||
|
|
||
|
The *comm* keyword is simply a short-cut to set the same value
|
||
|
for both the *comm/exchange* and *comm/forward* keywords.
|
||
|
|
||
|
The value options for all 3 keywords are *no* or *host* or *device*.
|
||
|
A value of *no* means to use the standard non-KOKKOS method of
|
||
|
packing/unpacking data for the communication. A value of *host* means
|
||
|
to use the host, typically a multi-core CPU, and perform the
|
||
|
packing/unpacking in parallel with threads. A value of *device* means
|
||
|
to use the device, typically a GPU, to perform the packing/unpacking
|
||
|
operation.
|
||
|
|
||
|
The optimal choice for these keywords depends on the input script and
|
||
|
the hardware used. The *no* value is useful for verifying that the
|
||
|
Kokkos-based *host* and *device* values are working correctly. It may
|
||
|
also be the fastest choice when using Kokkos styles in MPI-only mode
|
||
|
(i.e. with a thread count of 1).
|
||
|
|
||
|
When running on CPUs or Xeon Phi, the *host* and *device* values work
|
||
|
identically. When using GPUs, the *device* value will typically be
|
||
|
optimal if all of your styles used in your input script are supported
|
||
|
by the KOKKOS package. In this case data can stay on the GPU for many
|
||
|
timesteps without being moved between the host and GPU, if you use the
|
||
|
*device* value. This requires that your MPI is able to access GPU
|
||
|
memory directly. Currently that is true for OpenMPI 1.8 (or later
|
||
|
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
|
||
|
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
|
||
|
then data has to be move between the host and device anyway, so it is
|
||
|
typically faster to let the host handle communication, by using the
|
||
|
*host* value. Using *host* instead of *no* will enable use of
|
||
|
multiple threads to pack/unpack communicated data.
|
||
|
|
||
|
|
||
|
----------
|
||
|
|
||
|
|
||
|
The *omp* style invokes settings associated with the use of the
|
||
|
USER-OMP package.
|
||
|
|
||
|
The *Nthread* argument sets the number of OpenMP threads allocated for
|
||
|
each MPI task. For example, if your system has nodes with dual
|
||
|
quad-core processors, it has a total of 8 cores per node. You could
|
||
|
use two MPI tasks per node (e.g. using the -ppn option of the mpirun
|
||
|
command in MPICH or -npernode in OpenMPI), and set *Nthreads* = 4.
|
||
|
This would use all 8 cores on each node. Note that the product of MPI
|
||
|
tasks * threads/task should not exceed the physical number of cores
|
||
|
(on a node), otherwise performance will suffer.
|
||
|
|
||
|
Setting *Nthread* = 0 instructs LAMMPS to use whatever value is the
|
||
|
default for the given OpenMP environment. This is usually determined
|
||
|
via the *OMP_NUM_THREADS* environment variable or the compiler
|
||
|
runtime. Note that in most cases the default for OpenMP capable
|
||
|
compilers is to use one thread for each available CPU core when
|
||
|
*OMP_NUM_THREADS* is not explicitly set, which can lead to poor
|
||
|
performance.
|
||
|
|
||
|
Here are examples of how to set the environment variable when
|
||
|
launching LAMMPS:
|
||
|
|
||
|
.. parsed-literal::
|
||
|
|
||
|
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
|
||
|
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
||
|
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
||
|
|
||
|
or you can set it permanently in your shell's start-up script.
|
||
|
All three of these examples use a total of 4 CPU cores.
|
||
|
|
||
|
Note that different MPI implementations have different ways of passing
|
||
|
the OMP_NUM_THREADS environment variable to all MPI processes. The
|
||
|
2nd example line above is for MPICH; the 3rd example line with -x is
|
||
|
for OpenMPI. Check your MPI documentation for additional details.
|
||
|
|
||
|
What combination of threads and MPI tasks gives the best performance
|
||
|
is difficult to predict and can depend on many components of your
|
||
|
input. Not all features of LAMMPS support OpenMP threading via the
|
||
|
USER-OMP packaage and the parallel efficiency can be very different,
|
||
|
too.
|
||
|
|
||
|
Optional keyword/value pairs can also be specified. Each has a
|
||
|
default value as listed below.
|
||
|
|
||
|
The *neigh* keyword specifies whether neighbor list building will be
|
||
|
multi-threaded in addition to force calculations. If *neigh* is set
|
||
|
to *no* then neighbor list calculation is performed only by MPI tasks
|
||
|
with no OpenMP threading. If *mode* is *yes* (the default), a
|
||
|
multi-threaded neighbor list build is used. Using *neigh* = *yes* is
|
||
|
almost always faster and should produce idential neighbor lists at the
|
||
|
expense of using more memory. Specifically, neighbor list pages are
|
||
|
allocated for all threads at the same time and each thread works
|
||
|
within its own pages.
|
||
|
|
||
|
|
||
|
----------
|
||
|
|
||
|
|
||
|
Restrictions
|
||
|
""""""""""""
|
||
|
|
||
|
|
||
|
This command cannot be used after the simulation box is defined by a
|
||
|
:doc:`read_data <read_data>` or :doc:`create_box <create_box>` command.
|
||
|
|
||
|
The cuda style of this command can only be invoked if LAMMPS was built
|
||
|
with the USER-CUDA package. See the :ref:`Making LAMMPS <start_3>` section for more info.
|
||
|
|
||
|
The gpu style of this command can only be invoked if LAMMPS was built
|
||
|
with the GPU package. See the :ref:`Making LAMMPS <start_3>` section for more info.
|
||
|
|
||
|
The intel style of this command can only be invoked if LAMMPS was
|
||
|
built with the USER-INTEL package. See the :ref:`Making LAMMPS <start_3>` section for more info.
|
||
|
|
||
|
The kk style of this command can only be invoked if LAMMPS was built
|
||
|
with the KOKKOS package. See the :ref:`Making LAMMPS <start_3>` section for more info.
|
||
|
|
||
|
The omp style of this command can only be invoked if LAMMPS was built
|
||
|
with the USER-OMP package. See the :ref:`Making LAMMPS <start_3>` section for more info.
|
||
|
|
||
|
Related commands
|
||
|
""""""""""""""""
|
||
|
|
||
|
:doc:`suffix <suffix>`, "-pk" :ref:`command-line setting <start_7>`
|
||
|
|
||
|
Default
|
||
|
"""""""
|
||
|
|
||
|
For the USER-CUDA package, the default is Ngpu = 1 and the option
|
||
|
defaults are newton = off, gpuID = 0 to Ngpu-1, timing = not enabled,
|
||
|
test = not enabled, and thread = auto. These settings are made
|
||
|
automatically by the required "-c on" :ref:`command-line switch <start_7>`. You can change them bu using the
|
||
|
package cuda command in your input script or via the "-pk cuda"
|
||
|
:ref:`command-line switch <start_7>`.
|
||
|
|
||
|
For the GPU package, the default is Ngpu = 1 and the option defaults
|
||
|
are neigh = yes, newton = off, binsize = 0.0, split = 1.0, gpuID = 0
|
||
|
to Ngpu-1, tpa = 1, and device = not used. These settings are made
|
||
|
automatically if the "-sf gpu" :ref:`command-line switch <start_7>` is used. If it is not used, you
|
||
|
must invoke the package gpu command in your input script or via the
|
||
|
"-pk gpu" :ref:`command-line switch <start_7>`.
|
||
|
|
||
|
For the USER-INTEL package, the default is Nphi = 1 and the option
|
||
|
defaults are omp = 0, mode = mixed, balance = -1, tpc = 4, tptask =
|
||
|
240. The default ghost option is determined by the pair style being
|
||
|
used. This value is output to the screen in the offload report at the
|
||
|
end of each run. Note that all of these settings, except "omp" and
|
||
|
"mode", are ignored if LAMMPS was not built with Xeon Phi coprocessor
|
||
|
support. These settings are made automatically if the "-sf intel"
|
||
|
:ref:`command-line switch <start_7>` is used. If it is
|
||
|
not used, you must invoke the package intel command in your input
|
||
|
script or or via the "-pk intel" :ref:`command-line switch <start_7>`.
|
||
|
|
||
|
For the KOKKOS package, the option defaults neigh = full, newton =
|
||
|
off, binsize = 0.0, and comm = device. These settings are made
|
||
|
automatically by the required "-k on" :ref:`command-line switch <start_7>`. You can change them bu using the
|
||
|
package kokkos command in your input script or via the "-pk kokkos"
|
||
|
:ref:`command-line switch <start_7>`.
|
||
|
|
||
|
For the OMP package, the default is Nthreads = 0 and the option
|
||
|
defaults are neigh = yes. These settings are made automatically if
|
||
|
the "-sf omp" :ref:`command-line switch <start_7>` is
|
||
|
used. If it is not used, you must invoke the package omp command in
|
||
|
your input script or via the "-pk omp" :ref:`command-line switch <start_7>`.
|
||
|
|
||
|
|
||
|
.. _lws: http://lammps.sandia.gov
|
||
|
.. _ld: Manual.html
|
||
|
.. _lc: Section_commands.html#comm
|