forked from lijiext/lammps
436 lines
20 KiB
Plaintext
436 lines
20 KiB
Plaintext
"LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
|
|
|
|
:link(lws,http://lammps.sandia.gov)
|
|
:link(ld,Manual.html)
|
|
:link(lc,Section_commands.html#comm)
|
|
|
|
:line
|
|
|
|
package command :h3
|
|
|
|
[Syntax:]
|
|
|
|
package style args :pre
|
|
|
|
style = {cuda} or {gpu} or {intel} or {kokkos} or {omp} :ulb,l
|
|
args = arguments specific to the style :l
|
|
{cuda} args = keyword value ...
|
|
one or more keyword/value pairs may be appended
|
|
keywords = {gpu/node} or {gpu/node/special} or {timing} or {test} or {override/bpa}
|
|
{gpu/node} value = N
|
|
N = number of GPUs to be used per node
|
|
{gpu/node/special} values = N gpu1 .. gpuN
|
|
N = number of GPUs to be used per node
|
|
gpu1 .. gpuN = N IDs of the GPUs to use
|
|
{timing} values = none
|
|
{test} values = id
|
|
id = atom-ID of a test particle
|
|
{override/bpa} values = flag
|
|
flag = 0 for TpA algorithm, 1 for BpA algorithm
|
|
{gpu} args = mode first last split keyword value ...
|
|
mode = force or force/neigh
|
|
first = ID of first GPU to be used on each node
|
|
last = ID of last GPU to be used on each node
|
|
split = fraction of particles assigned to the GPU
|
|
zero or more keyword/value pairs may be appended
|
|
keywords = {threads_per_atom} or {cellsize} or {device}
|
|
{threads_per_atom} value = Nthreads
|
|
Nthreads = # of GPU threads used per atom
|
|
{cellsize} value = dist
|
|
dist = length (distance units) in each dimension for neighbor bins
|
|
{device} value = device_type
|
|
device_type = {kepler} or {fermi} or {cypress} or {generic}
|
|
{intel} args = Nthreads precision keyword value ...
|
|
Nthreads = # of OpenMP threads to associate with each MPI process on host
|
|
precision = {single} or {mixed} or {double}
|
|
keywords = {balance} or {offload_cards} or {offload_ghost} or {offload_tpc} or {offload_threads}
|
|
{balance} value = split
|
|
split = fraction of work to offload to coprocessor, -1 for dynamic
|
|
{offload_cards} value = ncops
|
|
ncops = number of coprocessors to use on each node
|
|
{offload_ghost} value = offload_type
|
|
offload_type = 1 to include ghost atoms for offload, 0 for local only
|
|
{offload_tpc} value = tpc
|
|
tpc = number of threads to use on each core of coprocessor
|
|
{offload_threads} value = tptask
|
|
tptask = max number of threads to use on coprocessor for each MPI task
|
|
{kokkos} args = keyword value ...
|
|
one or more keyword/value pairs may be appended
|
|
keywords = {neigh} or {comm/exchange} or {comm/forward}
|
|
{neigh} value = {full} or {half/thread} or {half} or {n2} or {full/cluster}
|
|
{comm/exchange} value = {no} or {host} or {device}
|
|
{comm/forward} value = {no} or {host} or {device}
|
|
{omp} args = Nthreads mode
|
|
Nthreads = # of OpenMP threads to associate with each MPI process
|
|
mode = force or force/neigh (optional) :pre
|
|
:ule
|
|
|
|
[Examples:]
|
|
|
|
package gpu force 0 0 1.0
|
|
package gpu force 0 0 0.75
|
|
package gpu force/neigh 0 0 1.0
|
|
package gpu force/neigh 0 1 -1.0
|
|
package cuda gpu/node/special 2 0 2
|
|
package cuda test 3948
|
|
package kokkos neigh half/thread comm/forward device
|
|
package omp * force/neigh
|
|
package omp 4 force
|
|
package intel * mixed balance -1 :pre
|
|
|
|
[Description:]
|
|
|
|
This command invokes package-specific settings. Currently the
|
|
following packages use it: USER-CUDA, GPU, USER-INTEL, KOKKOS, and
|
|
USER-OMP.
|
|
|
|
To use the accelerated GPU and USER-OMP styles, the use of the package
|
|
command is required. However, as described in the "Defaults" section
|
|
below, if you use the "-sf gpu" or "-sf omp" "command-line
|
|
options"_Section_start.html#start_7 to enable use of these styles,
|
|
then default package settings are enabled. In that case you only need
|
|
to use the package command if you want to change the defaults.
|
|
|
|
To use the accelerated USER-CUDA and KOKKOS styles, the package
|
|
command is not required as defaults are assigned internally. You only
|
|
need to use the package command if you want to change the defaults.
|
|
|
|
See "Section_accelerate"_Section_accelerate.html of the manual for
|
|
more details about using these various packages for accelerating
|
|
LAMMPS calculations.
|
|
|
|
:line
|
|
|
|
The {cuda} style invokes options associated with the use of the
|
|
USER-CUDA package.
|
|
|
|
The {gpu/node} keyword specifies the number {N} of GPUs to be used on
|
|
each node. An MPI process with rank {K} will use the GPU (K mod N).
|
|
This implies that processes should be assigned with successive ranks
|
|
on each node, which is the default with most (or even all) MPI
|
|
implementations. The default value for {N} is 2.
|
|
|
|
The {gpu/node/special} keyword also specifies the number (N) of GPUs
|
|
to be used on each node, but allows more control over their
|
|
specification. An MPI process with rank {K} will use the GPU {gpuI}
|
|
with l = (K mod N) + 1. This implies that processes should be assigned
|
|
with successive ranks on each node, which is the default with most (or
|
|
even all) MPI implementations. For example if you have three GPUs on
|
|
a machine, one of which is used for the X-Server (the GPU with the ID
|
|
1) while the others (with IDs 0 and 2) are used for computations you
|
|
would specify:
|
|
|
|
package cuda gpu/node/special 2 0 2 :pre
|
|
|
|
A main purpose of the {gpu/node/special} optoin is to allow two (or
|
|
more) simulations to be run on one workstation. In that case one
|
|
would set the first simulation to use GPU 0 and the second to use GPU
|
|
1. This is not necessary though, if the GPUs are in what is called
|
|
{compute exclusive} mode. Using that setting, every process will get
|
|
its own GPU automatically. This {compute exclusive} mode can be set
|
|
as root using the {nvidia-smi} tool which is part of the CUDA
|
|
installation.
|
|
|
|
Note that if the {gpu/node/special} keyword is not used, the USER-CUDA
|
|
package sorts existing GPUs on each node according to their number of
|
|
multiprocessors. This way, compute GPUs will be priorized over
|
|
X-Server GPUs.
|
|
|
|
Use of the {timing} keyword will output detailed timing information
|
|
for various subroutines.
|
|
|
|
The {test} keyword will output info for the the specified atom at
|
|
several points during each time step. This is mainly usefull for
|
|
debugging purposes. Note that the simulation will be severly slowed
|
|
down if this option is used.
|
|
|
|
The {override/bpa} keyword can be used to specify which mode is used
|
|
for pair-force evaluation. TpA = one thread per atom; BpA = one block
|
|
per atom. If this keyword is not used, a short test at the begin of
|
|
each run will determine which method is more effective (the result of
|
|
this test is part of the LAMMPS output). Therefore it is usually not
|
|
necessary to use this keyword.
|
|
|
|
:line
|
|
|
|
The {gpu} style invokes options associated with the use of the GPU
|
|
package.
|
|
|
|
The {mode} setting specifies where neighbor list calculations will be
|
|
performed. If {mode} is force, neighbor list calculation is performed
|
|
on the CPU. If {mode} is force/neigh, neighbor list calculation is
|
|
performed on the GPU. GPU neighbor list calculation currently cannot
|
|
be used with a triclinic box. GPU neighbor list calculation currently
|
|
cannot be used with "hybrid"_pair_hybrid.html pair styles. GPU
|
|
neighbor lists are not compatible with styles that are not
|
|
GPU-enabled. When a non-GPU enabled style requires a neighbor list,
|
|
it will also be built using CPU routines. In these cases, it will
|
|
typically be more efficient to only use CPU neighbor list builds.
|
|
|
|
The {first} and {last} settings specify the GPUs that will be used for
|
|
simulation. On each node, the GPU IDs in the inclusive range from
|
|
{first} to {last} will be used.
|
|
|
|
The {split} setting can be used for load balancing force calculation
|
|
work between CPU and GPU cores in GPU-enabled pair styles. If 0 <
|
|
{split} < 1.0, a fixed fraction of particles is offloaded to the GPU
|
|
while force calculation for the other particles occurs simulataneously
|
|
on the CPU. If {split}<0, the optimal fraction (based on CPU and GPU
|
|
timings) is calculated every 25 timesteps. If {split} = 1.0, all force
|
|
calculations for GPU accelerated pair styles are performed on the
|
|
GPU. In this case, "hybrid"_pair_hybrid.html, "bond"_bond_style.html,
|
|
"angle"_angle_style.html, "dihedral"_dihedral_style.html,
|
|
"improper"_improper_style.html, and "long-range"_kspace_style.html
|
|
calculations can be performed on the CPU while the GPU is performing
|
|
force calculations for the GPU-enabled pair style. If all CPU force
|
|
computations complete before the GPU, LAMMPS will block until the GPU
|
|
has finished before continuing the timestep.
|
|
|
|
As an example, if you have two GPUs per node and 8 CPU cores per node,
|
|
and would like to run on 4 nodes (32 cores) with dynamic balancing of
|
|
force calculation across CPU and GPU cores, you could specify
|
|
|
|
package gpu force/neigh 0 1 -1 :pre
|
|
|
|
In this case, all CPU cores and GPU devices on the nodes would be
|
|
utilized. Each GPU device would be shared by 4 CPU cores. The CPU
|
|
cores would perform force calculations for some fraction of the
|
|
particles at the same time the GPUs performed force calculation for
|
|
the other particles.
|
|
|
|
The {threads_per_atom} keyword allows control of the number of GPU
|
|
threads used per-atom to perform the short range force calculation.
|
|
By default, the value will be chosen based on the pair style, however,
|
|
the value can be set with this keyword to fine-tune performance. For
|
|
large cutoffs or with a small number of particles per GPU, increasing
|
|
the value can improve performance. The number of threads per atom must
|
|
be a power of 2 and currently cannot be greater than 32.
|
|
|
|
The {cellsize} keyword can be used to control the size of the cells used
|
|
for binning atoms in neighbor list calculations. Setting this value is
|
|
normally not needed; the optimal value is close to the default
|
|
(equal to the cutoff distance for the short range interactions
|
|
plus the neighbor skin). GPUs can perform efficiently with much larger cutoffs
|
|
than CPUs and this can be used to reduce the time required for long-range
|
|
calculations or in some cases to eliminate them with models such as
|
|
"coul/wolf"_pair_coul.html or "coul/dsf"_pair_coul.html. For very large cutoffs,
|
|
it can be more efficient to use smaller values for cellsize in parallel
|
|
simulations. For example, with a cutoff of 20*sigma and a neighbor skin of
|
|
sigma, a cellsize of 5.25*sigma can be efficient for parallel simulations.
|
|
|
|
The {device} keyword can be used to tune parameters to optimize for a specific
|
|
accelerator when using OpenCL. For CUDA, the {device} keyword is ignored.
|
|
Currently, the device type is limited to NVIDIA Kepler, NVIDIA Fermi,
|
|
AMD Cypress, or a generic device. More devices will be added soon. The default
|
|
device type can be specified when building LAMMPS with the GPU library.
|
|
|
|
:line
|
|
|
|
The {intel} style invokes options associated with the use of the
|
|
USER-INTEL package.
|
|
|
|
The {Nthreads} argument allows to one explicitly set the number of
|
|
OpenMP threads to be allocated for each MPI process, An {Nthreads}
|
|
value of '*' instructs LAMMPS to use whatever is the default for the
|
|
given OpenMP environment. This is usually determined via the
|
|
OMP_NUM_THREADS environment variable or the compiler runtime.
|
|
|
|
The {precision} argument determines the precision mode to use and can
|
|
take values of {single} (intel styles use single precision for all
|
|
calculations), {mixed} (intel styles use double precision for
|
|
accumulation and storage of forces, torques, energies, and virial
|
|
terms and single precision for everything else), or {double} (intel
|
|
styles use double precision for all calculations).
|
|
|
|
Additional keyword-value pairs are available that are used to
|
|
determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is
|
|
built without offload support, these values are ignored. The
|
|
additional settings are as follows:
|
|
|
|
The {balance} setting is used to set the fraction of work offloaded to
|
|
the coprocessor for an intel style (in the inclusive range 0.0 to
|
|
1.0). While this fraction of work is running on the coprocessor, other
|
|
calculations will run on the host, including neighbor and pair
|
|
calculations that are not offloaded, angle, bond, dihedral, kspace,
|
|
and some MPI communications. If the balance is set to -1, the fraction
|
|
of work is dynamically adjusted automatically throughout the run. This
|
|
can typically give performance within 5 to 10 percent of the optimal
|
|
fixed fraction.
|
|
|
|
The {offload_cards} setting determines the number of coprocessors to
|
|
use on each node.
|
|
|
|
Additional options for fine tuning performance with offload are as
|
|
follows:
|
|
|
|
The {offload_ghost} setting determines whether or not ghost atoms,
|
|
atoms at the borders between MPI tasks, are offloaded for neighbor and
|
|
force calculations. When set to "0", ghost atoms are not offloaded.
|
|
This option can reduce the amount of data transfer with the
|
|
coprocessor and also can overlap MPI communication of forces with
|
|
computation on the coprocessor when the "newton pair"_newton.html
|
|
setting is "on". When set to "1", ghost atoms are offloaded. In some
|
|
cases this can provide better performance, especially if the offload
|
|
fraction is high.
|
|
|
|
The {offload_tpc} option sets the maximum number of threads that will
|
|
run on each core of the coprocessor.
|
|
|
|
The {offload_threads} option sets the maximum number of threads that
|
|
will be used on the coprocessor for each MPI task. This, along with
|
|
the {offload_tpc} setting, are the only methods for changing the
|
|
number of threads on the coprocessor. The OMP_NUM_THREADS keyword and
|
|
{Nthreads} options are only used for threads on the host.
|
|
|
|
:line
|
|
|
|
The {kokkos} style invokes options associated with the use of the
|
|
KOKKOS package.
|
|
|
|
The {neigh} keyword determines what kinds of neighbor lists are built.
|
|
A value of {half} uses half-neighbor lists, the same as used by most
|
|
pair styles in LAMMPS. A value of {half/thread} uses a threadsafe
|
|
variant of the half-neighbor list. It should be used instead of
|
|
{half} when running with threads on a CPU. A value of {full} uses a
|
|
full-neighborlist, i.e. f_ij and f_ji are both calculated. This
|
|
performs twice as much computation as the {half} option, however that
|
|
can be a win because it is threadsafe and doesn't require atomic
|
|
operations. A value of {full/cluster} is an experimental neighbor
|
|
style, where particles interact with all particles within a small
|
|
cluster, if at least one of the clusters particles is within the
|
|
neighbor cutoff range. This potentially allows for better
|
|
vectorization on architectures such as the Intel Phi. If also reduces
|
|
the size of the neighbor list by roughly a factor of the cluster size,
|
|
thus reducing the total memory footprint considerably.
|
|
|
|
The {comm/exchange} and {comm/forward} keywords determine whether the
|
|
host or device performs the packing and unpacking of data when
|
|
communicating information between processors. "Exchange"
|
|
communication happens only on timesteps that neighbor lists are
|
|
rebuilt. The data is only for atoms that migrate to new processors.
|
|
"Forward" communication happens every timestep. The data is for atom
|
|
coordinates and any other atom properties that needs to be updated for
|
|
ghost atoms owned by each processor.
|
|
|
|
The value options for these keywords are {no} or {host} or {device}.
|
|
A value of {no} means to use the standard non-KOKKOS method of
|
|
packing/unpacking data for the communication. A value of {host} means
|
|
to use the host, typically a multi-core CPU, and perform the
|
|
packing/unpacking in parallel with threads. A value of {device} means
|
|
to use the device, typically a GPU, to perform the packing/unpacking
|
|
operation.
|
|
|
|
The optimal choice for these keywords depends on the input script and
|
|
the hardware used. The {no} value is useful for verifying that Kokkos
|
|
code is working correctly. It may also be the fastest choice when
|
|
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
|
|
When running on CPUs or Xeon Phi, the {host} and {device} values work
|
|
identically. When using GPUs, the {device} value will typically be
|
|
optimal if all of your styles used in your input script are supported
|
|
by the KOKKOS package. In this case data can stay on the GPU for many
|
|
timesteps without being moved between the host and GPU, if you use the
|
|
{device} value. This requires that your MPI is able to access GPU
|
|
memory directly. Currently that is true for OpenMPI 1.8 (or later
|
|
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
|
|
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
|
|
then data has to be move between the host and device anyway, so it is
|
|
typically faster to let the host handle communication, by using the
|
|
{host} value. Using {host} instead of {no} will enable use of
|
|
multiple threads to pack/unpack communicated data.
|
|
|
|
:line
|
|
|
|
The {omp} style invokes options associated with the use of the
|
|
USER-OMP package.
|
|
|
|
The first argument allows to explicitly set the number of OpenMP
|
|
threads to be allocated for each MPI process. For example, if your
|
|
system has nodes with dual quad-core processors, it has a total of 8
|
|
cores per node. You could run MPI on 2 cores on each node (e.g. using
|
|
options for the mpirun command), and set the {Nthreads} setting to 4.
|
|
This would effectively use all 8 cores on each node. Since each MPI
|
|
process would spawn 4 threads (one of which runs as part of the MPI
|
|
process itself).
|
|
|
|
For performance reasons, you should not set {Nthreads} to more threads
|
|
than there are physical cores (per MPI task), but LAMMPS cannot check
|
|
for this.
|
|
|
|
An {Nthreads} value of '*' instructs LAMMPS to use whatever is the
|
|
default for the given OpenMP environment. This is usually determined
|
|
via the {OMP_NUM_THREADS} environment variable or the compiler
|
|
runtime. Please note that in most cases the default for OpenMP
|
|
capable compilers is to use one thread for each available CPU core
|
|
when {OMP_NUM_THREADS} is not set, which can lead to extremely bad
|
|
performance.
|
|
|
|
Which combination of threads and MPI tasks gives the best performance
|
|
is difficult to predict and can depend on many components of your input.
|
|
Not all features of LAMMPS support OpenMP and the parallel efficiency
|
|
can be very different, too.
|
|
|
|
The {mode} setting specifies where neighbor list calculations will be
|
|
multi-threaded as well. If {mode} is force, neighbor list calculation
|
|
is performed in serial. If {mode} is force/neigh, a multi-threaded
|
|
neighbor list build is used. Using the force/neigh setting is almost
|
|
always faster and should produce idential neighbor lists at the
|
|
expense of using some more memory (neighbor list pages are always
|
|
allocated for all threads at the same time and each thread works on
|
|
its own pages).
|
|
|
|
:line
|
|
|
|
[Restrictions:]
|
|
|
|
This command cannot be used after the simulation box is defined by a
|
|
"read_data"_read_data.html or "create_box"_create_box.html command.
|
|
|
|
The cuda style of this command can only be invoked if LAMMPS was built
|
|
with the USER-CUDA package. See the "Making
|
|
LAMMPS"_Section_start.html#start_3 section for more info.
|
|
|
|
The gpu style of this command can only be invoked if LAMMPS was built
|
|
with the GPU package. See the "Making
|
|
LAMMPS"_Section_start.html#start_3 section for more info.
|
|
|
|
The kk style of this command can only be invoked if LAMMPS was built
|
|
with the KOKKOS package. See the "Making
|
|
LAMMPS"_Section_start.html#start_3 section for more info.
|
|
|
|
The omp style of this command can only be invoked if LAMMPS was built
|
|
with the USER-OMP package. See the "Making
|
|
LAMMPS"_Section_start.html#start_3 section for more info.
|
|
|
|
[Related commands:]
|
|
|
|
"suffix"_suffix.html
|
|
|
|
[Default:]
|
|
|
|
The default settings for the USER-CUDA package are "package cuda gpu
|
|
2". This is the case whether the "-sf cuda" "command-line
|
|
switch"_Section_start.html#start_7 is used or not.
|
|
|
|
If the "-sf gpu" "command-line switch"_Section_start.html#start_7 is
|
|
used then it is as if the command "package gpu force/neigh 0 0 1" were
|
|
invoked, to specify default settings for the GPU package. If the
|
|
command-line switch is not used, then no defaults are set, and you
|
|
must specify the appropriate package command in your input script.
|
|
|
|
The default settings for the USER-INTEL package are "package intel *
|
|
mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240".
|
|
The {offload_ghost} default setting is determined by the intel style
|
|
being used. The value used is output to the screen in the offload
|
|
report at the end of each run.
|
|
|
|
The default settings for the KOKKOS package are "package kk neigh full
|
|
comm/exchange host comm/forward host". This is the case whether the
|
|
"-sf kk" "command-line switch"_Section_start.html#start_7 is used or
|
|
not.
|
|
|
|
If the "-sf omp" "command-line switch"_Section_start.html#start_7 is
|
|
used then it is as if the command "package omp *" were invoked, to
|
|
specify default settings for the USER-OMP package. If the
|
|
command-line switch is not used, then no defaults are set, and you
|
|
must specify the appropriate package command in your input script.
|