2011-06-01 07:08:32 +08:00
|
|
|
<HTML>
|
|
|
|
<CENTER><A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A>
|
|
|
|
</CENTER>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<HR>
|
|
|
|
|
|
|
|
<H3>package command
|
|
|
|
</H3>
|
|
|
|
<P><B>Syntax:</B>
|
|
|
|
</P>
|
|
|
|
<PRE>package style args
|
|
|
|
</PRE>
|
2014-08-15 00:30:25 +08:00
|
|
|
<UL><LI>style = <I>cuda</I> or <I>gpu</I> or <I>intel</I> or <I>kokkos</I> or <I>omp</I>
|
2011-06-01 07:08:32 +08:00
|
|
|
|
2011-08-17 22:20:30 +08:00
|
|
|
<LI>args = arguments specific to the style
|
2011-06-01 07:08:32 +08:00
|
|
|
|
2014-09-10 05:14:55 +08:00
|
|
|
<PRE> <I>cuda</I> args = Ngpu keyword value ...
|
|
|
|
Ngpu = # of GPUs per node
|
|
|
|
zero or more keyword/value pairs may be appended
|
|
|
|
keywords = <I>gpuID</I> or <I>timing</I> or <I>test</I> or <I>thread</I>
|
|
|
|
<I>gpuID</I> values = gpu1 .. gpuN
|
|
|
|
gpu1 .. gpuN = IDs of the Ngpu GPUs to use
|
2014-05-30 06:52:23 +08:00
|
|
|
<I>timing</I> values = none
|
|
|
|
<I>test</I> values = id
|
|
|
|
id = atom-ID of a test particle
|
2014-09-10 05:14:55 +08:00
|
|
|
<I>thread</I> = auto or tpa or bpa
|
|
|
|
auto = test whether tpa or bpa is faster
|
|
|
|
tpa = one thread per atom
|
|
|
|
bpa = one block per atom
|
|
|
|
<I>gpu</I> args = Ngpu keyword value ...
|
|
|
|
Ngpu = # of GPUs per node
|
|
|
|
zero or more keyword/value pairs may be appended
|
|
|
|
keywords = <I>neigh</I> or <I>split</I> or <I>gpuID</I> or <I>tpa</I> or <I>binsize</I> or <I>device</I>
|
|
|
|
<I>neigh</I> value = <I>yes</I> or <I>no</I>
|
|
|
|
yes = neighbor list build on GPU (default)
|
|
|
|
no = neighbor list build on CPU
|
|
|
|
<I>split</I> = fraction
|
|
|
|
fraction = fraction of atoms assigned to GPU (default = 1.0)
|
|
|
|
<I>gpuID</I> values = first last
|
|
|
|
first = ID of first GPU to be used on each node
|
|
|
|
last = ID of last GPU to be used on each node
|
|
|
|
<I>tpa</I> value = Nthreads
|
2012-09-22 00:04:48 +08:00
|
|
|
Nthreads = # of GPU threads used per atom
|
2014-09-10 05:14:55 +08:00
|
|
|
<I>binsize</I> value = size
|
|
|
|
size = bin size for neighbor list construction (distance units)
|
2013-08-23 22:50:59 +08:00
|
|
|
<I>device</I> value = device_type
|
|
|
|
device_type = <I>kepler</I> or <I>fermi</I> or <I>cypress</I> or <I>generic</I>
|
2014-08-15 00:30:25 +08:00
|
|
|
<I>intel</I> args = Nthreads precision keyword value ...
|
|
|
|
Nthreads = # of OpenMP threads to associate with each MPI process on host
|
|
|
|
precision = <I>single</I> or <I>mixed</I> or <I>double</I>
|
|
|
|
keywords = <I>balance</I> or <I>offload_cards</I> or <I>offload_ghost</I> or <I>offload_tpc</I> or <I>offload_threads</I>
|
|
|
|
<I>balance</I> value = split
|
|
|
|
split = fraction of work to offload to coprocessor, -1 for dynamic
|
|
|
|
<I>offload_cards</I> value = ncops
|
|
|
|
ncops = number of coprocessors to use on each node
|
|
|
|
<I>offload_ghost</I> value = offload_type
|
|
|
|
offload_type = 1 to include ghost atoms for offload, 0 for local only
|
|
|
|
<I>offload_tpc</I> value = tpc
|
|
|
|
tpc = number of threads to use on each core of coprocessor
|
|
|
|
<I>offload_threads</I> value = tptask
|
|
|
|
tptask = max number of threads to use on coprocessor for each MPI task
|
2014-05-30 06:52:23 +08:00
|
|
|
<I>kokkos</I> args = keyword value ...
|
2011-12-03 02:38:12 +08:00
|
|
|
one or more keyword/value pairs may be appended
|
2014-05-30 06:52:23 +08:00
|
|
|
keywords = <I>neigh</I> or <I>comm/exchange</I> or <I>comm/forward</I>
|
|
|
|
<I>neigh</I> value = <I>full</I> or <I>half/thread</I> or <I>half</I> or <I>n2</I> or <I>full/cluster</I>
|
|
|
|
<I>comm/exchange</I> value = <I>no</I> or <I>host</I> or <I>device</I>
|
|
|
|
<I>comm/forward</I> value = <I>no</I> or <I>host</I> or <I>device</I>
|
2014-09-10 01:07:45 +08:00
|
|
|
<I>omp</I> args = Nthreads keyword value ...
|
2014-09-10 05:14:55 +08:00
|
|
|
Nthread = # of OpenMP threads to associate with each MPI process
|
2014-09-10 01:07:45 +08:00
|
|
|
zero or more keyword/value pairs may be appended
|
|
|
|
keywords = <I>neigh</I>
|
|
|
|
<I>neigh</I> value = <I>yes</I> or <I>no</I>
|
2014-09-10 05:14:55 +08:00
|
|
|
yes = threaded neighbor list build (default)
|
|
|
|
no = non-threaded neighbor list build
|
2011-06-01 07:08:32 +08:00
|
|
|
</PRE>
|
|
|
|
|
|
|
|
</UL>
|
|
|
|
<P><B>Examples:</B>
|
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<PRE>package gpu 1
|
|
|
|
package gpu 1 split 0.75
|
|
|
|
package gpu 2 split -1.0
|
2011-12-03 02:38:12 +08:00
|
|
|
package cuda gpu/node/special 2 0 2
|
|
|
|
package cuda test 3948
|
2014-05-30 06:52:23 +08:00
|
|
|
package kokkos neigh half/thread comm/forward device
|
2014-09-10 01:07:45 +08:00
|
|
|
package omp 0 neigh yes
|
|
|
|
package omp 4
|
2014-08-15 00:30:25 +08:00
|
|
|
package intel * mixed balance -1
|
2011-06-01 07:08:32 +08:00
|
|
|
</PRE>
|
|
|
|
<P><B>Description:</B>
|
|
|
|
</P>
|
2011-08-17 22:20:30 +08:00
|
|
|
<P>This command invokes package-specific settings. Currently the
|
2014-08-15 00:30:25 +08:00
|
|
|
following packages use it: USER-CUDA, GPU, USER-INTEL, KOKKOS, and
|
|
|
|
USER-OMP.
|
2011-08-17 22:20:30 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>Talk about command line switches
|
|
|
|
</P>
|
|
|
|
<P>When does it have to be invoked
|
|
|
|
</P>
|
2011-12-02 23:47:30 +08:00
|
|
|
<P>To use the accelerated GPU and USER-OMP styles, the use of the package
|
|
|
|
command is required. However, as described in the "Defaults" section
|
2012-01-28 07:39:14 +08:00
|
|
|
below, if you use the "-sf gpu" or "-sf omp" <A HREF = "Section_start.html#start_7">command-line
|
2011-12-02 23:47:30 +08:00
|
|
|
options</A> to enable use of these styles,
|
|
|
|
then default package settings are enabled. In that case you only need
|
|
|
|
to use the package command if you want to change the defaults.
|
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>To use the accelerated USER-CUDA and KOKKOS styles, the package
|
|
|
|
command is not required as defaults are assigned internally. You only
|
|
|
|
need to use the package command if you want to change the defaults.
|
2011-12-02 23:47:30 +08:00
|
|
|
</P>
|
2011-12-14 04:35:35 +08:00
|
|
|
<P>See <A HREF = "Section_accelerate.html">Section_accelerate</A> of the manual for
|
|
|
|
more details about using these various packages for accelerating
|
|
|
|
LAMMPS calculations.
|
2011-08-17 22:20:30 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>Package GPU always sets newton pair off. Not so for USER-CUDA>
|
|
|
|
</P>
|
2011-08-17 22:20:30 +08:00
|
|
|
<HR>
|
|
|
|
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>cuda</I> style invokes settings associated with the use of the
|
|
|
|
USER-CUDA package.
|
|
|
|
</P>
|
|
|
|
<P>The <I>Ngpus</I> argument sets the number of GPUs per node. There must be
|
|
|
|
exactly one MPI task per GPU, as set by the mpirun or mpiexec command.
|
|
|
|
</P>
|
|
|
|
<P>Optional keyword/value pairs can also be specified. Each has a
|
|
|
|
default value as listed below.
|
|
|
|
</P>
|
|
|
|
<P>The <I>gpuID</I> keyword allows selection of which GPUs on each node will
|
|
|
|
be used for a simulation. GPU IDs range from 0 to N-1 where N is the
|
|
|
|
physical number of GPUs/node. An ID is specified for each of the
|
|
|
|
Ngpus being used. For example if you have three GPUs on a machine,
|
|
|
|
one of which is used for the X-Server (the GPU with the ID 1) while
|
|
|
|
the others (with IDs 0 and 2) are used for computations you would
|
|
|
|
specify:
|
|
|
|
</P>
|
|
|
|
<PRE>package cuda 2 gpuID 0 2
|
2014-05-30 06:52:23 +08:00
|
|
|
</PRE>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The purpose of the <I>gpuID</I> keyword is to allow two (or more)
|
|
|
|
simulations to be run on one workstation. In that case one could set
|
|
|
|
the first simulation to use GPU 0 and the second to use GPU 1. This is
|
|
|
|
not necessary however, if the GPUs are in what is called <I>compute
|
|
|
|
exclusive</I> mode. Using that setting, every process will get its own
|
|
|
|
GPU automatically. This <I>compute exclusive</I> mode can be set as root
|
|
|
|
using the <I>nvidia-smi</I> tool which is part of the CUDA installation.
|
|
|
|
</P>
|
|
|
|
<P>Also note that if the <I>gpuID</I> keyword is not used, the USER-CUDA
|
2014-05-30 06:52:23 +08:00
|
|
|
package sorts existing GPUs on each node according to their number of
|
|
|
|
multiprocessors. This way, compute GPUs will be priorized over
|
|
|
|
X-Server GPUs.
|
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>If the <I>timing</I> keyword is specified, detailed timing information for
|
|
|
|
various subroutines will be output.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>If the <I>test</I> keyword is specified, information for the specified atom
|
|
|
|
with atom-ID will be output at several points during each timestep.
|
|
|
|
This is mainly usefull for debugging purposes. Note that the
|
|
|
|
simulation slow down dramatically if this option is used.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>thread</I> keyword can be used to specify how GPU threads are
|
|
|
|
assigned work during pair style force evaluation. If the value =
|
|
|
|
<I>tpa</I>, one thread per atom is used. If the value = <I>bpa</I>, one block
|
|
|
|
per atom is used. If the value = <I>auto</I>, a short test is performed at
|
|
|
|
the beginning of each run to determing where <I>tpa</I> or <I>bpa</I> mode is
|
|
|
|
faster. The result of this test is output. Since <I>auto</I> is the
|
|
|
|
default value, it is usually not necessary to use this keyword.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>gpu</I> style invokes settings settings associated with the use of
|
|
|
|
the GPU package.
|
|
|
|
</P>
|
|
|
|
<P>The <I>Ngpu</I> argument sets the number of GPUs per node. There must be
|
|
|
|
at least as many MPI tasks per node as GPUs, as set by the mpirun or
|
|
|
|
mpiexec command. If there are more MPI tasks (per node)
|
|
|
|
than GPUs, multiple MPI tasks will share each GPU.
|
|
|
|
</P>
|
|
|
|
<P>Optional keyword/value pairs can also be specified. Each has a
|
|
|
|
default value as listed below.
|
|
|
|
</P>
|
|
|
|
<P>The <I>neigh</I> keyword specifies where neighbor lists for pair style
|
|
|
|
computation will be built. If <I>neigh</I> is <I>yes</I>, which is the default,
|
|
|
|
neighbor list building is performed on the GPU. If <I>neigh</I> is <I>no</I>,
|
|
|
|
neighbor list building is performed on the CPU. GPU neighbor list
|
|
|
|
building currently cannot be used with a triclinic box. GPU neighbor
|
|
|
|
list calculation currently cannot be used with
|
|
|
|
<A HREF = "pair_hybrid.html">hybrid</A> pair styles. GPU neighbor lists are not
|
|
|
|
compatible with comannds that are not GPU-enabled. When a non-GPU
|
|
|
|
enabled command requires a neighbor list, it will also be built on the
|
|
|
|
CPU. In these cases, it will typically be more efficient to only use
|
|
|
|
CPU neighbor list builds.
|
|
|
|
</P>
|
|
|
|
<P>The <I>split</I> keyword can be used for load balancing force calculations
|
|
|
|
between CPU and GPU cores in GPU-enabled pair styles. If 0 < <I>split</I> <
|
|
|
|
1.0, a fixed fraction of particles is offloaded to the GPU while force
|
|
|
|
calculation for the other particles occurs simulataneously on the
|
|
|
|
CPU. If <I>split</I> < 0.0, the optimal fraction (based on CPU and GPU
|
|
|
|
timings) is calculated every 25 timesteps. If <I>split</I> = 1.0, all
|
|
|
|
force calculations for GPU accelerated pair styles are performed on
|
|
|
|
the GPU. In this case, other <A HREF = "pair_hybrid.html">hybrid</A> pair
|
|
|
|
interactions, <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
|
|
|
|
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
|
|
|
|
<A HREF = "kspace_style.html">long-range</A> calculations can be performed on the
|
|
|
|
CPU while the GPU is performing force calculations for the GPU-enabled
|
|
|
|
pair style. If all CPU force computations complete before the GPU
|
|
|
|
completes, LAMMPS will block until the GPU has finished before
|
|
|
|
continuing the timestep.
|
2011-08-17 22:20:30 +08:00
|
|
|
</P>
|
|
|
|
<P>As an example, if you have two GPUs per node and 8 CPU cores per node,
|
|
|
|
and would like to run on 4 nodes (32 cores) with dynamic balancing of
|
|
|
|
force calculation across CPU and GPU cores, you could specify
|
2011-06-01 07:08:32 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<PRE>mpirun -np 32 -sf gpu -in in.script # launch command
|
|
|
|
package gpu 2 split -1 # input script command
|
2011-08-17 22:20:30 +08:00
|
|
|
</PRE>
|
|
|
|
<P>In this case, all CPU cores and GPU devices on the nodes would be
|
|
|
|
utilized. Each GPU device would be shared by 4 CPU cores. The CPU
|
|
|
|
cores would perform force calculations for some fraction of the
|
|
|
|
particles at the same time the GPUs performed force calculation for
|
|
|
|
the other particles.
|
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>gpuID</I> keyword allows selection of which GPUs on each node will
|
|
|
|
be used for a simulation. The <I>first</I> and <I>last</I> values specify the
|
|
|
|
GPU IDs to use (from 0 to Ngpu-1). By default, first = 0 and last =
|
|
|
|
Ngpu-1, so that all GPUs are used, assuming Ngpu is set to the number
|
|
|
|
of physical GPUs. If you only wish to use a subset, set Ngpu to a
|
|
|
|
smaller number and first/last to a sub-range of the available GPUs.
|
|
|
|
</P>
|
|
|
|
<P>The <I>tpa</I> keyword sets the number of GPU thread per atom used to
|
|
|
|
perform force calculations. With a default value of 1, the number of
|
|
|
|
threads will be chosen based on the pair style, however, the value can
|
|
|
|
be set explicitly with this keyword to fine-tune performance. For
|
2011-12-03 02:38:12 +08:00
|
|
|
large cutoffs or with a small number of particles per GPU, increasing
|
|
|
|
the value can improve performance. The number of threads per atom must
|
|
|
|
be a power of 2 and currently cannot be greater than 32.
|
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>binsize</I> keyword sets the size of bins used to bin atoms in
|
|
|
|
neighbor list builds. Setting this value is normally not needed; the
|
|
|
|
optimal value is close to the default, which is set equal to the
|
|
|
|
cutoff distance for the short range interactions plus the neighbor
|
|
|
|
skin. Note that this is 2x larger than the default bin size for
|
|
|
|
neighbor list builds on the CPU. This is becuase GPUs can perform
|
|
|
|
efficiently with much larger cutoffs than CPUs. This can be used to
|
|
|
|
reduce the time required for long-range calculations or in some cases
|
|
|
|
to eliminate them with pair style models such as
|
|
|
|
<A HREF = "pair_coul.html">coul/wolf</A> or <A HREF = "pair_coul.html">coul/dsf</A>. For very
|
|
|
|
large cutoffs, it can be more efficient to use smaller values for
|
|
|
|
<I>binsize</I> in parallel simulations. For example, with a cutoff of
|
|
|
|
20*sigma in LJ <A HREF = "units.html">units</A> and a neighbor skin distance of
|
|
|
|
sigma, a <I>binsize</I> = 5.25*sigma can be more efficient than the
|
|
|
|
default.
|
|
|
|
</P>
|
|
|
|
<P>The <I>device</I> keyword can be used to tune parameters optimized for a
|
|
|
|
specific accelerator, when using OpenCL. For CUDA, the <I>device</I>
|
|
|
|
keyword is ignored. Currently, the device type is limited to NVIDIA
|
|
|
|
Kepler, NVIDIA Fermi, AMD Cypress, or a generic device. More devices
|
|
|
|
may be added later. The default device type can be specified when
|
|
|
|
building LAMMPS with the GPU library, via settings in the
|
|
|
|
lib/gpu/Makefile that is used.
|
2013-08-23 22:50:59 +08:00
|
|
|
</P>
|
2011-08-17 22:20:30 +08:00
|
|
|
<HR>
|
|
|
|
|
2014-08-15 00:30:25 +08:00
|
|
|
<P>The <I>intel</I> style invokes options associated with the use of the
|
|
|
|
USER-INTEL package.
|
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>Nthread</I> argument allows to one explicitly set the number of
|
2014-08-15 00:30:25 +08:00
|
|
|
OpenMP threads to be allocated for each MPI process, An <I>Nthreads</I>
|
|
|
|
value of '*' instructs LAMMPS to use whatever is the default for the
|
|
|
|
given OpenMP environment. This is usually determined via the
|
|
|
|
OMP_NUM_THREADS environment variable or the compiler runtime.
|
|
|
|
</P>
|
|
|
|
<P>The <I>precision</I> argument determines the precision mode to use and can
|
|
|
|
take values of <I>single</I> (intel styles use single precision for all
|
|
|
|
calculations), <I>mixed</I> (intel styles use double precision for
|
|
|
|
accumulation and storage of forces, torques, energies, and virial
|
|
|
|
terms and single precision for everything else), or <I>double</I> (intel
|
|
|
|
styles use double precision for all calculations).
|
|
|
|
</P>
|
|
|
|
<P>Additional keyword-value pairs are available that are used to
|
2014-08-25 22:48:45 +08:00
|
|
|
determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is
|
2014-08-15 00:30:25 +08:00
|
|
|
built without offload support, these values are ignored. The
|
|
|
|
additional settings are as follows:
|
|
|
|
</P>
|
|
|
|
<P>The <I>balance</I> setting is used to set the fraction of work offloaded to
|
|
|
|
the coprocessor for an intel style (in the inclusive range 0.0 to
|
2014-09-10 05:14:55 +08:00
|
|
|
1.0). While this fraction of work is running on the coprocessor, other
|
2014-08-15 00:30:25 +08:00
|
|
|
calculations will run on the host, including neighbor and pair
|
|
|
|
calculations that are not offloaded, angle, bond, dihedral, kspace,
|
|
|
|
and some MPI communications. If the balance is set to -1, the fraction
|
|
|
|
of work is dynamically adjusted automatically throughout the run. This
|
|
|
|
can typically give performance within 5 to 10 percent of the optimal
|
|
|
|
fixed fraction.
|
|
|
|
</P>
|
|
|
|
<P>The <I>offload_cards</I> setting determines the number of coprocessors to
|
|
|
|
use on each node.
|
|
|
|
</P>
|
|
|
|
<P>Additional options for fine tuning performance with offload are as
|
|
|
|
follows:
|
|
|
|
</P>
|
|
|
|
<P>The <I>offload_ghost</I> setting determines whether or not ghost atoms,
|
|
|
|
atoms at the borders between MPI tasks, are offloaded for neighbor and
|
|
|
|
force calculations. When set to "0", ghost atoms are not offloaded.
|
|
|
|
This option can reduce the amount of data transfer with the
|
|
|
|
coprocessor and also can overlap MPI communication of forces with
|
|
|
|
computation on the coprocessor when the <A HREF = "newton.html">newton pair</A>
|
|
|
|
setting is "on". When set to "1", ghost atoms are offloaded. In some
|
|
|
|
cases this can provide better performance, especially if the offload
|
|
|
|
fraction is high.
|
|
|
|
</P>
|
|
|
|
<P>The <I>offload_tpc</I> option sets the maximum number of threads that will
|
|
|
|
run on each core of the coprocessor.
|
|
|
|
</P>
|
|
|
|
<P>The <I>offload_threads</I> option sets the maximum number of threads that
|
|
|
|
will be used on the coprocessor for each MPI task. This, along with
|
|
|
|
the <I>offload_tpc</I> setting, are the only methods for changing the
|
|
|
|
number of threads on the coprocessor. The OMP_NUM_THREADS keyword and
|
|
|
|
<I>Nthreads</I> options are only used for threads on the host.
|
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2014-05-30 07:07:14 +08:00
|
|
|
<P>The <I>kokkos</I> style invokes options associated with the use of the
|
2014-05-30 06:52:23 +08:00
|
|
|
KOKKOS package.
|
|
|
|
</P>
|
|
|
|
<P>The <I>neigh</I> keyword determines what kinds of neighbor lists are built.
|
|
|
|
A value of <I>half</I> uses half-neighbor lists, the same as used by most
|
2014-05-30 07:07:14 +08:00
|
|
|
pair styles in LAMMPS. A value of <I>half/thread</I> uses a threadsafe
|
|
|
|
variant of the half-neighbor list. It should be used instead of
|
|
|
|
<I>half</I> when running with threads on a CPU. A value of <I>full</I> uses a
|
2014-05-30 06:52:23 +08:00
|
|
|
full-neighborlist, i.e. f_ij and f_ji are both calculated. This
|
|
|
|
performs twice as much computation as the <I>half</I> option, however that
|
|
|
|
can be a win because it is threadsafe and doesn't require atomic
|
2014-05-30 07:07:14 +08:00
|
|
|
operations. A value of <I>full/cluster</I> is an experimental neighbor
|
|
|
|
style, where particles interact with all particles within a small
|
|
|
|
cluster, if at least one of the clusters particles is within the
|
|
|
|
neighbor cutoff range. This potentially allows for better
|
|
|
|
vectorization on architectures such as the Intel Phi. If also reduces
|
|
|
|
the size of the neighbor list by roughly a factor of the cluster size,
|
|
|
|
thus reducing the total memory footprint considerably.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
|
|
|
<P>The <I>comm/exchange</I> and <I>comm/forward</I> keywords determine whether the
|
|
|
|
host or device performs the packing and unpacking of data when
|
|
|
|
communicating information between processors. "Exchange"
|
|
|
|
communication happens only on timesteps that neighbor lists are
|
|
|
|
rebuilt. The data is only for atoms that migrate to new processors.
|
|
|
|
"Forward" communication happens every timestep. The data is for atom
|
|
|
|
coordinates and any other atom properties that needs to be updated for
|
|
|
|
ghost atoms owned by each processor.
|
|
|
|
</P>
|
|
|
|
<P>The value options for these keywords are <I>no</I> or <I>host</I> or <I>device</I>.
|
|
|
|
A value of <I>no</I> means to use the standard non-KOKKOS method of
|
|
|
|
packing/unpacking data for the communication. A value of <I>host</I> means
|
|
|
|
to use the host, typically a multi-core CPU, and perform the
|
|
|
|
packing/unpacking in parallel with threads. A value of <I>device</I> means
|
|
|
|
to use the device, typically a GPU, to perform the packing/unpacking
|
|
|
|
operation.
|
|
|
|
</P>
|
|
|
|
<P>The optimal choice for these keywords depends on the input script and
|
|
|
|
the hardware used. The <I>no</I> value is useful for verifying that Kokkos
|
|
|
|
code is working correctly. It may also be the fastest choice when
|
|
|
|
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
|
2014-05-30 07:07:14 +08:00
|
|
|
When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values work
|
|
|
|
identically. When using GPUs, the <I>device</I> value will typically be
|
|
|
|
optimal if all of your styles used in your input script are supported
|
|
|
|
by the KOKKOS package. In this case data can stay on the GPU for many
|
|
|
|
timesteps without being moved between the host and GPU, if you use the
|
|
|
|
<I>device</I> value. This requires that your MPI is able to access GPU
|
|
|
|
memory directly. Currently that is true for OpenMPI 1.8 (or later
|
|
|
|
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
|
|
|
|
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
|
|
|
|
then data has to be move between the host and device anyway, so it is
|
|
|
|
typically faster to let the host handle communication, by using the
|
|
|
|
<I>host</I> value. Using <I>host</I> instead of <I>no</I> will enable use of
|
|
|
|
multiple threads to pack/unpack communicated data.
|
2011-08-17 22:20:30 +08:00
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>omp</I> style invokes settings associated with the use of the
|
2011-08-17 22:20:30 +08:00
|
|
|
USER-OMP package.
|
2011-06-01 07:08:32 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The <I>Nthread</I> argument sets the number of OpenMP threads allocated for
|
|
|
|
each MPI task. For example, if your system has nodes with dual
|
|
|
|
quad-core processors, it has a total of 8 cores per node. You could
|
|
|
|
use two MPI tasks per node (e.g. using the -ppn option of the mpirun
|
|
|
|
command), and set <I>Nthreads</I> = 4. This would use all 8 cores on each
|
|
|
|
node. Note that the product of MPI tasks * threads/task should not
|
|
|
|
exceed the physical number of cores (on a node), otherwise performance
|
|
|
|
will suffer.
|
2011-08-17 22:20:30 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>Setting <I>Nthread</I> = 0 instructs LAMMPS to use whatever value is the
|
2011-12-01 23:35:32 +08:00
|
|
|
default for the given OpenMP environment. This is usually determined
|
|
|
|
via the <I>OMP_NUM_THREADS</I> environment variable or the compiler
|
2014-09-10 01:07:45 +08:00
|
|
|
runtime. Note that in most cases the default for OpenMP capable
|
|
|
|
compilers is to use one thread for each available CPU core when
|
|
|
|
<I>OMP_NUM_THREADS</I> is not explicitly set, which can lead to poor
|
2011-12-01 23:35:32 +08:00
|
|
|
performance.
|
|
|
|
</P>
|
2014-09-10 01:07:45 +08:00
|
|
|
<P>Here are examples of how to set the environment variable when
|
|
|
|
launching LAMMPS:
|
2014-09-09 23:17:43 +08:00
|
|
|
</P>
|
|
|
|
<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
|
|
|
|
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
|
|
|
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
|
|
|
</PRE>
|
|
|
|
<P>or you can set it permanently in your shell's start-up script.
|
|
|
|
All three of these examples use a total of 4 CPU cores.
|
|
|
|
</P>
|
|
|
|
<P>Note that different MPI implementations have different ways of passing
|
|
|
|
the OMP_NUM_THREADS environment variable to all MPI processes. The
|
2014-09-10 01:07:45 +08:00
|
|
|
2nd example line above is for MPICH; the 3rd example line with -x is
|
|
|
|
for OpenMPI. Check your MPI documentation for additional details.
|
|
|
|
</P>
|
|
|
|
<P>What combination of threads and MPI tasks gives the best performance
|
|
|
|
is difficult to predict and can depend on many components of your
|
|
|
|
input. Not all features of LAMMPS support OpenMP threading via the
|
|
|
|
USER-OMP packaage and the parallel efficiency can be very different,
|
|
|
|
too.
|
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>Optional keyword/value pairs can also be specified. Each has a
|
|
|
|
default value as listed below.
|
|
|
|
</P>
|
2014-09-10 01:07:45 +08:00
|
|
|
<P>The <I>neigh</I> keyword specifies whether neighbor list building will be
|
|
|
|
multi-threaded in addition to force calculations. If <I>neigh</I> is set
|
|
|
|
to <I>no</I> then neighbor list calculation is performed only by MPI tasks
|
|
|
|
with no OpenMP threading. If <I>mode</I> is <I>yes</I> (the default), a
|
|
|
|
multi-threaded neighbor list build is used. Using <I>neigh</I> = <I>yes</I> is
|
|
|
|
almost always faster and should produce idential neighbor lists at the
|
|
|
|
expense of using more memory. Specifically, neighbor list pages are
|
|
|
|
allocated for all threads at the same time and each thread works
|
|
|
|
within its own pages.
|
2011-08-17 22:20:30 +08:00
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2011-06-01 07:08:32 +08:00
|
|
|
<P><B>Restrictions:</B>
|
|
|
|
</P>
|
2011-08-17 22:20:30 +08:00
|
|
|
<P>This command cannot be used after the simulation box is defined by a
|
|
|
|
<A HREF = "read_data.html">read_data</A> or <A HREF = "create_box.html">create_box</A> command.
|
|
|
|
</P>
|
2011-06-01 07:08:32 +08:00
|
|
|
<P>The cuda style of this command can only be invoked if LAMMPS was built
|
2011-08-26 00:46:23 +08:00
|
|
|
with the USER-CUDA package. See the <A HREF = "Section_start.html#start_3">Making
|
2011-12-02 23:47:30 +08:00
|
|
|
LAMMPS</A> section for more info.
|
2011-06-01 07:08:32 +08:00
|
|
|
</P>
|
2011-08-17 22:20:30 +08:00
|
|
|
<P>The gpu style of this command can only be invoked if LAMMPS was built
|
2011-08-26 00:46:23 +08:00
|
|
|
with the GPU package. See the <A HREF = "Section_start.html#start_3">Making
|
2011-12-02 23:47:30 +08:00
|
|
|
LAMMPS</A> section for more info.
|
2011-06-01 07:08:32 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>The intel style of this command can only be invoked if LAMMPS was
|
|
|
|
built with the USER-INTEL package. See the <A HREF = "Section_start.html#start_3">Making
|
|
|
|
LAMMPS</A> section for more info.
|
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The kk style of this command can only be invoked if LAMMPS was built
|
|
|
|
with the KOKKOS package. See the <A HREF = "Section_start.html#start_3">Making
|
|
|
|
LAMMPS</A> section for more info.
|
|
|
|
</P>
|
2011-08-17 22:20:30 +08:00
|
|
|
<P>The omp style of this command can only be invoked if LAMMPS was built
|
2011-08-26 00:46:23 +08:00
|
|
|
with the USER-OMP package. See the <A HREF = "Section_start.html#start_3">Making
|
2011-12-02 23:47:30 +08:00
|
|
|
LAMMPS</A> section for more info.
|
2011-06-01 07:08:32 +08:00
|
|
|
</P>
|
2011-12-02 01:51:50 +08:00
|
|
|
<P><B>Related commands:</B>
|
2011-06-01 07:08:32 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P><A HREF = "suffix.html">suffix</A>, "-pk" <A HREF = "Section_start.html#start_7">command-line
|
|
|
|
setting</A>
|
2011-12-02 01:51:50 +08:00
|
|
|
</P>
|
|
|
|
<P><B>Default:</B>
|
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>To use the USER-CUDA package, the package command must be invoked
|
|
|
|
explicitly, either via the "-pk cuda" <A HREF = "Section_start.html#start_7">command-line
|
|
|
|
switch</A> or by invoking the package cuda
|
|
|
|
command in your input script. This will set the # of GPUs/node. The
|
|
|
|
options defaults are gpuID = 0 to Ngpu-1, timing not enabled, test not
|
|
|
|
enabled, and thread = auto.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>For the GPU package, the default is Ngpu = 1 and the option defaults
|
|
|
|
are neigh = yes, split = 1.0, gpuID = 0 to Ngpu-1, tpa = 1, binsize =
|
|
|
|
pair cutoff + neighbor skin, device = not used. These settings are
|
|
|
|
made if the "-sf gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>
|
|
|
|
is used. If it is not used, you must invoke the package gpu command
|
|
|
|
in your input script.
|
2011-12-02 23:47:30 +08:00
|
|
|
</P>
|
2014-08-15 00:30:25 +08:00
|
|
|
<P>The default settings for the USER-INTEL package are "package intel *
|
|
|
|
mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240".
|
|
|
|
The <I>offload_ghost</I> default setting is determined by the intel style
|
|
|
|
being used. The value used is output to the screen in the offload
|
|
|
|
report at the end of each run.
|
|
|
|
</P>
|
2014-09-10 00:05:17 +08:00
|
|
|
<P>The default settings for the KOKKOS package are "package kokkos neigh
|
|
|
|
full comm/exchange host comm/forward host". This is the case whether
|
|
|
|
the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
|
|
|
|
or not.
|
2011-06-01 07:08:32 +08:00
|
|
|
</P>
|
2014-09-10 05:14:55 +08:00
|
|
|
<P>For the OMP package, the default is Nthreads = 0 and the option
|
|
|
|
defaults are neigh = yes. These settings are made if the "-sf omp"
|
|
|
|
<A HREF = "Section_start.html#start_7">command-line switch</A> is used. If it is
|
|
|
|
not used, you must invoke the package omp command in your input
|
|
|
|
script.
|
2011-12-02 01:57:44 +08:00
|
|
|
</P>
|
2011-06-01 07:08:32 +08:00
|
|
|
</HTML>
|