lammps/doc/package.html

501 lines
24 KiB
HTML
Raw Normal View History

<HTML>
<CENTER><A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A>
</CENTER>
<HR>
<H3>package command
</H3>
<P><B>Syntax:</B>
</P>
<PRE>package style args
</PRE>
<UL><LI>style = <I>cuda</I> or <I>gpu</I> or <I>intel</I> or <I>kokkos</I> or <I>omp</I>
<LI>args = arguments specific to the style
<PRE> <I>cuda</I> args = Ngpu keyword value ...
Ngpu = # of GPUs per node
zero or more keyword/value pairs may be appended
keywords = <I>gpuID</I> or <I>timing</I> or <I>test</I> or <I>thread</I>
<I>gpuID</I> values = gpu1 .. gpuN
gpu1 .. gpuN = IDs of the Ngpu GPUs to use
<I>timing</I> values = none
<I>test</I> values = id
id = atom-ID of a test particle
<I>thread</I> = auto or tpa or bpa
auto = test whether tpa or bpa is faster
tpa = one thread per atom
bpa = one block per atom
<I>gpu</I> args = Ngpu keyword value ...
Ngpu = # of GPUs per node
zero or more keyword/value pairs may be appended
keywords = <I>neigh</I> or <I>split</I> or <I>gpuID</I> or <I>tpa</I> or <I>binsize</I> or <I>device</I>
<I>neigh</I> value = <I>yes</I> or <I>no</I>
yes = neighbor list build on GPU (default)
no = neighbor list build on CPU
<I>split</I> = fraction
fraction = fraction of atoms assigned to GPU (default = 1.0)
<I>gpuID</I> values = first last
first = ID of first GPU to be used on each node
last = ID of last GPU to be used on each node
<I>tpa</I> value = Nthreads
Nthreads = # of GPU threads used per atom
<I>binsize</I> value = size
size = bin size for neighbor list construction (distance units)
<I>device</I> value = device_type
device_type = <I>kepler</I> or <I>fermi</I> or <I>cypress</I> or <I>generic</I>
<I>intel</I> args = Nthreads precision keyword value ...
Nthreads = # of OpenMP threads to associate with each MPI process on host
precision = <I>single</I> or <I>mixed</I> or <I>double</I>
keywords = <I>balance</I> or <I>offload_cards</I> or <I>offload_ghost</I> or <I>offload_tpc</I> or <I>offload_threads</I>
<I>balance</I> value = split
split = fraction of work to offload to coprocessor, -1 for dynamic
<I>offload_cards</I> value = ncops
ncops = number of coprocessors to use on each node
<I>offload_ghost</I> value = offload_type
offload_type = 1 to include ghost atoms for offload, 0 for local only
<I>offload_tpc</I> value = tpc
tpc = number of threads to use on each core of coprocessor
<I>offload_threads</I> value = tptask
tptask = max number of threads to use on coprocessor for each MPI task
<I>kokkos</I> args = keyword value ...
one or more keyword/value pairs may be appended
keywords = <I>neigh</I> or <I>comm/exchange</I> or <I>comm/forward</I>
<I>neigh</I> value = <I>full</I> or <I>half/thread</I> or <I>half</I> or <I>n2</I> or <I>full/cluster</I>
<I>comm/exchange</I> value = <I>no</I> or <I>host</I> or <I>device</I>
<I>comm/forward</I> value = <I>no</I> or <I>host</I> or <I>device</I>
<I>omp</I> args = Nthreads keyword value ...
Nthread = # of OpenMP threads to associate with each MPI process
zero or more keyword/value pairs may be appended
keywords = <I>neigh</I>
<I>neigh</I> value = <I>yes</I> or <I>no</I>
yes = threaded neighbor list build (default)
no = non-threaded neighbor list build
</PRE>
</UL>
<P><B>Examples:</B>
</P>
<PRE>package gpu 1
package gpu 1 split 0.75
package gpu 2 split -1.0
package cuda gpu/node/special 2 0 2
package cuda test 3948
package kokkos neigh half/thread comm/forward device
package omp 0 neigh yes
package omp 4
package intel * mixed balance -1
</PRE>
<P><B>Description:</B>
</P>
<P>This command invokes package-specific settings. Currently the
following packages use it: USER-CUDA, GPU, USER-INTEL, KOKKOS, and
USER-OMP.
</P>
<P>Talk about command line switches
</P>
<P>When does it have to be invoked
</P>
<P>To use the accelerated GPU and USER-OMP styles, the use of the package
command is required. However, as described in the "Defaults" section
below, if you use the "-sf gpu" or "-sf omp" <A HREF = "Section_start.html#start_7">command-line
options</A> to enable use of these styles,
then default package settings are enabled. In that case you only need
to use the package command if you want to change the defaults.
</P>
<P>To use the accelerated USER-CUDA and KOKKOS styles, the package
command is not required as defaults are assigned internally. You only
need to use the package command if you want to change the defaults.
</P>
<P>See <A HREF = "Section_accelerate.html">Section_accelerate</A> of the manual for
more details about using these various packages for accelerating
LAMMPS calculations.
</P>
<P>Package GPU always sets newton pair off. Not so for USER-CUDA>
</P>
<HR>
<P>The <I>cuda</I> style invokes settings associated with the use of the
USER-CUDA package.
</P>
<P>The <I>Ngpus</I> argument sets the number of GPUs per node. There must be
exactly one MPI task per GPU, as set by the mpirun or mpiexec command.
</P>
<P>Optional keyword/value pairs can also be specified. Each has a
default value as listed below.
</P>
<P>The <I>gpuID</I> keyword allows selection of which GPUs on each node will
be used for a simulation. GPU IDs range from 0 to N-1 where N is the
physical number of GPUs/node. An ID is specified for each of the
Ngpus being used. For example if you have three GPUs on a machine,
one of which is used for the X-Server (the GPU with the ID 1) while
the others (with IDs 0 and 2) are used for computations you would
specify:
</P>
<PRE>package cuda 2 gpuID 0 2
</PRE>
<P>The purpose of the <I>gpuID</I> keyword is to allow two (or more)
simulations to be run on one workstation. In that case one could set
the first simulation to use GPU 0 and the second to use GPU 1. This is
not necessary however, if the GPUs are in what is called <I>compute
exclusive</I> mode. Using that setting, every process will get its own
GPU automatically. This <I>compute exclusive</I> mode can be set as root
using the <I>nvidia-smi</I> tool which is part of the CUDA installation.
</P>
<P>Also note that if the <I>gpuID</I> keyword is not used, the USER-CUDA
package sorts existing GPUs on each node according to their number of
multiprocessors. This way, compute GPUs will be priorized over
X-Server GPUs.
</P>
<P>If the <I>timing</I> keyword is specified, detailed timing information for
various subroutines will be output.
</P>
<P>If the <I>test</I> keyword is specified, information for the specified atom
with atom-ID will be output at several points during each timestep.
This is mainly usefull for debugging purposes. Note that the
simulation slow down dramatically if this option is used.
</P>
<P>The <I>thread</I> keyword can be used to specify how GPU threads are
assigned work during pair style force evaluation. If the value =
<I>tpa</I>, one thread per atom is used. If the value = <I>bpa</I>, one block
per atom is used. If the value = <I>auto</I>, a short test is performed at
the beginning of each run to determing where <I>tpa</I> or <I>bpa</I> mode is
faster. The result of this test is output. Since <I>auto</I> is the
default value, it is usually not necessary to use this keyword.
</P>
<HR>
<P>The <I>gpu</I> style invokes settings settings associated with the use of
the GPU package.
</P>
<P>The <I>Ngpu</I> argument sets the number of GPUs per node. There must be
at least as many MPI tasks per node as GPUs, as set by the mpirun or
mpiexec command. If there are more MPI tasks (per node)
than GPUs, multiple MPI tasks will share each GPU.
</P>
<P>Optional keyword/value pairs can also be specified. Each has a
default value as listed below.
</P>
<P>The <I>neigh</I> keyword specifies where neighbor lists for pair style
computation will be built. If <I>neigh</I> is <I>yes</I>, which is the default,
neighbor list building is performed on the GPU. If <I>neigh</I> is <I>no</I>,
neighbor list building is performed on the CPU. GPU neighbor list
building currently cannot be used with a triclinic box. GPU neighbor
list calculation currently cannot be used with
<A HREF = "pair_hybrid.html">hybrid</A> pair styles. GPU neighbor lists are not
compatible with comannds that are not GPU-enabled. When a non-GPU
enabled command requires a neighbor list, it will also be built on the
CPU. In these cases, it will typically be more efficient to only use
CPU neighbor list builds.
</P>
<P>The <I>split</I> keyword can be used for load balancing force calculations
between CPU and GPU cores in GPU-enabled pair styles. If 0 < <I>split</I> <
1.0, a fixed fraction of particles is offloaded to the GPU while force
calculation for the other particles occurs simulataneously on the
CPU. If <I>split</I> < 0.0, the optimal fraction (based on CPU and GPU
timings) is calculated every 25 timesteps. If <I>split</I> = 1.0, all
force calculations for GPU accelerated pair styles are performed on
the GPU. In this case, other <A HREF = "pair_hybrid.html">hybrid</A> pair
interactions, <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
<A HREF = "kspace_style.html">long-range</A> calculations can be performed on the
CPU while the GPU is performing force calculations for the GPU-enabled
pair style. If all CPU force computations complete before the GPU
completes, LAMMPS will block until the GPU has finished before
continuing the timestep.
</P>
<P>As an example, if you have two GPUs per node and 8 CPU cores per node,
and would like to run on 4 nodes (32 cores) with dynamic balancing of
force calculation across CPU and GPU cores, you could specify
</P>
<PRE>mpirun -np 32 -sf gpu -in in.script # launch command
package gpu 2 split -1 # input script command
</PRE>
<P>In this case, all CPU cores and GPU devices on the nodes would be
utilized. Each GPU device would be shared by 4 CPU cores. The CPU
cores would perform force calculations for some fraction of the
particles at the same time the GPUs performed force calculation for
the other particles.
</P>
<P>The <I>gpuID</I> keyword allows selection of which GPUs on each node will
be used for a simulation. The <I>first</I> and <I>last</I> values specify the
GPU IDs to use (from 0 to Ngpu-1). By default, first = 0 and last =
Ngpu-1, so that all GPUs are used, assuming Ngpu is set to the number
of physical GPUs. If you only wish to use a subset, set Ngpu to a
smaller number and first/last to a sub-range of the available GPUs.
</P>
<P>The <I>tpa</I> keyword sets the number of GPU thread per atom used to
perform force calculations. With a default value of 1, the number of
threads will be chosen based on the pair style, however, the value can
be set explicitly with this keyword to fine-tune performance. For
large cutoffs or with a small number of particles per GPU, increasing
the value can improve performance. The number of threads per atom must
be a power of 2 and currently cannot be greater than 32.
</P>
<P>The <I>binsize</I> keyword sets the size of bins used to bin atoms in
neighbor list builds. Setting this value is normally not needed; the
optimal value is close to the default, which is set equal to the
cutoff distance for the short range interactions plus the neighbor
skin. Note that this is 2x larger than the default bin size for
neighbor list builds on the CPU. This is becuase GPUs can perform
efficiently with much larger cutoffs than CPUs. This can be used to
reduce the time required for long-range calculations or in some cases
to eliminate them with pair style models such as
<A HREF = "pair_coul.html">coul/wolf</A> or <A HREF = "pair_coul.html">coul/dsf</A>. For very
large cutoffs, it can be more efficient to use smaller values for
<I>binsize</I> in parallel simulations. For example, with a cutoff of
20*sigma in LJ <A HREF = "units.html">units</A> and a neighbor skin distance of
sigma, a <I>binsize</I> = 5.25*sigma can be more efficient than the
default.
</P>
<P>The <I>device</I> keyword can be used to tune parameters optimized for a
specific accelerator, when using OpenCL. For CUDA, the <I>device</I>
keyword is ignored. Currently, the device type is limited to NVIDIA
Kepler, NVIDIA Fermi, AMD Cypress, or a generic device. More devices
may be added later. The default device type can be specified when
building LAMMPS with the GPU library, via settings in the
lib/gpu/Makefile that is used.
</P>
<HR>
<P>The <I>intel</I> style invokes options associated with the use of the
USER-INTEL package.
</P>
<P>The <I>Nthread</I> argument allows to one explicitly set the number of
OpenMP threads to be allocated for each MPI process, An <I>Nthreads</I>
value of '*' instructs LAMMPS to use whatever is the default for the
given OpenMP environment. This is usually determined via the
OMP_NUM_THREADS environment variable or the compiler runtime.
</P>
<P>The <I>precision</I> argument determines the precision mode to use and can
take values of <I>single</I> (intel styles use single precision for all
calculations), <I>mixed</I> (intel styles use double precision for
accumulation and storage of forces, torques, energies, and virial
terms and single precision for everything else), or <I>double</I> (intel
styles use double precision for all calculations).
</P>
<P>Additional keyword-value pairs are available that are used to
determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is
built without offload support, these values are ignored. The
additional settings are as follows:
</P>
<P>The <I>balance</I> setting is used to set the fraction of work offloaded to
the coprocessor for an intel style (in the inclusive range 0.0 to
1.0). While this fraction of work is running on the coprocessor, other
calculations will run on the host, including neighbor and pair
calculations that are not offloaded, angle, bond, dihedral, kspace,
and some MPI communications. If the balance is set to -1, the fraction
of work is dynamically adjusted automatically throughout the run. This
can typically give performance within 5 to 10 percent of the optimal
fixed fraction.
</P>
<P>The <I>offload_cards</I> setting determines the number of coprocessors to
use on each node.
</P>
<P>Additional options for fine tuning performance with offload are as
follows:
</P>
<P>The <I>offload_ghost</I> setting determines whether or not ghost atoms,
atoms at the borders between MPI tasks, are offloaded for neighbor and
force calculations. When set to "0", ghost atoms are not offloaded.
This option can reduce the amount of data transfer with the
coprocessor and also can overlap MPI communication of forces with
computation on the coprocessor when the <A HREF = "newton.html">newton pair</A>
setting is "on". When set to "1", ghost atoms are offloaded. In some
cases this can provide better performance, especially if the offload
fraction is high.
</P>
<P>The <I>offload_tpc</I> option sets the maximum number of threads that will
run on each core of the coprocessor.
</P>
<P>The <I>offload_threads</I> option sets the maximum number of threads that
will be used on the coprocessor for each MPI task. This, along with
the <I>offload_tpc</I> setting, are the only methods for changing the
number of threads on the coprocessor. The OMP_NUM_THREADS keyword and
<I>Nthreads</I> options are only used for threads on the host.
</P>
<HR>
<P>The <I>kokkos</I> style invokes options associated with the use of the
KOKKOS package.
</P>
<P>The <I>neigh</I> keyword determines what kinds of neighbor lists are built.
A value of <I>half</I> uses half-neighbor lists, the same as used by most
pair styles in LAMMPS. A value of <I>half/thread</I> uses a threadsafe
variant of the half-neighbor list. It should be used instead of
<I>half</I> when running with threads on a CPU. A value of <I>full</I> uses a
full-neighborlist, i.e. f_ij and f_ji are both calculated. This
performs twice as much computation as the <I>half</I> option, however that
can be a win because it is threadsafe and doesn't require atomic
operations. A value of <I>full/cluster</I> is an experimental neighbor
style, where particles interact with all particles within a small
cluster, if at least one of the clusters particles is within the
neighbor cutoff range. This potentially allows for better
vectorization on architectures such as the Intel Phi. If also reduces
the size of the neighbor list by roughly a factor of the cluster size,
thus reducing the total memory footprint considerably.
</P>
<P>The <I>comm/exchange</I> and <I>comm/forward</I> keywords determine whether the
host or device performs the packing and unpacking of data when
communicating information between processors. "Exchange"
communication happens only on timesteps that neighbor lists are
rebuilt. The data is only for atoms that migrate to new processors.
"Forward" communication happens every timestep. The data is for atom
coordinates and any other atom properties that needs to be updated for
ghost atoms owned by each processor.
</P>
<P>The value options for these keywords are <I>no</I> or <I>host</I> or <I>device</I>.
A value of <I>no</I> means to use the standard non-KOKKOS method of
packing/unpacking data for the communication. A value of <I>host</I> means
to use the host, typically a multi-core CPU, and perform the
packing/unpacking in parallel with threads. A value of <I>device</I> means
to use the device, typically a GPU, to perform the packing/unpacking
operation.
</P>
<P>The optimal choice for these keywords depends on the input script and
the hardware used. The <I>no</I> value is useful for verifying that Kokkos
code is working correctly. It may also be the fastest choice when
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values work
identically. When using GPUs, the <I>device</I> value will typically be
optimal if all of your styles used in your input script are supported
by the KOKKOS package. In this case data can stay on the GPU for many
timesteps without being moved between the host and GPU, if you use the
<I>device</I> value. This requires that your MPI is able to access GPU
memory directly. Currently that is true for OpenMPI 1.8 (or later
versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
then data has to be move between the host and device anyway, so it is
typically faster to let the host handle communication, by using the
<I>host</I> value. Using <I>host</I> instead of <I>no</I> will enable use of
multiple threads to pack/unpack communicated data.
</P>
<HR>
<P>The <I>omp</I> style invokes settings associated with the use of the
USER-OMP package.
</P>
<P>The <I>Nthread</I> argument sets the number of OpenMP threads allocated for
each MPI task. For example, if your system has nodes with dual
quad-core processors, it has a total of 8 cores per node. You could
use two MPI tasks per node (e.g. using the -ppn option of the mpirun
command), and set <I>Nthreads</I> = 4. This would use all 8 cores on each
node. Note that the product of MPI tasks * threads/task should not
exceed the physical number of cores (on a node), otherwise performance
will suffer.
</P>
<P>Setting <I>Nthread</I> = 0 instructs LAMMPS to use whatever value is the
default for the given OpenMP environment. This is usually determined
via the <I>OMP_NUM_THREADS</I> environment variable or the compiler
runtime. Note that in most cases the default for OpenMP capable
compilers is to use one thread for each available CPU core when
<I>OMP_NUM_THREADS</I> is not explicitly set, which can lead to poor
performance.
</P>
<P>Here are examples of how to set the environment variable when
launching LAMMPS:
</P>
<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
</PRE>
<P>or you can set it permanently in your shell's start-up script.
All three of these examples use a total of 4 CPU cores.
</P>
<P>Note that different MPI implementations have different ways of passing
the OMP_NUM_THREADS environment variable to all MPI processes. The
2nd example line above is for MPICH; the 3rd example line with -x is
for OpenMPI. Check your MPI documentation for additional details.
</P>
<P>What combination of threads and MPI tasks gives the best performance
is difficult to predict and can depend on many components of your
input. Not all features of LAMMPS support OpenMP threading via the
USER-OMP packaage and the parallel efficiency can be very different,
too.
</P>
<P>Optional keyword/value pairs can also be specified. Each has a
default value as listed below.
</P>
<P>The <I>neigh</I> keyword specifies whether neighbor list building will be
multi-threaded in addition to force calculations. If <I>neigh</I> is set
to <I>no</I> then neighbor list calculation is performed only by MPI tasks
with no OpenMP threading. If <I>mode</I> is <I>yes</I> (the default), a
multi-threaded neighbor list build is used. Using <I>neigh</I> = <I>yes</I> is
almost always faster and should produce idential neighbor lists at the
expense of using more memory. Specifically, neighbor list pages are
allocated for all threads at the same time and each thread works
within its own pages.
</P>
<HR>
<P><B>Restrictions:</B>
</P>
<P>This command cannot be used after the simulation box is defined by a
<A HREF = "read_data.html">read_data</A> or <A HREF = "create_box.html">create_box</A> command.
</P>
<P>The cuda style of this command can only be invoked if LAMMPS was built
with the USER-CUDA package. See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P>The gpu style of this command can only be invoked if LAMMPS was built
with the GPU package. See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P>The intel style of this command can only be invoked if LAMMPS was
built with the USER-INTEL package. See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P>The kk style of this command can only be invoked if LAMMPS was built
with the KOKKOS package. See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P>The omp style of this command can only be invoked if LAMMPS was built
with the USER-OMP package. See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P><B>Related commands:</B>
</P>
<P><A HREF = "suffix.html">suffix</A>, "-pk" <A HREF = "Section_start.html#start_7">command-line
setting</A>
</P>
<P><B>Default:</B>
</P>
<P>To use the USER-CUDA package, the package command must be invoked
explicitly, either via the "-pk cuda" <A HREF = "Section_start.html#start_7">command-line
switch</A> or by invoking the package cuda
command in your input script. This will set the # of GPUs/node. The
options defaults are gpuID = 0 to Ngpu-1, timing not enabled, test not
enabled, and thread = auto.
</P>
<P>For the GPU package, the default is Ngpu = 1 and the option defaults
are neigh = yes, split = 1.0, gpuID = 0 to Ngpu-1, tpa = 1, binsize =
pair cutoff + neighbor skin, device = not used. These settings are
made if the "-sf gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>
is used. If it is not used, you must invoke the package gpu command
in your input script.
</P>
<P>The default settings for the USER-INTEL package are "package intel *
mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240".
The <I>offload_ghost</I> default setting is determined by the intel style
being used. The value used is output to the screen in the offload
report at the end of each run.
</P>
<P>The default settings for the KOKKOS package are "package kokkos neigh
full comm/exchange host comm/forward host". This is the case whether
the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
or not.
</P>
<P>For the OMP package, the default is Nthreads = 0 and the option
defaults are neigh = yes. These settings are made if the "-sf omp"
<A HREF = "Section_start.html#start_7">command-line switch</A> is used. If it is
not used, you must invoke the package omp command in your input
script.
</P>
</HTML>