lammps/doc/package.html

<HTML>
<CENTER><A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
</CENTER>


<HR>

<H3>package command 
</H3>
<P><B>Syntax:</B>
</P>
<PRE>package style args 
</PRE>
<UL><LI>style = <I>cuda</I> or <I>gpu</I> or <I>intel</I> or <I>kokkos</I> or <I>omp</I> 

<LI>args = arguments specific to the style 

<PRE>  <I>cuda</I> args = keyword value ...
    one or more keyword/value pairs may be appended
    keywords = <I>gpu/node</I> or <I>gpu/node/special</I> or <I>timing</I> or <I>test</I> or <I>override/bpa</I>
      <I>gpu/node</I> value = N
        N = number of GPUs to be used per node
      <I>gpu/node/special</I> values = N gpu1 .. gpuN
        N = number of GPUs to be used per node
        gpu1 .. gpuN = N IDs of the GPUs to use
      <I>timing</I> values = none
      <I>test</I> values = id
        id = atom-ID of a test particle
      <I>override/bpa</I> values = flag
        flag = 0 for TpA algorithm, 1 for BpA algorithm 
  <I>gpu</I> args = mode first last split keyword value ...
    mode = force or force/neigh
    first = ID of first GPU to be used on each node
    last = ID of last GPU to be used on each node
    split = fraction of particles assigned to the GPU
    zero or more keyword/value pairs may be appended
    keywords = <I>threads_per_atom</I> or <I>cellsize</I> or <I>device</I>
      <I>threads_per_atom</I> value = Nthreads
        Nthreads = # of GPU threads used per atom
      <I>cellsize</I> value = dist
        dist = length (distance units) in each dimension for neighbor bins
      <I>device</I> value = device_type
        device_type = <I>kepler</I> or <I>fermi</I> or <I>cypress</I> or <I>generic</I>
  <I>intel</I> args = Nthreads precision keyword value ...
    Nthreads = # of OpenMP threads to associate with each MPI process on host
    precision = <I>single</I> or <I>mixed</I> or <I>double</I>
    keywords = <I>balance</I> or <I>offload_cards</I> or <I>offload_ghost</I> or <I>offload_tpc</I> or <I>offload_threads</I>
     <I>balance</I> value = split
       split = fraction of work to offload to coprocessor, -1 for dynamic
     <I>offload_cards</I> value = ncops
       ncops = number of coprocessors to use on each node
     <I>offload_ghost</I> value = offload_type
       offload_type = 1 to include ghost atoms for offload, 0 for local only
     <I>offload_tpc</I> value = tpc
       tpc = number of threads to use on each core of coprocessor
     <I>offload_threads</I> value = tptask
       tptask = max number of threads to use on coprocessor for each MPI task
  <I>kokkos</I> args = keyword value ...
    one or more keyword/value pairs may be appended
    keywords = <I>neigh</I> or <I>comm/exchange</I> or <I>comm/forward</I>
      <I>neigh</I> value = <I>full</I> or <I>half/thread</I> or <I>half</I> or <I>n2</I> or <I>full/cluster</I>
      <I>comm/exchange</I> value = <I>no</I> or <I>host</I> or <I>device</I>
      <I>comm/forward</I> value = <I>no</I> or <I>host</I> or <I>device</I>
  <I>omp</I> args = Nthreads mode
    Nthreads = # of OpenMP threads to associate with each MPI process
    mode = force or force/neigh (optional) 
</PRE>

</UL>
<P><B>Examples:</B>
</P>
<PRE>package gpu force 0 0 1.0
package gpu force 0 0 0.75
package gpu force/neigh 0 0 1.0
package gpu force/neigh 0 1 -1.0
package cuda gpu/node/special 2 0 2
package cuda test 3948
package kokkos neigh half/thread comm/forward device
package omp * force/neigh
package omp 4 force
package intel * mixed balance -1 
</PRE>
<P><B>Description:</B>
</P>
<P>This command invokes package-specific settings.  Currently the
following packages use it: USER-CUDA, GPU, USER-INTEL, KOKKOS, and
USER-OMP.
</P>
<P>To use the accelerated GPU and USER-OMP styles, the use of the package
command is required.  However, as described in the "Defaults" section
below, if you use the "-sf gpu" or "-sf omp" <A HREF = "Section_start.html#start_7">command-line
options</A> to enable use of these styles,
then default package settings are enabled.  In that case you only need
to use the package command if you want to change the defaults.
</P>
<P>To use the accelerated USER-CUDA and KOKKOS styles, the package
command is not required as defaults are assigned internally.  You only
need to use the package command if you want to change the defaults.
</P>
<P>See <A HREF = "Section_accelerate.html">Section_accelerate</A> of the manual for
more details about using these various packages for accelerating
LAMMPS calculations.
</P>
<HR>

<P>The <I>cuda</I> style invokes options associated with the use of the
USER-CUDA package.  
</P>
<P>The <I>gpu/node</I> keyword specifies the number <I>N</I> of GPUs to be used on
each node.  An MPI process with rank <I>K</I> will use the GPU (K mod N).
This implies that processes should be assigned with successive ranks
on each node, which is the default with most (or even all) MPI
implementations. The default value for <I>N</I> is 2.
</P>
<P>The <I>gpu/node/special</I> keyword also specifies the number (N) of GPUs
to be used on each node, but allows more control over their
specification.  An MPI process with rank <I>K</I> will use the GPU <I>gpuI</I>
with l = (K mod N) + 1. This implies that processes should be assigned
with successive ranks on each node, which is the default with most (or
even all) MPI implementations.  For example if you have three GPUs on
a machine, one of which is used for the X-Server (the GPU with the ID
1) while the others (with IDs 0 and 2) are used for computations you
would specify:
</P>
<PRE>package cuda gpu/node/special 2 0 2 
</PRE>
<P>A main purpose of the <I>gpu/node/special</I> optoin is to allow two (or
more) simulations to be run on one workstation.  In that case one
would set the first simulation to use GPU 0 and the second to use GPU
1. This is not necessary though, if the GPUs are in what is called
<I>compute exclusive</I> mode.  Using that setting, every process will get
its own GPU automatically.  This <I>compute exclusive</I> mode can be set
as root using the <I>nvidia-smi</I> tool which is part of the CUDA
installation.
</P>
<P>Note that if the <I>gpu/node/special</I> keyword is not used, the USER-CUDA
package sorts existing GPUs on each node according to their number of
multiprocessors.  This way, compute GPUs will be priorized over
X-Server GPUs.
</P>
<P>Use of the <I>timing</I> keyword will output detailed timing information
for various subroutines.
</P>
<P>The <I>test</I> keyword will output info for the the specified atom at
several points during each time step.  This is mainly usefull for
debugging purposes.  Note that the simulation will be severly slowed
down if this option is used.
</P>
<P>The <I>override/bpa</I> keyword can be used to specify which mode is used
for pair-force evaluation.  TpA = one thread per atom; BpA = one block
per atom.  If this keyword is not used, a short test at the begin of
each run will determine which method is more effective (the result of
this test is part of the LAMMPS output).  Therefore it is usually not
necessary to use this keyword.
</P>
<HR>

<P>The <I>gpu</I> style invokes options associated with the use of the GPU
package. 
</P>
<P>The <I>mode</I> setting specifies where neighbor list calculations will be
performed.  If <I>mode</I> is force, neighbor list calculation is performed
on the CPU. If <I>mode</I> is force/neigh, neighbor list calculation is
performed on the GPU. GPU neighbor list calculation currently cannot
be used with a triclinic box. GPU neighbor list calculation currently
cannot be used with <A HREF = "pair_hybrid.html">hybrid</A> pair styles.  GPU
neighbor lists are not compatible with styles that are not
GPU-enabled.  When a non-GPU enabled style requires a neighbor list,
it will also be built using CPU routines. In these cases, it will
typically be more efficient to only use CPU neighbor list builds.
</P>
<P>The <I>first</I> and <I>last</I> settings specify the GPUs that will be used for
simulation.  On each node, the GPU IDs in the inclusive range from
<I>first</I> to <I>last</I> will be used.
</P>
<P>The <I>split</I> setting can be used for load balancing force calculation
work between CPU and GPU cores in GPU-enabled pair styles. If 0 <
<I>split</I> < 1.0, a fixed fraction of particles is offloaded to the GPU
while force calculation for the other particles occurs simulataneously
on the CPU. If <I>split</I><0, the optimal fraction (based on CPU and GPU
timings) is calculated every 25 timesteps. If <I>split</I> = 1.0, all force
calculations for GPU accelerated pair styles are performed on the
GPU. In this case, <A HREF = "pair_hybrid.html">hybrid</A>, <A HREF = "bond_style.html">bond</A>,
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
calculations can be performed on the CPU while the GPU is performing
force calculations for the GPU-enabled pair style.  If all CPU force
computations complete before the GPU, LAMMPS will block until the GPU
has finished before continuing the timestep.
</P>
<P>As an example, if you have two GPUs per node and 8 CPU cores per node,
and would like to run on 4 nodes (32 cores) with dynamic balancing of
force calculation across CPU and GPU cores, you could specify
</P>
<PRE>package gpu force/neigh 0 1 -1 
</PRE>
<P>In this case, all CPU cores and GPU devices on the nodes would be
utilized.  Each GPU device would be shared by 4 CPU cores. The CPU
cores would perform force calculations for some fraction of the
particles at the same time the GPUs performed force calculation for
the other particles.
</P>
<P>The <I>threads_per_atom</I> keyword allows control of the number of GPU
threads used per-atom to perform the short range force calculation.
By default, the value will be chosen based on the pair style, however,
the value can be set with this keyword to fine-tune performance.  For
large cutoffs or with a small number of particles per GPU, increasing
the value can improve performance. The number of threads per atom must
be a power of 2 and currently cannot be greater than 32.
</P>
<P>The <I>cellsize</I> keyword can be used to control the size of the cells used
for binning atoms in neighbor list calculations. Setting this value is 
normally not needed; the optimal value is close to the default 
(equal to the cutoff distance for the short range interactions 
plus the neighbor skin). GPUs can perform efficiently with much larger cutoffs 
than CPUs and this can be used to reduce the time required for long-range 
calculations or in some cases to eliminate them with models such as 
<A HREF = "pair_coul.html">coul/wolf</A> or <A HREF = "pair_coul.html">coul/dsf</A>. For very large cutoffs,
it can be more efficient to use smaller values for cellsize in parallel
simulations. For example, with a cutoff of 20*sigma and a neighbor skin of
sigma, a cellsize of 5.25*sigma can be efficient for parallel simulations.
</P>
<P>The <I>device</I> keyword can be used to tune parameters to optimize for a specific
accelerator when using OpenCL. For CUDA, the <I>device</I> keyword is ignored.
Currently, the device type is limited to NVIDIA Kepler, NVIDIA Fermi, 
AMD Cypress, or a generic device. More devices will be added soon. The default
device type can be specified when building LAMMPS with the GPU library.
</P>
<HR>

<P>The <I>intel</I> style invokes options associated with the use of the
USER-INTEL package.
</P>
<P>The <I>Nthreads</I> argument allows to one explicitly set the number of
OpenMP threads to be allocated for each MPI process, An <I>Nthreads</I>
value of '*' instructs LAMMPS to use whatever is the default for the
given OpenMP environment. This is usually determined via the
OMP_NUM_THREADS environment variable or the compiler runtime.
</P>
<P>The <I>precision</I> argument determines the precision mode to use and can
take values of <I>single</I> (intel styles use single precision for all
calculations), <I>mixed</I> (intel styles use double precision for
accumulation and storage of forces, torques, energies, and virial
terms and single precision for everything else), or <I>double</I> (intel
styles use double precision for all calculations).
</P>
<P>Additional keyword-value pairs are available that are used to
determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is
built without offload support, these values are ignored. The
additional settings are as follows:
</P>
<P>The <I>balance</I> setting is used to set the fraction of work offloaded to
the coprocessor for an intel style (in the inclusive range 0.0 to
1.0). While this fraction of work is running on the coprocessor, other
calculations will run on the host, including neighbor and pair
calculations that are not offloaded, angle, bond, dihedral, kspace,
and some MPI communications. If the balance is set to -1, the fraction
of work is dynamically adjusted automatically throughout the run. This
can typically give performance within 5 to 10 percent of the optimal
fixed fraction.
</P>
<P>The <I>offload_cards</I> setting determines the number of coprocessors to
use on each node.
</P>
<P>Additional options for fine tuning performance with offload are as
follows:
</P>
<P>The <I>offload_ghost</I> setting determines whether or not ghost atoms,
atoms at the borders between MPI tasks, are offloaded for neighbor and
force calculations. When set to "0", ghost atoms are not offloaded.
This option can reduce the amount of data transfer with the
coprocessor and also can overlap MPI communication of forces with
computation on the coprocessor when the <A HREF = "newton.html">newton pair</A>
setting is "on".  When set to "1", ghost atoms are offloaded. In some
cases this can provide better performance, especially if the offload
fraction is high.
</P>
<P>The <I>offload_tpc</I> option sets the maximum number of threads that will
run on each core of the coprocessor.
</P>
<P>The <I>offload_threads</I> option sets the maximum number of threads that
will be used on the coprocessor for each MPI task. This, along with
the <I>offload_tpc</I> setting, are the only methods for changing the
number of threads on the coprocessor. The OMP_NUM_THREADS keyword and
<I>Nthreads</I> options are only used for threads on the host.
</P>
<HR>

<P>The <I>kokkos</I> style invokes options associated with the use of the
KOKKOS package.
</P>
<P>The <I>neigh</I> keyword determines what kinds of neighbor lists are built.
A value of <I>half</I> uses half-neighbor lists, the same as used by most
pair styles in LAMMPS.  A value of <I>half/thread</I> uses a threadsafe
variant of the half-neighbor list.  It should be used instead of
<I>half</I> when running with threads on a CPU.  A value of <I>full</I> uses a
full-neighborlist, i.e. f_ij and f_ji are both calculated.  This
performs twice as much computation as the <I>half</I> option, however that
can be a win because it is threadsafe and doesn't require atomic
operations.  A value of <I>full/cluster</I> is an experimental neighbor
style, where particles interact with all particles within a small
cluster, if at least one of the clusters particles is within the
neighbor cutoff range.  This potentially allows for better
vectorization on architectures such as the Intel Phi.  If also reduces
the size of the neighbor list by roughly a factor of the cluster size,
thus reducing the total memory footprint considerably.
</P>
<P>The <I>comm/exchange</I> and <I>comm/forward</I> keywords determine whether the
host or device performs the packing and unpacking of data when
communicating information between processors.  "Exchange"
communication happens only on timesteps that neighbor lists are
rebuilt.  The data is only for atoms that migrate to new processors.
"Forward" communication happens every timestep.  The data is for atom
coordinates and any other atom properties that needs to be updated for
ghost atoms owned by each processor.
</P>
<P>The value options for these keywords are <I>no</I> or <I>host</I> or <I>device</I>.
A value of <I>no</I> means to use the standard non-KOKKOS method of
packing/unpacking data for the communication.  A value of <I>host</I> means
to use the host, typically a multi-core CPU, and perform the
packing/unpacking in parallel with threads.  A value of <I>device</I> means
to use the device, typically a GPU, to perform the packing/unpacking
operation.
</P>
<P>The optimal choice for these keywords depends on the input script and
the hardware used.  The <I>no</I> value is useful for verifying that Kokkos
code is working correctly.  It may also be the fastest choice when
using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values work
identically.  When using GPUs, the <I>device</I> value will typically be
optimal if all of your styles used in your input script are supported
by the KOKKOS package.  In this case data can stay on the GPU for many
timesteps without being moved between the host and GPU, if you use the
<I>device</I> value.  This requires that your MPI is able to access GPU
memory directly.  Currently that is true for OpenMPI 1.8 (or later
versions), Mvapich2 1.9 (or later), and CrayMPI.  If your script uses
styles (e.g. fixes) which are not yet supported by the KOKKOS package,
then data has to be move between the host and device anyway, so it is
typically faster to let the host handle communication, by using the
<I>host</I> value.  Using <I>host</I> instead of <I>no</I> will enable use of
multiple threads to pack/unpack communicated data.
</P>
<HR>

<P>The <I>omp</I> style invokes options associated with the use of the
USER-OMP package.
</P>
<P>The first argument allows to explicitly set the number of OpenMP
threads to be allocated for each MPI process.  For example, if your
system has nodes with dual quad-core processors, it has a total of 8
cores per node.  You could run MPI on 2 cores on each node (e.g. using
options for the mpirun command), and set the <I>Nthreads</I> setting to 4.
This would effectively use all 8 cores on each node.  Since each MPI
process would spawn 4 threads (one of which runs as part of the MPI
process itself).
</P>
<P>For performance reasons, you should not set <I>Nthreads</I> to more threads
than there are physical cores (per MPI task), but LAMMPS cannot check
for this.
</P>
<P>An <I>Nthreads</I> value of '*' instructs LAMMPS to use whatever is the
default for the given OpenMP environment. This is usually determined
via the <I>OMP_NUM_THREADS</I> environment variable or the compiler
runtime.  Please note that in most cases the default for OpenMP
capable compilers is to use one thread for each available CPU core
when <I>OMP_NUM_THREADS</I> is not set, which can lead to extremely bad
performance.
</P>
<P>By default LAMMPS uses 1 thread per MPI task.  If the environment
variable OMP_NUM_THREADS is set to a valid value, this value is used.
You can set this environment variable when you launch LAMMPS, e.g.
</P>
<PRE>env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script 
</PRE>
<P>or you can set it permanently in your shell's start-up script.  
All three of these examples use a total of 4 CPU cores.
</P>
<P>Note that different MPI implementations have different ways of passing
the OMP_NUM_THREADS environment variable to all MPI processes.  The
2nd line above is for MPICH; the 3rd line with -x is for OpenMPI.
Check your MPI documentation for additional details.
</P>
<P>You can also set the number of threads per MPI task via the <A HREF = "package.html">package
omp</A> command, which will override any OMP_NUM_THREADS
setting.
</P>
<P>Which combination of threads and MPI tasks gives the best performance
is difficult to predict and can depend on many components of your input.
Not all features of LAMMPS support OpenMP and the parallel efficiency
can be very different, too.
</P>
<P>The <I>mode</I> setting specifies where neighbor list calculations will be
multi-threaded as well.  If <I>mode</I> is force, neighbor list calculation
is performed in serial. If <I>mode</I> is force/neigh, a multi-threaded
neighbor list build is used. Using the force/neigh setting is almost
always faster and should produce idential neighbor lists at the
expense of using some more memory (neighbor list pages are always
allocated for all threads at the same time and each thread works on
its own pages).
</P>
<HR>

<P><B>Restrictions:</B>
</P>
<P>This command cannot be used after the simulation box is defined by a
<A HREF = "read_data.html">read_data</A> or <A HREF = "create_box.html">create_box</A> command.
</P>
<P>The cuda style of this command can only be invoked if LAMMPS was built
with the USER-CUDA package.  See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P>The gpu style of this command can only be invoked if LAMMPS was built
with the GPU package.  See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P>The kk style of this command can only be invoked if LAMMPS was built
with the KOKKOS package.  See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P>The omp style of this command can only be invoked if LAMMPS was built
with the USER-OMP package.  See the <A HREF = "Section_start.html#start_3">Making
LAMMPS</A> section for more info.
</P>
<P><B>Related commands:</B>
</P>
<P><A HREF = "suffix.html">suffix</A>
</P>
<P><B>Default:</B>
</P>
<P>The default settings for the USER-CUDA package are "package cuda gpu
2".  This is the case whether the "-sf cuda" <A HREF = "Section_start.html#start_7">command-line
switch</A> is used or not.
</P>
<P>If the "-sf gpu" <A HREF = "Section_start.html#start_7">command-line switch</A> is
used then it is as if the command "package gpu force/neigh 0 0 1" were
invoked, to specify default settings for the GPU package.  If the
command-line switch is not used, then no defaults are set, and you
must specify the appropriate package command in your input script.
</P>
<P>The default settings for the USER-INTEL package are "package intel *
mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240".
The <I>offload_ghost</I> default setting is determined by the intel style
being used.  The value used is output to the screen in the offload
report at the end of each run.
</P>
<P>The default settings for the KOKKOS package are "package kokkos neigh
full comm/exchange host comm/forward host".  This is the case whether
the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
or not.
</P>
<P>If the "-sf omp" <A HREF = "Section_start.html#start_7">command-line switch</A> is
used then it is as if the command "package omp *" were invoked, to
specify default settings for the USER-OMP package.  If the
command-line switch is not used, then no defaults are set, and you
must specify the appropriate package command in your input script.
</P>
</HTML>