lammps/doc/Section_accelerate.txt

"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc - "Next
Section"_Section_howto.html :c

:link(lws,http://lammps.sandia.gov)
:link(ld,Manual.html)
:link(lc,Section_commands.html#comm)

:line

5. Accelerating LAMMPS performance :h3

This section describes various methods for improving LAMMPS
performance for different classes of problems running on different
kinds of machines.

5.1 "Measuring performance"_#acc_1
5.2 "General strategies"_#acc_2
5.3 "Packages with optimized styles"_#acc_3
5.4 "OPT package"_#acc_4
5.5 "USER-OMP package"_#acc_5
5.6 "GPU package"_#acc_6
5.7 "USER-CUDA package"_#acc_7
5.8 "KOKKOS package"_#acc_8
5.9 "USER-INTEL package"_#acc_9
5.10 "Comparison of USER-CUDA, GPU, and KOKKOS packages"_#acc_10 :all(b)

:line
:line

5.1 Measuring performance :h4,link(acc_1)

Before trying to make your simulation run faster, you should
understand how it currently performs and where the bottlenecks are.

The best way to do this is run the your system (actual number of
atoms) for a modest number of timesteps (say 100, or a few 100 at
most) on several different processor counts, including a single
processor if possible.  Do this for an equilibrium version of your
system, so that the 100-step timings are representative of a much
longer run.  There is typically no need to run for 1000s or timesteps
to get accurate timings; you can simply extrapolate from short runs.

For the set of runs, look at the timing data printed to the screen and
log file at the end of each LAMMPS run.  "This
section"_Section_start.html#start_8 of the manual has an overview.

Running on one (or a few processors) should give a good estimate of
the serial performance and what portions of the timestep are taking
the most time.  Running the same problem on a few different processor
counts should give an estimate of parallel scalability.  I.e. if the
simulation runs 16x faster on 16 processors, its 100% parallel
efficient; if it runs 8x faster on 16 processors, it's 50% efficient.

The most important data to look at in the timing info is the timing
breakdown and relative percentages.  For example, trying different
options for speeding up the long-range solvers will have little impact
if they only consume 10% of the run time.  If the pairwise time is
dominating, you may want to look at GPU or OMP versions of the pair
style, as discussed below.  Comparing how the percentages change as
you increase the processor count gives you a sense of how different
operations within the timestep are scaling.  Note that if you are
running with a Kspace solver, there is additional output on the
breakdown of the Kspace time.  For PPPM, this includes the fraction
spent on FFTs, which can be communication intensive.

Another important detail in the timing info are the histograms of
atoms counts and neighbor counts.  If these vary widely across
processors, you have a load-imbalance issue.  This often results in
inaccurate relative timing data, because processors have to wait when
communication occurs for other processors to catch up.  Thus the
reported times for "Communication" or "Other" may be higher than they
really are, due to load-imbalance.  If this is an issue, you can
uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
LAMMPS, to obtain synchronized timings.

:line

5.2 General strategies :h4,link(acc_2)

NOTE: this section is still a work in progress

Here is a list of general ideas for improving simulation performance.
Most of them are only applicable to certain models and certain
bottlenecks in the current performance, so let the timing data you
generate be your guide.  It is hard, if not impossible, to predict how
much difference these options will make, since it is a function of
problem size, number of processors used, and your machine.  There is
no substitute for identifying performance bottlenecks, and trying out
various options.

rRESPA
2-FFT PPPM
Staggered PPPM
single vs double PPPM
partial charge PPPM
verlet/split
processor mapping via processors numa command
load-balancing: balance and fix balance
processor command for layout
OMP when lots of cores :ul

2-FFT PPPM, also called {analytic differentiation} or {ad} PPPM, uses
2 FFTs instead of the 4 FFTs used by the default {ik differentiation}
PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to
achieve the same accuracy as 4-FFT PPPM. For problems where the FFT
cost is the performance bottleneck (typically large problems running
on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.
  
Staggered PPPM performs calculations using two different meshes, one
shifted slightly with respect to the other.  This can reduce force
aliasing errors and increase the accuracy of the method, but also
doubles the amount of work required. For high relative accuracy, using
staggered PPPM allows one to half the mesh size in each dimension as
compared to regular PPPM, which can give around a 4x speedup in the
kspace time. However, for low relative accuracy, using staggered PPPM
gives little benefit and can be up to 2x slower in the kspace
time. For example, the rhodopsin benchmark was run on a single
processor, and results for kspace time vs. relative accuracy for the
different methods are shown in the figure below.  For this system,
staggered PPPM (using ik differentiation) becomes useful when using a
relative accuracy of slightly greater than 1e-5 and above.

:c,image(JPG/rhodo_staggered.jpg)

IMPORTANT NOTE: Using staggered PPPM may not give the same increase in
accuracy of energy and pressure as it does in forces, so some caution
must be used if energy and/or pressure are quantities of interest,
such as when using a barostat.

:line

5.3 Packages with optimized styles :h4,link(acc_3)

Accelerated versions of various "pair_style"_pair_style.html,
"fixes"_fix.html, "computes"_compute.html, and other commands have
been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions, if you have the appropriate
hardware on your system.

All of these commands are in "packages"_Section_packages.html.
Currently, there are 6 such packages in LAMMPS:

USER-CUDA: for NVIDIA GPUs
GPU: for NVIDIA GPUs as well as OpenCL support
USER-INTEL: for Intel CPUs and Intel Xeon Phi
KOKKOS: for GPUs, Intel Xeon Phi, and OpenMP threading
USER-OMP: for OpenMP threading
OPT: generic CPU optimizations :ul

The accelerated styles have the same name as the standard styles,
except that a suffix is appended.  Otherwise, the syntax for the
command is identical, their functionality is the same, and the
numerical results it produces should also be identical, except for
precision and round-off issues.

For example, all of these styles are variants of the basic
Lennard-Jones pair style "pair_style lj/cut"_pair_lj.html:

"pair_style lj/cut/cuda"_pair_lj.html
"pair_style lj/cut/gpu"_pair_lj.html
"pair_style lj/cut/intel"_pair_lj.html
"pair_style lj/cut/kk"_pair_lj.html
"pair_style lj/cut/omp"_pair_lj.html
"pair_style lj/cut/opt"_pair_lj.html :ul

Assuming you have built LAMMPS with the appropriate package, these
styles can be invoked by specifying them explicitly in your input
script.  Or you can use the "-suffix command-line
switch"_Section_start.html#start_7 to invoke the accelerated versions
automatically, without changing your input script.  The
"suffix"_suffix.html command allows you to set a suffix explicitly and
to turn off and back on the comand-line switch setting, both from
within your input script.

To see what styles are currently available in each of the accelerated
packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the
manual.  The doc page for each indvidual style (e.g. "pair
lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also lists any
accelerated variants available for that style.

Here is a brief summary of what the various packages provide.  Details
are in individual sections below.

Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up on a GPU depends on a variety of factors, as discussed
below.

Styles with an "intel" suffix are part of the USER-INTEL
package. These styles support vectorized single and mixed precision
calculations, in addition to full double precision.  In extreme cases,
this can provide speedups over 3.5x on CPUs.  The package also
supports acceleration with offload to Intel(R) Xeon Phi(TM)
coprocessors.  This can result in additional speedup over 2x depending
on the hardware configuration.

Styles with a "kk" suffix are part of the KOKKOS package, and can be
run using OpenMP, on an NVIDIA GPU, or on an Intel(R) Xeon Phi(TM).
The speed-up depends on a variety of factors, as discussed below.

Styles with an "omp" suffix are part of the USER-OMP package and allow
a pair-style to be run in multi-threaded mode using OpenMP.  This can
be useful on nodes with high-core counts when using less MPI processes
than cores is advantageous, e.g. when running with PPPM so that FFTs
are run on fewer MPI processors or when the many MPI tasks would
overload the available bandwidth for communication.

Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25% on a
CPU.

The following sections explain:

what hardware and software the accelerated package requires
how to build LAMMPS with the accelerated package
how to run an input script with the accelerated package
speed-ups to expect
guidelines for best performance
restrictions :ul

The final section compares and contrasts the GPU, USER-CUDA, and
KOKKOS packages, since they all allow for use of NVIDIA GPUs.

:line

5.4 OPT package :h4,link(acc_4)

The OPT package was developed by James Fischer (High Performance
Technologies), David Richie, and Vincent Natoli (Stone Ridge
Technologies).  It contains a handful of pair styles whose compute()
methods were rewritten in C++ templated form to reduce the overhead
due to if tests and other conditional code.

[Required hardware/software:]

None.

[Building LAMMPS with the OPT package:]

Include the package and build LAMMPS.

make yes-opt
make machine :pre

No additional compile/link flags are needed in your lo-level
src/MAKE/Makefile.machine.

[Running with the OPT package;]

You can explicitly add an "opt" suffix to the
"pair_style"_pair_style.html command in your input script:

pair_style lj/cut/opt 2.5 :pre

Or you can run with the -sf "command-line
switch"_Section_start.html#start_7, which will automatically append
"opt" to styles that support it.

lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script :pre

[Speed-ups to expect:]

You should see a reduction in the "Pair time" value printed at the end
of a run.  On most machines for reasonable problem sizes, it will be a
5 to 20% savings.

[Guidelines for best performance;]

None.  Just try out an OPT pair style to see how it performs.

[Restrictions:]

None.

:line

5.5 USER-OMP package :h4,link(acc_5)

The USER-OMP package was developed by Axel Kohlmeyer at Temple
University.  It provides multi-threaded versions of most pair styles,
nearly all bonded styles (bond, angle, dihedral, improper), several
Kspace styles, and a few fix styles.  The package currently
uses the OpenMP interface for multi-threading.

[Required hardware/software:]

Your compiler must support the OpenMP interface.  You should have one
or more multi-core CPUs so that multiple threads can be launched by an
MPI task running on a CPU.

[Building LAMMPS with the USER-OMP package:]

Include the package and build LAMMPS.  

make yes-user-omp
make machine :pre

Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP
support in both the CCFLAGS and LINKFLAGS variables.  For GNU and
Intel compilers, this flag is {-fopenmp}.  Without this flag the
USER-OMP styles will still be compiled and work, but will not support
multi-threading.

[Running with the USER-OMP package;]

You can explicitly add an "omp" suffix to any supported style in your
input script:

pair_style lj/cut/omp 2.5
fix nve/omp :pre

Or you can run with the -sf "command-line
switch"_Section_start.html#start_7, which will automatically append
"opt" to styles that support it.

lmp_machine -sf omp < in.script
mpirun -np 4 lmp_machine -sf omp < in.script :pre

You must also specify how many threads to use per MPI task.  There are
several ways to do this.  Note that the default value for this setting
in the OpenMP environment is 1 thread/task, which may give poor
performance.  Also note that the product of MPI tasks * threads/task
should not exceed the physical number of cores, otherwise performance
will suffer.

a) You can set an environment variable, either in your shell
or its start-up script:

setenv OMP_NUM_THREADS 4 (for csh or tcsh)
NOTE: setenv OMP_NUM_THREADS 4 (for bash) :pre

This value will apply to all subsequent runs you perform.

b) You can set the same environment variable when you launch LAMMPS:

env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
NOTE: which mpirun is for OpenMPI or MPICH? :pre

All three examples use a total of 4 CPU cores.

Different MPI implementations have differnet ways of passing the
OMP_NUM_THREADS environment variable to all MPI processes.  The first
variant above is for MPICH, the second is for OpenMPI.  Check the
documentation of your MPI installation for additional details.

c) Use the "package omp"_package.html command near the top of your
script:

package omp 4 :pre

[Speed-ups to expect:]

Depending on which styles are accelerated, you should look for a
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
time" values printed at the end of a run.  

You may see a small performance advantage (5 to 20%) when running a
USER-OMP style (in serial or parallel) with a single thread/MPI task,
versus running standard LAMMPS with its un-accelerated styles (in
serial or all-MPI parallelization with 1 task/core).  This is because
many of the USER-OMP styles contain similar optimizations to those
used in the OPT package, as described above.

With multiple threads/task, the optimal choice of MPI tasks/node and
OpenMP threads/task can vary a lot and should always be tested via
benchmark runs for a specific simulation running on a specific
machine, paying attention to guidelines discussed in the next
sub-section.

A description of the multi-threading strategy used in the UESR-OMP
package and some performance examples are "presented
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1

[Guidelines for best performance;]

For many problems on current generation CPUs, running the USER-OMP
package with a single thread/task is faster than running with multiple
threads/task.  This is because the MPI parallelization in LAMMPS is
often more efficient than multi-threading as implemented in the
USER-OMP package.  The parallel efficiency (in a threaded sense) also
varies for different USER-OMP styles.

Using multiple threads/task can be more effective under the following
circumstances:

Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per node
is faster.  Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but
offers an additional speedup by utilizing the otherwise idle CPU
cores. :ulb,l

The interconnect used for MPI communication does not provide
sufficient bandwidth for a large number of MPI tasks per node.  For
example, this applies to running over gigabit ethernet or on Cray XT4
or XT5 series supercomputers.  As in the aforementioned case, this
effect worsens when using an increasing number of nodes. :l

The system has a spatially inhomogeneous particle density which does
not map well to the "domain decomposition scheme"_processors.html or
"load-balancing"_balance.html options that LAMMPS provides.  This is
because multi-threading achives parallelism over the number of
particles, not via their distribution in space. :l

A machine is being used in "capability mode", i.e. near the point
where MPI parallelism is maxed out.  For example, this can happen when
using the "PPPM solver"_kspace_style.html for long-range
electrostatics on large numbers of nodes.  The scaling of the "kspace
style"_kspace_style.html can become the the performance-limiting
factor.  Using multi-threading allows less MPI tasks to be invoked and
can speed-up the long-range solver, while increasing overall
performance by parallelizing the pairwise and bonded calculations via
OpenMP.  Likewise additional speedup can be sometimes be achived by
increasing the length of the Coulombic cutoff and thus reducing the
work done by the long-range solver. :l,ule

Other performance tips are as follows:

The best parallel efficiency from {omp} styles is typically achieved
when there is at least one MPI task per physical processor,
i.e. socket or die. :ulb,l

Using OpenMP threading (as opposed to all-MPI parallelism) on
hyper-threading enabled cores is usually counter-productive (e.g. on
IBM BG/Q), as the cost in additional memory bandwidth requirements is
not offset by the gain in CPU utilization through
hyper-threading. :l,ule

[Restrictions:]

None of the pair styles in the USER-OMP package support the "inner",
"middle", "outer" options for "rRESPA integration"_run_style.html.
Only the rRESPA "pair" option is supported.

:line

5.6 GPU package :h4,link(acc_6)

[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]

The GPU package was developed by Mike Brown at ORNL and his
collaborators.  It provides GPU versions of several pair styles,
including the 3-body Stillinger-Weber pair style, and for long-range
Coulombics via the PPPM command.  It has the following features:

The package is designed to exploit common GPU hardware configurations
where one or more GPUs are coupled with many cores of a multi-core
CPUs, e.g. within a node of a parallel machine. :ulb,l

Atom-based data (e.g. coordinates, forces) moves back-and-forth
between the CPU(s) and GPU every timestep. :l

Neighbor lists can be constructed on the CPU or on the GPU :l

The charge assignement and force interpolation portions of PPPM can be
run on the GPU.  The FFT portion, which requires MPI communication
between processors, runs on the CPU. :l

Asynchronous force computations can be performed simultaneously on the
CPU(s) and GPU. :l

It allows for GPU computations to be performed in single or double
precision, or in mixed-mode precision. where pairwise forces are
cmoputed in single precision, but accumulated into double-precision
force vectors. :l

LAMMPS-specific code is in the GPU package.  It makes calls to a
generic GPU library in the lib/gpu directory.  This library provides
NVIDIA support as well as more general OpenCL support, so that the
same functionality can eventually be supported on a variety of GPU
hardware. :l,ule

[Hardware and software requirements:]

To use this package, you currently need to have an NVIDIA GPU and
install the NVIDIA Cuda software on your system:

Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/cards/0
Go to http://www.nvidia.com/object/cuda_get.html
Install a driver and toolkit appropriate for your system (SDK is not necessary)
Follow the instructions in lammps/lib/gpu/README to build the library (see below)
Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties :ul

[Building LAMMPS with the GPU package:]

As with other packages that include a separately compiled library, you
need to first build the GPU library, before building LAMMPS itself.
General instructions for doing this are in "this
section"_Section_start.html#start_3 of the manual.  For this package,
use a Makefile in lib/gpu appropriate for your system.

Before building the library, you can set the precision it will use by
editing the CUDA_PREC setting in the Makefile you are using, as
follows:

CUDA_PREC = -D_SINGLE_SINGLE  # Single precision for all calculations
CUDA_PREC = -D_DOUBLE_DOUBLE  # Double precision for all calculations
CUDA_PREC = -D_SINGLE_DOUBLE  # Accumulation of forces, etc, in double :pre

The last setting is the mixed mode referred to above.  Note that your
GPU must support double precision to use either the 2nd or 3rd of
these settings.

To build the library, then type:

cd lammps/lib/gpu
make -f Makefile.linux
(see further instructions in lammps/lib/gpu/README) :pre

If you are successful, you will produce the file lib/libgpu.a.

Now you are ready to build LAMMPS with the GPU package installed:

cd lammps/src
make yes-gpu
make machine :pre

Note that the lo-level Makefile (e.g. src/MAKE/Makefile.linux) has
these settings: gpu_SYSINC, gpu_SYSLIB, gpu_SYSPATH.  These need to be
set appropriately to include the paths and settings for the CUDA
system software on your machine.  See src/MAKE/Makefile.g++ for an
example.

Also note that if you change the GPU library precision, you need to
re-build the entire library.  You should do a "clean" first,
e.g. "make -f Makefile.linux clean".  Then you must also re-build
LAMMPS if the library precision has changed, so that it re-links with
the new library.

[Running an input script:]

The examples/gpu and bench/GPU directories have scripts that can be
run with the GPU package, as well as detailed instructions on how to
run them.

The total number of MPI tasks used by LAMMPS (one or multiple per
compute node) is set in the usual manner via the mpirun or mpiexec
commands, and is independent of the GPU package.

When using the GPU package, you cannot assign more than one physical
GPU to an MPI task.  However multiple MPI tasks can share the same
GPU, and in many cases it will be more efficient to run this way.

Input script requirements to run using pair or PPPM styles with a
{gpu} suffix are as follows:

To invoke specific styles from the GPU package, either append "gpu" to
the style name (e.g. pair_style lj/cut/gpu), or use the "-suffix
command-line switch"_Section_start.html#start_7, or use the
"suffix"_suffix.html command in the input script. :ulb,l

The "newton pair"_newton.html setting in the input script must be
{off}. :l

Unless the "-suffix gpu command-line
switch"_Section_start.html#start_7 is used, the "package
gpu"_package.html command must be used near the beginning of the
script to control the GPU selection and initialization settings.  It
also has an option to enable asynchronous splitting of force
computations between the CPUs and GPUs. :l,ule

The default for the "package gpu"_package.html command is to have all
the MPI tasks on the compute node use a single GPU.  If you have
multiple GPUs per node, then be sure to create one or more MPI tasks
per GPU, and use the first/last settings in the "package
gpu"_package.html command to include all the GPU IDs on the node.
E.g. first = 0, last = 1, for 2 GPUs.  For example, on an 8-core 2-GPU
compute node, if you assign 8 MPI tasks to the node, the following
command in the input script

package gpu force/neigh 0 1 -1

would speciy each GPU is shared by 4 MPI tasks.  The final -1 will
dynamically balance force calculations across the CPU cores and GPUs.
I.e. each CPU core will perform force calculations for some small
fraction of the particles, at the same time the GPUs perform force
calcaultions for the majority of the particles.

[Timing output:]

As described by the "package gpu"_package.html command, GPU
accelerated pair styles can perform computations asynchronously with
CPU computations. The "Pair" time reported by LAMMPS will be the
maximum of the time required to complete the CPU pair style
computations and the time required to complete the GPU pair style
computations. Any time spent for GPU-enabled pair styles for
computations that run simultaneously with "bond"_bond_style.html,
"angle"_angle_style.html, "dihedral"_dihedral_style.html,
"improper"_improper_style.html, and "long-range"_kspace_style.html
calculations will not be included in the "Pair" time.

When the {mode} setting for the package gpu command is force/neigh,
the time for neighbor list calculations on the GPU will be added into
the "Pair" time, not the "Neigh" time.  An additional breakdown of the
times required for various tasks on the GPU (data copy, neighbor
calculations, force computations, etc) are output only with the LAMMPS
screen output (not in the log file) at the end of each run.  These
timings represent total time spent on the GPU for each routine,
regardless of asynchronous CPU calculations.

The output section "GPU Time Info (average)" reports "Max Mem / Proc".
This is the maximum memory used at one time on the GPU for data
storage by a single MPI process.

[Performance tips:]

You should experiment with how many MPI tasks per GPU to use to see
what gives the best performance for your problem.  This is a function
of your problem size and what pair style you are using.  Likewise, you
should also experiment with the precision setting for the GPU library
to see if single or mixed precision will give accurate results, since
they will typically be faster.

Using multiple MPI tasks per GPU will often give the best performance,
as allowed my most multi-core CPU/GPU configurations.

If the number of particles per MPI task is small (e.g. 100s of
particles), it can be more eefficient to run with fewer MPI tasks per
GPU, even if you do not use all the cores on the compute node.

The "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS
web site gives GPU performance on a desktop machine and the Titan HPC
platform at ORNL for several of the LAMMPS benchmarks, as a function
of problem size and number of compute nodes.

:line

5.7 USER-CUDA package :h4,link(acc_7)

[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]

The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
styles, many fixes, a few computes, and for long-range Coulombics via
the PPPM command.  It has the following features:

The package is designed to allow an entire LAMMPS calculation, for
many timesteps, to run entirely on the GPU (except for inter-processor
MPI communication), so that atom-based data (e.g. coordinates, forces)
do not have to move back-and-forth between the CPU and GPU. :ulb,l

The speed-up advantage of this approach is typically better when the
number of atoms per GPU is large :l

Data will stay on the GPU until a timestep where a non-GPU-ized fix or
compute is invoked.  Whenever a non-GPU operation occurs (fix,
compute, output), data automatically moves back to the CPU as needed.
This may incur a performance penalty, but should otherwise work
transparently. :l

Neighbor lists for GPU-ized pair styles are constructed on the
GPU. :l

The package only supports use of a single CPU (core) with each
GPU. :l,ule

[Hardware and software requirements:]

To use this package, you need to have specific NVIDIA hardware and
install specific NVIDIA CUDA software on your system.

Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
help you to find out the Compute Capability of your card:

http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

Install the Nvidia Cuda Toolkit in version 3.2 or higher and the
corresponding GPU drivers. The Nvidia Cuda SDK is not required for
LAMMPSCUDA but we recommend it be installed.  You can then make sure
that its sample projects can be compiled without problems.

[Building LAMMPS with the USER-CUDA package:]

As with other packages that include a separately compiled library, you
need to first build the USER-CUDA library, before building LAMMPS
itself.  General instructions for doing this are in "this
section"_Section_start.html#start_3 of the manual.  For this package,
do the following, using settings in the lib/cuda Makefiles appropriate
for your system:

Go to the lammps/lib/cuda directory :ulb,l

If your {CUDA} toolkit is not installed in the default system directoy
{/usr/local/cuda} edit the file {lib/cuda/Makefile.common}
accordingly. :l

Type "make OPTIONS", where {OPTIONS} are one or more of the following
options. The settings will be written to the
{lib/cuda/Makefile.defaults} and used in the next step. :l

{precision=N} to set the precision level
  N = 1 for single precision (default)
  N = 2 for double precision
  N = 3 for positions in double precision
  N = 4 for positions and velocities in double precision
{arch=M} to set GPU compute capability
  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
{prec_timer=0/1} to use hi-precision timers
  0 = do not use them (default)
  1 = use these timers
  this is usually only useful for Mac machines 
{dbg=0/1} to activate debug mode
  0 = no debug mode (default)
  1 = yes debug mode
  this is only useful for developers
{cufft=1} to determine usage of CUDA FFT library
  0 = no CUFFT support (default)
  in the future other CUDA-enabled FFT libraries might be supported :pre

Type "make" to build the library.  If you are successful, you will
produce the file lib/libcuda.a. :l,ule

Now you are ready to build LAMMPS with the USER-CUDA package installed:

cd lammps/src
make yes-user-cuda
make machine :pre

Note that the LAMMPS build references the lib/cuda/Makefile.common
file to extract setting specific CUDA settings.  So it is important
that you have first built the cuda library (in lib/cuda) using
settings appropriate to your system.

[Input script requirements:]

Additional input script requirements to run styles with a {cuda}
suffix are as follows:

The "-cuda on command-line switch"_Section_start.html#start_7 must be
used when launching LAMMPS to enable the USER-CUDA package. :ulb,l

To invoke specific styles from the USER-CUDA package, you can either
append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use
the "-suffix command-line switch"_Section_start.html#start_7, or use
the "suffix"_suffix.html command.  One exception is that the
"kspace_style pppm/cuda"_kspace_style.html command has to be requested
explicitly. :l

To use the USER-CUDA package with its default settings, no additional
command is needed in your input script.  This is because when LAMMPS
starts up, it detects if it has been built with the USER-CUDA package.
See the "-cuda command-line switch"_Section_start.html#start_7 for
more details. :l

To change settings for the USER-CUDA package at run-time, the "package
cuda"_package.html command can be used near the beginning of your
input script.  See the "package"_package.html command doc page for
details. :l,ule

[Performance tips:]

The USER-CUDA package offers more speed-up relative to CPU performance
when the number of atoms per GPU is large, e.g. on the order of tens
or hundreds of 1000s.

As noted above, this package will continue to run a simulation
entirely on the GPU(s) (except for inter-processor MPI communication),
for multiple timesteps, until a CPU calculation is required, either by
a fix or compute that is non-GPU-ized, or until output is performed
(thermo or dump snapshot or restart file).  The less often this
occurs, the faster your simulation will run.

:line

5.8 KOKKOS package :h4,link(acc_8)

[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]

The KOKKOS package contains versions of pair, fix, and atom styles
that use data structures and methods and macros provided by the Kokkos
library, which is included with LAMMPS in lib/kokkos.

"Kokkos"_http://trilinos.sandia.gov/packages/kokkos is a C++ library
that provides two key abstractions for an application like LAMMPS.
First, it allows a single implementation of an application kernel
(e.g. a pair style) to run efficiently on different kinds of hardware
(GPU, Intel Phi, many-core chip).

Second, it provides data abstractions to adjust (at compile time) the
memory layout of basic data structures like 2d and 3d arrays and allow
the transparent utilization of special hardware load and store units.
Such data structures are used in LAMMPS to store atom coordinates or
forces or neighbor lists.  The layout is chosen to optimize
performance on different platforms.  Again this operation is hidden
from the developer, and does not affect how the single implementation
of the kernel is coded.

These abstractions are set at build time, when LAMMPS is compiled with
the KOKKOS package installed.  This is done by selecting a "host" and
"device" to build for, compatible with the compute nodes in your
machine.  Note that if you are running on a desktop machine, you
typically have one compute node.  On a cluster or supercomputer there
may be dozens or 1000s of compute nodes.  The procedure for building
and running with the Kokkos library is the same, no matter how many
nodes you run on.

All Kokkos operations occur within the context of an individual MPI
task running on a single node of the machine.  The total number of MPI
tasks used by LAMMPS (one or multiple per compute node) is set in the
usual manner via the mpirun or mpiexec commands, and is independent of
Kokkos.

Kokkos provides support for one or two modes of execution per MPI
task.  This means that some computational tasks (pairwise
interactions, neighbor list builds, time integration, etc) are
parallelized in one or the other of the two modes.  The first mode is
called the "host" and is one or more threads running on one or more
physical CPUs (within the node).  Currently, both multi-core CPUs and
an Intel Phi processor (running in native mode) are supported.  The
second mode is called the "device" and is an accelerator chip of some
kind.  Currently only an NVIDIA GPU is supported.  If your compute
node does not have a GPU, then there is only one mode of execution,
i.e. the host and device are the same.

IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
of MPI tasks per compute node to be equal to the number of GPUs per
compute node.  In the future Kokkos will support assigning one GPU to
multiple MPI tasks or using multiple GPUs per MPI task.  Currently
Kokkos does not support AMD GPUs due to limits in the available
backend programming models (in particular relative extensive C++
support is required for the Kernel language).  This is expected to
change in the future.

Here are several examples of how to build LAMMPS and run a simulation
using the KOKKOS package for typical compute node configurations.
Note that the -np setting for the mpirun command in these examples are
for a run on a single node.  To scale these examples up to run on a
system with N compute nodes, simply multiply the -np setting by N.

All the build steps are performed from within the src directory.  All
the run steps are performed in the bench directory using the in.lj
input script.  It is assumed the LAMMPS executable has been copied to
that directory or whatever directory the runs are being performed in.
Details of the various options are discussed below.

[Compute node(s) = dual hex-core CPUs and no GPU:]

make yes-kokkos                           # install the KOKKOS package
make g++ OMP=yes                          # build with OpenMP, no CUDA :pre

mpirun -np 12 lmp_g++ < in.lj      # MPI-only mode with no Kokkos
mpirun -np 12 lmp_g++ -k on -sf kk < in.lj      # MPI-only mode with Kokkos
mpirun -np 1 lmp_g++ -k on t 12 -sf kk < in.lj     # one MPI task, 12 threads
mpirun -np 2 lmp_g++ -k on t 6 -sf kk < in.lj      # two MPI tasks, 6 threads/task :pre

[Compute node(s) = Intel Phi with 61 cores:]

make yes-kokkos
make g++ OMP=yes MIC=yes                  # build with OpenMP for Phi :pre

mpirun -np 12 lmp_g++ -k on t 20 -sf kk < in.lj      # 12*20 = 240 total cores
mpirun -np 15 lmp_g++ -k on t 16 -sf kk < in.lj
mpirun -np 30 lmp_g++ -k on t 8 -sf kk < in.lj
mpirun -np 1 lmp_g++ -k on t 240 -sf kk < in.lj :pre

[Compute node(s) = dual hex-core CPUs and a single GPU:]

make yes-kokkos
make cuda CUDA=yes             # build for GPU, use src/MAKE/Makefile.cuda :pre

mpirun -np 1 lmp_cuda -k on t 6 -sf kk < in.lj :pre

[Compute node(s) = dual 8-core CPUs and 2 GPUs:]

make yes-kokkos
make cuda CUDA=yes :pre

mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk < in.lj     # use both GPUs, one per MPI task :pre

[Building LAMMPS with the KOKKOS package:]

A summary of the build process is given here.  More details and all
the available make variable options are given in "this
section"_Section_start.html#start_3_4 of the manual.

From the src directory, type

make yes-kokkos :pre

to include the KOKKOS package.  Then perform a normal LAMMPS build,
with additional make variable specifications to choose the host and
device you will run the resulting executable on, e.g.

make g++ OMP=yes
make cuda CUDA=yes :pre

As illustrated above, the most important variables to set are OMP,
CUDA, and MIC.  The default settings are OMP=yes, CUDA=no, MIC=no
Setting OMP to {yes} will use OpenMP for threading on the host, as
well as on the device (if no GPU is present).  Setting CUDA to {yes}
will use one or more GPUs as the device.  Setting MIC=yes is necessary
when building for an Intel Phi processor.

Note that to use a GPU, you must use a lo-level Makefile,
e.g. src/MAKE/Makefile.cuda as included in the LAMMPS distro, which
uses the NVIDA "nvcc" compiler.  You must check that the CCFLAGS -arch
setting is appropriate for your NVIDIA hardware and installed
software.  Typical values for -arch are given in "this
section"_Section_start.html#start_3_4 of the manual, as well as other
settings that must be included in the lo-level Makefile, if you create
your own.

[Input scripts and use of command-line switches -kokkos and -suffix:]

To use any Kokkos-enabled style provided in the KOKKOS package, you
must use a Kokkos-enabled atom style.  LAMMPS will give an error if
you do not do this.

There are two command-line switches relevant to using Kokkos, -k or
-kokkos, and -sf or -suffix.  They are described in detail in "this
section"_Section_start.html#start_7 of the manual.

Here are common options to use:

-k on : required to run any KOKKOS-enabled style :ulb,l

-sf kk : enables automatic use of Kokkos versions of atom, pair,
fix, compute styles if they exist.  This can also be done with more
precise control by using the "suffix"_suffix.html command or appending
"kk" to styles within the input script, e.g. "pair_style lj/cut/kk". :l

-k on t Nt : specifies how many threads per MPI task to use within a
 compute node.  For good performance, the product of MPI tasks *
 threads/task should not exceed the number of physical CPU or Intel
 Phi cores. :l

-k on g Ng : specifies how many GPUs per compute node are available.
The default is 1, so this should be specified is you have 2 or more
GPUs per compute node. :ule,l

[Use of package command options:]

Using the "package kokkos"_package.html command in an input script
allows choice of options for neighbor lists and communication.  See
the "package"_package.html command doc page for details and default
settings.

Experimenting with different styles of neighbor lists or inter-node
communication can provide a speed-up for specific calculations.

[Running on a multi-core CPU:]

Build with OMP=yes (the default) and CUDA=no (the default).

If N is the number of physical cores/node, then the number of MPI
tasks/node * number of threads/task should not exceed N, and should
typically equal N.  Note that the default threads/task is 1, as set by
the "t" keyword of the -k "command-line
switch"_Section_start.html#start_7.  If you do not change this, no
additional parallelism (beyond MPI) will be invoked on the host
CPU(s).

You can compare the performance running in different modes:
  
run with 1 MPI task/node and N threads/task
run with N MPI tasks/node and 1 thread/task
run with settings in between these extremes :ul

Examples of mpirun commands in these modes, for nodes with dual
hex-core CPUs and no GPU, are shown above.

[Running on GPUs:]

Build with CUDA=yes, using src/MAKE/Makefile.cuda.  Insure the setting
for CUDA_PATH in lib/kokkos/Makefile.lammps is correct for your Cuda
software installation.  Insure the -arch setting in
src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
"this section"_Section_start.html#start_3_4 of the manual for details.

The -np setting of the mpirun command should set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node. 

Use the "-kokkos command-line switch"_Section_commands.html#start_7 to
specify the number of GPUs per node, and the number of threads per MPI
task.  As above for multi-core CPUs (and no GPU), if N is the number
of physical cores/node, then the number of MPI tasks/node * number of
threads/task should not exceed N.  With one GPU (and one MPI task) it
may be faster to use less than all the available cores, by setting
threads/task to a smaller value.  This is because using all the cores
on a dual-socket node will incur extra cost to copy memory from the
2nd socket to the GPU.

Examples of mpirun commands that follow these rules, for nodes with
dual hex-core CPUs and one or two GPUs, are shown above.

[Running on an Intel Phi:]

Kokkos only uses Intel Phi processors in their "native" mode, i.e.
not hosted by a CPU.

Build with OMP=yes (the default) and MIC=yes.  The latter
insures code is correctly compiled for the Intel Phi.  The
OMP setting means OpenMP will be used for parallelization
on the Phi, which is currently the best option within
Kokkos.  In the future, other options may be added.

Current-generation Intel Phi chips have either 61 or 57 cores.  One
core should be excluded to run the OS, leaving 60 or 56 cores.  Each
core is hyperthreaded, so there are effectively N = 240 (4*60) or N =
224 (4*56) cores to run on.

The -np setting of the mpirun command sets the number of MPI
tasks/node.  The "-k on t Nt" command-line switch sets the number of
threads/task as Nt.  The product of these 2 values should be N, i.e.
240 or 224.  Also, the number of threads/task should be a multiple of
4 so that logical threads from more than one MPI task do not run on
the same physical core.

Examples of mpirun commands that follow these rules, for Intel Phi
nodes with 61 cores, are shown above.

[Examples and benchmarks:]

The examples/kokkos and bench/KOKKOS directories have scripts that can
be run with the KOKKOS package, as well as detailed instructions on
how to run them.

IMPORTANT NOTE: the bench/KOKKOS directory does not yet exist.  It
will be added later.

[Additional performance issues:]

When using threads (OpenMP or pthreads), it is important for
performance to bind the threads to physical cores, so they do not
migrate during a simulation.  The same is true for MPI tasks, but the
default binding rules implemented for various MPI versions, do not
account for thread binding.  

Thus if you use more than one thread per MPI task, you should insure
MPI tasks are bound to CPU sockets.  Furthermore, use thread affinity
environment variables from the OpenMP runtime when using OpenMP and
compile with hwloc support when using pthreads.  With OpenMP 3.1 (gcc
4.7 or later, intel 12 or later) setting the environment variable
OMP_PROC_BIND=true should be sufficient.  A typical mpirun command
should set these flags:

OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre

When using a GPU, you will achieve the best performance if your input
script does not use any fix or compute styles which are not yet
Kokkos-enabled.  This allows data to stay on the GPU for multiple
timesteps, without being copied back to the host CPU.  Invoking a
non-Kokkos fix or compute, or performing I/O for
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
to be copied back to the CPU.

You cannot yet assign multiple MPI tasks to the same GPU with the
KOKKOS package.  We plan to support this in the future, similar to the
GPU package in LAMMPS.

You cannot yet use both the host (multi-threaded) and device (GPU)
together to compute pairwise interactions with the KOKKOS package.  We
hope to support this in the future, similar to the GPU package in
LAMMPS.

:line

5.9 USER-INTEL package :h4,link(acc_9)

[Required hardware/software:]
[Building LAMMPS with the OPT package:]
[Running with the OPT package;]
[Guidelines for best performance;]
[Speed-ups to expect:]

The USER-INTEL package was developed by Mike Brown at Intel
Corporation. It provides a capability to accelerate simulations by
offloading neighbor list and non-bonded force calculations to Intel(R)
Xeon Phi(TM) coprocessors.  Additionally, it supports running
simulations in single, mixed, or double precision with vectorization,
even if a coprocessor is not present, i.e. on an Intel(R) CPU.  The same
C++ code is used for both cases.  When offloading to a coprocessor,
the routine is run twice, once with an offload flag.

The USER-INTEL package can be used in tandem with the USER-OMP
package.  This is useful when a USER-INTEL pair style is used, so that
other styles not supported by the USER-INTEL package, e.g. for bond,
angle, dihedral, improper, and long-range electrostatics can be run
with the USER-OMP package versions.  If you have built LAMMPS with
both the USER-INTEL and USER-OMP packages, then this mode of operation
is made easier, because the "-suffix intel" "command-line
switch"_Section_start.html#start_7 and the the "suffix
intel"_suffix.html command will both set a second-choice suffix to
"omp" so that styles from the USER-OMP package will be used if
available.

[Building LAMMPS with the USER-INTEL package:]

The procedure for building LAMMPS with the USER-INTEL package is
simple.  You have to edit your machine specific makefile to add the
flags to enable OpenMP support ({-openmp}) to both the CCFLAGS and
LINKFLAGS variables.  You also need to add -DLAMMPS_MEMALIGN=64 and
-restrict to CCFLAGS.

Note that currently you must use the Intel C++ compiler (icc/icpc) to
build the package.  In the future, using other compilers (e.g. g++)
may be possible.

If you are compiling on the same architecture that will be used for
the runs, adding the flag {-xHost} will enable vectorization with the
Intel(R) compiler.  In order to build with support for an Intel(R)
coprocessor, the flag {-offload} should be added to the LINKFLAGS line
and the flag {-DLMP_INTEL_OFFLOAD} should be added to the CCFLAGS
line.

The files src/MAKE/Makefile.intel and src/MAKE/Makefile.intel_offload
are included in the src/MAKE directory with options that perform well
with the Intel(R) compiler. The latter Makefile has support for offload
to coprocessors and the former does not.

It is recommended that Intel(R) Compiler 2013 SP1 update 1 be used for
compiling. Newer versions have some performance issues that are being
addressed. If using Intel(R) MPI, version 5 or higher is recommended.

The rest of the compilation is the same as for any other package that
has no additional library dependencies, e.g.

make yes-user-intel yes-user-omp
make machine :pre

[Running an input script:]

The examples/intel directory has scripts that can be run with the
USER-INTEL package, as well as detailed instructions on how to run
them.

The total number of MPI tasks used by LAMMPS (one or multiple per
compute node) is set in the usual manner via the mpirun or mpiexec
commands, and is independent of the USER-INTEL package.

Input script requirements to run using pair styles with a {intel}
suffix are as follows:

To invoke specific styles from the UESR-INTEL package, either append
"intel" to the style name (e.g. pair_style lj/cut/intel), or use the
"-suffix command-line switch"_Section_start.html#start_7, or use the
"suffix"_suffix.html command in the input script.

Unless the "-suffix intel command-line
switch"_Section_start.html#start_7 is used, a "package
intel"_package.html command must be used near the beginning of the
input script.  The default precision mode for the USER-INTEL package
is {mixed}, meaning that accumulation is performed in double precision
and other calculations are performed in single precision.  In order to
use all single or all double precision, the "package
intel"_package.html command must be used in the input script with a
"single" or "double" keyword specified.

[Running with an Intel(R) coprocessor:]

The USER-INTEL package supports offload of a fraction of the work to
Intel(R) Xeon Phi(TM) coprocessors.  This is accomplished by setting a
balance fraction on the "package intel"_package.html command. A
balance of 0 runs all calculations on the CPU.  A balance of 1 runs
all calculations on the coprocessor.  A balance of 0.5 runs half of
the calculations on the coprocessor.  Setting the balance to -1 will
enable dynamic load balancing that continously adjusts the fraction of
offloaded work throughout the simulation.  This option typically
produces results within 5 to 10 percent of the optimal fixed balance.
By default, using the "suffix"_suffix.html command or "-suffix
command-line switch"_Section_start.html#start_7 will use offload to a
coprocessor with the balance set to -1.  If LAMMPS is built without
offload support, this setting is ignored.

If one is running short benchmark runs with dynamic load balancing,
adding a short warm-up run (10-20 steps) will allow the load-balancer
to find a setting that will carry over to additional runs.

The default for the "package intel"_package.html command is to have
all the MPI tasks on a given compute node use a single Xeon Phi(TM) coprocessor
In general, running with a large number of MPI tasks on
each node will perform best with offload.  Each MPI task will
automatically get affinity to a subset of the hardware threads
available on the coprocessor.  For example, if your card has 61 cores,
with 60 cores available for offload and 4 hardware threads per core
(240 total threads), running with 24 MPI tasks per node will cause
each MPI task to use a subset of 10 threads on the coprocessor.  Fine
tuning of the number of threads to use per MPI task or the number of
threads to use per core can be accomplished with keywords to the
"package intel"_package.html command.

If LAMMPS is using offload to a Intel(R) Xeon Phi(TM) coprocessor, a diagnostic
line during the setup for a run is printed to the screen (not to log
files) indicating that offload is being used and the number of
coprocessor threads per MPI task.  Additionally, an offload timing
summary is printed at the end of each run.  When using offload, the
"sort"_atom_modify.html frequency for atom data is changed to 1 so
that the per-atom data is sorted every neighbor build.

To use multiple coprocessors on each compute node, the
{offload_cards} keyword can be specified with the "package
intel"_package.html command to specify the number of coprocessors to
use.

For simulations with long-range electrostatics or bond, angle,
dihedral, improper calculations, computation and data transfer to the
coprocessor will run concurrently with computations and MPI
communications for these routines on the host.  The USER-INTEL package
has two modes for deciding which atoms will be handled by the
coprocessor.  The setting is controlled with the "offload_ghost"
option.  When set to 0, ghost atoms (atoms at the borders between MPI
tasks) are not offloaded to the card.  This allows for overlap of MPI
communication of forces with computation on the coprocessor when the
"newton"_newton.html setting is "on".  The default is dependent on the
style being used, however, better performance might be achieved by
setting this explictly.

In order to control the number of OpenMP threads used on the host, the
OMP_NUM_THREADS environment variable should be set. This variable will
not influence the number of threads used on the coprocessor.  Only the
"package intel"_package.html command can be used to control thread
counts on the coprocessor.

[Restrictions:]

When using offload, "hybrid"_pair_hybrid.html styles that require skip
lists for neighbor builds cannot be offloaded to the coprocessor.
Using "hybrid/overlay"_pair_hybrid.html is allowed.  Only one intel
accelerated style may be used with hybrid styles.  Exclusion lists are
not currently supported with offload, however, the same effect can
often be accomplished by setting cutoffs for excluded atom types to 0.
None of the pair styles in the USER-OMP package currently support the
"inner", "middle", "outer" options for rRESPA integration via the
"run_style respa"_run_style.html command.

:line

5.10 Comparison of GPU and USER-CUDA packages :h4,link(acc_10)

Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
using NVIDIA hardware, but they do it in different ways.

As a consequence, for a particular simulation on specific hardware,
one package may be faster than the other.  We give guidelines below,
but the best way to determine which package is faster for your input
script is to try both of them on your machine.  See the benchmarking
section below for examples where this has been done.

[Guidelines for using each package optimally:]

The GPU package allows you to assign multiple CPUs (cores) to a single
GPU (a common configuration for "hybrid" nodes that contain multicore
CPU(s) and GPU(s)) and works effectively in this mode.  The USER-CUDA
package does not allow this; you can only use one CPU per GPU. :ulb,l

The GPU package moves per-atom data (coordinates, forces)
back-and-forth between the CPU and GPU every timestep.  The USER-CUDA
package only does this on timesteps when a CPU calculation is required
(e.g. to invoke a fix or compute that is non-GPU-ized).  Hence, if you
can formulate your input script to only use GPU-ized fixes and
computes, and avoid doing I/O too often (thermo output, dump file
snapshots, restart files), then the data transfer cost of the
USER-CUDA package can be very low, causing it to run faster than the
GPU package. :l

The GPU package is often faster than the USER-CUDA package, if the
number of atoms per GPU is "small".  The crossover point, in terms of
atoms/GPU at which the USER-CUDA package becomes faster depends
strongly on the pair style.  For example, for a simple Lennard Jones
system the crossover (in single precision) is often about 50K-100K
atoms per GPU.  When performing double precision calculations the
crossover point can be significantly smaller. :l

Both packages compute bonded interactions (bonds, angles, etc) on the
CPU.  This means a model with bonds will force the USER-CUDA package
to transfer per-atom data back-and-forth between the CPU and GPU every
timestep.  If the GPU package is running with several MPI processes
assigned to one GPU, the cost of computing the bonded interactions is
spread across more CPUs and hence the GPU package can run faster. :l

When using the GPU package with multiple CPUs assigned to one GPU, its
performance depends to some extent on high bandwidth between the CPUs
and the GPU.  Hence its performance is affected if full 16 PCIe lanes
are not available for each GPU.  In HPC environments this can be the
case if S2050/70 servers are used, where two devices generally share
one PCIe 2.0 16x slot.  Also many multi-GPU mainboards do not provide
full 16 lanes to each of the PCIe 2.0 16x slots. :l,ule

[Differences between the two packages:]

The GPU package accelerates only pair force, neighbor list, and PPPM
calculations.  The USER-CUDA package currently supports a wider range
of pair styles and can also accelerate many fix styles and some
compute styles, as well as neighbor list and PPPM calculations. :ulb,l

The USER-CUDA package does not support acceleration for minimization. :l

The USER-CUDA package does not support hybrid pair styles. :l

The USER-CUDA package can order atoms in the neighbor list differently
from run to run resulting in a different order for force accumulation. :l

The USER-CUDA package has a limit on the number of atom types that can be
used in a simulation. :l

The GPU package requires neighbor lists to be built on the CPU when using
exclusion lists or a triclinic simulation box. :l

The GPU package uses more GPU memory than the USER-CUDA package.  This
is generally not a problem since typical runs are computation-limited
rather than memory-limited. :l,ule

[Examples:]

The LAMMPS distribution has two directories with sample input scripts
for the GPU and USER-CUDA packages.

lammps/examples/gpu = GPU package files
lammps/examples/USER/cuda = USER-CUDA package files :ul

These contain input scripts for identical systems, so they can be used
to benchmark the performance of both packages on your system.