lammps/doc/Section_accelerate.html

<HTML>
<CENTER><A HREF = "Section_python.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> - <A HREF = "Section_errors.html">Next
Section</A>
</CENTER>


<HR>

<H3>10. Using accelerated CPU and GPU styles
</H3>
<P>Accelerated versions of various <A HREF = "pair_style.html">pair_style</A>,
<A HREF = "fix.html">fixes</A>, <A HREF = "compute.html">computes</A>, and other commands have
been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions, if you have the appropriate
hardware on your system.
</P>
<P>The accelerated styles have the same name as the standard styles,
except that a suffix is appended.  Otherwise, the syntax for the
command is identical, their functionality is the same, and the
numerical results it produces should also be identical, except for
precision and round-off issues.
</P>
<P>For example, all of these variants of the basic Lennard-Jones pair
style exist in LAMMPS:
</P>
<UL><LI><A HREF = "pair_lj.html">pair_style lj/cut</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A>
</UL>
<P>Assuming you have built LAMMPS with the appropriate package, these
styles can be invoked by specifying them explicitly in your input
script.  Or you can use the <A HREF = "Section_start.html#2_6">-suffix command-line
switch</A> to invoke the accelerated versions
automatically, without changing your input script.  The
<A HREF = "suffix.html">suffix</A> command allows you to set a suffix explicitly and
to turn off/on the comand-line switch setting, both from within your
input script.
</P>
<P>Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%.
</P>
<P>Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as
discussed below.
</P>
<P>To see what styles are currently available in each of the accelerated
packages, see <A HREF = "Section_commands.html#3_5">this section</A> of the manual.
A list of accelerated styles is included in the pair, fix, compute,
and kspace sections.
</P>
<P>The following sections explain:
</P>
<UL><LI>what hardware and software the accelerated styles require
<LI>how to build LAMMPS with the accelerated packages in place
<LI>what changes (if any) are needed in your input scripts
<LI>guidelines for best performance
<LI>speed-ups you can expect
</UL>
<P>The final section compares and contrasts the GPU and USER-CUDA
packages, since they are both designed to use NVIDIA GPU hardware.
</P>
10.1 <A HREF = "#10_1">OPT package</A><BR>
10.2 <A HREF = "#10_2">GPU package</A><BR>
10.3 <A HREF = "#10_3">USER-CUDA package</A><BR>
10.4 <A HREF = "#10_4">Comparison of GPU and USER-CUDA packages</A> <BR>

<HR>

<HR>

<H4><A NAME = "10_1"></A>10.1 OPT package
</H4>
<P>The OPT package was developed by James Fischer (High Performance
Technologies), David Richie, and Vincent Natoli (Stone Ridge
Technologies).  It contains a handful of pair styles whose compute()
methods were rewritten in C++ templated form to reduce the overhead
due to if tests and other conditional code.
</P>
<P>The procedure for building LAMMPS with the OPT package is simple.  It
is the same as for any other package which has no additional library
dependencies:
</P>
<PRE>make yes-opt
make machine
</PRE>
<P>If your input script uses one of the OPT pair styles,
you can run it as follows:
</P>
<PRE>lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script
</PRE>
<P>You should see a reduction in the "Pair time" printed out at the end
of the run.  On most machines and problems, this will typically be a 5
to 20% savings.
</P>
<HR>

<HR>

<H4><A NAME = "10_2"></A>10.2 GPU package
</H4>
<P>The GPU package was developed by Mike Brown at ORNL.  It provides GPU
versions of several pair styles and for long-range Coulombics via the
PPPM command.  It has the following features:
</P>
<UL><LI>The package is designed to exploit common GPU hardware configurations
where one or more GPUs are coupled with one or more multi-core CPUs
within a node of a parallel machine.

<LI>Atom-based data (e.g. coordinates, forces) moves back-and-forth
between the CPU and GPU every timestep.

<LI>Neighbor lists can be constructed by on the CPU or on the GPU,
controlled by the <A HREF = "fix_gpu.html">fix gpu</A> command.

<LI>The charge assignement and force interpolation portions of PPPM can be
run on the GPU.  The FFT portion, which requires MPI communication
between processors, runs on the CPU.

<LI>Asynchronous force computations can be performed simulataneously on
the CPU and GPU.

<LI>LAMMPS-specific code is in the GPU package.  It makee calls to a more
generic GPU library in the lib/gpu directory.  This library provides
NVIDIA support as well as a more general OpenCL support, so that the
same functionality can eventually be supported on other GPU
hardware.
</UL>
<P><B>Hardware and software requirements:</B>
</P>
<P>To use this package, you need to have specific NVIDIA hardware and
install specific NVIDIA CUDA software on your system:
</P>
<UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0
<LI>Go to http://www.nvidia.com/object/cuda_get.html
<LI>Install a driver and toolkit appropriate for your system (SDK is not necessary)
<LI>Follow the instructions in lammps/lib/gpu/README to build the library (also see below)
<LI>Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties
</UL>
<P><B>Building LAMMPS with the GPU package:</B>
</P>
<P>As with other packages that link with a separately complied library,
you need to first build the GPU library, before building LAMMPS
itself.  General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
section</A> of the manual.  For this package,
do the following, using a Makefile appropriate for your system:
</P>
<PRE>cd lammps/lib/gpu
make -f Makefile.linux
(see further instructions in lammps/lib/gpu/README)
</PRE>
<P>If you are successful, you will produce the file lib/libgpu.a.
</P>
<P>Now you are ready to build LAMMPS with the GPU package installed:
</P>
<PRE>cd lammps/lib/src
make yes-gpu
make machine
</PRE>
<P>Note that the lo-level Makefile (e.g. src/MAKE/Makefile.linux) has
these settings: gpu_SYSINC, gpu_SYSLIB, gpu_SYSPATH.  These need to be
set appropriately to include the paths and settings for the CUDA
system software on your machine.  See src/MAKE/Makefile.g++ for an
example.
</P>
<P><B>GPU configuration</B>
</P>
<P>When using GPUs, you are restricted to one physical GPU per LAMMPS
process, which is an MPI process running (typically) on a single core
or processor.  Multiple processes can share a single GPU and in many
cases it will be more efficient to run with multiple processes per
GPU.
</P>
<P><B>Input script requirements:</B>
</P>
<P>Additional input script requirements to run styles with a <I>gpu</I> suffix
are as follows.
</P>
<P>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I> and the <A HREF = "fix_gpu.html">fix
gpu</A> command must be used.  To invoke specific styles
from the GPU package, you can either append "gpu" to the style name
(e.g. pair_style lj/cut/gpu), or use the <A HREF = "Section_start.html#2_6">-suffix command-line
switch</A>, or use the <A HREF = "suffix.html">suffix</A>
command.
</P>
<P>The <A HREF = "fix_gpu.html">fix gpu</A> command controls the GPU selection and
initialization steps.
</P>
<P>The format for the fix is:
</P>
<PRE>fix fix-ID all gpu <I>mode</I> <I>first</I> <I>last</I> <I>split</I>
</PRE>
<P>where fix-ID is the name for the fix. The gpu fix must be the first
fix specified for a given run, otherwise LAMMPS will exit with an
error. The gpu fix does not have any effect on runs that do not use
GPU acceleration, so there should be no problem specifying the fix
first in any input script.
</P>
<P>The <I>mode</I> setting can be either "force" or "force/neigh". In the
former, neighbor list calculation is performed on the CPU using the
standard LAMMPS routines. In the latter, the neighbor list calculation
is performed on the GPU. The GPU neighbor list can be used for better
performance, however, it cannot not be used with a triclinic box or
with <A HREF = "pair_hybrid.html">hybrid</A> pair styles.
</P>
<P>There are cases when it may be more efficient to select the CPU for
neighbor list builds. If a non-GPU enabled style (e.g. a fix or
compute) requires a neighbor list, it will also be built using CPU
routines.  Redundant CPU and GPU neighbor list calculations will
typically be less efficient.
</P>
<P>The <I>first</I> setting is the ID (as reported by
lammps/lib/gpu/nvc_get_devices) of the first GPU that will be used on
each node. The <I>last</I> setting is the ID of the last GPU that will be
used on each node. If you have only one GPU per node, <I>first</I> and
<I>last</I> will typically both be 0. Selecting a non-sequential set of GPU
IDs (e.g. 0,1,3) is not currently supported.
</P>
<P>The <I>split</I> setting is the fraction of particles whose forces,
torques, energies, and/or virials will be calculated on the GPU. This
can be used to perform CPU and GPU force calculations simultaneously,
e.g. on a hybrid node with a multicore CPU and a GPU(s).  If <I>split</I>
is negative, the software will attempt to calculate the optimal
fraction automatically every 25 timesteps based on CPU and GPU
timings. Because the GPU speedups are dependent on the number of
particles, automatic calculation of the split can be less efficient,
but typically results in loop times within 20% of an optimal fixed
split.
</P>
<P>As an example, if you have two GPUs per node, 8 CPU cores per node,
and would like to run on 4 nodes (32 cores) with dynamic balancing of
force calculation across CPU and GPU cores, the fix might be
</P>
<PRE>fix 0 all gpu force/neigh 0 1 -1
</PRE>
<P>In this case, all CPU cores and GPU devices on the nodes would be
utilized.  Each GPU device would be shared by 4 CPU cores. The CPU
cores would perform force calculations for some fraction of the
particles at the same time the GPUs performed force calculation for
the other particles.
</P>
<P><B>Asynchronous pair computation on GPU and CPU</B>
</P>
<P>The GPU accelerated pair styles can perform pair style force
calculation on the GPU at the same time other force calculations
within LAMMPS are being performed on the CPU.  These include pair,
bond, angle, etc forces as well as long-range Coulombic forces.  This
is enabled by the <I>split</I> setting in the gpu fix as described above.
</P>
<P>With a <I>split</I> setting less than 1.0, a portion of the pair-wise force
calculations will also be performed on the CPU.  When the CPU finishes
its pair style computations (if any), the next LAMMPS force
computation will begin (bond, angle, etc), possibly before the GPU has
finished its pair style computations.
</P>
<P>This means that if <I>split</I> is set to 1.0, the GPU will begin the
LAMMPS force computation immediately. This can be used to run a
<A HREF = "pair_hybrid.html">hybrid</A> GPU pair style at the same time as a hybrid
CPU pair style. In this case, the GPU pair style should be first in
the hybrid command in order to perform simultaneous calculations. This
also allows <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
<A HREF = "kspace_style.html">long-range</A> force computations to run
simultaneously with the GPU pair style.  If all CPU force computations
complete before the GPU, LAMMPS will block until the GPU has finished
before continuing the timestep.
</P>
<P><B>Timing output:</B>
</P>
<P>As noted above, GPU accelerated pair styles can perform computations
asynchronously with CPU computations. The "Pair" time reported by
LAMMPS will be the maximum of the time required to complete the CPU
pair style computations and the time required to complete the GPU pair
style computations. Any time spent for GPU-enabled pair styles for
computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
calculations will not be included in the "Pair" time.
</P>
<P>When the <I>mode</I> setting for the gpu fix is force/neigh, the time for
neighbor list calculations on the GPU will be added into the "Pair"
time, not the "Neigh" time.  An additional breakdown of the times
required for various tasks on the GPU (data copy, neighbor
calculations, force computations, etc) are output only with the LAMMPS
screen output (not in the log file) at the end of each run.  These
timings represent total time spent on the GPU for each routine,
regardless of asynchronous CPU calculations.
</P>
<P><B>Performance tips:</B>
</P>
<P>Because of the large number of cores within each GPU device, it may be
more efficient to run on fewer processes per GPU when the number of
particles per MPI process is small (100's of particles); this can be
necessary to keep the GPU cores busy.
</P>
<P>See the lammps/lib/gpu/README file for instructions on how to build
the LAMMPS gpu library for single, mixed, and double precision.  The
latter requires that your GPU card support double precision.
</P>
<HR>

<HR>

<H4><A NAME = "10_3"></A>10.3 USER-CUDA package
</H4>
<P>The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
styles, many fixes, a few computes, and for long-range Coulombics via
the PPPM command.  It has the following features:
</P>
<UL><LI>The package is designed to allow an entire LAMMPS calculation, for
many timesteps, to run entirely on the GPU (except for inter-processor
MPI communication), so that atom-based data (e.g. coordinates, forces)
do not have to move back-and-forth between the CPU and GPU.

<LI>This will occur until a timestep where a non-GPU-ized fix or compute
is invoked.  E.g. whenever a non-GPU operation occurs (fix, compute,
output), data automatically moves back to the CPU as needed.  This may
incur a performance penalty, but should otherwise just work
transparently.

<LI>Neighbor lists for GPU-ized pair styles are constructed on the
GPU.
</UL>
<P><B>Hardware and software requirements:</B>
</P>
<P>To use this package, you need to have specific NVIDIA hardware and
install specific NVIDIA CUDA software on your system:
</P>
<P>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
help you to find out the Compute Capability of your card:
</P>
<P>http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
</P>
<P>Install the Nvidia Cuda Toolkit in version 3.2 or higher and the
corresponding GPU drivers. The Nvidia Cuda SDK is not required for
LAMMPSCUDA but we recommend it be installed.  You can then make sure
that its sample projects can be compiled without problems.
</P>
<P><B>Building LAMMPS with the USER-CUDA package:</B>
</P>
<P>As with other packages that link with a separately complied library,
you need to first build the USER-CUDA library, before building LAMMPS
itself.  General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
section</A> of the manual.  For this package,
do the following, using a Makefile appropriate for your system:
</P>
<UL><LI>If your <I>CUDA</I> toolkit is not installed in the default system directoy
<I>/usr/local/cuda</I> edit the file <I>lib/cuda/Makefile.common</I>
accordingly.

<LI>Go to the lammps/lib/cuda directory

<LI>Type "make OPTIONS", where <I>OPTIONS</I> are one or more of the following
options. The settings will be written to the
<I>lib/cuda/Makefile.defaults</I> and used in the next step.

<PRE><I>precision=N</I> to set the precision level
  N = 1 for single precision (default)
  N = 2 for double precision
  N = 3 for positions in double precision
  N = 4 for positions and velocities in double precision
<I>arch=M</I> to set GPU compute capability
  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
<I>prec_timer=0/1</I> to use hi-precision timers
  0 = do not use them (default)
  1 = use these timers
  this is usually only useful for Mac machines
<I>dbg=0/1</I> to activate debug mode
  0 = no debug mode (default)
  1 = yes debug mode
  this is only useful for developers
<I>cufft=1</I> to determine usage of CUDA FFT library
  0 = no CUFFT support (default)
  in the future other CUDA-enabled FFT libraries might be supported
</PRE>
<LI>Type "make" to build the library.  If you are successful, you will
produce the file lib/libcuda.a.
</UL>
<P>Now you are ready to build LAMMPS with the USER-CUDA package installed:
</P>
<PRE>cd lammps/lib/src
make yes-user-cuda
make machine
</PRE>
<P>Note that the build will reference the lib/cuda/Makefile.common file
to extract setting relevant to the LAMMPS build.  So it is important
that you have first built the cuda library (in lib/cuda) using
settings appropriate to your system.
</P>
<P><B>Input script requirements:</B>
</P>
<P>Additional input script requirements to run styles with a <I>cuda</I>
suffix are as follows.
</P>
<P>To invoke specific styles from the USER-CUDA package, you can either
append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use
the <A HREF = "Section_start.html#2_6">-suffix command-line switch</A>, or use the
<A HREF = "suffix.html">suffix</A> command.  One exception is that the <A HREF = "kspace_style.html">kspace_style
pppm/cuda</A> command has to be requested explicitly.
</P>
<P>To use the USER-CUDA package with its default settings, no additional
command is needed in your input script.  This is because when LAMMPS
starts up, it detects if it has been built with the USER-CUDA package.
See the <A HREF = "Section_start.html#2_6">-cuda command-line switch</A> for more
details.
</P>
<P>To change settings for the USER-CUDA package at run-time, the <A HREF = "package.html">package
cuda</A> command can be used at the beginning of your input
script.  See the commands doc page for details.
</P>
<P><B>Performance tips:</B>
</P>
<P>The USER-CUDA package offers more speed-up relative to CPU performance
when the number of atoms per GPU is large, e.g. on the order of tens
or hundreds of 1000s.
</P>
<P>As noted above, this package will continue to run a simulation
entirely on the GPU(s) (except for inter-processor MPI communication),
for multiple timesteps, until a CPU calculation is required, either by
a fix or compute that is non-GPU-ized, or until output is performed
(thermo or dump snapshot or restart file).  The less often this
occurs, the faster your simulation may run.
</P>
<HR>

<HR>

<H4><A NAME = "10_4"></A>10.4 Comparison of GPU and USER-CUDA packages
</H4>
<P>Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
using NVIDIA hardware, but they do it in different ways.
</P>
<P>As a consequence, for a specific simulation on particular hardware,
one package may be faster than the other.  We give guidelines below,
but the best way to determine which package is faster for your input
script is to try both of them on your machine.  See the benchmarking
section below for examples where this has been done.
</P>
<P><B>Guidelines for using each package optimally:</B>
</P>
<UL><LI>The GPU package moves per-atom data (coordinates, forces)
back-and-forth between the CPU and GPU every timestep.  The USER-CUDA
package only does this on timesteps when a CPU calculation is required
(e.g. to invoke a fix or compute that is non-GPU-ized).  Hence, if you
can formulate your input script to only use GPU-ized fixes and
computes, and avoid doing I/O too often (thermo output, dump file
snapshots, restart files), then the data transfer cost of the
USER-CUDA package can be very low, causing it to run faster than the
GPU package.

<LI>The GPU package is often faster than the USER-CUDA package, if the
number of atoms per GPU is "small".  The crossover point, in terms of
atoms/GPU at which the USER-CUDA package becomes faster depends
strongly on the pair style.  For example, for a simple Lennard Jones
system the crossover (in single precision) is often about 50K-100K
atoms per GPU.  When performing double precision calculations the
crossover point can be significantly smaller.

<LI>The GPU package allows you to assign multiple CPUs (cores) to a single
GPU (a common configuration for "hybrid" nodes that contain multicore
CPU(s) and GPU(s)) and works effectively in this mode.  The USER-CUDA
package does not; it works best when there is one CPU per GPU.

<LI>Both packages compute bonded interactions (bonds, angles, etc) on the
CPU.  This means a model with bonds will force the USER-CUDA package
to transfer per-atom data back-and-forth between the CPU and GPU every
timestep.  If the GPU package is running with several MPI processes
assigned to one GPU, the cost of computing the bonded interactions is
spread across more CPUs and hence the GPU package can run faster.
</UL>
<P><B>Chief differences between the two packages:</B>
</P>
<UL><LI>The GPU package accelerates only pair force, neighbor list, and PPPM
calculations.  The USER-CUDA package currently supports a wider range
of pair styles and can also accelerate many fix styles and some
compute styles, as well as neighbor list and PPPM calculations.

<LI>The GPU package uses more GPU memory than the USER-CUDA package.  This
is generally not much of a problem since typical runs are
computation-limited rather than memory-limited.

<LI>When using the GPU package with multiple CPUs assigned to one GPU, its
performance depends to some extent on high bandwidth between the CPUs
and the GPU.  Hence its performance is affected if full 16 PCIe lanes
are not available for each GPU.  In HPC environments this can be the
case if S2050/70 servers are used, where two devices generally share
one PCIe 2.0 16x slot.  Also many multi-GPU mainboards do not provide
full 16 lanes to each of the PCIe 2.0 16x slots.
</UL>
<P><B>Examples:</B>
</P>
<P>The LAMMPS distribution has two directories with sample
input scripts for the GPU and USER-CUDA packages.
</P>
<UL><LI>lammps/examples/gpu = GPU package files
<LI>lammps/examples/USER/cuda = USER-CUDA package files
</UL>
<P>These are files for identical systems, so they can be
used to benchmark the performance of both packages
on your system.
</P>
<P><B>Benchmark data:</B>
</P>
<P>NOTE: We plan to add some benchmark results and plots here for the
examples described in the previous section.
</P>
<P>Simulations:
</P>
<P>1. Lennard Jones
</P>
<UL><LI>256,000 atoms
<LI>2.5 A cutoff
<LI>0.844 density
</UL>
<P>2. Lennard Jones
</P>
<UL><LI>256,000 atoms
<LI>5.0 A cutoff
<LI>0.844 density
</UL>
<P>3. Rhodopsin model
</P>
<UL><LI>256,000 atoms
<LI>10A cutoff
<LI>Coulomb via PPPM
</UL>
<P>4. Lihtium-Phosphate
</P>
<UL><LI>295650 atoms
<LI>15A cutoff
<LI>Coulomb via PPPM
</UL>
<P>Hardware:
</P>
<P>Workstation:
</P>
<UL><LI>2x GTX470
<LI>i7 950@3GHz
<LI>24Gb DDR3 @ 1066Mhz
<LI>CentOS 5.5
<LI>CUDA 3.2
<LI>Driver 260.19.12
</UL>
<P>eStella:
</P>
<UL><LI>6 Nodes
<LI>2xC2050
<LI>2xQDR Infiniband interconnect(aggregate bandwidth 80GBps)
<LI>Intel X5650 HexCore @ 2.67GHz
<LI>SL 5.5
<LI>CUDA 3.2
<LI>Driver 260.19.26
</UL>
<P>Keeneland:
</P>
<UL><LI>HP SL-390 (Ariston) cluster
<LI>120 nodes
<LI>2x Intel Westmere hex-core CPUs
<LI>3xC2070s
<LI>QDR InfiniBand interconnect
</UL>
</HTML>