lammps/doc/Section_accelerate.html

<HTML>
<CENTER><A HREF = "Section_python.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> - <A HREF = "Section_errors.html">Next Section</A>
</CENTER>


<HR>

<H3>10. Using accelerated CPU and GPU styles
</H3>
<P>NOTE: These doc pages are still incomplete as of 1Jun11.
</P>
<P>NOTE: The USER-CUDA package discussed below has not yet been
officially released in LAMMPS.
</P>
<P>Accelerated versions of various <A HREF = "pair_style.html">pair_style</A>,
<A HREF = "fix.html">fixes</A>, <A HREF = "compute.html">computes</A>, and other commands have
been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions, if you have the appropriate
hardware on your system.
</P>
<P>The accelerated styles have the same name as the standard styles,
except that a suffix is appended.  Otherwise, the syntax for the
command is identical, their functionality is the same, and the
numerical results it produces should also be identical, except for
precision and round-off issues.
</P>
<P>For example, all of these variants of the basic Lennard-Jones pair
style exist in LAMMPS:
</P>
<UL><LI><A HREF = "pair_lj.html">pair_style lj/cut</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A>
</UL>
<P>Assuming you have built LAMMPS with the appropriate package, these
styles can be invoked by specifying them explicitly in your input
script.  Or you can use the <A HREF = "Section_start.html#2_6">-suffix command-line
switch</A> to invoke the accelerated versions
automatically, without changing your input script.  The
<A HREF = "suffix.html">suffix</A> command also allows you to set a suffix and to
turn off/on the comand-line switch setting within your input script.
</P>
<P>Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%.
</P>
<P>Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as
discussed below.
</P>
<P>To see what styles are currently available in each of the accelerated
packages, see <A HREF = "Section_commands.html#3_5">this section</A> of the manual.
A list of accelerated styles is included in the pair, fix, compute,
and kspace sections.
</P>
<P>The following sections explain:
</P>
<UL><LI>what hardware and software the accelerated styles require
<LI>how to install the accelerated packages
<LI>what kind of problems they run best on
<LI>guidelines for how to use them to best advantage
<LI>the kinds of speed-ups you can expect
</UL>
<P>The final section compares and contrasts the GPU and USER-CUDA
packages, since they are both designed to use NVIDIA GPU hardware.
</P>
10.1 <A HREF = "#10_1">OPT package</A><BR>
10.2 <A HREF = "#10_2">GPU package</A><BR>
10.3 <A HREF = "#10_3">USER-CUDA package</A><BR>
10.4 <A HREF = "#10_4">Comparison of GPU and USER-CUDA packages</A> <BR>

<HR>

<HR>

<H4><A NAME = "10_1"></A>10.1 OPT package
</H4>
<P>The OPT package was developed by James Fischer (High Performance
Technologies), David Richie and Vincent Natoli (Stone Ridge
Technologies).  It contains a handful of pair styles whose compute()
methods were rewritten in C++ templated form to reduce the overhead
due to if tests and other conditional code.
</P>
<P>The procedure for building LAMMPS with the OPT package is simple.  It
is the same as for any other package which has no additional library
dependencies:
</P>
<PRE>make yes-opt
make machine
</PRE>
<P>If your input script uses one of the OPT pair styles,
you can run it as follows:
</P>
<PRE>lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script
</PRE>
<P>You should see a reduction in the "Pair time" printed out at the end
of the run.  On most machines and problems, this will typically be a 5
to 20% savings.
</P>
<HR>

<H4><A NAME = "10_2"></A>10.2 GPU package
</H4>
<P>The GPU package was developed by Mike Brown at ORNL.
</P>
<P>Additional requirements in your input script to run the styles with a
<I>gpu</I> suffix are as follows:
</P>
<P>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I> and the <A HREF = "fix_gpu.html">fix
gpu</A> command must be used.  The fix controls the GPU
selection and initialization steps.
</P>
<P>A few LAMMPS <A HREF = "pair_style.html">pair styles</A> can be run on graphical
processing units (GPUs).  We plan to add more over time.  Currently,
they only support NVIDIA GPU cards.  To use them you need to install
certain NVIDIA CUDA software on your system:
</P>
<UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0 Go
<LI>to http://www.nvidia.com/object/cuda_get.html Install a driver and
<LI>toolkit appropriate for your system (SDK is not necessary) Follow the
<LI>instructions in README in lammps/lib/gpu to build the library Run
<LI>lammps/lib/gpu/nvc_get_devices to list supported devices and
<LI>properties
</UL>
<H4>GPU configuration
</H4>
<P>When using GPUs, you are restricted to one physical GPU per LAMMPS
process.  Multiple processes can share a single GPU and in many cases
it will be more efficient to run with multiple processes per GPU. Any
GPU accelerated style requires that <A HREF = "fix_gpu.html">fix gpu</A> be used in
the input script to select and initialize the GPUs. The format for the
fix is:
</P>
<PRE>fix <I>name</I> all gpu <I>mode</I> <I>first</I> <I>last</I> <I>split</I>
</PRE>
<P>where <I>name</I> is the name for the fix. The gpu fix must be the first
fix specified for a given run, otherwise the program will exit with an
error. The gpu fix will not have any effect on runs that do not use
GPU acceleration; there should be no problem with specifying the fix
first in any input script.
</P>
<P><I>mode</I> can be either "force" or "force/neigh". In the former, neighbor
list calculation is performed on the CPU using the standard LAMMPS
routines. In the latter, the neighbor list calculation is performed on
the GPU. The GPU neighbor list can be used for better performance,
however, it cannot not be used with a triclinic box or with
<A HREF = "pair_hybrid.html">hybrid</A> pair styles.
</P>
<P>There are cases when it might be more efficient to select the CPU for
neighbor list builds. If a non-GPU enabled style requires a neighbor
list, it will also be built using CPU routines. Redundant CPU and GPU
neighbor list calculations will typically be less efficient.
</P>
<P><I>first</I> is the ID (as reported by lammps/lib/gpu/nvc_get_devices) of
the first GPU that will be used on each node. <I>last</I> is the ID of the
last GPU that will be used on each node. If you have only one GPU per
node, <I>first</I> and <I>last</I> will typically both be 0. Selecting a
non-sequential set of GPU IDs (e.g. 0,1,3) is not currently supported.
</P>
<P><I>split</I> is the fraction of particles whose forces, torques, energies,
and/or virials will be calculated on the GPU. This can be used to
perform CPU and GPU force calculations simultaneously. If <I>split</I> is
negative, the software will attempt to calculate the optimal fraction
automatically every 25 timesteps based on CPU and GPU timings. Because
the GPU speedups are dependent on the number of particles, automatic
calculation of the split can be less efficient, but typically results
in loop times within 20% of an optimal fixed split.
</P>
<P>If you have two GPUs per node, 8 CPU cores per node, and would like to
run on 4 nodes with dynamic balancing of force calculation across CPU
and GPU cores, the fix might be
</P>
<PRE>fix 0 all gpu force/neigh 0 1 -1
</PRE>
<P>with LAMMPS run on 32 processes. In this case, all CPU cores and GPU
devices on the nodes would be utilized.  Each GPU device would be
shared by 4 CPU cores. The CPU cores would perform force calculations
for some fraction of the particles at the same time the GPUs performed
force calculation for the other particles.
</P>
<P>Because of the large number of cores on each GPU device, it might be
more efficient to run on fewer processes per GPU when the number of
particles per process is small (100's of particles); this can be
necessary to keep the GPU cores busy.
</P>
<H4>GPU input script
</H4>
<P>In order to use GPU acceleration in LAMMPS, <A HREF = "fix_gpu.html">fix_gpu</A>
should be used in order to initialize and configure the GPUs for
use. Additionally, GPU enabled styles must be selected in the input
script. Currently, this is limited to a few <A HREF = "pair_style.html">pair
styles</A> and PPPM.  Some GPU-enabled styles have
additional restrictions listed in their documentation.
</P>
<H4>GPU asynchronous pair computation
</H4>
<P>The GPU accelerated pair styles can be used to perform pair style
force calculation on the GPU while other calculations are performed on
the CPU. One method to do this is to specify a <I>split</I> in the gpu fix
as described above.  In this case, force calculation for the pair
style will also be performed on the CPU.
</P>
<P>When the CPU work in a GPU pair style has finished, the next force
computation will begin, possibly before the GPU has finished. If
<I>split</I> is 1.0 in the gpu fix, the next force computation will begin
almost immediately. This can be used to run a
<A HREF = "pair_hybrid.html">hybrid</A> GPU pair style at the same time as a hybrid
CPU pair style. In this case, the GPU pair style should be first in
the hybrid command in order to perform simultaneous calculations. This
also allows <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
<A HREF = "kspace_style.html">long-range</A> force computations to be run
simultaneously with the GPU pair style.  Once all CPU force
computations have completed, the gpu fix will block until the GPU has
finished all work before continuing the run.
</P>
<H4>GPU timing
</H4>
<P>GPU accelerated pair styles can perform computations asynchronously
with CPU computations. The "Pair" time reported by LAMMPS will be the
maximum of the time required to complete the CPU pair style
computations and the time required to complete the GPU pair style
computations. Any time spent for GPU-enabled pair styles for
computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
calculations will not be included in the "Pair" time.
</P>
<P>When <I>mode</I> for the gpu fix is force/neigh, the time for neighbor list
calculations on the GPU will be added into the "Pair" time, not the
"Neigh" time. A breakdown of the times required for various tasks on
the GPU (data copy, neighbor calculations, force computations, etc.)
are output only with the LAMMPS screen output at the end of each
run. These timings represent total time spent on the GPU for each
routine, regardless of asynchronous CPU calculations.
</P>
<H4>GPU single vs double precision
</H4>
<P>See the lammps/lib/gpu/README file for instructions on how to build
the LAMMPS gpu library for single, mixed, and double precision.  The
latter requires that your GPU card supports double precision.
</P>
<HR>

<H4><A NAME = "10_3"></A>10.3 USER-CUDA package
</H4>
<P>The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany.
</P>
<P>This package will only be of any use to you, if you have an NVIDIA(tm)
graphics card being CUDA(tm) enabled. Your GPU needs to support
Compute Capability 1.3. This list may help
you to find out the Compute Capability of your card:
</P>
<P>http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
</P>
<P>Install the Nvidia Cuda Toolkit in version 3.2 or higher and the
corresponding GPU drivers. The Nvidia Cuda SDK is not required for
LAMMPSCUDA but we recommend to install it and
</P>
<P>make sure that the sample projects can be compiled without problems.
</P>
<P>You should also be able to compile LAMMPS by typing
</P>
<P><I>make YourMachine</I>
</P>
<P>inside the src directory of LAMMPS root path. If not, you should
consult the LAMMPS documentation.
</P>
<H4>Compilation
</H4>
<P>If your <I>CUDA</I> toolkit is not installed in the default directoy
<I>/usr/local/cuda</I> edit the file <I>lib/cuda/Makefile.common</I>
accordingly.
</P>
<P>Go to  <I>lib/cuda/</I> and type
</P>
<P><I>make OPTIONS</I>
</P>
<P>where <I>OPTIONS</I> are one or more of the following:
</P>
<UL><LI><I>precision = 2</I> set precision level: 1 .. single precision, 2
.. double precision, 3 .. positions in double precision, 4
.. positions and velocities in double precision

<LI><I>arch = 20</I> set GPU compute capability: 20 .. CC2.0 (GF100/110
e.g. C2050,GTX580,GTX470), 21 .. CC2.1 (GF104/114 e.g. GTX560, GTX460,
GTX450), 13 .. CC1.3 (GF200 e.g. C1060, GTX285)

<LI><I>prec_timer = 1</I> do not use precision timers if set to 0. This is
usually only usefull for compiling on Mac machines.

<LI><I>dbg = 0</I> activate debug mode when setting to 1. Only usefull for
developers.

<LI><I>cufft = 1</I> set CUDA FFT library. Can currently only be used to not
compile with cufft support (set to 0). In the future other CUDA
enabled FFT libraries might be supported.
</UL>
<P>The settings will be written to the <I>lib/cuda/Makefile.defaults</I>. When
compiling with <I>make</I> only those settings will be used.
</P>
<P>Go to <I>src</I>, install the USER-CUDA package with <I>make yes-USER-CUDA</I>
and compile the binary with <I>make YourMachine</I>. You might need to
delete old object files if you compiled without the USER-CUDA package
before, using the same Machine file (<I>rm Obj_YourMachine/*</I>).
</P>
<P>CUDA versions of classes are only installed if the corresponding CPU
versions are installed as well. E.g. you need to install the KSPACE
package to use <I>pppm/cuda</I>.
</P>
<H4>Usage
</H4>
<P>In order to make use of the GPU acceleration provided by the USER-CUDA
package, you only have to add
</P>
<P><I>accelerator cuda</I>
</P>
<P>at the top of your input script. See the <A HREF = "accelerator.html">accelerator</A> command for details of additional options.
</P>
<P>When compiling with USER-CUDA support the <A HREF = "Section_start.html#2_6">-accelerator command-line
switch</A> is effectively set to "cuda" by default
and does not have to be given.
</P>
<P>If you want to run simulations without using the "cuda" styles with
the same binary, you need to turn it explicitely off by giving "-a
none", "-a opt" or "-a gpu" as a command-
</P>
<P>line argument.
</P>
<P>The kspace style <I>pppm/cuda</I> has to be requested explicitely.
</P>
<HR>

<H4><A NAME = "10_4"></A>10.4 Comparison of GPU and USER-CUDA packages
</H4>
<P>The USER-CUDA package is an alternative package for GPU acceleration
that runs as much of the simulation as possible on the GPU. Depending on
the simulation, this can provide a significant speedup when the number
of atoms per GPU is large.
</P>
<P>The styles available for GPU acceleration
will be different in each package.
</P>
<P>The main difference between the "GPU" and the "USER-CUDA" package is
that while the latter aims at calculating everything on the device the
GPU package uses it as an accelerator for the pair force, neighbor
list and pppm calculations only. As a consequence in different
scenarios either package can be faster. Generally the GPU package is
faster than the USER-CUDA package, if the number of atoms per device
is small. Also the GPU package profits from oversubscribing
devices. Hence one usually wants to launch two (or more) MPI processes
per device.
</P>
<P>The exact crossover where the USER-CUDA package becomes faster depends
strongly on the pair-style. For example for a simple Lennard Jones
system the crossover (in single precision) can often be found between
50,000 - 100,000 atoms per device. When performing double precision
calculations this threshold can be significantly smaller. As a result
the GPU package can show better "strong scaling" behaviour in
comparison with the USER-CUDA package as long as this limit of atoms
per GPU is not reached.
</P>
<P>Another scenario where the GPU package can be faster is, when a lot of
bonded interactions are calculated. Those are handled by both packages
by the host while the device simultaniously calculates the
pair-forces. Since, when using the GPU package, one launches several
MPI processes per device, this work is spread over more CPU cores as
compared to running the same simulation with the USER-CUDA package.
</P>
<P>As a side note: the GPU package performance depends to some extent on
optimal bandwidth between host and device. Hence its performance is
affected if no full 16 PCIe lanes are available for each device. In
HPC environments this can be the case if S2050/70 servers are used,
where two devices generally share one PCIe 2.0 16x slot. Also many
multi GPU mainboards do not provide full 16 lanes to each of the PCIe
2.0 16x slots.
</P>
<P>While the GPU package uses considerable more device memory than the
USER-CUDA package, this is generally not much of a problem. Typically
run times are larger than desired, before the memory is exhausted.
</P>
<P>Currently the USER-CUDA package supports a wider range of
force-fields. On the other hand its performance is considerably
reduced if one has to use a fix at every timestep, which is not yet
available as a "CUDA"-accelerated version.
</P>
<P>In the end for each simulations its best to just try both packages and
see which one is performing better in the particular situation.
</P>
<H4>Benchmark
</H4>
<P>In the following 4 benchmark systems which are supported by both the
GPu and the CUDA package are shown:
</P>
<P>1. Lennard Jones, 2.5A
256,000 atoms
2.5 A cutoff
0.844 density
</P>
<P>2. Lennard Jones, 5.0A
256,000 atoms
5.0 A cutoff
0.844 density
</P>
<P>3. Rhodopsin model
256,000 atoms
10A cutoff
Coulomb via PPPM
</P>
<P>4. Lihtium-Phosphate
295650 atoms
15A cutoff
Coulomb via PPPM
</P>
<P>Hardware:
Workstation:
2x GTX470
i7 950@3GHz
24Gb DDR3 @ 1066Mhz
CentOS 5.5
CUDA 3.2
Driver 260.19.12
</P>
<P>eStella:
6 Nodes
2xC2050
2xQDR Infiniband interconnect(aggregate bandwidth 80GBps)
Intel X5650 HexCore @ 2.67GHz
SL 5.5
CUDA 3.2
Driver 260.19.26
</P>
<P>Keeneland:
HP SL-390 (Ariston) cluster
120 nodes
2x Intel Westmere hex-core CPUs
3xC2070s
QDR InfiniBand interconnec
</P>
</HTML>