forked from lijiext/lammps
447 lines
18 KiB
HTML
447 lines
18 KiB
HTML
<HTML>
|
|
<CENTER><A HREF = "Section_python.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> - <A HREF = "Section_errors.html">Next Section</A>
|
|
</CENTER>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<HR>
|
|
|
|
<H3>10. Using accelerated CPU and GPU styles
|
|
</H3>
|
|
<P>NOTE: These doc pages are still incomplete as of 1Jun11.
|
|
</P>
|
|
<P>NOTE: The USER-CUDA package discussed below has not yet been
|
|
officially released in LAMMPS.
|
|
</P>
|
|
<P>Accelerated versions of various <A HREF = "pair_style.html">pair_style</A>,
|
|
<A HREF = "fix.html">fixes</A>, <A HREF = "compute.html">computes</A>, and other commands have
|
|
been added to LAMMPS, which will typically run faster than the
|
|
standard non-accelerated versions, if you have the appropriate
|
|
hardware on your system.
|
|
</P>
|
|
<P>The accelerated styles have the same name as the standard styles,
|
|
except that a suffix is appended. Otherwise, the syntax for the
|
|
command is identical, their functionality is the same, and the
|
|
numerical results it produces should also be identical, except for
|
|
precision and round-off issues.
|
|
</P>
|
|
<P>For example, all of these variants of the basic Lennard-Jones pair
|
|
style exist in LAMMPS:
|
|
</P>
|
|
<UL><LI><A HREF = "pair_lj.html">pair_style lj/cut</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A>
|
|
</UL>
|
|
<P>Assuming you have built LAMMPS with the appropriate package, these
|
|
styles can be invoked by specifying them explicitly in your input
|
|
script. Or you can use the <A HREF = "Section_start.html#2_6">-suffix command-line
|
|
switch</A> to invoke the accelerated versions
|
|
automatically, without changing your input script. The
|
|
<A HREF = "suffix.html">suffix</A> command also allows you to set a suffix and to
|
|
turn off/on the comand-line switch setting within your input script.
|
|
</P>
|
|
<P>Styles with an "opt" suffix are part of the OPT package and typically
|
|
speed-up the pairwise calculations of your simulation by 5-25%.
|
|
</P>
|
|
<P>Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
|
|
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
|
The speed-up due to GPU usage depends on a variety of factors, as
|
|
discussed below.
|
|
</P>
|
|
<P>To see what styles are currently available in each of the accelerated
|
|
packages, see <A HREF = "Section_commands.html#3_5">this section</A> of the manual.
|
|
A list of accelerated styles is included in the pair, fix, compute,
|
|
and kspace sections.
|
|
</P>
|
|
<P>The following sections explain:
|
|
</P>
|
|
<UL><LI>what hardware and software the accelerated styles require
|
|
<LI>how to install the accelerated packages
|
|
<LI>what kind of problems they run best on
|
|
<LI>guidelines for how to use them to best advantage
|
|
<LI>the kinds of speed-ups you can expect
|
|
</UL>
|
|
<P>The final section compares and contrasts the GPU and USER-CUDA
|
|
packages, since they are both designed to use NVIDIA GPU hardware.
|
|
</P>
|
|
10.1 <A HREF = "#10_1">OPT package</A><BR>
|
|
10.2 <A HREF = "#10_2">GPU package</A><BR>
|
|
10.3 <A HREF = "#10_3">USER-CUDA package</A><BR>
|
|
10.4 <A HREF = "#10_4">Comparison of GPU and USER-CUDA packages</A> <BR>
|
|
|
|
<HR>
|
|
|
|
<HR>
|
|
|
|
<H4><A NAME = "10_1"></A>10.1 OPT package
|
|
</H4>
|
|
<P>The OPT package was developed by James Fischer (High Performance
|
|
Technologies), David Richie and Vincent Natoli (Stone Ridge
|
|
Technologies). It contains a handful of pair styles whose compute()
|
|
methods were rewritten in C++ templated form to reduce the overhead
|
|
due to if tests and other conditional code.
|
|
</P>
|
|
<P>The procedure for building LAMMPS with the OPT package is simple. It
|
|
is the same as for any other package which has no additional library
|
|
dependencies:
|
|
</P>
|
|
<PRE>make yes-opt
|
|
make machine
|
|
</PRE>
|
|
<P>If your input script uses one of the OPT pair styles,
|
|
you can run it as follows:
|
|
</P>
|
|
<PRE>lmp_machine -sf opt < in.script
|
|
mpirun -np 4 lmp_machine -sf opt < in.script
|
|
</PRE>
|
|
<P>You should see a reduction in the "Pair time" printed out at the end
|
|
of the run. On most machines and problems, this will typically be a 5
|
|
to 20% savings.
|
|
</P>
|
|
<HR>
|
|
|
|
<H4><A NAME = "10_2"></A>10.2 GPU package
|
|
</H4>
|
|
<P>The GPU package was developed by Mike Brown at ORNL.
|
|
</P>
|
|
<P>Additional requirements in your input script to run the styles with a
|
|
<I>gpu</I> suffix are as follows:
|
|
</P>
|
|
<P>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I> and the <A HREF = "fix_gpu.html">fix
|
|
gpu</A> command must be used. The fix controls the GPU
|
|
selection and initialization steps.
|
|
</P>
|
|
<P>A few LAMMPS <A HREF = "pair_style.html">pair styles</A> can be run on graphical
|
|
processing units (GPUs). We plan to add more over time. Currently,
|
|
they only support NVIDIA GPU cards. To use them you need to install
|
|
certain NVIDIA CUDA software on your system:
|
|
</P>
|
|
<UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0 Go
|
|
<LI>to http://www.nvidia.com/object/cuda_get.html Install a driver and
|
|
<LI>toolkit appropriate for your system (SDK is not necessary) Follow the
|
|
<LI>instructions in README in lammps/lib/gpu to build the library Run
|
|
<LI>lammps/lib/gpu/nvc_get_devices to list supported devices and
|
|
<LI>properties
|
|
</UL>
|
|
<H4>GPU configuration
|
|
</H4>
|
|
<P>When using GPUs, you are restricted to one physical GPU per LAMMPS
|
|
process. Multiple processes can share a single GPU and in many cases
|
|
it will be more efficient to run with multiple processes per GPU. Any
|
|
GPU accelerated style requires that <A HREF = "fix_gpu.html">fix gpu</A> be used in
|
|
the input script to select and initialize the GPUs. The format for the
|
|
fix is:
|
|
</P>
|
|
<PRE>fix <I>name</I> all gpu <I>mode</I> <I>first</I> <I>last</I> <I>split</I>
|
|
</PRE>
|
|
<P>where <I>name</I> is the name for the fix. The gpu fix must be the first
|
|
fix specified for a given run, otherwise the program will exit with an
|
|
error. The gpu fix will not have any effect on runs that do not use
|
|
GPU acceleration; there should be no problem with specifying the fix
|
|
first in any input script.
|
|
</P>
|
|
<P><I>mode</I> can be either "force" or "force/neigh". In the former, neighbor
|
|
list calculation is performed on the CPU using the standard LAMMPS
|
|
routines. In the latter, the neighbor list calculation is performed on
|
|
the GPU. The GPU neighbor list can be used for better performance,
|
|
however, it cannot not be used with a triclinic box or with
|
|
<A HREF = "pair_hybrid.html">hybrid</A> pair styles.
|
|
</P>
|
|
<P>There are cases when it might be more efficient to select the CPU for
|
|
neighbor list builds. If a non-GPU enabled style requires a neighbor
|
|
list, it will also be built using CPU routines. Redundant CPU and GPU
|
|
neighbor list calculations will typically be less efficient.
|
|
</P>
|
|
<P><I>first</I> is the ID (as reported by lammps/lib/gpu/nvc_get_devices) of
|
|
the first GPU that will be used on each node. <I>last</I> is the ID of the
|
|
last GPU that will be used on each node. If you have only one GPU per
|
|
node, <I>first</I> and <I>last</I> will typically both be 0. Selecting a
|
|
non-sequential set of GPU IDs (e.g. 0,1,3) is not currently supported.
|
|
</P>
|
|
<P><I>split</I> is the fraction of particles whose forces, torques, energies,
|
|
and/or virials will be calculated on the GPU. This can be used to
|
|
perform CPU and GPU force calculations simultaneously. If <I>split</I> is
|
|
negative, the software will attempt to calculate the optimal fraction
|
|
automatically every 25 timesteps based on CPU and GPU timings. Because
|
|
the GPU speedups are dependent on the number of particles, automatic
|
|
calculation of the split can be less efficient, but typically results
|
|
in loop times within 20% of an optimal fixed split.
|
|
</P>
|
|
<P>If you have two GPUs per node, 8 CPU cores per node, and would like to
|
|
run on 4 nodes with dynamic balancing of force calculation across CPU
|
|
and GPU cores, the fix might be
|
|
</P>
|
|
<PRE>fix 0 all gpu force/neigh 0 1 -1
|
|
</PRE>
|
|
<P>with LAMMPS run on 32 processes. In this case, all CPU cores and GPU
|
|
devices on the nodes would be utilized. Each GPU device would be
|
|
shared by 4 CPU cores. The CPU cores would perform force calculations
|
|
for some fraction of the particles at the same time the GPUs performed
|
|
force calculation for the other particles.
|
|
</P>
|
|
<P>Because of the large number of cores on each GPU device, it might be
|
|
more efficient to run on fewer processes per GPU when the number of
|
|
particles per process is small (100's of particles); this can be
|
|
necessary to keep the GPU cores busy.
|
|
</P>
|
|
<H4>GPU input script
|
|
</H4>
|
|
<P>In order to use GPU acceleration in LAMMPS, <A HREF = "fix_gpu.html">fix_gpu</A>
|
|
should be used in order to initialize and configure the GPUs for
|
|
use. Additionally, GPU enabled styles must be selected in the input
|
|
script. Currently, this is limited to a few <A HREF = "pair_style.html">pair
|
|
styles</A> and PPPM. Some GPU-enabled styles have
|
|
additional restrictions listed in their documentation.
|
|
</P>
|
|
<H4>GPU asynchronous pair computation
|
|
</H4>
|
|
<P>The GPU accelerated pair styles can be used to perform pair style
|
|
force calculation on the GPU while other calculations are performed on
|
|
the CPU. One method to do this is to specify a <I>split</I> in the gpu fix
|
|
as described above. In this case, force calculation for the pair
|
|
style will also be performed on the CPU.
|
|
</P>
|
|
<P>When the CPU work in a GPU pair style has finished, the next force
|
|
computation will begin, possibly before the GPU has finished. If
|
|
<I>split</I> is 1.0 in the gpu fix, the next force computation will begin
|
|
almost immediately. This can be used to run a
|
|
<A HREF = "pair_hybrid.html">hybrid</A> GPU pair style at the same time as a hybrid
|
|
CPU pair style. In this case, the GPU pair style should be first in
|
|
the hybrid command in order to perform simultaneous calculations. This
|
|
also allows <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
|
|
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
|
|
<A HREF = "kspace_style.html">long-range</A> force computations to be run
|
|
simultaneously with the GPU pair style. Once all CPU force
|
|
computations have completed, the gpu fix will block until the GPU has
|
|
finished all work before continuing the run.
|
|
</P>
|
|
<H4>GPU timing
|
|
</H4>
|
|
<P>GPU accelerated pair styles can perform computations asynchronously
|
|
with CPU computations. The "Pair" time reported by LAMMPS will be the
|
|
maximum of the time required to complete the CPU pair style
|
|
computations and the time required to complete the GPU pair style
|
|
computations. Any time spent for GPU-enabled pair styles for
|
|
computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
|
|
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
|
|
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
|
|
calculations will not be included in the "Pair" time.
|
|
</P>
|
|
<P>When <I>mode</I> for the gpu fix is force/neigh, the time for neighbor list
|
|
calculations on the GPU will be added into the "Pair" time, not the
|
|
"Neigh" time. A breakdown of the times required for various tasks on
|
|
the GPU (data copy, neighbor calculations, force computations, etc.)
|
|
are output only with the LAMMPS screen output at the end of each
|
|
run. These timings represent total time spent on the GPU for each
|
|
routine, regardless of asynchronous CPU calculations.
|
|
</P>
|
|
<H4>GPU single vs double precision
|
|
</H4>
|
|
<P>See the lammps/lib/gpu/README file for instructions on how to build
|
|
the LAMMPS gpu library for single, mixed, and double precision. The
|
|
latter requires that your GPU card supports double precision.
|
|
</P>
|
|
<HR>
|
|
|
|
<H4><A NAME = "10_3"></A>10.3 USER-CUDA package
|
|
</H4>
|
|
<P>The USER-CUDA package was developed by Christian Trott at U Technology
|
|
Ilmenau in Germany.
|
|
</P>
|
|
<P>This package will only be of any use to you, if you have an NVIDIA(tm)
|
|
graphics card being CUDA(tm) enabled. Your GPU needs to support
|
|
Compute Capability 1.3. This list may help
|
|
you to find out the Compute Capability of your card:
|
|
</P>
|
|
<P>http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
|
|
</P>
|
|
<P>Install the Nvidia Cuda Toolkit in version 3.2 or higher and the
|
|
corresponding GPU drivers. The Nvidia Cuda SDK is not required for
|
|
LAMMPSCUDA but we recommend to install it and
|
|
</P>
|
|
<P>make sure that the sample projects can be compiled without problems.
|
|
</P>
|
|
<P>You should also be able to compile LAMMPS by typing
|
|
</P>
|
|
<P><I>make YourMachine</I>
|
|
</P>
|
|
<P>inside the src directory of LAMMPS root path. If not, you should
|
|
consult the LAMMPS documentation.
|
|
</P>
|
|
<H4>Compilation
|
|
</H4>
|
|
<P>If your <I>CUDA</I> toolkit is not installed in the default directoy
|
|
<I>/usr/local/cuda</I> edit the file <I>lib/cuda/Makefile.common</I>
|
|
accordingly.
|
|
</P>
|
|
<P>Go to <I>lib/cuda/</I> and type
|
|
</P>
|
|
<P><I>make OPTIONS</I>
|
|
</P>
|
|
<P>where <I>OPTIONS</I> are one or more of the following:
|
|
</P>
|
|
<UL><LI><I>precision = 2</I> set precision level: 1 .. single precision, 2
|
|
.. double precision, 3 .. positions in double precision, 4
|
|
.. positions and velocities in double precision
|
|
|
|
<LI><I>arch = 20</I> set GPU compute capability: 20 .. CC2.0 (GF100/110
|
|
e.g. C2050,GTX580,GTX470), 21 .. CC2.1 (GF104/114 e.g. GTX560, GTX460,
|
|
GTX450), 13 .. CC1.3 (GF200 e.g. C1060, GTX285)
|
|
|
|
<LI><I>prec_timer = 1</I> do not use precision timers if set to 0. This is
|
|
usually only usefull for compiling on Mac machines.
|
|
|
|
<LI><I>dbg = 0</I> activate debug mode when setting to 1. Only usefull for
|
|
developers.
|
|
|
|
<LI><I>cufft = 1</I> set CUDA FFT library. Can currently only be used to not
|
|
compile with cufft support (set to 0). In the future other CUDA
|
|
enabled FFT libraries might be supported.
|
|
</UL>
|
|
<P>The settings will be written to the <I>lib/cuda/Makefile.defaults</I>. When
|
|
compiling with <I>make</I> only those settings will be used.
|
|
</P>
|
|
<P>Go to <I>src</I>, install the USER-CUDA package with <I>make yes-USER-CUDA</I>
|
|
and compile the binary with <I>make YourMachine</I>. You might need to
|
|
delete old object files if you compiled without the USER-CUDA package
|
|
before, using the same Machine file (<I>rm Obj_YourMachine/*</I>).
|
|
</P>
|
|
<P>CUDA versions of classes are only installed if the corresponding CPU
|
|
versions are installed as well. E.g. you need to install the KSPACE
|
|
package to use <I>pppm/cuda</I>.
|
|
</P>
|
|
<H4>Usage
|
|
</H4>
|
|
<P>In order to make use of the GPU acceleration provided by the USER-CUDA
|
|
package, you only have to add
|
|
</P>
|
|
<P><I>accelerator cuda</I>
|
|
</P>
|
|
<P>at the top of your input script. See the <A HREF = "accelerator.html">accelerator</A> command for details of additional options.
|
|
</P>
|
|
<P>When compiling with USER-CUDA support the <A HREF = "Section_start.html#2_6">-accelerator command-line
|
|
switch</A> is effectively set to "cuda" by default
|
|
and does not have to be given.
|
|
</P>
|
|
<P>If you want to run simulations without using the "cuda" styles with
|
|
the same binary, you need to turn it explicitely off by giving "-a
|
|
none", "-a opt" or "-a gpu" as a command-
|
|
</P>
|
|
<P>line argument.
|
|
</P>
|
|
<P>The kspace style <I>pppm/cuda</I> has to be requested explicitely.
|
|
</P>
|
|
<HR>
|
|
|
|
<H4><A NAME = "10_4"></A>10.4 Comparison of GPU and USER-CUDA packages
|
|
</H4>
|
|
<P>The USER-CUDA package is an alternative package for GPU acceleration
|
|
that runs as much of the simulation as possible on the GPU. Depending on
|
|
the simulation, this can provide a significant speedup when the number
|
|
of atoms per GPU is large.
|
|
</P>
|
|
<P>The styles available for GPU acceleration
|
|
will be different in each package.
|
|
</P>
|
|
<P>The main difference between the "GPU" and the "USER-CUDA" package is
|
|
that while the latter aims at calculating everything on the device the
|
|
GPU package uses it as an accelerator for the pair force, neighbor
|
|
list and pppm calculations only. As a consequence in different
|
|
scenarios either package can be faster. Generally the GPU package is
|
|
faster than the USER-CUDA package, if the number of atoms per device
|
|
is small. Also the GPU package profits from oversubscribing
|
|
devices. Hence one usually wants to launch two (or more) MPI processes
|
|
per device.
|
|
</P>
|
|
<P>The exact crossover where the USER-CUDA package becomes faster depends
|
|
strongly on the pair-style. For example for a simple Lennard Jones
|
|
system the crossover (in single precision) can often be found between
|
|
50,000 - 100,000 atoms per device. When performing double precision
|
|
calculations this threshold can be significantly smaller. As a result
|
|
the GPU package can show better "strong scaling" behaviour in
|
|
comparison with the USER-CUDA package as long as this limit of atoms
|
|
per GPU is not reached.
|
|
</P>
|
|
<P>Another scenario where the GPU package can be faster is, when a lot of
|
|
bonded interactions are calculated. Those are handled by both packages
|
|
by the host while the device simultaniously calculates the
|
|
pair-forces. Since, when using the GPU package, one launches several
|
|
MPI processes per device, this work is spread over more CPU cores as
|
|
compared to running the same simulation with the USER-CUDA package.
|
|
</P>
|
|
<P>As a side note: the GPU package performance depends to some extent on
|
|
optimal bandwidth between host and device. Hence its performance is
|
|
affected if no full 16 PCIe lanes are available for each device. In
|
|
HPC environments this can be the case if S2050/70 servers are used,
|
|
where two devices generally share one PCIe 2.0 16x slot. Also many
|
|
multi GPU mainboards do not provide full 16 lanes to each of the PCIe
|
|
2.0 16x slots.
|
|
</P>
|
|
<P>While the GPU package uses considerable more device memory than the
|
|
USER-CUDA package, this is generally not much of a problem. Typically
|
|
run times are larger than desired, before the memory is exhausted.
|
|
</P>
|
|
<P>Currently the USER-CUDA package supports a wider range of
|
|
force-fields. On the other hand its performance is considerably
|
|
reduced if one has to use a fix at every timestep, which is not yet
|
|
available as a "CUDA"-accelerated version.
|
|
</P>
|
|
<P>In the end for each simulations its best to just try both packages and
|
|
see which one is performing better in the particular situation.
|
|
</P>
|
|
<H4>Benchmark
|
|
</H4>
|
|
<P>In the following 4 benchmark systems which are supported by both the
|
|
GPu and the CUDA package are shown:
|
|
</P>
|
|
<P>1. Lennard Jones, 2.5A
|
|
256,000 atoms
|
|
2.5 A cutoff
|
|
0.844 density
|
|
</P>
|
|
<P>2. Lennard Jones, 5.0A
|
|
256,000 atoms
|
|
5.0 A cutoff
|
|
0.844 density
|
|
</P>
|
|
<P>3. Rhodopsin model
|
|
256,000 atoms
|
|
10A cutoff
|
|
Coulomb via PPPM
|
|
</P>
|
|
<P>4. Lihtium-Phosphate
|
|
295650 atoms
|
|
15A cutoff
|
|
Coulomb via PPPM
|
|
</P>
|
|
<P>Hardware:
|
|
Workstation:
|
|
2x GTX470
|
|
i7 950@3GHz
|
|
24Gb DDR3 @ 1066Mhz
|
|
CentOS 5.5
|
|
CUDA 3.2
|
|
Driver 260.19.12
|
|
</P>
|
|
<P>eStella:
|
|
6 Nodes
|
|
2xC2050
|
|
2xQDR Infiniband interconnect(aggregate bandwidth 80GBps)
|
|
Intel X5650 HexCore @ 2.67GHz
|
|
SL 5.5
|
|
CUDA 3.2
|
|
Driver 260.19.26
|
|
</P>
|
|
<P>Keeneland:
|
|
HP SL-390 (Ariston) cluster
|
|
120 nodes
|
|
2x Intel Westmere hex-core CPUs
|
|
3xC2070s
|
|
QDR InfiniBand interconnec
|
|
</P>
|
|
</HTML>
|