2011-05-27 07:45:30 +08:00
|
|
|
<HTML>
|
|
|
|
<CENTER><A HREF = "Section_python.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> - <A HREF = "Section_errors.html">Next Section</A>
|
|
|
|
</CENTER>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<HR>
|
|
|
|
|
|
|
|
<H3>10. Using accelerated CPU and GPU styles
|
|
|
|
</H3>
|
2011-06-01 08:11:58 +08:00
|
|
|
<P>NOTE: These doc pages are still incomplete as of 1Jun11.
|
|
|
|
</P>
|
|
|
|
<P>NOTE: The USER-CUDA package discussed below has not yet been
|
|
|
|
officially released in LAMMPS.
|
|
|
|
</P>
|
2011-05-27 07:45:30 +08:00
|
|
|
<P>Accelerated versions of various <A HREF = "pair_style.html">pair_style</A>,
|
2011-06-09 04:56:17 +08:00
|
|
|
<A HREF = "fix.html">fixes</A>, <A HREF = "compute.html">computes</A>, and other commands have
|
|
|
|
been added to LAMMPS, which will typically run faster than the
|
|
|
|
standard non-accelerated versions, if you have the appropriate
|
|
|
|
hardware on your system.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2011-06-09 04:56:17 +08:00
|
|
|
<P>The accelerated styles have the same name as the standard styles,
|
|
|
|
except that a suffix is appended. Otherwise, the syntax for the
|
|
|
|
command is identical, their functionality is the same, and the
|
|
|
|
numerical results it produces should also be identical, except for
|
|
|
|
precision and round-off issues.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
|
|
|
<P>For example, all of these variants of the basic Lennard-Jones pair
|
|
|
|
style exist in LAMMPS:
|
|
|
|
</P>
|
|
|
|
<UL><LI><A HREF = "pair_lj.html">pair_style lj/cut</A>
|
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
|
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
|
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A>
|
|
|
|
</UL>
|
|
|
|
<P>Assuming you have built LAMMPS with the appropriate package, these
|
|
|
|
styles can be invoked by specifying them explicitly in your input
|
2011-06-01 07:08:32 +08:00
|
|
|
script. Or you can use the <A HREF = "Section_start.html#2_6">-suffix command-line
|
2011-05-27 07:45:30 +08:00
|
|
|
switch</A> to invoke the accelerated versions
|
2011-06-09 04:56:17 +08:00
|
|
|
automatically, without changing your input script. The
|
|
|
|
<A HREF = "suffix.html">suffix</A> command also allows you to set a suffix and to
|
|
|
|
turn off/on the comand-line switch setting within your input script.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
|
|
|
<P>Styles with an "opt" suffix are part of the OPT package and typically
|
2011-06-09 04:56:17 +08:00
|
|
|
speed-up the pairwise calculations of your simulation by 5-25%.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
|
|
|
<P>Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
|
|
|
|
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
|
|
|
The speed-up due to GPU usage depends on a variety of factors, as
|
|
|
|
discussed below.
|
|
|
|
</P>
|
|
|
|
<P>To see what styles are currently available in each of the accelerated
|
|
|
|
packages, see <A HREF = "Section_commands.html#3_5">this section</A> of the manual.
|
|
|
|
A list of accelerated styles is included in the pair, fix, compute,
|
|
|
|
and kspace sections.
|
|
|
|
</P>
|
|
|
|
<P>The following sections explain:
|
|
|
|
</P>
|
2011-06-09 04:56:17 +08:00
|
|
|
<UL><LI>what hardware and software the accelerated styles require
|
|
|
|
<LI>how to install the accelerated packages
|
2011-05-27 07:45:30 +08:00
|
|
|
<LI>what kind of problems they run best on
|
2011-06-09 04:56:17 +08:00
|
|
|
<LI>guidelines for how to use them to best advantage
|
2011-05-27 07:45:30 +08:00
|
|
|
<LI>the kinds of speed-ups you can expect
|
|
|
|
</UL>
|
|
|
|
<P>The final section compares and contrasts the GPU and USER-CUDA
|
|
|
|
packages, since they are both designed to use NVIDIA GPU hardware.
|
|
|
|
</P>
|
|
|
|
10.1 <A HREF = "#10_1">OPT package</A><BR>
|
|
|
|
10.2 <A HREF = "#10_2">GPU package</A><BR>
|
|
|
|
10.3 <A HREF = "#10_3">USER-CUDA package</A><BR>
|
|
|
|
10.4 <A HREF = "#10_4">Comparison of GPU and USER-CUDA packages</A> <BR>
|
|
|
|
|
|
|
|
<HR>
|
|
|
|
|
2011-06-09 04:56:17 +08:00
|
|
|
<HR>
|
|
|
|
|
2011-05-27 07:45:30 +08:00
|
|
|
<H4><A NAME = "10_1"></A>10.1 OPT package
|
|
|
|
</H4>
|
2011-05-28 01:59:03 +08:00
|
|
|
<P>The OPT package was developed by James Fischer (High Performance
|
|
|
|
Technologies), David Richie and Vincent Natoli (Stone Ridge
|
2011-06-09 04:56:17 +08:00
|
|
|
Technologies). It contains a handful of pair styles whose compute()
|
|
|
|
methods were rewritten in C++ templated form to reduce the overhead
|
|
|
|
due to if tests and other conditional code.
|
|
|
|
</P>
|
|
|
|
<P>The procedure for building LAMMPS with the OPT package is simple. It
|
|
|
|
is the same as for any other package which has no additional library
|
|
|
|
dependencies:
|
|
|
|
</P>
|
|
|
|
<PRE>make yes-opt
|
|
|
|
make machine
|
|
|
|
</PRE>
|
|
|
|
<P>If your input script uses one of the OPT pair styles,
|
|
|
|
you can run it as follows:
|
|
|
|
</P>
|
|
|
|
<PRE>lmp_machine -sf opt < in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf opt < in.script
|
|
|
|
</PRE>
|
|
|
|
<P>You should see a reduction in the "Pair time" printed out at the end
|
|
|
|
of the run. On most machines and problems, this will typically be a 5
|
|
|
|
to 20% savings.
|
2011-05-28 01:59:03 +08:00
|
|
|
</P>
|
2011-05-27 07:45:30 +08:00
|
|
|
<HR>
|
|
|
|
|
|
|
|
<H4><A NAME = "10_2"></A>10.2 GPU package
|
|
|
|
</H4>
|
2011-06-09 05:26:06 +08:00
|
|
|
<P>Additional requirements in your input script to run the styles with a
|
|
|
|
<I>gpu</I> suffix are as follows:
|
|
|
|
</P>
|
|
|
|
<P>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I> and the <A HREF = "fix_gpu.html">fix
|
|
|
|
gpu</A> command must be used. The fix controls the GPU
|
|
|
|
selection and initialization steps.
|
|
|
|
</P>
|
2011-05-28 01:59:03 +08:00
|
|
|
<P>The GPU package was developed by Mike Brown at ORNL.
|
|
|
|
</P>
|
2011-05-27 07:45:30 +08:00
|
|
|
<P>A few LAMMPS <A HREF = "pair_style.html">pair styles</A> can be run on graphical
|
|
|
|
processing units (GPUs). We plan to add more over time. Currently,
|
|
|
|
they only support NVIDIA GPU cards. To use them you need to install
|
|
|
|
certain NVIDIA CUDA software on your system:
|
|
|
|
</P>
|
2011-05-28 01:59:03 +08:00
|
|
|
<UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0 Go
|
|
|
|
<LI>to http://www.nvidia.com/object/cuda_get.html Install a driver and
|
|
|
|
<LI>toolkit appropriate for your system (SDK is not necessary) Follow the
|
|
|
|
<LI>instructions in README in lammps/lib/gpu to build the library Run
|
|
|
|
<LI>lammps/lib/gpu/nvc_get_devices to list supported devices and
|
|
|
|
<LI>properties
|
2011-05-27 07:45:30 +08:00
|
|
|
</UL>
|
|
|
|
<H4>GPU configuration
|
|
|
|
</H4>
|
|
|
|
<P>When using GPUs, you are restricted to one physical GPU per LAMMPS
|
|
|
|
process. Multiple processes can share a single GPU and in many cases
|
|
|
|
it will be more efficient to run with multiple processes per GPU. Any
|
|
|
|
GPU accelerated style requires that <A HREF = "fix_gpu.html">fix gpu</A> be used in
|
|
|
|
the input script to select and initialize the GPUs. The format for the
|
|
|
|
fix is:
|
|
|
|
</P>
|
|
|
|
<PRE>fix <I>name</I> all gpu <I>mode</I> <I>first</I> <I>last</I> <I>split</I>
|
|
|
|
</PRE>
|
|
|
|
<P>where <I>name</I> is the name for the fix. The gpu fix must be the first
|
|
|
|
fix specified for a given run, otherwise the program will exit with an
|
|
|
|
error. The gpu fix will not have any effect on runs that do not use
|
|
|
|
GPU acceleration; there should be no problem with specifying the fix
|
|
|
|
first in any input script.
|
|
|
|
</P>
|
|
|
|
<P><I>mode</I> can be either "force" or "force/neigh". In the former, neighbor
|
|
|
|
list calculation is performed on the CPU using the standard LAMMPS
|
|
|
|
routines. In the latter, the neighbor list calculation is performed on
|
|
|
|
the GPU. The GPU neighbor list can be used for better performance,
|
|
|
|
however, it cannot not be used with a triclinic box or with
|
|
|
|
<A HREF = "pair_hybrid.html">hybrid</A> pair styles.
|
|
|
|
</P>
|
|
|
|
<P>There are cases when it might be more efficient to select the CPU for
|
|
|
|
neighbor list builds. If a non-GPU enabled style requires a neighbor
|
|
|
|
list, it will also be built using CPU routines. Redundant CPU and GPU
|
|
|
|
neighbor list calculations will typically be less efficient.
|
|
|
|
</P>
|
|
|
|
<P><I>first</I> is the ID (as reported by lammps/lib/gpu/nvc_get_devices) of
|
|
|
|
the first GPU that will be used on each node. <I>last</I> is the ID of the
|
|
|
|
last GPU that will be used on each node. If you have only one GPU per
|
|
|
|
node, <I>first</I> and <I>last</I> will typically both be 0. Selecting a
|
|
|
|
non-sequential set of GPU IDs (e.g. 0,1,3) is not currently supported.
|
|
|
|
</P>
|
|
|
|
<P><I>split</I> is the fraction of particles whose forces, torques, energies,
|
|
|
|
and/or virials will be calculated on the GPU. This can be used to
|
|
|
|
perform CPU and GPU force calculations simultaneously. If <I>split</I> is
|
|
|
|
negative, the software will attempt to calculate the optimal fraction
|
|
|
|
automatically every 25 timesteps based on CPU and GPU timings. Because
|
|
|
|
the GPU speedups are dependent on the number of particles, automatic
|
|
|
|
calculation of the split can be less efficient, but typically results
|
|
|
|
in loop times within 20% of an optimal fixed split.
|
|
|
|
</P>
|
|
|
|
<P>If you have two GPUs per node, 8 CPU cores per node, and would like to
|
|
|
|
run on 4 nodes with dynamic balancing of force calculation across CPU
|
|
|
|
and GPU cores, the fix might be
|
|
|
|
</P>
|
|
|
|
<PRE>fix 0 all gpu force/neigh 0 1 -1
|
|
|
|
</PRE>
|
|
|
|
<P>with LAMMPS run on 32 processes. In this case, all CPU cores and GPU
|
|
|
|
devices on the nodes would be utilized. Each GPU device would be
|
|
|
|
shared by 4 CPU cores. The CPU cores would perform force calculations
|
|
|
|
for some fraction of the particles at the same time the GPUs performed
|
|
|
|
force calculation for the other particles.
|
|
|
|
</P>
|
|
|
|
<P>Because of the large number of cores on each GPU device, it might be
|
|
|
|
more efficient to run on fewer processes per GPU when the number of
|
|
|
|
particles per process is small (100's of particles); this can be
|
|
|
|
necessary to keep the GPU cores busy.
|
|
|
|
</P>
|
|
|
|
<H4>GPU input script
|
|
|
|
</H4>
|
|
|
|
<P>In order to use GPU acceleration in LAMMPS, <A HREF = "fix_gpu.html">fix_gpu</A>
|
|
|
|
should be used in order to initialize and configure the GPUs for
|
|
|
|
use. Additionally, GPU enabled styles must be selected in the input
|
|
|
|
script. Currently, this is limited to a few <A HREF = "pair_style.html">pair
|
|
|
|
styles</A> and PPPM. Some GPU-enabled styles have
|
|
|
|
additional restrictions listed in their documentation.
|
|
|
|
</P>
|
|
|
|
<H4>GPU asynchronous pair computation
|
|
|
|
</H4>
|
|
|
|
<P>The GPU accelerated pair styles can be used to perform pair style
|
|
|
|
force calculation on the GPU while other calculations are performed on
|
|
|
|
the CPU. One method to do this is to specify a <I>split</I> in the gpu fix
|
|
|
|
as described above. In this case, force calculation for the pair
|
|
|
|
style will also be performed on the CPU.
|
|
|
|
</P>
|
|
|
|
<P>When the CPU work in a GPU pair style has finished, the next force
|
|
|
|
computation will begin, possibly before the GPU has finished. If
|
|
|
|
<I>split</I> is 1.0 in the gpu fix, the next force computation will begin
|
|
|
|
almost immediately. This can be used to run a
|
|
|
|
<A HREF = "pair_hybrid.html">hybrid</A> GPU pair style at the same time as a hybrid
|
|
|
|
CPU pair style. In this case, the GPU pair style should be first in
|
|
|
|
the hybrid command in order to perform simultaneous calculations. This
|
|
|
|
also allows <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
|
|
|
|
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
|
|
|
|
<A HREF = "kspace_style.html">long-range</A> force computations to be run
|
|
|
|
simultaneously with the GPU pair style. Once all CPU force
|
|
|
|
computations have completed, the gpu fix will block until the GPU has
|
|
|
|
finished all work before continuing the run.
|
|
|
|
</P>
|
|
|
|
<H4>GPU timing
|
|
|
|
</H4>
|
|
|
|
<P>GPU accelerated pair styles can perform computations asynchronously
|
|
|
|
with CPU computations. The "Pair" time reported by LAMMPS will be the
|
|
|
|
maximum of the time required to complete the CPU pair style
|
|
|
|
computations and the time required to complete the GPU pair style
|
|
|
|
computations. Any time spent for GPU-enabled pair styles for
|
|
|
|
computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
|
|
|
|
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
|
|
|
|
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
|
|
|
|
calculations will not be included in the "Pair" time.
|
|
|
|
</P>
|
|
|
|
<P>When <I>mode</I> for the gpu fix is force/neigh, the time for neighbor list
|
|
|
|
calculations on the GPU will be added into the "Pair" time, not the
|
|
|
|
"Neigh" time. A breakdown of the times required for various tasks on
|
|
|
|
the GPU (data copy, neighbor calculations, force computations, etc.)
|
|
|
|
are output only with the LAMMPS screen output at the end of each
|
|
|
|
run. These timings represent total time spent on the GPU for each
|
|
|
|
routine, regardless of asynchronous CPU calculations.
|
|
|
|
</P>
|
|
|
|
<H4>GPU single vs double precision
|
|
|
|
</H4>
|
|
|
|
<P>See the lammps/lib/gpu/README file for instructions on how to build
|
|
|
|
the LAMMPS gpu library for single, mixed, and double precision. The
|
|
|
|
latter requires that your GPU card supports double precision.
|
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
|
|
|
<H4><A NAME = "10_3"></A>10.3 USER-CUDA package
|
|
|
|
</H4>
|
2011-05-28 01:59:03 +08:00
|
|
|
<P>The USER-CUDA package was developed by Christian Trott at U Technology
|
|
|
|
Ilmenau in Germany.
|
|
|
|
</P>
|
2011-05-27 07:45:30 +08:00
|
|
|
<HR>
|
|
|
|
|
|
|
|
<H4><A NAME = "10_4"></A>10.4 Comparison of GPU and USER-CUDA packages
|
|
|
|
</H4>
|
|
|
|
</HTML>
|