forked from lijiext/lammps
224 lines
8.8 KiB
Plaintext
224 lines
8.8 KiB
Plaintext
:doc:`Return to Section accelerate overview <Section_accelerate>`
|
|
|
|
5.USER-CUDA package
|
|
-------------------
|
|
|
|
The USER-CUDA package was developed by Christian Trott (Sandia) while
|
|
at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions
|
|
of many pair styles, many fixes, a few computes, and for long-range
|
|
Coulombics via the PPPM command. It has the following general
|
|
features:
|
|
|
|
* The package is designed to allow an entire LAMMPS calculation, for
|
|
many timesteps, to run entirely on the GPU (except for inter-processor
|
|
MPI communication), so that atom-based data (e.g. coordinates, forces)
|
|
do not have to move back-and-forth between the CPU and GPU.
|
|
* The speed-up advantage of this approach is typically better when the
|
|
number of atoms per GPU is large
|
|
* Data will stay on the GPU until a timestep where a non-USER-CUDA fix
|
|
or compute is invoked. Whenever a non-GPU operation occurs (fix,
|
|
compute, output), data automatically moves back to the CPU as needed.
|
|
This may incur a performance penalty, but should otherwise work
|
|
transparently.
|
|
* Neighbor lists are constructed on the GPU.
|
|
* The package only supports use of a single MPI task, running on a
|
|
single CPU (core), assigned to each GPU.
|
|
Here is a quick overview of how to use the USER-CUDA package:
|
|
|
|
* build the library in lib/cuda for your GPU hardware with desired precision
|
|
* include the USER-CUDA package and build LAMMPS
|
|
* use the mpirun command to specify 1 MPI task per GPU (on each node)
|
|
* enable the USER-CUDA package via the "-c on" command-line switch
|
|
* specify the # of GPUs per node
|
|
* use USER-CUDA styles in your input script
|
|
|
|
The latter two steps can be done using the "-pk cuda" and "-sf cuda"
|
|
:ref:`command-line switches <start_7>` respectively. Or
|
|
the effect of the "-pk" or "-sf" switches can be duplicated by adding
|
|
the :doc:`package cuda <package>` or :doc:`suffix cuda <suffix>` commands
|
|
respectively to your input script.
|
|
|
|
**Required hardware/software:**
|
|
|
|
To use this package, you need to have one or more NVIDIA GPUs and
|
|
install the NVIDIA Cuda software on your system:
|
|
|
|
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
|
|
help you to find out the Compute Capability of your card:
|
|
|
|
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
|
|
|
|
Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
|
|
corresponding GPU drivers. The Nvidia Cuda SDK is not required, but
|
|
we recommend it also be installed. You can then make sure its sample
|
|
projects can be compiled without problems.
|
|
|
|
**Building LAMMPS with the USER-CUDA package:**
|
|
|
|
This requires two steps (a,b): build the USER-CUDA library, then build
|
|
LAMMPS with the USER-CUDA package.
|
|
|
|
You can do both these steps in one line, using the src/Make.py script,
|
|
described in :ref:`Section 2.4 <start_4>` of the manual.
|
|
Type "Make.py -h" for help. If run from the src directory, this
|
|
command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the
|
|
starting Makefile.machine:
|
|
|
|
.. parsed-literal::
|
|
|
|
Make.py -p cuda -cuda mode=single arch=20 -o cuda -a lib-cuda file mpi
|
|
|
|
Or you can follow these two (a,b) steps:
|
|
|
|
(a) Build the USER-CUDA library
|
|
|
|
The USER-CUDA library is in lammps/lib/cuda. If your *CUDA* toolkit
|
|
is not installed in the default system directoy */usr/local/cuda* edit
|
|
the file *lib/cuda/Makefile.common* accordingly.
|
|
|
|
To build the library with the settings in lib/cuda/Makefile.default,
|
|
simply type:
|
|
|
|
.. parsed-literal::
|
|
|
|
make
|
|
|
|
To set options when the library is built, type "make OPTIONS", where
|
|
*OPTIONS* are one or more of the following. The settings will be
|
|
written to the *lib/cuda/Makefile.defaults* before the build.
|
|
|
|
.. parsed-literal::
|
|
|
|
*precision=N* to set the precision level
|
|
N = 1 for single precision (default)
|
|
N = 2 for double precision
|
|
N = 3 for positions in double precision
|
|
N = 4 for positions and velocities in double precision
|
|
*arch=M* to set GPU compute capability
|
|
M = 35 for Kepler GPUs
|
|
M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
|
|
M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450)
|
|
M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
|
|
*prec_timer=0/1* to use hi-precision timers
|
|
0 = do not use them (default)
|
|
1 = use them
|
|
this is usually only useful for Mac machines
|
|
*dbg=0/1* to activate debug mode
|
|
0 = no debug mode (default)
|
|
1 = yes debug mode
|
|
this is only useful for developers
|
|
*cufft=1* for use of the CUDA FFT library
|
|
0 = no CUFFT support (default)
|
|
in the future other CUDA-enabled FFT libraries might be supported
|
|
|
|
If the build is successful, it will produce the files liblammpscuda.a and
|
|
Makefile.lammps.
|
|
|
|
Note that if you change any of the options (like precision), you need
|
|
to re-build the entire library. Do a "make clean" first, followed by
|
|
"make".
|
|
|
|
(b) Build LAMMPS with the USER-CUDA package
|
|
|
|
.. parsed-literal::
|
|
|
|
cd lammps/src
|
|
make yes-user-cuda
|
|
make machine
|
|
|
|
No additional compile/link flags are needed in Makefile.machine.
|
|
|
|
Note that if you change the USER-CUDA library precision (discussed
|
|
above) and rebuild the USER-CUDA library, then you also need to
|
|
re-install the USER-CUDA package and re-build LAMMPS, so that all
|
|
affected files are re-compiled and linked to the new USER-CUDA
|
|
library.
|
|
|
|
**Run with the USER-CUDA package from the command line:**
|
|
|
|
The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
by LAMMPS (one or multiple per compute node) and the number of MPI
|
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
|
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
|
|
|
When using the USER-CUDA package, you must use exactly one MPI task
|
|
per physical GPU.
|
|
|
|
You must use the "-c on" :ref:`command-line switch <start_7>` to enable the USER-CUDA package.
|
|
The "-c on" switch also issues a default :doc:`package cuda 1 <package>`
|
|
command which sets various USER-CUDA options to default values, as
|
|
discussed on the :doc:`package <package>` command doc page.
|
|
|
|
Use the "-sf cuda" :ref:`command-line switch <start_7>`,
|
|
which will automatically append "cuda" to styles that support it. Use
|
|
the "-pk cuda Ng" :ref:`command-line switch <start_7>` to
|
|
set Ng = # of GPUs per node to a different value than the default set
|
|
by the "-c on" switch (1 GPU) or change other :doc:`package cuda <package>` options.
|
|
|
|
.. parsed-literal::
|
|
|
|
lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU
|
|
mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
|
|
mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes
|
|
|
|
The syntax for the "-pk" switch is the same as same as the "package
|
|
cuda" command. See the :doc:`package <package>` command doc page for
|
|
details, including the default values used for all its options if it
|
|
is not specified.
|
|
|
|
Note that the default for the :doc:`package cuda <package>` command is
|
|
to set the Newton flag to "off" for both pairwise and bonded
|
|
interactions. This typically gives fastest performance. If the
|
|
:doc:`newton <newton>` command is used in the input script, it can
|
|
override these defaults.
|
|
|
|
**Or run with the USER-CUDA package by editing an input script:**
|
|
|
|
The discussion above for the mpirun/mpiexec command and the requirement
|
|
of one MPI task per GPU is the same.
|
|
|
|
You must still use the "-c on" :ref:`command-line switch <start_7>` to enable the USER-CUDA package.
|
|
|
|
Use the :doc:`suffix cuda <suffix>` command, or you can explicitly add a
|
|
"cuda" suffix to individual styles in your input script, e.g.
|
|
|
|
.. parsed-literal::
|
|
|
|
pair_style lj/cut/cuda 2.5
|
|
|
|
You only need to use the :doc:`package cuda <package>` command if you
|
|
wish to change any of its option defaults, including the number of
|
|
GPUs/node (default = 1), as set by the "-c on" :ref:`command-line switch <start_7>`.
|
|
|
|
**Speed-ups to expect:**
|
|
|
|
The performance of a GPU versus a multi-core CPU is a function of your
|
|
hardware, which pair style is used, the number of atoms/GPU, and the
|
|
precision used on the GPU (double, single, mixed).
|
|
|
|
See the `Benchmark page <http://lammps.sandia.gov/bench.html>`_ of the
|
|
LAMMPS web site for performance of the USER-CUDA package on different
|
|
hardware.
|
|
|
|
**Guidelines for best performance:**
|
|
|
|
* The USER-CUDA package offers more speed-up relative to CPU performance
|
|
when the number of atoms per GPU is large, e.g. on the order of tens
|
|
or hundreds of 1000s.
|
|
* As noted above, this package will continue to run a simulation
|
|
entirely on the GPU(s) (except for inter-processor MPI communication),
|
|
for multiple timesteps, until a CPU calculation is required, either by
|
|
a fix or compute that is non-GPU-ized, or until output is performed
|
|
(thermo or dump snapshot or restart file). The less often this
|
|
occurs, the faster your simulation will run.
|
|
Restrictions
|
|
""""""""""""
|
|
|
|
|
|
None.
|
|
|
|
|
|
.. _lws: http://lammps.sandia.gov
|
|
.. _ld: Manual.html
|
|
.. _lc: Section_commands.html#comm
|