2014-09-10 23:32:24 +08:00
|
|
|
"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
|
|
|
|
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
|
|
|
|
|
|
|
|
:link(lws,http://lammps.sandia.gov)
|
|
|
|
:link(ld,Manual.html)
|
|
|
|
:link(lc,Section_commands.html#comm)
|
|
|
|
|
|
|
|
:line
|
|
|
|
|
|
|
|
"Return to Section accelerate overview"_Section_accelerate.html
|
|
|
|
|
|
|
|
5.3.1 USER-CUDA package :h4
|
|
|
|
|
|
|
|
The USER-CUDA package was developed by Christian Trott (Sandia) while
|
|
|
|
at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions
|
|
|
|
of many pair styles, many fixes, a few computes, and for long-range
|
|
|
|
Coulombics via the PPPM command. It has the following general
|
|
|
|
features:
|
|
|
|
|
|
|
|
The package is designed to allow an entire LAMMPS calculation, for
|
|
|
|
many timesteps, to run entirely on the GPU (except for inter-processor
|
|
|
|
MPI communication), so that atom-based data (e.g. coordinates, forces)
|
|
|
|
do not have to move back-and-forth between the CPU and GPU. :ulb,l
|
|
|
|
|
|
|
|
The speed-up advantage of this approach is typically better when the
|
|
|
|
number of atoms per GPU is large :l
|
|
|
|
|
|
|
|
Data will stay on the GPU until a timestep where a non-USER-CUDA fix
|
|
|
|
or compute is invoked. Whenever a non-GPU operation occurs (fix,
|
|
|
|
compute, output), data automatically moves back to the CPU as needed.
|
|
|
|
This may incur a performance penalty, but should otherwise work
|
|
|
|
transparently. :l
|
|
|
|
|
|
|
|
Neighbor lists are constructed on the GPU. :l
|
|
|
|
|
|
|
|
The package only supports use of a single MPI task, running on a
|
|
|
|
single CPU (core), assigned to each GPU. :l,ule
|
|
|
|
|
|
|
|
Here is a quick overview of how to use the USER-CUDA package:
|
|
|
|
|
|
|
|
build the library in lib/cuda for your GPU hardware with desired precision
|
|
|
|
include the USER-CUDA package and build LAMMPS
|
|
|
|
use the mpirun command to specify 1 MPI task per GPU (on each node)
|
|
|
|
enable the USER-CUDA package via the "-c on" command-line switch
|
|
|
|
specify the # of GPUs per node
|
|
|
|
use USER-CUDA styles in your input script :ul
|
|
|
|
|
|
|
|
The latter two steps can be done using the "-pk cuda" and "-sf cuda"
|
|
|
|
"command-line switches"_Section_start.html#start_7 respectively. Or
|
|
|
|
the effect of the "-pk" or "-sf" switches can be duplicated by adding
|
|
|
|
the "package cuda"_package.html or "suffix cuda"_suffix.html commands
|
|
|
|
respectively to your input script.
|
|
|
|
|
|
|
|
[Required hardware/software:]
|
|
|
|
|
|
|
|
To use this package, you need to have one or more NVIDIA GPUs and
|
|
|
|
install the NVIDIA Cuda software on your system:
|
|
|
|
|
|
|
|
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
|
|
|
|
help you to find out the Compute Capability of your card:
|
|
|
|
|
|
|
|
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
|
|
|
|
|
|
|
|
Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
|
|
|
|
corresponding GPU drivers. The Nvidia Cuda SDK is not required, but
|
|
|
|
we recommend it also be installed. You can then make sure its sample
|
|
|
|
projects can be compiled without problems.
|
|
|
|
|
|
|
|
[Building LAMMPS with the USER-CUDA package:]
|
|
|
|
|
|
|
|
This requires two steps (a,b): build the USER-CUDA library, then build
|
|
|
|
LAMMPS with the USER-CUDA package.
|
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
You can do both these steps in one line, using the src/Make.py script,
|
|
|
|
described in "Section 2.4"_Section_start.html#start_4 of the manual.
|
|
|
|
Type "Make.py -h" for help. If run from the src directory, this
|
|
|
|
command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the
|
|
|
|
starting Makefile.machine:
|
|
|
|
|
|
|
|
Make.py -p cuda -cuda mode=single arch=20 -o cuda lib-cuda file mpi :pre
|
|
|
|
|
|
|
|
Or you can follow these two (a,b) steps:
|
|
|
|
|
2014-09-10 23:32:24 +08:00
|
|
|
(a) Build the USER-CUDA library
|
|
|
|
|
|
|
|
The USER-CUDA library is in lammps/lib/cuda. If your {CUDA} toolkit
|
|
|
|
is not installed in the default system directoy {/usr/local/cuda} edit
|
|
|
|
the file {lib/cuda/Makefile.common} accordingly.
|
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
To build the library with the settings in lib/cuda/Makefile.default,
|
|
|
|
simply type:
|
|
|
|
|
|
|
|
make :pre
|
|
|
|
|
|
|
|
To set options when the library is built, type "make OPTIONS", where
|
2014-09-10 23:32:24 +08:00
|
|
|
{OPTIONS} are one or more of the following. The settings will be
|
2014-10-07 23:08:33 +08:00
|
|
|
written to the {lib/cuda/Makefile.defaults} before the build.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
{precision=N} to set the precision level
|
|
|
|
N = 1 for single precision (default)
|
|
|
|
N = 2 for double precision
|
|
|
|
N = 3 for positions in double precision
|
|
|
|
N = 4 for positions and velocities in double precision
|
|
|
|
{arch=M} to set GPU compute capability
|
|
|
|
M = 35 for Kepler GPUs
|
|
|
|
M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
|
|
|
|
M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450)
|
|
|
|
M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
|
|
|
|
{prec_timer=0/1} to use hi-precision timers
|
|
|
|
0 = do not use them (default)
|
|
|
|
1 = use them
|
|
|
|
this is usually only useful for Mac machines
|
|
|
|
{dbg=0/1} to activate debug mode
|
|
|
|
0 = no debug mode (default)
|
|
|
|
1 = yes debug mode
|
|
|
|
this is only useful for developers
|
|
|
|
{cufft=1} for use of the CUDA FFT library
|
|
|
|
0 = no CUFFT support (default)
|
|
|
|
in the future other CUDA-enabled FFT libraries might be supported :pre
|
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
If the build is successful, it will produce the files liblammpscuda.a and
|
|
|
|
Makefile.lammps.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
Note that if you change any of the options (like precision), you need
|
|
|
|
to re-build the entire library. Do a "make clean" first, followed by
|
|
|
|
"make".
|
|
|
|
|
|
|
|
(b) Build LAMMPS with the USER-CUDA package
|
|
|
|
|
|
|
|
cd lammps/src
|
|
|
|
make yes-user-cuda
|
|
|
|
make machine :pre
|
|
|
|
|
2014-10-07 23:08:33 +08:00
|
|
|
No additional compile/link flags are needed in Makefile.machine.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
Note that if you change the USER-CUDA library precision (discussed
|
|
|
|
above) and rebuild the USER-CUDA library, then you also need to
|
|
|
|
re-install the USER-CUDA package and re-build LAMMPS, so that all
|
|
|
|
affected files are re-compiled and linked to the new USER-CUDA
|
|
|
|
library.
|
|
|
|
|
|
|
|
[Run with the USER-CUDA package from the command line:]
|
|
|
|
|
|
|
|
The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
|
|
by LAMMPS (one or multiple per compute node) and the number of MPI
|
2014-09-12 05:14:29 +08:00
|
|
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
2014-09-13 05:19:51 +08:00
|
|
|
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
When using the USER-CUDA package, you must use exactly one MPI task
|
|
|
|
per physical GPU.
|
|
|
|
|
|
|
|
You must use the "-c on" "command-line
|
|
|
|
switch"_Section_start.html#start_7 to enable the USER-CUDA package.
|
2014-09-10 23:47:24 +08:00
|
|
|
The "-c on" switch also issues a default "package cuda 1"_package.html
|
|
|
|
command which sets various USER-CUDA options to default values, as
|
|
|
|
discussed on the "package"_package.html command doc page.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
Use the "-sf cuda" "command-line switch"_Section_start.html#start_7,
|
|
|
|
which will automatically append "cuda" to styles that support it. Use
|
|
|
|
the "-pk cuda Ng" "command-line switch"_Section_start.html#start_7 to
|
2014-09-10 23:47:24 +08:00
|
|
|
set Ng = # of GPUs per node to a different value than the default set
|
|
|
|
by the "-c on" switch (1 GPU) or change other "package
|
|
|
|
cuda"_package.html options.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU
|
|
|
|
mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
|
|
|
|
mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes :pre
|
|
|
|
|
2014-09-10 23:47:24 +08:00
|
|
|
The syntax for the "-pk" switch is the same as same as the "package
|
|
|
|
cuda" command. See the "package"_package.html command doc page for
|
|
|
|
details, including the default values used for all its options if it
|
|
|
|
is not specified.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
2014-09-11 01:48:28 +08:00
|
|
|
Note that the default for the "package cuda"_package.html command is
|
|
|
|
to set the Newton flag to "off" for both pairwise and bonded
|
|
|
|
interactions. This typically gives fastest performance. If the
|
|
|
|
"newton"_newton.html command is used in the input script, it can
|
|
|
|
override these defaults.
|
|
|
|
|
2014-09-10 23:32:24 +08:00
|
|
|
[Or run with the USER-CUDA package by editing an input script:]
|
|
|
|
|
|
|
|
The discussion above for the mpirun/mpiexec command and the requirement
|
|
|
|
of one MPI task per GPU is the same.
|
|
|
|
|
|
|
|
You must still use the "-c on" "command-line
|
|
|
|
switch"_Section_start.html#start_7 to enable the USER-CUDA package.
|
|
|
|
|
|
|
|
Use the "suffix cuda"_suffix.html command, or you can explicitly add a
|
|
|
|
"cuda" suffix to individual styles in your input script, e.g.
|
|
|
|
|
|
|
|
pair_style lj/cut/cuda 2.5 :pre
|
|
|
|
|
2014-09-10 23:47:24 +08:00
|
|
|
You only need to use the "package cuda"_package.html command if you
|
|
|
|
wish to change any of its option defaults, including the number of
|
|
|
|
GPUs/node (default = 1), as set by the "-c on" "command-line
|
|
|
|
switch"_Section_start.html#start_7.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
The performance of a GPU versus a multi-core CPU is a function of your
|
|
|
|
hardware, which pair style is used, the number of atoms/GPU, and the
|
|
|
|
precision used on the GPU (double, single, mixed).
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
|
|
|
LAMMPS web site for performance of the USER-CUDA package on different
|
|
|
|
hardware.
|
|
|
|
|
|
|
|
[Guidelines for best performance:]
|
|
|
|
|
|
|
|
The USER-CUDA package offers more speed-up relative to CPU performance
|
|
|
|
when the number of atoms per GPU is large, e.g. on the order of tens
|
|
|
|
or hundreds of 1000s. :ulb,l
|
|
|
|
|
|
|
|
As noted above, this package will continue to run a simulation
|
|
|
|
entirely on the GPU(s) (except for inter-processor MPI communication),
|
|
|
|
for multiple timesteps, until a CPU calculation is required, either by
|
|
|
|
a fix or compute that is non-GPU-ized, or until output is performed
|
|
|
|
(thermo or dump snapshot or restart file). The less often this
|
|
|
|
occurs, the faster your simulation will run. :l,ule
|
|
|
|
|
|
|
|
[Restrictions:]
|
|
|
|
|
|
|
|
None.
|