lammps/doc/accelerate_cuda.txt

"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

:link(lws,http://lammps.sandia.gov)
:link(ld,Manual.html)
:link(lc,Section_commands.html#comm)

:line

"Return to Section accelerate overview"_Section_accelerate.html

5.3.1 USER-CUDA package :h4

The USER-CUDA package was developed by Christian Trott (Sandia) while
at U Technology Ilmenau in Germany.  It provides NVIDIA GPU versions
of many pair styles, many fixes, a few computes, and for long-range
Coulombics via the PPPM command.  It has the following general
features:

The package is designed to allow an entire LAMMPS calculation, for
many timesteps, to run entirely on the GPU (except for inter-processor
MPI communication), so that atom-based data (e.g. coordinates, forces)
do not have to move back-and-forth between the CPU and GPU. :ulb,l

The speed-up advantage of this approach is typically better when the
number of atoms per GPU is large :l

Data will stay on the GPU until a timestep where a non-USER-CUDA fix
or compute is invoked.  Whenever a non-GPU operation occurs (fix,
compute, output), data automatically moves back to the CPU as needed.
This may incur a performance penalty, but should otherwise work
transparently. :l

Neighbor lists are constructed on the GPU. :l

The package only supports use of a single MPI task, running on a
single CPU (core), assigned to each GPU. :l,ule

Here is a quick overview of how to use the USER-CUDA package:

build the library in lib/cuda for your GPU hardware with desired precision
include the USER-CUDA package and build LAMMPS
use the mpirun command to specify 1 MPI task per GPU (on each node)
enable the USER-CUDA package via the "-c on" command-line switch
specify the # of GPUs per node
use USER-CUDA styles in your input script :ul

The latter two steps can be done using the "-pk cuda" and "-sf cuda"
"command-line switches"_Section_start.html#start_7 respectively.  Or
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the "package cuda"_package.html or "suffix cuda"_suffix.html commands
respectively to your input script.

[Required hardware/software:]

To use this package, you need to have one or more NVIDIA GPUs and
install the NVIDIA Cuda software on your system:

Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
help you to find out the Compute Capability of your card:

http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
corresponding GPU drivers.  The Nvidia Cuda SDK is not required, but
we recommend it also be installed.  You can then make sure its sample
projects can be compiled without problems.

[Building LAMMPS with the USER-CUDA package:]

This requires two steps (a,b): build the USER-CUDA library, then build
LAMMPS with the USER-CUDA package.

(a) Build the USER-CUDA library

The USER-CUDA library is in lammps/lib/cuda.  If your {CUDA} toolkit
is not installed in the default system directoy {/usr/local/cuda} edit
the file {lib/cuda/Makefile.common} accordingly.

To set options for the library build, type "make OPTIONS", where
{OPTIONS} are one or more of the following. The settings will be
written to the {lib/cuda/Makefile.defaults} and used when
the library is built.

{precision=N} to set the precision level
  N = 1 for single precision (default)
  N = 2 for double precision
  N = 3 for positions in double precision
  N = 4 for positions and velocities in double precision
{arch=M} to set GPU compute capability
  M = 35 for Kepler GPUs
  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
{prec_timer=0/1} to use hi-precision timers
  0 = do not use them (default)
  1 = use them
  this is usually only useful for Mac machines 
{dbg=0/1} to activate debug mode
  0 = no debug mode (default)
  1 = yes debug mode
  this is only useful for developers
{cufft=1} for use of the CUDA FFT library
  0 = no CUFFT support (default)
  in the future other CUDA-enabled FFT libraries might be supported :pre

To build the library, simply type:

make :pre

If successful, it will produce the files libcuda.a and Makefile.lammps.

Note that if you change any of the options (like precision), you need
to re-build the entire library.  Do a "make clean" first, followed by
"make".

(b) Build LAMMPS with the USER-CUDA package

cd lammps/src
make yes-user-cuda
make machine :pre

No additional compile/link flags are needed in your Makefile.machine
in src/MAKE.

Note that if you change the USER-CUDA library precision (discussed
above) and rebuild the USER-CUDA library, then you also need to
re-install the USER-CUDA package and re-build LAMMPS, so that all
affected files are re-compiled and linked to the new USER-CUDA
library.

[Run with the USER-CUDA package from the command line:]

The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node.  E.g. the mpirun command does this via its -np
and -ppn switches.

When using the USER-CUDA package, you must use exactly one MPI task
per physical GPU.

You must use the "-c on" "command-line
switch"_Section_start.html#start_7 to enable the USER-CUDA package.
The "-c on" switch also issues a default "package cuda 1"_package.html
command which sets various USER-CUDA options to default values, as
discussed on the "package"_package.html command doc page.

Use the "-sf cuda" "command-line switch"_Section_start.html#start_7,
which will automatically append "cuda" to styles that support it.  Use
the "-pk cuda Ng" "command-line switch"_Section_start.html#start_7 to
set Ng = # of GPUs per node to a different value than the default set
by the "-c on" switch (1 GPU) or change other "package
cuda"_package.html options.

lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes :pre

The syntax for the "-pk" switch is the same as same as the "package
cuda" command.  See the "package"_package.html command doc page for
details, including the default values used for all its options if it
is not specified.

Note that the default for the "package cuda"_package.html command is
to set the Newton flag to "off" for both pairwise and bonded
interactions.  This typically gives fastest performance.  If the
"newton"_newton.html command is used in the input script, it can
override these defaults.

[Or run with the USER-CUDA package by editing an input script:]

The discussion above for the mpirun/mpiexec command and the requirement
of one MPI task per GPU is the same.

You must still use the "-c on" "command-line
switch"_Section_start.html#start_7 to enable the USER-CUDA package.

Use the "suffix cuda"_suffix.html command, or you can explicitly add a
"cuda" suffix to individual styles in your input script, e.g.

pair_style lj/cut/cuda 2.5 :pre

You only need to use the "package cuda"_package.html command if you
wish to change any of its option defaults, including the number of
GPUs/node (default = 1), as set by the "-c on" "command-line
switch"_Section_start.html#start_7.

[Speed-ups to expect:]

The performance of a GPU versus a multi-core CPU is a function of your
hardware, which pair style is used, the number of atoms/GPU, and the
precision used on the GPU (double, single, mixed).

See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
LAMMPS web site for performance of the USER-CUDA package on different
hardware.

[Guidelines for best performance:]

The USER-CUDA package offers more speed-up relative to CPU performance
when the number of atoms per GPU is large, e.g. on the order of tens
or hundreds of 1000s. :ulb,l

As noted above, this package will continue to run a simulation
entirely on the GPU(s) (except for inter-processor MPI communication),
for multiple timesteps, until a CPU calculation is required, either by
a fix or compute that is non-GPU-ized, or until output is performed
(thermo or dump snapshot or restart file).  The less often this
occurs, the faster your simulation will run. :l,ule

[Restrictions:]

None.
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -`
			`"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c`

			`:link(lws,http://lammps.sandia.gov)`
			`:link(ld,Manual.html)`
			`:link(lc,Section_commands.html#comm)`

			`:line`

			`"Return to Section accelerate overview"_Section_accelerate.html`

			`5.3.1 USER-CUDA package :h4`

			`The USER-CUDA package was developed by Christian Trott (Sandia) while`
			`at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions`
			`of many pair styles, many fixes, a few computes, and for long-range`
			`Coulombics via the PPPM command. It has the following general`
			`features:`

			`The package is designed to allow an entire LAMMPS calculation, for`
			`many timesteps, to run entirely on the GPU (except for inter-processor`
			`MPI communication), so that atom-based data (e.g. coordinates, forces)`
			`do not have to move back-and-forth between the CPU and GPU. :ulb,l`

			`The speed-up advantage of this approach is typically better when the`
			`number of atoms per GPU is large :l`

			`Data will stay on the GPU until a timestep where a non-USER-CUDA fix`
			`or compute is invoked. Whenever a non-GPU operation occurs (fix,`
			`compute, output), data automatically moves back to the CPU as needed.`
			`This may incur a performance penalty, but should otherwise work`
			`transparently. :l`

			`Neighbor lists are constructed on the GPU. :l`

			`The package only supports use of a single MPI task, running on a`
			`single CPU (core), assigned to each GPU. :l,ule`

			`Here is a quick overview of how to use the USER-CUDA package:`

			`build the library in lib/cuda for your GPU hardware with desired precision`
			`include the USER-CUDA package and build LAMMPS`
			`use the mpirun command to specify 1 MPI task per GPU (on each node)`
			`enable the USER-CUDA package via the "-c on" command-line switch`
			`specify the # of GPUs per node`
			`use USER-CUDA styles in your input script :ul`

			`The latter two steps can be done using the "-pk cuda" and "-sf cuda"`
			`"command-line switches"_Section_start.html#start_7 respectively. Or`
			`the effect of the "-pk" or "-sf" switches can be duplicated by adding`
			`the "package cuda"_package.html or "suffix cuda"_suffix.html commands`
			`respectively to your input script.`

			`[Required hardware/software:]`

			`To use this package, you need to have one or more NVIDIA GPUs and`
			`install the NVIDIA Cuda software on your system:`

			`Your NVIDIA GPU needs to support Compute Capability 1.3. This list may`
			`help you to find out the Compute Capability of your card:`

			`http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units`

			`Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the`
			`corresponding GPU drivers. The Nvidia Cuda SDK is not required, but`
			`we recommend it also be installed. You can then make sure its sample`
			`projects can be compiled without problems.`

			`[Building LAMMPS with the USER-CUDA package:]`

			`This requires two steps (a,b): build the USER-CUDA library, then build`
			`LAMMPS with the USER-CUDA package.`

			`(a) Build the USER-CUDA library`

			`The USER-CUDA library is in lammps/lib/cuda. If your {CUDA} toolkit`
			`is not installed in the default system directoy {/usr/local/cuda} edit`
			`the file {lib/cuda/Makefile.common} accordingly.`

			`To set options for the library build, type "make OPTIONS", where`
			`{OPTIONS} are one or more of the following. The settings will be`
			`written to the {lib/cuda/Makefile.defaults} and used when`
			`the library is built.`

			`{precision=N} to set the precision level`
			`N = 1 for single precision (default)`
			`N = 2 for double precision`
			`N = 3 for positions in double precision`
			`N = 4 for positions and velocities in double precision`
			`{arch=M} to set GPU compute capability`
			`M = 35 for Kepler GPUs`
			`M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)`
			`M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450)`
			`M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)`
			`{prec_timer=0/1} to use hi-precision timers`
			`0 = do not use them (default)`
			`1 = use them`
			`this is usually only useful for Mac machines`
			`{dbg=0/1} to activate debug mode`
			`0 = no debug mode (default)`
			`1 = yes debug mode`
			`this is only useful for developers`
			`{cufft=1} for use of the CUDA FFT library`
			`0 = no CUFFT support (default)`
			`in the future other CUDA-enabled FFT libraries might be supported :pre`

			`To build the library, simply type:`

			`make :pre`

			`If successful, it will produce the files libcuda.a and Makefile.lammps.`

			`Note that if you change any of the options (like precision), you need`
			`to re-build the entire library. Do a "make clean" first, followed by`
			`"make".`

			`(b) Build LAMMPS with the USER-CUDA package`

			`cd lammps/src`
			`make yes-user-cuda`
			`make machine :pre`

			`No additional compile/link flags are needed in your Makefile.machine`
			`in src/MAKE.`

			`Note that if you change the USER-CUDA library precision (discussed`
			`above) and rebuild the USER-CUDA library, then you also need to`
			`re-install the USER-CUDA package and re-build LAMMPS, so that all`
			`affected files are re-compiled and linked to the new USER-CUDA`
			`library.`

			`[Run with the USER-CUDA package from the command line:]`

			`The mpirun or mpiexec command sets the total number of MPI tasks used`
			`by LAMMPS (one or multiple per compute node) and the number of MPI`
			`tasks used per node. E.g. the mpirun command does this via its -np`
			`and -ppn switches.`

			`When using the USER-CUDA package, you must use exactly one MPI task`
			`per physical GPU.`

			`You must use the "-c on" "command-line`
			`switch"_Section_start.html#start_7 to enable the USER-CUDA package.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12465 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:47:24 +08:00			`The "-c on" switch also issues a default "package cuda 1"_package.html`
			`command which sets various USER-CUDA options to default values, as`
			`discussed on the "package"_package.html command doc page.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
			`Use the "-sf cuda" "command-line switch"_Section_start.html#start_7,`
			`which will automatically append "cuda" to styles that support it. Use`
			`the "-pk cuda Ng" "command-line switch"_Section_start.html#start_7 to`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12465 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:47:24 +08:00			`set Ng = # of GPUs per node to a different value than the default set`
			`by the "-c on" switch (1 GPU) or change other "package`
			`cuda"_package.html options.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
			`lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU`
			`mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node`
			`mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes :pre`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12465 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:47:24 +08:00			`The syntax for the "-pk" switch is the same as same as the "package`
			`cuda" command. See the "package"_package.html command doc page for`
			`details, including the default values used for all its options if it`
			`is not specified.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12473 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-11 01:48:28 +08:00			`Note that the default for the "package cuda"_package.html command is`
			`to set the Newton flag to "off" for both pairwise and bonded`
			`interactions. This typically gives fastest performance. If the`
			`"newton"_newton.html command is used in the input script, it can`
			`override these defaults.`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`[Or run with the USER-CUDA package by editing an input script:]`

			`The discussion above for the mpirun/mpiexec command and the requirement`
			`of one MPI task per GPU is the same.`

			`You must still use the "-c on" "command-line`
			`switch"_Section_start.html#start_7 to enable the USER-CUDA package.`

			`Use the "suffix cuda"_suffix.html command, or you can explicitly add a`
			`"cuda" suffix to individual styles in your input script, e.g.`

			`pair_style lj/cut/cuda 2.5 :pre`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12465 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:47:24 +08:00			`You only need to use the "package cuda"_package.html command if you`
			`wish to change any of its option defaults, including the number of`
			`GPUs/node (default = 1), as set by the "-c on" "command-line`
			`switch"_Section_start.html#start_7.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
			`[Speed-ups to expect:]`

			`The performance of a GPU versus a multi-core CPU is a function of your`
			`hardware, which pair style is used, the number of atoms/GPU, and the`
			`precision used on the GPU (double, single, mixed).`

			`See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the`
			`LAMMPS web site for performance of the USER-CUDA package on different`
			`hardware.`

			`[Guidelines for best performance:]`

			`The USER-CUDA package offers more speed-up relative to CPU performance`
			`when the number of atoms per GPU is large, e.g. on the order of tens`
			`or hundreds of 1000s. :ulb,l`

			`As noted above, this package will continue to run a simulation`
			`entirely on the GPU(s) (except for inter-processor MPI communication),`
			`for multiple timesteps, until a CPU calculation is required, either by`
			`a fix or compute that is non-GPU-ized, or until output is performed`
			`(thermo or dump snapshot or restart file). The less often this`
			`occurs, the faster your simulation will run. :l,ule`

			`[Restrictions:]`

			`None.`