lammps/doc/accelerate_gpu.txt

"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

:link(lws,http://lammps.sandia.gov)
:link(ld,Manual.html)
:link(lc,Section_commands.html#comm)

:line

"Return to Section accelerate overview"_Section_accelerate.html

5.3.2 GPU package :h4

The GPU package was developed by Mike Brown at ORNL and his
collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
versions of many pair styles, including the 3-body Stillinger-Weber
pair style, and for "kspace_style pppm"_kspace_style.html for
long-range Coulombics.  It has the following general features:

It is designed to exploit common GPU hardware configurations where one
or more GPUs are coupled to many cores of one or more multi-core CPUs,
e.g. within a node of a parallel machine. :ulb,l

Atom-based data (e.g. coordinates, forces) moves back-and-forth
between the CPU(s) and GPU every timestep. :l

Neighbor lists can be built on the CPU or on the GPU :l

The charge assignement and force interpolation portions of PPPM can be
run on the GPU.  The FFT portion, which requires MPI communication
between processors, runs on the CPU. :l

Asynchronous force computations can be performed simultaneously on the
CPU(s) and GPU. :l

It allows for GPU computations to be performed in single or double
precision, or in mixed-mode precision, where pairwise forces are
computed in single precision, but accumulated into double-precision
force vectors. :l

LAMMPS-specific code is in the GPU package.  It makes calls to a
generic GPU library in the lib/gpu directory.  This library provides
NVIDIA support as well as more general OpenCL support, so that the
same functionality can eventually be supported on a variety of GPU
hardware. :l,ule

Here is a quick overview of how to use the GPU package:

build the library in lib/gpu for your GPU hardware wity desired precision
include the GPU package and build LAMMPS
use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU
specify the # of GPUs per node
use GPU styles in your input script :ul

The latter two steps can be done using the "-pk gpu" and "-sf gpu"
"command-line switches"_Section_start.html#start_7 respectively.  Or
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the "package gpu"_package.html or "suffix gpu"_suffix.html commands
respectively to your input script.

[Required hardware/software:]

To use this package, you currently need to have an NVIDIA GPU and
install the NVIDIA Cuda software on your system:

Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
Go to http://www.nvidia.com/object/cuda_get.html
Install a driver and toolkit appropriate for your system (SDK is not necessary)
Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul

[Building LAMMPS with the GPU package:]

This requires two steps (a,b): build the GPU library, then build
LAMMPS with the GPU package.

You can do both these steps in one line, using the src/Make.py script,
described in "Section 2.4"_Section_start.html#start_4 of the manual.
Type "Make.py -h" for help.  If run from the src directory, this
command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the
starting Makefile.machine:

Make.py -p gpu -gpu mode=single arch=31 -o gpu lib-gpu file mpi :pre

Or you can follow these two (a,b) steps:

(a) Build the GPU library

The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
lib/gpu) appropriate for your system.  You should pay special
attention to 3 settings in this makefile.

CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
CUDA_ARCH = needs to be appropriate to your GPUs
CUDA_PREC = precision (double, mixed, single) you desire :ul

See lib/gpu/Makefile.linux.double for examples of the ARCH settings
for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
possible precision settings:

CUDA_PREC = -D_SINGLE_SINGLE  # single precision for all calculations
CUDA_PREC = -D_DOUBLE_DOUBLE  # double precision for all calculations
CUDA_PREC = -D_SINGLE_DOUBLE  # accumulation of forces, etc, in double :pre

The last setting is the mixed mode referred to above.  Note that your
GPU must support double precision to use either the 2nd or 3rd of
these settings.

To build the library, type:

make -f Makefile.machine :pre

If successful, it will produce the files libgpu.a and Makefile.lammps.

The latter file has 3 settings that need to be appropriate for the
paths and settings for the CUDA system software on your machine.
Makefile.lammps is a copy of the file specified by the EXTRAMAKE
setting in Makefile.machine.  You can change EXTRAMAKE or create your
own Makefile.lammps.machine if needed.

Note that to change the precision of the GPU library, you need to
re-build the entire library.  Do a "clean" first, e.g. "make -f
Makefile.linux clean", followed by the make command above.

(b) Build LAMMPS with the GPU package

cd lammps/src
make yes-gpu
make machine :pre

No additional compile/link flags are needed in Makefile.machine.

Note that if you change the GPU library precision (discussed above)
and rebuild the GPU library, then you also need to re-install the GPU
package and re-build LAMMPS, so that all affected files are
re-compiled and linked to the new GPU library.

[Run with the GPU package from the command line:]

The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node.  E.g. the mpirun command in MPICH does this via
its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.

When using the GPU package, you cannot assign more than one GPU to a
single MPI task.  However multiple MPI tasks can share the same GPU,
and in many cases it will be more efficient to run this way.  Likewise
it may be more efficient to use less MPI tasks/node than the available
# of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
automatically if you create more MPI tasks/node than there are
GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
shared by 4 MPI tasks.

Use the "-sf gpu" "command-line switch"_Section_start.html#start_7,
which will automatically append "gpu" to styles that support it.  Use
the "-pk gpu Ng" "command-line switch"_Section_start.html#start_7 to
set Ng = # of GPUs/node to use.

lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes :pre

Note that if the "-sf gpu" switch is used, it also issues a default
"package gpu 1"_package.html command, which sets the number of
GPUs/node to 1.

Using the "-pk" switch explicitly allows for setting of the number of
GPUs/node to use and additional options.  Its syntax is the same as
same as the "package gpu" command.  See the "package"_package.html
command doc page for details, including the default values used for
all its options if it is not specified.

Note that the default for the "package gpu"_package.html command is to
set the Newton flag to "off" pairwise interactions.  It does not
affect the setting for bonded interactions (LAMMPS default is "on").
The "off" setting for pairwise interaction is currently required for
GPU package pair styles.

[Or run with the GPU package by editing an input script:]

The discussion above for the mpirun/mpiexec command, MPI tasks/node,
and use of multiple MPI tasks/GPU is the same.

Use the "suffix gpu"_suffix.html command, or you can explicitly add an
"gpu" suffix to individual styles in your input script, e.g.

pair_style lj/cut/gpu 2.5 :pre

You must also use the "package gpu"_package.html command to enable the
GPU package, unless the "-sf gpu" or "-pk gpu" "command-line
switches"_Section_start.html#start_7 were used.  It specifies the
number of GPUs/node to use, as well as other options.

[Speed-ups to expect:]

The performance of a GPU versus a multi-core CPU is a function of your
hardware, which pair style is used, the number of atoms/GPU, and the
precision used on the GPU (double, single, mixed).

See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
LAMMPS web site for performance of the GPU package on various
hardware, including the Titan HPC platform at ORNL.

You should also experiment with how many MPI tasks per GPU to use to
give the best performance for your problem and machine.  This is also
a function of the problem size and the pair style being using.
Likewise, you should experiment with the precision setting for the GPU
library to see if single or mixed precision will give accurate
results, since they will typically be faster.

[Guidelines for best performance:]

Using multiple MPI tasks per GPU will often give the best performance,
as allowed my most multi-core CPU/GPU configurations. :ulb,l

If the number of particles per MPI task is small (e.g. 100s of
particles), it can be more efficient to run with fewer MPI tasks per
GPU, even if you do not use all the cores on the compute node. :l

The "package gpu"_package.html command has several options for tuning
performance.  Neighbor lists can be built on the GPU or CPU.  Force
calculations can be dynamically balanced across the CPU cores and
GPUs.  GPU-specific settings can be made which can be optimized
for different hardware.  See the "packakge"_package.html command
doc page for details. :l

As described by the "package gpu"_package.html command, GPU
accelerated pair styles can perform computations asynchronously with
CPU computations. The "Pair" time reported by LAMMPS will be the
maximum of the time required to complete the CPU pair style
computations and the time required to complete the GPU pair style
computations. Any time spent for GPU-enabled pair styles for
computations that run simultaneously with "bond"_bond_style.html,
"angle"_angle_style.html, "dihedral"_dihedral_style.html,
"improper"_improper_style.html, and "long-range"_kspace_style.html
calculations will not be included in the "Pair" time. :l

When the {mode} setting for the package gpu command is force/neigh,
the time for neighbor list calculations on the GPU will be added into
the "Pair" time, not the "Neigh" time.  An additional breakdown of the
times required for various tasks on the GPU (data copy, neighbor
calculations, force computations, etc) are output only with the LAMMPS
screen output (not in the log file) at the end of each run.  These
timings represent total time spent on the GPU for each routine,
regardless of asynchronous CPU calculations. :l

The output section "GPU Time Info (average)" reports "Max Mem / Proc".
This is the maximum memory used at one time on the GPU for data
storage by a single MPI process. :l,ule

[Restrictions:]

None.
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -`
			`"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c`

			`:link(lws,http://lammps.sandia.gov)`
			`:link(ld,Manual.html)`
			`:link(lc,Section_commands.html#comm)`

			`:line`

			`"Return to Section accelerate overview"_Section_accelerate.html`

			`5.3.2 GPU package :h4`

			`The GPU package was developed by Mike Brown at ORNL and his`
			`collaborators, particularly Trung Nguyen (ORNL). It provides GPU`
			`versions of many pair styles, including the 3-body Stillinger-Weber`
			`pair style, and for "kspace_style pppm"_kspace_style.html for`
			`long-range Coulombics. It has the following general features:`

			`It is designed to exploit common GPU hardware configurations where one`
			`or more GPUs are coupled to many cores of one or more multi-core CPUs,`
			`e.g. within a node of a parallel machine. :ulb,l`

			`Atom-based data (e.g. coordinates, forces) moves back-and-forth`
			`between the CPU(s) and GPU every timestep. :l`

			`Neighbor lists can be built on the CPU or on the GPU :l`

			`The charge assignement and force interpolation portions of PPPM can be`
			`run on the GPU. The FFT portion, which requires MPI communication`
			`between processors, runs on the CPU. :l`

			`Asynchronous force computations can be performed simultaneously on the`
			`CPU(s) and GPU. :l`

			`It allows for GPU computations to be performed in single or double`
			`precision, or in mixed-mode precision, where pairwise forces are`
			`computed in single precision, but accumulated into double-precision`
			`force vectors. :l`

			`LAMMPS-specific code is in the GPU package. It makes calls to a`
			`generic GPU library in the lib/gpu directory. This library provides`
			`NVIDIA support as well as more general OpenCL support, so that the`
			`same functionality can eventually be supported on a variety of GPU`
			`hardware. :l,ule`

			`Here is a quick overview of how to use the GPU package:`

			`build the library in lib/gpu for your GPU hardware wity desired precision`
			`include the GPU package and build LAMMPS`
			`use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU`
			`specify the # of GPUs per node`
			`use GPU styles in your input script :ul`

			`The latter two steps can be done using the "-pk gpu" and "-sf gpu"`
			`"command-line switches"_Section_start.html#start_7 respectively. Or`
			`the effect of the "-pk" or "-sf" switches can be duplicated by adding`
			`the "package gpu"_package.html or "suffix gpu"_suffix.html commands`
			`respectively to your input script.`

			`[Required hardware/software:]`

			`To use this package, you currently need to have an NVIDIA GPU and`
			`install the NVIDIA Cuda software on your system:`

			`Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information`
			`Go to http://www.nvidia.com/object/cuda_get.html`
			`Install a driver and toolkit appropriate for your system (SDK is not necessary)`
			`Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul`

			`[Building LAMMPS with the GPU package:]`

			`This requires two steps (a,b): build the GPU library, then build`
			`LAMMPS with the GPU package.`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12592 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-10-07 23:08:33 +08:00			`You can do both these steps in one line, using the src/Make.py script,`
			`described in "Section 2.4"_Section_start.html#start_4 of the manual.`
			`Type "Make.py -h" for help. If run from the src directory, this`
			`command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the`
			`starting Makefile.machine:`

			`Make.py -p gpu -gpu mode=single arch=31 -o gpu lib-gpu file mpi :pre`

			`Or you can follow these two (a,b) steps:`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`(a) Build the GPU library`

			`The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in`
			`lib/gpu) appropriate for your system. You should pay special`
			`attention to 3 settings in this makefile.`

			`CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system`
			`CUDA_ARCH = needs to be appropriate to your GPUs`
			`CUDA_PREC = precision (double, mixed, single) you desire :ul`

			`See lib/gpu/Makefile.linux.double for examples of the ARCH settings`
			`for different GPU choices, e.g. Fermi vs Kepler. It also lists the`
			`possible precision settings:`

			`CUDA_PREC = -D_SINGLE_SINGLE # single precision for all calculations`
			`CUDA_PREC = -D_DOUBLE_DOUBLE # double precision for all calculations`
			`CUDA_PREC = -D_SINGLE_DOUBLE # accumulation of forces, etc, in double :pre`

			`The last setting is the mixed mode referred to above. Note that your`
			`GPU must support double precision to use either the 2nd or 3rd of`
			`these settings.`

			`To build the library, type:`

			`make -f Makefile.machine :pre`

			`If successful, it will produce the files libgpu.a and Makefile.lammps.`

			`The latter file has 3 settings that need to be appropriate for the`
			`paths and settings for the CUDA system software on your machine.`
			`Makefile.lammps is a copy of the file specified by the EXTRAMAKE`
			`setting in Makefile.machine. You can change EXTRAMAKE or create your`
			`own Makefile.lammps.machine if needed.`

			`Note that to change the precision of the GPU library, you need to`
			`re-build the entire library. Do a "clean" first, e.g. "make -f`
			`Makefile.linux clean", followed by the make command above.`

			`(b) Build LAMMPS with the GPU package`

			`cd lammps/src`
			`make yes-gpu`
			`make machine :pre`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12592 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-10-07 23:08:33 +08:00			`No additional compile/link flags are needed in Makefile.machine.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
			`Note that if you change the GPU library precision (discussed above)`
			`and rebuild the GPU library, then you also need to re-install the GPU`
			`package and re-build LAMMPS, so that all affected files are`
			`re-compiled and linked to the new GPU library.`

			`[Run with the GPU package from the command line:]`

			`The mpirun or mpiexec command sets the total number of MPI tasks used`
			`by LAMMPS (one or multiple per compute node) and the number of MPI`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12503 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-12 05:14:29 +08:00			`tasks used per node. E.g. the mpirun command in MPICH does this via`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12508 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-13 05:19:51 +08:00			`its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
			`When using the GPU package, you cannot assign more than one GPU to a`
			`single MPI task. However multiple MPI tasks can share the same GPU,`
			`and in many cases it will be more efficient to run this way. Likewise`
			`it may be more efficient to use less MPI tasks/node than the available`
			`# of CPU cores. Assignment of multiple MPI tasks to a GPU will happen`
			`automatically if you create more MPI tasks/node than there are`
			`GPUs/mode. E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be`
			`shared by 4 MPI tasks.`

			`Use the "-sf gpu" "command-line switch"_Section_start.html#start_7,`
			`which will automatically append "gpu" to styles that support it. Use`
			`the "-pk gpu Ng" "command-line switch"_Section_start.html#start_7 to`
			`set Ng = # of GPUs/node to use.`

			`lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU`
			`mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node`
			`mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes :pre`

			`Note that if the "-sf gpu" switch is used, it also issues a default`
			`"package gpu 1"_package.html command, which sets the number of`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12465 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:47:24 +08:00			`GPUs/node to 1.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12465 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:47:24 +08:00			`Using the "-pk" switch explicitly allows for setting of the number of`
			`GPUs/node to use and additional options. Its syntax is the same as`
			`same as the "package gpu" command. See the "package"_package.html`
			`command doc page for details, including the default values used for`
			`all its options if it is not specified.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12473 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-11 01:48:28 +08:00			`Note that the default for the "package gpu"_package.html command is to`
			`set the Newton flag to "off" pairwise interactions. It does not`
			`affect the setting for bonded interactions (LAMMPS default is "on").`
			`The "off" setting for pairwise interaction is currently required for`
			`GPU package pair styles.`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`[Or run with the GPU package by editing an input script:]`

			`The discussion above for the mpirun/mpiexec command, MPI tasks/node,`
			`and use of multiple MPI tasks/GPU is the same.`

			`Use the "suffix gpu"_suffix.html command, or you can explicitly add an`
			`"gpu" suffix to individual styles in your input script, e.g.`

			`pair_style lj/cut/gpu 2.5 :pre`

			`You must also use the "package gpu"_package.html command to enable the`
			`GPU package, unless the "-sf gpu" or "-pk gpu" "command-line`
			`switches"_Section_start.html#start_7 were used. It specifies the`
			`number of GPUs/node to use, as well as other options.`

			`[Speed-ups to expect:]`

			`The performance of a GPU versus a multi-core CPU is a function of your`
			`hardware, which pair style is used, the number of atoms/GPU, and the`
			`precision used on the GPU (double, single, mixed).`

			`See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the`
			`LAMMPS web site for performance of the GPU package on various`
			`hardware, including the Titan HPC platform at ORNL.`

			`You should also experiment with how many MPI tasks per GPU to use to`
			`give the best performance for your problem and machine. This is also`
			`a function of the problem size and the pair style being using.`
			`Likewise, you should experiment with the precision setting for the GPU`
			`library to see if single or mixed precision will give accurate`
			`results, since they will typically be faster.`

			`[Guidelines for best performance:]`

			`Using multiple MPI tasks per GPU will often give the best performance,`
			`as allowed my most multi-core CPU/GPU configurations. :ulb,l`

			`If the number of particles per MPI task is small (e.g. 100s of`
			`particles), it can be more efficient to run with fewer MPI tasks per`
			`GPU, even if you do not use all the cores on the compute node. :l`

			`The "package gpu"_package.html command has several options for tuning`
			`performance. Neighbor lists can be built on the GPU or CPU. Force`
			`calculations can be dynamically balanced across the CPU cores and`
			`GPUs. GPU-specific settings can be made which can be optimized`
			`for different hardware. See the "packakge"_package.html command`
			`doc page for details. :l`

			`As described by the "package gpu"_package.html command, GPU`
			`accelerated pair styles can perform computations asynchronously with`
			`CPU computations. The "Pair" time reported by LAMMPS will be the`
			`maximum of the time required to complete the CPU pair style`
			`computations and the time required to complete the GPU pair style`
			`computations. Any time spent for GPU-enabled pair styles for`
			`computations that run simultaneously with "bond"_bond_style.html,`
			`"angle"_angle_style.html, "dihedral"_dihedral_style.html,`
			`"improper"_improper_style.html, and "long-range"_kspace_style.html`
			`calculations will not be included in the "Pair" time. :l`

			`When the {mode} setting for the package gpu command is force/neigh,`
			`the time for neighbor list calculations on the GPU will be added into`
			`the "Pair" time, not the "Neigh" time. An additional breakdown of the`
			`times required for various tasks on the GPU (data copy, neighbor`
			`calculations, force computations, etc) are output only with the LAMMPS`
			`screen output (not in the log file) at the end of each run. These`
			`timings represent total time spent on the GPU for each routine,`
			`regardless of asynchronous CPU calculations. :l`

			`The output section "GPU Time Info (average)" reports "Max Mem / Proc".`
			`This is the maximum memory used at one time on the GPU for data`
			`storage by a single MPI process. :l,ule`

			`[Restrictions:]`

			`None.`