git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@6711 f3b2605a-c512-4ea7-a41b-209d697bcdaa

This commit is contained in:
sjplimp 2011-08-17 21:55:22 +00:00
parent b416be6cbc
commit dcc7913857
12 changed files with 288 additions and 209 deletions

View File

@ -30,6 +30,7 @@ style exist in LAMMPS:
</P>
<UL><LI><A HREF = "pair_lj.html">pair_style lj/cut</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/omp</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
<LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A>
</UL>
@ -45,6 +46,12 @@ input script.
<P>Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%.
</P>
<P>Styles with an "omp" suffix are part of the USER-OMP package and allow
a pair-style to be run in threaded mode using OpenMP. This can be
useful on nodes with high-core counts when using less MPI processes
than cores is advantageous, e.g. when running with PPPM so that FFTs
are run on fewer MPI processors.
</P>
<P>Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as
@ -67,8 +74,9 @@ and kspace sections.
packages, since they are both designed to use NVIDIA GPU hardware.
</P>
10.1 <A HREF = "#10_1">OPT package</A><BR>
10.2 <A HREF = "#10_2">GPU package</A><BR>
10.3 <A HREF = "#10_3">USER-CUDA package</A><BR>
10.5 <A HREF = "#10_2">USER-OMP package</A><BR>
10.2 <A HREF = "#10_3">GPU package</A><BR>
10.3 <A HREF = "#10_4">USER-CUDA package</A><BR>
10.4 <A HREF = "#10_4">Comparison of GPU and USER-CUDA packages</A> <BR>
<HR>
@ -104,53 +112,62 @@ to 20% savings.
<HR>
<H4><A NAME = "10_2"></A>10.2 GPU package
<H4><A NAME = "10_2"></A>10.2 USER-OMP package
</H4>
<P>This section will be written when the USER-OMP package is released
in main LAMMPS.
</P>
<HR>
<HR>
<H4><A NAME = "10_3"></A>10.3 GPU package
</H4>
<P>The GPU package was developed by Mike Brown at ORNL. It provides GPU
versions of several pair styles and for long-range Coulombics via the
PPPM command. It has the following features:
</P>
<UL><LI>The package is designed to exploit common GPU hardware configurations
where one or more GPUs are coupled with one or more multi-core CPUs
within a node of a parallel machine.
where one or more GPUs are coupled with many cores of a multi-core
CPUs, e.g. within a node of a parallel machine.
<LI>Atom-based data (e.g. coordinates, forces) moves back-and-forth
between the CPU and GPU every timestep.
between the CPU(s) and GPU every timestep.
<LI>Neighbor lists can be constructed by on the CPU or on the GPU,
controlled by the <A HREF = "fix_gpu.html">fix gpu</A> command.
<LI>Neighbor lists can be constructed on the CPU or on the GPU
<LI>The charge assignement and force interpolation portions of PPPM can be
run on the GPU. The FFT portion, which requires MPI communication
between processors, runs on the CPU.
<LI>Asynchronous force computations can be performed simulataneously on
the CPU and GPU.
<LI>Asynchronous force computations can be performed simultaneously on the
CPU(s) and GPU.
<LI>LAMMPS-specific code is in the GPU package. It makee calls to a more
<LI>LAMMPS-specific code is in the GPU package. It makes calls to a
generic GPU library in the lib/gpu directory. This library provides
NVIDIA support as well as a more general OpenCL support, so that the
same functionality can eventually be supported on other GPU
NVIDIA support as well as more general OpenCL support, so that the
same functionality can eventually be supported on a variety of GPU
hardware.
</UL>
<P><B>Hardware and software requirements:</B>
</P>
<P>To use this package, you need to have specific NVIDIA hardware and
install specific NVIDIA CUDA software on your system:
<P>To use this package, you currently need to have specific NVIDIA
hardware and install specific NVIDIA CUDA software on your system:
</P>
<UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0
<LI>Go to http://www.nvidia.com/object/cuda_get.html
<LI>Install a driver and toolkit appropriate for your system (SDK is not necessary)
<LI>Follow the instructions in lammps/lib/gpu/README to build the library (also see below)
<LI>Follow the instructions in lammps/lib/gpu/README to build the library (see below)
<LI>Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties
</UL>
<P><B>Building LAMMPS with the GPU package:</B>
</P>
<P>As with other packages that link with a separately complied library,
you need to first build the GPU library, before building LAMMPS
itself. General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
<P>As with other packages that include a separately compiled library, you
need to first build the GPU library, before building LAMMPS itself.
General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
section</A> of the manual. For this package,
do the following, using a Makefile appropriate for your system:
do the following, using a Makefile in lib/gpu appropriate for your
system:
</P>
<PRE>cd lammps/lib/gpu
make -f Makefile.linux
@ -160,7 +177,7 @@ make -f Makefile.linux
</P>
<P>Now you are ready to build LAMMPS with the GPU package installed:
</P>
<PRE>cd lammps/lib/src
<PRE>cd lammps/src
make yes-gpu
make machine
</PRE>
@ -173,28 +190,27 @@ example.
<P><B>GPU configuration</B>
</P>
<P>When using GPUs, you are restricted to one physical GPU per LAMMPS
process, which is an MPI process running (typically) on a single core
or processor. Multiple processes can share a single GPU and in many
cases it will be more efficient to run with multiple processes per
GPU.
process, which is an MPI process running on a single core or
processor. Multiple MPI processes (CPU cores) can share a single GPU,
and in many cases it will be more efficient to run this way.
</P>
<P><B>Input script requirements:</B>
</P>
<P>Additional input script requirements to run styles with a <I>gpu</I> suffix
are as follows.
<P>Additional input script requirements to run pair or PPPM styles with a
<I>gpu</I> suffix are as follows:
</P>
<P>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I>.
</P>
<P>To invoke specific styles from the GPU package, you can either append
<UL><LI>To invoke specific styles from the GPU package, you can either append
"gpu" to the style name (e.g. pair_style lj/cut/gpu), or use the
<A HREF = "Section_start.html#2_6">-suffix command-line switch</A>, or use the
<A HREF = "suffix.html">suffix</A> command.
</P>
<P>The <A HREF = "package.html">package gpu</A> command must be used near the beginning
of your script to control the GPU selection and initialization steps.
It also enables asynchronous splitting of force computations between
the CPUs and GPUs.
</P>
<A HREF = "suffix.html">suffix</A> command.
<LI>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I>.
<LI>The <A HREF = "package.html">package gpu</A> command must be used near the beginning
of your script to control the GPU selection and initialization
settings. It also has an option to enable asynchronous splitting of
force computations between the CPUs and GPUs.
</UL>
<P>As an example, if you have two GPUs per node and 8 CPU cores per node,
and would like to run on 4 nodes (32 cores) with dynamic balancing of
force calculation across CPU and GPU cores, you could specify
@ -220,10 +236,10 @@ computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
calculations will not be included in the "Pair" time.
</P>
<P>When the <I>mode</I> setting for the gpu fix is force/neigh, the time for
neighbor list calculations on the GPU will be added into the "Pair"
time, not the "Neigh" time. An additional breakdown of the times
required for various tasks on the GPU (data copy, neighbor
<P>When the <I>mode</I> setting for the package gpu command is force/neigh,
the time for neighbor list calculations on the GPU will be added into
the "Pair" time, not the "Neigh" time. An additional breakdown of the
times required for various tasks on the GPU (data copy, neighbor
calculations, force computations, etc) are output only with the LAMMPS
screen output (not in the log file) at the end of each run. These
timings represent total time spent on the GPU for each routine,
@ -231,20 +247,23 @@ regardless of asynchronous CPU calculations.
</P>
<P><B>Performance tips:</B>
</P>
<P>Generally speaking, for best performance, you should use multiple CPUs
per GPU, as provided my most multi-core CPU/GPU configurations.
</P>
<P>Because of the large number of cores within each GPU device, it may be
more efficient to run on fewer processes per GPU when the number of
particles per MPI process is small (100's of particles); this can be
necessary to keep the GPU cores busy.
</P>
<P>See the lammps/lib/gpu/README file for instructions on how to build
the LAMMPS gpu library for single, mixed, and double precision. The
latter requires that your GPU card support double precision.
the GPU library for single, mixed, or double precision. The latter
requires that your GPU card support double precision.
</P>
<HR>
<HR>
<H4><A NAME = "10_3"></A>10.3 USER-CUDA package
<H4><A NAME = "10_4"></A>10.4 USER-CUDA package
</H4>
<P>The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
@ -256,19 +275,22 @@ many timesteps, to run entirely on the GPU (except for inter-processor
MPI communication), so that atom-based data (e.g. coordinates, forces)
do not have to move back-and-forth between the CPU and GPU.
<LI>This will occur until a timestep where a non-GPU-ized fix or compute
is invoked. E.g. whenever a non-GPU operation occurs (fix, compute,
output), data automatically moves back to the CPU as needed. This may
incur a performance penalty, but should otherwise just work
<LI>Data will stay on the GPU until a timestep where a non-GPU-ized fix or
compute is invoked. Whenever a non-GPU operation occurs (fix,
compute, output), data automatically moves back to the CPU as needed.
This may incur a performance penalty, but should otherwise work
transparently.
<LI>Neighbor lists for GPU-ized pair styles are constructed on the
GPU.
<LI>The package only supports use of a single CPU (core) with each
GPU.
</UL>
<P><B>Hardware and software requirements:</B>
</P>
<P>To use this package, you need to have specific NVIDIA hardware and
install specific NVIDIA CUDA software on your system:
install specific NVIDIA CUDA software on your system.
</P>
<P>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
help you to find out the Compute Capability of your card:
@ -282,18 +304,19 @@ that its sample projects can be compiled without problems.
</P>
<P><B>Building LAMMPS with the USER-CUDA package:</B>
</P>
<P>As with other packages that link with a separately complied library,
you need to first build the USER-CUDA library, before building LAMMPS
<P>As with other packages that include a separately compiled library, you
need to first build the USER-CUDA library, before building LAMMPS
itself. General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
section</A> of the manual. For this package,
do the following, using a Makefile appropriate for your system:
do the following, using settings in the lib/cuda Makefiles appropriate
for your system:
</P>
<UL><LI>If your <I>CUDA</I> toolkit is not installed in the default system directoy
<UL><LI>Go to the lammps/lib/cuda directory
<LI>If your <I>CUDA</I> toolkit is not installed in the default system directoy
<I>/usr/local/cuda</I> edit the file <I>lib/cuda/Makefile.common</I>
accordingly.
<LI>Go to the lammps/lib/cuda directory
<LI>Type "make OPTIONS", where <I>OPTIONS</I> are one or more of the following
options. The settings will be written to the
<I>lib/cuda/Makefile.defaults</I> and used in the next step.
@ -324,36 +347,38 @@ produce the file lib/libcuda.a.
</UL>
<P>Now you are ready to build LAMMPS with the USER-CUDA package installed:
</P>
<PRE>cd lammps/lib/src
<PRE>cd lammps/src
make yes-user-cuda
make machine
</PRE>
<P>Note that the build will reference the lib/cuda/Makefile.common file
to extract setting relevant to the LAMMPS build. So it is important
<P>Note that the LAMMPS build references the lib/cuda/Makefile.common
file to extract setting specific CUDA settings. So it is important
that you have first built the cuda library (in lib/cuda) using
settings appropriate to your system.
</P>
<P><B>Input script requirements:</B>
</P>
<P>Additional input script requirements to run styles with a <I>cuda</I>
suffix are as follows.
suffix are as follows:
</P>
<P>To invoke specific styles from the USER-CUDA package, you can either
<UL><LI>To invoke specific styles from the USER-CUDA package, you can either
append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use
the <A HREF = "Section_start.html#2_6">-suffix command-line switch</A>, or use the
<A HREF = "suffix.html">suffix</A> command. One exception is that the <A HREF = "kspace_style.html">kspace_style
pppm/cuda</A> command has to be requested explicitly.
</P>
<P>To use the USER-CUDA package with its default settings, no additional
pppm/cuda</A> command has to be requested
explicitly.
<LI>To use the USER-CUDA package with its default settings, no additional
command is needed in your input script. This is because when LAMMPS
starts up, it detects if it has been built with the USER-CUDA package.
See the <A HREF = "Section_start.html#2_6">-cuda command-line switch</A> for more
details.
</P>
<P>To change settings for the USER-CUDA package at run-time, the <A HREF = "package.html">package
cuda</A> command can be used at the beginning of your input
script. See the commands doc page for details.
</P>
details.
<LI>To change settings for the USER-CUDA package at run-time, the <A HREF = "package.html">package
cuda</A> command can be used near the beginning of your
input script. See the <A HREF = "package.html">package</A> command doc page for
details.
</UL>
<P><B>Performance tips:</B>
</P>
<P>The USER-CUDA package offers more speed-up relative to CPU performance
@ -365,18 +390,18 @@ entirely on the GPU(s) (except for inter-processor MPI communication),
for multiple timesteps, until a CPU calculation is required, either by
a fix or compute that is non-GPU-ized, or until output is performed
(thermo or dump snapshot or restart file). The less often this
occurs, the faster your simulation may run.
occurs, the faster your simulation will run.
</P>
<HR>
<HR>
<H4><A NAME = "10_4"></A>10.4 Comparison of GPU and USER-CUDA packages
<H4><A NAME = "10_5"></A>10.5 Comparison of GPU and USER-CUDA packages
</H4>
<P>Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
using NVIDIA hardware, but they do it in different ways.
</P>
<P>As a consequence, for a specific simulation on particular hardware,
<P>As a consequence, for a particular simulation on specific hardware,
one package may be faster than the other. We give guidelines below,
but the best way to determine which package is faster for your input
script is to try both of them on your machine. See the benchmarking
@ -384,7 +409,12 @@ section below for examples where this has been done.
</P>
<P><B>Guidelines for using each package optimally:</B>
</P>
<UL><LI>The GPU package moves per-atom data (coordinates, forces)
<UL><LI>The GPU package allows you to assign multiple CPUs (cores) to a single
GPU (a common configuration for "hybrid" nodes that contain multicore
CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA
package does not allow this; you can only use one CPU per GPU.
<LI>The GPU package moves per-atom data (coordinates, forces)
back-and-forth between the CPU and GPU every timestep. The USER-CUDA
package only does this on timesteps when a CPU calculation is required
(e.g. to invoke a fix or compute that is non-GPU-ized). Hence, if you
@ -402,28 +432,12 @@ system the crossover (in single precision) is often about 50K-100K
atoms per GPU. When performing double precision calculations the
crossover point can be significantly smaller.
<LI>The GPU package allows you to assign multiple CPUs (cores) to a single
GPU (a common configuration for "hybrid" nodes that contain multicore
CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA
package does not; it works best when there is one CPU per GPU.
<LI>Both packages compute bonded interactions (bonds, angles, etc) on the
CPU. This means a model with bonds will force the USER-CUDA package
to transfer per-atom data back-and-forth between the CPU and GPU every
timestep. If the GPU package is running with several MPI processes
assigned to one GPU, the cost of computing the bonded interactions is
spread across more CPUs and hence the GPU package can run faster.
</UL>
<P><B>Chief differences between the two packages:</B>
</P>
<UL><LI>The GPU package accelerates only pair force, neighbor list, and PPPM
calculations. The USER-CUDA package currently supports a wider range
of pair styles and can also accelerate many fix styles and some
compute styles, as well as neighbor list and PPPM calculations.
<LI>The GPU package uses more GPU memory than the USER-CUDA package. This
is generally not much of a problem since typical runs are
computation-limited rather than memory-limited.
<LI>When using the GPU package with multiple CPUs assigned to one GPU, its
performance depends to some extent on high bandwidth between the CPUs
@ -433,18 +447,30 @@ case if S2050/70 servers are used, where two devices generally share
one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide
full 16 lanes to each of the PCIe 2.0 16x slots.
</UL>
<P><B>Differences between the two packages:</B>
</P>
<UL><LI>The GPU package accelerates only pair force, neighbor list, and PPPM
calculations. The USER-CUDA package currently supports a wider range
of pair styles and can also accelerate many fix styles and some
compute styles, as well as neighbor list and PPPM calculations.
<LI>The GPU package uses more GPU memory than the USER-CUDA package. This
is generally not a problem since typical runs are computation-limited
rather than memory-limited.
</UL>
<P><B>Examples:</B>
</P>
<P>The LAMMPS distribution has two directories with sample
input scripts for the GPU and USER-CUDA packages.
<P>The LAMMPS distribution has two directories with sample input scripts
for the GPU and USER-CUDA packages.
</P>
<UL><LI>lammps/examples/gpu = GPU package files
<LI>lammps/examples/USER/cuda = USER-CUDA package files
</UL>
<P>These are files for identical systems, so they can be
used to benchmark the performance of both packages
on your system.
<P>These contain input scripts for identical systems, so they can be used
to benchmark the performance of both packages on your system.
</P>
<HR>
<P><B>Benchmark data:</B>
</P>
<P>NOTE: We plan to add some benchmark results and plots here for the

View File

@ -27,6 +27,7 @@ style exist in LAMMPS:
"pair_style lj/cut"_pair_lj.html
"pair_style lj/cut/opt"_pair_lj.html
"pair_style lj/cut/omp"_pair_lj.html
"pair_style lj/cut/gpu"_pair_lj.html
"pair_style lj/cut/cuda"_pair_lj.html :ul
@ -42,6 +43,12 @@ input script.
Styles with an "opt" suffix are part of the OPT package and typically
speed-up the pairwise calculations of your simulation by 5-25%.
Styles with an "omp" suffix are part of the USER-OMP package and allow
a pair-style to be run in threaded mode using OpenMP. This can be
useful on nodes with high-core counts when using less MPI processes
than cores is advantageous, e.g. when running with PPPM so that FFTs
are run on fewer MPI processors.
Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
packages, and can be run on NVIDIA GPUs associated with your CPUs.
The speed-up due to GPU usage depends on a variety of factors, as
@ -64,8 +71,9 @@ The final section compares and contrasts the GPU and USER-CUDA
packages, since they are both designed to use NVIDIA GPU hardware.
10.1 "OPT package"_#10_1
10.2 "GPU package"_#10_2
10.3 "USER-CUDA package"_#10_3
10.5 "USER-OMP package"_#10_2
10.2 "GPU package"_#10_3
10.3 "USER-CUDA package"_#10_4
10.4 "Comparison of GPU and USER-CUDA packages"_#10_4 :all(b)
:line
@ -99,53 +107,61 @@ to 20% savings.
:line
:line
10.2 GPU package :h4,link(10_2)
10.2 USER-OMP package :h4,link(10_2)
This section will be written when the USER-OMP package is released
in main LAMMPS.
:line
:line
10.3 GPU package :h4,link(10_3)
The GPU package was developed by Mike Brown at ORNL. It provides GPU
versions of several pair styles and for long-range Coulombics via the
PPPM command. It has the following features:
The package is designed to exploit common GPU hardware configurations
where one or more GPUs are coupled with one or more multi-core CPUs
within a node of a parallel machine. :ulb,l
where one or more GPUs are coupled with many cores of a multi-core
CPUs, e.g. within a node of a parallel machine. :ulb,l
Atom-based data (e.g. coordinates, forces) moves back-and-forth
between the CPU and GPU every timestep. :l
between the CPU(s) and GPU every timestep. :l
Neighbor lists can be constructed by on the CPU or on the GPU,
controlled by the "fix gpu"_fix_gpu.html command. :l
Neighbor lists can be constructed on the CPU or on the GPU :l
The charge assignement and force interpolation portions of PPPM can be
run on the GPU. The FFT portion, which requires MPI communication
between processors, runs on the CPU. :l
Asynchronous force computations can be performed simulataneously on
the CPU and GPU. :l
Asynchronous force computations can be performed simultaneously on the
CPU(s) and GPU. :l
LAMMPS-specific code is in the GPU package. It makee calls to a more
LAMMPS-specific code is in the GPU package. It makes calls to a
generic GPU library in the lib/gpu directory. This library provides
NVIDIA support as well as a more general OpenCL support, so that the
same functionality can eventually be supported on other GPU
NVIDIA support as well as more general OpenCL support, so that the
same functionality can eventually be supported on a variety of GPU
hardware. :l,ule
[Hardware and software requirements:]
To use this package, you need to have specific NVIDIA hardware and
install specific NVIDIA CUDA software on your system:
To use this package, you currently need to have specific NVIDIA
hardware and install specific NVIDIA CUDA software on your system:
Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0
Go to http://www.nvidia.com/object/cuda_get.html
Install a driver and toolkit appropriate for your system (SDK is not necessary)
Follow the instructions in lammps/lib/gpu/README to build the library (also see below)
Follow the instructions in lammps/lib/gpu/README to build the library (see below)
Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties :ul
[Building LAMMPS with the GPU package:]
As with other packages that link with a separately complied library,
you need to first build the GPU library, before building LAMMPS
itself. General instructions for doing this are in "this
As with other packages that include a separately compiled library, you
need to first build the GPU library, before building LAMMPS itself.
General instructions for doing this are in "this
section"_doc/Section_start.html#2_3 of the manual. For this package,
do the following, using a Makefile appropriate for your system:
do the following, using a Makefile in lib/gpu appropriate for your
system:
cd lammps/lib/gpu
make -f Makefile.linux
@ -155,7 +171,7 @@ If you are successful, you will produce the file lib/libgpu.a.
Now you are ready to build LAMMPS with the GPU package installed:
cd lammps/lib/src
cd lammps/src
make yes-gpu
make machine :pre
@ -168,27 +184,26 @@ example.
[GPU configuration]
When using GPUs, you are restricted to one physical GPU per LAMMPS
process, which is an MPI process running (typically) on a single core
or processor. Multiple processes can share a single GPU and in many
cases it will be more efficient to run with multiple processes per
GPU.
process, which is an MPI process running on a single core or
processor. Multiple MPI processes (CPU cores) can share a single GPU,
and in many cases it will be more efficient to run this way.
[Input script requirements:]
Additional input script requirements to run styles with a {gpu} suffix
are as follows.
The "newton pair"_newton.html setting must be {off}.
Additional input script requirements to run pair or PPPM styles with a
{gpu} suffix are as follows:
To invoke specific styles from the GPU package, you can either append
"gpu" to the style name (e.g. pair_style lj/cut/gpu), or use the
"-suffix command-line switch"_Section_start.html#2_6, or use the
"suffix"_suffix.html command.
"suffix"_suffix.html command. :ulb,l
The "newton pair"_newton.html setting must be {off}. :l
The "package gpu"_package.html command must be used near the beginning
of your script to control the GPU selection and initialization steps.
It also enables asynchronous splitting of force computations between
the CPUs and GPUs.
of your script to control the GPU selection and initialization
settings. It also has an option to enable asynchronous splitting of
force computations between the CPUs and GPUs. :l,ule
As an example, if you have two GPUs per node and 8 CPU cores per node,
and would like to run on 4 nodes (32 cores) with dynamic balancing of
@ -215,10 +230,10 @@ computations that run simultaneously with "bond"_bond_style.html,
"improper"_improper_style.html, and "long-range"_kspace_style.html
calculations will not be included in the "Pair" time.
When the {mode} setting for the gpu fix is force/neigh, the time for
neighbor list calculations on the GPU will be added into the "Pair"
time, not the "Neigh" time. An additional breakdown of the times
required for various tasks on the GPU (data copy, neighbor
When the {mode} setting for the package gpu command is force/neigh,
the time for neighbor list calculations on the GPU will be added into
the "Pair" time, not the "Neigh" time. An additional breakdown of the
times required for various tasks on the GPU (data copy, neighbor
calculations, force computations, etc) are output only with the LAMMPS
screen output (not in the log file) at the end of each run. These
timings represent total time spent on the GPU for each routine,
@ -226,19 +241,22 @@ regardless of asynchronous CPU calculations.
[Performance tips:]
Generally speaking, for best performance, you should use multiple CPUs
per GPU, as provided my most multi-core CPU/GPU configurations.
Because of the large number of cores within each GPU device, it may be
more efficient to run on fewer processes per GPU when the number of
particles per MPI process is small (100's of particles); this can be
necessary to keep the GPU cores busy.
See the lammps/lib/gpu/README file for instructions on how to build
the LAMMPS gpu library for single, mixed, and double precision. The
latter requires that your GPU card support double precision.
the GPU library for single, mixed, or double precision. The latter
requires that your GPU card support double precision.
:line
:line
10.3 USER-CUDA package :h4,link(10_3)
10.4 USER-CUDA package :h4,link(10_4)
The USER-CUDA package was developed by Christian Trott at U Technology
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
@ -250,19 +268,22 @@ many timesteps, to run entirely on the GPU (except for inter-processor
MPI communication), so that atom-based data (e.g. coordinates, forces)
do not have to move back-and-forth between the CPU and GPU. :ulb,l
This will occur until a timestep where a non-GPU-ized fix or compute
is invoked. E.g. whenever a non-GPU operation occurs (fix, compute,
output), data automatically moves back to the CPU as needed. This may
incur a performance penalty, but should otherwise just work
Data will stay on the GPU until a timestep where a non-GPU-ized fix or
compute is invoked. Whenever a non-GPU operation occurs (fix,
compute, output), data automatically moves back to the CPU as needed.
This may incur a performance penalty, but should otherwise work
transparently. :l
Neighbor lists for GPU-ized pair styles are constructed on the
GPU. :l
The package only supports use of a single CPU (core) with each
GPU. :l,ule
[Hardware and software requirements:]
To use this package, you need to have specific NVIDIA hardware and
install specific NVIDIA CUDA software on your system:
install specific NVIDIA CUDA software on your system.
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
help you to find out the Compute Capability of your card:
@ -276,17 +297,18 @@ that its sample projects can be compiled without problems.
[Building LAMMPS with the USER-CUDA package:]
As with other packages that link with a separately complied library,
you need to first build the USER-CUDA library, before building LAMMPS
As with other packages that include a separately compiled library, you
need to first build the USER-CUDA library, before building LAMMPS
itself. General instructions for doing this are in "this
section"_doc/Section_start.html#2_3 of the manual. For this package,
do the following, using a Makefile appropriate for your system:
do the following, using settings in the lib/cuda Makefiles appropriate
for your system:
Go to the lammps/lib/cuda directory :ulb,l
If your {CUDA} toolkit is not installed in the default system directoy
{/usr/local/cuda} edit the file {lib/cuda/Makefile.common}
accordingly. :ulb,l
Go to the lammps/lib/cuda directory :l
accordingly. :l
Type "make OPTIONS", where {OPTIONS} are one or more of the following
options. The settings will be written to the
@ -318,35 +340,37 @@ produce the file lib/libcuda.a. :l,ule
Now you are ready to build LAMMPS with the USER-CUDA package installed:
cd lammps/lib/src
cd lammps/src
make yes-user-cuda
make machine :pre
Note that the build will reference the lib/cuda/Makefile.common file
to extract setting relevant to the LAMMPS build. So it is important
Note that the LAMMPS build references the lib/cuda/Makefile.common
file to extract setting specific CUDA settings. So it is important
that you have first built the cuda library (in lib/cuda) using
settings appropriate to your system.
[Input script requirements:]
Additional input script requirements to run styles with a {cuda}
suffix are as follows.
suffix are as follows:
To invoke specific styles from the USER-CUDA package, you can either
append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use
the "-suffix command-line switch"_Section_start.html#2_6, or use the
"suffix"_suffix.html command. One exception is that the "kspace_style
pppm/cuda"_kspace_style.html command has to be requested explicitly.
pppm/cuda"_kspace_style.html command has to be requested
explicitly. :ulb,l
To use the USER-CUDA package with its default settings, no additional
command is needed in your input script. This is because when LAMMPS
starts up, it detects if it has been built with the USER-CUDA package.
See the "-cuda command-line switch"_Section_start.html#2_6 for more
details.
details. :l
To change settings for the USER-CUDA package at run-time, the "package
cuda"_package.html command can be used at the beginning of your input
script. See the commands doc page for details.
cuda"_package.html command can be used near the beginning of your
input script. See the "package"_package.html command doc page for
details. :l,ule
[Performance tips:]
@ -359,17 +383,17 @@ entirely on the GPU(s) (except for inter-processor MPI communication),
for multiple timesteps, until a CPU calculation is required, either by
a fix or compute that is non-GPU-ized, or until output is performed
(thermo or dump snapshot or restart file). The less often this
occurs, the faster your simulation may run.
occurs, the faster your simulation will run.
:line
:line
10.4 Comparison of GPU and USER-CUDA packages :h4,link(10_4)
10.5 Comparison of GPU and USER-CUDA packages :h4,link(10_5)
Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
using NVIDIA hardware, but they do it in different ways.
As a consequence, for a specific simulation on particular hardware,
As a consequence, for a particular simulation on specific hardware,
one package may be faster than the other. We give guidelines below,
but the best way to determine which package is faster for your input
script is to try both of them on your machine. See the benchmarking
@ -377,6 +401,11 @@ section below for examples where this has been done.
[Guidelines for using each package optimally:]
The GPU package allows you to assign multiple CPUs (cores) to a single
GPU (a common configuration for "hybrid" nodes that contain multicore
CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA
package does not allow this; you can only use one CPU per GPU. :ulb,l
The GPU package moves per-atom data (coordinates, forces)
back-and-forth between the CPU and GPU every timestep. The USER-CUDA
package only does this on timesteps when a CPU calculation is required
@ -385,7 +414,7 @@ can formulate your input script to only use GPU-ized fixes and
computes, and avoid doing I/O too often (thermo output, dump file
snapshots, restart files), then the data transfer cost of the
USER-CUDA package can be very low, causing it to run faster than the
GPU package. :ulb,l
GPU package. :l
The GPU package is often faster than the USER-CUDA package, if the
number of atoms per GPU is "small". The crossover point, in terms of
@ -395,28 +424,12 @@ system the crossover (in single precision) is often about 50K-100K
atoms per GPU. When performing double precision calculations the
crossover point can be significantly smaller. :l
The GPU package allows you to assign multiple CPUs (cores) to a single
GPU (a common configuration for "hybrid" nodes that contain multicore
CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA
package does not; it works best when there is one CPU per GPU. :l
Both packages compute bonded interactions (bonds, angles, etc) on the
CPU. This means a model with bonds will force the USER-CUDA package
to transfer per-atom data back-and-forth between the CPU and GPU every
timestep. If the GPU package is running with several MPI processes
assigned to one GPU, the cost of computing the bonded interactions is
spread across more CPUs and hence the GPU package can run faster. :l,ule
[Chief differences between the two packages:]
The GPU package accelerates only pair force, neighbor list, and PPPM
calculations. The USER-CUDA package currently supports a wider range
of pair styles and can also accelerate many fix styles and some
compute styles, as well as neighbor list and PPPM calculations. :ulb,l
The GPU package uses more GPU memory than the USER-CUDA package. This
is generally not much of a problem since typical runs are
computation-limited rather than memory-limited. :l
spread across more CPUs and hence the GPU package can run faster. :l
When using the GPU package with multiple CPUs assigned to one GPU, its
performance depends to some extent on high bandwidth between the CPUs
@ -426,17 +439,29 @@ case if S2050/70 servers are used, where two devices generally share
one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide
full 16 lanes to each of the PCIe 2.0 16x slots. :l,ule
[Differences between the two packages:]
The GPU package accelerates only pair force, neighbor list, and PPPM
calculations. The USER-CUDA package currently supports a wider range
of pair styles and can also accelerate many fix styles and some
compute styles, as well as neighbor list and PPPM calculations. :ulb,l
The GPU package uses more GPU memory than the USER-CUDA package. This
is generally not a problem since typical runs are computation-limited
rather than memory-limited. :l,ule
[Examples:]
The LAMMPS distribution has two directories with sample
input scripts for the GPU and USER-CUDA packages.
The LAMMPS distribution has two directories with sample input scripts
for the GPU and USER-CUDA packages.
lammps/examples/gpu = GPU package files
lammps/examples/USER/cuda = USER-CUDA package files :ul
These are files for identical systems, so they can be
used to benchmark the performance of both packages
on your system.
These contain input scripts for identical systems, so they can be used
to benchmark the performance of both packages on your system.
:line
[Benchmark data:]

View File

@ -58,8 +58,8 @@ LAMMPS output options.
</P>
<P><B>Restrictions:</B>
</P>
<P>This compute is part of the "user-ackland" package. It is only
enabled if LAMMPS was built with that package. See the <A HREF = "Section_start.html#2_3">Making
<P>This compute is part of the "user-misc" package. It is only enabled
if LAMMPS was built with that package. See the <A HREF = "Section_start.html#2_3">Making
LAMMPS</A> section for more info.
</P>
<P><B>Related commands:</B>

View File

@ -55,8 +55,8 @@ LAMMPS output options.
[Restrictions:]
This compute is part of the "user-ackland" package. It is only
enabled if LAMMPS was built with that package. See the "Making
This compute is part of the "user-misc" package. It is only enabled
if LAMMPS was built with that package. See the "Making
LAMMPS"_Section_start.html#2_3 section for more info.
[Related commands:]

View File

@ -43,9 +43,22 @@ fix comm all imd 8888 trate 5 unwrap on fscale 10.0
<P><B>Description:</B>
</P>
<P>This fix implements the "Interactive MD" (IMD) protocol which allows
to connect an IMD client, for example the <A HREF = "http://www.ks.uiuc.edu/Research/vmd">VMD visualization
program</A>, to a running LAMMPS simulation and monitor the progress
of the simulation and interactively apply forces to selected atoms.
realtime visualization and manipulation of MD simulations through the
IMD protocol, as initially implemented in VMD and NAMD. Specifically
it allows LAMMPS to connect an IMD client, for example the <A HREF = "http://www.ks.uiuc.edu/Research/vmd">VMD
visualization program</A>, so that it can monitor the progress of the
simulation and interactively apply forces to selected atoms.
</P>
<P>If LAMMPS is compiled with the preprocessor flag -DLAMMPS_ASYNC_IMD
then fix imd will use posix threads to spawn a thread on MPI rank 0 in
order to offload data reading and writing from the main execution
thread and potentiall lower the inferred latencies for slow
communication links. This feature has only been tested under linux.
</P>
<P>There are example scripts for using this package with LAMMPS in
examples/USER/imd. Additional examples and a driver for use with the
Novint Falcon game controller as haptic device can be found at:
http://sites.google.com/site/akohlmey/software/vrpn-icms.
</P>
<P>The source code for this fix includes code developed by the
Theoretical and Computational Biophysics Group in the Beckman
@ -138,15 +151,16 @@ This fix is not invoked during <A HREF = "minimize.html">energy minimization</A>
</P>
<P><B>Restrictions:</B>
</P>
<P>This fix is part of the "user-imd" package. It is only enabled if
<P>This fix is part of the "user-misc" package. It is only enabled if
LAMMPS was built with that package. See the <A HREF = "Section_start.html#2_3">Making
LAMMPS</A> section for more info.
This on platforms that support multi-threading, this fix can be
compiled in a way that the coordinate transfers to the IMD client
can be handled from a separate thread, when LAMMPS is compiled with
the -DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep
MD loop times low and transfer rates high, especially for systems
with many atoms and for slow connections.
</P>
<P>On platforms that support multi-threading, this fix can be compiled in
a way that the coordinate transfers to the IMD client can be handled
from a separate thread, when LAMMPS is compiled with the
-DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep MD loop
times low and transfer rates high, especially for systems with many
atoms and for slow connections.
</P>
<P>When used in combination with VMD, a topology or coordinate file has
to be loaded, which matches (in number and ordering of atoms) the

View File

@ -35,9 +35,22 @@ fix comm all imd 8888 trate 5 unwrap on fscale 10.0 :pre
[Description:]
This fix implements the "Interactive MD" (IMD) protocol which allows
to connect an IMD client, for example the "VMD visualization
program"_VMD, to a running LAMMPS simulation and monitor the progress
of the simulation and interactively apply forces to selected atoms.
realtime visualization and manipulation of MD simulations through the
IMD protocol, as initially implemented in VMD and NAMD. Specifically
it allows LAMMPS to connect an IMD client, for example the "VMD
visualization program"_VMD, so that it can monitor the progress of the
simulation and interactively apply forces to selected atoms.
If LAMMPS is compiled with the preprocessor flag -DLAMMPS_ASYNC_IMD
then fix imd will use posix threads to spawn a thread on MPI rank 0 in
order to offload data reading and writing from the main execution
thread and potentiall lower the inferred latencies for slow
communication links. This feature has only been tested under linux.
There are example scripts for using this package with LAMMPS in
examples/USER/imd. Additional examples and a driver for use with the
Novint Falcon game controller as haptic device can be found at:
http://sites.google.com/site/akohlmey/software/vrpn-icms.
The source code for this fix includes code developed by the
Theoretical and Computational Biophysics Group in the Beckman
@ -128,15 +141,16 @@ This fix is not invoked during "energy minimization"_minimize.html.
[Restrictions:]
This fix is part of the "user-imd" package. It is only enabled if
This fix is part of the "user-misc" package. It is only enabled if
LAMMPS was built with that package. See the "Making
LAMMPS"_Section_start.html#2_3 section for more info.
This on platforms that support multi-threading, this fix can be
compiled in a way that the coordinate transfers to the IMD client
can be handled from a separate thread, when LAMMPS is compiled with
the -DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep
MD loop times low and transfer rates high, especially for systems
with many atoms and for slow connections.
On platforms that support multi-threading, this fix can be compiled in
a way that the coordinate transfers to the IMD client can be handled
from a separate thread, when LAMMPS is compiled with the
-DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep MD loop
times low and transfer rates high, especially for systems with many
atoms and for slow connections.
When used in combination with VMD, a topology or coordinate file has
to be loaded, which matches (in number and ordering of atoms) the

View File

@ -132,7 +132,7 @@ minimization</A>.
</P>
<P><B>Restrictions:</B>
</P>
<P>This fix is part of the "user-smd" package. It is only enabled if
<P>This fix is part of the "user-misc" package. It is only enabled if
LAMMPS was built with that package. See the <A HREF = "Section_start.html#2_3">Making
LAMMPS</A> section for more info.
</P>

View File

@ -123,7 +123,7 @@ minimization"_minimize.html.
[Restrictions:]
This fix is part of the "user-smd" package. It is only enabled if
This fix is part of the "user-misc" package. It is only enabled if
LAMMPS was built with that package. See the "Making
LAMMPS"_Section_start.html#2_3 section for more info.

View File

@ -101,7 +101,7 @@ the other particles.
<HR>
<P>The <I>cuda</I> style invokes options associated with the use of the
USER-CUDA package. These need to be documented.
USER-CUDA package. These still need to be documented.
</P>
<HR>

View File

@ -95,7 +95,7 @@ the other particles.
:line
The {cuda} style invokes options associated with the use of the
USER-CUDA package. These need to be documented.
USER-CUDA package. These still need to be documented.
:line

View File

@ -415,7 +415,7 @@ an input script that reads a restart file.
that package (which it is by default). See the <A HREF = "Section_start.html#2_3">Making
LAMMPS</A> section for more info.
</P>
<P>The <I>eam/cd</I> style is part of the "user-cd-eam" package and also
<P>The <I>eam/cd</I> style is part of the "user-misc" package and also
requires the "manybody" package. It is only enabled if LAMMPS was
built with those packages. See the <A HREF = "Section_start.html#2_3">Making
LAMMPS</A> section for more info.

View File

@ -403,7 +403,7 @@ All of these styles except the {eam/cd} style are part of the
that package (which it is by default). See the "Making
LAMMPS"_Section_start.html#2_3 section for more info.
The {eam/cd} style is part of the "user-cd-eam" package and also
The {eam/cd} style is part of the "user-misc" package and also
requires the "manybody" package. It is only enabled if LAMMPS was
built with those packages. See the "Making
LAMMPS"_Section_start.html#2_3 section for more info.