git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@6711 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2011-08-17 21:55:22 +00:00 · 2011-08-17 21:55:22 +00:00 · dcc7913857
parent b416be6cbc
commit dcc7913857
12 changed files with 288 additions and 209 deletions
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
@ -30,6 +30,7 @@ style exist in LAMMPS:
 </P>
 <UL><LI><A HREF = "pair_lj.html">pair_style lj/cut</A>
 <LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
+<LI><A HREF = "pair_lj.html">pair_style lj/cut/omp</A>
 <LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
 <LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A> 
 </UL>
@ -45,6 +46,12 @@ input script.
 <P>Styles with an "opt" suffix are part of the OPT package and typically
 speed-up the pairwise calculations of your simulation by 5-25%.
 </P>
+<P>Styles with an "omp" suffix are part of the USER-OMP package and allow
+a pair-style to be run in threaded mode using OpenMP.  This can be
+useful on nodes with high-core counts when using less MPI processes
+than cores is advantageous, e.g. when running with PPPM so that FFTs
+are run on fewer MPI processors.
+</P>
 <P>Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
 packages, and can be run on NVIDIA GPUs associated with your CPUs.
 The speed-up due to GPU usage depends on a variety of factors, as
@ -67,8 +74,9 @@ and kspace sections.
 packages, since they are both designed to use NVIDIA GPU hardware.
 </P>
 10.1 <A HREF = "#10_1">OPT package</A><BR>
-10.2 <A HREF = "#10_2">GPU package</A><BR>
-10.3 <A HREF = "#10_3">USER-CUDA package</A><BR>
+10.5 <A HREF = "#10_2">USER-OMP package</A><BR>
+10.2 <A HREF = "#10_3">GPU package</A><BR>
+10.3 <A HREF = "#10_4">USER-CUDA package</A><BR>
 10.4 <A HREF = "#10_4">Comparison of GPU and USER-CUDA packages</A> <BR>

 <HR>
@ -104,53 +112,62 @@ to 20% savings.

 <HR>

-<H4><A NAME = "10_2"></A>10.2 GPU package 
+<H4><A NAME = "10_2"></A>10.2 USER-OMP package 
+</H4>
+<P>This section will be written when the USER-OMP package is released
+in main LAMMPS.
+</P>
+<HR>
+
+<HR>
+
+<H4><A NAME = "10_3"></A>10.3 GPU package 
 </H4>
 <P>The GPU package was developed by Mike Brown at ORNL.  It provides GPU
 versions of several pair styles and for long-range Coulombics via the
 PPPM command.  It has the following features:
 </P>
 <UL><LI>The package is designed to exploit common GPU hardware configurations
-where one or more GPUs are coupled with one or more multi-core CPUs
-within a node of a parallel machine. 
+where one or more GPUs are coupled with many cores of a multi-core
+CPUs, e.g. within a node of a parallel machine. 

 <LI>Atom-based data (e.g. coordinates, forces) moves back-and-forth
-between the CPU and GPU every timestep. 
+between the CPU(s) and GPU every timestep. 

-<LI>Neighbor lists can be constructed by on the CPU or on the GPU,
-controlled by the <A HREF = "fix_gpu.html">fix gpu</A> command. 
+<LI>Neighbor lists can be constructed on the CPU or on the GPU 

 <LI>The charge assignement and force interpolation portions of PPPM can be
 run on the GPU.  The FFT portion, which requires MPI communication
 between processors, runs on the CPU. 

-<LI>Asynchronous force computations can be performed simulataneously on
-the CPU and GPU. 
+<LI>Asynchronous force computations can be performed simultaneously on the
+CPU(s) and GPU. 

-<LI>LAMMPS-specific code is in the GPU package.  It makee calls to a more
+<LI>LAMMPS-specific code is in the GPU package.  It makes calls to a
 generic GPU library in the lib/gpu directory.  This library provides
-NVIDIA support as well as a more general OpenCL support, so that the
-same functionality can eventually be supported on other GPU
+NVIDIA support as well as more general OpenCL support, so that the
+same functionality can eventually be supported on a variety of GPU
 hardware. 
 </UL>
 <P><B>Hardware and software requirements:</B>
 </P>
-<P>To use this package, you need to have specific NVIDIA hardware and
-install specific NVIDIA CUDA software on your system:
+<P>To use this package, you currently need to have specific NVIDIA
+hardware and install specific NVIDIA CUDA software on your system:
 </P>
 <UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0
 <LI>Go to http://www.nvidia.com/object/cuda_get.html
 <LI>Install a driver and toolkit appropriate for your system (SDK is not necessary)
-<LI>Follow the instructions in lammps/lib/gpu/README to build the library (also see below)
+<LI>Follow the instructions in lammps/lib/gpu/README to build the library (see below)
 <LI>Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties 
 </UL>
 <P><B>Building LAMMPS with the GPU package:</B>
 </P>
-<P>As with other packages that link with a separately complied library,
-you need to first build the GPU library, before building LAMMPS
-itself.  General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
+<P>As with other packages that include a separately compiled library, you
+need to first build the GPU library, before building LAMMPS itself.
+General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
 section</A> of the manual.  For this package,
-do the following, using a Makefile appropriate for your system:
+do the following, using a Makefile in lib/gpu appropriate for your
+system:
 </P>
 <PRE>cd lammps/lib/gpu
 make -f Makefile.linux
@ -160,7 +177,7 @@ make -f Makefile.linux
 </P>
 <P>Now you are ready to build LAMMPS with the GPU package installed:
 </P>
-<PRE>cd lammps/lib/src
+<PRE>cd lammps/src
 make yes-gpu
 make machine 
 </PRE>
@ -173,28 +190,27 @@ example.
 <P><B>GPU configuration</B>
 </P>
 <P>When using GPUs, you are restricted to one physical GPU per LAMMPS
-process, which is an MPI process running (typically) on a single core
-or processor.  Multiple processes can share a single GPU and in many
-cases it will be more efficient to run with multiple processes per
-GPU.
+process, which is an MPI process running on a single core or
+processor.  Multiple MPI processes (CPU cores) can share a single GPU,
+and in many cases it will be more efficient to run this way.
 </P>
 <P><B>Input script requirements:</B>
 </P>
-<P>Additional input script requirements to run styles with a <I>gpu</I> suffix
-are as follows.
+<P>Additional input script requirements to run pair or PPPM styles with a
+<I>gpu</I> suffix are as follows:
 </P>
-<P>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I>.
-</P>
-<P>To invoke specific styles from the GPU package, you can either append
+<UL><LI>To invoke specific styles from the GPU package, you can either append
 "gpu" to the style name (e.g. pair_style lj/cut/gpu), or use the
 <A HREF = "Section_start.html#2_6">-suffix command-line switch</A>, or use the
-<A HREF = "suffix.html">suffix</A> command.
-</P>
-<P>The <A HREF = "package.html">package gpu</A> command must be used near the beginning
-of your script to control the GPU selection and initialization steps.
-It also enables asynchronous splitting of force computations between
-the CPUs and GPUs.
-</P>
+<A HREF = "suffix.html">suffix</A> command. 
+
+<LI>The <A HREF = "newton.html">newton pair</A> setting must be <I>off</I>. 
+
+<LI>The <A HREF = "package.html">package gpu</A> command must be used near the beginning
+of your script to control the GPU selection and initialization
+settings.  It also has an option to enable asynchronous splitting of
+force computations between the CPUs and GPUs. 
+</UL>
 <P>As an example, if you have two GPUs per node and 8 CPU cores per node,
 and would like to run on 4 nodes (32 cores) with dynamic balancing of
 force calculation across CPU and GPU cores, you could specify
@ -220,10 +236,10 @@ computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
 <A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
 calculations will not be included in the "Pair" time.
 </P>
-<P>When the <I>mode</I> setting for the gpu fix is force/neigh, the time for
-neighbor list calculations on the GPU will be added into the "Pair"
-time, not the "Neigh" time.  An additional breakdown of the times
-required for various tasks on the GPU (data copy, neighbor
+<P>When the <I>mode</I> setting for the package gpu command is force/neigh,
+the time for neighbor list calculations on the GPU will be added into
+the "Pair" time, not the "Neigh" time.  An additional breakdown of the
+times required for various tasks on the GPU (data copy, neighbor
 calculations, force computations, etc) are output only with the LAMMPS
 screen output (not in the log file) at the end of each run.  These
 timings represent total time spent on the GPU for each routine,
@ -231,20 +247,23 @@ regardless of asynchronous CPU calculations.
 </P>
 <P><B>Performance tips:</B>
 </P>
+<P>Generally speaking, for best performance, you should use multiple CPUs
+per GPU, as provided my most multi-core CPU/GPU configurations.
+</P>
 <P>Because of the large number of cores within each GPU device, it may be
 more efficient to run on fewer processes per GPU when the number of
 particles per MPI process is small (100's of particles); this can be
 necessary to keep the GPU cores busy.
 </P>
 <P>See the lammps/lib/gpu/README file for instructions on how to build
-the LAMMPS gpu library for single, mixed, and double precision.  The
-latter requires that your GPU card support double precision.
+the GPU library for single, mixed, or double precision.  The latter
+requires that your GPU card support double precision.
 </P>
 <HR>

 <HR>

-<H4><A NAME = "10_3"></A>10.3 USER-CUDA package 
+<H4><A NAME = "10_4"></A>10.4 USER-CUDA package 
 </H4>
 <P>The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
@ -256,19 +275,22 @@ many timesteps, to run entirely on the GPU (except for inter-processor
 MPI communication), so that atom-based data (e.g. coordinates, forces)
 do not have to move back-and-forth between the CPU and GPU. 

-<LI>This will occur until a timestep where a non-GPU-ized fix or compute
-is invoked.  E.g. whenever a non-GPU operation occurs (fix, compute,
-output), data automatically moves back to the CPU as needed.  This may
-incur a performance penalty, but should otherwise just work
+<LI>Data will stay on the GPU until a timestep where a non-GPU-ized fix or
+compute is invoked.  Whenever a non-GPU operation occurs (fix,
+compute, output), data automatically moves back to the CPU as needed.
+This may incur a performance penalty, but should otherwise work
 transparently. 

 <LI>Neighbor lists for GPU-ized pair styles are constructed on the
 GPU. 
+
+<LI>The package only supports use of a single CPU (core) with each
+GPU. 
 </UL>
 <P><B>Hardware and software requirements:</B>
 </P>
 <P>To use this package, you need to have specific NVIDIA hardware and
-install specific NVIDIA CUDA software on your system:
+install specific NVIDIA CUDA software on your system.
 </P>
 <P>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
 help you to find out the Compute Capability of your card:
@ -282,18 +304,19 @@ that its sample projects can be compiled without problems.
 </P>
 <P><B>Building LAMMPS with the USER-CUDA package:</B>
 </P>
-<P>As with other packages that link with a separately complied library,
-you need to first build the USER-CUDA library, before building LAMMPS
+<P>As with other packages that include a separately compiled library, you
+need to first build the USER-CUDA library, before building LAMMPS
 itself.  General instructions for doing this are in <A HREF = "doc/Section_start.html#2_3">this
 section</A> of the manual.  For this package,
-do the following, using a Makefile appropriate for your system:
+do the following, using settings in the lib/cuda Makefiles appropriate
+for your system:
 </P>
-<UL><LI>If your <I>CUDA</I> toolkit is not installed in the default system directoy
+<UL><LI>Go to the lammps/lib/cuda directory 
+
+<LI>If your <I>CUDA</I> toolkit is not installed in the default system directoy
 <I>/usr/local/cuda</I> edit the file <I>lib/cuda/Makefile.common</I>
 accordingly. 

-<LI>Go to the lammps/lib/cuda directory 
-
 <LI>Type "make OPTIONS", where <I>OPTIONS</I> are one or more of the following
 options. The settings will be written to the
 <I>lib/cuda/Makefile.defaults</I> and used in the next step. 
@ -324,36 +347,38 @@ produce the file lib/libcuda.a.
 </UL>
 <P>Now you are ready to build LAMMPS with the USER-CUDA package installed:
 </P>
-<PRE>cd lammps/lib/src
+<PRE>cd lammps/src
 make yes-user-cuda
 make machine 
 </PRE>
-<P>Note that the build will reference the lib/cuda/Makefile.common file
-to extract setting relevant to the LAMMPS build.  So it is important
+<P>Note that the LAMMPS build references the lib/cuda/Makefile.common
+file to extract setting specific CUDA settings.  So it is important
 that you have first built the cuda library (in lib/cuda) using
 settings appropriate to your system.
 </P>
 <P><B>Input script requirements:</B>
 </P>
 <P>Additional input script requirements to run styles with a <I>cuda</I>
-suffix are as follows.
+suffix are as follows:
 </P>
-<P>To invoke specific styles from the USER-CUDA package, you can either
+<UL><LI>To invoke specific styles from the USER-CUDA package, you can either
 append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use
 the <A HREF = "Section_start.html#2_6">-suffix command-line switch</A>, or use the
 <A HREF = "suffix.html">suffix</A> command.  One exception is that the <A HREF = "kspace_style.html">kspace_style
-pppm/cuda</A> command has to be requested explicitly.
-</P>
-<P>To use the USER-CUDA package with its default settings, no additional
+pppm/cuda</A> command has to be requested
+explicitly. 
+
+<LI>To use the USER-CUDA package with its default settings, no additional
 command is needed in your input script.  This is because when LAMMPS
 starts up, it detects if it has been built with the USER-CUDA package.
 See the <A HREF = "Section_start.html#2_6">-cuda command-line switch</A> for more
-details.
-</P>
-<P>To change settings for the USER-CUDA package at run-time, the <A HREF = "package.html">package
-cuda</A> command can be used at the beginning of your input
-script.  See the commands doc page for details.
-</P>
+details. 
+
+<LI>To change settings for the USER-CUDA package at run-time, the <A HREF = "package.html">package
+cuda</A> command can be used near the beginning of your
+input script.  See the <A HREF = "package.html">package</A> command doc page for
+details. 
+</UL>
 <P><B>Performance tips:</B>
 </P>
 <P>The USER-CUDA package offers more speed-up relative to CPU performance
@ -365,18 +390,18 @@ entirely on the GPU(s) (except for inter-processor MPI communication),
 for multiple timesteps, until a CPU calculation is required, either by
 a fix or compute that is non-GPU-ized, or until output is performed
 (thermo or dump snapshot or restart file).  The less often this
-occurs, the faster your simulation may run.
+occurs, the faster your simulation will run.
 </P>
 <HR>

 <HR>

-<H4><A NAME = "10_4"></A>10.4 Comparison of GPU and USER-CUDA packages 
+<H4><A NAME = "10_5"></A>10.5 Comparison of GPU and USER-CUDA packages 
 </H4>
 <P>Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
 using NVIDIA hardware, but they do it in different ways.
 </P>
-<P>As a consequence, for a specific simulation on particular hardware,
+<P>As a consequence, for a particular simulation on specific hardware,
 one package may be faster than the other.  We give guidelines below,
 but the best way to determine which package is faster for your input
 script is to try both of them on your machine.  See the benchmarking
@ -384,7 +409,12 @@ section below for examples where this has been done.
 </P>
 <P><B>Guidelines for using each package optimally:</B>
 </P>
-<UL><LI>The GPU package moves per-atom data (coordinates, forces)
+<UL><LI>The GPU package allows you to assign multiple CPUs (cores) to a single
+GPU (a common configuration for "hybrid" nodes that contain multicore
+CPU(s) and GPU(s)) and works effectively in this mode.  The USER-CUDA
+package does not allow this; you can only use one CPU per GPU. 
+
+<LI>The GPU package moves per-atom data (coordinates, forces)
 back-and-forth between the CPU and GPU every timestep.  The USER-CUDA
 package only does this on timesteps when a CPU calculation is required
 (e.g. to invoke a fix or compute that is non-GPU-ized).  Hence, if you
@ -402,28 +432,12 @@ system the crossover (in single precision) is often about 50K-100K
 atoms per GPU.  When performing double precision calculations the
 crossover point can be significantly smaller. 

-<LI>The GPU package allows you to assign multiple CPUs (cores) to a single
-GPU (a common configuration for "hybrid" nodes that contain multicore
-CPU(s) and GPU(s)) and works effectively in this mode.  The USER-CUDA
-package does not; it works best when there is one CPU per GPU. 
-
 <LI>Both packages compute bonded interactions (bonds, angles, etc) on the
 CPU.  This means a model with bonds will force the USER-CUDA package
 to transfer per-atom data back-and-forth between the CPU and GPU every
 timestep.  If the GPU package is running with several MPI processes
 assigned to one GPU, the cost of computing the bonded interactions is
 spread across more CPUs and hence the GPU package can run faster. 
-</UL>
-<P><B>Chief differences between the two packages:</B>
-</P>
-<UL><LI>The GPU package accelerates only pair force, neighbor list, and PPPM
-calculations.  The USER-CUDA package currently supports a wider range
-of pair styles and can also accelerate many fix styles and some
-compute styles, as well as neighbor list and PPPM calculations. 
-
-<LI>The GPU package uses more GPU memory than the USER-CUDA package.  This
-is generally not much of a problem since typical runs are
-computation-limited rather than memory-limited. 

 <LI>When using the GPU package with multiple CPUs assigned to one GPU, its
 performance depends to some extent on high bandwidth between the CPUs
@ -433,18 +447,30 @@ case if S2050/70 servers are used, where two devices generally share
 one PCIe 2.0 16x slot.  Also many multi-GPU mainboards do not provide
 full 16 lanes to each of the PCIe 2.0 16x slots. 
 </UL>
+<P><B>Differences between the two packages:</B>
+</P>
+<UL><LI>The GPU package accelerates only pair force, neighbor list, and PPPM
+calculations.  The USER-CUDA package currently supports a wider range
+of pair styles and can also accelerate many fix styles and some
+compute styles, as well as neighbor list and PPPM calculations. 
+
+<LI>The GPU package uses more GPU memory than the USER-CUDA package.  This
+is generally not a problem since typical runs are computation-limited
+rather than memory-limited. 
+</UL>
 <P><B>Examples:</B>
 </P>
-<P>The LAMMPS distribution has two directories with sample
-input scripts for the GPU and USER-CUDA packages.
+<P>The LAMMPS distribution has two directories with sample input scripts
+for the GPU and USER-CUDA packages.
 </P>
 <UL><LI>lammps/examples/gpu = GPU package files
 <LI>lammps/examples/USER/cuda = USER-CUDA package files 
 </UL>
-<P>These are files for identical systems, so they can be
-used to benchmark the performance of both packages
-on your system.
+<P>These contain input scripts for identical systems, so they can be used
+to benchmark the performance of both packages on your system.
 </P>
+<HR>
+
 <P><B>Benchmark data:</B>
 </P>
 <P>NOTE: We plan to add some benchmark results and plots here for the
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
@ -27,6 +27,7 @@ style exist in LAMMPS:

 "pair_style lj/cut"_pair_lj.html
 "pair_style lj/cut/opt"_pair_lj.html
+"pair_style lj/cut/omp"_pair_lj.html
 "pair_style lj/cut/gpu"_pair_lj.html
 "pair_style lj/cut/cuda"_pair_lj.html :ul

@ -42,6 +43,12 @@ input script.
 Styles with an "opt" suffix are part of the OPT package and typically
 speed-up the pairwise calculations of your simulation by 5-25%.

+Styles with an "omp" suffix are part of the USER-OMP package and allow
+a pair-style to be run in threaded mode using OpenMP.  This can be
+useful on nodes with high-core counts when using less MPI processes
+than cores is advantageous, e.g. when running with PPPM so that FFTs
+are run on fewer MPI processors.
+
 Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA
 packages, and can be run on NVIDIA GPUs associated with your CPUs.
 The speed-up due to GPU usage depends on a variety of factors, as
@ -64,8 +71,9 @@ The final section compares and contrasts the GPU and USER-CUDA
 packages, since they are both designed to use NVIDIA GPU hardware.

 10.1 "OPT package"_#10_1
-10.2 "GPU package"_#10_2
-10.3 "USER-CUDA package"_#10_3
+10.5 "USER-OMP package"_#10_2
+10.2 "GPU package"_#10_3
+10.3 "USER-CUDA package"_#10_4
 10.4 "Comparison of GPU and USER-CUDA packages"_#10_4 :all(b)

 :line
@ -99,53 +107,61 @@ to 20% savings.
 :line
 :line

-10.2 GPU package :h4,link(10_2)
+10.2 USER-OMP package :h4,link(10_2)
+
+This section will be written when the USER-OMP package is released
+in main LAMMPS.
+
+:line
+:line
+
+10.3 GPU package :h4,link(10_3)

 The GPU package was developed by Mike Brown at ORNL.  It provides GPU
 versions of several pair styles and for long-range Coulombics via the
 PPPM command.  It has the following features:

 The package is designed to exploit common GPU hardware configurations
-where one or more GPUs are coupled with one or more multi-core CPUs
-within a node of a parallel machine. :ulb,l
+where one or more GPUs are coupled with many cores of a multi-core
+CPUs, e.g. within a node of a parallel machine. :ulb,l

 Atom-based data (e.g. coordinates, forces) moves back-and-forth
-between the CPU and GPU every timestep. :l
+between the CPU(s) and GPU every timestep. :l

-Neighbor lists can be constructed by on the CPU or on the GPU,
-controlled by the "fix gpu"_fix_gpu.html command. :l
+Neighbor lists can be constructed on the CPU or on the GPU :l

 The charge assignement and force interpolation portions of PPPM can be
 run on the GPU.  The FFT portion, which requires MPI communication
 between processors, runs on the CPU. :l

-Asynchronous force computations can be performed simulataneously on
-the CPU and GPU. :l
+Asynchronous force computations can be performed simultaneously on the
+CPU(s) and GPU. :l

-LAMMPS-specific code is in the GPU package.  It makee calls to a more
+LAMMPS-specific code is in the GPU package.  It makes calls to a
 generic GPU library in the lib/gpu directory.  This library provides
-NVIDIA support as well as a more general OpenCL support, so that the
-same functionality can eventually be supported on other GPU
+NVIDIA support as well as more general OpenCL support, so that the
+same functionality can eventually be supported on a variety of GPU
 hardware. :l,ule

 [Hardware and software requirements:]

-To use this package, you need to have specific NVIDIA hardware and
-install specific NVIDIA CUDA software on your system:
+To use this package, you currently need to have specific NVIDIA
+hardware and install specific NVIDIA CUDA software on your system:

 Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0
 Go to http://www.nvidia.com/object/cuda_get.html
 Install a driver and toolkit appropriate for your system (SDK is not necessary)
-Follow the instructions in lammps/lib/gpu/README to build the library (also see below)
+Follow the instructions in lammps/lib/gpu/README to build the library (see below)
 Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties :ul

 [Building LAMMPS with the GPU package:]

-As with other packages that link with a separately complied library,
-you need to first build the GPU library, before building LAMMPS
-itself.  General instructions for doing this are in "this
+As with other packages that include a separately compiled library, you
+need to first build the GPU library, before building LAMMPS itself.
+General instructions for doing this are in "this
 section"_doc/Section_start.html#2_3 of the manual.  For this package,
-do the following, using a Makefile appropriate for your system:
+do the following, using a Makefile in lib/gpu appropriate for your
+system:

 cd lammps/lib/gpu
 make -f Makefile.linux
@ -155,7 +171,7 @@ If you are successful, you will produce the file lib/libgpu.a.

 Now you are ready to build LAMMPS with the GPU package installed:

-cd lammps/lib/src
+cd lammps/src
 make yes-gpu
 make machine :pre

@ -168,27 +184,26 @@ example.
 [GPU configuration]

 When using GPUs, you are restricted to one physical GPU per LAMMPS
-process, which is an MPI process running (typically) on a single core
-or processor.  Multiple processes can share a single GPU and in many
-cases it will be more efficient to run with multiple processes per
-GPU.
+process, which is an MPI process running on a single core or
+processor.  Multiple MPI processes (CPU cores) can share a single GPU,
+and in many cases it will be more efficient to run this way.

 [Input script requirements:]

-Additional input script requirements to run styles with a {gpu} suffix
-are as follows.
-
-The "newton pair"_newton.html setting must be {off}.
+Additional input script requirements to run pair or PPPM styles with a
+{gpu} suffix are as follows:

 To invoke specific styles from the GPU package, you can either append
 "gpu" to the style name (e.g. pair_style lj/cut/gpu), or use the
 "-suffix command-line switch"_Section_start.html#2_6, or use the
-"suffix"_suffix.html command.
+"suffix"_suffix.html command. :ulb,l
+
+The "newton pair"_newton.html setting must be {off}. :l

 The "package gpu"_package.html command must be used near the beginning
-of your script to control the GPU selection and initialization steps.
-It also enables asynchronous splitting of force computations between
-the CPUs and GPUs.
+of your script to control the GPU selection and initialization
+settings.  It also has an option to enable asynchronous splitting of
+force computations between the CPUs and GPUs. :l,ule

 As an example, if you have two GPUs per node and 8 CPU cores per node,
 and would like to run on 4 nodes (32 cores) with dynamic balancing of
@ -215,10 +230,10 @@ computations that run simultaneously with "bond"_bond_style.html,
 "improper"_improper_style.html, and "long-range"_kspace_style.html
 calculations will not be included in the "Pair" time.

-When the {mode} setting for the gpu fix is force/neigh, the time for
-neighbor list calculations on the GPU will be added into the "Pair"
-time, not the "Neigh" time.  An additional breakdown of the times
-required for various tasks on the GPU (data copy, neighbor
+When the {mode} setting for the package gpu command is force/neigh,
+the time for neighbor list calculations on the GPU will be added into
+the "Pair" time, not the "Neigh" time.  An additional breakdown of the
+times required for various tasks on the GPU (data copy, neighbor
 calculations, force computations, etc) are output only with the LAMMPS
 screen output (not in the log file) at the end of each run.  These
 timings represent total time spent on the GPU for each routine,
@ -226,19 +241,22 @@ regardless of asynchronous CPU calculations.

 [Performance tips:]

+Generally speaking, for best performance, you should use multiple CPUs
+per GPU, as provided my most multi-core CPU/GPU configurations.
+
 Because of the large number of cores within each GPU device, it may be
 more efficient to run on fewer processes per GPU when the number of
 particles per MPI process is small (100's of particles); this can be
 necessary to keep the GPU cores busy.

 See the lammps/lib/gpu/README file for instructions on how to build
-the LAMMPS gpu library for single, mixed, and double precision.  The
-latter requires that your GPU card support double precision.
+the GPU library for single, mixed, or double precision.  The latter
+requires that your GPU card support double precision.

 :line
 :line

-10.3 USER-CUDA package :h4,link(10_3)
+10.4 USER-CUDA package :h4,link(10_4)

 The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
@ -250,19 +268,22 @@ many timesteps, to run entirely on the GPU (except for inter-processor
 MPI communication), so that atom-based data (e.g. coordinates, forces)
 do not have to move back-and-forth between the CPU and GPU. :ulb,l

-This will occur until a timestep where a non-GPU-ized fix or compute
-is invoked.  E.g. whenever a non-GPU operation occurs (fix, compute,
-output), data automatically moves back to the CPU as needed.  This may
-incur a performance penalty, but should otherwise just work
+Data will stay on the GPU until a timestep where a non-GPU-ized fix or
+compute is invoked.  Whenever a non-GPU operation occurs (fix,
+compute, output), data automatically moves back to the CPU as needed.
+This may incur a performance penalty, but should otherwise work
 transparently. :l

 Neighbor lists for GPU-ized pair styles are constructed on the
+GPU. :l
+
+The package only supports use of a single CPU (core) with each
 GPU. :l,ule

 [Hardware and software requirements:]

 To use this package, you need to have specific NVIDIA hardware and
-install specific NVIDIA CUDA software on your system:
+install specific NVIDIA CUDA software on your system.

 Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
 help you to find out the Compute Capability of your card:
@ -276,17 +297,18 @@ that its sample projects can be compiled without problems.

 [Building LAMMPS with the USER-CUDA package:]

-As with other packages that link with a separately complied library,
-you need to first build the USER-CUDA library, before building LAMMPS
+As with other packages that include a separately compiled library, you
+need to first build the USER-CUDA library, before building LAMMPS
 itself.  General instructions for doing this are in "this
 section"_doc/Section_start.html#2_3 of the manual.  For this package,
-do the following, using a Makefile appropriate for your system:
+do the following, using settings in the lib/cuda Makefiles appropriate
+for your system:
+
+Go to the lammps/lib/cuda directory :ulb,l

 If your {CUDA} toolkit is not installed in the default system directoy
 {/usr/local/cuda} edit the file {lib/cuda/Makefile.common}
-accordingly. :ulb,l
-
-Go to the lammps/lib/cuda directory :l
+accordingly. :l

 Type "make OPTIONS", where {OPTIONS} are one or more of the following
 options. The settings will be written to the
@ -318,35 +340,37 @@ produce the file lib/libcuda.a. :l,ule

 Now you are ready to build LAMMPS with the USER-CUDA package installed:

-cd lammps/lib/src
+cd lammps/src
 make yes-user-cuda
 make machine :pre

-Note that the build will reference the lib/cuda/Makefile.common file
-to extract setting relevant to the LAMMPS build.  So it is important
+Note that the LAMMPS build references the lib/cuda/Makefile.common
+file to extract setting specific CUDA settings.  So it is important
 that you have first built the cuda library (in lib/cuda) using
 settings appropriate to your system.

 [Input script requirements:]

 Additional input script requirements to run styles with a {cuda}
-suffix are as follows.
+suffix are as follows:

 To invoke specific styles from the USER-CUDA package, you can either
 append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use
 the "-suffix command-line switch"_Section_start.html#2_6, or use the
 "suffix"_suffix.html command.  One exception is that the "kspace_style
-pppm/cuda"_kspace_style.html command has to be requested explicitly.
+pppm/cuda"_kspace_style.html command has to be requested
+explicitly. :ulb,l

 To use the USER-CUDA package with its default settings, no additional
 command is needed in your input script.  This is because when LAMMPS
 starts up, it detects if it has been built with the USER-CUDA package.
 See the "-cuda command-line switch"_Section_start.html#2_6 for more
-details.
+details. :l

 To change settings for the USER-CUDA package at run-time, the "package
-cuda"_package.html command can be used at the beginning of your input
-script.  See the commands doc page for details.
+cuda"_package.html command can be used near the beginning of your
+input script.  See the "package"_package.html command doc page for
+details. :l,ule

 [Performance tips:]

@ -359,17 +383,17 @@ entirely on the GPU(s) (except for inter-processor MPI communication),
 for multiple timesteps, until a CPU calculation is required, either by
 a fix or compute that is non-GPU-ized, or until output is performed
 (thermo or dump snapshot or restart file).  The less often this
-occurs, the faster your simulation may run.
+occurs, the faster your simulation will run.

 :line
 :line

-10.4 Comparison of GPU and USER-CUDA packages :h4,link(10_4)
+10.5 Comparison of GPU and USER-CUDA packages :h4,link(10_5)

 Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
 using NVIDIA hardware, but they do it in different ways.

-As a consequence, for a specific simulation on particular hardware,
+As a consequence, for a particular simulation on specific hardware,
 one package may be faster than the other.  We give guidelines below,
 but the best way to determine which package is faster for your input
 script is to try both of them on your machine.  See the benchmarking
@ -377,6 +401,11 @@ section below for examples where this has been done.

 [Guidelines for using each package optimally:]

+The GPU package allows you to assign multiple CPUs (cores) to a single
+GPU (a common configuration for "hybrid" nodes that contain multicore
+CPU(s) and GPU(s)) and works effectively in this mode.  The USER-CUDA
+package does not allow this; you can only use one CPU per GPU. :ulb,l
+
 The GPU package moves per-atom data (coordinates, forces)
 back-and-forth between the CPU and GPU every timestep.  The USER-CUDA
 package only does this on timesteps when a CPU calculation is required
@ -385,7 +414,7 @@ can formulate your input script to only use GPU-ized fixes and
 computes, and avoid doing I/O too often (thermo output, dump file
 snapshots, restart files), then the data transfer cost of the
 USER-CUDA package can be very low, causing it to run faster than the
-GPU package. :ulb,l
+GPU package. :l

 The GPU package is often faster than the USER-CUDA package, if the
 number of atoms per GPU is "small".  The crossover point, in terms of
@ -395,28 +424,12 @@ system the crossover (in single precision) is often about 50K-100K
 atoms per GPU.  When performing double precision calculations the
 crossover point can be significantly smaller. :l

-The GPU package allows you to assign multiple CPUs (cores) to a single
-GPU (a common configuration for "hybrid" nodes that contain multicore
-CPU(s) and GPU(s)) and works effectively in this mode.  The USER-CUDA
-package does not; it works best when there is one CPU per GPU. :l
-
 Both packages compute bonded interactions (bonds, angles, etc) on the
 CPU.  This means a model with bonds will force the USER-CUDA package
 to transfer per-atom data back-and-forth between the CPU and GPU every
 timestep.  If the GPU package is running with several MPI processes
 assigned to one GPU, the cost of computing the bonded interactions is
-spread across more CPUs and hence the GPU package can run faster. :l,ule
-
-[Chief differences between the two packages:]
-
-The GPU package accelerates only pair force, neighbor list, and PPPM
-calculations.  The USER-CUDA package currently supports a wider range
-of pair styles and can also accelerate many fix styles and some
-compute styles, as well as neighbor list and PPPM calculations. :ulb,l
-
-The GPU package uses more GPU memory than the USER-CUDA package.  This
-is generally not much of a problem since typical runs are
-computation-limited rather than memory-limited. :l
+spread across more CPUs and hence the GPU package can run faster. :l

 When using the GPU package with multiple CPUs assigned to one GPU, its
 performance depends to some extent on high bandwidth between the CPUs
@ -426,17 +439,29 @@ case if S2050/70 servers are used, where two devices generally share
 one PCIe 2.0 16x slot.  Also many multi-GPU mainboards do not provide
 full 16 lanes to each of the PCIe 2.0 16x slots. :l,ule

+[Differences between the two packages:]
+
+The GPU package accelerates only pair force, neighbor list, and PPPM
+calculations.  The USER-CUDA package currently supports a wider range
+of pair styles and can also accelerate many fix styles and some
+compute styles, as well as neighbor list and PPPM calculations. :ulb,l
+
+The GPU package uses more GPU memory than the USER-CUDA package.  This
+is generally not a problem since typical runs are computation-limited
+rather than memory-limited. :l,ule
+
 [Examples:]

-The LAMMPS distribution has two directories with sample
-input scripts for the GPU and USER-CUDA packages.
+The LAMMPS distribution has two directories with sample input scripts
+for the GPU and USER-CUDA packages.

 lammps/examples/gpu = GPU package files
 lammps/examples/USER/cuda = USER-CUDA package files :ul

-These are files for identical systems, so they can be
-used to benchmark the performance of both packages
-on your system.
+These contain input scripts for identical systems, so they can be used
+to benchmark the performance of both packages on your system.
+
+:line

 [Benchmark data:]

--- a/doc/compute_ackland_atom.html
+++ b/doc/compute_ackland_atom.html
@ -58,8 +58,8 @@ LAMMPS output options.
 </P>
 <P><B>Restrictions:</B>
 </P>
-<P>This compute is part of the "user-ackland" package.  It is only
-enabled if LAMMPS was built with that package.  See the <A HREF = "Section_start.html#2_3">Making
+<P>This compute is part of the "user-misc" package.  It is only enabled
+if LAMMPS was built with that package.  See the <A HREF = "Section_start.html#2_3">Making
 LAMMPS</A> section for more info.
 </P>
 <P><B>Related commands:</B>
--- a/doc/compute_ackland_atom.txt
+++ b/doc/compute_ackland_atom.txt
@ -55,8 +55,8 @@ LAMMPS output options.

 [Restrictions:]

-This compute is part of the "user-ackland" package.  It is only
-enabled if LAMMPS was built with that package.  See the "Making
+This compute is part of the "user-misc" package.  It is only enabled
+if LAMMPS was built with that package.  See the "Making
 LAMMPS"_Section_start.html#2_3 section for more info.

 [Related commands:]
--- a/doc/fix_imd.html
+++ b/doc/fix_imd.html
@ -43,9 +43,22 @@ fix comm all imd 8888 trate 5 unwrap on fscale 10.0
 <P><B>Description:</B>
 </P>
 <P>This fix implements the "Interactive MD" (IMD) protocol which allows
-to connect an IMD client, for example the <A HREF = "http://www.ks.uiuc.edu/Research/vmd">VMD visualization
-program</A>, to a running LAMMPS simulation and monitor the progress
-of the simulation and interactively apply forces to selected atoms.
+realtime visualization and manipulation of MD simulations through the
+IMD protocol, as initially implemented in VMD and NAMD.  Specifically
+it allows LAMMPS to connect an IMD client, for example the <A HREF = "http://www.ks.uiuc.edu/Research/vmd">VMD
+visualization program</A>, so that it can monitor the progress of the
+simulation and interactively apply forces to selected atoms.
+</P>
+<P>If LAMMPS is compiled with the preprocessor flag -DLAMMPS_ASYNC_IMD
+then fix imd will use posix threads to spawn a thread on MPI rank 0 in
+order to offload data reading and writing from the main execution
+thread and potentiall lower the inferred latencies for slow
+communication links.  This feature has only been tested under linux.
+</P>
+<P>There are example scripts for using this package with LAMMPS in
+examples/USER/imd. Additional examples and a driver for use with the
+Novint Falcon game controller as haptic device can be found at:
+http://sites.google.com/site/akohlmey/software/vrpn-icms.
 </P>
 <P>The source code for this fix includes code developed by the
 Theoretical and Computational Biophysics Group in the Beckman
@ -138,15 +151,16 @@ This fix is not invoked during <A HREF = "minimize.html">energy minimization</A>
 </P>
 <P><B>Restrictions:</B>
 </P>
-<P>This fix is part of the "user-imd" package.  It is only enabled if
+<P>This fix is part of the "user-misc" package.  It is only enabled if
 LAMMPS was built with that package.  See the <A HREF = "Section_start.html#2_3">Making
 LAMMPS</A> section for more info.
-This on platforms that support multi-threading, this fix can be 
-compiled in a way that the coordinate transfers to the IMD client
-can be handled from a separate thread, when LAMMPS is compiled with
-the -DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep 
-MD loop times low and transfer rates high, especially for systems
-with many atoms and for slow connections.
+</P>
+<P>On platforms that support multi-threading, this fix can be compiled in
+a way that the coordinate transfers to the IMD client can be handled
+from a separate thread, when LAMMPS is compiled with the
+-DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep MD loop
+times low and transfer rates high, especially for systems with many
+atoms and for slow connections.
 </P>
 <P>When used in combination with VMD, a topology or coordinate file has
 to be loaded, which matches (in number and ordering of atoms) the
--- a/doc/fix_imd.txt
+++ b/doc/fix_imd.txt
@ -35,9 +35,22 @@ fix comm all imd 8888 trate 5 unwrap on fscale 10.0 :pre
 [Description:]

 This fix implements the "Interactive MD" (IMD) protocol which allows
-to connect an IMD client, for example the "VMD visualization
-program"_VMD, to a running LAMMPS simulation and monitor the progress
-of the simulation and interactively apply forces to selected atoms.
+realtime visualization and manipulation of MD simulations through the
+IMD protocol, as initially implemented in VMD and NAMD.  Specifically
+it allows LAMMPS to connect an IMD client, for example the "VMD
+visualization program"_VMD, so that it can monitor the progress of the
+simulation and interactively apply forces to selected atoms.
+
+If LAMMPS is compiled with the preprocessor flag -DLAMMPS_ASYNC_IMD
+then fix imd will use posix threads to spawn a thread on MPI rank 0 in
+order to offload data reading and writing from the main execution
+thread and potentiall lower the inferred latencies for slow
+communication links.  This feature has only been tested under linux.
+
+There are example scripts for using this package with LAMMPS in
+examples/USER/imd. Additional examples and a driver for use with the
+Novint Falcon game controller as haptic device can be found at:
+http://sites.google.com/site/akohlmey/software/vrpn-icms.

 The source code for this fix includes code developed by the
 Theoretical and Computational Biophysics Group in the Beckman
@ -128,15 +141,16 @@ This fix is not invoked during "energy minimization"_minimize.html.

 [Restrictions:]

-This fix is part of the "user-imd" package.  It is only enabled if
+This fix is part of the "user-misc" package.  It is only enabled if
 LAMMPS was built with that package.  See the "Making
 LAMMPS"_Section_start.html#2_3 section for more info.
-This on platforms that support multi-threading, this fix can be 
-compiled in a way that the coordinate transfers to the IMD client
-can be handled from a separate thread, when LAMMPS is compiled with
-the -DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep 
-MD loop times low and transfer rates high, especially for systems
-with many atoms and for slow connections.
+
+On platforms that support multi-threading, this fix can be compiled in
+a way that the coordinate transfers to the IMD client can be handled
+from a separate thread, when LAMMPS is compiled with the
+-DLAMMPS_ASYNC_IMD preprocessor flag. This should to keep MD loop
+times low and transfer rates high, especially for systems with many
+atoms and for slow connections.

 When used in combination with VMD, a topology or coordinate file has
 to be loaded, which matches (in number and ordering of atoms) the
--- a/doc/fix_smd.html
+++ b/doc/fix_smd.html
@ -132,7 +132,7 @@ minimization</A>.
 </P>
 <P><B>Restrictions:</B>
 </P>
-<P>This fix is part of the "user-smd" package.  It is only enabled if
+<P>This fix is part of the "user-misc" package.  It is only enabled if
 LAMMPS was built with that package.  See the <A HREF = "Section_start.html#2_3">Making
 LAMMPS</A> section for more info.
 </P>
--- a/doc/fix_smd.txt
+++ b/doc/fix_smd.txt
@ -123,7 +123,7 @@ minimization"_minimize.html.

 [Restrictions:]

-This fix is part of the "user-smd" package.  It is only enabled if
+This fix is part of the "user-misc" package.  It is only enabled if
 LAMMPS was built with that package.  See the "Making
 LAMMPS"_Section_start.html#2_3 section for more info.

--- a/doc/package.html
+++ b/doc/package.html
@ -101,7 +101,7 @@ the other particles.
 <HR>

 <P>The <I>cuda</I> style invokes options associated with the use of the
-USER-CUDA package.  These need to be documented.
+USER-CUDA package.  These still need to be documented.
 </P>
 <HR>

--- a/doc/package.txt
+++ b/doc/package.txt
@ -95,7 +95,7 @@ the other particles.
 :line

 The {cuda} style invokes options associated with the use of the
-USER-CUDA package.  These need to be documented.
+USER-CUDA package.  These still need to be documented.

 :line

--- a/doc/pair_eam.html
+++ b/doc/pair_eam.html
@ -415,7 +415,7 @@ an input script that reads a restart file.
 that package (which it is by default).  See the <A HREF = "Section_start.html#2_3">Making
 LAMMPS</A> section for more info.
 </P>
-<P>The <I>eam/cd</I> style is part of the "user-cd-eam" package and also
+<P>The <I>eam/cd</I> style is part of the "user-misc" package and also
 requires the "manybody" package.  It is only enabled if LAMMPS was
 built with those packages.  See the <A HREF = "Section_start.html#2_3">Making
 LAMMPS</A> section for more info.
--- a/doc/pair_eam.txt
+++ b/doc/pair_eam.txt
@ -403,7 +403,7 @@ All of these styles except the {eam/cd} style are part of the
 that package (which it is by default).  See the "Making
 LAMMPS"_Section_start.html#2_3 section for more info.

-The {eam/cd} style is part of the "user-cd-eam" package and also
+The {eam/cd} style is part of the "user-misc" package and also
 requires the "manybody" package.  It is only enabled if LAMMPS was
 built with those packages.  See the "Making
 LAMMPS"_Section_start.html#2_3 section for more info.