2011-05-27 07:45:30 +08:00
|
|
|
<HTML>
|
2011-08-26 00:17:31 +08:00
|
|
|
<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
|
|
|
|
<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> - <A HREF = "Section_howto.html">Next
|
2011-08-09 23:37:57 +08:00
|
|
|
Section</A>
|
2011-05-27 07:45:30 +08:00
|
|
|
</CENTER>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<HR>
|
|
|
|
|
2011-12-09 08:44:47 +08:00
|
|
|
<H3>5. Accelerating LAMMPS performance
|
2011-05-27 07:45:30 +08:00
|
|
|
</H3>
|
2011-12-09 08:44:47 +08:00
|
|
|
<P>This section describes various methods for improving LAMMPS
|
2012-08-07 06:50:52 +08:00
|
|
|
performance for different classes of problems running on different
|
|
|
|
kinds of machines.
|
|
|
|
</P>
|
|
|
|
5.1 <A HREF = "#acc_1">Measuring performance</A><BR>
|
|
|
|
5.2 <A HREF = "#acc_2">General strategies</A><BR>
|
|
|
|
5.3 <A HREF = "#acc_3">Packages with optimized styles</A><BR>
|
|
|
|
5.4 <A HREF = "#acc_4">OPT package</A><BR>
|
|
|
|
5.5 <A HREF = "#acc_5">USER-OMP package</A><BR>
|
|
|
|
5.6 <A HREF = "#acc_6">GPU package</A><BR>
|
|
|
|
5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
|
2014-05-30 06:52:23 +08:00
|
|
|
5.8 <A HREF = "#acc_8">KOKKOS package</A><BR>
|
2014-08-15 00:30:25 +08:00
|
|
|
5.9 <A HREF = "#acc_9">USER-INTEL package</A><BR>
|
|
|
|
5.10 <A HREF = "#acc_10">Comparison of GPU and USER-CUDA packages</A> <BR>
|
2012-08-07 06:50:52 +08:00
|
|
|
|
|
|
|
<HR>
|
|
|
|
|
|
|
|
<HR>
|
|
|
|
|
|
|
|
<H4><A NAME = "acc_1"></A>5.1 Measuring performance
|
|
|
|
</H4>
|
|
|
|
<P>Before trying to make your simulation run faster, you should
|
|
|
|
understand how it currently performs and where the bottlenecks are.
|
|
|
|
</P>
|
|
|
|
<P>The best way to do this is run the your system (actual number of
|
|
|
|
atoms) for a modest number of timesteps (say 100, or a few 100 at
|
|
|
|
most) on several different processor counts, including a single
|
|
|
|
processor if possible. Do this for an equilibrium version of your
|
|
|
|
system, so that the 100-step timings are representative of a much
|
|
|
|
longer run. There is typically no need to run for 1000s or timesteps
|
|
|
|
to get accurate timings; you can simply extrapolate from short runs.
|
|
|
|
</P>
|
|
|
|
<P>For the set of runs, look at the timing data printed to the screen and
|
|
|
|
log file at the end of each LAMMPS run. <A HREF = "Section_start.html#start_8">This
|
|
|
|
section</A> of the manual has an overview.
|
|
|
|
</P>
|
|
|
|
<P>Running on one (or a few processors) should give a good estimate of
|
|
|
|
the serial performance and what portions of the timestep are taking
|
|
|
|
the most time. Running the same problem on a few different processor
|
|
|
|
counts should give an estimate of parallel scalability. I.e. if the
|
|
|
|
simulation runs 16x faster on 16 processors, its 100% parallel
|
|
|
|
efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
|
|
|
|
</P>
|
|
|
|
<P>The most important data to look at in the timing info is the timing
|
|
|
|
breakdown and relative percentages. For example, trying different
|
|
|
|
options for speeding up the long-range solvers will have little impact
|
|
|
|
if they only consume 10% of the run time. If the pairwise time is
|
|
|
|
dominating, you may want to look at GPU or OMP versions of the pair
|
|
|
|
style, as discussed below. Comparing how the percentages change as
|
|
|
|
you increase the processor count gives you a sense of how different
|
|
|
|
operations within the timestep are scaling. Note that if you are
|
|
|
|
running with a Kspace solver, there is additional output on the
|
|
|
|
breakdown of the Kspace time. For PPPM, this includes the fraction
|
|
|
|
spent on FFTs, which can be communication intensive.
|
|
|
|
</P>
|
|
|
|
<P>Another important detail in the timing info are the histograms of
|
|
|
|
atoms counts and neighbor counts. If these vary widely across
|
|
|
|
processors, you have a load-imbalance issue. This often results in
|
|
|
|
inaccurate relative timing data, because processors have to wait when
|
|
|
|
communication occurs for other processors to catch up. Thus the
|
|
|
|
reported times for "Communication" or "Other" may be higher than they
|
|
|
|
really are, due to load-imbalance. If this is an issue, you can
|
|
|
|
uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
|
|
|
|
LAMMPS, to obtain synchronized timings.
|
2011-12-09 08:44:47 +08:00
|
|
|
</P>
|
2012-08-07 06:50:52 +08:00
|
|
|
<HR>
|
|
|
|
|
|
|
|
<H4><A NAME = "acc_2"></A>5.2 General strategies
|
|
|
|
</H4>
|
2012-08-07 06:51:58 +08:00
|
|
|
<P>NOTE: this sub-section is still a work in progress
|
|
|
|
</P>
|
2012-08-07 06:50:52 +08:00
|
|
|
<P>Here is a list of general ideas for improving simulation performance.
|
|
|
|
Most of them are only applicable to certain models and certain
|
|
|
|
bottlenecks in the current performance, so let the timing data you
|
2014-01-18 05:51:23 +08:00
|
|
|
generate be your guide. It is hard, if not impossible, to predict how
|
|
|
|
much difference these options will make, since it is a function of
|
|
|
|
problem size, number of processors used, and your machine. There is
|
|
|
|
no substitute for identifying performance bottlenecks, and trying out
|
|
|
|
various options.
|
2012-08-07 06:50:52 +08:00
|
|
|
</P>
|
|
|
|
<UL><LI>rRESPA
|
|
|
|
<LI>2-FFT PPPM
|
2013-06-28 07:48:54 +08:00
|
|
|
<LI>Staggered PPPM
|
2012-08-07 06:50:52 +08:00
|
|
|
<LI>single vs double PPPM
|
|
|
|
<LI>partial charge PPPM
|
|
|
|
<LI>verlet/split
|
|
|
|
<LI>processor mapping via processors numa command
|
|
|
|
<LI>load-balancing: balance and fix balance
|
|
|
|
<LI>processor command for layout
|
2012-10-19 23:39:31 +08:00
|
|
|
<LI>OMP when lots of cores
|
2012-08-07 06:50:52 +08:00
|
|
|
</UL>
|
2013-06-30 05:40:58 +08:00
|
|
|
<P>2-FFT PPPM, also called <I>analytic differentiation</I> or <I>ad</I> PPPM, uses
|
|
|
|
2 FFTs instead of the 4 FFTs used by the default <I>ik differentiation</I>
|
|
|
|
PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to
|
|
|
|
achieve the same accuracy as 4-FFT PPPM. For problems where the FFT
|
|
|
|
cost is the performance bottleneck (typically large problems running
|
|
|
|
on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.
|
2013-06-28 07:48:54 +08:00
|
|
|
</P>
|
2013-06-30 05:40:58 +08:00
|
|
|
<P>Staggered PPPM performs calculations using two different meshes, one
|
|
|
|
shifted slightly with respect to the other. This can reduce force
|
|
|
|
aliasing errors and increase the accuracy of the method, but also
|
|
|
|
doubles the amount of work required. For high relative accuracy, using
|
|
|
|
staggered PPPM allows one to half the mesh size in each dimension as
|
|
|
|
compared to regular PPPM, which can give around a 4x speedup in the
|
|
|
|
kspace time. However, for low relative accuracy, using staggered PPPM
|
|
|
|
gives little benefit and can be up to 2x slower in the kspace
|
|
|
|
time. For example, the rhodopsin benchmark was run on a single
|
|
|
|
processor, and results for kspace time vs. relative accuracy for the
|
|
|
|
different methods are shown in the figure below. For this system,
|
|
|
|
staggered PPPM (using ik differentiation) becomes useful when using a
|
|
|
|
relative accuracy of slightly greater than 1e-5 and above.
|
2013-06-28 07:48:54 +08:00
|
|
|
</P>
|
|
|
|
<CENTER><IMG SRC = "JPG/rhodo_staggered.jpg">
|
|
|
|
</CENTER>
|
2013-06-30 05:40:58 +08:00
|
|
|
<P>IMPORTANT NOTE: Using staggered PPPM may not give the same increase in
|
|
|
|
accuracy of energy and pressure as it does in forces, so some caution
|
|
|
|
must be used if energy and/or pressure are quantities of interest,
|
|
|
|
such as when using a barostat.
|
2013-06-28 07:48:54 +08:00
|
|
|
</P>
|
2012-08-07 06:50:52 +08:00
|
|
|
<HR>
|
2011-12-09 08:44:47 +08:00
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
<H4><A NAME = "acc_3"></A>5.3 Packages with optimized styles
|
|
|
|
</H4>
|
2011-05-27 07:45:30 +08:00
|
|
|
<P>Accelerated versions of various <A HREF = "pair_style.html">pair_style</A>,
|
2011-06-09 04:56:17 +08:00
|
|
|
<A HREF = "fix.html">fixes</A>, <A HREF = "compute.html">computes</A>, and other commands have
|
|
|
|
been added to LAMMPS, which will typically run faster than the
|
|
|
|
standard non-accelerated versions, if you have the appropriate
|
|
|
|
hardware on your system.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2011-06-09 04:56:17 +08:00
|
|
|
<P>The accelerated styles have the same name as the standard styles,
|
|
|
|
except that a suffix is appended. Otherwise, the syntax for the
|
|
|
|
command is identical, their functionality is the same, and the
|
|
|
|
numerical results it produces should also be identical, except for
|
|
|
|
precision and round-off issues.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>For example, all of these styles are variants of the basic
|
|
|
|
Lennard-Jones pair style <A HREF = "pair_lj.html">pair_style lj/cut</A>:
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<UL><LI><A HREF = "pair_lj.html">pair_style lj/cut/cuda</A>
|
2011-05-27 07:45:30 +08:00
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/gpu</A>
|
2014-08-15 00:30:25 +08:00
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/intel</A>
|
2014-05-30 06:52:23 +08:00
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/kk</A>
|
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/omp</A>
|
|
|
|
<LI><A HREF = "pair_lj.html">pair_style lj/cut/opt</A>
|
2011-05-27 07:45:30 +08:00
|
|
|
</UL>
|
|
|
|
<P>Assuming you have built LAMMPS with the appropriate package, these
|
|
|
|
styles can be invoked by specifying them explicitly in your input
|
2012-01-28 07:39:14 +08:00
|
|
|
script. Or you can use the <A HREF = "Section_start.html#start_7">-suffix command-line
|
2011-05-27 07:45:30 +08:00
|
|
|
switch</A> to invoke the accelerated versions
|
2011-06-09 04:56:17 +08:00
|
|
|
automatically, without changing your input script. The
|
2011-08-09 23:37:57 +08:00
|
|
|
<A HREF = "suffix.html">suffix</A> command allows you to set a suffix explicitly and
|
2014-05-30 06:52:23 +08:00
|
|
|
to turn off and back on the comand-line switch setting, both from
|
|
|
|
within your input script.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU
|
|
|
|
packages, and can be run on NVIDIA GPUs associated with your CPUs.
|
|
|
|
The speed-up due to GPU usage depends on a variety of factors, as
|
|
|
|
discussed below.
|
|
|
|
</P>
|
2014-08-15 00:30:25 +08:00
|
|
|
<P>Styles with an "intel" suffix are part of the USER-INTEL
|
|
|
|
package. These styles support vectorized single and mixed precision
|
|
|
|
calculations, in addition to full double precision. In extreme cases,
|
|
|
|
this can provide speedups over 3.5x on CPUs. The package also
|
2014-08-25 22:48:45 +08:00
|
|
|
supports acceleration with offload to Intel(R) Xeon Phi(TM) coprocessors.
|
|
|
|
This can result in additional speedup over 2x depending on the
|
2014-08-15 00:30:25 +08:00
|
|
|
hardware configuration.
|
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>Styles with a "kk" suffix are part of the KOKKOS package, and can be
|
|
|
|
run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends
|
|
|
|
on a variety of factors, as discussed below.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>Styles with an "omp" suffix are part of the USER-OMP package and allow
|
2011-10-07 01:32:51 +08:00
|
|
|
a pair-style to be run in multi-threaded mode using OpenMP. This can
|
|
|
|
be useful on nodes with high-core counts when using less MPI processes
|
2011-08-18 05:55:22 +08:00
|
|
|
than cores is advantageous, e.g. when running with PPPM so that FFTs
|
2011-10-07 01:32:51 +08:00
|
|
|
are run on fewer MPI processors or when the many MPI tasks would
|
|
|
|
overload the available bandwidth for communication.
|
2011-08-18 05:55:22 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>Styles with an "opt" suffix are part of the OPT package and typically
|
|
|
|
speed-up the pairwise calculations of your simulation by 5-25%.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
|
|
|
<P>To see what styles are currently available in each of the accelerated
|
2011-12-14 04:43:36 +08:00
|
|
|
packages, see <A HREF = "Section_commands.html#cmd_5">Section_commands 5</A> of the
|
2011-08-26 01:01:01 +08:00
|
|
|
manual. A list of accelerated styles is included in the pair, fix,
|
2014-05-30 06:52:23 +08:00
|
|
|
compute, and kspace sections. The doc page for each indvidual style
|
|
|
|
(e.g. <A HREF = "pair_lj.html">pair lj/cut</A> or <A HREF = "fix_nve.html">fix nve</A>) will also
|
|
|
|
list any accelerated variants available for that style.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
|
|
|
<P>The following sections explain:
|
|
|
|
</P>
|
2011-06-09 04:56:17 +08:00
|
|
|
<UL><LI>what hardware and software the accelerated styles require
|
2014-05-30 06:52:23 +08:00
|
|
|
<LI>how to build LAMMPS with the accelerated package in place
|
2011-08-09 23:37:57 +08:00
|
|
|
<LI>what changes (if any) are needed in your input scripts
|
|
|
|
<LI>guidelines for best performance
|
|
|
|
<LI>speed-ups you can expect
|
2011-05-27 07:45:30 +08:00
|
|
|
</UL>
|
|
|
|
<P>The final section compares and contrasts the GPU and USER-CUDA
|
2014-05-30 06:52:23 +08:00
|
|
|
packages, since they are both designed to use NVIDIA hardware.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
<H4><A NAME = "acc_4"></A>5.4 OPT package
|
2011-05-27 07:45:30 +08:00
|
|
|
</H4>
|
2011-05-28 01:59:03 +08:00
|
|
|
<P>The OPT package was developed by James Fischer (High Performance
|
2011-08-09 23:37:57 +08:00
|
|
|
Technologies), David Richie, and Vincent Natoli (Stone Ridge
|
2011-06-09 04:56:17 +08:00
|
|
|
Technologies). It contains a handful of pair styles whose compute()
|
|
|
|
methods were rewritten in C++ templated form to reduce the overhead
|
|
|
|
due to if tests and other conditional code.
|
|
|
|
</P>
|
|
|
|
<P>The procedure for building LAMMPS with the OPT package is simple. It
|
|
|
|
is the same as for any other package which has no additional library
|
|
|
|
dependencies:
|
|
|
|
</P>
|
|
|
|
<PRE>make yes-opt
|
|
|
|
make machine
|
|
|
|
</PRE>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>If your input script uses one of the OPT pair styles, you can run it
|
|
|
|
as follows:
|
2011-06-09 04:56:17 +08:00
|
|
|
</P>
|
|
|
|
<PRE>lmp_machine -sf opt < in.script
|
|
|
|
mpirun -np 4 lmp_machine -sf opt < in.script
|
|
|
|
</PRE>
|
|
|
|
<P>You should see a reduction in the "Pair time" printed out at the end
|
|
|
|
of the run. On most machines and problems, this will typically be a 5
|
|
|
|
to 20% savings.
|
2011-05-28 01:59:03 +08:00
|
|
|
</P>
|
2011-05-27 07:45:30 +08:00
|
|
|
<HR>
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
<H4><A NAME = "acc_5"></A>5.5 USER-OMP package
|
2011-08-18 05:55:22 +08:00
|
|
|
</H4>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
|
|
|
University. It provides multi-threaded versions of most pair styles,
|
|
|
|
all dihedral styles, and a few fixes in LAMMPS. The package currently
|
|
|
|
uses the OpenMP interface which requires using a specific compiler
|
|
|
|
flag in the makefile to enable multiple threads; without this flag the
|
|
|
|
corresponding pair styles will still be compiled and work, but do not
|
|
|
|
support multi-threading.
|
2011-10-07 01:32:51 +08:00
|
|
|
</P>
|
|
|
|
<P><B>Building LAMMPS with the USER-OMP package:</B>
|
|
|
|
</P>
|
|
|
|
<P>The procedure for building LAMMPS with the USER-OMP package is simple.
|
|
|
|
You have to edit your machine specific makefile to add the flag to
|
2014-06-15 03:20:38 +08:00
|
|
|
enable OpenMP support to both the CCFLAGS and LINKFLAGS variables.
|
|
|
|
For the GNU compilers and Intel compilers, this flag is called
|
|
|
|
<I>-fopenmp</I>. Check your compiler documentation to find out which flag
|
|
|
|
you need to add. The rest of the compilation is the same as for any
|
|
|
|
other package which has no additional library dependencies:
|
2011-10-07 01:32:51 +08:00
|
|
|
</P>
|
|
|
|
<PRE>make yes-user-omp
|
|
|
|
make machine
|
|
|
|
</PRE>
|
|
|
|
<P>If your input script uses one of regular styles that are also
|
|
|
|
exist as an OpenMP version in the USER-OMP package you can run
|
|
|
|
it as follows:
|
|
|
|
</P>
|
|
|
|
<PRE>env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script
|
|
|
|
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
|
2011-10-25 23:07:42 +08:00
|
|
|
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
|
2011-10-07 01:32:51 +08:00
|
|
|
</PRE>
|
|
|
|
<P>The value of the environment variable OMP_NUM_THREADS determines how
|
2014-05-30 06:52:23 +08:00
|
|
|
many threads per MPI task are launched. All three examples above use a
|
|
|
|
total of 4 CPU cores. For different MPI implementations the method to
|
|
|
|
pass the OMP_NUM_THREADS environment variable to all processes is
|
|
|
|
different. Two different variants, one for MPICH and OpenMPI,
|
|
|
|
respectively are shown above. Please check the documentation of your
|
|
|
|
MPI installation for additional details. Alternatively, the value
|
|
|
|
provided by OMP_NUM_THREADS can be overridded with the <A HREF = "package.html">package
|
|
|
|
omp</A> command. Depending on which styles are accelerated
|
|
|
|
in your input, you should see a reduction in the "Pair time" and/or
|
|
|
|
"Bond time" and "Loop time" printed out at the end of the run. The
|
|
|
|
optimal ratio of MPI to OpenMP can vary a lot and should always be
|
|
|
|
confirmed through some benchmark runs for the current system and on
|
|
|
|
the current machine.
|
2011-10-07 01:32:51 +08:00
|
|
|
</P>
|
|
|
|
<P><B>Restrictions:</B>
|
|
|
|
</P>
|
|
|
|
<P>None of the pair styles in the USER-OMP package support the "inner",
|
|
|
|
"middle", "outer" options for r-RESPA integration, only the "pair"
|
|
|
|
option is supported.
|
|
|
|
</P>
|
|
|
|
<P><B>Parallel efficiency and performance tips:</B>
|
|
|
|
</P>
|
|
|
|
<P>In most simple cases the MPI parallelization in LAMMPS is more
|
|
|
|
efficient than multi-threading implemented in the USER-OMP package.
|
|
|
|
Also the parallel efficiency varies between individual styles.
|
|
|
|
On the other hand, in many cases you still want to use the <I>omp</I> version
|
|
|
|
- even when compiling or running without OpenMP support - since they
|
|
|
|
all contain optimizations similar to those in the OPT package, which
|
|
|
|
can result in serial speedup.
|
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>Using multi-threading is most effective under the following
|
|
|
|
circumstances:
|
2011-10-07 01:32:51 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<UL><LI>Individual compute nodes have a significant number of CPU cores but
|
|
|
|
the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx
|
|
|
|
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
|
|
|
|
MPI task per CPU core will result in significant performance
|
|
|
|
degradation, so that running with 4 or even only 2 MPI tasks per nodes
|
|
|
|
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
|
|
|
inter-node communication bandwidth contention in the same way, but
|
|
|
|
offers and additional speedup from utilizing the otherwise idle CPU
|
|
|
|
cores.
|
2011-10-07 01:32:51 +08:00
|
|
|
|
|
|
|
<LI>The interconnect used for MPI communication is not able to provide
|
2014-05-30 06:52:23 +08:00
|
|
|
sufficient bandwidth for a large number of MPI tasks per node. This
|
|
|
|
applies for example to running over gigabit ethernet or on Cray XT4 or
|
|
|
|
XT5 series supercomputers. Same as in the aforementioned case this
|
|
|
|
effect worsens with using an increasing number of nodes.
|
|
|
|
|
|
|
|
<LI>The input is a system that has an inhomogeneous particle density which
|
|
|
|
cannot be mapped well to the domain decomposition scheme that LAMMPS
|
|
|
|
employs. While this can be to some degree alleviated through using the
|
|
|
|
<A HREF = "processors.html">processors</A> keyword, multi-threading provides a
|
|
|
|
parallelism that parallelizes over the number of particles not their
|
|
|
|
distribution in space.
|
2011-10-07 01:32:51 +08:00
|
|
|
|
|
|
|
<LI>Finally, multi-threaded styles can improve performance when running
|
|
|
|
LAMMPS in "capability mode", i.e. near the point where the MPI
|
2014-05-30 06:52:23 +08:00
|
|
|
parallelism scales out. This can happen in particular when using as
|
|
|
|
kspace style for long-range electrostatics. Here the scaling of the
|
|
|
|
kspace style is the performance limiting factor and using
|
|
|
|
multi-threaded styles allows to operate the kspace style at the limit
|
|
|
|
of scaling and then increase performance parallelizing the real space
|
|
|
|
calculations with hybrid MPI+OpenMP. Sometimes additional speedup can
|
|
|
|
be achived by increasing the real-space coulomb cutoff and thus
|
|
|
|
reducing the work in the kspace part.
|
2011-10-07 01:32:51 +08:00
|
|
|
</UL>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The best parallel efficiency from <I>omp</I> styles is typically achieved
|
|
|
|
when there is at least one MPI task per physical processor,
|
|
|
|
i.e. socket or die.
|
2011-10-07 01:32:51 +08:00
|
|
|
</P>
|
|
|
|
<P>Using threads on hyper-threading enabled cores is usually
|
|
|
|
counterproductive, as the cost in additional memory bandwidth
|
2014-05-30 06:52:23 +08:00
|
|
|
requirements is not offset by the gain in CPU utilization through
|
|
|
|
hyper-threading.
|
2011-10-07 01:32:51 +08:00
|
|
|
</P>
|
|
|
|
<P>A description of the multi-threading strategy and some performance
|
2014-05-30 06:52:23 +08:00
|
|
|
examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
|
|
|
|
here</A>
|
2011-08-18 05:55:22 +08:00
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
<H4><A NAME = "acc_6"></A>5.6 GPU package
|
2011-05-27 07:45:30 +08:00
|
|
|
</H4>
|
2014-04-02 05:30:58 +08:00
|
|
|
<P>The GPU package was developed by Mike Brown at ORNL and his
|
|
|
|
collaborators. It provides GPU versions of several pair styles,
|
|
|
|
including the 3-body Stillinger-Weber pair style, and for long-range
|
|
|
|
Coulombics via the PPPM command. It has the following features:
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-09 23:37:57 +08:00
|
|
|
<UL><LI>The package is designed to exploit common GPU hardware configurations
|
2011-08-18 05:55:22 +08:00
|
|
|
where one or more GPUs are coupled with many cores of a multi-core
|
|
|
|
CPUs, e.g. within a node of a parallel machine.
|
2011-08-09 23:37:57 +08:00
|
|
|
|
|
|
|
<LI>Atom-based data (e.g. coordinates, forces) moves back-and-forth
|
2011-08-18 05:55:22 +08:00
|
|
|
between the CPU(s) and GPU every timestep.
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
<LI>Neighbor lists can be constructed on the CPU or on the GPU
|
2011-08-09 23:37:57 +08:00
|
|
|
|
|
|
|
<LI>The charge assignement and force interpolation portions of PPPM can be
|
|
|
|
run on the GPU. The FFT portion, which requires MPI communication
|
|
|
|
between processors, runs on the CPU.
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
<LI>Asynchronous force computations can be performed simultaneously on the
|
|
|
|
CPU(s) and GPU.
|
2011-08-09 23:37:57 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
<LI>It allows for GPU computations to be performed in single or double
|
|
|
|
precision, or in mixed-mode precision. where pairwise forces are
|
|
|
|
cmoputed in single precision, but accumulated into double-precision
|
|
|
|
force vectors.
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
<LI>LAMMPS-specific code is in the GPU package. It makes calls to a
|
2011-08-09 23:37:57 +08:00
|
|
|
generic GPU library in the lib/gpu directory. This library provides
|
2011-08-18 05:55:22 +08:00
|
|
|
NVIDIA support as well as more general OpenCL support, so that the
|
|
|
|
same functionality can eventually be supported on a variety of GPU
|
2011-08-09 23:37:57 +08:00
|
|
|
hardware.
|
|
|
|
</UL>
|
|
|
|
<P><B>Hardware and software requirements:</B>
|
2011-06-09 05:26:06 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>To use this package, you currently need to have an NVIDIA GPU and
|
|
|
|
install the NVIDIA Cuda software on your system:
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<UL><LI>Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/cards/0
|
2011-08-09 23:37:57 +08:00
|
|
|
<LI>Go to http://www.nvidia.com/object/cuda_get.html
|
|
|
|
<LI>Install a driver and toolkit appropriate for your system (SDK is not necessary)
|
2011-08-18 05:55:22 +08:00
|
|
|
<LI>Follow the instructions in lammps/lib/gpu/README to build the library (see below)
|
2011-08-09 23:37:57 +08:00
|
|
|
<LI>Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties
|
2011-05-27 07:45:30 +08:00
|
|
|
</UL>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P><B>Building LAMMPS with the GPU package:</B>
|
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>As with other packages that include a separately compiled library, you
|
|
|
|
need to first build the GPU library, before building LAMMPS itself.
|
2013-03-11 22:51:34 +08:00
|
|
|
General instructions for doing this are in <A HREF = "Section_start.html#start_3">this
|
|
|
|
section</A> of the manual. For this package,
|
2014-05-30 06:52:23 +08:00
|
|
|
use a Makefile in lib/gpu appropriate for your system.
|
|
|
|
</P>
|
|
|
|
<P>Before building the library, you can set the precision it will use by
|
|
|
|
editing the CUDA_PREC setting in the Makefile you are using, as
|
|
|
|
follows:
|
|
|
|
</P>
|
|
|
|
<PRE>CUDA_PREC = -D_SINGLE_SINGLE # Single precision for all calculations
|
|
|
|
CUDA_PREC = -D_DOUBLE_DOUBLE # Double precision for all calculations
|
|
|
|
CUDA_PREC = -D_SINGLE_DOUBLE # Accumulation of forces, etc, in double
|
|
|
|
</PRE>
|
|
|
|
<P>The last setting is the mixed mode referred to above. Note that your
|
|
|
|
GPU must support double precision to use either the 2nd or 3rd of
|
|
|
|
these settings.
|
|
|
|
</P>
|
|
|
|
<P>To build the library, then type:
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
|
|
|
<PRE>cd lammps/lib/gpu
|
|
|
|
make -f Makefile.linux
|
|
|
|
(see further instructions in lammps/lib/gpu/README)
|
|
|
|
</PRE>
|
|
|
|
<P>If you are successful, you will produce the file lib/libgpu.a.
|
|
|
|
</P>
|
|
|
|
<P>Now you are ready to build LAMMPS with the GPU package installed:
|
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<PRE>cd lammps/src
|
2011-08-09 23:37:57 +08:00
|
|
|
make yes-gpu
|
|
|
|
make machine
|
|
|
|
</PRE>
|
|
|
|
<P>Note that the lo-level Makefile (e.g. src/MAKE/Makefile.linux) has
|
|
|
|
these settings: gpu_SYSINC, gpu_SYSLIB, gpu_SYSPATH. These need to be
|
|
|
|
set appropriately to include the paths and settings for the CUDA
|
|
|
|
system software on your machine. See src/MAKE/Makefile.g++ for an
|
|
|
|
example.
|
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>Also note that if you change the GPU library precision, you need to
|
|
|
|
re-build the entire library. You should do a "clean" first,
|
|
|
|
e.g. "make -f Makefile.linux clean". Then you must also re-build
|
|
|
|
LAMMPS if the library precision has changed, so that it re-links with
|
|
|
|
the new library.
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P><B>Running an input script:</B>
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The examples/gpu and bench/GPU directories have scripts that can be
|
|
|
|
run with the GPU package, as well as detailed instructions on how to
|
|
|
|
run them.
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The total number of MPI tasks used by LAMMPS (one or multiple per
|
|
|
|
compute node) is set in the usual manner via the mpirun or mpiexec
|
|
|
|
commands, and is independent of the GPU package.
|
|
|
|
</P>
|
|
|
|
<P>When using the GPU package, you cannot assign more than one physical
|
|
|
|
GPU to an MPI task. However multiple MPI tasks can share the same
|
|
|
|
GPU, and in many cases it will be more efficient to run this way.
|
|
|
|
</P>
|
|
|
|
<P>Input script requirements to run using pair or PPPM styles with a
|
2011-08-18 05:55:22 +08:00
|
|
|
<I>gpu</I> suffix are as follows:
|
2011-08-17 22:22:48 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<UL><LI>To invoke specific styles from the GPU package, either append "gpu" to
|
|
|
|
the style name (e.g. pair_style lj/cut/gpu), or use the <A HREF = "Section_start.html#start_7">-suffix
|
|
|
|
command-line switch</A>, or use the
|
|
|
|
<A HREF = "suffix.html">suffix</A> command in the input script.
|
2011-08-18 05:55:22 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
<LI>The <A HREF = "newton.html">newton pair</A> setting in the input script must be
|
|
|
|
<I>off</I>.
|
2011-08-18 05:55:22 +08:00
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
<LI>Unless the <A HREF = "Section_start.html#start_7">-suffix gpu command-line
|
|
|
|
switch</A> is used, the <A HREF = "package.html">package
|
|
|
|
gpu</A> command must be used near the beginning of the
|
|
|
|
script to control the GPU selection and initialization settings. It
|
|
|
|
also has an option to enable asynchronous splitting of force
|
|
|
|
computations between the CPUs and GPUs.
|
2011-08-18 05:55:22 +08:00
|
|
|
</UL>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The default for the <A HREF = "package.html">package gpu</A> command is to have all
|
|
|
|
the MPI tasks on the compute node use a single GPU. If you have
|
|
|
|
multiple GPUs per node, then be sure to create one or more MPI tasks
|
|
|
|
per GPU, and use the first/last settings in the <A HREF = "package.html">package
|
|
|
|
gpu</A> command to include all the GPU IDs on the node.
|
|
|
|
E.g. first = 0, last = 1, for 2 GPUs. For example, on an 8-core 2-GPU
|
|
|
|
compute node, if you assign 8 MPI tasks to the node, the following
|
|
|
|
command in the input script
|
|
|
|
</P>
|
|
|
|
<P>package gpu force/neigh 0 1 -1
|
|
|
|
</P>
|
|
|
|
<P>would speciy each GPU is shared by 4 MPI tasks. The final -1 will
|
|
|
|
dynamically balance force calculations across the CPU cores and GPUs.
|
|
|
|
I.e. each CPU core will perform force calculations for some small
|
|
|
|
fraction of the particles, at the same time the GPUs perform force
|
|
|
|
calcaultions for the majority of the particles.
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
|
|
|
<P><B>Timing output:</B>
|
|
|
|
</P>
|
2011-08-17 22:20:30 +08:00
|
|
|
<P>As described by the <A HREF = "package.html">package gpu</A> command, GPU
|
|
|
|
accelerated pair styles can perform computations asynchronously with
|
|
|
|
CPU computations. The "Pair" time reported by LAMMPS will be the
|
|
|
|
maximum of the time required to complete the CPU pair style
|
|
|
|
computations and the time required to complete the GPU pair style
|
|
|
|
computations. Any time spent for GPU-enabled pair styles for
|
2011-05-27 07:45:30 +08:00
|
|
|
computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
|
|
|
|
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
|
|
|
|
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
|
|
|
|
calculations will not be included in the "Pair" time.
|
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>When the <I>mode</I> setting for the package gpu command is force/neigh,
|
|
|
|
the time for neighbor list calculations on the GPU will be added into
|
|
|
|
the "Pair" time, not the "Neigh" time. An additional breakdown of the
|
|
|
|
times required for various tasks on the GPU (data copy, neighbor
|
2011-08-09 23:37:57 +08:00
|
|
|
calculations, force computations, etc) are output only with the LAMMPS
|
|
|
|
screen output (not in the log file) at the end of each run. These
|
|
|
|
timings represent total time spent on the GPU for each routine,
|
|
|
|
regardless of asynchronous CPU calculations.
|
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The output section "GPU Time Info (average)" reports "Max Mem / Proc".
|
|
|
|
This is the maximum memory used at one time on the GPU for data
|
|
|
|
storage by a single MPI process.
|
|
|
|
</P>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P><B>Performance tips:</B>
|
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>You should experiment with how many MPI tasks per GPU to use to see
|
|
|
|
what gives the best performance for your problem. This is a function
|
|
|
|
of your problem size and what pair style you are using. Likewise, you
|
|
|
|
should also experiment with the precision setting for the GPU library
|
|
|
|
to see if single or mixed precision will give accurate results, since
|
|
|
|
they will typically be faster.
|
|
|
|
</P>
|
|
|
|
<P>Using multiple MPI tasks per GPU will often give the best performance,
|
|
|
|
as allowed my most multi-core CPU/GPU configurations.
|
2011-08-18 05:55:22 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>If the number of particles per MPI task is small (e.g. 100s of
|
|
|
|
particles), it can be more eefficient to run with fewer MPI tasks per
|
|
|
|
GPU, even if you do not use all the cores on the compute node.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>The <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the LAMMPS
|
|
|
|
web site gives GPU performance on a desktop machine and the Titan HPC
|
|
|
|
platform at ORNL for several of the LAMMPS benchmarks, as a function
|
|
|
|
of problem size and number of compute nodes.
|
2011-05-27 07:45:30 +08:00
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2012-08-07 06:50:52 +08:00
|
|
|
<H4><A NAME = "acc_7"></A>5.7 USER-CUDA package
|
2011-05-27 07:45:30 +08:00
|
|
|
</H4>
|
2011-05-28 01:59:03 +08:00
|
|
|
<P>The USER-CUDA package was developed by Christian Trott at U Technology
|
2011-08-09 23:37:57 +08:00
|
|
|
Ilmenau in Germany. It provides NVIDIA GPU versions of many pair
|
|
|
|
styles, many fixes, a few computes, and for long-range Coulombics via
|
|
|
|
the PPPM command. It has the following features:
|
|
|
|
</P>
|
|
|
|
<UL><LI>The package is designed to allow an entire LAMMPS calculation, for
|
|
|
|
many timesteps, to run entirely on the GPU (except for inter-processor
|
|
|
|
MPI communication), so that atom-based data (e.g. coordinates, forces)
|
|
|
|
do not have to move back-and-forth between the CPU and GPU.
|
|
|
|
|
2011-08-27 02:53:00 +08:00
|
|
|
<LI>The speed-up advantage of this approach is typically better when the
|
|
|
|
number of atoms per GPU is large
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
<LI>Data will stay on the GPU until a timestep where a non-GPU-ized fix or
|
|
|
|
compute is invoked. Whenever a non-GPU operation occurs (fix,
|
|
|
|
compute, output), data automatically moves back to the CPU as needed.
|
|
|
|
This may incur a performance penalty, but should otherwise work
|
2011-08-09 23:37:57 +08:00
|
|
|
transparently.
|
|
|
|
|
|
|
|
<LI>Neighbor lists for GPU-ized pair styles are constructed on the
|
|
|
|
GPU.
|
2011-08-18 05:55:22 +08:00
|
|
|
|
|
|
|
<LI>The package only supports use of a single CPU (core) with each
|
|
|
|
GPU.
|
2011-08-09 23:37:57 +08:00
|
|
|
</UL>
|
|
|
|
<P><B>Hardware and software requirements:</B>
|
2011-05-28 01:59:03 +08:00
|
|
|
</P>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P>To use this package, you need to have specific NVIDIA hardware and
|
2011-08-18 05:55:22 +08:00
|
|
|
install specific NVIDIA CUDA software on your system.
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
|
|
|
<P>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
|
|
|
|
help you to find out the Compute Capability of your card:
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
|
|
|
<P>http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
|
|
|
|
</P>
|
|
|
|
<P>Install the Nvidia Cuda Toolkit in version 3.2 or higher and the
|
|
|
|
corresponding GPU drivers. The Nvidia Cuda SDK is not required for
|
2011-08-09 23:37:57 +08:00
|
|
|
LAMMPSCUDA but we recommend it be installed. You can then make sure
|
|
|
|
that its sample projects can be compiled without problems.
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P><B>Building LAMMPS with the USER-CUDA package:</B>
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>As with other packages that include a separately compiled library, you
|
|
|
|
need to first build the USER-CUDA library, before building LAMMPS
|
2013-03-11 22:52:39 +08:00
|
|
|
itself. General instructions for doing this are in <A HREF = "Section_start.html#start_3">this
|
|
|
|
section</A> of the manual. For this package,
|
|
|
|
do the following, using settings in the lib/cuda Makefiles appropriate
|
|
|
|
for your system:
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<UL><LI>Go to the lammps/lib/cuda directory
|
|
|
|
|
|
|
|
<LI>If your <I>CUDA</I> toolkit is not installed in the default system directoy
|
2011-06-14 07:18:49 +08:00
|
|
|
<I>/usr/local/cuda</I> edit the file <I>lib/cuda/Makefile.common</I>
|
2011-08-09 23:37:57 +08:00
|
|
|
accordingly.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
<LI>Type "make OPTIONS", where <I>OPTIONS</I> are one or more of the following
|
|
|
|
options. The settings will be written to the
|
|
|
|
<I>lib/cuda/Makefile.defaults</I> and used in the next step.
|
2011-06-14 07:18:49 +08:00
|
|
|
|
2011-08-09 23:37:57 +08:00
|
|
|
<PRE><I>precision=N</I> to set the precision level
|
|
|
|
N = 1 for single precision (default)
|
|
|
|
N = 2 for double precision
|
|
|
|
N = 3 for positions in double precision
|
|
|
|
N = 4 for positions and velocities in double precision
|
|
|
|
<I>arch=M</I> to set GPU compute capability
|
|
|
|
M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
|
|
|
|
M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450)
|
|
|
|
M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
|
|
|
|
<I>prec_timer=0/1</I> to use hi-precision timers
|
|
|
|
0 = do not use them (default)
|
|
|
|
1 = use these timers
|
|
|
|
this is usually only useful for Mac machines
|
|
|
|
<I>dbg=0/1</I> to activate debug mode
|
|
|
|
0 = no debug mode (default)
|
|
|
|
1 = yes debug mode
|
|
|
|
this is only useful for developers
|
|
|
|
<I>cufft=1</I> to determine usage of CUDA FFT library
|
|
|
|
0 = no CUFFT support (default)
|
|
|
|
in the future other CUDA-enabled FFT libraries might be supported
|
|
|
|
</PRE>
|
|
|
|
<LI>Type "make" to build the library. If you are successful, you will
|
|
|
|
produce the file lib/libcuda.a.
|
2011-06-14 07:18:49 +08:00
|
|
|
</UL>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P>Now you are ready to build LAMMPS with the USER-CUDA package installed:
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<PRE>cd lammps/src
|
2011-08-09 23:37:57 +08:00
|
|
|
make yes-user-cuda
|
|
|
|
make machine
|
|
|
|
</PRE>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>Note that the LAMMPS build references the lib/cuda/Makefile.common
|
|
|
|
file to extract setting specific CUDA settings. So it is important
|
2011-08-16 03:45:43 +08:00
|
|
|
that you have first built the cuda library (in lib/cuda) using
|
|
|
|
settings appropriate to your system.
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
|
|
|
<P><B>Input script requirements:</B>
|
|
|
|
</P>
|
|
|
|
<P>Additional input script requirements to run styles with a <I>cuda</I>
|
2011-08-18 05:55:22 +08:00
|
|
|
suffix are as follows:
|
2011-08-09 23:37:57 +08:00
|
|
|
</P>
|
2014-06-04 23:49:05 +08:00
|
|
|
<UL><LI>The <A HREF = "Section_start.html#start_7">-cuda on command-line switch</A> must be
|
|
|
|
used when launching LAMMPS to enable the USER-CUDA package.
|
|
|
|
|
|
|
|
<LI>To invoke specific styles from the USER-CUDA package, you can either
|
2011-08-09 23:37:57 +08:00
|
|
|
append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use
|
2012-01-28 07:39:14 +08:00
|
|
|
the <A HREF = "Section_start.html#start_7">-suffix command-line switch</A>, or use
|
2011-08-26 00:46:23 +08:00
|
|
|
the <A HREF = "suffix.html">suffix</A> command. One exception is that the
|
|
|
|
<A HREF = "kspace_style.html">kspace_style pppm/cuda</A> command has to be requested
|
2011-08-18 05:55:22 +08:00
|
|
|
explicitly.
|
|
|
|
|
|
|
|
<LI>To use the USER-CUDA package with its default settings, no additional
|
2011-08-09 23:37:57 +08:00
|
|
|
command is needed in your input script. This is because when LAMMPS
|
|
|
|
starts up, it detects if it has been built with the USER-CUDA package.
|
2012-01-28 07:39:14 +08:00
|
|
|
See the <A HREF = "Section_start.html#start_7">-cuda command-line switch</A> for
|
2011-08-26 00:46:23 +08:00
|
|
|
more details.
|
2011-08-18 05:55:22 +08:00
|
|
|
|
|
|
|
<LI>To change settings for the USER-CUDA package at run-time, the <A HREF = "package.html">package
|
|
|
|
cuda</A> command can be used near the beginning of your
|
|
|
|
input script. See the <A HREF = "package.html">package</A> command doc page for
|
|
|
|
details.
|
|
|
|
</UL>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P><B>Performance tips:</B>
|
|
|
|
</P>
|
|
|
|
<P>The USER-CUDA package offers more speed-up relative to CPU performance
|
|
|
|
when the number of atoms per GPU is large, e.g. on the order of tens
|
|
|
|
or hundreds of 1000s.
|
|
|
|
</P>
|
|
|
|
<P>As noted above, this package will continue to run a simulation
|
|
|
|
entirely on the GPU(s) (except for inter-processor MPI communication),
|
|
|
|
for multiple timesteps, until a CPU calculation is required, either by
|
|
|
|
a fix or compute that is non-GPU-ized, or until output is performed
|
|
|
|
(thermo or dump snapshot or restart file). The less often this
|
2011-08-18 05:55:22 +08:00
|
|
|
occurs, the faster your simulation will run.
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-09 23:37:57 +08:00
|
|
|
<HR>
|
|
|
|
|
2014-05-30 06:52:23 +08:00
|
|
|
<H4><A NAME = "acc_8"></A>5.8 KOKKOS package
|
|
|
|
</H4>
|
|
|
|
<P>The KOKKOS package contains versions of pair, fix, and atom styles
|
|
|
|
that use data structures and methods and macros provided by the Kokkos
|
|
|
|
library, which is included with LAMMPS in lib/kokkos.
|
|
|
|
</P>
|
2014-05-30 07:07:14 +08:00
|
|
|
<P><A HREF = "http://trilinos.sandia.gov/packages/kokkos">Kokkos</A> is a C++ library
|
|
|
|
that provides two key abstractions for an application like LAMMPS.
|
|
|
|
First, it allows a single implementation of an application kernel
|
|
|
|
(e.g. a pair style) to run efficiently on different kinds of hardware
|
|
|
|
(GPU, Intel Phi, many-core chip).
|
|
|
|
</P>
|
|
|
|
<P>Second, it provides data abstractions to adjust (at compile time) the
|
|
|
|
memory layout of basic data structures like 2d and 3d arrays and allow
|
|
|
|
the transparent utilization of special hardware load and store units.
|
|
|
|
Such data structures are used in LAMMPS to store atom coordinates or
|
|
|
|
forces or neighbor lists. The layout is chosen to optimize
|
|
|
|
performance on different platforms. Again this operation is hidden
|
|
|
|
from the developer, and does not affect how the single implementation
|
|
|
|
of the kernel is coded.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
|
|
|
<P>These abstractions are set at build time, when LAMMPS is compiled with
|
|
|
|
the KOKKOS package installed. This is done by selecting a "host" and
|
|
|
|
"device" to build for, compatible with the compute nodes in your
|
|
|
|
machine. Note that if you are running on a desktop machine, you
|
|
|
|
typically have one compute node. On a cluster or supercomputer there
|
|
|
|
may be dozens or 1000s of compute nodes. The procedure for building
|
|
|
|
and running with the Kokkos library is the same, no matter how many
|
|
|
|
nodes you run on.
|
|
|
|
</P>
|
|
|
|
<P>All Kokkos operations occur within the context of an individual MPI
|
|
|
|
task running on a single node of the machine. The total number of MPI
|
|
|
|
tasks used by LAMMPS (one or multiple per compute node) is set in the
|
|
|
|
usual manner via the mpirun or mpiexec commands, and is independent of
|
|
|
|
Kokkos.
|
|
|
|
</P>
|
|
|
|
<P>Kokkos provides support for one or two modes of execution per MPI
|
|
|
|
task. This means that some computational tasks (pairwise
|
|
|
|
interactions, neighbor list builds, time integration, etc) are
|
|
|
|
parallelized in one or the other of the two modes. The first mode is
|
|
|
|
called the "host" and is one or more threads running on one or more
|
|
|
|
physical CPUs (within the node). Currently, both multi-core CPUs and
|
|
|
|
an Intel Phi processor (running in native mode) are supported. The
|
|
|
|
second mode is called the "device" and is an accelerator chip of some
|
|
|
|
kind. Currently only an NVIDIA GPU is supported. If your compute
|
|
|
|
node does not have a GPU, then there is only one mode of execution,
|
|
|
|
i.e. the host and device are the same.
|
|
|
|
</P>
|
|
|
|
<P>IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
|
|
|
|
of MPI tasks per compute node to be equal to the number of GPUs per
|
|
|
|
compute node. In the future Kokkos will support assigning one GPU to
|
2014-05-30 07:07:14 +08:00
|
|
|
multiple MPI tasks or using multiple GPUs per MPI task. Currently
|
|
|
|
Kokkos does not support AMD GPUs due to limits in the available
|
|
|
|
backend programming models (in particular relative extensive C++
|
|
|
|
support is required for the Kernel language). This is expected to
|
|
|
|
change in the future.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
|
|
|
<P>Here are several examples of how to build LAMMPS and run a simulation
|
|
|
|
using the KOKKOS package for typical compute node configurations.
|
|
|
|
Note that the -np setting for the mpirun command in these examples are
|
|
|
|
for a run on a single node. To scale these examples up to run on a
|
|
|
|
system with N compute nodes, simply multiply the -np setting by N.
|
|
|
|
</P>
|
|
|
|
<P>All the build steps are performed from within the src directory. All
|
|
|
|
the run steps are performed in the bench directory using the in.lj
|
|
|
|
input script. It is assumed the LAMMPS executable has been copied to
|
|
|
|
that directory or whatever directory the runs are being performed in.
|
|
|
|
Details of the various options are discussed below.
|
|
|
|
</P>
|
|
|
|
<P><B>Compute node(s) = dual hex-core CPUs and no GPU:</B>
|
|
|
|
</P>
|
|
|
|
<PRE>make yes-kokkos # install the KOKKOS package
|
|
|
|
make g++ OMP=yes # build with OpenMP, no CUDA
|
|
|
|
</PRE>
|
2014-06-04 23:49:05 +08:00
|
|
|
<PRE>mpirun -np 12 lmp_g++ < in.lj # MPI-only mode with no Kokkos
|
|
|
|
mpirun -np 12 lmp_g++ -k on -sf kk < in.lj # MPI-only mode with Kokkos
|
2014-05-30 06:52:23 +08:00
|
|
|
mpirun -np 1 lmp_g++ -k on t 12 -sf kk < in.lj # one MPI task, 12 threads
|
|
|
|
mpirun -np 2 lmp_g++ -k on t 6 -sf kk < in.lj # two MPI tasks, 6 threads/task
|
|
|
|
</PRE>
|
|
|
|
<P><B>Compute node(s) = Intel Phi with 61 cores:</B>
|
|
|
|
</P>
|
|
|
|
<PRE>make yes-kokkos
|
|
|
|
make g++ OMP=yes MIC=yes # build with OpenMP for Phi
|
|
|
|
</PRE>
|
|
|
|
<PRE>mpirun -np 12 lmp_g++ -k on t 20 -sf kk < in.lj # 12*20 = 240 total cores
|
|
|
|
mpirun -np 15 lmp_g++ -k on t 16 -sf kk < in.lj
|
|
|
|
mpirun -np 30 lmp_g++ -k on t 8 -sf kk < in.lj
|
|
|
|
mpirun -np 1 lmp_g++ -k on t 240 -sf kk < in.lj
|
|
|
|
</PRE>
|
|
|
|
<P><B>Compute node(s) = dual hex-core CPUs and a single GPU:</B>
|
|
|
|
</P>
|
|
|
|
<PRE>make yes-kokkos
|
|
|
|
make cuda CUDA=yes # build for GPU, use src/MAKE/Makefile.cuda
|
|
|
|
</PRE>
|
|
|
|
<PRE>mpirun -np 1 lmp_cuda -k on t 6 -sf kk < in.lj
|
|
|
|
</PRE>
|
|
|
|
<P><B>Compute node(s) = dual 8-core CPUs and 2 GPUs:</B>
|
|
|
|
</P>
|
|
|
|
<PRE>make yes-kokkos
|
|
|
|
make cuda CUDA=yes
|
|
|
|
</PRE>
|
|
|
|
<PRE>mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk < in.lj # use both GPUs, one per MPI task
|
|
|
|
</PRE>
|
|
|
|
<P><B>Building LAMMPS with the KOKKOS package:</B>
|
|
|
|
</P>
|
|
|
|
<P>A summary of the build process is given here. More details and all
|
|
|
|
the available make variable options are given in <A HREF = "Section_start.html#start_3_4">this
|
|
|
|
section</A> of the manual.
|
|
|
|
</P>
|
|
|
|
<P>From the src directory, type
|
|
|
|
</P>
|
|
|
|
<PRE>make yes-kokkos
|
|
|
|
</PRE>
|
|
|
|
<P>to include the KOKKOS package. Then perform a normal LAMMPS build,
|
|
|
|
with additional make variable specifications to choose the host and
|
|
|
|
device you will run the resulting executable on, e.g.
|
|
|
|
</P>
|
|
|
|
<PRE>make g++ OMP=yes
|
|
|
|
make cuda CUDA=yes
|
|
|
|
</PRE>
|
|
|
|
<P>As illustrated above, the most important variables to set are OMP,
|
|
|
|
CUDA, and MIC. The default settings are OMP=yes, CUDA=no, MIC=no
|
|
|
|
Setting OMP to <I>yes</I> will use OpenMP for threading on the host, as
|
|
|
|
well as on the device (if no GPU is present). Setting CUDA to <I>yes</I>
|
|
|
|
will use one or more GPUs as the device. Setting MIC=yes is necessary
|
|
|
|
when building for an Intel Phi processor.
|
|
|
|
</P>
|
|
|
|
<P>Note that to use a GPU, you must use a lo-level Makefile,
|
|
|
|
e.g. src/MAKE/Makefile.cuda as included in the LAMMPS distro, which
|
|
|
|
uses the NVIDA "nvcc" compiler. You must check that the CCFLAGS -arch
|
|
|
|
setting is appropriate for your NVIDIA hardware and installed
|
|
|
|
software. Typical values for -arch are given in <A HREF = "Section_start.html#start_3_4">this
|
|
|
|
section</A> of the manual, as well as other
|
|
|
|
settings that must be included in the lo-level Makefile, if you create
|
|
|
|
your own.
|
|
|
|
</P>
|
|
|
|
<P><B>Input scripts and use of command-line switches -kokkos and -suffix:</B>
|
|
|
|
</P>
|
|
|
|
<P>To use any Kokkos-enabled style provided in the KOKKOS package, you
|
|
|
|
must use a Kokkos-enabled atom style. LAMMPS will give an error if
|
|
|
|
you do not do this.
|
|
|
|
</P>
|
|
|
|
<P>There are two command-line switches relevant to using Kokkos, -k or
|
|
|
|
-kokkos, and -sf or -suffix. They are described in detail in <A HREF = "Section_start.html#start_7">this
|
|
|
|
section</A> of the manual.
|
|
|
|
</P>
|
|
|
|
<P>Here are common options to use:
|
|
|
|
</P>
|
2014-06-04 23:49:05 +08:00
|
|
|
<UL><LI>-k on : required to run any KOKKOS-enabled style
|
2014-05-30 06:52:23 +08:00
|
|
|
|
|
|
|
<LI>-sf kk : enables automatic use of Kokkos versions of atom, pair,
|
|
|
|
fix, compute styles if they exist. This can also be done with more
|
|
|
|
precise control by using the <A HREF = "suffix.html">suffix</A> command or appending
|
|
|
|
"kk" to styles within the input script, e.g. "pair_style lj/cut/kk".
|
|
|
|
|
|
|
|
<LI>-k on t Nt : specifies how many threads per MPI task to use within a
|
|
|
|
compute node. For good performance, the product of MPI tasks *
|
|
|
|
threads/task should not exceed the number of physical CPU or Intel
|
|
|
|
Phi cores.
|
|
|
|
|
|
|
|
<LI>-k on g Ng : specifies how many GPUs per compute node are available.
|
|
|
|
The default is 1, so this should be specified is you have 2 or more
|
|
|
|
GPUs per compute node.
|
|
|
|
</UL>
|
|
|
|
<P><B>Use of package command options:</B>
|
|
|
|
</P>
|
|
|
|
<P>Using the <A HREF = "package.html">package kokkos</A> command in an input script
|
|
|
|
allows choice of options for neighbor lists and communication. See
|
|
|
|
the <A HREF = "package.html">package</A> command doc page for details and default
|
|
|
|
settings.
|
|
|
|
</P>
|
|
|
|
<P>Experimenting with different styles of neighbor lists or inter-node
|
|
|
|
communication can provide a speed-up for specific calculations.
|
|
|
|
</P>
|
|
|
|
<P><B>Running on a multi-core CPU:</B>
|
|
|
|
</P>
|
|
|
|
<P>Build with OMP=yes (the default) and CUDA=no (the default).
|
|
|
|
</P>
|
|
|
|
<P>If N is the number of physical cores/node, then the number of MPI
|
|
|
|
tasks/node * number of threads/task should not exceed N, and should
|
|
|
|
typically equal N. Note that the default threads/task is 1, as set by
|
|
|
|
the "t" keyword of the -k <A HREF = "Section_start.html#start_7">command-line
|
2014-05-30 07:07:14 +08:00
|
|
|
switch</A>. If you do not change this, no
|
|
|
|
additional parallelism (beyond MPI) will be invoked on the host
|
2014-05-30 06:52:23 +08:00
|
|
|
CPU(s).
|
|
|
|
</P>
|
|
|
|
<P>You can compare the performance running in different modes:
|
|
|
|
</P>
|
|
|
|
<UL><LI>run with 1 MPI task/node and N threads/task
|
|
|
|
<LI>run with N MPI tasks/node and 1 thread/task
|
|
|
|
<LI>run with settings in between these extremes
|
|
|
|
</UL>
|
|
|
|
<P>Examples of mpirun commands in these modes, for nodes with dual
|
|
|
|
hex-core CPUs and no GPU, are shown above.
|
|
|
|
</P>
|
|
|
|
<P><B>Running on GPUs:</B>
|
|
|
|
</P>
|
|
|
|
<P>Build with CUDA=yes, using src/MAKE/Makefile.cuda. Insure the setting
|
|
|
|
for CUDA_PATH in lib/kokkos/Makefile.lammps is correct for your Cuda
|
|
|
|
software installation. Insure the -arch setting in
|
|
|
|
src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
|
|
|
|
<A HREF = "Section_start.html#start_3_4">this section</A> of the manual for details.
|
|
|
|
</P>
|
2014-05-30 07:07:14 +08:00
|
|
|
<P>The -np setting of the mpirun command should set the number of MPI
|
|
|
|
tasks/node to be equal to the # of physical GPUs on the node.
|
2014-05-30 06:52:23 +08:00
|
|
|
</P>
|
|
|
|
<P>Use the <A HREF = "Section_commands.html#start_7">-kokkos command-line switch</A> to
|
|
|
|
specify the number of GPUs per node, and the number of threads per MPI
|
|
|
|
task. As above for multi-core CPUs (and no GPU), if N is the number
|
|
|
|
of physical cores/node, then the number of MPI tasks/node * number of
|
|
|
|
threads/task should not exceed N. With one GPU (and one MPI task) it
|
|
|
|
may be faster to use less than all the available cores, by setting
|
|
|
|
threads/task to a smaller value. This is because using all the cores
|
|
|
|
on a dual-socket node will incur extra cost to copy memory from the
|
|
|
|
2nd socket to the GPU.
|
|
|
|
</P>
|
|
|
|
<P>Examples of mpirun commands that follow these rules, for nodes with
|
|
|
|
dual hex-core CPUs and one or two GPUs, are shown above.
|
|
|
|
</P>
|
|
|
|
<P><B>Running on an Intel Phi:</B>
|
|
|
|
</P>
|
|
|
|
<P>Kokkos only uses Intel Phi processors in their "native" mode, i.e.
|
|
|
|
not hosted by a CPU.
|
|
|
|
</P>
|
|
|
|
<P>Build with OMP=yes (the default) and MIC=yes. The latter
|
|
|
|
insures code is correctly compiled for the Intel Phi. The
|
|
|
|
OMP setting means OpenMP will be used for parallelization
|
|
|
|
on the Phi, which is currently the best option within
|
|
|
|
Kokkos. In the future, other options may be added.
|
|
|
|
</P>
|
|
|
|
<P>Current-generation Intel Phi chips have either 61 or 57 cores. One
|
|
|
|
core should be excluded to run the OS, leaving 60 or 56 cores. Each
|
|
|
|
core is hyperthreaded, so there are effectively N = 240 (4*60) or N =
|
|
|
|
224 (4*56) cores to run on.
|
|
|
|
</P>
|
|
|
|
<P>The -np setting of the mpirun command sets the number of MPI
|
|
|
|
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
|
|
|
threads/task as Nt. The product of these 2 values should be N, i.e.
|
|
|
|
240 or 224. Also, the number of threads/task should be a multiple of
|
|
|
|
4 so that logical threads from more than one MPI task do not run on
|
|
|
|
the same physical core.
|
|
|
|
</P>
|
|
|
|
<P>Examples of mpirun commands that follow these rules, for Intel Phi
|
|
|
|
nodes with 61 cores, are shown above.
|
|
|
|
</P>
|
|
|
|
<P><B>Examples and benchmarks:</B>
|
|
|
|
</P>
|
|
|
|
<P>The examples/kokkos and bench/KOKKOS directories have scripts that can
|
|
|
|
be run with the KOKKOS package, as well as detailed instructions on
|
|
|
|
how to run them.
|
|
|
|
</P>
|
|
|
|
<P>IMPORTANT NOTE: the bench/KOKKOS directory does not yet exist. It
|
|
|
|
will be added later.
|
|
|
|
</P>
|
|
|
|
<P><B>Additional performance issues:</B>
|
|
|
|
</P>
|
|
|
|
<P>When using threads (OpenMP or pthreads), it is important for
|
|
|
|
performance to bind the threads to physical cores, so they do not
|
|
|
|
migrate during a simulation. The same is true for MPI tasks, but the
|
|
|
|
default binding rules implemented for various MPI versions, do not
|
2014-05-30 07:07:14 +08:00
|
|
|
account for thread binding.
|
|
|
|
</P>
|
|
|
|
<P>Thus if you use more than one thread per MPI task, you should insure
|
|
|
|
MPI tasks are bound to CPU sockets. Furthermore, use thread affinity
|
|
|
|
environment variables from the OpenMP runtime when using OpenMP and
|
|
|
|
compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc
|
|
|
|
4.7 or later, intel 12 or later) setting the environment variable
|
|
|
|
OMP_PROC_BIND=true should be sufficient. A typical mpirun command
|
|
|
|
should set these flags:
|
|
|
|
</P>
|
|
|
|
<PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
|
|
|
|
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...
|
|
|
|
</PRE>
|
2014-05-30 06:52:23 +08:00
|
|
|
<P>When using a GPU, you will achieve the best performance if your input
|
|
|
|
script does not use any fix or compute styles which are not yet
|
|
|
|
Kokkos-enabled. This allows data to stay on the GPU for multiple
|
|
|
|
timesteps, without being copied back to the host CPU. Invoking a
|
|
|
|
non-Kokkos fix or compute, or performing I/O for
|
|
|
|
<A HREF = "thermo_style.html">thermo</A> or <A HREF = "dump.html">dump</A> output will cause data
|
|
|
|
to be copied back to the CPU.
|
|
|
|
</P>
|
|
|
|
<P>You cannot yet assign multiple MPI tasks to the same GPU with the
|
|
|
|
KOKKOS package. We plan to support this in the future, similar to the
|
|
|
|
GPU package in LAMMPS.
|
|
|
|
</P>
|
|
|
|
<P>You cannot yet use both the host (multi-threaded) and device (GPU)
|
|
|
|
together to compute pairwise interactions with the KOKKOS package. We
|
|
|
|
hope to support this in the future, similar to the GPU package in
|
|
|
|
LAMMPS.
|
|
|
|
</P>
|
|
|
|
<HR>
|
|
|
|
|
2014-08-15 00:30:25 +08:00
|
|
|
<H4><A NAME = "acc_9"></A>5.9 USER-INTEL package
|
|
|
|
</H4>
|
|
|
|
<P>The USER-INTEL package was developed by Mike Brown at Intel
|
|
|
|
Corporation. It provides a capability to accelerate simulations by
|
2014-08-25 22:48:45 +08:00
|
|
|
offloading neighbor list and non-bonded force calculations to Intel(R)
|
|
|
|
Xeon Phi(TM) coprocessors. Additionally, it supports running
|
2014-08-15 04:26:52 +08:00
|
|
|
simulations in single, mixed, or double precision with vectorization,
|
2014-08-25 22:48:45 +08:00
|
|
|
even if a coprocessor is not present, i.e. on an Intel(R) CPU. The same
|
2014-08-15 04:26:52 +08:00
|
|
|
C++ code is used for both cases. When offloading to a coprocessor,
|
|
|
|
the routine is run twice, once with an offload flag.
|
|
|
|
</P>
|
|
|
|
<P>The USER-INTEL package can be used in tandem with the USER-OMP
|
|
|
|
package. This is useful when a USER-INTEL pair style is used, so that
|
|
|
|
other styles not supported by the USER-INTEL package, e.g. for bond,
|
|
|
|
angle, dihedral, improper, and long-range electrostatics can be run
|
|
|
|
with the USER-OMP package versions. If you have built LAMMPS with
|
|
|
|
both the USER-INTEL and USER-OMP packages, then this mode of operation
|
|
|
|
is made easier, because the "-suffix intel" <A HREF = "Section_start.html#start_7">command-line
|
|
|
|
switch</A> and the the <A HREF = "suffix.html">suffix
|
|
|
|
intel</A> command will both set a second-choice suffix to
|
|
|
|
"omp" so that styles from the USER-OMP package will be used if
|
|
|
|
available.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
|
|
|
<P><B>Building LAMMPS with the USER-INTEL package:</B>
|
|
|
|
</P>
|
|
|
|
<P>The procedure for building LAMMPS with the USER-INTEL package is
|
|
|
|
simple. You have to edit your machine specific makefile to add the
|
|
|
|
flags to enable OpenMP support (<I>-openmp</I>) to both the CCFLAGS and
|
2014-08-15 04:26:52 +08:00
|
|
|
LINKFLAGS variables. You also need to add -DLAMMPS_MEMALIGN=64 and
|
|
|
|
-restrict to CCFLAGS.
|
|
|
|
</P>
|
2014-08-25 22:54:37 +08:00
|
|
|
<P>Note that currently you must use the Intel C++ compiler (icc/icpc) to
|
|
|
|
build the package. In the future, using other compilers (e.g. g++)
|
|
|
|
may be possible.
|
|
|
|
</P>
|
2014-08-15 04:26:52 +08:00
|
|
|
<P>If you are compiling on the same architecture that will be used for
|
|
|
|
the runs, adding the flag <I>-xHost</I> will enable vectorization with the
|
2014-08-25 22:48:45 +08:00
|
|
|
Intel(R) compiler. In order to build with support for an Intel(R)
|
2014-08-15 00:30:25 +08:00
|
|
|
coprocessor, the flag <I>-offload</I> should be added to the LINKFLAGS line
|
|
|
|
and the flag <I>-DLMP_INTEL_OFFLOAD</I> should be added to the CCFLAGS
|
|
|
|
line.
|
|
|
|
</P>
|
|
|
|
<P>The files src/MAKE/Makefile.intel and src/MAKE/Makefile.intel_offload
|
2014-08-15 04:26:52 +08:00
|
|
|
are included in the src/MAKE directory with options that perform well
|
2014-08-25 22:48:45 +08:00
|
|
|
with the Intel(R) compiler. The latter Makefile has support for offload
|
2014-08-15 04:26:52 +08:00
|
|
|
to coprocessors and the former does not.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
2014-08-25 22:48:45 +08:00
|
|
|
<P>It is recommended that Intel(R) Compiler 2013 SP1 update 1 be used for
|
2014-08-15 00:30:25 +08:00
|
|
|
compiling. Newer versions have some performance issues that are being
|
2014-08-25 22:48:45 +08:00
|
|
|
addressed. If using Intel(R) MPI, version 5 or higher is recommended.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
|
|
|
<P>The rest of the compilation is the same as for any other package that
|
2014-08-15 04:26:52 +08:00
|
|
|
has no additional library dependencies, e.g.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
2014-08-15 04:26:52 +08:00
|
|
|
<PRE>make yes-user-intel yes-user-omp
|
2014-08-15 00:30:25 +08:00
|
|
|
make machine
|
|
|
|
</PRE>
|
|
|
|
<P><B>Running an input script:</B>
|
|
|
|
</P>
|
|
|
|
<P>The examples/intel directory has scripts that can be run with the
|
|
|
|
USER-INTEL package, as well as detailed instructions on how to run
|
|
|
|
them.
|
|
|
|
</P>
|
|
|
|
<P>The total number of MPI tasks used by LAMMPS (one or multiple per
|
|
|
|
compute node) is set in the usual manner via the mpirun or mpiexec
|
2014-08-25 22:48:45 +08:00
|
|
|
commands, and is independent of the USER-INTEL package.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
|
|
|
<P>Input script requirements to run using pair styles with a <I>intel</I>
|
|
|
|
suffix are as follows:
|
|
|
|
</P>
|
2014-08-15 04:26:52 +08:00
|
|
|
<P>To invoke specific styles from the UESR-INTEL package, either append
|
2014-08-15 00:30:25 +08:00
|
|
|
"intel" to the style name (e.g. pair_style lj/cut/intel), or use the
|
|
|
|
<A HREF = "Section_start.html#start_7">-suffix command-line switch</A>, or use the
|
|
|
|
<A HREF = "suffix.html">suffix</A> command in the input script.
|
|
|
|
</P>
|
|
|
|
<P>Unless the <A HREF = "Section_start.html#start_7">-suffix intel command-line
|
2014-08-15 04:26:52 +08:00
|
|
|
switch</A> is used, a <A HREF = "package.html">package
|
2014-08-15 00:30:25 +08:00
|
|
|
intel</A> command must be used near the beginning of the
|
2014-08-15 04:26:52 +08:00
|
|
|
input script. The default precision mode for the USER-INTEL package
|
|
|
|
is <I>mixed</I>, meaning that accumulation is performed in double precision
|
|
|
|
and other calculations are performed in single precision. In order to
|
|
|
|
use all single or all double precision, the <A HREF = "package.html">package
|
|
|
|
intel</A> command must be used in the input script with a
|
|
|
|
"single" or "double" keyword specified.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
2014-08-25 22:48:45 +08:00
|
|
|
<P><B>Running with an Intel(R) coprocessor:</B>
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
2014-08-15 04:26:52 +08:00
|
|
|
<P>The USER-INTEL package supports offload of a fraction of the work to
|
2014-08-25 22:48:45 +08:00
|
|
|
Intel(R) Xeon Phi(TM) coprocessors. This is accomplished by setting a
|
2014-08-15 04:26:52 +08:00
|
|
|
balance fraction on the <A HREF = "package.html">package intel</A> command. A
|
|
|
|
balance of 0 runs all calculations on the CPU. A balance of 1 runs
|
|
|
|
all calculations on the coprocessor. A balance of 0.5 runs half of
|
|
|
|
the calculations on the coprocessor. Setting the balance to -1 will
|
|
|
|
enable dynamic load balancing that continously adjusts the fraction of
|
|
|
|
offloaded work throughout the simulation. This option typically
|
|
|
|
produces results within 5 to 10 percent of the optimal fixed balance.
|
|
|
|
By default, using the <A HREF = "suffix.html">suffix</A> command or <A HREF = "Section_start.html#start_7">-suffix
|
|
|
|
command-line switch</A> will use offload to a
|
|
|
|
coprocessor with the balance set to -1. If LAMMPS is built without
|
|
|
|
offload support, this setting is ignored.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
|
|
|
<P>If one is running short benchmark runs with dynamic load balancing,
|
|
|
|
adding a short warm-up run (10-20 steps) will allow the load-balancer
|
2014-08-15 04:26:52 +08:00
|
|
|
to find a setting that will carry over to additional runs.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
|
|
|
<P>The default for the <A HREF = "package.html">package intel</A> command is to have
|
2014-08-25 22:48:45 +08:00
|
|
|
all the MPI tasks on a given compute node use a single Xeon Phi(TM) coprocessor
|
|
|
|
In general, running with a large number of MPI tasks on
|
2014-08-15 04:26:52 +08:00
|
|
|
each node will perform best with offload. Each MPI task will
|
2014-08-15 00:30:25 +08:00
|
|
|
automatically get affinity to a subset of the hardware threads
|
2014-08-15 04:26:52 +08:00
|
|
|
available on the coprocessor. For example, if your card has 61 cores,
|
|
|
|
with 60 cores available for offload and 4 hardware threads per core
|
|
|
|
(240 total threads), running with 24 MPI tasks per node will cause
|
|
|
|
each MPI task to use a subset of 10 threads on the coprocessor. Fine
|
|
|
|
tuning of the number of threads to use per MPI task or the number of
|
|
|
|
threads to use per core can be accomplished with keywords to the
|
|
|
|
<A HREF = "package.html">package intel</A> command.
|
|
|
|
</P>
|
2014-08-25 22:48:45 +08:00
|
|
|
<P>If LAMMPS is using offload to a Intel(R) Xeon Phi(TM) coprocessor, a diagnostic
|
2014-08-15 04:26:52 +08:00
|
|
|
line during the setup for a run is printed to the screen (not to log
|
|
|
|
files) indicating that offload is being used and the number of
|
|
|
|
coprocessor threads per MPI task. Additionally, an offload timing
|
|
|
|
summary is printed at the end of each run. When using offload, the
|
|
|
|
<A HREF = "atom_modify.html">sort</A> frequency for atom data is changed to 1 so
|
|
|
|
that the per-atom data is sorted every neighbor build.
|
|
|
|
</P>
|
2014-08-25 22:48:45 +08:00
|
|
|
<P>To use multiple coprocessors on each compute node, the
|
2014-08-15 00:30:25 +08:00
|
|
|
<I>offload_cards</I> keyword can be specified with the <A HREF = "package.html">package
|
|
|
|
intel</A> command to specify the number of coprocessors to
|
|
|
|
use.
|
|
|
|
</P>
|
2014-08-15 04:26:52 +08:00
|
|
|
<P>For simulations with long-range electrostatics or bond, angle,
|
|
|
|
dihedral, improper calculations, computation and data transfer to the
|
2014-08-15 00:30:25 +08:00
|
|
|
coprocessor will run concurrently with computations and MPI
|
2014-08-15 04:26:52 +08:00
|
|
|
communications for these routines on the host. The USER-INTEL package
|
|
|
|
has two modes for deciding which atoms will be handled by the
|
|
|
|
coprocessor. The setting is controlled with the "offload_ghost"
|
|
|
|
option. When set to 0, ghost atoms (atoms at the borders between MPI
|
|
|
|
tasks) are not offloaded to the card. This allows for overlap of MPI
|
|
|
|
communication of forces with computation on the coprocessor when the
|
|
|
|
<A HREF = "newton.html">newton</A> setting is "on". The default is dependent on the
|
|
|
|
style being used, however, better performance might be achieved by
|
2014-08-15 00:30:25 +08:00
|
|
|
setting this explictly.
|
|
|
|
</P>
|
|
|
|
<P>In order to control the number of OpenMP threads used on the host, the
|
|
|
|
OMP_NUM_THREADS environment variable should be set. This variable will
|
|
|
|
not influence the number of threads used on the coprocessor. Only the
|
2014-08-15 04:26:52 +08:00
|
|
|
<A HREF = "package.html">package intel</A> command can be used to control thread
|
|
|
|
counts on the coprocessor.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
|
|
|
<P><B>Restrictions:</B>
|
|
|
|
</P>
|
|
|
|
<P>When using offload, <A HREF = "pair_hybrid.html">hybrid</A> styles that require skip
|
|
|
|
lists for neighbor builds cannot be offloaded to the coprocessor.
|
2014-08-15 04:26:52 +08:00
|
|
|
Using <A HREF = "pair_hybrid.html">hybrid/overlay</A> is allowed. Only one intel
|
|
|
|
accelerated style may be used with hybrid styles. Exclusion lists are
|
2014-08-15 00:30:25 +08:00
|
|
|
not currently supported with offload, however, the same effect can
|
2014-08-15 04:26:52 +08:00
|
|
|
often be accomplished by setting cutoffs for excluded atom types to 0.
|
|
|
|
None of the pair styles in the USER-OMP package currently support the
|
|
|
|
"inner", "middle", "outer" options for rRESPA integration via the
|
|
|
|
<A HREF = "run_style.html">run_style respa</A> command.
|
2014-08-15 00:30:25 +08:00
|
|
|
</P>
|
2011-08-09 23:37:57 +08:00
|
|
|
<HR>
|
|
|
|
|
2014-08-15 00:30:25 +08:00
|
|
|
<H4><A NAME = "acc_10"></A>5.10 Comparison of GPU and USER-CUDA packages
|
2011-06-14 07:18:49 +08:00
|
|
|
</H4>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P>Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
|
|
|
|
using NVIDIA hardware, but they do it in different ways.
|
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>As a consequence, for a particular simulation on specific hardware,
|
2011-08-09 23:37:57 +08:00
|
|
|
one package may be faster than the other. We give guidelines below,
|
|
|
|
but the best way to determine which package is faster for your input
|
|
|
|
script is to try both of them on your machine. See the benchmarking
|
|
|
|
section below for examples where this has been done.
|
|
|
|
</P>
|
|
|
|
<P><B>Guidelines for using each package optimally:</B>
|
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<UL><LI>The GPU package allows you to assign multiple CPUs (cores) to a single
|
|
|
|
GPU (a common configuration for "hybrid" nodes that contain multicore
|
|
|
|
CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA
|
|
|
|
package does not allow this; you can only use one CPU per GPU.
|
|
|
|
|
|
|
|
<LI>The GPU package moves per-atom data (coordinates, forces)
|
2011-08-09 23:37:57 +08:00
|
|
|
back-and-forth between the CPU and GPU every timestep. The USER-CUDA
|
|
|
|
package only does this on timesteps when a CPU calculation is required
|
|
|
|
(e.g. to invoke a fix or compute that is non-GPU-ized). Hence, if you
|
|
|
|
can formulate your input script to only use GPU-ized fixes and
|
|
|
|
computes, and avoid doing I/O too often (thermo output, dump file
|
|
|
|
snapshots, restart files), then the data transfer cost of the
|
|
|
|
USER-CUDA package can be very low, causing it to run faster than the
|
|
|
|
GPU package.
|
|
|
|
|
|
|
|
<LI>The GPU package is often faster than the USER-CUDA package, if the
|
|
|
|
number of atoms per GPU is "small". The crossover point, in terms of
|
|
|
|
atoms/GPU at which the USER-CUDA package becomes faster depends
|
|
|
|
strongly on the pair style. For example, for a simple Lennard Jones
|
|
|
|
system the crossover (in single precision) is often about 50K-100K
|
|
|
|
atoms per GPU. When performing double precision calculations the
|
|
|
|
crossover point can be significantly smaller.
|
|
|
|
|
|
|
|
<LI>Both packages compute bonded interactions (bonds, angles, etc) on the
|
|
|
|
CPU. This means a model with bonds will force the USER-CUDA package
|
|
|
|
to transfer per-atom data back-and-forth between the CPU and GPU every
|
|
|
|
timestep. If the GPU package is running with several MPI processes
|
|
|
|
assigned to one GPU, the cost of computing the bonded interactions is
|
|
|
|
spread across more CPUs and hence the GPU package can run faster.
|
|
|
|
|
|
|
|
<LI>When using the GPU package with multiple CPUs assigned to one GPU, its
|
|
|
|
performance depends to some extent on high bandwidth between the CPUs
|
|
|
|
and the GPU. Hence its performance is affected if full 16 PCIe lanes
|
|
|
|
are not available for each GPU. In HPC environments this can be the
|
|
|
|
case if S2050/70 servers are used, where two devices generally share
|
|
|
|
one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide
|
|
|
|
full 16 lanes to each of the PCIe 2.0 16x slots.
|
|
|
|
</UL>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P><B>Differences between the two packages:</B>
|
|
|
|
</P>
|
|
|
|
<UL><LI>The GPU package accelerates only pair force, neighbor list, and PPPM
|
|
|
|
calculations. The USER-CUDA package currently supports a wider range
|
|
|
|
of pair styles and can also accelerate many fix styles and some
|
|
|
|
compute styles, as well as neighbor list and PPPM calculations.
|
|
|
|
|
2012-01-26 00:00:02 +08:00
|
|
|
<LI>The USER-CUDA package does not support acceleration for minimization.
|
|
|
|
|
|
|
|
<LI>The USER-CUDA package does not support hybrid pair styles.
|
|
|
|
|
|
|
|
<LI>The USER-CUDA package can order atoms in the neighbor list differently
|
|
|
|
from run to run resulting in a different order for force accumulation.
|
|
|
|
|
|
|
|
<LI>The USER-CUDA package has a limit on the number of atom types that can be
|
|
|
|
used in a simulation.
|
|
|
|
|
|
|
|
<LI>The GPU package requires neighbor lists to be built on the CPU when using
|
|
|
|
exclusion lists or a triclinic simulation box.
|
|
|
|
|
2011-08-18 05:55:22 +08:00
|
|
|
<LI>The GPU package uses more GPU memory than the USER-CUDA package. This
|
|
|
|
is generally not a problem since typical runs are computation-limited
|
|
|
|
rather than memory-limited.
|
|
|
|
</UL>
|
2011-08-09 23:37:57 +08:00
|
|
|
<P><B>Examples:</B>
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>The LAMMPS distribution has two directories with sample input scripts
|
|
|
|
for the GPU and USER-CUDA packages.
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-08-09 23:37:57 +08:00
|
|
|
<UL><LI>lammps/examples/gpu = GPU package files
|
|
|
|
<LI>lammps/examples/USER/cuda = USER-CUDA package files
|
|
|
|
</UL>
|
2011-08-18 05:55:22 +08:00
|
|
|
<P>These contain input scripts for identical systems, so they can be used
|
|
|
|
to benchmark the performance of both packages on your system.
|
2011-06-14 07:18:49 +08:00
|
|
|
</P>
|
2011-05-27 07:45:30 +08:00
|
|
|
</HTML>
|