mirror of https://github.com/lammps/lammps.git
604 lines
33 KiB
HTML
604 lines
33 KiB
HTML
|
|
|
|
<!DOCTYPE html>
|
|
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
|
|
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
|
|
<head>
|
|
<meta charset="utf-8">
|
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
|
|
<title>5.USER-INTEL package — LAMMPS documentation</title>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
|
|
|
|
|
|
|
|
<link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />
|
|
|
|
|
|
|
|
<link rel="top" title="LAMMPS documentation" href="index.html"/>
|
|
|
|
|
|
<script src="_static/js/modernizr.min.js"></script>
|
|
|
|
</head>
|
|
|
|
<body class="wy-body-for-nav" role="document">
|
|
|
|
<div class="wy-grid-for-nav">
|
|
|
|
|
|
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
|
<div class="wy-side-nav-search">
|
|
|
|
|
|
|
|
<a href="Manual.html" class="icon icon-home"> LAMMPS
|
|
|
|
|
|
|
|
</a>
|
|
|
|
|
|
<div role="search">
|
|
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
|
|
<input type="text" name="q" placeholder="Search docs" />
|
|
<input type="hidden" name="check_keywords" value="yes" />
|
|
<input type="hidden" name="area" value="default" />
|
|
</form>
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
|
|
|
|
|
|
|
|
<ul>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
|
|
</ul>
|
|
|
|
|
|
|
|
</div>
|
|
|
|
</nav>
|
|
|
|
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
|
|
|
|
|
|
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
|
|
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
|
<a href="Manual.html">LAMMPS</a>
|
|
</nav>
|
|
|
|
|
|
|
|
<div class="wy-nav-content">
|
|
<div class="rst-content">
|
|
<div role="navigation" aria-label="breadcrumbs navigation">
|
|
<ul class="wy-breadcrumbs">
|
|
<li><a href="Manual.html">Docs</a> »</li>
|
|
|
|
<li>5.USER-INTEL package</li>
|
|
<li class="wy-breadcrumbs-aside">
|
|
|
|
|
|
<a href="http://lammps.sandia.gov">Website</a>
|
|
<a href="Section_commands.html#comm">Commands</a>
|
|
|
|
</li>
|
|
</ul>
|
|
<hr/>
|
|
|
|
</div>
|
|
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
|
<div itemprop="articleBody">
|
|
|
|
<p><a class="reference internal" href="Section_accelerate.html"><span class="doc">Return to Section accelerate overview</span></a></p>
|
|
<div class="section" id="user-intel-package">
|
|
<h1>5.USER-INTEL package</h1>
|
|
<p>The USER-INTEL package is maintained by Mike Brown at Intel
|
|
Corporation. It provides two methods for accelerating simulations,
|
|
depending on the hardware you have. The first is acceleration on
|
|
Intel CPUs by running in single, mixed, or double precision with
|
|
vectorization. The second is acceleration on Intel Xeon Phi
|
|
coprocessors via offloading neighbor list and non-bonded force
|
|
calculations to the Phi. The same C++ code is used in both cases.
|
|
When offloading to a coprocessor from a CPU, the same routine is run
|
|
twice, once on the CPU and once with an offload flag. This allows
|
|
LAMMPS to run on the CPU cores and coprocessor cores simulataneously.</p>
|
|
<p><strong>Currently Available USER-INTEL Styles:</strong></p>
|
|
<ul class="simple">
|
|
<li>Angle Styles: charmm, harmonic</li>
|
|
<li>Bond Styles: fene, harmonic</li>
|
|
<li>Dihedral Styles: charmm, harmonic, opls</li>
|
|
<li>Fixes: nve, npt, nvt, nvt/sllod</li>
|
|
<li>Improper Styles: cvff, harmonic</li>
|
|
<li>Pair Styles: buck/coul/cut, buck/coul/long, buck, gayberne,
|
|
charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff</li>
|
|
<li>K-Space Styles: pppm</li>
|
|
</ul>
|
|
<p><strong>Speed-ups to expect:</strong></p>
|
|
<p>The speedups will depend on your simulation, the hardware, which
|
|
styles are used, the number of atoms, and the floating-point
|
|
precision mode. Performance improvements are shown compared to
|
|
LAMMPS <em>without using other acceleration packages</em> as these are
|
|
under active development (and subject to performance changes). The
|
|
measurements were performed using the input files available in
|
|
the src/USER-INTEL/TEST directory. These are scalable in size; the
|
|
results given are with 512K particles (524K for Liquid Crystal).
|
|
Most of the simulations are standard LAMMPS benchmarks (indicated
|
|
by the filename extension in parenthesis) with modifications to the
|
|
run length and to add a warmup run (for use with offload
|
|
benchmarks).</p>
|
|
<img alt="_images/user_intel.png" class="align-center" src="_images/user_intel.png" />
|
|
<p>Results are speedups obtained on Intel Xeon E5-2697v4 processors
|
|
(code-named Broadwell) and Intel Xeon Phi 7250 processors
|
|
(code-named Knights Landing) with “18 Jun 2016” LAMMPS built with
|
|
Intel Parallel Studio 2016 update 3. Results are with 1 MPI task
|
|
per physical core. See <em>src/USER-INTEL/TEST/README</em> for the raw
|
|
simulation rates and instructions to reproduce.</p>
|
|
<hr class="docutils" />
|
|
<p><strong>Quick Start for Experienced Users:</strong></p>
|
|
<p>LAMMPS should be built with the USER-INTEL package installed.
|
|
Simulations should be run with 1 MPI task per physical <em>core</em>,
|
|
not <em>hardware thread</em>.</p>
|
|
<p>For Intel Xeon CPUs:</p>
|
|
<ul class="simple">
|
|
<li>Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary.</li>
|
|
<li>If using <em>kspace_style pppm</em> in the input script, add “neigh_modify binsize 3” and “kspace_modify diff ad” to the input script for better
|
|
performance.</li>
|
|
<li>“-pk intel 0 omp 2 -sf intel” added to LAMMPS command-line</li>
|
|
</ul>
|
|
<p>For Intel Xeon Phi CPUs for simulations without <em>kspace_style
|
|
pppm</em> in the input script</p>
|
|
<ul class="simple">
|
|
<li>Edit src/MAKE/OPTIONS/Makefile.knl as necessary.</li>
|
|
<li>Runs should be performed using MCDRAM.</li>
|
|
<li>“-pk intel 0 omp 2 -sf intel” <em>or</em> “-pk intel 0 omp 4 -sf intel”
|
|
should be added to the LAMMPS command-line. Choice for best
|
|
performance will depend on the simulation.</li>
|
|
</ul>
|
|
<p>For Intel Xeon Phi CPUs for simulations with <em>kspace_style
|
|
pppm</em> in the input script:</p>
|
|
<ul class="simple">
|
|
<li>Edit src/MAKE/OPTIONS/Makefile.knl as necessary.</li>
|
|
<li>Runs should be performed using MCDRAM.</li>
|
|
<li>Add “neigh_modify binsize 3” to the input script for better
|
|
performance.</li>
|
|
<li>Add “kspace_modify diff ad” to the input script for better
|
|
performance.</li>
|
|
<li>export KMP_AFFINITY=none</li>
|
|
<li>“-pk intel 0 omp 3 lrt yes -sf intel” or “-pk intel 0 omp 1 lrt yes
|
|
-sf intel” added to LAMMPS command-line. Choice for best performance
|
|
will depend on the simulation.</li>
|
|
</ul>
|
|
<p>For Intel Xeon Phi coprocessors (Offload):</p>
|
|
<ul class="simple">
|
|
<li>Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary</li>
|
|
<li>“-pk intel N omp 1” added to command-line where N is the number of
|
|
coprocessors per node.</li>
|
|
</ul>
|
|
<hr class="docutils" />
|
|
<p><strong>Required hardware/software:</strong></p>
|
|
<p>In order to use offload to coprocessors, an Intel Xeon Phi
|
|
coprocessor and an Intel compiler are required. For this, the
|
|
recommended version of the Intel compiler is 14.0.1.106 or
|
|
versions 15.0.2.044 and higher.</p>
|
|
<p>Although any compiler can be used with the USER-INTEL pacakge,
|
|
currently, vectorization directives are disabled by default when
|
|
not using Intel compilers due to lack of standard support and
|
|
observations of decreased performance. The OpenMP standard now
|
|
supports directives for vectorization and we plan to transition the
|
|
code to this standard once it is available in most compilers. We
|
|
expect this to allow improved performance and support with other
|
|
compilers.</p>
|
|
<p>For Intel Xeon Phi x200 series processors (code-named Knights
|
|
Landing), there are multiple configuration options for the hardware.
|
|
For best performance, we recommend that the MCDRAM is configured in
|
|
“Flat” mode and with the cluster mode set to “Quadrant” or “SNC4”.
|
|
“Cache” mode can also be used, although the performance might be
|
|
slightly lower.</p>
|
|
<p><strong>Notes about Simultaneous Multithreading:</strong></p>
|
|
<p>Modern CPUs often support Simultaneous Multithreading (SMT). On
|
|
Intel processors, this is called Hyper-Threading (HT) technology.
|
|
SMT is hardware support for running multiple threads efficiently on
|
|
a single core. <em>Hardware threads</em> or <em>logical cores</em> are often used
|
|
to refer to the number of threads that are supported in hardware.
|
|
For example, the Intel Xeon E5-2697v4 processor is described
|
|
as having 36 cores and 72 threads. This means that 36 MPI processes
|
|
or OpenMP threads can run simultaneously on separate cores, but that
|
|
up to 72 MPI processes or OpenMP threads can be running on the CPU
|
|
without costly operating system context switches.</p>
|
|
<p>Molecular dynamics simulations will often run faster when making use
|
|
of SMT. If a thread becomes stalled, for example because it is
|
|
waiting on data that has not yet arrived from memory, another thread
|
|
can start running so that the CPU pipeline is still being used
|
|
efficiently. Although benefits can be seen by launching a MPI task
|
|
for every hardware thread, for multinode simulations, we recommend
|
|
that OpenMP threads are used for SMT instead, either with the
|
|
USER-INTEL package, <a class="reference external" href="accelerate_omp.html"">USER-OMP package</a>, or
|
|
<a class="reference internal" href="accelerate_kokkos.html"><span class="doc">KOKKOS package</span></a>. In the example above, up
|
|
to 36X speedups can be observed by using all 36 physical cores with
|
|
LAMMPS. By using all 72 hardware threads, an additional 10-30%
|
|
performance gain can be achieved.</p>
|
|
<p>The BIOS on many platforms allows SMT to be disabled, however, we do
|
|
not recommend this on modern processors as there is little to no
|
|
benefit for any software package in most cases. The operating system
|
|
will report every hardware thread as a separate core allowing one to
|
|
determine the number of hardware threads available. On Linux systems,
|
|
this information can normally be obtained with:</p>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cat</span> <span class="o">/</span><span class="n">proc</span><span class="o">/</span><span class="n">cpuinfo</span>
|
|
</pre></div>
|
|
</div>
|
|
<p><strong>Building LAMMPS with the USER-INTEL package:</strong></p>
|
|
<p>The USER-INTEL package must be installed into the source directory:</p>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">make</span> <span class="n">yes</span><span class="o">-</span><span class="n">user</span><span class="o">-</span><span class="n">intel</span>
|
|
</pre></div>
|
|
</div>
|
|
<p>Several example Makefiles for building with the Intel compiler are
|
|
included with LAMMPS in the src/MAKE/OPTIONS/ directory:</p>
|
|
<pre class="literal-block">
|
|
Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload
|
|
Makefile.knl # Intel Compiler, Intel MPI, No Offload
|
|
Makefile.intel_cpu_mpich # Intel Compiler, MPICH, No Offload
|
|
Makefile.intel_cpu_openpmi # Intel Compiler, OpenMPI, No Offload
|
|
Makefile.intel_coprocessor # Intel Compiler, Intel MPI, Offload
|
|
</pre>
|
|
<p>Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that
|
|
it explicitly specifies that vectorization should be for Intel
|
|
Xeon Phi x200 processors making it easier to cross-compile. For
|
|
users with recent installations of Intel Parallel Studio, the
|
|
process can be as simple as:</p>
|
|
<pre class="literal-block">
|
|
make yes-user-intel
|
|
source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh
|
|
# or psxevars.csh for C-shell
|
|
make intel_cpu_intelmpi
|
|
</pre>
|
|
<p>Alternatively, the build can be accomplished with the src/Make.py
|
|
script, described in <a class="reference internal" href="Section_start.html#start-4"><span class="std std-ref">Section 2.4</span></a> of the
|
|
manual. Type “Make.py -h” for help. For an example:</p>
|
|
<pre class="literal-block">
|
|
Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi
|
|
</pre>
|
|
<p>Note that if you build with support for a Phi coprocessor, the same
|
|
binary can be used on nodes with or without coprocessors installed.
|
|
However, if you do not have coprocessors on your system, building
|
|
without offload support will produce a smaller binary.</p>
|
|
<p>The general requirements for Makefiles with the USER-INTEL package
|
|
are as follows. “-DLAMMPS_MEMALIGN=64” is required for CCFLAGS. When
|
|
using Intel compilers, “-restrict” is required and “-qopenmp” is
|
|
highly recommended for CCFLAGS and LINKFLAGS. LIB should include
|
|
“-ltbbmalloc”. For builds supporting offload, “-DLMP_INTEL_OFFLOAD”
|
|
is required for CCFLAGS and “-qoffload” is required for LINKFLAGS.
|
|
Other recommended CCFLAG options for best performance are
|
|
“-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2
|
|
-no-prec-div”. The Make.py command will add all of these
|
|
automatically.</p>
|
|
<div class="admonition note">
|
|
<p class="first admonition-title">Note</p>
|
|
<p class="last">The vectorization and math capabilities can differ depending on
|
|
the CPU. For Intel compilers, the “-x” flag specifies the type of
|
|
processor for which to optimize. “-xHost” specifies that the compiler
|
|
should build for the processor used for compiling. For Intel Xeon Phi
|
|
x200 series processors, this option is “-xMIC-AVX512”. For fourth
|
|
generation Intel Xeon (v4/Broadwell) processors, “-xCORE-AVX2” should
|
|
be used. For older Intel Xeon processors, “-xAVX” will perform best
|
|
in general for the different simulations in LAMMPS. The default
|
|
in most of the example Makefiles is to use “-xHost”, however this
|
|
should not be used when cross-compiling.</p>
|
|
</div>
|
|
<p><strong>Running LAMMPS with the USER-INTEL package:</strong></p>
|
|
<p>Running LAMMPS with the USER-INTEL package is similar to normal use
|
|
with the exceptions that one should 1) specify that LAMMPS should use
|
|
the USER-INTEL package, 2) specify the number of OpenMP threads, and
|
|
3) optionally specify the specific LAMMPS styles that should use the
|
|
USER-INTEL package. 1) and 2) can be performed from the command-line
|
|
or by editing the input script. 3) requires editing the input script.
|
|
Advanced performance tuning options are also described below to get
|
|
the best performance.</p>
|
|
<p>When running on a single node (including runs using offload to a
|
|
coprocessor), best performance is normally obtained by using 1 MPI
|
|
task per physical core and additional OpenMP threads with SMT. For
|
|
Intel Xeon processors, 2 OpenMP threads should be used for SMT.
|
|
For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used
|
|
(best choice depends on the simulation). In cases where the user
|
|
specifies that LRT mode is used (described below), 1 or 3 OpenMP
|
|
threads should be used. For multi-node runs, using 1 MPI task per
|
|
physical core will often perform best, however, depending on the
|
|
machine and scale, users might get better performance by decreasing
|
|
the number of MPI tasks and using more OpenMP threads. For
|
|
performance, the product of the number of MPI tasks and OpenMP
|
|
threads should not exceed the number of available hardware threads in
|
|
almost all cases.</p>
|
|
<div class="admonition note">
|
|
<p class="first admonition-title">Note</p>
|
|
<p class="last">Setting core affinity is often used to pin MPI tasks and OpenMP
|
|
threads to a core or group of cores so that memory access can be
|
|
uniform. Unless disabled at build time, affinity for MPI tasks and
|
|
OpenMP threads on the host (CPU) will be set by default on the host
|
|
<em>when using offload to a coprocessor</em>. In this case, it is unnecessary
|
|
to use other methods to control affinity (e.g. taskset, numactl,
|
|
I_MPI_PIN_DOMAIN, etc.). This can be disabled with the <em>no_affinity</em>
|
|
option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command or by disabling the
|
|
option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
|
|
CCFLAGS line of your Makefile). Disabling this option is not
|
|
recommended, especially when running on a machine with Intel
|
|
Hyper-Threading technology disabled.</p>
|
|
</div>
|
|
<p><strong>Run with the USER-INTEL package from the command line:</strong></p>
|
|
<p>To enable USER-INTEL optimizations for all available styles used in
|
|
the input script, the “-sf intel”
|
|
<a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> can be used without
|
|
any requirement for editing the input script. This switch will
|
|
automatically append “intel” to styles that support it. It also
|
|
invokes a default command: <a class="reference internal" href="package.html"><span class="doc">package intel 1</span></a>. This
|
|
package command is used to set options for the USER-INTEL package.
|
|
The default package command will specify that USER-INTEL calculations
|
|
are performed in mixed precision, that the number of OpenMP threads
|
|
is specified by the OMP_NUM_THREADS environment variable, and that
|
|
if coprocessors are present and the binary was built with offload
|
|
support, that 1 coprocessor per node will be used with automatic
|
|
balancing of work between the CPU and the coprocessor.</p>
|
|
<p>You can specify different options for the USER-INTEL package by using
|
|
the “-pk intel Nphi” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a>
|
|
with keyword/value pairs as specified in the documentation. Here,
|
|
Nphi = # of Xeon Phi coprocessors/node (ignored without offload
|
|
support). Common options to the USER-INTEL package include <em>omp</em> to
|
|
override any OMP_NUM_THREADS setting and specify the number of OpenMP
|
|
threads, <em>mode</em> to set the floating-point precision mode, and
|
|
<em>lrt</em> to enable Long-Range Thread mode as described below. See the
|
|
<a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command for details, including the
|
|
default values used for all its options if not specified, and how to
|
|
set the number of OpenMP threads via the OMP_NUM_THREADS environment
|
|
variable if desired.</p>
|
|
<p>Examples (see documentation for your MPI/Machine for differences in
|
|
launching MPI applications):</p>
|
|
<pre class="literal-block">
|
|
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
|
|
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision
|
|
</pre>
|
|
<p><strong>Or run with the USER-INTEL package by editing an input script:</strong></p>
|
|
<p>As an alternative to adding command-line arguments, the input script
|
|
can be edited to enable the USER-INTEL package. This requires adding
|
|
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command to the top of the input
|
|
script. For the second example above, this would be:</p>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">package</span> <span class="n">intel</span> <span class="mi">0</span> <span class="n">omp</span> <span class="mi">2</span> <span class="n">mode</span> <span class="n">double</span>
|
|
</pre></div>
|
|
</div>
|
|
<p>To enable the USER-INTEL package only for individual styles, you can
|
|
add an “intel” suffix to the individual style, e.g.:</p>
|
|
<pre class="literal-block">
|
|
pair_style lj/cut/intel 2.5
|
|
</pre>
|
|
<p>Alternatively, the <a class="reference internal" href="suffix.html"><span class="doc">suffix intel</span></a> command can be added to
|
|
the input script to enable USER-INTEL styles for the commands that
|
|
follow in the input script.</p>
|
|
<p><strong>Tuning for Performance:</strong></p>
|
|
<div class="admonition note">
|
|
<p class="first admonition-title">Note</p>
|
|
<p class="last">The USER-INTEL package will perform better with modifications
|
|
to the input script when <a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> is used:
|
|
<a class="reference internal" href="kspace_modify.html"><span class="doc">kspace_modify diff ad</span></a> and <a class="reference internal" href="neigh_modify.html"><span class="doc">neigh_modify binsize 3</span></a> should be added to the input script.</p>
|
|
</div>
|
|
<p>Long-Range Thread (LRT) mode is an option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command that can improve performance when using
|
|
<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> for long-range electrostatics on processors
|
|
with SMT. It generates an extra pthread for each MPI task. The thread
|
|
is dedicated to performing some of the PPPM calculations and MPI
|
|
communications. On Intel Xeon Phi x200 series CPUs, this will likely
|
|
always improve performance, even on a single node. On Intel Xeon
|
|
processors, using this mode might result in better performance when
|
|
using multiple nodes, depending on the machine. To use this mode,
|
|
specify that the number of OpenMP threads is one less than would
|
|
normally be used for the run and add the “lrt yes” option to the “-pk”
|
|
command-line suffix or “package intel” command. For example, if a run
|
|
would normally perform best with “-pk intel 0 omp 4”, instead use
|
|
“-pk intel 0 omp 3 lrt yes”. When using LRT, you should set the
|
|
environment variable “KMP_AFFINITY=none”. LRT mode is not supported
|
|
when using offload.</p>
|
|
<p>Not all styles are supported in the USER-INTEL package. You can mix
|
|
the USER-INTEL package with styles from the <a class="reference internal" href="accelerate_opt.html"><span class="doc">OPT</span></a>
|
|
package or the <a class="reference external" href="accelerate_omp.html"">USER-OMP package</a>. Of course,
|
|
this requires that these packages were installed at build time. This
|
|
can performed automatically by using “-sf hybrid intel opt” or
|
|
“-sf hybrid intel omp” command-line options. Alternatively, the “opt”
|
|
and “omp” suffixes can be appended manually in the input script. For
|
|
the latter, the <a class="reference internal" href="package.html"><span class="doc">package omp</span></a> command must be in the
|
|
input script or the “-pk omp Nt” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> must be used where Nt is the
|
|
number of OpenMP threads. The number of OpenMP threads should not be
|
|
set differently for the different packages. Note that the <a class="reference internal" href="suffix.html"><span class="doc">suffix hybrid intel omp</span></a> command can also be used within the
|
|
input script to automatically append the “omp” suffix to styles when
|
|
USER-INTEL styles are not available.</p>
|
|
<p>When running on many nodes, performance might be better when using
|
|
fewer OpenMP threads and more MPI tasks. This will depend on the
|
|
simulation and the machine. Using the <a class="reference internal" href="run_style.html"><span class="doc">verlet/split</span></a>
|
|
run style might also give better performance for simulations with
|
|
<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> electrostatics. Note that this is an
|
|
alternative to LRT mode and the two cannot be used together.</p>
|
|
<p>Currently, when using Intel MPI with Intel Xeon Phi x200 series
|
|
CPUs, better performance might be obtained by setting the
|
|
environment variable “I_MPI_SHM_LMT=shm” for Linux kernels that do
|
|
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
|
|
series processors will always perform better using MCDRAM. Please
|
|
consult your system documentation for the best approach to specify
|
|
that MPI runs are performed in MCDRAM.</p>
|
|
<p><strong>Tuning for Offload Performance:</strong></p>
|
|
<p>The default settings for offload should give good performance.</p>
|
|
<p>When using LAMMPS with offload to Intel coprocessors, best performance
|
|
will typically be achieved with concurrent calculations performed on
|
|
both the CPU and the coprocessor. This is achieved by offloading only
|
|
a fraction of the neighbor and pair computations to the coprocessor or
|
|
using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> pair styles where only one style uses
|
|
the “intel” suffix. For simulations with long-range electrostatics or
|
|
bond, angle, dihedral, improper calculations, computation and data
|
|
transfer to the coprocessor will run concurrently with computations
|
|
and MPI communications for these calculations on the host CPU. This
|
|
is illustrated in the figure below for the rhodopsin protein benchmark
|
|
running on E5-2697v2 processors with a Intel Xeon Phi 7120p
|
|
coprocessor. In this plot, the vertical access is time and routines
|
|
running at the same time are running concurrently on both the host and
|
|
the coprocessor.</p>
|
|
<img alt="_images/offload_knc.png" class="align-center" src="_images/offload_knc.png" />
|
|
<p>The fraction of the offloaded work is controlled by the <em>balance</em>
|
|
keyword in the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. A balance of 0
|
|
runs all calculations on the CPU. A balance of 1 runs all
|
|
supported calculations on the coprocessor. A balance of 0.5 runs half
|
|
of the calculations on the coprocessor. Setting the balance to -1
|
|
(the default) will enable dynamic load balancing that continously
|
|
adjusts the fraction of offloaded work throughout the simulation.
|
|
Because data transfer cannot be timed, this option typically produces
|
|
results within 5 to 10 percent of the optimal fixed balance.</p>
|
|
<p>If running short benchmark runs with dynamic load balancing, adding a
|
|
short warm-up run (10-20 steps) will allow the load-balancer to find a
|
|
near-optimal setting that will carry over to additional runs.</p>
|
|
<p>The default for the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command is to have
|
|
all the MPI tasks on a given compute node use a single Xeon Phi
|
|
coprocessor. In general, running with a large number of MPI tasks on
|
|
each node will perform best with offload. Each MPI task will
|
|
automatically get affinity to a subset of the hardware threads
|
|
available on the coprocessor. For example, if your card has 61 cores,
|
|
with 60 cores available for offload and 4 hardware threads per core
|
|
(240 total threads), running with 24 MPI tasks per node will cause
|
|
each MPI task to use a subset of 10 threads on the coprocessor. Fine
|
|
tuning of the number of threads to use per MPI task or the number of
|
|
threads to use per core can be accomplished with keyword settings of
|
|
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command.</p>
|
|
<p>The USER-INTEL package has two modes for deciding which atoms will be
|
|
handled by the coprocessor. This choice is controlled with the <em>ghost</em>
|
|
keyword of the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. When set to 0,
|
|
ghost atoms (atoms at the borders between MPI tasks) are not offloaded
|
|
to the card. This allows for overlap of MPI communication of forces
|
|
with computation on the coprocessor when the <a class="reference internal" href="newton.html"><span class="doc">newton</span></a>
|
|
setting is “on”. The default is dependent on the style being used,
|
|
however, better performance may be achieved by setting this option
|
|
explictly.</p>
|
|
<p>When using offload with CPU Hyper-Threading disabled, it may help
|
|
performance to use fewer MPI tasks and OpenMP threads than available
|
|
cores. This is due to the fact that additional threads are generated
|
|
internally to handle the asynchronous offload tasks.</p>
|
|
<p>If pair computations are being offloaded to an Intel Xeon Phi
|
|
coprocessor, a diagnostic line is printed to the screen (not to the
|
|
log file), during the setup phase of a run, indicating that offload
|
|
mode is being used and indicating the number of coprocessor threads
|
|
per MPI task. Additionally, an offload timing summary is printed at
|
|
the end of each run. When offloading, the frequency for <a class="reference internal" href="atom_modify.html"><span class="doc">atom sorting</span></a> is changed to 1 so that the per-atom data is
|
|
effectively sorted at every rebuild of the neighbor lists. All the
|
|
available coprocessor threads on each Phi will be divided among MPI
|
|
tasks, unless the <em>tptask</em> option of the “-pk intel” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> is used to limit the coprocessor
|
|
threads per MPI task.</p>
|
|
<div class="section" id="restrictions">
|
|
<h2>Restrictions</h2>
|
|
<p>When offloading to a coprocessor, <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> styles
|
|
that require skip lists for neighbor builds cannot be offloaded.
|
|
Using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid/overlay</span></a> is allowed. Only one intel
|
|
accelerated style may be used with hybrid styles.
|
|
<a class="reference internal" href="special_bonds.html"><span class="doc">Special_bonds</span></a> exclusion lists are not currently
|
|
supported with offload, however, the same effect can often be
|
|
accomplished by setting cutoffs for excluded atom types to 0. None of
|
|
the pair styles in the USER-INTEL package currently support the
|
|
“inner”, “middle”, “outer” options for rRESPA integration via the
|
|
<a class="reference internal" href="run_style.html"><span class="doc">run_style respa</span></a> command; only the “pair” option is
|
|
supported.</p>
|
|
<p><strong>References:</strong></p>
|
|
<ul class="simple">
|
|
<li>Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann.</li>
|
|
<li>Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press.</li>
|
|
<li>Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101.</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
</div>
|
|
</div>
|
|
<footer>
|
|
|
|
|
|
<hr/>
|
|
|
|
<div role="contentinfo">
|
|
<p>
|
|
© Copyright 2013 Sandia Corporation.
|
|
</p>
|
|
</div>
|
|
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
|
|
|
</footer>
|
|
|
|
</div>
|
|
</div>
|
|
|
|
</section>
|
|
|
|
</div>
|
|
|
|
|
|
|
|
|
|
|
|
<script type="text/javascript">
|
|
var DOCUMENTATION_OPTIONS = {
|
|
URL_ROOT:'./',
|
|
VERSION:'',
|
|
COLLAPSE_INDEX:false,
|
|
FILE_SUFFIX:'.html',
|
|
HAS_SOURCE: true
|
|
};
|
|
</script>
|
|
<script type="text/javascript" src="_static/jquery.js"></script>
|
|
<script type="text/javascript" src="_static/underscore.js"></script>
|
|
<script type="text/javascript" src="_static/doctools.js"></script>
|
|
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
|
|
<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
|
|
<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
|
|
<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>
|
|
|
|
|
|
|
|
|
|
|
|
<script type="text/javascript" src="_static/js/theme.js"></script>
|
|
|
|
|
|
|
|
|
|
<script type="text/javascript">
|
|
jQuery(function () {
|
|
SphinxRtdTheme.StickyNav.enable();
|
|
});
|
|
</script>
|
|
|
|
|
|
</body>
|
|
</html> |