forked from lijiext/lammps
462 lines
33 KiB
HTML
462 lines
33 KiB
HTML
|
|
|
|
<!DOCTYPE html>
|
|
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
|
|
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
|
|
<head>
|
|
<meta charset="utf-8">
|
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
|
|
<title>5.USER-INTEL package — LAMMPS documentation</title>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
|
|
|
|
|
|
|
|
<link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />
|
|
|
|
|
|
|
|
<link rel="top" title="LAMMPS documentation" href="index.html"/>
|
|
|
|
|
|
<script src="_static/js/modernizr.min.js"></script>
|
|
|
|
</head>
|
|
|
|
<body class="wy-body-for-nav" role="document">
|
|
|
|
<div class="wy-grid-for-nav">
|
|
|
|
|
|
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
|
<div class="wy-side-nav-search">
|
|
|
|
|
|
|
|
<a href="Manual.html" class="icon icon-home"> LAMMPS
|
|
|
|
|
|
|
|
</a>
|
|
|
|
|
|
<div role="search">
|
|
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
|
|
<input type="text" name="q" placeholder="Search docs" />
|
|
<input type="hidden" name="check_keywords" value="yes" />
|
|
<input type="hidden" name="area" value="default" />
|
|
</form>
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
|
|
|
|
|
|
|
|
<ul>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
|
|
</ul>
|
|
|
|
|
|
|
|
</div>
|
|
|
|
</nav>
|
|
|
|
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
|
|
|
|
|
|
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
|
|
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
|
<a href="Manual.html">LAMMPS</a>
|
|
</nav>
|
|
|
|
|
|
|
|
<div class="wy-nav-content">
|
|
<div class="rst-content">
|
|
<div role="navigation" aria-label="breadcrumbs navigation">
|
|
<ul class="wy-breadcrumbs">
|
|
<li><a href="Manual.html">Docs</a> »</li>
|
|
|
|
<li>5.USER-INTEL package</li>
|
|
<li class="wy-breadcrumbs-aside">
|
|
|
|
|
|
<a href="http://lammps.sandia.gov">Website</a>
|
|
<a href="Section_commands.html#comm">Commands</a>
|
|
|
|
</li>
|
|
</ul>
|
|
<hr/>
|
|
|
|
</div>
|
|
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
|
<div itemprop="articleBody">
|
|
|
|
<p><a class="reference internal" href="Section_accelerate.html"><span class="doc">Return to Section accelerate overview</span></a></p>
|
|
<div class="section" id="user-intel-package">
|
|
<h1>5.USER-INTEL package</h1>
|
|
<p>The USER-INTEL package was developed by Mike Brown at Intel
|
|
Corporation. It provides two methods for accelerating simulations,
|
|
depending on the hardware you have. The first is acceleration on
|
|
Intel(R) CPUs by running in single, mixed, or double precision with
|
|
vectorization. The second is acceleration on Intel(R) Xeon Phi(TM)
|
|
coprocessors via offloading neighbor list and non-bonded force
|
|
calculations to the Phi. The same C++ code is used in both cases.
|
|
When offloading to a coprocessor from a CPU, the same routine is run
|
|
twice, once on the CPU and once with an offload flag.</p>
|
|
<p>Note that the USER-INTEL package supports use of the Phi in “offload”
|
|
mode, not “native” mode like the <a class="reference internal" href="accelerate_kokkos.html"><span class="doc">KOKKOS package</span></a>.</p>
|
|
<p>Also note that the USER-INTEL package can be used in tandem with the
|
|
<a class="reference internal" href="accelerate_omp.html"><span class="doc">USER-OMP package</span></a>. This is useful when
|
|
offloading pair style computations to the Phi, so that other styles
|
|
not supported by the USER-INTEL package, e.g. bond, angle, dihedral,
|
|
improper, and long-range electrostatics, can run simultaneously in
|
|
threaded mode on the CPU cores. Since less MPI tasks than CPU cores
|
|
will typically be invoked when running with coprocessors, this enables
|
|
the extra CPU cores to be used for useful computation.</p>
|
|
<p>As illustrated below, if LAMMPS is built with both the USER-INTEL and
|
|
USER-OMP packages, this dual mode of operation is made easier to use,
|
|
via the “-suffix hybrid intel omp” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> or the <a class="reference internal" href="suffix.html"><span class="doc">suffix hybrid intel omp</span></a> command. Both set a second-choice suffix to “omp” so
|
|
that styles from the USER-INTEL package will be used if available,
|
|
with styles from the USER-OMP package as a second choice.</p>
|
|
<p>Here is a quick overview of how to use the USER-INTEL package for CPU
|
|
acceleration, assuming one or more 16-core nodes. More details
|
|
follow.</p>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">use</span> <span class="n">an</span> <span class="n">Intel</span> <span class="n">compiler</span>
|
|
<span class="n">use</span> <span class="n">these</span> <span class="n">CCFLAGS</span> <span class="n">settings</span> <span class="ow">in</span> <span class="n">Makefile</span><span class="o">.</span><span class="n">machine</span><span class="p">:</span> <span class="o">-</span><span class="n">fopenmp</span><span class="p">,</span> <span class="o">-</span><span class="n">DLAMMPS_MEMALIGN</span><span class="o">=</span><span class="mi">64</span><span class="p">,</span> <span class="o">-</span><span class="n">restrict</span><span class="p">,</span> <span class="o">-</span><span class="n">xHost</span><span class="p">,</span> <span class="o">-</span><span class="n">fno</span><span class="o">-</span><span class="n">alias</span><span class="p">,</span> <span class="o">-</span><span class="n">ansi</span><span class="o">-</span><span class="n">alias</span><span class="p">,</span> <span class="o">-</span><span class="n">override</span><span class="o">-</span><span class="n">limits</span>
|
|
<span class="n">use</span> <span class="n">these</span> <span class="n">LINKFLAGS</span> <span class="n">settings</span> <span class="ow">in</span> <span class="n">Makefile</span><span class="o">.</span><span class="n">machine</span><span class="p">:</span> <span class="o">-</span><span class="n">fopenmp</span><span class="p">,</span> <span class="o">-</span><span class="n">xHost</span>
|
|
<span class="n">make</span> <span class="n">yes</span><span class="o">-</span><span class="n">user</span><span class="o">-</span><span class="n">intel</span> <span class="n">yes</span><span class="o">-</span><span class="n">user</span><span class="o">-</span><span class="n">omp</span> <span class="c1"># including user-omp is optional</span>
|
|
<span class="n">make</span> <span class="n">mpi</span> <span class="c1"># build with the USER-INTEL package, if settings (including compiler) added to Makefile.mpi</span>
|
|
<span class="n">make</span> <span class="n">intel_cpu</span> <span class="c1"># or Makefile.intel_cpu already has settings, uses Intel MPI wrapper</span>
|
|
<span class="n">Make</span><span class="o">.</span><span class="n">py</span> <span class="o">-</span><span class="n">v</span> <span class="o">-</span><span class="n">p</span> <span class="n">intel</span> <span class="n">omp</span> <span class="o">-</span><span class="n">intel</span> <span class="n">cpu</span> <span class="o">-</span><span class="n">a</span> <span class="n">file</span> <span class="n">mpich_icc</span> <span class="c1"># or one-line build via Make.py for MPICH</span>
|
|
<span class="n">Make</span><span class="o">.</span><span class="n">py</span> <span class="o">-</span><span class="n">v</span> <span class="o">-</span><span class="n">p</span> <span class="n">intel</span> <span class="n">omp</span> <span class="o">-</span><span class="n">intel</span> <span class="n">cpu</span> <span class="o">-</span><span class="n">a</span> <span class="n">file</span> <span class="n">ompi_icc</span> <span class="c1"># or for OpenMPI</span>
|
|
<span class="n">Make</span><span class="o">.</span><span class="n">py</span> <span class="o">-</span><span class="n">v</span> <span class="o">-</span><span class="n">p</span> <span class="n">intel</span> <span class="n">omp</span> <span class="o">-</span><span class="n">intel</span> <span class="n">cpu</span> <span class="o">-</span><span class="n">a</span> <span class="n">file</span> <span class="n">intel_cpu</span> <span class="c1"># or for Intel MPI wrapper</span>
|
|
</pre></div>
|
|
</div>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">intel</span> <span class="o">-</span><span class="n">pk</span> <span class="n">intel</span> <span class="mi">0</span> <span class="n">omp</span> <span class="mi">16</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 1 node, 1 MPI task/node, 16 threads/task, no USER-OMP</span>
|
|
<span class="n">mpirun</span> <span class="o">-</span><span class="n">np</span> <span class="mi">32</span> <span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">intel</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 2 nodess, 16 MPI tasks/node, no threads, no USER-OMP</span>
|
|
<span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">hybrid</span> <span class="n">intel</span> <span class="n">omp</span> <span class="o">-</span><span class="n">pk</span> <span class="n">intel</span> <span class="mi">0</span> <span class="n">omp</span> <span class="mi">16</span> <span class="o">-</span><span class="n">pk</span> <span class="n">omp</span> <span class="mi">16</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 1 node, 1 MPI task/node, 16 threads/task, with USER-OMP</span>
|
|
<span class="n">mpirun</span> <span class="o">-</span><span class="n">np</span> <span class="mi">32</span> <span class="o">-</span><span class="n">ppn</span> <span class="mi">4</span> <span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">hybrid</span> <span class="n">intel</span> <span class="n">omp</span> <span class="o">-</span><span class="n">pk</span> <span class="n">omp</span> <span class="mi">4</span> <span class="o">-</span><span class="n">pk</span> <span class="n">omp</span> <span class="mi">4</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 8 nodes, 4 MPI tasks/node, 4 threads/task, with USER-OMP</span>
|
|
</pre></div>
|
|
</div>
|
|
<p>Here is a quick overview of how to use the USER-INTEL package for the
|
|
same CPUs as above (16 cores/node), with an additional Xeon Phi(TM)
|
|
coprocessor per node. More details follow.</p>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Same</span> <span class="k">as</span> <span class="n">above</span> <span class="k">for</span> <span class="n">building</span><span class="p">,</span> <span class="k">with</span> <span class="n">these</span> <span class="n">additions</span><span class="o">/</span><span class="n">changes</span><span class="p">:</span>
|
|
<span class="n">add</span> <span class="n">the</span> <span class="n">flag</span> <span class="o">-</span><span class="n">DLMP_INTEL_OFFLOAD</span> <span class="n">to</span> <span class="n">CCFLAGS</span> <span class="ow">in</span> <span class="n">Makefile</span><span class="o">.</span><span class="n">machine</span>
|
|
<span class="n">add</span> <span class="n">the</span> <span class="n">flag</span> <span class="o">-</span><span class="n">offload</span> <span class="n">to</span> <span class="n">LINKFLAGS</span> <span class="ow">in</span> <span class="n">Makefile</span><span class="o">.</span><span class="n">machine</span>
|
|
<span class="k">for</span> <span class="n">Make</span><span class="o">.</span><span class="n">py</span> <span class="n">change</span> <span class="s2">"-intel cpu"</span> <span class="n">to</span> <span class="s2">"-intel phi"</span><span class="p">,</span> <span class="ow">and</span> <span class="s2">"file intel_cpu"</span> <span class="n">to</span> <span class="s2">"file intel_phi"</span>
|
|
</pre></div>
|
|
</div>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">mpirun</span> <span class="o">-</span><span class="n">np</span> <span class="mi">32</span> <span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">intel</span> <span class="o">-</span><span class="n">pk</span> <span class="n">intel</span> <span class="mi">1</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 2 nodes, 16 MPI tasks/node, 240 total threads on coprocessor, no USER-OMP</span>
|
|
<span class="n">mpirun</span> <span class="o">-</span><span class="n">np</span> <span class="mi">16</span> <span class="o">-</span><span class="n">ppn</span> <span class="mi">8</span> <span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">intel</span> <span class="o">-</span><span class="n">pk</span> <span class="n">intel</span> <span class="mi">1</span> <span class="n">omp</span> <span class="mi">2</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 2 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, no USER-OMP</span>
|
|
<span class="n">mpirun</span> <span class="o">-</span><span class="n">np</span> <span class="mi">32</span> <span class="o">-</span><span class="n">ppn</span> <span class="mi">8</span> <span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">hybrid</span> <span class="n">intel</span> <span class="n">omp</span> <span class="o">-</span><span class="n">pk</span> <span class="n">intel</span> <span class="mi">1</span> <span class="n">omp</span> <span class="mi">2</span> <span class="o">-</span><span class="n">pk</span> <span class="n">omp</span> <span class="mi">2</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 4 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, with USER-OMP</span>
|
|
</pre></div>
|
|
</div>
|
|
<p><strong>Required hardware/software:</strong></p>
|
|
<p>Your compiler must support the OpenMP interface. Use of an Intel(R)
|
|
C++ compiler is recommended, but not required. However, g++ will not
|
|
recognize some of the settings listed above, so they cannot be used.
|
|
Optimizations for vectorization have only been tested with the
|
|
Intel(R) compiler. Use of other compilers may not result in
|
|
vectorization, or give poor performance.</p>
|
|
<p>The recommended version of the Intel(R) compiler is 14.0.1.106.
|
|
Versions 15.0.1.133 and later are also supported. If using Intel(R)
|
|
MPI, versions 15.0.2.044 and later are recommended.</p>
|
|
<p>To use the offload option, you must have one or more Intel(R) Xeon
|
|
Phi(TM) coprocessors and use an Intel(R) C++ compiler.</p>
|
|
<p><strong>Building LAMMPS with the USER-INTEL package:</strong></p>
|
|
<p>The lines above illustrate how to include/build with the USER-INTEL
|
|
package, for either CPU or Phi support, in two steps, using the “make”
|
|
command. Or how to do it with one command via the src/Make.py script,
|
|
described in <a class="reference internal" href="Section_start.html#start-4"><span class="std std-ref">Section 2.4</span></a> of the manual.
|
|
Type “Make.py -h” for help. Because the mechanism for specifing what
|
|
compiler to use (Intel in this case) is different for different MPI
|
|
wrappers, 3 versions of the Make.py command are shown.</p>
|
|
<p>Note that if you build with support for a Phi coprocessor, the same
|
|
binary can be used on nodes with or without coprocessors installed.
|
|
However, if you do not have coprocessors on your system, building
|
|
without offload support will produce a smaller binary.</p>
|
|
<p>If you also build with the USER-OMP package, you can use styles from
|
|
both packages, as described below.</p>
|
|
<p>Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must
|
|
include “-fopenmp”. Likewise, if you use an Intel compiler, the
|
|
CCFLAGS setting must include “-restrict”. For Phi support, the
|
|
“-DLMP_INTEL_OFFLOAD” (CCFLAGS) and “-offload” (LINKFLAGS) settings
|
|
are required. The other settings listed above are optional, but will
|
|
typically improve performance. The Make.py command will add all of
|
|
these automatically.</p>
|
|
<p>If you are compiling on the same architecture that will be used for
|
|
the runs, adding the flag <em>-xHost</em> to CCFLAGS enables vectorization
|
|
with the Intel(R) compiler. Otherwise, you must provide the correct
|
|
compute node architecture to the -x option (e.g. -xAVX).</p>
|
|
<p>Example machines makefiles Makefile.intel_cpu and Makefile.intel_phi
|
|
are included in the src/MAKE/OPTIONS directory with settings that
|
|
perform well with the Intel(R) compiler. The latter has support for
|
|
offload to Phi coprocessors; the former does not.</p>
|
|
<p><strong>Run with the USER-INTEL package from the command line:</strong></p>
|
|
<p>The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
by LAMMPS (one or multiple per compute node) and the number of MPI
|
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
|
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.</p>
|
|
<p>If you compute (any portion of) pairwise interactions using USER-INTEL
|
|
pair styles on the CPU, or use USER-OMP styles on the CPU, you need to
|
|
choose how many OpenMP threads per MPI task to use. If both packages
|
|
are used, it must be done for both packages, and the same thread count
|
|
value should be used for both. Note that the product of MPI tasks *
|
|
threads/task should not exceed the physical number of cores (on a
|
|
node), otherwise performance will suffer.</p>
|
|
<p>When using the USER-INTEL package for the Phi, you also need to
|
|
specify the number of coprocessor/node and optionally the number of
|
|
coprocessor threads per MPI task to use. Note that coprocessor
|
|
threads (which run on the coprocessor) are totally independent from
|
|
OpenMP threads (which run on the CPU). The default values for the
|
|
settings that affect coprocessor threads are typically fine, as
|
|
discussed below.</p>
|
|
<p>As in the lines above, use the “-sf intel” or “-sf hybrid intel omp”
|
|
<a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a>, which will
|
|
automatically append “intel” to styles that support it. In the second
|
|
case, “omp” will be appended if an “intel” style does not exist.</p>
|
|
<p>Note that if either switch is used, it also invokes a default command:
|
|
<a class="reference internal" href="package.html"><span class="doc">package intel 1</span></a>. If the “-sf hybrid intel omp” switch
|
|
is used, the default USER-OMP command <a class="reference internal" href="package.html"><span class="doc">package omp 0</span></a> is
|
|
also invoked (if LAMMPS was built with USER-OMP). Both set the number
|
|
of OpenMP threads per MPI task via the OMP_NUM_THREADS environment
|
|
variable. The first command sets the number of Xeon Phi(TM)
|
|
coprocessors/node to 1 (ignored if USER-INTEL is built for CPU-only),
|
|
and the precision mode to “mixed” (default value).</p>
|
|
<p>You can also use the “-pk intel Nphi” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> to explicitly set Nphi = # of Xeon
|
|
Phi(TM) coprocessors/node, as well as additional options. Nphi should
|
|
be >= 1 if LAMMPS was built with coprocessor support, otherswise Nphi
|
|
= 0 for a CPU-only build. All the available coprocessor threads on
|
|
each Phi will be divided among MPI tasks, unless the <em>tptask</em> option
|
|
of the “-pk intel” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> is
|
|
used to limit the coprocessor threads per MPI task. See the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command for details, including the default values
|
|
used for all its options if not specified, and how to set the number
|
|
of OpenMP threads via the OMP_NUM_THREADS environment variable if
|
|
desired.</p>
|
|
<p>If LAMMPS was built with the USER-OMP package, you can also use the
|
|
“-pk omp Nt” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> to
|
|
explicitly set Nt = # of OpenMP threads per MPI task to use, as well
|
|
as additional options. Nt should be the same threads per MPI task as
|
|
set for the USER-INTEL package, e.g. via the “-pk intel Nphi omp Nt”
|
|
command. Again, see the <a class="reference internal" href="package.html"><span class="doc">package omp</span></a> command for
|
|
details, including the default values used for all its options if not
|
|
specified, and how to set the number of OpenMP threads via the
|
|
OMP_NUM_THREADS environment variable if desired.</p>
|
|
<p><strong>Or run with the USER-INTEL package by editing an input script:</strong></p>
|
|
<p>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
|
|
OpenMP threads per MPI task, and coprocessor threads per MPI task is
|
|
the same.</p>
|
|
<p>Use the <a class="reference internal" href="suffix.html"><span class="doc">suffix intel</span></a> or <a class="reference internal" href="suffix.html"><span class="doc">suffix hybrid intel omp</span></a> commands, or you can explicitly add an “intel” or
|
|
“omp” suffix to individual styles in your input script, e.g.</p>
|
|
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">pair_style</span> <span class="n">lj</span><span class="o">/</span><span class="n">cut</span><span class="o">/</span><span class="n">intel</span> <span class="mf">2.5</span>
|
|
</pre></div>
|
|
</div>
|
|
<p>You must also use the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command, unless the
|
|
“-sf intel” or “-pk intel” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switches</span></a> were used. It specifies how many
|
|
coprocessors/node to use, as well as other OpenMP threading and
|
|
coprocessor options. The <a class="reference internal" href="package.html"><span class="doc">package</span></a> doc page explains how
|
|
to set the number of OpenMP threads via an environment variable if
|
|
desired.</p>
|
|
<p>If LAMMPS was also built with the USER-OMP package, you must also use
|
|
the <a class="reference internal" href="package.html"><span class="doc">package omp</span></a> command to enable that package, unless
|
|
the “-sf hybrid intel omp” or “-pk omp” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switches</span></a> were used. It specifies how many
|
|
OpenMP threads per MPI task to use (should be same as the setting for
|
|
the USER-INTEL package), as well as other options. Its doc page
|
|
explains how to set the number of OpenMP threads via an environment
|
|
variable if desired.</p>
|
|
<p><strong>Speed-ups to expect:</strong></p>
|
|
<p>If LAMMPS was not built with coprocessor support (CPU only) when
|
|
including the USER-INTEL package, then acclerated styles will run on
|
|
the CPU using vectorization optimizations and the specified precision.
|
|
This may give a substantial speed-up for a pair style, particularly if
|
|
mixed or single precision is used.</p>
|
|
<p>If LAMMPS was built with coproccesor support, the pair styles will run
|
|
on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The
|
|
performance of a Xeon Phi versus a multi-core CPU is a function of
|
|
your hardware, which pair style is used, the number of
|
|
atoms/coprocessor, and the precision used on the coprocessor (double,
|
|
single, mixed).</p>
|
|
<p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the
|
|
LAMMPS web site for performance of the USER-INTEL package on different
|
|
hardware.</p>
|
|
<div class="admonition note">
|
|
<p class="first admonition-title">Note</p>
|
|
<p class="last">Setting core affinity is often used to pin MPI tasks and OpenMP
|
|
threads to a core or group of cores so that memory access can be
|
|
uniform. Unless disabled at build time, affinity for MPI tasks and
|
|
OpenMP threads on the host (CPU) will be set by default on the host
|
|
when using offload to a coprocessor. In this case, it is unnecessary
|
|
to use other methods to control affinity (e.g. taskset, numactl,
|
|
I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with
|
|
the <em>no_affinity</em> option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command
|
|
or by disabling the option at build time (by adding
|
|
-DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile).
|
|
Disabling this option is not recommended, especially when running on a
|
|
machine with hyperthreading disabled.</p>
|
|
</div>
|
|
<p><strong>Guidelines for best performance on an Intel(R) Xeon Phi(TM)
|
|
coprocessor:</strong></p>
|
|
<ul class="simple">
|
|
<li>The default for the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command is to have
|
|
all the MPI tasks on a given compute node use a single Xeon Phi(TM)
|
|
coprocessor. In general, running with a large number of MPI tasks on
|
|
each node will perform best with offload. Each MPI task will
|
|
automatically get affinity to a subset of the hardware threads
|
|
available on the coprocessor. For example, if your card has 61 cores,
|
|
with 60 cores available for offload and 4 hardware threads per core
|
|
(240 total threads), running with 24 MPI tasks per node will cause
|
|
each MPI task to use a subset of 10 threads on the coprocessor. Fine
|
|
tuning of the number of threads to use per MPI task or the number of
|
|
threads to use per core can be accomplished with keyword settings of
|
|
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command.</li>
|
|
<li>If desired, only a fraction of the pair style computation can be
|
|
offloaded to the coprocessors. This is accomplished by using the
|
|
<em>balance</em> keyword in the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. A
|
|
balance of 0 runs all calculations on the CPU. A balance of 1 runs
|
|
all calculations on the coprocessor. A balance of 0.5 runs half of
|
|
the calculations on the coprocessor. Setting the balance to -1 (the
|
|
default) will enable dynamic load balancing that continously adjusts
|
|
the fraction of offloaded work throughout the simulation. This option
|
|
typically produces results within 5 to 10 percent of the optimal fixed
|
|
balance.</li>
|
|
<li>When using offload with CPU hyperthreading disabled, it may help
|
|
performance to use fewer MPI tasks and OpenMP threads than available
|
|
cores. This is due to the fact that additional threads are generated
|
|
internally to handle the asynchronous offload tasks.</li>
|
|
<li>If running short benchmark runs with dynamic load balancing, adding a
|
|
short warm-up run (10-20 steps) will allow the load-balancer to find a
|
|
near-optimal setting that will carry over to additional runs.</li>
|
|
<li>If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
|
|
coprocessor, a diagnostic line is printed to the screen (not to the
|
|
log file), during the setup phase of a run, indicating that offload
|
|
mode is being used and indicating the number of coprocessor threads
|
|
per MPI task. Additionally, an offload timing summary is printed at
|
|
the end of each run. When offloading, the frequency for <a class="reference internal" href="atom_modify.html"><span class="doc">atom sorting</span></a> is changed to 1 so that the per-atom data is
|
|
effectively sorted at every rebuild of the neighbor lists.</li>
|
|
<li>For simulations with long-range electrostatics or bond, angle,
|
|
dihedral, improper calculations, computation and data transfer to the
|
|
coprocessor will run concurrently with computations and MPI
|
|
communications for these calculations on the host CPU. The USER-INTEL
|
|
package has two modes for deciding which atoms will be handled by the
|
|
coprocessor. This choice is controlled with the <em>ghost</em> keyword of
|
|
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. When set to 0, ghost atoms
|
|
(atoms at the borders between MPI tasks) are not offloaded to the
|
|
card. This allows for overlap of MPI communication of forces with
|
|
computation on the coprocessor when the <a class="reference internal" href="newton.html"><span class="doc">newton</span></a> setting
|
|
is “on”. The default is dependent on the style being used, however,
|
|
better performance may be achieved by setting this option
|
|
explictly.</li>
|
|
</ul>
|
|
<div class="section" id="restrictions">
|
|
<h2>Restrictions</h2>
|
|
<p>When offloading to a coprocessor, <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> styles
|
|
that require skip lists for neighbor builds cannot be offloaded.
|
|
Using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid/overlay</span></a> is allowed. Only one intel
|
|
accelerated style may be used with hybrid styles.
|
|
<a class="reference internal" href="special_bonds.html"><span class="doc">Special_bonds</span></a> exclusion lists are not currently
|
|
supported with offload, however, the same effect can often be
|
|
accomplished by setting cutoffs for excluded atom types to 0. None of
|
|
the pair styles in the USER-INTEL package currently support the
|
|
“inner”, “middle”, “outer” options for rRESPA integration via the
|
|
<a class="reference internal" href="run_style.html"><span class="doc">run_style respa</span></a> command; only the “pair” option is
|
|
supported.</p>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
</div>
|
|
</div>
|
|
<footer>
|
|
|
|
|
|
<hr/>
|
|
|
|
<div role="contentinfo">
|
|
<p>
|
|
© Copyright 2013 Sandia Corporation.
|
|
</p>
|
|
</div>
|
|
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
|
|
|
</footer>
|
|
|
|
</div>
|
|
</div>
|
|
|
|
</section>
|
|
|
|
</div>
|
|
|
|
|
|
|
|
|
|
|
|
<script type="text/javascript">
|
|
var DOCUMENTATION_OPTIONS = {
|
|
URL_ROOT:'./',
|
|
VERSION:'',
|
|
COLLAPSE_INDEX:false,
|
|
FILE_SUFFIX:'.html',
|
|
HAS_SOURCE: true
|
|
};
|
|
</script>
|
|
<script type="text/javascript" src="_static/jquery.js"></script>
|
|
<script type="text/javascript" src="_static/underscore.js"></script>
|
|
<script type="text/javascript" src="_static/doctools.js"></script>
|
|
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
|
|
<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
|
|
<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
|
|
<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>
|
|
|
|
|
|
|
|
|
|
|
|
<script type="text/javascript" src="_static/js/theme.js"></script>
|
|
|
|
|
|
|
|
|
|
<script type="text/javascript">
|
|
jQuery(function () {
|
|
SphinxRtdTheme.StickyNav.enable();
|
|
});
|
|
</script>
|
|
|
|
|
|
</body>
|
|
</html> |