2014-09-10 23:32:24 +08:00
|
|
|
"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
|
|
|
|
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
|
|
|
|
|
|
|
|
:link(lws,http://lammps.sandia.gov)
|
|
|
|
:link(ld,Manual.html)
|
|
|
|
:link(lc,Section_commands.html#comm)
|
|
|
|
|
|
|
|
:line
|
|
|
|
|
|
|
|
"Return to Section accelerate overview"_Section_accelerate.html
|
|
|
|
|
|
|
|
5.3.3 USER-INTEL package :h4
|
|
|
|
|
|
|
|
The USER-INTEL package was developed by Mike Brown at Intel
|
|
|
|
Corporation. It provides a capability to accelerate simulations by
|
|
|
|
offloading neighbor list and non-bonded force calculations to Intel(R)
|
|
|
|
Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package).
|
|
|
|
Additionally, it supports running simulations in single, mixed, or
|
|
|
|
double precision with vectorization, even if a coprocessor is not
|
|
|
|
present, i.e. on an Intel(R) CPU. The same C++ code is used for both
|
|
|
|
cases. When offloading to a coprocessor, the routine is run twice,
|
|
|
|
once with an offload flag.
|
|
|
|
|
|
|
|
The USER-INTEL package can be used in tandem with the USER-OMP
|
|
|
|
package. This is useful when offloading pair style computations to
|
|
|
|
coprocessors, so that other styles not supported by the USER-INTEL
|
|
|
|
package, e.g. bond, angle, dihedral, improper, and long-range
|
|
|
|
electrostatics, can be run simultaneously in threaded mode on CPU
|
|
|
|
cores. Since less MPI tasks than CPU cores will typically be invoked
|
|
|
|
when running with coprocessors, this enables the extra cores to be
|
|
|
|
utilized for useful computation.
|
|
|
|
|
|
|
|
If LAMMPS is built with both the USER-INTEL and USER-OMP packages
|
|
|
|
intsalled, this mode of operation is made easier to use, because the
|
|
|
|
"-suffix intel" "command-line switch"_Section_start.html#start_7 or
|
|
|
|
the "suffix intel"_suffix.html command will both set a second-choice
|
|
|
|
suffix to "omp" so that styles from the USER-OMP package will be used
|
|
|
|
if available, after first testing if a style from the USER-INTEL
|
|
|
|
package is available.
|
|
|
|
|
|
|
|
Here is a quick overview of how to use the USER-INTEL package
|
|
|
|
for CPU acceleration:
|
|
|
|
|
2014-09-11 23:49:12 +08:00
|
|
|
specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
|
2014-09-10 23:32:24 +08:00
|
|
|
specify -fopenmp with LINKFLAGS in your Makefile.machine
|
|
|
|
include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
|
|
|
|
if using the USER-OMP package, specify how many threads per MPI task to use
|
|
|
|
use USER-INTEL styles in your input script :ul
|
|
|
|
|
|
|
|
Using the USER-INTEL package to offload work to the Intel(R)
|
|
|
|
Xeon Phi(TM) coprocessor is the same except for these additional
|
|
|
|
steps:
|
|
|
|
|
|
|
|
add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
|
|
|
|
add the flag -offload to LINKFLAGS in your Makefile.machine
|
|
|
|
specify how many threads per coprocessor to use :ul
|
|
|
|
|
|
|
|
The latter two steps in the first case and the last step in the
|
|
|
|
coprocessor case can be done using the "-pk omp" and "-sf intel" and
|
|
|
|
"-pk intel" "command-line switches"_Section_start.html#start_7
|
|
|
|
respectively. Or the effect of the "-pk" or "-sf" switches can be
|
|
|
|
duplicated by adding the "package omp"_package.html or "suffix
|
|
|
|
intel"_suffix.html or "package intel"_package.html commands
|
|
|
|
respectively to your input script.
|
|
|
|
|
|
|
|
[Required hardware/software:]
|
|
|
|
|
|
|
|
To use the offload option, you must have one or more Intel(R) Xeon
|
|
|
|
Phi(TM) coprocessors.
|
|
|
|
|
|
|
|
Optimizations for vectorization have only been tested with the
|
|
|
|
Intel(R) compiler. Use of other compilers may not result in
|
|
|
|
vectorization or give poor performance.
|
|
|
|
|
|
|
|
Use of an Intel C++ compiler is reccommended, but not required. The
|
|
|
|
compiler must support the OpenMP interface.
|
|
|
|
|
|
|
|
[Building LAMMPS with the USER-INTEL package:]
|
|
|
|
|
|
|
|
Include the package(s) and build LAMMPS:
|
|
|
|
|
|
|
|
cd lammps/src
|
|
|
|
make yes-user-intel
|
|
|
|
make yes-user-omp (if desired)
|
|
|
|
make machine :pre
|
|
|
|
|
|
|
|
If the USER-OMP package is also installed, you can use styles from
|
|
|
|
both packages, as described below.
|
|
|
|
|
|
|
|
The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support
|
|
|
|
in both the CCFLAGS and LINKFLAGS variables, which is {-openmp} for
|
|
|
|
Intel compilers. You also need to add -DLAMMPS_MEMALIGN=64 and
|
|
|
|
-restrict to CCFLAGS.
|
|
|
|
|
|
|
|
If you are compiling on the same architecture that will be used for
|
|
|
|
the runs, adding the flag {-xHost} to CCFLAGS will enable
|
|
|
|
vectorization with the Intel(R) compiler.
|
|
|
|
|
|
|
|
In order to build with support for an Intel(R) coprocessor, the flag
|
|
|
|
{-offload} should be added to the LINKFLAGS line and the flag
|
|
|
|
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
|
|
|
|
|
|
|
|
Note that the machine makefiles Makefile.intel and
|
|
|
|
Makefile.intel_offload are included in the src/MAKE directory with
|
|
|
|
options that perform well with the Intel(R) compiler. The latter file
|
|
|
|
has support for offload to coprocessors; the former does not.
|
|
|
|
|
|
|
|
If using an Intel compiler, it is recommended that Intel(R) Compiler
|
|
|
|
2013 SP1 update 1 be used. Newer versions have some performance
|
|
|
|
issues that are being addressed. If using Intel(R) MPI, version 5 or
|
|
|
|
higher is recommended.
|
|
|
|
|
|
|
|
[Running with the USER-INTEL package from the command line:]
|
|
|
|
|
|
|
|
The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
|
|
by LAMMPS (one or multiple per compute node) and the number of MPI
|
2014-09-12 05:14:29 +08:00
|
|
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
|
|
|
its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.
|
2014-09-10 23:32:24 +08:00
|
|
|
|
|
|
|
If LAMMPS was also built with the USER-OMP package, you need to choose
|
|
|
|
how many OpenMP threads per MPI task will be used by the USER-OMP
|
|
|
|
package. Note that the product of MPI tasks * OpenMP threads/task
|
|
|
|
should not exceed the physical number of cores (on a node), otherwise
|
|
|
|
performance will suffer.
|
|
|
|
|
|
|
|
If LAMMPS was built with coprocessor support for the USER-INTEL
|
|
|
|
package, you need to specify the number of coprocessor/node and the
|
|
|
|
number of threads to use on the coprocessor per MPI task. Note that
|
|
|
|
coprocessor threads (which run on the coprocessor) are totally
|
|
|
|
independent from OpenMP threads (which run on the CPU). The product
|
|
|
|
of MPI tasks * coprocessor threads/task should not exceed the maximum
|
|
|
|
number of threads the coproprocessor is designed to run, otherwise
|
|
|
|
performance will suffer. This value is 240 for current generation
|
|
|
|
Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. The
|
|
|
|
threads/core value can be set to a smaller value if desired by an
|
|
|
|
option on the "package intel"_package.html command, in which case the
|
|
|
|
maximum number of threads is also reduced.
|
|
|
|
|
|
|
|
Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
|
|
|
|
which will automatically append "intel" to styles that support it. If
|
|
|
|
a style does not support it, a "omp" suffix is tried next. Use the
|
|
|
|
"-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set
|
|
|
|
Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
|
|
|
|
the USER-OMP package. Use the "-pk intel Nphi" "command-line
|
|
|
|
switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
|
|
|
|
coprocessors/node, if LAMMPS was built with coprocessor support.
|
|
|
|
|
|
|
|
CPU-only without USER-OMP (but using Intel vectorization on CPU):
|
|
|
|
lmp_machine -sf intel -in in.script # 1 MPI task
|
|
|
|
mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre
|
|
|
|
|
|
|
|
CPU-only with USER-OMP (and Intel vectorization on CPU):
|
|
|
|
lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node
|
|
|
|
mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node
|
|
|
|
mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script # ditto on 8 16-core nodes :pre
|
|
|
|
|
|
|
|
CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
|
|
|
|
lmp_machine -sf intel -pk intel 16 1 -in in.script # 1 MPI task, 240 threads on 1 coprocessor
|
|
|
|
mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node,
|
|
|
|
# each MPI task uses 60 threads on 1 coprocessor
|
|
|
|
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script # ditto on 8 16-core nodes for MPI tasks and OpenMP threads,
|
|
|
|
# each MPI task uses 120 threads on one of 2 coprocessors :pre
|
|
|
|
|
|
|
|
Note that if the "-sf intel" switch is used, it also issues two
|
|
|
|
default commands: "package omp 0"_package.html and "package intel
|
|
|
|
1"_package.html command. These set the number of OpenMP threads per
|
|
|
|
MPI task via the OMP_NUM_THREADS environment variable, and the number
|
|
|
|
of Xeon Phi(TM) coprocessors/node to 1. The former is ignored if
|
|
|
|
LAMMPS was not built with the USER-OMP package. The latter is ignored
|
|
|
|
is LAMMPS was not built with coprocessor support, except for its
|
|
|
|
optional precision setting.
|
|
|
|
|
|
|
|
Using the "-pk omp" switch explicitly allows for direct setting of the
|
|
|
|
number of OpenMP threads per MPI task, and additional options. Using
|
|
|
|
the "-pk intel" switch explicitly allows for direct setting of the
|
|
|
|
number of coprocessors/node, and additional options. The syntax for
|
|
|
|
these two switches is the same as the "package omp"_package.html and
|
|
|
|
"package intel"_package.html commands. See the "package"_package.html
|
|
|
|
command doc page for details, including the default values used for
|
|
|
|
all its options if these switches are not specified, and how to set
|
|
|
|
the number of OpenMP threads via the OMP_NUM_THREADS environment
|
|
|
|
variable if desired.
|
|
|
|
|
|
|
|
[Or run with the USER-INTEL package by editing an input script:]
|
|
|
|
|
|
|
|
The discussion above for the mpirun/mpiexec command, MPI tasks/node,
|
|
|
|
OpenMP threads per MPI task, and coprocessor threads per MPI task is
|
|
|
|
the same.
|
|
|
|
|
|
|
|
Use the "suffix intel"_suffix.html command, or you can explicitly add an
|
|
|
|
"intel" suffix to individual styles in your input script, e.g.
|
|
|
|
|
|
|
|
pair_style lj/cut/intel 2.5 :pre
|
|
|
|
|
|
|
|
You must also use the "package omp"_package.html command to enable the
|
|
|
|
USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
|
|
|
|
intel" or "-pk omp" "command-line switches"_Section_start.html#start_7
|
|
|
|
were used. It specifies how many OpenMP threads per MPI task to use,
|
|
|
|
as well as other options. Its doc page explains how to set the number
|
|
|
|
of threads via an environment variable if desired.
|
|
|
|
|
|
|
|
You must also use the "package intel"_package.html command to enable
|
|
|
|
coprocessor support within the USER-INTEL package (assuming LAMMPS was
|
|
|
|
built with coprocessor support) unless the "-sf intel" or "-pk intel"
|
|
|
|
"command-line switches"_Section_start.html#start_7 were used. It
|
|
|
|
specifies how many coprocessors/node to use, as well as other
|
|
|
|
coprocessor options.
|
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
If LAMMPS was not built with coprocessor support when including the
|
|
|
|
USER-INTEL package, then acclerated styles will run on the CPU using
|
|
|
|
vectorization optimizations and the specified precision. This may
|
|
|
|
give a substantial speed-up for a pair style, particularly if mixed or
|
|
|
|
single precision is used.
|
|
|
|
|
|
|
|
If LAMMPS was built with coproccesor support, the pair styles will run
|
|
|
|
on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The
|
|
|
|
performance of a Xeon Phi versus a multi-core CPU is a function of
|
|
|
|
your hardware, which pair style is used, the number of
|
|
|
|
atoms/coprocessor, and the precision used on the coprocessor (double,
|
|
|
|
single, mixed).
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
|
|
|
LAMMPS web site for performance of the USER-INTEL package on different
|
|
|
|
hardware.
|
|
|
|
|
|
|
|
[Guidelines for best performance on an Intel(R) Xeon Phi(TM)
|
|
|
|
coprocessor:]
|
|
|
|
|
|
|
|
The default for the "package intel"_package.html command is to have
|
|
|
|
all the MPI tasks on a given compute node use a single Xeon Phi(TM)
|
|
|
|
coprocessor. In general, running with a large number of MPI tasks on
|
|
|
|
each node will perform best with offload. Each MPI task will
|
|
|
|
automatically get affinity to a subset of the hardware threads
|
|
|
|
available on the coprocessor. For example, if your card has 61 cores,
|
|
|
|
with 60 cores available for offload and 4 hardware threads per core
|
|
|
|
(240 total threads), running with 24 MPI tasks per node will cause
|
|
|
|
each MPI task to use a subset of 10 threads on the coprocessor. Fine
|
|
|
|
tuning of the number of threads to use per MPI task or the number of
|
|
|
|
threads to use per core can be accomplished with keyword settings of
|
|
|
|
the "package intel"_package.html command. :ulb,l
|
|
|
|
|
|
|
|
If desired, only a fraction of the pair style computation can be
|
|
|
|
offloaded to the coprocessors. This is accomplished by using the
|
|
|
|
{balance} keyword in the "package intel"_package.html command. A
|
|
|
|
balance of 0 runs all calculations on the CPU. A balance of 1 runs
|
|
|
|
all calculations on the coprocessor. A balance of 0.5 runs half of
|
|
|
|
the calculations on the coprocessor. Setting the balance to -1 (the
|
|
|
|
default) will enable dynamic load balancing that continously adjusts
|
|
|
|
the fraction of offloaded work throughout the simulation. This option
|
|
|
|
typically produces results within 5 to 10 percent of the optimal fixed
|
|
|
|
balance. :l
|
|
|
|
|
|
|
|
When using offload with CPU hyperthreading disabled, it may help
|
|
|
|
performance to use fewer MPI tasks and OpenMP threads than available
|
|
|
|
cores. This is due to the fact that additional threads are generated
|
|
|
|
internally to handle the asynchronous offload tasks. :l
|
|
|
|
|
|
|
|
If running short benchmark runs with dynamic load balancing, adding a
|
|
|
|
short warm-up run (10-20 steps) will allow the load-balancer to find a
|
|
|
|
near-optimal setting that will carry over to additional runs. :l
|
|
|
|
|
|
|
|
If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
|
|
|
|
coprocessor, a diagnostic line is printed to the screen (not to the
|
|
|
|
log file), during the setup phase of a run, indicating that offload
|
|
|
|
mode is being used and indicating the number of coprocessor threads
|
|
|
|
per MPI task. Additionally, an offload timing summary is printed at
|
|
|
|
the end of each run. When offloading, the frequency for "atom
|
|
|
|
sorting"_atom_modify.html is changed to 1 so that the per-atom data is
|
|
|
|
effectively sorted at every rebuild of the neighbor lists. :l
|
|
|
|
|
|
|
|
For simulations with long-range electrostatics or bond, angle,
|
|
|
|
dihedral, improper calculations, computation and data transfer to the
|
|
|
|
coprocessor will run concurrently with computations and MPI
|
|
|
|
communications for these calculations on the host CPU. The USER-INTEL
|
|
|
|
package has two modes for deciding which atoms will be handled by the
|
|
|
|
coprocessor. This choice is controlled with the {ghost} keyword of
|
|
|
|
the "package intel"_package.html command. When set to 0, ghost atoms
|
|
|
|
(atoms at the borders between MPI tasks) are not offloaded to the
|
|
|
|
card. This allows for overlap of MPI communication of forces with
|
|
|
|
computation on the coprocessor when the "newton"_newton.html setting
|
|
|
|
is "on". The default is dependent on the style being used, however,
|
|
|
|
better performance may be achieved by setting this option
|
|
|
|
explictly. :l,ule
|
|
|
|
|
|
|
|
[Restrictions:]
|
|
|
|
|
|
|
|
When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles
|
|
|
|
that require skip lists for neighbor builds cannot be offloaded.
|
|
|
|
Using "hybrid/overlay"_pair_hybrid.html is allowed. Only one intel
|
|
|
|
accelerated style may be used with hybrid styles.
|
|
|
|
"Special_bonds"_special_bonds.html exclusion lists are not currently
|
|
|
|
supported with offload, however, the same effect can often be
|
|
|
|
accomplished by setting cutoffs for excluded atom types to 0. None of
|
|
|
|
the pair styles in the USER-INTEL package currently support the
|
|
|
|
"inner", "middle", "outer" options for rRESPA integration via the
|
|
|
|
"run_style respa"_run_style.html command; only the "pair" option is
|
|
|
|
supported.
|