lammps/doc/accelerate_intel.txt

"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

:link(lws,http://lammps.sandia.gov)
:link(ld,Manual.html)
:link(lc,Section_commands.html#comm)

:line

"Return to Section accelerate overview"_Section_accelerate.html

5.3.3 USER-INTEL package :h4

The USER-INTEL package was developed by Mike Brown at Intel
Corporation.  It provides a capability to accelerate simulations by
offloading neighbor list and non-bonded force calculations to Intel(R)
Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package).
Additionally, it supports running simulations in single, mixed, or
double precision with vectorization, even if a coprocessor is not
present, i.e. on an Intel(R) CPU.  The same C++ code is used for both
cases.  When offloading to a coprocessor, the routine is run twice,
once with an offload flag.

The USER-INTEL package can be used in tandem with the USER-OMP
package.  This is useful when offloading pair style computations to
coprocessors, so that other styles not supported by the USER-INTEL
package, e.g. bond, angle, dihedral, improper, and long-range
electrostatics, can be run simultaneously in threaded mode on CPU
cores.  Since less MPI tasks than CPU cores will typically be invoked
when running with coprocessors, this enables the extra cores to be
utilized for useful computation.

If LAMMPS is built with both the USER-INTEL and USER-OMP packages
intsalled, this mode of operation is made easier to use, because the
"-suffix intel" "command-line switch"_Section_start.html#start_7 or
the "suffix intel"_suffix.html command will both set a second-choice
suffix to "omp" so that styles from the USER-OMP package will be used
if available, after first testing if a style from the USER-INTEL
package is available.

Here is a quick overview of how to use the USER-INTEL package
for CPU acceleration:

specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
specify -fopenmp with LINKFLAGS in your Makefile.machine
include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
if using the USER-OMP package, specify how many threads per MPI task to use
use USER-INTEL styles in your input script :ul

Using the USER-INTEL package to offload work to the Intel(R)
Xeon Phi(TM) coprocessor is the same except for these additional
steps:

add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
add the flag -offload to LINKFLAGS in your Makefile.machine
specify how many threads per coprocessor to use :ul

The latter two steps in the first case and the last step in the
coprocessor case can be done using the "-pk omp" and "-sf intel" and
"-pk intel" "command-line switches"_Section_start.html#start_7
respectively.  Or the effect of the "-pk" or "-sf" switches can be
duplicated by adding the "package omp"_package.html or "suffix
intel"_suffix.html or "package intel"_package.html commands
respectively to your input script.

[Required hardware/software:]

To use the offload option, you must have one or more Intel(R) Xeon
Phi(TM) coprocessors.

Optimizations for vectorization have only been tested with the
Intel(R) compiler.  Use of other compilers may not result in
vectorization or give poor performance.

Use of an Intel C++ compiler is reccommended, but not required.  The
compiler must support the OpenMP interface.

[Building LAMMPS with the USER-INTEL package:]

Include the package(s) and build LAMMPS:  

cd lammps/src
make yes-user-intel
make yes-user-omp (if desired)
make machine :pre

If the USER-OMP package is also installed, you can use styles from
both packages, as described below.

The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support
in both the CCFLAGS and LINKFLAGS variables, which is {-openmp} for
Intel compilers.  You also need to add -DLAMMPS_MEMALIGN=64 and
-restrict to CCFLAGS.

If you are compiling on the same architecture that will be used for
the runs, adding the flag {-xHost} to CCFLAGS will enable
vectorization with the Intel(R) compiler.

In order to build with support for an Intel(R) coprocessor, the flag
{-offload} should be added to the LINKFLAGS line and the flag
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.

Note that the machine makefiles Makefile.intel and
Makefile.intel_offload are included in the src/MAKE directory with
options that perform well with the Intel(R) compiler. The latter file
has support for offload to coprocessors; the former does not.

If using an Intel compiler, it is recommended that Intel(R) Compiler
2013 SP1 update 1 be used.  Newer versions have some performance
issues that are being addressed. If using Intel(R) MPI, version 5 or
higher is recommended.

[Running with the USER-INTEL package from the command line:]

The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node.  E.g. the mpirun command in MPICH does this via
its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.

If LAMMPS was also built with the USER-OMP package, you need to choose
how many OpenMP threads per MPI task will be used by the USER-OMP
package.  Note that the product of MPI tasks * OpenMP threads/task
should not exceed the physical number of cores (on a node), otherwise
performance will suffer.

If LAMMPS was built with coprocessor support for the USER-INTEL
package, you need to specify the number of coprocessor/node and the
number of threads to use on the coprocessor per MPI task.  Note that
coprocessor threads (which run on the coprocessor) are totally
independent from OpenMP threads (which run on the CPU).  The product
of MPI tasks * coprocessor threads/task should not exceed the maximum
number of threads the coproprocessor is designed to run, otherwise
performance will suffer.  This value is 240 for current generation
Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core.  The
threads/core value can be set to a smaller value if desired by an
option on the "package intel"_package.html command, in which case the
maximum number of threads is also reduced.

Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
which will automatically append "intel" to styles that support it.  If
a style does not support it, a "omp" suffix is tried next.  Use the
"-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set
Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
the USER-OMP package.  Use the "-pk intel Nphi" "command-line
switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
coprocessors/node, if LAMMPS was built with coprocessor support.

CPU-only without USER-OMP (but using Intel vectorization on CPU):
lmp_machine -sf intel -in in.script                 # 1 MPI task
mpirun -np 32 lmp_machine -sf intel -in in.script   # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre

CPU-only with USER-OMP (and Intel vectorization on CPU):
lmp_machine -sf intel -pk intel 16 0 -in in.script                # 1 MPI task on a 16-core node
mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script    # 4 MPI tasks each with 4 threads on a single 16-core node
mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script   # ditto on 8 16-core nodes :pre

CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
lmp_machine -sf intel -pk intel 16 1 -in in.script                                  # 1 MPI task, 240 threads on 1 coprocessor
mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script            # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node, 
                                                                                    # each MPI task uses 60 threads on 1 coprocessor
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script   # ditto on 8 16-core nodes for MPI tasks and OpenMP threads, 
                                                                                    # each MPI task uses 120 threads on one of 2 coprocessors :pre

Note that if the "-sf intel" switch is used, it also issues two
default commands: "package omp 0"_package.html and "package intel
1"_package.html command.  These set the number of OpenMP threads per
MPI task via the OMP_NUM_THREADS environment variable, and the number
of Xeon Phi(TM) coprocessors/node to 1.  The former is ignored if
LAMMPS was not built with the USER-OMP package.  The latter is ignored
is LAMMPS was not built with coprocessor support, except for its
optional precision setting.

Using the "-pk omp" switch explicitly allows for direct setting of the
number of OpenMP threads per MPI task, and additional options.  Using
the "-pk intel" switch explicitly allows for direct setting of the
number of coprocessors/node, and additional options.  The syntax for
these two switches is the same as the "package omp"_package.html and
"package intel"_package.html commands.  See the "package"_package.html
command doc page for details, including the default values used for
all its options if these switches are not specified, and how to set
the number of OpenMP threads via the OMP_NUM_THREADS environment
variable if desired.

[Or run with the USER-INTEL package by editing an input script:]

The discussion above for the mpirun/mpiexec command, MPI tasks/node,
OpenMP threads per MPI task, and coprocessor threads per MPI task is
the same.

Use the "suffix intel"_suffix.html command, or you can explicitly add an
"intel" suffix to individual styles in your input script, e.g.

pair_style lj/cut/intel 2.5 :pre

You must also use the "package omp"_package.html command to enable the
USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
intel" or "-pk omp" "command-line switches"_Section_start.html#start_7
were used.  It specifies how many OpenMP threads per MPI task to use,
as well as other options.  Its doc page explains how to set the number
of threads via an environment variable if desired.

You must also use the "package intel"_package.html command to enable
coprocessor support within the USER-INTEL package (assuming LAMMPS was
built with coprocessor support) unless the "-sf intel" or "-pk intel"
"command-line switches"_Section_start.html#start_7 were used.  It
specifies how many coprocessors/node to use, as well as other
coprocessor options.

[Speed-ups to expect:]

If LAMMPS was not built with coprocessor support when including the
USER-INTEL package, then acclerated styles will run on the CPU using
vectorization optimizations and the specified precision.  This may
give a substantial speed-up for a pair style, particularly if mixed or
single precision is used.

If LAMMPS was built with coproccesor support, the pair styles will run
on one or more Intel(R) Xeon Phi(TM) coprocessors (per node).  The
performance of a Xeon Phi versus a multi-core CPU is a function of
your hardware, which pair style is used, the number of
atoms/coprocessor, and the precision used on the coprocessor (double,
single, mixed).

See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
LAMMPS web site for performance of the USER-INTEL package on different
hardware.

[Guidelines for best performance on an Intel(R) Xeon Phi(TM)
coprocessor:]

The default for the "package intel"_package.html command is to have
all the MPI tasks on a given compute node use a single Xeon Phi(TM)
coprocessor.  In general, running with a large number of MPI tasks on
each node will perform best with offload.  Each MPI task will
automatically get affinity to a subset of the hardware threads
available on the coprocessor.  For example, if your card has 61 cores,
with 60 cores available for offload and 4 hardware threads per core
(240 total threads), running with 24 MPI tasks per node will cause
each MPI task to use a subset of 10 threads on the coprocessor.  Fine
tuning of the number of threads to use per MPI task or the number of
threads to use per core can be accomplished with keyword settings of
the "package intel"_package.html command. :ulb,l

If desired, only a fraction of the pair style computation can be
offloaded to the coprocessors.  This is accomplished by using the
{balance} keyword in the "package intel"_package.html command.  A
balance of 0 runs all calculations on the CPU.  A balance of 1 runs
all calculations on the coprocessor.  A balance of 0.5 runs half of
the calculations on the coprocessor.  Setting the balance to -1 (the
default) will enable dynamic load balancing that continously adjusts
the fraction of offloaded work throughout the simulation.  This option
typically produces results within 5 to 10 percent of the optimal fixed
balance. :l

When using offload with CPU hyperthreading disabled, it may help
performance to use fewer MPI tasks and OpenMP threads than available
cores.  This is due to the fact that additional threads are generated
internally to handle the asynchronous offload tasks. :l

If running short benchmark runs with dynamic load balancing, adding a
short warm-up run (10-20 steps) will allow the load-balancer to find a
near-optimal setting that will carry over to additional runs. :l

If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
coprocessor, a diagnostic line is printed to the screen (not to the
log file), during the setup phase of a run, indicating that offload
mode is being used and indicating the number of coprocessor threads
per MPI task.  Additionally, an offload timing summary is printed at
the end of each run.  When offloading, the frequency for "atom
sorting"_atom_modify.html is changed to 1 so that the per-atom data is
effectively sorted at every rebuild of the neighbor lists. :l

For simulations with long-range electrostatics or bond, angle,
dihedral, improper calculations, computation and data transfer to the
coprocessor will run concurrently with computations and MPI
communications for these calculations on the host CPU.  The USER-INTEL
package has two modes for deciding which atoms will be handled by the
coprocessor.  This choice is controlled with the {ghost} keyword of
the "package intel"_package.html command.  When set to 0, ghost atoms
(atoms at the borders between MPI tasks) are not offloaded to the
card.  This allows for overlap of MPI communication of forces with
computation on the coprocessor when the "newton"_newton.html setting
is "on".  The default is dependent on the style being used, however,
better performance may be achieved by setting this option
explictly. :l,ule

[Restrictions:]

When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles
that require skip lists for neighbor builds cannot be offloaded.
Using "hybrid/overlay"_pair_hybrid.html is allowed.  Only one intel
accelerated style may be used with hybrid styles.
"Special_bonds"_special_bonds.html exclusion lists are not currently
supported with offload, however, the same effect can often be
accomplished by setting cutoffs for excluded atom types to 0.  None of
the pair styles in the USER-INTEL package currently support the
"inner", "middle", "outer" options for rRESPA integration via the
"run_style respa"_run_style.html command; only the "pair" option is
supported.
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -`
			`"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c`

			`:link(lws,http://lammps.sandia.gov)`
			`:link(ld,Manual.html)`
			`:link(lc,Section_commands.html#comm)`

			`:line`

			`"Return to Section accelerate overview"_Section_accelerate.html`

			`5.3.3 USER-INTEL package :h4`

			`The USER-INTEL package was developed by Mike Brown at Intel`
			`Corporation. It provides a capability to accelerate simulations by`
			`offloading neighbor list and non-bonded force calculations to Intel(R)`
			`Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package).`
			`Additionally, it supports running simulations in single, mixed, or`
			`double precision with vectorization, even if a coprocessor is not`
			`present, i.e. on an Intel(R) CPU. The same C++ code is used for both`
			`cases. When offloading to a coprocessor, the routine is run twice,`
			`once with an offload flag.`

			`The USER-INTEL package can be used in tandem with the USER-OMP`
			`package. This is useful when offloading pair style computations to`
			`coprocessors, so that other styles not supported by the USER-INTEL`
			`package, e.g. bond, angle, dihedral, improper, and long-range`
			`electrostatics, can be run simultaneously in threaded mode on CPU`
			`cores. Since less MPI tasks than CPU cores will typically be invoked`
			`when running with coprocessors, this enables the extra cores to be`
			`utilized for useful computation.`

			`If LAMMPS is built with both the USER-INTEL and USER-OMP packages`
			`intsalled, this mode of operation is made easier to use, because the`
			`"-suffix intel" "command-line switch"_Section_start.html#start_7 or`
			`the "suffix intel"_suffix.html command will both set a second-choice`
			`suffix to "omp" so that styles from the USER-OMP package will be used`
			`if available, after first testing if a style from the USER-INTEL`
			`package is available.`

			`Here is a quick overview of how to use the USER-INTEL package`
			`for CPU acceleration:`

git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12484 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-11 23:49:12 +08:00			`specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`specify -fopenmp with LINKFLAGS in your Makefile.machine`
			`include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS`
			`if using the USER-OMP package, specify how many threads per MPI task to use`
			`use USER-INTEL styles in your input script :ul`

			`Using the USER-INTEL package to offload work to the Intel(R)`
			`Xeon Phi(TM) coprocessor is the same except for these additional`
			`steps:`

			`add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine`
			`add the flag -offload to LINKFLAGS in your Makefile.machine`
			`specify how many threads per coprocessor to use :ul`

			`The latter two steps in the first case and the last step in the`
			`coprocessor case can be done using the "-pk omp" and "-sf intel" and`
			`"-pk intel" "command-line switches"_Section_start.html#start_7`
			`respectively. Or the effect of the "-pk" or "-sf" switches can be`
			`duplicated by adding the "package omp"_package.html or "suffix`
			`intel"_suffix.html or "package intel"_package.html commands`
			`respectively to your input script.`

			`[Required hardware/software:]`

			`To use the offload option, you must have one or more Intel(R) Xeon`
			`Phi(TM) coprocessors.`

			`Optimizations for vectorization have only been tested with the`
			`Intel(R) compiler. Use of other compilers may not result in`
			`vectorization or give poor performance.`

			`Use of an Intel C++ compiler is reccommended, but not required. The`
			`compiler must support the OpenMP interface.`

			`[Building LAMMPS with the USER-INTEL package:]`

			`Include the package(s) and build LAMMPS:`

			`cd lammps/src`
			`make yes-user-intel`
			`make yes-user-omp (if desired)`
			`make machine :pre`

			`If the USER-OMP package is also installed, you can use styles from`
			`both packages, as described below.`

			`The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support`
			`in both the CCFLAGS and LINKFLAGS variables, which is {-openmp} for`
			`Intel compilers. You also need to add -DLAMMPS_MEMALIGN=64 and`
			`-restrict to CCFLAGS.`

			`If you are compiling on the same architecture that will be used for`
			`the runs, adding the flag {-xHost} to CCFLAGS will enable`
			`vectorization with the Intel(R) compiler.`

			`In order to build with support for an Intel(R) coprocessor, the flag`
			`{-offload} should be added to the LINKFLAGS line and the flag`
			`-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.`

			`Note that the machine makefiles Makefile.intel and`
			`Makefile.intel_offload are included in the src/MAKE directory with`
			`options that perform well with the Intel(R) compiler. The latter file`
			`has support for offload to coprocessors; the former does not.`

			`If using an Intel compiler, it is recommended that Intel(R) Compiler`
			`2013 SP1 update 1 be used. Newer versions have some performance`
			`issues that are being addressed. If using Intel(R) MPI, version 5 or`
			`higher is recommended.`

			`[Running with the USER-INTEL package from the command line:]`

			`The mpirun or mpiexec command sets the total number of MPI tasks used`
			`by LAMMPS (one or multiple per compute node) and the number of MPI`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12503 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-12 05:14:29 +08:00			`tasks used per node. E.g. the mpirun command in MPICH does this via`
			`its -np and -ppn switches. Ditto OpenMPI via -np and -npernode.`
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00
			`If LAMMPS was also built with the USER-OMP package, you need to choose`
			`how many OpenMP threads per MPI task will be used by the USER-OMP`
			`package. Note that the product of MPI tasks * OpenMP threads/task`
			`should not exceed the physical number of cores (on a node), otherwise`
			`performance will suffer.`

			`If LAMMPS was built with coprocessor support for the USER-INTEL`
			`package, you need to specify the number of coprocessor/node and the`
			`number of threads to use on the coprocessor per MPI task. Note that`
			`coprocessor threads (which run on the coprocessor) are totally`
			`independent from OpenMP threads (which run on the CPU). The product`
			`of MPI tasks * coprocessor threads/task should not exceed the maximum`
			`number of threads the coproprocessor is designed to run, otherwise`
			`performance will suffer. This value is 240 for current generation`
			`Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. The`
			`threads/core value can be set to a smaller value if desired by an`
			`option on the "package intel"_package.html command, in which case the`
			`maximum number of threads is also reduced.`

			`Use the "-sf intel" "command-line switch"_Section_start.html#start_7,`
			`which will automatically append "intel" to styles that support it. If`
			`a style does not support it, a "omp" suffix is tried next. Use the`
			`"-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set`
			`Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with`
			`the USER-OMP package. Use the "-pk intel Nphi" "command-line`
			`switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)`
			`coprocessors/node, if LAMMPS was built with coprocessor support.`

			`CPU-only without USER-OMP (but using Intel vectorization on CPU):`
			`lmp_machine -sf intel -in in.script # 1 MPI task`
			`mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre`

			`CPU-only with USER-OMP (and Intel vectorization on CPU):`
			`lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node`
			`mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node`
			`mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script # ditto on 8 16-core nodes :pre`

			`CPUs + Xeon Phi(TM) coprocessors with USER-OMP:`
			`lmp_machine -sf intel -pk intel 16 1 -in in.script # 1 MPI task, 240 threads on 1 coprocessor`
			`mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node,`
			`# each MPI task uses 60 threads on 1 coprocessor`
			`mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script # ditto on 8 16-core nodes for MPI tasks and OpenMP threads,`
			`# each MPI task uses 120 threads on one of 2 coprocessors :pre`

			`Note that if the "-sf intel" switch is used, it also issues two`
			`default commands: "package omp 0"_package.html and "package intel`
			`1"_package.html command. These set the number of OpenMP threads per`
			`MPI task via the OMP_NUM_THREADS environment variable, and the number`
			`of Xeon Phi(TM) coprocessors/node to 1. The former is ignored if`
			`LAMMPS was not built with the USER-OMP package. The latter is ignored`
			`is LAMMPS was not built with coprocessor support, except for its`
			`optional precision setting.`

			`Using the "-pk omp" switch explicitly allows for direct setting of the`
			`number of OpenMP threads per MPI task, and additional options. Using`
			`the "-pk intel" switch explicitly allows for direct setting of the`
			`number of coprocessors/node, and additional options. The syntax for`
			`these two switches is the same as the "package omp"_package.html and`
			`"package intel"_package.html commands. See the "package"_package.html`
			`command doc page for details, including the default values used for`
			`all its options if these switches are not specified, and how to set`
			`the number of OpenMP threads via the OMP_NUM_THREADS environment`
			`variable if desired.`

			`[Or run with the USER-INTEL package by editing an input script:]`

			`The discussion above for the mpirun/mpiexec command, MPI tasks/node,`
			`OpenMP threads per MPI task, and coprocessor threads per MPI task is`
			`the same.`

			`Use the "suffix intel"_suffix.html command, or you can explicitly add an`
			`"intel" suffix to individual styles in your input script, e.g.`

			`pair_style lj/cut/intel 2.5 :pre`

			`You must also use the "package omp"_package.html command to enable the`
			`USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf`
			`intel" or "-pk omp" "command-line switches"_Section_start.html#start_7`
			`were used. It specifies how many OpenMP threads per MPI task to use,`
			`as well as other options. Its doc page explains how to set the number`
			`of threads via an environment variable if desired.`

			`You must also use the "package intel"_package.html command to enable`
			`coprocessor support within the USER-INTEL package (assuming LAMMPS was`
			`built with coprocessor support) unless the "-sf intel" or "-pk intel"`
			`"command-line switches"_Section_start.html#start_7 were used. It`
			`specifies how many coprocessors/node to use, as well as other`
			`coprocessor options.`

			`[Speed-ups to expect:]`

			`If LAMMPS was not built with coprocessor support when including the`
			`USER-INTEL package, then acclerated styles will run on the CPU using`
			`vectorization optimizations and the specified precision. This may`
			`give a substantial speed-up for a pair style, particularly if mixed or`
			`single precision is used.`

			`If LAMMPS was built with coproccesor support, the pair styles will run`
			`on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The`
			`performance of a Xeon Phi versus a multi-core CPU is a function of`
			`your hardware, which pair style is used, the number of`
			`atoms/coprocessor, and the precision used on the coprocessor (double,`
			`single, mixed).`

			`See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the`
			`LAMMPS web site for performance of the USER-INTEL package on different`
			`hardware.`

			`[Guidelines for best performance on an Intel(R) Xeon Phi(TM)`
			`coprocessor:]`

			`The default for the "package intel"_package.html command is to have`
			`all the MPI tasks on a given compute node use a single Xeon Phi(TM)`
			`coprocessor. In general, running with a large number of MPI tasks on`
			`each node will perform best with offload. Each MPI task will`
			`automatically get affinity to a subset of the hardware threads`
			`available on the coprocessor. For example, if your card has 61 cores,`
			`with 60 cores available for offload and 4 hardware threads per core`
			`(240 total threads), running with 24 MPI tasks per node will cause`
			`each MPI task to use a subset of 10 threads on the coprocessor. Fine`
			`tuning of the number of threads to use per MPI task or the number of`
			`threads to use per core can be accomplished with keyword settings of`
			`the "package intel"_package.html command. :ulb,l`

			`If desired, only a fraction of the pair style computation can be`
			`offloaded to the coprocessors. This is accomplished by using the`
			`{balance} keyword in the "package intel"_package.html command. A`
			`balance of 0 runs all calculations on the CPU. A balance of 1 runs`
			`all calculations on the coprocessor. A balance of 0.5 runs half of`
			`the calculations on the coprocessor. Setting the balance to -1 (the`
			`default) will enable dynamic load balancing that continously adjusts`
			`the fraction of offloaded work throughout the simulation. This option`
			`typically produces results within 5 to 10 percent of the optimal fixed`
			`balance. :l`

			`When using offload with CPU hyperthreading disabled, it may help`
			`performance to use fewer MPI tasks and OpenMP threads than available`
			`cores. This is due to the fact that additional threads are generated`
			`internally to handle the asynchronous offload tasks. :l`

			`If running short benchmark runs with dynamic load balancing, adding a`
			`short warm-up run (10-20 steps) will allow the load-balancer to find a`
			`near-optimal setting that will carry over to additional runs. :l`

			`If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)`
			`coprocessor, a diagnostic line is printed to the screen (not to the`
			`log file), during the setup phase of a run, indicating that offload`
			`mode is being used and indicating the number of coprocessor threads`
			`per MPI task. Additionally, an offload timing summary is printed at`
			`the end of each run. When offloading, the frequency for "atom`
			`sorting"_atom_modify.html is changed to 1 so that the per-atom data is`
			`effectively sorted at every rebuild of the neighbor lists. :l`

			`For simulations with long-range electrostatics or bond, angle,`
			`dihedral, improper calculations, computation and data transfer to the`
			`coprocessor will run concurrently with computations and MPI`
			`communications for these calculations on the host CPU. The USER-INTEL`
			`package has two modes for deciding which atoms will be handled by the`
			`coprocessor. This choice is controlled with the {ghost} keyword of`
			`the "package intel"_package.html command. When set to 0, ghost atoms`
			`(atoms at the borders between MPI tasks) are not offloaded to the`
			`card. This allows for overlap of MPI communication of forces with`
			`computation on the coprocessor when the "newton"_newton.html setting`
			`is "on". The default is dependent on the style being used, however,`
			`better performance may be achieved by setting this option`
			`explictly. :l,ule`

			`[Restrictions:]`

			`When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles`
			`that require skip lists for neighbor builds cannot be offloaded.`
			`Using "hybrid/overlay"_pair_hybrid.html is allowed. Only one intel`
			`accelerated style may be used with hybrid styles.`
			`"Special_bonds"_special_bonds.html exclusion lists are not currently`
			`supported with offload, however, the same effect can often be`
			`accomplished by setting cutoffs for excluded atom types to 0. None of`
			`the pair styles in the USER-INTEL package currently support the`
			`"inner", "middle", "outer" options for rRESPA integration via the`
			`"run_style respa"_run_style.html command; only the "pair" option is`
			`supported.`