git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12508 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2014-09-12 21:19:51 +00:00 · 2014-09-12 21:19:51 +00:00 · d0b6d228c7
parent 16864ce4e3
commit d0b6d228c7
12 changed files with 362 additions and 257 deletions
--- a/doc/accelerate_cuda.html
+++ b/doc/accelerate_cuda.html
@ -137,7 +137,7 @@ library.
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 </P>
 <P>When using the USER-CUDA package, you must use exactly one MPI task
 per physical GPU.
--- a/doc/accelerate_cuda.txt
+++ b/doc/accelerate_cuda.txt
@ -134,7 +134,7 @@ library.
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.

 When using the USER-CUDA package, you must use exactly one MPI task
 per physical GPU.
--- a/doc/accelerate_gpu.html
+++ b/doc/accelerate_gpu.html
@ -133,7 +133,7 @@ re-compiled and linked to the new GPU library.
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 </P>
 <P>When using the GPU package, you cannot assign more than one GPU to a
 single MPI task.  However multiple MPI tasks can share the same GPU,
--- a/doc/accelerate_gpu.txt
+++ b/doc/accelerate_gpu.txt
@ -130,7 +130,7 @@ re-compiled and linked to the new GPU library.
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.

 When using the GPU package, you cannot assign more than one GPU to a
 single MPI task.  However multiple MPI tasks can share the same GPU,
--- a/doc/accelerate_intel.html
+++ b/doc/accelerate_intel.html
@ -28,10 +28,10 @@ once with an offload flag.
 package.  This is useful when offloading pair style computations to
 coprocessors, so that other styles not supported by the USER-INTEL
 package, e.g. bond, angle, dihedral, improper, and long-range
-electrostatics, can be run simultaneously in threaded mode on CPU
+electrostatics, can run simultaneously in threaded mode on the CPU
 cores.  Since less MPI tasks than CPU cores will typically be invoked
-when running with coprocessors, this enables the extra cores to be
-utilized for useful computation.
+when running with coprocessors, this enables the extra CPU cores to be
+used for useful computation.
 </P>
 <P>If LAMMPS is built with both the USER-INTEL and USER-OMP packages
 intsalled, this mode of operation is made easier to use, because the
@ -42,13 +42,13 @@ if available, after first testing if a style from the USER-INTEL
 package is available.
 </P>
 <P>Here is a quick overview of how to use the USER-INTEL package
-for CPU acceleration:
+for CPU-only acceleration:
 </P>
-<UL><LI>specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
-<LI>specify -fopenmp with LINKFLAGS in your Makefile.machine
+<UL><LI>specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
+<LI>specify -openmp with LINKFLAGS in your Makefile.machine
 <LI>include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
-<LI>if using the USER-OMP package, specify how many threads per MPI task to use
-<LI>use USER-INTEL styles in your input script 
+<LI>specify how many OpenMP threads per MPI task to use
+<LI>use USER-INTEL and (optionally) USER-OMP styles in your input script 
 </UL>
 <P>Using the USER-INTEL package to offload work to the Intel(R)
 Xeon Phi(TM) coprocessor is the same except for these additional
@ -56,15 +56,14 @@ steps:
 </P>
 <UL><LI>add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
 <LI>add the flag -offload to LINKFLAGS in your Makefile.machine
-<LI>specify how many threads per coprocessor to use 
+<LI>specify how many coprocessor threads per MPI task to use 
 </UL>
 <P>The latter two steps in the first case and the last step in the
-coprocessor case can be done using the "-pk omp" and "-sf intel" and
-"-pk intel" <A HREF = "Section_start.html#start_7">command-line switches</A>
-respectively.  Or the effect of the "-pk" or "-sf" switches can be
-duplicated by adding the <A HREF = "package.html">package omp</A> or <A HREF = "suffix.html">suffix
-intel</A> or <A HREF = "package.html">package intel</A> commands
-respectively to your input script.
+coprocessor case can be done using the "-pk intel" and "-sf intel"
+<A HREF = "Section_start.html#start_7">command-line switches</A> respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the <A HREF = "package.html">package intel</A> or <A HREF = "suffix.html">suffix intel</A>
+commands respectively to your input script.
 </P>
 <P><B>Required hardware/software:</B>
 </P>
@ -99,9 +98,9 @@ Intel compilers.  You also need to add -DLAMMPS_MEMALIGN=64 and
 the runs, adding the flag <I>-xHost</I> to CCFLAGS will enable
 vectorization with the Intel(R) compiler.
 </P>
-<P>In order to build with support for an Intel(R) coprocessor, the flag
-<I>-offload</I> should be added to the LINKFLAGS line and the flag
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
+<P>In order to build with support for an Intel(R) Xeon Phi(TM)
+coprocessor, the flag <I>-offload</I> should be added to the LINKFLAGS line
+and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
 </P>
 <P>Note that the machine makefiles Makefile.intel and
 Makefile.intel_offload are included in the src/MAKE directory with
@ -118,71 +117,77 @@ higher is recommended.
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 </P>
-<P>If LAMMPS was also built with the USER-OMP package, you need to choose
-how many OpenMP threads per MPI task will be used by the USER-OMP
-package.  Note that the product of MPI tasks * OpenMP threads/task
-should not exceed the physical number of cores (on a node), otherwise
-performance will suffer.
+<P>If you plan to compute (any portion of) pairwise interactions using
+USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
+you need to choose how many OpenMP threads per MPI task to use.  Note
+that the product of MPI tasks * OpenMP threads/task should not exceed
+the physical number of cores (on a node), otherwise performance will
+suffer.
 </P>
 <P>If LAMMPS was built with coprocessor support for the USER-INTEL
-package, you need to specify the number of coprocessor/node and the
-number of threads to use on the coprocessor per MPI task.  Note that
+package, you also need to specify the number of coprocessor/node and
+the number of coprocessor threads per MPI task to use.  Note that
 coprocessor threads (which run on the coprocessor) are totally
-independent from OpenMP threads (which run on the CPU).  The product
-of MPI tasks * coprocessor threads/task should not exceed the maximum
-number of threads the coproprocessor is designed to run, otherwise
-performance will suffer.  This value is 240 for current generation
-Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core.  The
-threads/core value can be set to a smaller value if desired by an
-option on the <A HREF = "package.html">package intel</A> command, in which case the
-maximum number of threads is also reduced.
+independent from OpenMP threads (which run on the CPU).  The default
+values for the settings that affect coprocessor threads are typically
+fine, as discussed below.
 </P>
 <P>Use the "-sf intel" <A HREF = "Section_start.html#start_7">command-line switch</A>,
 which will automatically append "intel" to styles that support it.  If
-a style does not support it, a "omp" suffix is tried next.  Use the
-"-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switch</A>, to set
-Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
-the USER-OMP package.  Use the "-pk intel Nphi" <A HREF = "Section_start.html#start_7">command-line
+a style does not support it, an "omp" suffix is tried next.  OpenMP
+threads per MPI task can be set via the "-pk intel Nphi omp Nt" or
+"-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switches</A>, which
+set Nt = # of OpenMP threads per MPI task to use.  The "-pk omp" form
+is only allowed if LAMMPS was also built with the USER-OMP package.
+</P>
+<P>Use the "-pk intel Nphi" <A HREF = "Section_start.html#start_7">command-line
 switch</A> to set Nphi = # of Xeon Phi(TM)
-coprocessors/node, if LAMMPS was built with coprocessor support.
+coprocessors/node, if LAMMPS was built with coprocessor support.  All
+the available coprocessor threads on each Phi will be divided among
+MPI tasks, unless the <I>tptask</I> option of the "-pk intel" <A HREF = "Section_start.html#start_7">command-line
+switch</A> is used to limit the coprocessor
+threads per MPI task.  See the <A HREF = "package.html">package intel</A> command
+for details.
 </P>
 <PRE>CPU-only without USER-OMP (but using Intel vectorization on CPU):
 lmp_machine -sf intel -in in.script                 # 1 MPI task
 mpirun -np 32 lmp_machine -sf intel -in in.script   # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) 
 </PRE>
 <PRE>CPU-only with USER-OMP (and Intel vectorization on CPU):
-lmp_machine -sf intel -pk intel 16 0 -in in.script                # 1 MPI task on a 16-core node
-mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script    # 4 MPI tasks each with 4 threads on a single 16-core node
-mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script   # ditto on 8 16-core nodes 
+lmp_machine -sf intel -pk intel 16 0 -in in.script             # 1 MPI task on a 16-core node
+mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script     # 4 MPI tasks each with 4 threads on a single 16-core node
+mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script    # ditto on 8 16-core nodes 
 </PRE>
-<PRE>CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
-lmp_machine -sf intel -pk intel 16 1 -in in.script                                  # 1 MPI task, 240 threads on 1 coprocessor
-mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script            # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node, 
-                                                                                    # each MPI task uses 60 threads on 1 coprocessor
-mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script   # ditto on 8 16-core nodes for MPI tasks and OpenMP threads, 
-                                                                                    # each MPI task uses 120 threads on one of 2 coprocessors 
+<PRE>CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
+lmp_machine -sf intel -pk intel 1 omp 16 -in in.script                       # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
+lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script             # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
+mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script           # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
+mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script   # ditto on 8 16-core nodes
+mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script           # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task 
 </PRE>
-<P>Note that if the "-sf intel" switch is used, it also issues two
-default commands: <A HREF = "package.html">package omp 0</A> and <A HREF = "package.html">package intel
-1</A> command.  These set the number of OpenMP threads per
-MPI task via the OMP_NUM_THREADS environment variable, and the number
-of Xeon Phi(TM) coprocessors/node to 1.  The former is ignored if
-LAMMPS was not built with the USER-OMP package.  The latter is ignored
-is LAMMPS was not built with coprocessor support, except for its
-optional precision setting.
+<P>Note that if the "-sf intel" switch is used, it also invokes two
+default commands: <A HREF = "package.html">package intel 1</A>, followed by <A HREF = "package.html">package
+omp 0</A>.  These both set the number of OpenMP threads per
+MPI task via the OMP_NUM_THREADS environment variable.  The first
+command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
+the precision mode to "mixed", as one of its option defaults).  The
+latter command is not invoked if LAMMPS was not built with the
+USER-OMP package.  The Nphi = 1 value for the first command is ignored
+if LAMMPS was not built with coprocessor support.
 </P>
-<P>Using the "-pk omp" switch explicitly allows for direct setting of the
-number of OpenMP threads per MPI task, and additional options.  Using
-the "-pk intel" switch explicitly allows for direct setting of the
-number of coprocessors/node, and additional options.  The syntax for
-these two switches is the same as the <A HREF = "package.html">package omp</A> and
-<A HREF = "package.html">package intel</A> commands.  See the <A HREF = "package.html">package</A>
-command doc page for details, including the default values used for
-all its options if these switches are not specified, and how to set
-the number of OpenMP threads via the OMP_NUM_THREADS environment
-variable if desired.
+<P>Using the "-pk intel" or "-pk omp" switches explicitly allows for
+direct setting of the number of OpenMP threads per MPI task, and
+additional options for either of the USER-INTEL or USER-OMP packages.
+In particular, the "-pk intel" switch sets the number of
+coprocessors/node and can limit the number of coprocessor threads per
+MPI task.  The syntax for these two switches is the same as the
+<A HREF = "package.html">package omp</A> and <A HREF = "package.html">package intel</A> commands.
+See the <A HREF = "package.html">package</A> command doc page for details, including
+the default values used for all its options if these switches are not
+specified, and how to set the number of OpenMP threads via the
+OMP_NUM_THREADS environment variable if desired.
 </P>
 <P><B>Or run with the USER-INTEL package by editing an input script:</B>
 </P>
@ -195,19 +200,20 @@ the same.
 </P>
 <PRE>pair_style lj/cut/intel 2.5 
 </PRE>
-<P>You must also use the <A HREF = "package.html">package omp</A> command to enable the
-USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
-intel" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line switches</A>
-were used.  It specifies how many OpenMP threads per MPI task to use,
-as well as other options.  Its doc page explains how to set the number
-of threads via an environment variable if desired.
+<P>You must also use the <A HREF = "package.html">package intel</A> command, unless the
+"-sf intel" or "-pk intel" <A HREF = "Section_start.html#start_7">command-line
+switches</A> were used.  It specifies how many
+coprocessors/node to use, as well as other OpenMP threading and
+coprocessor options.  Its doc page explains how to set the number of
+OpenMP threads via an environment variable if desired.
 </P>
-<P>You must also use the <A HREF = "package.html">package intel</A> command to enable
-coprocessor support within the USER-INTEL package (assuming LAMMPS was
-built with coprocessor support) unless the "-sf intel" or "-pk intel"
-<A HREF = "Section_start.html#start_7">command-line switches</A> were used.  It
-specifies how many coprocessors/node to use, as well as other
-coprocessor options.
+<P>If LAMMPS was also built with the USER-OMP package, you must also use
+the <A HREF = "package.html">package omp</A> command to enable that package, unless
+the "-sf intel" or "-pk omp" <A HREF = "Section_start.html#start_7">command-line
+switches</A> were used.  It specifies how many
+OpenMP threads per MPI task to use, as well as other options.  Its doc
+page explains how to set the number of OpenMP threads via an
+environment variable if desired.
 </P>
 <P><B>Speed-ups to expect:</B>
 </P>
--- a/doc/accelerate_intel.txt
+++ b/doc/accelerate_intel.txt
@ -25,10 +25,10 @@ The USER-INTEL package can be used in tandem with the USER-OMP
 package.  This is useful when offloading pair style computations to
 coprocessors, so that other styles not supported by the USER-INTEL
 package, e.g. bond, angle, dihedral, improper, and long-range
-electrostatics, can be run simultaneously in threaded mode on CPU
+electrostatics, can run simultaneously in threaded mode on the CPU
 cores.  Since less MPI tasks than CPU cores will typically be invoked
-when running with coprocessors, this enables the extra cores to be
-utilized for useful computation.
+when running with coprocessors, this enables the extra CPU cores to be
+used for useful computation.

 If LAMMPS is built with both the USER-INTEL and USER-OMP packages
 intsalled, this mode of operation is made easier to use, because the
@ -39,13 +39,13 @@ if available, after first testing if a style from the USER-INTEL
 package is available.

 Here is a quick overview of how to use the USER-INTEL package
-for CPU acceleration:
+for CPU-only acceleration:

-specify these CCFLAGS in your src/MAKE/Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
-specify -fopenmp with LINKFLAGS in your Makefile.machine
+specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
+specify -openmp with LINKFLAGS in your Makefile.machine
 include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
-if using the USER-OMP package, specify how many threads per MPI task to use
-use USER-INTEL styles in your input script :ul
+specify how many OpenMP threads per MPI task to use
+use USER-INTEL and (optionally) USER-OMP styles in your input script :ul

 Using the USER-INTEL package to offload work to the Intel(R)
 Xeon Phi(TM) coprocessor is the same except for these additional
@ -53,15 +53,14 @@ steps:

 add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
 add the flag -offload to LINKFLAGS in your Makefile.machine
-specify how many threads per coprocessor to use :ul
+specify how many coprocessor threads per MPI task to use :ul

 The latter two steps in the first case and the last step in the
-coprocessor case can be done using the "-pk omp" and "-sf intel" and
-"-pk intel" "command-line switches"_Section_start.html#start_7
-respectively.  Or the effect of the "-pk" or "-sf" switches can be
-duplicated by adding the "package omp"_package.html or "suffix
-intel"_suffix.html or "package intel"_package.html commands
-respectively to your input script.
+coprocessor case can be done using the "-pk intel" and "-sf intel"
+"command-line switches"_Section_start.html#start_7 respectively.  Or
+the effect of the "-pk" or "-sf" switches can be duplicated by adding
+the "package intel"_package.html or "suffix intel"_suffix.html
+commands respectively to your input script.

 [Required hardware/software:]

@ -96,9 +95,9 @@ If you are compiling on the same architecture that will be used for
 the runs, adding the flag {-xHost} to CCFLAGS will enable
 vectorization with the Intel(R) compiler.

-In order to build with support for an Intel(R) coprocessor, the flag
-{-offload} should be added to the LINKFLAGS line and the flag
-DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
+In order to build with support for an Intel(R) Xeon Phi(TM)
+coprocessor, the flag {-offload} should be added to the LINKFLAGS line
+and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.

 Note that the machine makefiles Makefile.intel and
 Makefile.intel_offload are included in the src/MAKE directory with
@ -115,71 +114,77 @@ higher is recommended.
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.

-If LAMMPS was also built with the USER-OMP package, you need to choose
-how many OpenMP threads per MPI task will be used by the USER-OMP
-package.  Note that the product of MPI tasks * OpenMP threads/task
-should not exceed the physical number of cores (on a node), otherwise
-performance will suffer.
+If you plan to compute (any portion of) pairwise interactions using
+USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
+you need to choose how many OpenMP threads per MPI task to use.  Note
+that the product of MPI tasks * OpenMP threads/task should not exceed
+the physical number of cores (on a node), otherwise performance will
+suffer.

 If LAMMPS was built with coprocessor support for the USER-INTEL
-package, you need to specify the number of coprocessor/node and the
-number of threads to use on the coprocessor per MPI task.  Note that
+package, you also need to specify the number of coprocessor/node and
+the number of coprocessor threads per MPI task to use.  Note that
 coprocessor threads (which run on the coprocessor) are totally
-independent from OpenMP threads (which run on the CPU).  The product
-of MPI tasks * coprocessor threads/task should not exceed the maximum
-number of threads the coproprocessor is designed to run, otherwise
-performance will suffer.  This value is 240 for current generation
-Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core.  The
-threads/core value can be set to a smaller value if desired by an
-option on the "package intel"_package.html command, in which case the
-maximum number of threads is also reduced.
+independent from OpenMP threads (which run on the CPU).  The default
+values for the settings that affect coprocessor threads are typically
+fine, as discussed below.

 Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
 which will automatically append "intel" to styles that support it.  If
-a style does not support it, a "omp" suffix is tried next.  Use the
-"-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set
-Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
-the USER-OMP package.  Use the "-pk intel Nphi" "command-line
+a style does not support it, an "omp" suffix is tried next.  OpenMP
+threads per MPI task can be set via the "-pk intel Nphi omp Nt" or
+"-pk omp Nt" "command-line switches"_Section_start.html#start_7, which
+set Nt = # of OpenMP threads per MPI task to use.  The "-pk omp" form
+is only allowed if LAMMPS was also built with the USER-OMP package.
+
+Use the "-pk intel Nphi" "command-line
 switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
-coprocessors/node, if LAMMPS was built with coprocessor support.
+coprocessors/node, if LAMMPS was built with coprocessor support.  All
+the available coprocessor threads on each Phi will be divided among
+MPI tasks, unless the {tptask} option of the "-pk intel" "command-line
+switch"_Section_start.html#start_7 is used to limit the coprocessor
+threads per MPI task.  See the "package intel"_package.html command
+for details.

 CPU-only without USER-OMP (but using Intel vectorization on CPU):
 lmp_machine -sf intel -in in.script                 # 1 MPI task
 mpirun -np 32 lmp_machine -sf intel -in in.script   # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre

 CPU-only with USER-OMP (and Intel vectorization on CPU):
-lmp_machine -sf intel -pk intel 16 0 -in in.script                # 1 MPI task on a 16-core node
-mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script    # 4 MPI tasks each with 4 threads on a single 16-core node
-mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script   # ditto on 8 16-core nodes :pre
+lmp_machine -sf intel -pk intel 16 0 -in in.script             # 1 MPI task on a 16-core node
+mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script     # 4 MPI tasks each with 4 threads on a single 16-core node
+mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script    # ditto on 8 16-core nodes :pre

-CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
-lmp_machine -sf intel -pk intel 16 1 -in in.script                                  # 1 MPI task, 240 threads on 1 coprocessor
-mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script            # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node, 
-                                                                                    # each MPI task uses 60 threads on 1 coprocessor
-mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script   # ditto on 8 16-core nodes for MPI tasks and OpenMP threads, 
-                                                                                    # each MPI task uses 120 threads on one of 2 coprocessors :pre
+CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
+lmp_machine -sf intel -pk intel 1 omp 16 -in in.script                       # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
+lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script             # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
+mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script           # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
+mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script   # ditto on 8 16-core nodes
+mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script           # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task :pre 

-Note that if the "-sf intel" switch is used, it also issues two
-default commands: "package omp 0"_package.html and "package intel
-1"_package.html command.  These set the number of OpenMP threads per
-MPI task via the OMP_NUM_THREADS environment variable, and the number
-of Xeon Phi(TM) coprocessors/node to 1.  The former is ignored if
-LAMMPS was not built with the USER-OMP package.  The latter is ignored
-is LAMMPS was not built with coprocessor support, except for its
-optional precision setting.
+Note that if the "-sf intel" switch is used, it also invokes two
+default commands: "package intel 1"_package.html, followed by "package
+omp 0"_package.html.  These both set the number of OpenMP threads per
+MPI task via the OMP_NUM_THREADS environment variable.  The first
+command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
+the precision mode to "mixed", as one of its option defaults).  The
+latter command is not invoked if LAMMPS was not built with the
+USER-OMP package.  The Nphi = 1 value for the first command is ignored
+if LAMMPS was not built with coprocessor support.

-Using the "-pk omp" switch explicitly allows for direct setting of the
-number of OpenMP threads per MPI task, and additional options.  Using
-the "-pk intel" switch explicitly allows for direct setting of the
-number of coprocessors/node, and additional options.  The syntax for
-these two switches is the same as the "package omp"_package.html and
-"package intel"_package.html commands.  See the "package"_package.html
-command doc page for details, including the default values used for
-all its options if these switches are not specified, and how to set
-the number of OpenMP threads via the OMP_NUM_THREADS environment
-variable if desired.
+Using the "-pk intel" or "-pk omp" switches explicitly allows for
+direct setting of the number of OpenMP threads per MPI task, and
+additional options for either of the USER-INTEL or USER-OMP packages.
+In particular, the "-pk intel" switch sets the number of
+coprocessors/node and can limit the number of coprocessor threads per
+MPI task.  The syntax for these two switches is the same as the
+"package omp"_package.html and "package intel"_package.html commands.
+See the "package"_package.html command doc page for details, including
+the default values used for all its options if these switches are not
+specified, and how to set the number of OpenMP threads via the
+OMP_NUM_THREADS environment variable if desired.

 [Or run with the USER-INTEL package by editing an input script:]

@ -192,19 +197,20 @@ Use the "suffix intel"_suffix.html command, or you can explicitly add an

 pair_style lj/cut/intel 2.5 :pre

-You must also use the "package omp"_package.html command to enable the
-USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf
-intel" or "-pk omp" "command-line switches"_Section_start.html#start_7
-were used.  It specifies how many OpenMP threads per MPI task to use,
-as well as other options.  Its doc page explains how to set the number
-of threads via an environment variable if desired.
+You must also use the "package intel"_package.html command, unless the
+"-sf intel" or "-pk intel" "command-line
+switches"_Section_start.html#start_7 were used.  It specifies how many
+coprocessors/node to use, as well as other OpenMP threading and
+coprocessor options.  Its doc page explains how to set the number of
+OpenMP threads via an environment variable if desired.

-You must also use the "package intel"_package.html command to enable
-coprocessor support within the USER-INTEL package (assuming LAMMPS was
-built with coprocessor support) unless the "-sf intel" or "-pk intel"
-"command-line switches"_Section_start.html#start_7 were used.  It
-specifies how many coprocessors/node to use, as well as other
-coprocessor options.
+If LAMMPS was also built with the USER-OMP package, you must also use
+the "package omp"_package.html command to enable that package, unless
+the "-sf intel" or "-pk omp" "command-line
+switches"_Section_start.html#start_7 were used.  It specifies how many
+OpenMP threads per MPI task to use, as well as other options.  Its doc
+page explains how to set the number of OpenMP threads via an
+environment variable if desired.

 [Speed-ups to expect:]

--- a/doc/accelerate_kokkos.html
+++ b/doc/accelerate_kokkos.html
@ -178,7 +178,7 @@ double precision.
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 </P>
 <P>When using KOKKOS built with host=OMP, you need to choose how many
 OpenMP threads per MPI task will be used (via the "-k" command-line
--- a/doc/accelerate_kokkos.txt
+++ b/doc/accelerate_kokkos.txt
@ -175,7 +175,7 @@ double precision.
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.

 When using KOKKOS built with host=OMP, you need to choose how many
 OpenMP threads per MPI task will be used (via the "-k" command-line
--- a/doc/accelerate_omp.html
+++ b/doc/accelerate_omp.html
@ -57,7 +57,7 @@ Intel compilers the CCFLAGS setting also needs to include "-restrict".
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 </P>
 <P>You need to choose how many threads per MPI task will be used by the
 USER-OMP package.  Note that the product of MPI tasks * threads/task
--- a/doc/accelerate_omp.txt
+++ b/doc/accelerate_omp.txt
@ -54,7 +54,7 @@ Intel compilers the CCFLAGS setting also needs to include "-restrict".
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto OpenMPI via -np and -npernode.
+its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.

 You need to choose how many threads per MPI task will be used by the
 USER-OMP package.  Note that the product of MPI tasks * threads/task
--- a/doc/package.html
+++ b/doc/package.html
@ -59,20 +59,22 @@
  <I>intel</I> args = NPhi keyword value ...
    Nphi = # of coprocessors per node
    zero or more keyword/value pairs may be appended 
-    keywords = <I>prec</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
-      <I>prec</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
+    keywords = <I>omp</I> or <I>mode</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
+      <I>omp</I> value = Nthreads
+        Nthreads = number of OpenMP threads to use on CPU (default = 0)
+      <I>mode</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
        single = perform force calculations in single precision
        mixed = perform force calculations in mixed precision
        double = perform force calculations in double precision
-     <I>balance</I> value = split
-       split = fraction of work to offload to coprocessor, -1 for dynamic
-     <I>ghost</I> value = <I>yes</I> or <I>no</I>
-       yes = include ghost atoms for offload
-       no = do not include ghost atoms for offload
-     <I>tpc</I> value = Ntpc
-       Ntpc = number of threads to use on each physical core of coprocessor
-     <I>tptask</I> value = Ntptask
-       Ntptask = max number of threads to use on coprocessor for each MPI task
+      <I>balance</I> value = split
+        split = fraction of work to offload to coprocessor, -1 for dynamic
+      <I>ghost</I> value = <I>yes</I> or <I>no</I>
+        yes = include ghost atoms for offload
+        no = do not include ghost atoms for offload
+      <I>tpc</I> value = Ntpc
+        Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
+      <I>tptask</I> value = Ntptask
+        Ntptask = max number of coprocessor threads per MPI task (default = 240)
  <I>kokkos</I> args = keyword value ...
    zero or more keyword/value pairs may be appended
    keywords = <I>neigh</I> or <I>newton</I> or <I>binsize</I> or <I>comm</I> or <I>comm/exchange</I> or <I>comm/forward</I>
@ -114,7 +116,8 @@ package cuda 1 test 3948
 package kokkos neigh half/thread comm device
 package omp 0 neigh no
 package omp 4
-package intel * mixed balance -1 
+package intel 1
+package intel 2 omp 4 mode mixed balance 0.5 
 </PRE>
 <P><B>Description:</B>
 </P>
@ -324,18 +327,56 @@ lib/gpu/Makefile that is used.
 <HR>

 <P>The <I>intel</I> style invokes settings associated with the use of the
-USER-INTEL package.  All of its settings, except the <I>prec</I> keyword,
-are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
-when building with the USER-INTEL package.  All of its settings,
-including the <I>prec</I> keyword are applicable if LAMMPS was built with
-coprocessor support.
+USER-INTEL package.  All of its settings, except the <I>omp</I> and <I>mode</I>
+keywords, are ignored if LAMMPS was not built with Xeon Phi
+coprocessor support.  All of its settings, including the <I>omp</I> and
+<I>mode</I> keyword are applicable if LAMMPS was built with coprocessor
+support.
 </P>
 <P>The <I>Nphi</I> argument sets the number of coprocessors per node.
+This can be set to any value, including 0, if LAMMPS was not
+built with coprocessor support.
 </P>
 <P>Optional keyword/value pairs can also be specified.  Each has a
 default value as listed below.
 </P>
-<P>The <I>prec</I> keyword argument determines the precision mode to use for
+<P>The <I>omp</I> keyword determines the number of OpenMP threads allocated
+for each MPI task when any portion of the interactions computed by a
+USER-INTEL pair style are run on the CPU.  This can be the case even
+if LAMMPS was built with coprocessor support; see the <I>balance</I>
+keyword discussion below.  If you are running with less MPI tasks/node
+than there are CPUs, it can be advantageous to use OpenMP threading on
+the CPUs.
+</P>
+<P>IMPORTANT NOTE: The <I>omp</I> keyword has nothing to do with coprocessor
+threads on the Xeon Phi; see the <I>tpc</I> and <I>tptask</I> keywords below for
+a discussion of coprocessor threads.
+</P>
+<P>The <I>Nthread</I> value for the <I>omp</I> keyword sets the number of OpenMP
+threads allocated for each MPI task.  Setting <I>Nthread</I> = 0 (the
+default) instructs LAMMPS to use whatever value is the default for the
+given OpenMP environment. This is usually determined via the
+<I>OMP_NUM_THREADS</I> environment variable or the compiler runtime, which
+is usually a value of 1.
+</P>
+<P>For more details, including examples of how to set the OMP_NUM_THREADS
+environment variable, see the discussion of the <I>Nthreads</I> setting on
+this doc page for the "package omp" command.  Nthreads is a required
+argument for the USER-OMP package.  Its meaning is exactly the same
+for the USER-INTEL pacakge.
+</P>
+<P>IMPORTANT NOTE: If you build LAMMPS with both the USER-INTEL and
+USER-OMP packages, be aware that both packages allow setting of the
+<I>Nthreads</I> value via their package commands, but there is only a
+single global <I>Nthreads</I> value used by OpenMP.  Thus if both package
+commands are invoked, you should insure the two values are consistent.
+If they are not, the last one invoked will take precedence, for both
+packages.  Also note that if the "-sf intel" <A HREF = <A HREF = "Section_start.html#start_7">command-line"></A>
+switch</A> is used, it invokes a "package
+intel" command, followed by a "package omp" command, both with a
+setting of <I>Nthreads</I> = 0.
+</P>
+<P>The <I>mode</I> keyword determines the precision mode to use for
 computing pair style forces, either on the CPU or on the coprocessor,
 when using a USER-INTEL supported <A HREF = "pair_style.html">pair style</A>.  It
 can take a value of <I>single</I>, <I>mixed</I> which is the default, or
@ -347,12 +388,12 @@ quantities.  <I>Double</I> means double precision is used for the entire
 force calculation.
 </P>
 <P>The <I>balance</I> keyword sets the fraction of <A HREF = "pair_style.html">pair
-style</A> work offloaded to the coprocessor style for
-split values between 0.0 and 1.0 inclusive.  While this fraction of
-work is running on the coprocessor, other calculations will run on the
-host, including neighbor and pair calculations that are not offloaded,
-angle, bond, dihedral, kspace, and some MPI communications.  If
-<I>split</I> is set to -1, the fraction of work is dynamically adjusted
+style</A> work offloaded to the coprocessor for split
+values between 0.0 and 1.0 inclusive.  While this fraction of work is
+running on the coprocessor, other calculations will run on the host,
+including neighbor and pair calculations that are not offloaded, as
+well as angle, bond, dihedral, kspace, and some MPI communications.
+If <I>split</I> is set to -1, the fraction of work is dynamically adjusted
 automatically throughout the run.  This typically give performance
 within 5 to 10 percent of the optimal fixed fraction.
 </P>
@ -362,21 +403,28 @@ and force calculations.  When the value = "no", ghost atoms are not
 offloaded.  This option can reduce the amount of data transfer with
 the coprocessor and can also overlap MPI communication of forces with
 computation on the coprocessor when the <A HREF = "newton.html">newton pair</A>
-setting is "on".  When the value = "ues", ghost atoms are offloaded.
+setting is "on".  When the value = "yes", ghost atoms are offloaded.
 In some cases this can provide better performance, especially if the
 <I>balance</I> fraction is high.
 </P>
-<P>The <I>tpc</I> keyword sets the maximum # of threads <I>Ntpc</I> that will
-run on each physical core of the coprocessor.  The default value is
-set to 4, which is the number of hardware threads per core supported
-by the current generation Xeon Phi chips.
+<P>The <I>tpc</I> keyword sets the max # of coprocessor threads <I>Ntpc</I> that
+will run on each core of the coprocessor.  The default value = 4,
+which is the number of hardware threads per core supported by the
+current generation Xeon Phi chips.  
 </P>
-<P>The <I>tptask</I> keyword sets the maximum # of threads (Ntptask</I> that will
-be used on the coprocessor for each MPI task.  This, along with the
-<I>tpc</I> keyword setting, are the only methods for changing the number of
-threads used on the coprocessor.  The default value is set to 240 =
-60*4, which is the maximum # of threads supported by an entire current
-generation Xeon Phi chip.
+<P>The <I>tptask</I> keyword sets the max # of coprocessor threads (Ntptask</I>
+assigned to each MPI task.  The default value = 240, which is the
+total # of threads an entire current generation Xeon Phi chip can run
+(240 = 60 cores * 4 threads/core).  This means each MPI task assigned
+to the Phi will enough threads for the chip to run the max allowed,
+even if only 1 MPI task is assigned.  If 8 MPI tasks are assigned to
+the Phi, each will run with 30 threads.  If you wish to limit the
+number of threads per MPI task, set <I>tptask</I> to a smaller value.
+E.g. for <I>tptask</I> = 16, if 8 MPI tasks are assigned, each will run
+with 16 threads, for a total of 128.
+</P>
+<P>Note that the default settings for <I>tpc</I> and <I>tptask</I> are fine for
+most problems, regardless of how many MPI tasks you assign to a Phi.
 </P>
 <HR>

@ -581,15 +629,16 @@ must invoke the package gpu command in your input script or via the
 "-pk gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>.
 </P>
 <P>For the USER-INTEL package, the default is Nphi = 1 and the option
-defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240.  Note
-that all of these settings, except "prec", are ignored if LAMMPS was
-not built with Xeon Phi coprocessor support.  The default ghost option
-is determined by the pair style being used.  This value is output to
-the screen in the offload report at the end of each run.  These
-settings are made automatically if the "-sf intel" <A HREF = "Section_start.html#start_7">command-line
-switch</A> is used.  If it is not used, you
-must invoke the package intel command in your input script or or via
-the "-pk intel" <A HREF = "Section_start.html#start_7">command-line switch</A>.
+defaults are omp = 0, mode = mixed, balance = -1, tpc = 4, tptask =
+240.  The default ghost option is determined by the pair style being
+used.  This value is output to the screen in the offload report at the
+end of each run.  Note that all of these settings, except "omp" and
+"mode", are ignored if LAMMPS was not built with Xeon Phi coprocessor
+support.  These settings are made automatically if the "-sf intel"
+<A HREF = "Section_start.html#start_7">command-line switch</A> is used.  If it is
+not used, you must invoke the package intel command in your input
+script or or via the "-pk intel" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.
 </P>
 <P>For the KOKKOS package, the option defaults neigh = full, newton =
 off, binsize = 0.0, and comm = host.  These settings are made
--- a/doc/package.txt
+++ b/doc/package.txt
@ -54,20 +54,22 @@ args = arguments specific to the style :l
  {intel} args = NPhi keyword value ...
    Nphi = # of coprocessors per node
    zero or more keyword/value pairs may be appended 
-    keywords = {prec} or {balance} or {ghost} or {tpc} or {tptask}
-      {prec} value = {single} or {mixed} or {double}
+    keywords = {omp} or {mode} or {balance} or {ghost} or {tpc} or {tptask}
+      {omp} value = Nthreads
+        Nthreads = number of OpenMP threads to use on CPU (default = 0)
+      {mode} value = {single} or {mixed} or {double}
        single = perform force calculations in single precision
        mixed = perform force calculations in mixed precision
        double = perform force calculations in double precision
-     {balance} value = split
-       split = fraction of work to offload to coprocessor, -1 for dynamic
-     {ghost} value = {yes} or {no}
-       yes = include ghost atoms for offload
-       no = do not include ghost atoms for offload
-     {tpc} value = Ntpc
-       Ntpc = number of threads to use on each physical core of coprocessor
-     {tptask} value = Ntptask
-       Ntptask = max number of threads to use on coprocessor for each MPI task
+      {balance} value = split
+        split = fraction of work to offload to coprocessor, -1 for dynamic
+      {ghost} value = {yes} or {no}
+        yes = include ghost atoms for offload
+        no = do not include ghost atoms for offload
+      {tpc} value = Ntpc
+        Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
+      {tptask} value = Ntptask
+        Ntptask = max number of coprocessor threads per MPI task (default = 240)
  {kokkos} args = keyword value ...
    zero or more keyword/value pairs may be appended
    keywords = {neigh} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward}
@ -108,7 +110,8 @@ package cuda 1 test 3948
 package kokkos neigh half/thread comm device
 package omp 0 neigh no
 package omp 4
-package intel * mixed balance -1 :pre
+package intel 1
+package intel 2 omp 4 mode mixed balance 0.5 :pre

 [Description:]

@ -263,11 +266,6 @@ cutoff of 20*sigma in LJ "units"_units.html and a neighbor skin
 distance of sigma, a {binsize} = 5.25*sigma can be more efficient than
 the default.

-
-
-
-
-
 The {split} keyword can be used for load balancing force calculations
 between CPU and GPU cores in GPU-enabled pair styles. If 0 < {split} <
 1.0, a fixed fraction of particles is offloaded to the GPU while force
@ -323,18 +321,56 @@ lib/gpu/Makefile that is used.
 :line

 The {intel} style invokes settings associated with the use of the
-USER-INTEL package.  All of its settings, except the {prec} keyword,
-are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
-when building with the USER-INTEL package.  All of its settings,
-including the {prec} keyword are applicable if LAMMPS was built with
-coprocessor support.
+USER-INTEL package.  All of its settings, except the {omp} and {mode}
+keywords, are ignored if LAMMPS was not built with Xeon Phi
+coprocessor support.  All of its settings, including the {omp} and
+{mode} keyword are applicable if LAMMPS was built with coprocessor
+support.

 The {Nphi} argument sets the number of coprocessors per node.
+This can be set to any value, including 0, if LAMMPS was not
+built with coprocessor support.

 Optional keyword/value pairs can also be specified.  Each has a
 default value as listed below.

-The {prec} keyword argument determines the precision mode to use for
+The {omp} keyword determines the number of OpenMP threads allocated
+for each MPI task when any portion of the interactions computed by a
+USER-INTEL pair style are run on the CPU.  This can be the case even
+if LAMMPS was built with coprocessor support; see the {balance}
+keyword discussion below.  If you are running with less MPI tasks/node
+than there are CPUs, it can be advantageous to use OpenMP threading on
+the CPUs.
+
+IMPORTANT NOTE: The {omp} keyword has nothing to do with coprocessor
+threads on the Xeon Phi; see the {tpc} and {tptask} keywords below for
+a discussion of coprocessor threads.
+
+The {Nthread} value for the {omp} keyword sets the number of OpenMP
+threads allocated for each MPI task.  Setting {Nthread} = 0 (the
+default) instructs LAMMPS to use whatever value is the default for the
+given OpenMP environment. This is usually determined via the
+{OMP_NUM_THREADS} environment variable or the compiler runtime, which
+is usually a value of 1.
+
+For more details, including examples of how to set the OMP_NUM_THREADS
+environment variable, see the discussion of the {Nthreads} setting on
+this doc page for the "package omp" command.  Nthreads is a required
+argument for the USER-OMP package.  Its meaning is exactly the same
+for the USER-INTEL pacakge.
+
+IMPORTANT NOTE: If you build LAMMPS with both the USER-INTEL and
+USER-OMP packages, be aware that both packages allow setting of the
+{Nthreads} value via their package commands, but there is only a
+single global {Nthreads} value used by OpenMP.  Thus if both package
+commands are invoked, you should insure the two values are consistent.
+If they are not, the last one invoked will take precedence, for both
+packages.  Also note that if the "-sf intel" "command-line
+switch"_"_Section_start.html#start_7 is used, it invokes a "package
+intel" command, followed by a "package omp" command, both with a
+setting of {Nthreads} = 0.
+
+The {mode} keyword determines the precision mode to use for
 computing pair style forces, either on the CPU or on the coprocessor,
 when using a USER-INTEL supported "pair style"_pair_style.html.  It
 can take a value of {single}, {mixed} which is the default, or
@ -346,12 +382,12 @@ quantities.  {Double} means double precision is used for the entire
 force calculation.

 The {balance} keyword sets the fraction of "pair
-style"_pair_style.html work offloaded to the coprocessor style for
-split values between 0.0 and 1.0 inclusive.  While this fraction of
-work is running on the coprocessor, other calculations will run on the
-host, including neighbor and pair calculations that are not offloaded,
-angle, bond, dihedral, kspace, and some MPI communications.  If
-{split} is set to -1, the fraction of work is dynamically adjusted
+style"_pair_style.html work offloaded to the coprocessor for split
+values between 0.0 and 1.0 inclusive.  While this fraction of work is
+running on the coprocessor, other calculations will run on the host,
+including neighbor and pair calculations that are not offloaded, as
+well as angle, bond, dihedral, kspace, and some MPI communications.
+If {split} is set to -1, the fraction of work is dynamically adjusted
 automatically throughout the run.  This typically give performance
 within 5 to 10 percent of the optimal fixed fraction.

@ -361,21 +397,28 @@ and force calculations.  When the value = "no", ghost atoms are not
 offloaded.  This option can reduce the amount of data transfer with
 the coprocessor and can also overlap MPI communication of forces with
 computation on the coprocessor when the "newton pair"_newton.html
-setting is "on".  When the value = "ues", ghost atoms are offloaded.
+setting is "on".  When the value = "yes", ghost atoms are offloaded.
 In some cases this can provide better performance, especially if the
 {balance} fraction is high.

-The {tpc} keyword sets the maximum # of threads {Ntpc} that will
-run on each physical core of the coprocessor.  The default value is
-set to 4, which is the number of hardware threads per core supported
-by the current generation Xeon Phi chips.
+The {tpc} keyword sets the max # of coprocessor threads {Ntpc} that
+will run on each core of the coprocessor.  The default value = 4,
+which is the number of hardware threads per core supported by the
+current generation Xeon Phi chips.  

-The {tptask} keyword sets the maximum # of threads (Ntptask} that will
-be used on the coprocessor for each MPI task.  This, along with the
-{tpc} keyword setting, are the only methods for changing the number of
-threads used on the coprocessor.  The default value is set to 240 =
-60*4, which is the maximum # of threads supported by an entire current
-generation Xeon Phi chip.
+The {tptask} keyword sets the max # of coprocessor threads (Ntptask}
+assigned to each MPI task.  The default value = 240, which is the
+total # of threads an entire current generation Xeon Phi chip can run
+(240 = 60 cores * 4 threads/core).  This means each MPI task assigned
+to the Phi will enough threads for the chip to run the max allowed,
+even if only 1 MPI task is assigned.  If 8 MPI tasks are assigned to
+the Phi, each will run with 30 threads.  If you wish to limit the
+number of threads per MPI task, set {tptask} to a smaller value.
+E.g. for {tptask} = 16, if 8 MPI tasks are assigned, each will run
+with 16 threads, for a total of 128.
+
+Note that the default settings for {tpc} and {tptask} are fine for
+most problems, regardless of how many MPI tasks you assign to a Phi.

 :line

@ -580,15 +623,16 @@ must invoke the package gpu command in your input script or via the
 "-pk gpu" "command-line switch"_Section_start.html#start_7.

 For the USER-INTEL package, the default is Nphi = 1 and the option
-defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240.  Note
-that all of these settings, except "prec", are ignored if LAMMPS was
-not built with Xeon Phi coprocessor support.  The default ghost option
-is determined by the pair style being used.  This value is output to
-the screen in the offload report at the end of each run.  These
-settings are made automatically if the "-sf intel" "command-line
-switch"_Section_start.html#start_7 is used.  If it is not used, you
-must invoke the package intel command in your input script or or via
-the "-pk intel" "command-line switch"_Section_start.html#start_7.
+defaults are omp = 0, mode = mixed, balance = -1, tpc = 4, tptask =
+240.  The default ghost option is determined by the pair style being
+used.  This value is output to the screen in the offload report at the
+end of each run.  Note that all of these settings, except "omp" and
+"mode", are ignored if LAMMPS was not built with Xeon Phi coprocessor
+support.  These settings are made automatically if the "-sf intel"
+"command-line switch"_Section_start.html#start_7 is used.  If it is
+not used, you must invoke the package intel command in your input
+script or or via the "-pk intel" "command-line
+switch"_Section_start.html#start_7.

 For the KOKKOS package, the option defaults neigh = full, newton =
 off, binsize = 0.0, and comm = host.  These settings are made