git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12461 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2014-09-09 22:50:42 +00:00 · 2014-09-09 22:50:42 +00:00 · 4d9d81fe69
parent ca8fd22c19
commit 4d9d81fe69
4 changed files with 205 additions and 174 deletions
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
@ -1390,8 +1390,9 @@ steps:
 coprocessor case can be done using the "-pk omp" and "-sf intel" and
 "-pk intel" <A HREF = "Section_start.html#start_7">command-line switches</A>
 respectively.  Or the effect of the "-pk" or "-sf" switches can be
-duplicated by adding the <A HREF = "package.html">package intel</A> or <A HREF = "suffix.html">suffix
-intel</A> commands respectively to your input script.
+duplicated by adding the <A HREF = "package.html">package omp</A> or <A HREF = "suffix.html">suffix
+intel</A> or <A HREF = "package.html">package intel</A> commands
+respectively to your input script.
 </P>
 <P><B>Required hardware/software:</B>
 </P>
@ -1470,9 +1471,10 @@ maximum number of threads is also reduced.
 which will automatically append "intel" to styles that support it.  If
 a style does not support it, a "omp" suffix is tried next.  Use the
 "-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line switch</A>, to set
-Nt = # of OpenMP threads per MPI task to use.  Use the "-pk intel Nt
-Nphi" <A HREF = "Section_start.html#start_7">command-line switch</A> to set Nphi = #
-of Xeon Phi(TM) coprocessors/node.  
+Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
+the USER-OMP package.  Use the "-pk intel Nt Nphi" <A HREF = "Section_start.html#start_7">command-line
+switch</A> to set Nphi = # of Xeon Phi(TM)
+coprocessors/node, if LAMMPS was built with coprocessor support.
 </P>
 <PRE>CPU-only without USER-OMP (but using Intel vectorization on CPU):
 lmp_machine -sf intel -in in.script                 # 1 MPI task
@ -1494,8 +1496,9 @@ mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.scrip
 default commands: <A HREF = "package.html">package omp 0</A> and <A HREF = "package.html">package intel
 1</A> command.  These set the number of OpenMP threads per
 MPI task via the OMP_NUM_THREADS environment variable, and the number
-of Xeon Phi(TM) coprocessors/node to 1.  The latter is ignored is
-LAMMPS was not built with coprocessor support.
+of Xeon Phi(TM) coprocessors/node to 1.  The former is ignored if
+LAMMPS was not built with the USER-OMP package.  The latter is ignored
+is LAMMPS was not built with coprocessor support.
 </P>
 <P>Using the "-pk omp" switch explicitly allows for direct setting of the
 number of OpenMP threads per MPI task, and additional options.  Using
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
@ -1385,8 +1385,9 @@ The latter two steps in the first case and the last step in the
 coprocessor case can be done using the "-pk omp" and "-sf intel" and
 "-pk intel" "command-line switches"_Section_start.html#start_7
 respectively.  Or the effect of the "-pk" or "-sf" switches can be
-duplicated by adding the "package intel"_package.html or "suffix
-intel"_suffix.html commands respectively to your input script.
+duplicated by adding the "package omp"_package.html or "suffix
+intel"_suffix.html or "package intel"_package.html commands
+respectively to your input script.

 [Required hardware/software:]

@ -1465,9 +1466,10 @@ Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
 which will automatically append "intel" to styles that support it.  If
 a style does not support it, a "omp" suffix is tried next.  Use the
 "-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set
-Nt = # of OpenMP threads per MPI task to use.  Use the "-pk intel Nt
-Nphi" "command-line switch"_Section_start.html#start_7 to set Nphi = #
-of Xeon Phi(TM) coprocessors/node.  
+Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with
+the USER-OMP package.  Use the "-pk intel Nt Nphi" "command-line
+switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
+coprocessors/node, if LAMMPS was built with coprocessor support.

 CPU-only without USER-OMP (but using Intel vectorization on CPU):
 lmp_machine -sf intel -in in.script                 # 1 MPI task
@ -1489,8 +1491,9 @@ Note that if the "-sf intel" switch is used, it also issues two
 default commands: "package omp 0"_package.html and "package intel
 1"_package.html command.  These set the number of OpenMP threads per
 MPI task via the OMP_NUM_THREADS environment variable, and the number
-of Xeon Phi(TM) coprocessors/node to 1.  The latter is ignored is
-LAMMPS was not built with coprocessor support.
+of Xeon Phi(TM) coprocessors/node to 1.  The former is ignored if
+LAMMPS was not built with the USER-OMP package.  The latter is ignored
+is LAMMPS was not built with coprocessor support.

 Using the "-pk omp" switch explicitly allows for direct setting of the
 number of OpenMP threads per MPI task, and additional options.  Using
--- a/doc/package.html
+++ b/doc/package.html
@ -50,20 +50,23 @@
        size = bin size for neighbor list construction (distance units)
      <I>device</I> value = device_type
        device_type = <I>kepler</I> or <I>fermi</I> or <I>cypress</I> or <I>generic</I>
-  <I>intel</I> args = Nthreads precision keyword value ...
-    Nthreads = # of OpenMP threads to associate with each MPI process on host
-    precision = <I>single</I> or <I>mixed</I> or <I>double</I>
-    keywords = <I>balance</I> or <I>offload_cards</I> or <I>offload_ghost</I> or <I>offload_tpc</I> or <I>offload_threads</I>
+  <I>intel</I> args = NPhi keyword value ...
+    Nphi = # of coprocessors per node
+    zero or more keyword/value pairs may be appended 
+    keywords = <I>prec</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
+      <I>prec</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
+        single = perform force calculations in single precision
+        mixed = perform force calculations in mixed precision
+        double = perform force calculations in double precision
     <I>balance</I> value = split
       split = fraction of work to offload to coprocessor, -1 for dynamic
-     <I>offload_cards</I> value = ncops
-       ncops = number of coprocessors to use on each node
-     <I>offload_ghost</I> value = offload_type
-       offload_type = 1 to include ghost atoms for offload, 0 for local only
-     <I>offload_tpc</I> value = tpc
-       tpc = number of threads to use on each core of coprocessor
-     <I>offload_threads</I> value = tptask
-       tptask = max number of threads to use on coprocessor for each MPI task
+     <I>ghost</I> value = <I>yes</I> or <I>no</I>
+       yes = include ghost atoms for offload
+       no = do not include ghost atoms for offload
+     <I>tpc</I> value = Ntpc
+       Ntpc = number of threads to use on each physical core of coprocessor
+     <I>tptask</I> value = Ntptask
+       Ntptask = max number of threads to use on coprocessor for each MPI task
  <I>kokkos</I> args = keyword value ...
    one or more keyword/value pairs may be appended
    keywords = <I>neigh</I> or <I>comm/exchange</I> or <I>comm/forward</I>
@ -171,8 +174,8 @@ default value, it is usually not necessary to use this keyword.
 </P>
 <HR>

-<P>The <I>gpu</I> style invokes settings settings associated with the use of
-the GPU package.
+<P>The <I>gpu</I> style invokes settings associated with the use of the GPU
+package.
 </P>
 <P>The <I>Ngpu</I> argument sets the number of GPUs per node.  There must be
 at least as many MPI tasks per node as GPUs, as set by the mpirun or
@ -264,65 +267,64 @@ lib/gpu/Makefile that is used.
 </P>
 <HR>

-<P>The <I>intel</I> style invokes options associated with the use of the
-USER-INTEL package.
+<P>The <I>intel</I> style invokes settings associated with the use of the
+USER-INTEL package.  All of its settings, except the <I>prec</I> keyword,
+are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
+when building with the USER-INTEL package.  All of its settings,
+including the <I>prec</I> keyword are applicable if LAMMPS was built with
+coprocessor support.
 </P>
-<P>The <I>Nthread</I> argument allows to one explicitly set the number of
-OpenMP threads to be allocated for each MPI process, An <I>Nthreads</I>
-value of '*' instructs LAMMPS to use whatever is the default for the
-given OpenMP environment. This is usually determined via the
-OMP_NUM_THREADS environment variable or the compiler runtime.
+<P>The <I>Nphi</I> argument sets the number of coprocessors per node.
 </P>
-<P>The <I>precision</I> argument determines the precision mode to use and can
-take values of <I>single</I> (intel styles use single precision for all
-calculations), <I>mixed</I> (intel styles use double precision for
-accumulation and storage of forces, torques, energies, and virial
-terms and single precision for everything else), or <I>double</I> (intel
-styles use double precision for all calculations).
+<P>Optional keyword/value pairs can also be specified.  Each has a
+default value as listed below.
 </P>
-<P>Additional keyword-value pairs are available that are used to
-determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is
-built without offload support, these values are ignored. The
-additional settings are as follows:
+<P>The <I>prec</I> keyword argument determines the precision mode to use for
+computing pair style forces, either on the CPU or on the coprocessor,
+when using a USER-INTEL supported <A HREF = "pair_style.html">pair style</A>.  It
+can take a value of <I>single</I>, <I>mixed</I> which is the default, or
+<I>double</I>.  <I>Single</I> means single precision is used for the entire
+force calculation.  <I>Mixed</I> means forces between a pair of atoms are
+computed in single precision, but accumulated and stored in double
+precision, including storage of forces, torques, energies, and virial
+quantities.  <I>Double</I> means double precision is used for the entire
+force calculation.
 </P>
-<P>The <I>balance</I> setting is used to set the fraction of work offloaded to
-the coprocessor for an intel style (in the inclusive range 0.0 to
-1.0).  While this fraction of work is running on the coprocessor, other
-calculations will run on the host, including neighbor and pair
-calculations that are not offloaded, angle, bond, dihedral, kspace,
-and some MPI communications. If the balance is set to -1, the fraction
-of work is dynamically adjusted automatically throughout the run. This
-can typically give performance within 5 to 10 percent of the optimal
-fixed fraction.
+<P>The <I>balance</I> keyword sets the fraction of <A HREF = "pair_style.html">pair
+style</A> work offloaded to the coprocessor style for
+split values between 0.0 and 1.0 inclusive.  While this fraction of
+work is running on the coprocessor, other calculations will run on the
+host, including neighbor and pair calculations that are not offloaded,
+angle, bond, dihedral, kspace, and some MPI communications.  If
+<I>split</I> is set to -1, the fraction of work is dynamically adjusted
+automatically throughout the run.  This typically give performance
+within 5 to 10 percent of the optimal fixed fraction.
 </P>
-<P>The <I>offload_cards</I> setting determines the number of coprocessors to
-use on each node.
-</P>
-<P>Additional options for fine tuning performance with offload are as
-follows:
-</P>
-<P>The <I>offload_ghost</I> setting determines whether or not ghost atoms,
-atoms at the borders between MPI tasks, are offloaded for neighbor and
-force calculations. When set to "0", ghost atoms are not offloaded.
-This option can reduce the amount of data transfer with the
-coprocessor and also can overlap MPI communication of forces with
+<P>The <I>ghost</I> keyword determines whether or not ghost atoms, i.e. atoms
+at the boundaries of proessor sub-domains, are offloaded for neighbor
+and force calculations.  When the value = "no", ghost atoms are not
+offloaded.  This option can reduce the amount of data transfer with
+the coprocessor and can also overlap MPI communication of forces with
 computation on the coprocessor when the <A HREF = "newton.html">newton pair</A>
-setting is "on".  When set to "1", ghost atoms are offloaded. In some
-cases this can provide better performance, especially if the offload
-fraction is high.
+setting is "on".  When the value = "ues", ghost atoms are offloaded.
+In some cases this can provide better performance, especially if the
+<I>balance</I> fraction is high.
 </P>
-<P>The <I>offload_tpc</I> option sets the maximum number of threads that will
-run on each core of the coprocessor.
+<P>The <I>tpc</I> keyword sets the maximum # of threads <I>Ntpc</I> that will
+run on each physical core of the coprocessor.  The default value is
+set to 4, which is the number of hardware threads per core supported
+by the current generation Xeon Phi chips.
 </P>
-<P>The <I>offload_threads</I> option sets the maximum number of threads that
-will be used on the coprocessor for each MPI task. This, along with
-the <I>offload_tpc</I> setting, are the only methods for changing the
-number of threads on the coprocessor. The OMP_NUM_THREADS keyword and
-<I>Nthreads</I> options are only used for threads on the host.
+<P>The <I>tptask</I> keyword sets the maximum # of threads (Ntptask</I> that will
+be used on the coprocessor for each MPI task.  This, along with the
+<I>tpc</I> keyword setting, are the only methods for changing the number of
+threads used on the coprocessor.  The default value is set to 240 =
+60*4, which is the maximum # of threads supported by an entire current
+generation Xeon Phi chip.
 </P>
 <HR>

-<P>The <I>kokkos</I> style invokes options associated with the use of the
+<P>The <I>kokkos</I> style invokes settings associated with the use of the
 KOKKOS package.
 </P>
 <P>The <I>neigh</I> keyword determines what kinds of neighbor lists are built.
@ -466,35 +468,45 @@ setting</A>
 </P>
 <P><B>Default:</B>
 </P>
-<P>To use the USER-CUDA package, the package command must be invoked
-explicitly, either via the "-pk cuda" <A HREF = "Section_start.html#start_7">command-line
-switch</A> or by invoking the package cuda
-command in your input script.  This will set the # of GPUs/node.  The
-options defaults are gpuID = 0 to Ngpu-1, timing not enabled, test not
-enabled, and thread = auto.
+<P>To use the USER-CUDA package, the package cuda command must be invoked
+explicitly in your input script or via the "-pk cuda" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.  This will set the # of GPUs/node.
+The options defaults are gpuID = 0 to Ngpu-1, timing = not enabled,
+test = not enabled, and thread = auto.
 </P>
 <P>For the GPU package, the default is Ngpu = 1 and the option defaults
 are neigh = yes, split = 1.0, gpuID = 0 to Ngpu-1, tpa = 1, binsize =
 pair cutoff + neighbor skin, device = not used.  These settings are
-made if the "-sf gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>
-is used.  If it is not used, you must invoke the package gpu command
-in your input script.
+made automatically if the "-sf gpu" <A HREF = "Section_start.html#start_7">command-line
+switch</A> is used.  If it is not used, you
+must invoke the package gpu command in your input script or via the
+"-pk gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>.
 </P>
-<P>The default settings for the USER-INTEL package are "package intel *
-mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240".
-The <I>offload_ghost</I> default setting is determined by the intel style
-being used.  The value used is output to the screen in the offload
-report at the end of each run.
+<P>For the USER-INTEL package, the default is Nphi = 1 and the option
+defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240.  The
+default ghost option is determined by the pair style being used.  This
+value used is output to the screen in the offload report at the end of
+each run.  These settings are made automatically if the "-sf intel"
+<A HREF = "Section_start.html#start_7">command-line switch</A> is used.  If it is
+not used, you must invoke the package intel command in your input
+script or or via the "-pk intel" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.
 </P>
 <P>The default settings for the KOKKOS package are "package kokkos neigh
 full comm/exchange host comm/forward host".  This is the case whether
 the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
 or not.
+To use the KOKKOS package, the package kokkos command must be invoked
+explicitly in your input script or via the "-pk kokkos" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.  This will set the # of GPUs/node.
+The options defaults are gpuID = 0 to Ngpu-1, timing = not enabled,
+test = not enabled, and thread = auto.
 </P>
 <P>For the OMP package, the default is Nthreads = 0 and the option
-defaults are neigh = yes.  These settings are made if the "-sf omp"
-<A HREF = "Section_start.html#start_7">command-line switch</A> is used.  If it is
-not used, you must invoke the package omp command in your input
-script.
+defaults are neigh = yes.  These settings are made automatically if
+the "-sf omp" <A HREF = "Section_start.html#start_7">command-line switch</A> is
+used.  If it is not used, you must invoke the package omp command in
+your input script or via the "-pk omp" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.
 </P>
 </HTML>
--- a/doc/package.txt
+++ b/doc/package.txt
@ -45,20 +45,23 @@ args = arguments specific to the style :l
        size = bin size for neighbor list construction (distance units)
      {device} value = device_type
        device_type = {kepler} or {fermi} or {cypress} or {generic}
-  {intel} args = Nthreads precision keyword value ...
-    Nthreads = # of OpenMP threads to associate with each MPI process on host
-    precision = {single} or {mixed} or {double}
-    keywords = {balance} or {offload_cards} or {offload_ghost} or {offload_tpc} or {offload_threads}
+  {intel} args = NPhi keyword value ...
+    Nphi = # of coprocessors per node
+    zero or more keyword/value pairs may be appended 
+    keywords = {prec} or {balance} or {ghost} or {tpc} or {tptask}
+      {prec} value = {single} or {mixed} or {double}
+        single = perform force calculations in single precision
+        mixed = perform force calculations in mixed precision
+        double = perform force calculations in double precision
     {balance} value = split
       split = fraction of work to offload to coprocessor, -1 for dynamic
-     {offload_cards} value = ncops
-       ncops = number of coprocessors to use on each node
-     {offload_ghost} value = offload_type
-       offload_type = 1 to include ghost atoms for offload, 0 for local only
-     {offload_tpc} value = tpc
-       tpc = number of threads to use on each core of coprocessor
-     {offload_threads} value = tptask
-       tptask = max number of threads to use on coprocessor for each MPI task
+     {ghost} value = {yes} or {no}
+       yes = include ghost atoms for offload
+       no = do not include ghost atoms for offload
+     {tpc} value = Ntpc
+       Ntpc = number of threads to use on each physical core of coprocessor
+     {tptask} value = Ntptask
+       Ntptask = max number of threads to use on coprocessor for each MPI task
  {kokkos} args = keyword value ...
    one or more keyword/value pairs may be appended
    keywords = {neigh} or {comm/exchange} or {comm/forward}
@ -165,8 +168,8 @@ default value, it is usually not necessary to use this keyword.

 :line

-The {gpu} style invokes settings settings associated with the use of
-the GPU package.
+The {gpu} style invokes settings associated with the use of the GPU
+package.

 The {Ngpu} argument sets the number of GPUs per node.  There must be
 at least as many MPI tasks per node as GPUs, as set by the mpirun or
@ -258,65 +261,64 @@ lib/gpu/Makefile that is used.

 :line

-The {intel} style invokes options associated with the use of the
-USER-INTEL package.
+The {intel} style invokes settings associated with the use of the
+USER-INTEL package.  All of its settings, except the {prec} keyword,
+are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
+when building with the USER-INTEL package.  All of its settings,
+including the {prec} keyword are applicable if LAMMPS was built with
+coprocessor support.

-The {Nthread} argument allows to one explicitly set the number of
-OpenMP threads to be allocated for each MPI process, An {Nthreads}
-value of '*' instructs LAMMPS to use whatever is the default for the
-given OpenMP environment. This is usually determined via the
-OMP_NUM_THREADS environment variable or the compiler runtime.
+The {Nphi} argument sets the number of coprocessors per node.

-The {precision} argument determines the precision mode to use and can
-take values of {single} (intel styles use single precision for all
-calculations), {mixed} (intel styles use double precision for
-accumulation and storage of forces, torques, energies, and virial
-terms and single precision for everything else), or {double} (intel
-styles use double precision for all calculations).
+Optional keyword/value pairs can also be specified.  Each has a
+default value as listed below.

-Additional keyword-value pairs are available that are used to
-determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is
-built without offload support, these values are ignored. The
-additional settings are as follows:
+The {prec} keyword argument determines the precision mode to use for
+computing pair style forces, either on the CPU or on the coprocessor,
+when using a USER-INTEL supported "pair style"_pair_style.html.  It
+can take a value of {single}, {mixed} which is the default, or
+{double}.  {Single} means single precision is used for the entire
+force calculation.  {Mixed} means forces between a pair of atoms are
+computed in single precision, but accumulated and stored in double
+precision, including storage of forces, torques, energies, and virial
+quantities.  {Double} means double precision is used for the entire
+force calculation.

-The {balance} setting is used to set the fraction of work offloaded to
-the coprocessor for an intel style (in the inclusive range 0.0 to
-1.0).  While this fraction of work is running on the coprocessor, other
-calculations will run on the host, including neighbor and pair
-calculations that are not offloaded, angle, bond, dihedral, kspace,
-and some MPI communications. If the balance is set to -1, the fraction
-of work is dynamically adjusted automatically throughout the run. This
-can typically give performance within 5 to 10 percent of the optimal
-fixed fraction.
+The {balance} keyword sets the fraction of "pair
+style"_pair_style.html work offloaded to the coprocessor style for
+split values between 0.0 and 1.0 inclusive.  While this fraction of
+work is running on the coprocessor, other calculations will run on the
+host, including neighbor and pair calculations that are not offloaded,
+angle, bond, dihedral, kspace, and some MPI communications.  If
+{split} is set to -1, the fraction of work is dynamically adjusted
+automatically throughout the run.  This typically give performance
+within 5 to 10 percent of the optimal fixed fraction.

-The {offload_cards} setting determines the number of coprocessors to
-use on each node.
-
-Additional options for fine tuning performance with offload are as
-follows:
-
-The {offload_ghost} setting determines whether or not ghost atoms,
-atoms at the borders between MPI tasks, are offloaded for neighbor and
-force calculations. When set to "0", ghost atoms are not offloaded.
-This option can reduce the amount of data transfer with the
-coprocessor and also can overlap MPI communication of forces with
+The {ghost} keyword determines whether or not ghost atoms, i.e. atoms
+at the boundaries of proessor sub-domains, are offloaded for neighbor
+and force calculations.  When the value = "no", ghost atoms are not
+offloaded.  This option can reduce the amount of data transfer with
+the coprocessor and can also overlap MPI communication of forces with
 computation on the coprocessor when the "newton pair"_newton.html
-setting is "on".  When set to "1", ghost atoms are offloaded. In some
-cases this can provide better performance, especially if the offload
-fraction is high.
+setting is "on".  When the value = "ues", ghost atoms are offloaded.
+In some cases this can provide better performance, especially if the
+{balance} fraction is high.

-The {offload_tpc} option sets the maximum number of threads that will
-run on each core of the coprocessor.
+The {tpc} keyword sets the maximum # of threads {Ntpc} that will
+run on each physical core of the coprocessor.  The default value is
+set to 4, which is the number of hardware threads per core supported
+by the current generation Xeon Phi chips.

-The {offload_threads} option sets the maximum number of threads that
-will be used on the coprocessor for each MPI task. This, along with
-the {offload_tpc} setting, are the only methods for changing the
-number of threads on the coprocessor. The OMP_NUM_THREADS keyword and
-{Nthreads} options are only used for threads on the host.
+The {tptask} keyword sets the maximum # of threads (Ntptask} that will
+be used on the coprocessor for each MPI task.  This, along with the
+{tpc} keyword setting, are the only methods for changing the number of
+threads used on the coprocessor.  The default value is set to 240 =
+60*4, which is the maximum # of threads supported by an entire current
+generation Xeon Phi chip.

 :line

-The {kokkos} style invokes options associated with the use of the
+The {kokkos} style invokes settings associated with the use of the
 KOKKOS package.

 The {neigh} keyword determines what kinds of neighbor lists are built.
@ -460,33 +462,44 @@ setting"_Section_start.html#start_7

 [Default:]

-To use the USER-CUDA package, the package command must be invoked
-explicitly, either via the "-pk cuda" "command-line
-switch"_Section_start.html#start_7 or by invoking the package cuda
-command in your input script.  This will set the # of GPUs/node.  The
-options defaults are gpuID = 0 to Ngpu-1, timing not enabled, test not
-enabled, and thread = auto.
+To use the USER-CUDA package, the package cuda command must be invoked
+explicitly in your input script or via the "-pk cuda" "command-line
+switch"_Section_start.html#start_7.  This will set the # of GPUs/node.
+The options defaults are gpuID = 0 to Ngpu-1, timing = not enabled,
+test = not enabled, and thread = auto.

 For the GPU package, the default is Ngpu = 1 and the option defaults
 are neigh = yes, split = 1.0, gpuID = 0 to Ngpu-1, tpa = 1, binsize =
 pair cutoff + neighbor skin, device = not used.  These settings are
-made if the "-sf gpu" "command-line switch"_Section_start.html#start_7
-is used.  If it is not used, you must invoke the package gpu command
-in your input script.
+made automatically if the "-sf gpu" "command-line
+switch"_Section_start.html#start_7 is used.  If it is not used, you
+must invoke the package gpu command in your input script or via the
+"-pk gpu" "command-line switch"_Section_start.html#start_7.

-The default settings for the USER-INTEL package are "package intel *
-mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240".
-The {offload_ghost} default setting is determined by the intel style
-being used.  The value used is output to the screen in the offload
-report at the end of each run.
+For the USER-INTEL package, the default is Nphi = 1 and the option
+defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240.  The
+default ghost option is determined by the pair style being used.  This
+value used is output to the screen in the offload report at the end of
+each run.  These settings are made automatically if the "-sf intel"
+"command-line switch"_Section_start.html#start_7 is used.  If it is
+not used, you must invoke the package intel command in your input
+script or or via the "-pk intel" "command-line
+switch"_Section_start.html#start_7.

 The default settings for the KOKKOS package are "package kokkos neigh
 full comm/exchange host comm/forward host".  This is the case whether
 the "-sf kk" "command-line switch"_Section_start.html#start_7 is used
 or not.
+To use the KOKKOS package, the package kokkos command must be invoked
+explicitly in your input script or via the "-pk kokkos" "command-line
+switch"_Section_start.html#start_7.  This will set the # of GPUs/node.
+The options defaults are gpuID = 0 to Ngpu-1, timing = not enabled,
+test = not enabled, and thread = auto.

 For the OMP package, the default is Nthreads = 0 and the option
-defaults are neigh = yes.  These settings are made if the "-sf omp"
-"command-line switch"_Section_start.html#start_7 is used.  If it is
-not used, you must invoke the package omp command in your input
-script.
+defaults are neigh = yes.  These settings are made automatically if
+the "-sf omp" "command-line switch"_Section_start.html#start_7 is
+used.  If it is not used, you must invoke the package omp command in
+your input script or via the "-pk omp" "command-line
+switch"_Section_start.html#start_7.
+