git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12048 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2014-05-29 23:07:14 +00:00 · 2014-05-29 23:07:14 +00:00 · 85d9a319d7
parent 0fc38af7a3
commit 85d9a319d7
6 changed files with 154 additions and 136 deletions
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
@ -683,19 +683,20 @@ occurs, the faster your simulation will run.
 that use data structures and methods and macros provided by the Kokkos
 library, which is included with LAMMPS in lib/kokkos.
 </P>
-<P>Kokkos is a C++ library that provides two key abstractions for an
-application like LAMMPS.  First, it allows a single implementation of
-an application kernel (e.g. a pair style) to run efficiently on
-different kinds of hardware (GPU, Intel Phi, many-core chip).
+<P><A HREF = "http://trilinos.sandia.gov/packages/kokkos">Kokkos</A> is a C++ library
+that provides two key abstractions for an application like LAMMPS.
+First, it allows a single implementation of an application kernel
+(e.g. a pair style) to run efficiently on different kinds of hardware
+(GPU, Intel Phi, many-core chip).
 </P>
-<P>Second, it adjusts the memory layout of basic data structures like 2d
-and 3d arrays specifically for the chosen hardware.  These are used in
-LAMMPS to store atom coordinates or forces or neighbor lists.  The
-layout is chosen to optimize performance on different platforms.
-Again this operation is hidden from the developer, and does not affect
-how the single implementation of the kernel is coded.
-</P>
-<P>CT NOTE: Pointer to Kokkos web page???
+<P>Second, it provides data abstractions to adjust (at compile time) the
+memory layout of basic data structures like 2d and 3d arrays and allow
+the transparent utilization of special hardware load and store units.
+Such data structures are used in LAMMPS to store atom coordinates or
+forces or neighbor lists.  The layout is chosen to optimize
+performance on different platforms.  Again this operation is hidden
+from the developer, and does not affect how the single implementation
+of the kernel is coded.
 </P>
 <P>These abstractions are set at build time, when LAMMPS is compiled with
 the KOKKOS package installed.  This is done by selecting a "host" and
@ -727,9 +728,11 @@ i.e. the host and device are the same.
 <P>IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
 of MPI tasks per compute node to be equal to the number of GPUs per
 compute node.  In the future Kokkos will support assigning one GPU to
-multiple MPI tasks or using multiple GPUs per MPI task.
-</P>
-<P>CT NOTE: what about AMD GPUs running OpenCL? are they supported?
+multiple MPI tasks or using multiple GPUs per MPI task.  Currently
+Kokkos does not support AMD GPUs due to limits in the available
+backend programming models (in particular relative extensive C++
+support is required for the Kernel language).  This is expected to
+change in the future.
 </P>
 <P>Here are several examples of how to build LAMMPS and run a simulation
 using the KOKKOS package for typical compute node configurations.
@ -857,8 +860,8 @@ communication can provide a speed-up for specific calculations.
 tasks/node * number of threads/task should not exceed N, and should
 typically equal N.  Note that the default threads/task is 1, as set by
 the "t" keyword of the -k <A HREF = "Section_start.html#start_7">command-line
-switch</A>.  If you do not change this, there
-will no additional parallelism (beyond MPI) invoked on the host
+switch</A>.  If you do not change this, no
+additional parallelism (beyond MPI) will be invoked on the host
 CPU(s).
 </P>
 <P>You can compare the performance running in different modes:
@ -878,9 +881,8 @@ software installation.  Insure the -arch setting in
 src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
 <A HREF = "Section_start.html#start_3_4">this section</A> of the manual for details.
 </P>
-<P>The -np setting of the mpirun command must set the number of MPI
-tasks/node to be equal to the # of physical GPUs on the node.  CT
-NOTE: does LAMMPS enforce this?
+<P>The -np setting of the mpirun command should set the number of MPI
+tasks/node to be equal to the # of physical GPUs on the node. 
 </P>
 <P>Use the <A HREF = "Section_commands.html#start_7">-kokkos command-line switch</A> to
 specify the number of GPUs per node, and the number of threads per MPI
@ -936,9 +938,19 @@ will be added later.
 performance to bind the threads to physical cores, so they do not
 migrate during a simulation.  The same is true for MPI tasks, but the
 default binding rules implemented for various MPI versions, do not
-account for thread binding.  Thus you should do the following if using
-multiple threads per MPI task.  CT NOTE: explain what to do.
+account for thread binding.  
 </P>
+<P>Thus if you use more than one thread per MPI task, you should insure
+MPI tasks are bound to CPU sockets.  Furthermore, use thread affinity
+environment variables from the OpenMP runtime when using OpenMP and
+compile with hwloc support when using pthreads.  With OpenMP 3.1 (gcc
+4.7 or later, intel 12 or later) setting the environment variable
+OMP_PROC_BIND=true should be sufficient.  A typical mpirun command
+should set these flags:
+</P>
+<PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
+Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... 
+</PRE>
 <P>When using a GPU, you will achieve the best performance if your input
 script does not use any fix or compute styles which are not yet
 Kokkos-enabled.  This allows data to stay on the GPU for multiple
@ -956,8 +968,6 @@ together to compute pairwise interactions with the KOKKOS package.  We
 hope to support this in the future, similar to the GPU package in
 LAMMPS.
 </P>
-<P>CT NOTE: other performance tips??
-</P>
 <HR>

 <HR>
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
@ -679,19 +679,20 @@ The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and methods and macros provided by the Kokkos
 library, which is included with LAMMPS in lib/kokkos.

-Kokkos is a C++ library that provides two key abstractions for an
-application like LAMMPS.  First, it allows a single implementation of
-an application kernel (e.g. a pair style) to run efficiently on
-different kinds of hardware (GPU, Intel Phi, many-core chip).
+"Kokkos"_http://trilinos.sandia.gov/packages/kokkos is a C++ library
+that provides two key abstractions for an application like LAMMPS.
+First, it allows a single implementation of an application kernel
+(e.g. a pair style) to run efficiently on different kinds of hardware
+(GPU, Intel Phi, many-core chip).

-Second, it adjusts the memory layout of basic data structures like 2d
-and 3d arrays specifically for the chosen hardware.  These are used in
-LAMMPS to store atom coordinates or forces or neighbor lists.  The
-layout is chosen to optimize performance on different platforms.
-Again this operation is hidden from the developer, and does not affect
-how the single implementation of the kernel is coded.
-
-CT NOTE: Pointer to Kokkos web page???
+Second, it provides data abstractions to adjust (at compile time) the
+memory layout of basic data structures like 2d and 3d arrays and allow
+the transparent utilization of special hardware load and store units.
+Such data structures are used in LAMMPS to store atom coordinates or
+forces or neighbor lists.  The layout is chosen to optimize
+performance on different platforms.  Again this operation is hidden
+from the developer, and does not affect how the single implementation
+of the kernel is coded.

 These abstractions are set at build time, when LAMMPS is compiled with
 the KOKKOS package installed.  This is done by selecting a "host" and
@ -723,9 +724,11 @@ i.e. the host and device are the same.
 IMPORTNANT NOTE: Currently, if using GPUs, you should set the number
 of MPI tasks per compute node to be equal to the number of GPUs per
 compute node.  In the future Kokkos will support assigning one GPU to
-multiple MPI tasks or using multiple GPUs per MPI task.
-
-CT NOTE: what about AMD GPUs running OpenCL? are they supported?
+multiple MPI tasks or using multiple GPUs per MPI task.  Currently
+Kokkos does not support AMD GPUs due to limits in the available
+backend programming models (in particular relative extensive C++
+support is required for the Kernel language).  This is expected to
+change in the future.

 Here are several examples of how to build LAMMPS and run a simulation
 using the KOKKOS package for typical compute node configurations.
@ -853,8 +856,8 @@ If N is the number of physical cores/node, then the number of MPI
 tasks/node * number of threads/task should not exceed N, and should
 typically equal N.  Note that the default threads/task is 1, as set by
 the "t" keyword of the -k "command-line
-switch"_Section_start.html#start_7.  If you do not change this, there
-will no additional parallelism (beyond MPI) invoked on the host
+switch"_Section_start.html#start_7.  If you do not change this, no
+additional parallelism (beyond MPI) will be invoked on the host
 CPU(s).

 You can compare the performance running in different modes:
@ -874,9 +877,8 @@ software installation.  Insure the -arch setting in
 src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see
 "this section"_Section_start.html#start_3_4 of the manual for details.

-The -np setting of the mpirun command must set the number of MPI
-tasks/node to be equal to the # of physical GPUs on the node.  CT
-NOTE: does LAMMPS enforce this?
+The -np setting of the mpirun command should set the number of MPI
+tasks/node to be equal to the # of physical GPUs on the node. 

 Use the "-kokkos command-line switch"_Section_commands.html#start_7 to
 specify the number of GPUs per node, and the number of threads per MPI
@ -932,8 +934,18 @@ When using threads (OpenMP or pthreads), it is important for
 performance to bind the threads to physical cores, so they do not
 migrate during a simulation.  The same is true for MPI tasks, but the
 default binding rules implemented for various MPI versions, do not
-account for thread binding.  Thus you should do the following if using
-multiple threads per MPI task.  CT NOTE: explain what to do.
+account for thread binding.  
+
+Thus if you use more than one thread per MPI task, you should insure
+MPI tasks are bound to CPU sockets.  Furthermore, use thread affinity
+environment variables from the OpenMP runtime when using OpenMP and
+compile with hwloc support when using pthreads.  With OpenMP 3.1 (gcc
+4.7 or later, intel 12 or later) setting the environment variable
+OMP_PROC_BIND=true should be sufficient.  A typical mpirun command
+should set these flags:
+
+OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
+Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre

 When using a GPU, you will achieve the best performance if your input
 script does not use any fix or compute styles which are not yet
@ -952,8 +964,6 @@ together to compute pairwise interactions with the KOKKOS package.  We
 hope to support this in the future, similar to the GPU package in
 LAMMPS.

-CT NOTE: other performance tips??
-
 :line
 :line

--- a/doc/Section_start.html
+++ b/doc/Section_start.html
@ -1259,13 +1259,13 @@ Ng = 1 and Ns is not set.
 <P>Depending on which flavor of MPI you are running, LAMMPS will look for
 one of these 3 environment variables
 </P>
-<PRE>SLURM_LOCALID (???) CT NOTE: what MPI is this for?
+<PRE>SLURM_LOCALID (various MPI variants compiled with SLURM support)
 MV2_COMM_WORLD_LOCAL_RANK (Mvapich)
 OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) 
 </PRE>
-<P>which are initialized by "mpirun" or "mpiexec".  The environment
-variable setting for each MPI rank is used to assign a unique GPU ID
-to the MPI task.
+<P>which are initialized by the "srun", "mpirun" or "mpiexec" commands.
+The environment variable setting for each MPI rank is used to assign a
+unique GPU ID to the MPI task.
 </P>
 <PRE>threads Nt 
 </PRE>
@ -1274,13 +1274,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode.
 The default is Nt = 1, which essentially runs in MPI-only mode.  If
 there are Np MPI tasks per physical node, you generally want Np*Nt =
 the number of physical cores per node, to use your available hardware
-optimally.
+optimally.  This also sets the number of threads used by the host when
+LAMMPS is compiled with CUDA=yes.
 </P>
 <PRE>numa Nm 
 </PRE>
-<P>CT NOTE: what does numa set, and why use it?
-</P>
-<P>Explain.  The default is Nm = 1.
+<P>This option is only relevant when using pthreads with hwloc support.
+In this case Nm defines the number of NUMA regions (typicaly sockets)
+on a node which will be utilizied by a single MPI rank.  By default Nm
+= 1.  If this option is used the total number of worker-threads per
+MPI rank is threads*numa.  Currently it is always almost better to
+assign at least one MPI rank per NUMA region, and leave numa set to
+its default value of 1. This is because letting a single process span
+multiple NUMA regions induces a significant amount of cross NUMA data
+traffic which is slow.
 </P>
 <PRE>-log file 
 </PRE>
--- a/doc/Section_start.txt
+++ b/doc/Section_start.txt
@ -1253,13 +1253,13 @@ Ng = 1 and Ns is not set.
 Depending on which flavor of MPI you are running, LAMMPS will look for
 one of these 3 environment variables

-SLURM_LOCALID (???) CT NOTE: what MPI is this for?
+SLURM_LOCALID (various MPI variants compiled with SLURM support)
 MV2_COMM_WORLD_LOCAL_RANK (Mvapich)
 OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) :pre

-which are initialized by "mpirun" or "mpiexec".  The environment
-variable setting for each MPI rank is used to assign a unique GPU ID
-to the MPI task.
+which are initialized by the "srun", "mpirun" or "mpiexec" commands.
+The environment variable setting for each MPI rank is used to assign a
+unique GPU ID to the MPI task.

 threads Nt :pre

@ -1268,13 +1268,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode.
 The default is Nt = 1, which essentially runs in MPI-only mode.  If
 there are Np MPI tasks per physical node, you generally want Np*Nt =
 the number of physical cores per node, to use your available hardware
-optimally.
+optimally.  This also sets the number of threads used by the host when
+LAMMPS is compiled with CUDA=yes.

 numa Nm :pre

-CT NOTE: what does numa set, and why use it?
-
-Explain.  The default is Nm = 1.
+This option is only relevant when using pthreads with hwloc support.
+In this case Nm defines the number of NUMA regions (typicaly sockets)
+on a node which will be utilizied by a single MPI rank.  By default Nm
+= 1.  If this option is used the total number of worker-threads per
+MPI rank is threads*numa.  Currently it is always almost better to
+assign at least one MPI rank per NUMA region, and leave numa set to
+its default value of 1. This is because letting a single process span
+multiple NUMA regions induces a significant amount of cross NUMA data
+traffic which is slow.

 -log file :pre

--- a/doc/package.html
+++ b/doc/package.html
@ -216,26 +216,24 @@ device type can be specified when building LAMMPS with the GPU library.
 </P>
 <HR>

-<P>The <I>kk</I> style invokes options associated with the use of the
+<P>The <I>kokkos</I> style invokes options associated with the use of the
 KOKKOS package.
 </P>
 <P>The <I>neigh</I> keyword determines what kinds of neighbor lists are built.
 A value of <I>half</I> uses half-neighbor lists, the same as used by most
-pair styles in LAMMPS.  This is the default when running without
-threads on a CPU.  A value of <I>half/thread</I> uses a threadsafe variant
-of the half-neighbor list.  It should be used instead of <I>half</I> when
-running with threads on a CPU.  A value of <I>full</I> uses a
+pair styles in LAMMPS.  A value of <I>half/thread</I> uses a threadsafe
+variant of the half-neighbor list.  It should be used instead of
+<I>half</I> when running with threads on a CPU.  A value of <I>full</I> uses a
 full-neighborlist, i.e. f_ij and f_ji are both calculated.  This
 performs twice as much computation as the <I>half</I> option, however that
 can be a win because it is threadsafe and doesn't require atomic
-operations.  This is the default when running in threaded mode or on
-GPUs.  A value of <I>full/cluster</I> is an experimental neighbor style,
-where particles interact with all particles within a small cluster, if
-at least one of the clusters particles is within the neighbor cutoff
-range.  This potentially allows for better vectorization on
-architectures such as the Intel Phi.  If also reduces the size of the
-neighbor list by roughly a factor of the cluster size, thus reducing
-the total memory footprint considerably.
+operations.  A value of <I>full/cluster</I> is an experimental neighbor
+style, where particles interact with all particles within a small
+cluster, if at least one of the clusters particles is within the
+neighbor cutoff range.  This potentially allows for better
+vectorization on architectures such as the Intel Phi.  If also reduces
+the size of the neighbor list by roughly a factor of the cluster size,
+thus reducing the total memory footprint considerably.
 </P>
 <P>The <I>comm/exchange</I> and <I>comm/forward</I> keywords determine whether the
 host or device performs the packing and unpacking of data when
@ -254,26 +252,23 @@ packing/unpacking in parallel with threads.  A value of <I>device</I> means
 to use the device, typically a GPU, to perform the packing/unpacking
 operation.
 </P>
-<P>CT NOTE: please read this paragraph, to make sure it is correct:
-</P>
 <P>The optimal choice for these keywords depends on the input script and
 the hardware used.  The <I>no</I> value is useful for verifying that Kokkos
 code is working correctly.  It may also be the fastest choice when
 using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
-When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values
-should work identically.  When using GPUs, the <I>device</I> value will
-typically be optimal if all of your styles used in your input script
-are supported by the KOKKOS package.  In this case data can stay on
-the GPU for many timesteps without being moved between the host and
-GPU, if you use the <I>device</I> value.  This requires that your MPI is
-able to access GPU memory directly.  Currently that is true for
-OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI.
-If your script uses styles (e.g. fixes) which are not yet supported by
-the KOKKOS package, then data has to be move between the host and
-device anyway, so it is typically faster to let the host handle
-communication, by using the <I>host</I> value.  Using <I>host</I> instead of
-<I>no</I> will enable use of multiple threads to pack/unpack communicated
-data.
+When running on CPUs or Xeon Phi, the <I>host</I> and <I>device</I> values work
+identically.  When using GPUs, the <I>device</I> value will typically be
+optimal if all of your styles used in your input script are supported
+by the KOKKOS package.  In this case data can stay on the GPU for many
+timesteps without being moved between the host and GPU, if you use the
+<I>device</I> value.  This requires that your MPI is able to access GPU
+memory directly.  Currently that is true for OpenMPI 1.8 (or later
+versions), Mvapich2 1.9 (or later), and CrayMPI.  If your script uses
+styles (e.g. fixes) which are not yet supported by the KOKKOS package,
+then data has to be move between the host and device anyway, so it is
+typically faster to let the host handle communication, by using the
+<I>host</I> value.  Using <I>host</I> instead of <I>no</I> will enable use of
+multiple threads to pack/unpack communicated data.
 </P>
 <HR>

@ -354,13 +349,10 @@ invoked, to specify default settings for the GPU package.  If the
 command-line switch is not used, then no defaults are set, and you
 must specify the appropriate package command in your input script.
 </P>
-<P>CT NOTE: is this correct?  The above sems to say the
-choice of neigh value depends on use of threads or not.
-</P>
-<P>The default settings for the KOKKOS package are "package kokkos neigh
-full comm/exchange host comm/forward host".  This is the case whether
-the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used
-or not.
+<P>The default settings for the KOKKOS package are "package kk neigh full 
+comm/exchange host comm/forward host".  This is the case whether the
+"-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A> is used or
+not.
 </P>
 <P>If the "-sf omp" <A HREF = "Section_start.html#start_7">command-line switch</A> is
 used then it is as if the command "package omp *" were invoked, to
--- a/doc/package.txt
+++ b/doc/package.txt
@ -210,26 +210,24 @@ device type can be specified when building LAMMPS with the GPU library.

 :line

-The {kk} style invokes options associated with the use of the
+The {kokkos} style invokes options associated with the use of the
 KOKKOS package.

 The {neigh} keyword determines what kinds of neighbor lists are built.
 A value of {half} uses half-neighbor lists, the same as used by most
-pair styles in LAMMPS.  This is the default when running without
-threads on a CPU.  A value of {half/thread} uses a threadsafe variant
-of the half-neighbor list.  It should be used instead of {half} when
-running with threads on a CPU.  A value of {full} uses a
+pair styles in LAMMPS.  A value of {half/thread} uses a threadsafe
+variant of the half-neighbor list.  It should be used instead of
+{half} when running with threads on a CPU.  A value of {full} uses a
 full-neighborlist, i.e. f_ij and f_ji are both calculated.  This
 performs twice as much computation as the {half} option, however that
 can be a win because it is threadsafe and doesn't require atomic
-operations.  This is the default when running in threaded mode or on
-GPUs.  A value of {full/cluster} is an experimental neighbor style,
-where particles interact with all particles within a small cluster, if
-at least one of the clusters particles is within the neighbor cutoff
-range.  This potentially allows for better vectorization on
-architectures such as the Intel Phi.  If also reduces the size of the
-neighbor list by roughly a factor of the cluster size, thus reducing
-the total memory footprint considerably.
+operations.  A value of {full/cluster} is an experimental neighbor
+style, where particles interact with all particles within a small
+cluster, if at least one of the clusters particles is within the
+neighbor cutoff range.  This potentially allows for better
+vectorization on architectures such as the Intel Phi.  If also reduces
+the size of the neighbor list by roughly a factor of the cluster size,
+thus reducing the total memory footprint considerably.

 The {comm/exchange} and {comm/forward} keywords determine whether the
 host or device performs the packing and unpacking of data when
@ -248,26 +246,23 @@ packing/unpacking in parallel with threads.  A value of {device} means
 to use the device, typically a GPU, to perform the packing/unpacking
 operation.

-CT NOTE: please read this paragraph, to make sure it is correct:
-
 The optimal choice for these keywords depends on the input script and
 the hardware used.  The {no} value is useful for verifying that Kokkos
 code is working correctly.  It may also be the fastest choice when
 using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
-When running on CPUs or Xeon Phi, the {host} and {device} values
-should work identically.  When using GPUs, the {device} value will
-typically be optimal if all of your styles used in your input script
-are supported by the KOKKOS package.  In this case data can stay on
-the GPU for many timesteps without being moved between the host and
-GPU, if you use the {device} value.  This requires that your MPI is
-able to access GPU memory directly.  Currently that is true for
-OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI.
-If your script uses styles (e.g. fixes) which are not yet supported by
-the KOKKOS package, then data has to be move between the host and
-device anyway, so it is typically faster to let the host handle
-communication, by using the {host} value.  Using {host} instead of
-{no} will enable use of multiple threads to pack/unpack communicated
-data.
+When running on CPUs or Xeon Phi, the {host} and {device} values work
+identically.  When using GPUs, the {device} value will typically be
+optimal if all of your styles used in your input script are supported
+by the KOKKOS package.  In this case data can stay on the GPU for many
+timesteps without being moved between the host and GPU, if you use the
+{device} value.  This requires that your MPI is able to access GPU
+memory directly.  Currently that is true for OpenMPI 1.8 (or later
+versions), Mvapich2 1.9 (or later), and CrayMPI.  If your script uses
+styles (e.g. fixes) which are not yet supported by the KOKKOS package,
+then data has to be move between the host and device anyway, so it is
+typically faster to let the host handle communication, by using the
+{host} value.  Using {host} instead of {no} will enable use of
+multiple threads to pack/unpack communicated data.

 :line

@ -348,13 +343,10 @@ invoked, to specify default settings for the GPU package.  If the
 command-line switch is not used, then no defaults are set, and you
 must specify the appropriate package command in your input script.

-CT NOTE: is this correct?  The above sems to say the
-choice of neigh value depends on use of threads or not.
-
-The default settings for the KOKKOS package are "package kokkos neigh
-full comm/exchange host comm/forward host".  This is the case whether
-the "-sf kk" "command-line switch"_Section_start.html#start_7 is used
-or not.
+The default settings for the KOKKOS package are "package kk neigh full 
+comm/exchange host comm/forward host".  This is the case whether the
+"-sf kk" "command-line switch"_Section_start.html#start_7 is used or
+not.

 If the "-sf omp" "command-line switch"_Section_start.html#start_7 is
 used then it is as if the command "package omp *" were invoked, to