git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@7342 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2011-12-13 15:58:47 +00:00 · 2011-12-13 15:58:47 +00:00 · 9b72a103ea
parent 8325ae6954
commit 9b72a103ea
4 changed files with 375 additions and 129 deletions
--- a/doc/Section_start.html
+++ b/doc/Section_start.html
@ -853,6 +853,7 @@ letter abbreviation can be used:
 <LI>-p or -partition
 <LI>-pl or -plog
 <LI>-ps or -pscreen
+<LI>-r or -reorder
 <LI>-sc or -screen
 <LI>-sf or -suffix
 <LI>-v or -var 
@ -961,10 +962,78 @@ partition screen files are created.  This overrides the filename
 specified in the -screen command-line option.  This option is useful
 when working with large numbers of partitions, allowing the partition
 screen files to be suppressed (-pscreen none) or placed in a
-sub-directory (-pscreen replica_files/screen) If this option is not
+sub-directory (-pscreen replica_files/screen).  If this option is not
 used the screen file for partition N is screen.N or whatever is
 specified by the -screen command-line option.
 </P>
+<PRE>-reorder nth N
+-reorder custom filename 
+</PRE>
+<P>Reorder the processors in the MPI communicator used to instantiate
+LAMMPS, in one of several ways.  The original MPI communicator ranks
+all P processors from 0 to P-1.  The mapping of these ranks to
+physical processors is done by MPI before LAMMPS begins.  It may be
+useful in some cases to alter the rank order.  E.g. to insure that
+cores within each node are ranked in a desired order.  Or when using
+the <A HREF = "run_style.html">run_style verlet/split</A> command with 2 partitions
+to insure that a specific Kspace processor (in the 2nd partition) is
+matched up with a specific set of processors in the 1st partition.
+See the <A HREF = "Section_accelerate.html">Section_accelerate</A> doc pages for
+more details.
+</P>
+<P>If the keyword <I>nth</I> is used with a setting <I>N</I>, then it means every
+Nth processor will be moved to the end of the ranking.  This is useful
+when using the <A HREF = "run_style.html">run_style verlet/split</A> command with 2
+partitions via the -partition command-line switch.  The first set of
+processors will be in the first partition, the 2nd set in the 2nd
+partition.  The -reorder command-line switch can alter this so that
+the 1st N procs in the 1st partition and one proc in the 2nd partition
+will be ordered consecutively, e.g. as the cores on one physical node.
+This can boost performance.  For example, if you use "-reorder nth 4"
+and "-partition 9 3" and you are running on 12 processors, the
+processors will be reordered from
+</P>
+<PRE>0 1 2 3 4 5 6 7 8 9 10 11 
+</PRE>
+<P>to
+</P>
+<PRE>0 1 2 4 5 6 8 9 10 3 7 11 
+</PRE>
+<P>so that the processors in each partition will be
+</P>
+<PRE>0 1 2 4 5 6 8 9 10 
+3 7 11 
+</PRE>
+<P>See the "processors" command for how to insure processors from each
+partition could then be grouped optimally for quad-core nodes.
+</P>
+<P>If the keyword is <I>custom", then a file that specifies a permutation
+of the processor ranks is also specified.  The format of the reorder
+file is as follows.  Any number of initial blank or comment lines
+(starting with a "#" character) can be present.  These should be
+followed by P lines of the form:
+</P>
+<PRE>I J 
+</PRE>
+<P>where P is the number of processors LAMMPS was launched with.  Note
+that if running in multi-partition mode (see the -partition switch
+above) P is the total number of processors in all partitions.  The I
+and J values describe a permutation of the P processors.  Every I and
+J should be values from 0 to P-1 inclusive.  In the set of P I values,
+every proc ID should appear exactly once.  Ditto for the set of P J
+values.  A single I,J pairing means that the physical processor with
+rank I in the original MPI communicator will have rank J in the
+reordered communicator.
+</P>
+<P>Note that rank ordering can also be specified by many MPI
+implementations, either by environment variables that specify how to
+order physical processors, or by config files that specify what
+physical processors to assign to each MPI rank.  The -reorder switch
+simply gives you a portable way to do this without relying on MPI
+itself.  See the <A HREF = "processors">processors out</A> command for how to output
+info on the final assignment of physical processors to the LAMMPS
+simulation domain.
+</P>
 <PRE>-screen file 
 </PRE>
 <P>Specify a file for LAMMPS to write its screen information to.  In
--- a/doc/Section_start.txt
+++ b/doc/Section_start.txt
@ -844,6 +844,7 @@ letter abbreviation can be used:
 -p or -partition
 -pl or -plog
 -ps or -pscreen
+-r or -reorder
 -sc or -screen
 -sf or -suffix
 -v or -var :ul
@ -952,10 +953,78 @@ partition screen files are created.  This overrides the filename
 specified in the -screen command-line option.  This option is useful
 when working with large numbers of partitions, allowing the partition
 screen files to be suppressed (-pscreen none) or placed in a
-sub-directory (-pscreen replica_files/screen) If this option is not
+sub-directory (-pscreen replica_files/screen).  If this option is not
 used the screen file for partition N is screen.N or whatever is
 specified by the -screen command-line option.

+-reorder nth N
+-reorder custom filename :pre
+
+Reorder the processors in the MPI communicator used to instantiate
+LAMMPS, in one of several ways.  The original MPI communicator ranks
+all P processors from 0 to P-1.  The mapping of these ranks to
+physical processors is done by MPI before LAMMPS begins.  It may be
+useful in some cases to alter the rank order.  E.g. to insure that
+cores within each node are ranked in a desired order.  Or when using
+the "run_style verlet/split"_run_style.html command with 2 partitions
+to insure that a specific Kspace processor (in the 2nd partition) is
+matched up with a specific set of processors in the 1st partition.
+See the "Section_accelerate"_Section_accelerate.html doc pages for
+more details.
+
+If the keyword {nth} is used with a setting {N}, then it means every
+Nth processor will be moved to the end of the ranking.  This is useful
+when using the "run_style verlet/split"_run_style.html command with 2
+partitions via the -partition command-line switch.  The first set of
+processors will be in the first partition, the 2nd set in the 2nd
+partition.  The -reorder command-line switch can alter this so that
+the 1st N procs in the 1st partition and one proc in the 2nd partition
+will be ordered consecutively, e.g. as the cores on one physical node.
+This can boost performance.  For example, if you use "-reorder nth 4"
+and "-partition 9 3" and you are running on 12 processors, the
+processors will be reordered from
+
+0 1 2 3 4 5 6 7 8 9 10 11 :pre
+
+to
+
+0 1 2 4 5 6 8 9 10 3 7 11 :pre
+
+so that the processors in each partition will be
+
+0 1 2 4 5 6 8 9 10 
+3 7 11 :pre
+
+See the "processors" command for how to insure processors from each
+partition could then be grouped optimally for quad-core nodes.
+
+If the keyword is {custom", then a file that specifies a permutation
+of the processor ranks is also specified.  The format of the reorder
+file is as follows.  Any number of initial blank or comment lines
+(starting with a "#" character) can be present.  These should be
+followed by P lines of the form:
+
+I J :pre
+
+where P is the number of processors LAMMPS was launched with.  Note
+that if running in multi-partition mode (see the -partition switch
+above) P is the total number of processors in all partitions.  The I
+and J values describe a permutation of the P processors.  Every I and
+J should be values from 0 to P-1 inclusive.  In the set of P I values,
+every proc ID should appear exactly once.  Ditto for the set of P J
+values.  A single I,J pairing means that the physical processor with
+rank I in the original MPI communicator will have rank J in the
+reordered communicator.
+
+Note that rank ordering can also be specified by many MPI
+implementations, either by environment variables that specify how to
+order physical processors, or by config files that specify what
+physical processors to assign to each MPI rank.  The -reorder switch
+simply gives you a portable way to do this without relying on MPI
+itself.  See the "processors out"_processors command for how to output
+info on the final assignment of physical processors to the LAMMPS
+simulation domain.
+
 -screen file :pre

 Specify a file for LAMMPS to write its screen information to.  In
--- a/doc/processors.html
+++ b/doc/processors.html
@ -19,27 +19,31 @@

 <LI>zero or more keyword/arg pairs may be appended 

-<LI>keyword = <I>grid</I> or <I>numa</I> or <I>part</I> 
+<LI>keyword = <I>grid</I> or <I>level2</I> or <I>level3</I> or <I>numa</I> or <I>part</I> or <I>file</I> 

 <PRE>  <I>grid</I> arg = <I>cart</I> or <I>cart/reorder</I> or <I>xyz</I> or <I>xzy</I> or <I>yxz</I> or <I>yzx</I> or <I>zxy</I> or <I>zyx</I>
     cart = use MPI_Cart() methods to layout 3d grid of procs with reorder = 0
     cart/reorder = use MPI_Cart() methods to layout 3d grid of procs with reorder = 1
-     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order, where I varies fastest, then J, and K slowest
+     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order
  <I>numa</I> arg = none
  <I>part</I> args = Psend Precv cstyle
    Psend = partition # (1 to Np) which will send its processor layout
    Precv = partition # (1 to Np) which will recv the processor layout
    cstyle = <I>multiple</I>
-      <I>multiple</I> = Psend layout will be multiple of Precv layout in each dimension 
+      <I>multiple</I> = Psend layout will be multiple of Precv layout in each dimension
+  <I>file</I> arg = fname
+    fname = name of file to write processor mapping info to 
 </PRE>

 </UL>
 <P><B>Examples:</B>
 </P>
-<PRE>processors 2 4 4
-processors * * 5
-processors * * * grid xyz
+<PRE>processors * * 5 
+processors 2 4 4
+processors 2 4 4 grid xyz
+processors * * 8 grid xyz
 processors * * * numa
+processors 4 8 16 custom myfile
 processors * * * part 1 2 multiple 
 </PRE>
 <P><B>Description:</B>
@ -49,57 +53,67 @@ simulation box.  This involves 2 steps.  First if there are P
 processors it means choosing a factorization P = Px by Py by Pz so
 that there are Px processors in the x dimension, and similarly for the
 y and z dimensions.  Second, the P processors (with MPI ranks 0 to
-P-1) are mapped to the logical grid so that each grid cell is a
-processor.  The arguments to this command control each of these 2
-steps.
+P-1) are mapped to the logical 3d grid.  The arguments to this command
+control each of these 2 steps.
 </P>
 <P>The Px, Py, Pz parameters affect the factorization.  Any of the 3
 parameters can be specified with an asterisk "*", which means LAMMPS
-will choose the number of processors in that dimension.  It will do
-this based on the size and shape of the global simulation box so as to
-minimize the surface-to-volume ratio of each processor's sub-domain.
+will choose the number of processors in that dimension of the grid.
+It will do this based on the size and shape of the global simulation
+box so as to minimize the surface-to-volume ratio of each processor's
+sub-domain.
 </P>
 <P>Since LAMMPS does not load-balance by changing the grid of 3d
-processors on-the-fly, this choosing explicit values for Px or Py or
-Pz can be used to override the LAMMPS default if it is known to be
-sub-optimal for a particular problem.  For example, a problem where
-the extent of atoms will change dramatically in a particular dimension
-over the course of the simulation.
+processors on-the-fly, choosing explicit values for Px or Py or Pz can
+be used to override the LAMMPS default if it is known to be
+sub-optimal for a particular problem.  E.g. a problem where the extent
+of atoms will change dramatically in a particular dimension over the
+course of the simulation.
 </P>
 <P>The product of Px, Py, Pz must equal P, the total # of processors
 LAMMPS is running on.  For a <A HREF = "dimension.html">2d simulation</A>, Pz must
-equal 1.  If multiple partitions are being used then P is the number
-of processors in this partition; see <A HREF = "Section_start.html#start_6">this
-section</A> for an explanation of the
-partition command-line switch.
+equal 1.
 </P>
 <P>Note that if you run on a large, prime number of processors P, then a
 grid such as 1 x P x 1 will be required, which may incur extra
 communication costs due to the high surface area of each processor's
 sub-domain.
 </P>
+<P>Also note that if multiple partitions are being used then P is the
+number of processors in this partition; see <A HREF = "Section_start.html#start_6">this
+section</A> for an explanation of the
+-partition command-line switch.  Also note that you can prefix the
+processors command with the <A HREF = "partition.html">partition</A> command to
+easily specify different Px,Py,Pz values for different partitions.
+</P>
+<P>You can use the <A HREF = "partition.html">partition</A> command to specify
+different processor grids for different partitions, e.g.
+</P>
+<PRE>partition yes 1 processors 4 4 4
+partition yes 2 processors 2 3 2 
+</PRE>
 <HR>

-<P>The <I>grid</I> keyword affects how processor IDs are mapped to the 3d grid
-of processors.
+<P>The <I>grid</I> keyword affects how the P processor IDs (from 0 to P-1) are
+mapped to the 3d grid of processors.
 </P>
-<P>The <I>cart</I> style uses the family of MPI Cartesian functions to do
-this, namely MPI_Cart_create(), MPI_Cart_get(), MPI_Cart_shift(), and
-MPI_Cart_rank().  It invokes the MPI_Cart_create() function with its
-reorder flag = 0, so that MPI is not free to reorder the processors.
+<P>The <I>cart</I> style uses the family of MPI Cartesian functions to perform
+the mapping, namely MPI_Cart_create(), MPI_Cart_get(),
+MPI_Cart_shift(), and MPI_Cart_rank().  It invokes the
+MPI_Cart_create() function with its reorder flag = 0, so that MPI is
+not free to reorder the processors.
 </P>
 <P>The <I>cart/reorder</I> style does the same thing as the <I>cart</I> style
-except it sets the reorder flag to 1, so that MPI is free to reorder
+except it sets the reorder flag to 1, so that MPI can reorder
 processors if it desires.
 </P>
 <P>The <I>xyz</I>, <I>xzy</I>, <I>yxz</I>, <I>yzx</I>, <I>zxy</I>, and <I>zyx</I> styles are all
-similar.  If the style is IJK, then it explicitly maps the P
-processors to the grid so that the processor ID in the I direction
-varies fastest, the processor ID in the J direction varies next
-fastest, and the processor ID in the K direction varies slowest.  For
-example, if you select style <I>xyz</I> and you have a 2x2x2 grid of 8
-processors, the assignments of the 8 octants of the simulation domain
-will be:
+similar.  If the style is IJK, then it maps the P processors to the
+grid so that the processor ID in the I direction varies fastest, the
+processor ID in the J direction varies next fastest, and the processor
+ID in the K direction varies slowest.  For example, if you select
+style <I>xyz</I> and you have a 2x2x2 grid of 8 processors, the assignments
+of the 8 octants of the simulation domain will be:
 </P>
 <PRE>proc 0 = lo x, lo y, lo z octant
 proc 1 = hi x, lo y, lo z octant
@ -114,21 +128,28 @@ proc 7 = hi x, hi y, hi z octant
 should be aware of both the machine's network topology and the
 specific subset of processors and nodes that were assigned to your
 simulation.  Thus its MPI_Cart calls can optimize the assignment of
-MPI processes to the 3d grid to minimize communication costs.  However
-in practice, few if any MPI implementations actually do this.  So it
-is likely that the <I>cart</I> and <I>cart/reorder</I> styles simply give the
-same result as one of the IJK styles.
+MPI processes to the 3d grid to minimize communication costs.  In
+practice, however, few if any MPI implementations actually do this.
+So it is likely that the <I>cart</I> and <I>cart/reorder</I> styles simply give
+the same result as one of the IJK styles.
 </P>
 <HR>

 <P>The <I>numa</I> keyword affects both the factorization of P into Px,Py,Pz
 and the mapping of processors to the 3d grid.
 </P>
-<P>It will perform a two-level factorization of the simulation box to
-minimize inter-node communication.  This can improve parallel
-efficiency by reducing network traffic.  When this keyword is set, the
-simulation box is first divided across nodes.  Then within each node,
-the subdomain is further divided between the cores of each node.
+<P>It operates similar to the <I>level2</I> and <I>level3</I> keywords except that
+it tries to auto-detect the count and topology of the processors and
+cores within a node.  Currently, it does this in only 2 levels
+(assumes the proces/node = 1), but it may be extended in the future.
+</P>
+<P>It also uses a different algorithm (iterative) than the <I>level2</I>
+keyword for doing the two-level factorization of the simulation box
+into a 3d processor grid to minimize off-node communication.  Thus it
+may give a differnet or improved mapping of processors to the 3d grid.
+</P>
+<P>The numa setting will give an error if the number of MPI processes
+is not evenly divisible by the number of cores used per node.
 </P>
 <P>The numa setting will be ignored if (a) there are less than 4 cores
 per node, or (b) the number of MPI processes is not divisible by the
@ -137,14 +158,16 @@ any of the Px or Py of Pz values is greater than 1.
 </P>
 <HR>

-<P>The <I>part</I> keyword can be useful when running in multi-partition mode,
-e.g. with the <A HREF = "run_style.html<A HREF = "Section_start.html#start_6">-partition">>run_style verlet/split</A> command.  It
-specifies a dependency bewteen a sending partition <I>Psend</I> and a
-receiving partition <I>Precv</I> which is enforced when each is setting up
-their own mapping of the partitions processors to the simulation box.
-Each of <I>Psend</I> and <I>Precv</I> must be integers from 1 to Np, where Np is
-the number of partitions you have defined via the <A HREF = </A>
-command-line switch</A>.
+<P>The <I>part</I> keyword affects the factorization of P into Px,Py,Pz.
+</P>
+<P>It can be useful when running in multi-partition mode, e.g. with the
+<A HREF = "run_style.html">run_style verlet/split</A> command.  It specifies a
+dependency bewteen a sending partition <I>Psend</I> and a receiving
+partition <I>Precv</I> which is enforced when each is setting up their own
+mapping of their processors to the simulation box.  Each of <I>Psend</I>
+and <I>Precv</I> must be integers from 1 to Np, where Np is the number of
+partitions you have defined via the <A HREF = "Section_start.html#start_6">-partition command-line
+switch</A>.
 </P>
 <P>A "dependency" means that the sending partition will create its 3d
 logical grid as Px by Py by Pz and after it has done this, it will
@ -165,14 +188,6 @@ processors, it could create a 4x2x10 grid, but it will not create a
 2x4x10 grid, since in the y-dimension, 6 is not an integer multiple of
 4.
 </P>
-<HR>
-
-<P>Note that you can use the <A HREF = "partition.html">partition</A> command to
-specify different processor grids for different partitions, e.g.
-</P>
-<PRE>partition yes 1 processors 4 4 4
-partition yes 2 processors 2 3 2 
-</PRE>
 <P>IMPORTANT NOTE: If you use the <A HREF = "partition.html">partition</A> command to
 invoke different "processsors" commands on different partitions, and
 you also use the <I>part</I> keyword, then you must insure that both the
@ -183,6 +198,39 @@ setup phase if this error has been made.
 </P>
 <HR>

+<P>The <I>out</I> keyword writes the mapping of the factorization of P
+processors and their mapping to the 3d grid to the specified file
+<I>fname</I>.  This is useful to check that you assigned physical
+processors in the manner you desired, which can be tricky to figure
+out, especially when running on multiple partitions or on a multicore
+machine or when the processor ranks were reordered by use of the
+<A HREF = "Section_start.html#start_6">-reorder command-line switch</A> or due to
+use of MPI-specific launch options such as a config file.
+</P>
+<P>If you have multiple partitions you should insure that each one writes
+to a different file, e.g. using a <A HREF = "variable.html">world-style variable</A>
+for the filename.  The file will have a self-explanatory header,
+followed by one-line per processor in this format:
+</P>
+<P>I J K: world-ID universe-ID original-ID: name
+</P>
+<P>I,J,K are the indices of the processor in the 3d logical grid.  The
+IDs are the processor's rank in this simulation (the world), the
+universe (of multiple simulations), and the original MPI communicator
+used to instantiate LAMMPS, respectively.  The world and universe IDs
+will only be different if you are running on more than one partition;
+see the <A HREF = "Section_start.html#start_6">-partition command-line switch</A>.
+The universe and original IDs will only be different if you used the
+<A HREF = "Section_start.html#start_6">-reorder command-line switch</A> to reorder
+the processors differently than their rank in the original
+communicator LAMMPS was instantiated with.  The <I>name</I> is what is
+returned by a call to MPI_Get_processor_name() and should represent an
+identifier relevant to the physical processors in your machine.  Note
+that depending on the MPI implementation, multiple cores can have the
+same <I>name</I>.
+</P>
+<HR>
+
 <P><B>Restrictions:</B>
 </P>
 <P>This command cannot be used after the simulation box is defined by a
@ -190,13 +238,19 @@ setup phase if this error has been made.
 It can be used before a restart file is read to change the 3d
 processor grid from what is specified in the restart file.
 </P>
-<P>The <I>numa</I> keyword cannot be used with the <I>part</I> keyword, or
-with any <I>grid</I> setting other than <I>cart</I>.
+<P>You cannot use more than one of the <I>level2</I>, <I>level3</I>, or <I>numa</I>
+keywords.
 </P>
-<P><B>Related commands:</B> none
+<P>The <I>numa</I> keyword cannot be used with the <I>part</I> keyword, and it
+ignores the <I>grid</I> setting.
+</P>
+<P><B>Related commands:</B>
+</P>
+<P><A HREF = "partition.html">partition</A>, <A HREF = "Section_start.html#start_6">-reorder command-line
+switch</A>
 </P>
 <P><B>Default:</B>
 </P>
-<P>The option defaults are Px Py Pz = * * *, grid = cart, numa = 0.
+<P>The option defaults are Px Py Pz = * * * and grid = cart.
 </P>
 </HTML>
--- a/doc/processors.txt
+++ b/doc/processors.txt
@ -14,25 +14,29 @@ processors Px Py Pz keyword args ... :pre

 Px,Py,Pz = # of processors in each dimension of a 3d grid :ulb,l
 zero or more keyword/arg pairs may be appended :l
-keyword = {grid} or {numa} or {part} :l
+keyword = {grid} or {level2} or {level3} or {numa} or {part} or {file} :l
  {grid} arg = {cart} or {cart/reorder} or {xyz} or {xzy} or {yxz} or {yzx} or {zxy} or {zyx}
     cart = use MPI_Cart() methods to layout 3d grid of procs with reorder = 0
     cart/reorder = use MPI_Cart() methods to layout 3d grid of procs with reorder = 1
-     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order, where I varies fastest, then J, and K slowest
+     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order
  {numa} arg = none
  {part} args = Psend Precv cstyle
    Psend = partition # (1 to Np) which will send its processor layout
    Precv = partition # (1 to Np) which will recv the processor layout
    cstyle = {multiple}
-      {multiple} = Psend layout will be multiple of Precv layout in each dimension :pre
+      {multiple} = Psend layout will be multiple of Precv layout in each dimension
+  {file} arg = fname
+    fname = name of file to write processor mapping info to :pre
 :ule

 [Examples:]

+processors * * 5 
 processors 2 4 4
-processors * * 5
-processors * * * grid xyz
+processors 2 4 4 grid xyz
+processors * * 8 grid xyz
 processors * * * numa
+processors 4 8 16 custom myfile
 processors * * * part 1 2 multiple :pre

 [Description:]
@ -42,57 +46,67 @@ simulation box.  This involves 2 steps.  First if there are P
 processors it means choosing a factorization P = Px by Py by Pz so
 that there are Px processors in the x dimension, and similarly for the
 y and z dimensions.  Second, the P processors (with MPI ranks 0 to
-P-1) are mapped to the logical grid so that each grid cell is a
-processor.  The arguments to this command control each of these 2
-steps.
+P-1) are mapped to the logical 3d grid.  The arguments to this command
+control each of these 2 steps.

 The Px, Py, Pz parameters affect the factorization.  Any of the 3
 parameters can be specified with an asterisk "*", which means LAMMPS
-will choose the number of processors in that dimension.  It will do
-this based on the size and shape of the global simulation box so as to
-minimize the surface-to-volume ratio of each processor's sub-domain.
+will choose the number of processors in that dimension of the grid.
+It will do this based on the size and shape of the global simulation
+box so as to minimize the surface-to-volume ratio of each processor's
+sub-domain.

 Since LAMMPS does not load-balance by changing the grid of 3d
-processors on-the-fly, this choosing explicit values for Px or Py or
-Pz can be used to override the LAMMPS default if it is known to be
-sub-optimal for a particular problem.  For example, a problem where
-the extent of atoms will change dramatically in a particular dimension
-over the course of the simulation.
+processors on-the-fly, choosing explicit values for Px or Py or Pz can
+be used to override the LAMMPS default if it is known to be
+sub-optimal for a particular problem.  E.g. a problem where the extent
+of atoms will change dramatically in a particular dimension over the
+course of the simulation.

 The product of Px, Py, Pz must equal P, the total # of processors
 LAMMPS is running on.  For a "2d simulation"_dimension.html, Pz must
-equal 1.  If multiple partitions are being used then P is the number
-of processors in this partition; see "this
-section"_Section_start.html#start_6 for an explanation of the
-partition command-line switch.
+equal 1.

 Note that if you run on a large, prime number of processors P, then a
 grid such as 1 x P x 1 will be required, which may incur extra
 communication costs due to the high surface area of each processor's
 sub-domain.

+Also note that if multiple partitions are being used then P is the
+number of processors in this partition; see "this
+section"_Section_start.html#start_6 for an explanation of the
+-partition command-line switch.  Also note that you can prefix the
+processors command with the "partition"_partition.html command to
+easily specify different Px,Py,Pz values for different partitions.
+
+You can use the "partition"_partition.html command to specify
+different processor grids for different partitions, e.g.
+
+partition yes 1 processors 4 4 4
+partition yes 2 processors 2 3 2 :pre
+
 :line

-The {grid} keyword affects how processor IDs are mapped to the 3d grid
-of processors.
+The {grid} keyword affects how the P processor IDs (from 0 to P-1) are
+mapped to the 3d grid of processors.

-The {cart} style uses the family of MPI Cartesian functions to do
-this, namely MPI_Cart_create(), MPI_Cart_get(), MPI_Cart_shift(), and
-MPI_Cart_rank().  It invokes the MPI_Cart_create() function with its
-reorder flag = 0, so that MPI is not free to reorder the processors.
+The {cart} style uses the family of MPI Cartesian functions to perform
+the mapping, namely MPI_Cart_create(), MPI_Cart_get(),
+MPI_Cart_shift(), and MPI_Cart_rank().  It invokes the
+MPI_Cart_create() function with its reorder flag = 0, so that MPI is
+not free to reorder the processors.

 The {cart/reorder} style does the same thing as the {cart} style
-except it sets the reorder flag to 1, so that MPI is free to reorder
+except it sets the reorder flag to 1, so that MPI can reorder
 processors if it desires.

 The {xyz}, {xzy}, {yxz}, {yzx}, {zxy}, and {zyx} styles are all
-similar.  If the style is IJK, then it explicitly maps the P
-processors to the grid so that the processor ID in the I direction
-varies fastest, the processor ID in the J direction varies next
-fastest, and the processor ID in the K direction varies slowest.  For
-example, if you select style {xyz} and you have a 2x2x2 grid of 8
-processors, the assignments of the 8 octants of the simulation domain
-will be:
+similar.  If the style is IJK, then it maps the P processors to the
+grid so that the processor ID in the I direction varies fastest, the
+processor ID in the J direction varies next fastest, and the processor
+ID in the K direction varies slowest.  For example, if you select
+style {xyz} and you have a 2x2x2 grid of 8 processors, the assignments
+of the 8 octants of the simulation domain will be:

 proc 0 = lo x, lo y, lo z octant
 proc 1 = hi x, lo y, lo z octant
@ -107,21 +121,28 @@ Note that, in principle, an MPI implementation on a particular machine
 should be aware of both the machine's network topology and the
 specific subset of processors and nodes that were assigned to your
 simulation.  Thus its MPI_Cart calls can optimize the assignment of
-MPI processes to the 3d grid to minimize communication costs.  However
-in practice, few if any MPI implementations actually do this.  So it
-is likely that the {cart} and {cart/reorder} styles simply give the
-same result as one of the IJK styles.
+MPI processes to the 3d grid to minimize communication costs.  In
+practice, however, few if any MPI implementations actually do this.
+So it is likely that the {cart} and {cart/reorder} styles simply give
+the same result as one of the IJK styles.

 :line

 The {numa} keyword affects both the factorization of P into Px,Py,Pz
 and the mapping of processors to the 3d grid.

-It will perform a two-level factorization of the simulation box to
-minimize inter-node communication.  This can improve parallel
-efficiency by reducing network traffic.  When this keyword is set, the
-simulation box is first divided across nodes.  Then within each node,
-the subdomain is further divided between the cores of each node.
+It operates similar to the {level2} and {level3} keywords except that
+it tries to auto-detect the count and topology of the processors and
+cores within a node.  Currently, it does this in only 2 levels
+(assumes the proces/node = 1), but it may be extended in the future.
+
+It also uses a different algorithm (iterative) than the {level2}
+keyword for doing the two-level factorization of the simulation box
+into a 3d processor grid to minimize off-node communication.  Thus it
+may give a differnet or improved mapping of processors to the 3d grid.
+
+The numa setting will give an error if the number of MPI processes
+is not evenly divisible by the number of cores used per node.

 The numa setting will be ignored if (a) there are less than 4 cores
 per node, or (b) the number of MPI processes is not divisible by the
@ -130,14 +151,16 @@ any of the Px or Py of Pz values is greater than 1.

 :line

-The {part} keyword can be useful when running in multi-partition mode,
-e.g. with the "run_style verlet/split"_run_style.html command.  It
-specifies a dependency bewteen a sending partition {Psend} and a
-receiving partition {Precv} which is enforced when each is setting up
-their own mapping of the partitions processors to the simulation box.
-Each of {Psend} and {Precv} must be integers from 1 to Np, where Np is
-the number of partitions you have defined via the "-partition
-command-line switch"__Section_start.html#start_6.
+The {part} keyword affects the factorization of P into Px,Py,Pz.
+
+It can be useful when running in multi-partition mode, e.g. with the
+"run_style verlet/split"_run_style.html command.  It specifies a
+dependency bewteen a sending partition {Psend} and a receiving
+partition {Precv} which is enforced when each is setting up their own
+mapping of their processors to the simulation box.  Each of {Psend}
+and {Precv} must be integers from 1 to Np, where Np is the number of
+partitions you have defined via the "-partition command-line
+switch"_Section_start.html#start_6.

 A "dependency" means that the sending partition will create its 3d
 logical grid as Px by Py by Pz and after it has done this, it will
@ -158,14 +181,6 @@ processors, it could create a 4x2x10 grid, but it will not create a
 2x4x10 grid, since in the y-dimension, 6 is not an integer multiple of
 4.

-:line
-
-Note that you can use the "partition"_partition.html command to
-specify different processor grids for different partitions, e.g.
-
-partition yes 1 processors 4 4 4
-partition yes 2 processors 2 3 2 :pre
-
 IMPORTANT NOTE: If you use the "partition"_partition.html command to
 invoke different "processsors" commands on different partitions, and
 you also use the {part} keyword, then you must insure that both the
@ -176,6 +191,39 @@ setup phase if this error has been made.

 :line

+The {out} keyword writes the mapping of the factorization of P
+processors and their mapping to the 3d grid to the specified file
+{fname}.  This is useful to check that you assigned physical
+processors in the manner you desired, which can be tricky to figure
+out, especially when running on multiple partitions or on a multicore
+machine or when the processor ranks were reordered by use of the
+"-reorder command-line switch"_Section_start.html#start_6 or due to
+use of MPI-specific launch options such as a config file.
+
+If you have multiple partitions you should insure that each one writes
+to a different file, e.g. using a "world-style variable"_variable.html
+for the filename.  The file will have a self-explanatory header,
+followed by one-line per processor in this format:
+
+I J K: world-ID universe-ID original-ID: name
+
+I,J,K are the indices of the processor in the 3d logical grid.  The
+IDs are the processor's rank in this simulation (the world), the
+universe (of multiple simulations), and the original MPI communicator
+used to instantiate LAMMPS, respectively.  The world and universe IDs
+will only be different if you are running on more than one partition;
+see the "-partition command-line switch"_Section_start.html#start_6.
+The universe and original IDs will only be different if you used the
+"-reorder command-line switch"_Section_start.html#start_6 to reorder
+the processors differently than their rank in the original
+communicator LAMMPS was instantiated with.  The {name} is what is
+returned by a call to MPI_Get_processor_name() and should represent an
+identifier relevant to the physical processors in your machine.  Note
+that depending on the MPI implementation, multiple cores can have the
+same {name}.
+
+:line
+
 [Restrictions:]

 This command cannot be used after the simulation box is defined by a
@ -183,11 +231,17 @@ This command cannot be used after the simulation box is defined by a
 It can be used before a restart file is read to change the 3d
 processor grid from what is specified in the restart file.

-The {numa} keyword cannot be used with the {part} keyword, or
-with any {grid} setting other than {cart}.
+You cannot use more than one of the {level2}, {level3}, or {numa}
+keywords.

-[Related commands:] none
+The {numa} keyword cannot be used with the {part} keyword, and it
+ignores the {grid} setting.
+
+[Related commands:]
+
+"partition"_partition.html, "-reorder command-line
+switch"_Section_start.html#start_6

 [Default:]

-The option defaults are Px Py Pz = * * *, grid = cart, numa = 0.
+The option defaults are Px Py Pz = * * * and grid = cart.