git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@7342 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2011-12-13 15:58:47 +00:00 · 2011-12-13 15:58:47 +00:00 · 9b72a103ea
parent 8325ae6954
commit 9b72a103ea
4 changed files with 375 additions and 129 deletions
--- a/doc/Section_start.html
+++ b/doc/Section_start.html
@ -853,6 +853,7 @@ letter abbreviation can be used:
 <LI>-p or -partition
 <LI>-pl or -plog
 <LI>-ps or -pscreen
 <LI>-r or -reorder
 <LI>-sc or -screen
 <LI>-sf or -suffix
 <LI>-v or -var 
@ -961,10 +962,78 @@ partition screen files are created.  This overrides the filename
 specified in the -screen command-line option.  This option is useful
 when working with large numbers of partitions, allowing the partition
 screen files to be suppressed (-pscreen none) or placed in a
-sub-directory (-pscreen replica_files/screen) If this option is not
+sub-directory (-pscreen replica_files/screen).  If this option is not
 used the screen file for partition N is screen.N or whatever is
 specified by the -screen command-line option.
 </P>
 <PRE>-reorder nth N
 -reorder custom filename 
 </PRE>
 <P>Reorder the processors in the MPI communicator used to instantiate
 LAMMPS, in one of several ways.  The original MPI communicator ranks
 all P processors from 0 to P-1.  The mapping of these ranks to
 physical processors is done by MPI before LAMMPS begins.  It may be
 useful in some cases to alter the rank order.  E.g. to insure that
 cores within each node are ranked in a desired order.  Or when using
 the <A HREF = "run_style.html">run_style verlet/split</A> command with 2 partitions
 to insure that a specific Kspace processor (in the 2nd partition) is
 matched up with a specific set of processors in the 1st partition.
 See the <A HREF = "Section_accelerate.html">Section_accelerate</A> doc pages for
 more details.
 </P>
 <P>If the keyword <I>nth</I> is used with a setting <I>N</I>, then it means every
 Nth processor will be moved to the end of the ranking.  This is useful
 when using the <A HREF = "run_style.html">run_style verlet/split</A> command with 2
 partitions via the -partition command-line switch.  The first set of
 processors will be in the first partition, the 2nd set in the 2nd
 partition.  The -reorder command-line switch can alter this so that
 the 1st N procs in the 1st partition and one proc in the 2nd partition
 will be ordered consecutively, e.g. as the cores on one physical node.
 This can boost performance.  For example, if you use "-reorder nth 4"
 and "-partition 9 3" and you are running on 12 processors, the
 processors will be reordered from
 </P>
 <PRE>0 1 2 3 4 5 6 7 8 9 10 11 
 </PRE>
 <P>to
 </P>
 <PRE>0 1 2 4 5 6 8 9 10 3 7 11 
 </PRE>
 <P>so that the processors in each partition will be
 </P>
 <PRE>0 1 2 4 5 6 8 9 10 
 3 7 11 
 </PRE>
 <P>See the "processors" command for how to insure processors from each
 partition could then be grouped optimally for quad-core nodes.
 </P>
 <P>If the keyword is <I>custom", then a file that specifies a permutation
 of the processor ranks is also specified.  The format of the reorder
 file is as follows.  Any number of initial blank or comment lines
 (starting with a "#" character) can be present.  These should be
 followed by P lines of the form:
 </P>
 <PRE>I J 
 </PRE>
 <P>where P is the number of processors LAMMPS was launched with.  Note
 that if running in multi-partition mode (see the -partition switch
 above) P is the total number of processors in all partitions.  The I
 and J values describe a permutation of the P processors.  Every I and
 J should be values from 0 to P-1 inclusive.  In the set of P I values,
 every proc ID should appear exactly once.  Ditto for the set of P J
 values.  A single I,J pairing means that the physical processor with
 rank I in the original MPI communicator will have rank J in the
 reordered communicator.
 </P>
 <P>Note that rank ordering can also be specified by many MPI
 implementations, either by environment variables that specify how to
 order physical processors, or by config files that specify what
 physical processors to assign to each MPI rank.  The -reorder switch
 simply gives you a portable way to do this without relying on MPI
 itself.  See the <A HREF = "processors">processors out</A> command for how to output
 info on the final assignment of physical processors to the LAMMPS
 simulation domain.
 </P>
 <PRE>-screen file 
 </PRE>
 <P>Specify a file for LAMMPS to write its screen information to.  In
--- a/doc/Section_start.txt
+++ b/doc/Section_start.txt
@ -844,6 +844,7 @@ letter abbreviation can be used:
 -p or -partition
 -pl or -plog
 -ps or -pscreen
 -r or -reorder
 -sc or -screen
 -sf or -suffix
 -v or -var :ul
@ -952,10 +953,78 @@ partition screen files are created.  This overrides the filename
 specified in the -screen command-line option.  This option is useful
 when working with large numbers of partitions, allowing the partition
 screen files to be suppressed (-pscreen none) or placed in a
-sub-directory (-pscreen replica_files/screen) If this option is not
+sub-directory (-pscreen replica_files/screen).  If this option is not
 used the screen file for partition N is screen.N or whatever is
 specified by the -screen command-line option.
 -reorder nth N
 -reorder custom filename :pre
 Reorder the processors in the MPI communicator used to instantiate
 LAMMPS, in one of several ways.  The original MPI communicator ranks
 all P processors from 0 to P-1.  The mapping of these ranks to
 physical processors is done by MPI before LAMMPS begins.  It may be
 useful in some cases to alter the rank order.  E.g. to insure that
 cores within each node are ranked in a desired order.  Or when using
 the "run_style verlet/split"_run_style.html command with 2 partitions
 to insure that a specific Kspace processor (in the 2nd partition) is
 matched up with a specific set of processors in the 1st partition.
 See the "Section_accelerate"_Section_accelerate.html doc pages for
 more details.
 If the keyword {nth} is used with a setting {N}, then it means every
 Nth processor will be moved to the end of the ranking.  This is useful
 when using the "run_style verlet/split"_run_style.html command with 2
 partitions via the -partition command-line switch.  The first set of
 processors will be in the first partition, the 2nd set in the 2nd
 partition.  The -reorder command-line switch can alter this so that
 the 1st N procs in the 1st partition and one proc in the 2nd partition
 will be ordered consecutively, e.g. as the cores on one physical node.
 This can boost performance.  For example, if you use "-reorder nth 4"
 and "-partition 9 3" and you are running on 12 processors, the
 processors will be reordered from
 0 1 2 3 4 5 6 7 8 9 10 11 :pre
 to
 0 1 2 4 5 6 8 9 10 3 7 11 :pre
 so that the processors in each partition will be
 0 1 2 4 5 6 8 9 10 
 3 7 11 :pre
 See the "processors" command for how to insure processors from each
 partition could then be grouped optimally for quad-core nodes.
 If the keyword is {custom", then a file that specifies a permutation
 of the processor ranks is also specified.  The format of the reorder
 file is as follows.  Any number of initial blank or comment lines
 (starting with a "#" character) can be present.  These should be
 followed by P lines of the form:
 I J :pre
 where P is the number of processors LAMMPS was launched with.  Note
 that if running in multi-partition mode (see the -partition switch
 above) P is the total number of processors in all partitions.  The I
 and J values describe a permutation of the P processors.  Every I and
 J should be values from 0 to P-1 inclusive.  In the set of P I values,
 every proc ID should appear exactly once.  Ditto for the set of P J
 values.  A single I,J pairing means that the physical processor with
 rank I in the original MPI communicator will have rank J in the
 reordered communicator.
 Note that rank ordering can also be specified by many MPI
 implementations, either by environment variables that specify how to
 order physical processors, or by config files that specify what
 physical processors to assign to each MPI rank.  The -reorder switch
 simply gives you a portable way to do this without relying on MPI
 itself.  See the "processors out"_processors command for how to output
 info on the final assignment of physical processors to the LAMMPS
 simulation domain.
 -screen file :pre
 Specify a file for LAMMPS to write its screen information to.  In
--- a/doc/processors.html
+++ b/doc/processors.html
@ -19,27 +19,31 @@
 <LI>zero or more keyword/arg pairs may be appended 
-<LI>keyword = <I>grid</I> or <I>numa</I> or <I>part</I> 
+<LI>keyword = <I>grid</I> or <I>level2</I> or <I>level3</I> or <I>numa</I> or <I>part</I> or <I>file</I> 
 <PRE>  <I>grid</I> arg = <I>cart</I> or <I>cart/reorder</I> or <I>xyz</I> or <I>xzy</I> or <I>yxz</I> or <I>yzx</I> or <I>zxy</I> or <I>zyx</I>
     cart = use MPI_Cart() methods to layout 3d grid of procs with reorder = 0
     cart/reorder = use MPI_Cart() methods to layout 3d grid of procs with reorder = 1
-     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order, where I varies fastest, then J, and K slowest
+     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order
  <I>numa</I> arg = none
  <I>part</I> args = Psend Precv cstyle
    Psend = partition # (1 to Np) which will send its processor layout
    Precv = partition # (1 to Np) which will recv the processor layout
    cstyle = <I>multiple</I>
      <I>multiple</I> = Psend layout will be multiple of Precv layout in each dimension
  <I>file</I> arg = fname
    fname = name of file to write processor mapping info to 
 </PRE>
 </UL>
 <P><B>Examples:</B>
 </P>
-<PRE>processors 2 4 4
+<PRE>processors * * 5 
-processors * * 5
+processors 2 4 4
-processors * * * grid xyz
+processors 2 4 4 grid xyz
 processors * * 8 grid xyz
 processors * * * numa
 processors 4 8 16 custom myfile
 processors * * * part 1 2 multiple 
 </PRE>
 <P><B>Description:</B>
@ -49,57 +53,67 @@ simulation box.  This involves 2 steps.  First if there are P
 processors it means choosing a factorization P = Px by Py by Pz so
 that there are Px processors in the x dimension, and similarly for the
 y and z dimensions.  Second, the P processors (with MPI ranks 0 to
-P-1) are mapped to the logical grid so that each grid cell is a
+P-1) are mapped to the logical 3d grid.  The arguments to this command
-processor.  The arguments to this command control each of these 2
+control each of these 2 steps.
 steps.
 </P>
 <P>The Px, Py, Pz parameters affect the factorization.  Any of the 3
 parameters can be specified with an asterisk "*", which means LAMMPS
-will choose the number of processors in that dimension.  It will do
+will choose the number of processors in that dimension of the grid.
-this based on the size and shape of the global simulation box so as to
+It will do this based on the size and shape of the global simulation
-minimize the surface-to-volume ratio of each processor's sub-domain.
+box so as to minimize the surface-to-volume ratio of each processor's
 sub-domain.
 </P>
 <P>Since LAMMPS does not load-balance by changing the grid of 3d
-processors on-the-fly, this choosing explicit values for Px or Py or
+processors on-the-fly, choosing explicit values for Px or Py or Pz can
-Pz can be used to override the LAMMPS default if it is known to be
+be used to override the LAMMPS default if it is known to be
-sub-optimal for a particular problem.  For example, a problem where
+sub-optimal for a particular problem.  E.g. a problem where the extent
-the extent of atoms will change dramatically in a particular dimension
+of atoms will change dramatically in a particular dimension over the
-over the course of the simulation.
+course of the simulation.
 </P>
 <P>The product of Px, Py, Pz must equal P, the total # of processors
 LAMMPS is running on.  For a <A HREF = "dimension.html">2d simulation</A>, Pz must
-equal 1.  If multiple partitions are being used then P is the number
+equal 1.
 of processors in this partition; see <A HREF = "Section_start.html#start_6">this
 section</A> for an explanation of the
 -partition command-line switch.
 </P>
 <P>Note that if you run on a large, prime number of processors P, then a
 grid such as 1 x P x 1 will be required, which may incur extra
 communication costs due to the high surface area of each processor's
 sub-domain.
 </P>
 <P>Also note that if multiple partitions are being used then P is the
 number of processors in this partition; see <A HREF = "Section_start.html#start_6">this
 section</A> for an explanation of the
 -partition command-line switch.  Also note that you can prefix the
 processors command with the <A HREF = "partition.html">partition</A> command to
 easily specify different Px,Py,Pz values for different partitions.
 </P>
 <P>You can use the <A HREF = "partition.html">partition</A> command to specify
 different processor grids for different partitions, e.g.
 </P>
 <PRE>partition yes 1 processors 4 4 4
 partition yes 2 processors 2 3 2 
 </PRE>
 <HR>
-<P>The <I>grid</I> keyword affects how processor IDs are mapped to the 3d grid
+<P>The <I>grid</I> keyword affects how the P processor IDs (from 0 to P-1) are
-of processors.
+mapped to the 3d grid of processors.
 </P>
-<P>The <I>cart</I> style uses the family of MPI Cartesian functions to do
+<P>The <I>cart</I> style uses the family of MPI Cartesian functions to perform
-this, namely MPI_Cart_create(), MPI_Cart_get(), MPI_Cart_shift(), and
+the mapping, namely MPI_Cart_create(), MPI_Cart_get(),
-MPI_Cart_rank().  It invokes the MPI_Cart_create() function with its
+MPI_Cart_shift(), and MPI_Cart_rank().  It invokes the
-reorder flag = 0, so that MPI is not free to reorder the processors.
+MPI_Cart_create() function with its reorder flag = 0, so that MPI is
 not free to reorder the processors.
 </P>
 <P>The <I>cart/reorder</I> style does the same thing as the <I>cart</I> style
-except it sets the reorder flag to 1, so that MPI is free to reorder
+except it sets the reorder flag to 1, so that MPI can reorder
 processors if it desires.
 </P>
 <P>The <I>xyz</I>, <I>xzy</I>, <I>yxz</I>, <I>yzx</I>, <I>zxy</I>, and <I>zyx</I> styles are all
-similar.  If the style is IJK, then it explicitly maps the P
+similar.  If the style is IJK, then it maps the P processors to the
-processors to the grid so that the processor ID in the I direction
+grid so that the processor ID in the I direction varies fastest, the
-varies fastest, the processor ID in the J direction varies next
+processor ID in the J direction varies next fastest, and the processor
-fastest, and the processor ID in the K direction varies slowest.  For
+ID in the K direction varies slowest.  For example, if you select
-example, if you select style <I>xyz</I> and you have a 2x2x2 grid of 8
+style <I>xyz</I> and you have a 2x2x2 grid of 8 processors, the assignments
-processors, the assignments of the 8 octants of the simulation domain
+of the 8 octants of the simulation domain will be:
 will be:
 </P>
 <PRE>proc 0 = lo x, lo y, lo z octant
 proc 1 = hi x, lo y, lo z octant
@ -114,21 +128,28 @@ proc 7 = hi x, hi y, hi z octant
 should be aware of both the machine's network topology and the
 specific subset of processors and nodes that were assigned to your
 simulation.  Thus its MPI_Cart calls can optimize the assignment of
-MPI processes to the 3d grid to minimize communication costs.  However
+MPI processes to the 3d grid to minimize communication costs.  In
-in practice, few if any MPI implementations actually do this.  So it
+practice, however, few if any MPI implementations actually do this.
-is likely that the <I>cart</I> and <I>cart/reorder</I> styles simply give the
+So it is likely that the <I>cart</I> and <I>cart/reorder</I> styles simply give
-same result as one of the IJK styles.
+the same result as one of the IJK styles.
 </P>
 <HR>
 <P>The <I>numa</I> keyword affects both the factorization of P into Px,Py,Pz
 and the mapping of processors to the 3d grid.
 </P>
-<P>It will perform a two-level factorization of the simulation box to
+<P>It operates similar to the <I>level2</I> and <I>level3</I> keywords except that
-minimize inter-node communication.  This can improve parallel
+it tries to auto-detect the count and topology of the processors and
-efficiency by reducing network traffic.  When this keyword is set, the
+cores within a node.  Currently, it does this in only 2 levels
-simulation box is first divided across nodes.  Then within each node,
+(assumes the proces/node = 1), but it may be extended in the future.
-the subdomain is further divided between the cores of each node.
+</P>
 <P>It also uses a different algorithm (iterative) than the <I>level2</I>
 keyword for doing the two-level factorization of the simulation box
 into a 3d processor grid to minimize off-node communication.  Thus it
 may give a differnet or improved mapping of processors to the 3d grid.
 </P>
 <P>The numa setting will give an error if the number of MPI processes
 is not evenly divisible by the number of cores used per node.
 </P>
 <P>The numa setting will be ignored if (a) there are less than 4 cores
 per node, or (b) the number of MPI processes is not divisible by the
@ -137,14 +158,16 @@ any of the Px or Py of Pz values is greater than 1.
 </P>
 <HR>
-<P>The <I>part</I> keyword can be useful when running in multi-partition mode,
+<P>The <I>part</I> keyword affects the factorization of P into Px,Py,Pz.
-e.g. with the <A HREF = "run_style.html<A HREF = "Section_start.html#start_6">-partition">>run_style verlet/split</A> command.  It
+</P>
-specifies a dependency bewteen a sending partition <I>Psend</I> and a
+<P>It can be useful when running in multi-partition mode, e.g. with the
-receiving partition <I>Precv</I> which is enforced when each is setting up
+<A HREF = "run_style.html">run_style verlet/split</A> command.  It specifies a
-their own mapping of the partitions processors to the simulation box.
+dependency bewteen a sending partition <I>Psend</I> and a receiving
-Each of <I>Psend</I> and <I>Precv</I> must be integers from 1 to Np, where Np is
+partition <I>Precv</I> which is enforced when each is setting up their own
-the number of partitions you have defined via the <A HREF = </A>
+mapping of their processors to the simulation box.  Each of <I>Psend</I>
-command-line switch</A>.
+and <I>Precv</I> must be integers from 1 to Np, where Np is the number of
 partitions you have defined via the <A HREF = "Section_start.html#start_6">-partition command-line
 switch</A>.
 </P>
 <P>A "dependency" means that the sending partition will create its 3d
 logical grid as Px by Py by Pz and after it has done this, it will
@ -165,14 +188,6 @@ processors, it could create a 4x2x10 grid, but it will not create a
 2x4x10 grid, since in the y-dimension, 6 is not an integer multiple of
 4.
 </P>
 <HR>
 <P>Note that you can use the <A HREF = "partition.html">partition</A> command to
 specify different processor grids for different partitions, e.g.
 </P>
 <PRE>partition yes 1 processors 4 4 4
 partition yes 2 processors 2 3 2 
 </PRE>
 <P>IMPORTANT NOTE: If you use the <A HREF = "partition.html">partition</A> command to
 invoke different "processsors" commands on different partitions, and
 you also use the <I>part</I> keyword, then you must insure that both the
@ -183,6 +198,39 @@ setup phase if this error has been made.
 </P>
 <HR>
 <P>The <I>out</I> keyword writes the mapping of the factorization of P
 processors and their mapping to the 3d grid to the specified file
 <I>fname</I>.  This is useful to check that you assigned physical
 processors in the manner you desired, which can be tricky to figure
 out, especially when running on multiple partitions or on a multicore
 machine or when the processor ranks were reordered by use of the
 <A HREF = "Section_start.html#start_6">-reorder command-line switch</A> or due to
 use of MPI-specific launch options such as a config file.
 </P>
 <P>If you have multiple partitions you should insure that each one writes
 to a different file, e.g. using a <A HREF = "variable.html">world-style variable</A>
 for the filename.  The file will have a self-explanatory header,
 followed by one-line per processor in this format:
 </P>
 <P>I J K: world-ID universe-ID original-ID: name
 </P>
 <P>I,J,K are the indices of the processor in the 3d logical grid.  The
 IDs are the processor's rank in this simulation (the world), the
 universe (of multiple simulations), and the original MPI communicator
 used to instantiate LAMMPS, respectively.  The world and universe IDs
 will only be different if you are running on more than one partition;
 see the <A HREF = "Section_start.html#start_6">-partition command-line switch</A>.
 The universe and original IDs will only be different if you used the
 <A HREF = "Section_start.html#start_6">-reorder command-line switch</A> to reorder
 the processors differently than their rank in the original
 communicator LAMMPS was instantiated with.  The <I>name</I> is what is
 returned by a call to MPI_Get_processor_name() and should represent an
 identifier relevant to the physical processors in your machine.  Note
 that depending on the MPI implementation, multiple cores can have the
 same <I>name</I>.
 </P>
 <HR>
 <P><B>Restrictions:</B>
 </P>
 <P>This command cannot be used after the simulation box is defined by a
@ -190,13 +238,19 @@ setup phase if this error has been made.
 It can be used before a restart file is read to change the 3d
 processor grid from what is specified in the restart file.
 </P>
-<P>The <I>numa</I> keyword cannot be used with the <I>part</I> keyword, or
+<P>You cannot use more than one of the <I>level2</I>, <I>level3</I>, or <I>numa</I>
-with any <I>grid</I> setting other than <I>cart</I>.
+keywords.
 </P>
-<P><B>Related commands:</B> none
+<P>The <I>numa</I> keyword cannot be used with the <I>part</I> keyword, and it
 ignores the <I>grid</I> setting.
 </P>
 <P><B>Related commands:</B>
 </P>
 <P><A HREF = "partition.html">partition</A>, <A HREF = "Section_start.html#start_6">-reorder command-line
 switch</A>
 </P>
 <P><B>Default:</B>
 </P>
-<P>The option defaults are Px Py Pz = * * *, grid = cart, numa = 0.
+<P>The option defaults are Px Py Pz = * * * and grid = cart.
 </P>
 </HTML>
--- a/doc/processors.txt
+++ b/doc/processors.txt
@ -14,25 +14,29 @@ processors Px Py Pz keyword args ... :pre
 Px,Py,Pz = # of processors in each dimension of a 3d grid :ulb,l
 zero or more keyword/arg pairs may be appended :l
-keyword = {grid} or {numa} or {part} :l
+keyword = {grid} or {level2} or {level3} or {numa} or {part} or {file} :l
  {grid} arg = {cart} or {cart/reorder} or {xyz} or {xzy} or {yxz} or {yzx} or {zxy} or {zyx}
     cart = use MPI_Cart() methods to layout 3d grid of procs with reorder = 0
     cart/reorder = use MPI_Cart() methods to layout 3d grid of procs with reorder = 1
-     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order, where I varies fastest, then J, and K slowest
+     xyz,xzy,yxz,yzx,zxy,zyx = layout 3d grid of procs in IJK order
  {numa} arg = none
  {part} args = Psend Precv cstyle
    Psend = partition # (1 to Np) which will send its processor layout
    Precv = partition # (1 to Np) which will recv the processor layout
    cstyle = {multiple}
-      {multiple} = Psend layout will be multiple of Precv layout in each dimension :pre
+      {multiple} = Psend layout will be multiple of Precv layout in each dimension
  {file} arg = fname
    fname = name of file to write processor mapping info to :pre
 :ule
 [Examples:]
 processors 2 4 4
 processors * * 5 
-processors * * * grid xyz
+processors 2 4 4
 processors 2 4 4 grid xyz
 processors * * 8 grid xyz
 processors * * * numa
 processors 4 8 16 custom myfile
 processors * * * part 1 2 multiple :pre
 [Description:]
@ -42,57 +46,67 @@ simulation box.  This involves 2 steps.  First if there are P
 processors it means choosing a factorization P = Px by Py by Pz so
 that there are Px processors in the x dimension, and similarly for the
 y and z dimensions.  Second, the P processors (with MPI ranks 0 to
-P-1) are mapped to the logical grid so that each grid cell is a
+P-1) are mapped to the logical 3d grid.  The arguments to this command
-processor.  The arguments to this command control each of these 2
+control each of these 2 steps.
 steps.
 The Px, Py, Pz parameters affect the factorization.  Any of the 3
 parameters can be specified with an asterisk "*", which means LAMMPS
-will choose the number of processors in that dimension.  It will do
+will choose the number of processors in that dimension of the grid.
-this based on the size and shape of the global simulation box so as to
+It will do this based on the size and shape of the global simulation
-minimize the surface-to-volume ratio of each processor's sub-domain.
+box so as to minimize the surface-to-volume ratio of each processor's
 sub-domain.
 Since LAMMPS does not load-balance by changing the grid of 3d
-processors on-the-fly, this choosing explicit values for Px or Py or
+processors on-the-fly, choosing explicit values for Px or Py or Pz can
-Pz can be used to override the LAMMPS default if it is known to be
+be used to override the LAMMPS default if it is known to be
-sub-optimal for a particular problem.  For example, a problem where
+sub-optimal for a particular problem.  E.g. a problem where the extent
-the extent of atoms will change dramatically in a particular dimension
+of atoms will change dramatically in a particular dimension over the
-over the course of the simulation.
+course of the simulation.
 The product of Px, Py, Pz must equal P, the total # of processors
 LAMMPS is running on.  For a "2d simulation"_dimension.html, Pz must
-equal 1.  If multiple partitions are being used then P is the number
+equal 1.
 of processors in this partition; see "this
 section"_Section_start.html#start_6 for an explanation of the
 -partition command-line switch.
 Note that if you run on a large, prime number of processors P, then a
 grid such as 1 x P x 1 will be required, which may incur extra
 communication costs due to the high surface area of each processor's
 sub-domain.
 Also note that if multiple partitions are being used then P is the
 number of processors in this partition; see "this
 section"_Section_start.html#start_6 for an explanation of the
 -partition command-line switch.  Also note that you can prefix the
 processors command with the "partition"_partition.html command to
 easily specify different Px,Py,Pz values for different partitions.
 You can use the "partition"_partition.html command to specify
 different processor grids for different partitions, e.g.
 partition yes 1 processors 4 4 4
 partition yes 2 processors 2 3 2 :pre
 :line
-The {grid} keyword affects how processor IDs are mapped to the 3d grid
+The {grid} keyword affects how the P processor IDs (from 0 to P-1) are
-of processors.
+mapped to the 3d grid of processors.
-The {cart} style uses the family of MPI Cartesian functions to do
+The {cart} style uses the family of MPI Cartesian functions to perform
-this, namely MPI_Cart_create(), MPI_Cart_get(), MPI_Cart_shift(), and
+the mapping, namely MPI_Cart_create(), MPI_Cart_get(),
-MPI_Cart_rank().  It invokes the MPI_Cart_create() function with its
+MPI_Cart_shift(), and MPI_Cart_rank().  It invokes the
-reorder flag = 0, so that MPI is not free to reorder the processors.
+MPI_Cart_create() function with its reorder flag = 0, so that MPI is
 not free to reorder the processors.
 The {cart/reorder} style does the same thing as the {cart} style
-except it sets the reorder flag to 1, so that MPI is free to reorder
+except it sets the reorder flag to 1, so that MPI can reorder
 processors if it desires.
 The {xyz}, {xzy}, {yxz}, {yzx}, {zxy}, and {zyx} styles are all
-similar.  If the style is IJK, then it explicitly maps the P
+similar.  If the style is IJK, then it maps the P processors to the
-processors to the grid so that the processor ID in the I direction
+grid so that the processor ID in the I direction varies fastest, the
-varies fastest, the processor ID in the J direction varies next
+processor ID in the J direction varies next fastest, and the processor
-fastest, and the processor ID in the K direction varies slowest.  For
+ID in the K direction varies slowest.  For example, if you select
-example, if you select style {xyz} and you have a 2x2x2 grid of 8
+style {xyz} and you have a 2x2x2 grid of 8 processors, the assignments
-processors, the assignments of the 8 octants of the simulation domain
+of the 8 octants of the simulation domain will be:
 will be:
 proc 0 = lo x, lo y, lo z octant
 proc 1 = hi x, lo y, lo z octant
@ -107,21 +121,28 @@ Note that, in principle, an MPI implementation on a particular machine
 should be aware of both the machine's network topology and the
 specific subset of processors and nodes that were assigned to your
 simulation.  Thus its MPI_Cart calls can optimize the assignment of
-MPI processes to the 3d grid to minimize communication costs.  However
+MPI processes to the 3d grid to minimize communication costs.  In
-in practice, few if any MPI implementations actually do this.  So it
+practice, however, few if any MPI implementations actually do this.
-is likely that the {cart} and {cart/reorder} styles simply give the
+So it is likely that the {cart} and {cart/reorder} styles simply give
-same result as one of the IJK styles.
+the same result as one of the IJK styles.
 :line
 The {numa} keyword affects both the factorization of P into Px,Py,Pz
 and the mapping of processors to the 3d grid.
-It will perform a two-level factorization of the simulation box to
+It operates similar to the {level2} and {level3} keywords except that
-minimize inter-node communication.  This can improve parallel
+it tries to auto-detect the count and topology of the processors and
-efficiency by reducing network traffic.  When this keyword is set, the
+cores within a node.  Currently, it does this in only 2 levels
-simulation box is first divided across nodes.  Then within each node,
+(assumes the proces/node = 1), but it may be extended in the future.
-the subdomain is further divided between the cores of each node.
+
 It also uses a different algorithm (iterative) than the {level2}
 keyword for doing the two-level factorization of the simulation box
 into a 3d processor grid to minimize off-node communication.  Thus it
 may give a differnet or improved mapping of processors to the 3d grid.
 The numa setting will give an error if the number of MPI processes
 is not evenly divisible by the number of cores used per node.
 The numa setting will be ignored if (a) there are less than 4 cores
 per node, or (b) the number of MPI processes is not divisible by the
@ -130,14 +151,16 @@ any of the Px or Py of Pz values is greater than 1.
 :line
-The {part} keyword can be useful when running in multi-partition mode,
+The {part} keyword affects the factorization of P into Px,Py,Pz.
-e.g. with the "run_style verlet/split"_run_style.html command.  It
+
-specifies a dependency bewteen a sending partition {Psend} and a
+It can be useful when running in multi-partition mode, e.g. with the
-receiving partition {Precv} which is enforced when each is setting up
+"run_style verlet/split"_run_style.html command.  It specifies a
-their own mapping of the partitions processors to the simulation box.
+dependency bewteen a sending partition {Psend} and a receiving
-Each of {Psend} and {Precv} must be integers from 1 to Np, where Np is
+partition {Precv} which is enforced when each is setting up their own
-the number of partitions you have defined via the "-partition
+mapping of their processors to the simulation box.  Each of {Psend}
-command-line switch"__Section_start.html#start_6.
+and {Precv} must be integers from 1 to Np, where Np is the number of
 partitions you have defined via the "-partition command-line
 switch"_Section_start.html#start_6.
 A "dependency" means that the sending partition will create its 3d
 logical grid as Px by Py by Pz and after it has done this, it will
@ -158,14 +181,6 @@ processors, it could create a 4x2x10 grid, but it will not create a
 2x4x10 grid, since in the y-dimension, 6 is not an integer multiple of
 4.
 :line
 Note that you can use the "partition"_partition.html command to
 specify different processor grids for different partitions, e.g.
 partition yes 1 processors 4 4 4
 partition yes 2 processors 2 3 2 :pre
 IMPORTANT NOTE: If you use the "partition"_partition.html command to
 invoke different "processsors" commands on different partitions, and
 you also use the {part} keyword, then you must insure that both the
@ -176,6 +191,39 @@ setup phase if this error has been made.
 :line
 The {out} keyword writes the mapping of the factorization of P
 processors and their mapping to the 3d grid to the specified file
 {fname}.  This is useful to check that you assigned physical
 processors in the manner you desired, which can be tricky to figure
 out, especially when running on multiple partitions or on a multicore
 machine or when the processor ranks were reordered by use of the
 "-reorder command-line switch"_Section_start.html#start_6 or due to
 use of MPI-specific launch options such as a config file.
 If you have multiple partitions you should insure that each one writes
 to a different file, e.g. using a "world-style variable"_variable.html
 for the filename.  The file will have a self-explanatory header,
 followed by one-line per processor in this format:
 I J K: world-ID universe-ID original-ID: name
 I,J,K are the indices of the processor in the 3d logical grid.  The
 IDs are the processor's rank in this simulation (the world), the
 universe (of multiple simulations), and the original MPI communicator
 used to instantiate LAMMPS, respectively.  The world and universe IDs
 will only be different if you are running on more than one partition;
 see the "-partition command-line switch"_Section_start.html#start_6.
 The universe and original IDs will only be different if you used the
 "-reorder command-line switch"_Section_start.html#start_6 to reorder
 the processors differently than their rank in the original
 communicator LAMMPS was instantiated with.  The {name} is what is
 returned by a call to MPI_Get_processor_name() and should represent an
 identifier relevant to the physical processors in your machine.  Note
 that depending on the MPI implementation, multiple cores can have the
 same {name}.
 :line
 [Restrictions:]
 This command cannot be used after the simulation box is defined by a
@ -183,11 +231,17 @@ This command cannot be used after the simulation box is defined by a
 It can be used before a restart file is read to change the 3d
 processor grid from what is specified in the restart file.
-The {numa} keyword cannot be used with the {part} keyword, or
+You cannot use more than one of the {level2}, {level3}, or {numa}
-with any {grid} setting other than {cart}.
+keywords.
-[Related commands:] none
+The {numa} keyword cannot be used with the {part} keyword, and it
 ignores the {grid} setting.
 [Related commands:]
 "partition"_partition.html, "-reorder command-line
 switch"_Section_start.html#start_6
 [Default:]
-The option defaults are Px Py Pz = * * *, grid = cart, numa = 0.
+The option defaults are Px Py Pz = * * * and grid = cart.