forked from lijiext/lammps
187 lines
8.4 KiB
Plaintext
187 lines
8.4 KiB
Plaintext
"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
|
|
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
|
|
|
|
:link(lws,http://lammps.sandia.gov)
|
|
:link(ld,Manual.html)
|
|
:link(lc,Section_commands.html#comm)
|
|
|
|
:line
|
|
|
|
"Return to Section accelerate overview"_Section_accelerate.html
|
|
|
|
5.3.5 USER-OMP package :h4
|
|
|
|
The USER-OMP package was developed by Axel Kohlmeyer at Temple
|
|
University. It provides multi-threaded versions of most pair styles,
|
|
nearly all bonded styles (bond, angle, dihedral, improper), several
|
|
Kspace styles, and a few fix styles. The package currently uses the
|
|
OpenMP interface for multi-threading.
|
|
|
|
Here is a quick overview of how to use the USER-OMP package, assuming
|
|
one or more 16-core nodes. More details follow.
|
|
|
|
use -fopenmp with CCFLAGS and LINKFLAGS in Makefile.machine
|
|
make yes-user-omp
|
|
make mpi # build with USER-OMP package, if settings added to Makefile.mpi
|
|
make omp # or Makefile.omp already has settings
|
|
Make.py -v -p omp -o mpi -a file mpi # or one-line build via Make.py :pre
|
|
|
|
lmp_mpi -sf omp -pk omp 16 < in.script # 1 MPI task, 16 threads
|
|
mpirun -np 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 4 MPI tasks, 4 threads/task
|
|
mpirun -np 32 -ppn 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
|
|
|
|
[Required hardware/software:]
|
|
|
|
Your compiler must support the OpenMP interface. You should have one
|
|
or more multi-core CPUs so that multiple threads can be launched by
|
|
each MPI task running on a CPU.
|
|
|
|
[Building LAMMPS with the USER-OMP package:]
|
|
|
|
The lines above illustrate how to include/build with the USER-OMP
|
|
package in two steps, using the "make" command. Or how to do it with
|
|
one command via the src/Make.py script, described in "Section
|
|
2.4"_Section_start.html#start_4 of the manual. Type "Make.py -h" for
|
|
help.
|
|
|
|
Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must
|
|
include "-fopenmp". Likewise, if you use an Intel compiler, the
|
|
CCFLAGS setting must include "-restrict". The Make.py command will
|
|
add these automatically.
|
|
|
|
[Run with the USER-OMP package from the command line:]
|
|
|
|
The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
by LAMMPS (one or multiple per compute node) and the number of MPI
|
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
|
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
|
|
|
You need to choose how many OpenMP threads per MPI task will be used
|
|
by the USER-OMP package. Note that the product of MPI tasks *
|
|
threads/task should not exceed the physical number of cores (on a
|
|
node), otherwise performance will suffer.
|
|
|
|
As in the lines above, use the "-sf omp" "command-line
|
|
switch"_Section_start.html#start_7, which will automatically append
|
|
"omp" to styles that support it. The "-sf omp" switch also issues a
|
|
default "package omp 0"_package.html command, which will set the
|
|
number of threads per MPI task via the OMP_NUM_THREADS environment
|
|
variable.
|
|
|
|
You can also use the "-pk omp Nt" "command-line
|
|
switch"_Section_start.html#start_7, to explicitly set Nt = # of OpenMP
|
|
threads per MPI task to use, as well as additional options. Its
|
|
syntax is the same as the "package omp"_package.html command whose doc
|
|
page gives details, including the default values used if it is not
|
|
specified. It also gives more details on how to set the number of
|
|
threads via the OMP_NUM_THREADS environment variable.
|
|
|
|
[Or run with the USER-OMP package by editing an input script:]
|
|
|
|
The discussion above for the mpirun/mpiexec command, MPI tasks/node,
|
|
and threads/MPI task is the same.
|
|
|
|
Use the "suffix omp"_suffix.html command, or you can explicitly add an
|
|
"omp" suffix to individual styles in your input script, e.g.
|
|
|
|
pair_style lj/cut/omp 2.5 :pre
|
|
|
|
You must also use the "package omp"_package.html command to enable the
|
|
USER-OMP package. When you do this you also specify how many threads
|
|
per MPI task to use. The command doc page explains other options and
|
|
how to set the number of threads via the OMP_NUM_THREADS environment
|
|
variable.
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
Depending on which styles are accelerated, you should look for a
|
|
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
|
|
time" values printed at the end of a run.
|
|
|
|
You may see a small performance advantage (5 to 20%) when running a
|
|
USER-OMP style (in serial or parallel) with a single thread per MPI
|
|
task, versus running standard LAMMPS with its standard un-accelerated
|
|
styles (in serial or all-MPI parallelization with 1 task/core). This
|
|
is because many of the USER-OMP styles contain similar optimizations
|
|
to those used in the OPT package, described in "Section accelerate
|
|
5.3.6"_accelerate_opt.html.
|
|
|
|
With multiple threads/task, the optimal choice of number of MPI
|
|
tasks/node and OpenMP threads/task can vary a lot and should always be
|
|
tested via benchmark runs for a specific simulation running on a
|
|
specific machine, paying attention to guidelines discussed in the next
|
|
sub-section.
|
|
|
|
A description of the multi-threading strategy used in the USER-OMP
|
|
package and some performance examples are "presented
|
|
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
|
|
|
|
[Guidelines for best performance:]
|
|
|
|
For many problems on current generation CPUs, running the USER-OMP
|
|
package with a single thread/task is faster than running with multiple
|
|
threads/task. This is because the MPI parallelization in LAMMPS is
|
|
often more efficient than multi-threading as implemented in the
|
|
USER-OMP package. The parallel efficiency (in a threaded sense) also
|
|
varies for different USER-OMP styles.
|
|
|
|
Using multiple threads/task can be more effective under the following
|
|
circumstances:
|
|
|
|
Individual compute nodes have a significant number of CPU cores but
|
|
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
|
|
(Clovertown) and 54xx (Harpertown) quad-core processors. Running one
|
|
MPI task per CPU core will result in significant performance
|
|
degradation, so that running with 4 or even only 2 MPI tasks per node
|
|
is faster. Running in hybrid MPI+OpenMP mode will reduce the
|
|
inter-node communication bandwidth contention in the same way, but
|
|
offers an additional speedup by utilizing the otherwise idle CPU
|
|
cores. :ulb,l
|
|
|
|
The interconnect used for MPI communication does not provide
|
|
sufficient bandwidth for a large number of MPI tasks per node. For
|
|
example, this applies to running over gigabit ethernet or on Cray XT4
|
|
or XT5 series supercomputers. As in the aforementioned case, this
|
|
effect worsens when using an increasing number of nodes. :l
|
|
|
|
The system has a spatially inhomogeneous particle density which does
|
|
not map well to the "domain decomposition scheme"_processors.html or
|
|
"load-balancing"_balance.html options that LAMMPS provides. This is
|
|
because multi-threading achives parallelism over the number of
|
|
particles, not via their distribution in space. :l
|
|
|
|
A machine is being used in "capability mode", i.e. near the point
|
|
where MPI parallelism is maxed out. For example, this can happen when
|
|
using the "PPPM solver"_kspace_style.html for long-range
|
|
electrostatics on large numbers of nodes. The scaling of the KSpace
|
|
calculation (see the "kspace_style"_kspace_style.html command) becomes
|
|
the performance-limiting factor. Using multi-threading allows less
|
|
MPI tasks to be invoked and can speed-up the long-range solver, while
|
|
increasing overall performance by parallelizing the pairwise and
|
|
bonded calculations via OpenMP. Likewise additional speedup can be
|
|
sometimes be achived by increasing the length of the Coulombic cutoff
|
|
and thus reducing the work done by the long-range solver. Using the
|
|
"run_style verlet/split"_run_style.html command, which is compatible
|
|
with the USER-OMP package, is an alternative way to reduce the number
|
|
of MPI tasks assigned to the KSpace calculation. :l,ule
|
|
|
|
Additional performance tips are as follows:
|
|
|
|
The best parallel efficiency from {omp} styles is typically achieved
|
|
when there is at least one MPI task per physical CPU chip, i.e. socket
|
|
or die. :ulb,l
|
|
|
|
It is usually most efficient to restrict threading to a single
|
|
socket, i.e. use one or more MPI task per socket. :l
|
|
|
|
IMPORTANT NOTE: By default, several current MPI implementations use a
|
|
processor affinity setting that restricts each MPI task to a single
|
|
CPU core. Using multi-threading in this mode will force all threads
|
|
to share the one core and thus is likely to be counterproductive.
|
|
Instead, binding MPI tasks to a (multi-core) socket, should solve this
|
|
issue. :l,ule
|
|
|
|
[Restrictions:]
|
|
|
|
None.
|