lammps/doc/accelerate_omp.txt

"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

:link(lws,http://lammps.sandia.gov)
:link(ld,Manual.html)
:link(lc,Section_commands.html#comm)

:line

"Return to Section accelerate overview"_Section_accelerate.html

5.3.5 USER-OMP package :h4

The USER-OMP package was developed by Axel Kohlmeyer at Temple
University.  It provides multi-threaded versions of most pair styles,
nearly all bonded styles (bond, angle, dihedral, improper), several
Kspace styles, and a few fix styles.  The package currently
uses the OpenMP interface for multi-threading.

Here is a quick overview of how to use the USER-OMP package:

use the -fopenmp flag for compiling and linking in your Makefile.machine
include the USER-OMP package and build LAMMPS
use the mpirun command to set the number of MPI tasks/node
specify how many threads per MPI task to use
use USER-OMP styles in your input script :ul

The latter two steps can be done using the "-pk omp" and "-sf omp"
"command-line switches"_Section_start.html#start_7 respectively.  Or
the effect of the "-pk" or "-sf" switches can be duplicated by adding
the "package omp"_package.html or "suffix omp"_suffix.html commands
respectively to your input script.

[Required hardware/software:]

Your compiler must support the OpenMP interface.  You should have one
or more multi-core CPUs so that multiple threads can be launched by an
MPI task running on a CPU.

[Building LAMMPS with the USER-OMP package:]

Include the package and build LAMMPS:

cd lammps/src
make yes-user-omp
make machine :pre

Your src/MAKE/Makefile.machine needs a flag for OpenMP support in both
the CCFLAGS and LINKFLAGS variables.  For GNU and Intel compilers,
this flag is "-fopenmp".  Without this flag the USER-OMP styles will
still be compiled and work, but will not support multi-threading.

[Run with the USER-OMP package from the command line:]

The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node.  E.g. the mpirun command does this via its -np
and -ppn switches.

You need to choose how many threads per MPI task will be used by the
USER-OMP package.  Note that the product of MPI tasks * threads/task
should not exceed the physical number of cores (on a node), otherwise
performance will suffer.

Use the "-sf omp" "command-line switch"_Section_start.html#start_7,
which will automatically append "omp" to styles that support it.  Use
the "-pk omp Nt" "command-line switch"_Section_start.html#start_7, to
set Nt = # of OpenMP threads per MPI task to use.

lmp_machine -sf omp -pk omp 16 -in in.script                       # 1 MPI task on a 16-core node
mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script           # 4 MPI tasks each with 4 threads on a single 16-core node
mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script   # ditto on 8 16-core nodes :pre

Note that if the "-sf omp" switch is used, it also issues a default
"package omp 0"_package.html command, which sets the number of threads
per MPI task via the OMP_NUM_THREADS environment variable.

Using the "-pk" switch explicitly allows for direct setting of the
number of threads and additional options.  Its syntax is the same as
the "package omp" command.  See the "package"_package.html command doc
page for details, including the default values used for all its
options if it is not specified, and how to set the number of threads
via the OMP_NUM_THREADS environment variable if desired.

[Or run with the USER-OMP package by editing an input script:]

The discussion above for the mpirun/mpiexec command, MPI tasks/node,
and threads/MPI task is the same.

Use the "suffix omp"_suffix.html command, or you can explicitly add an
"omp" suffix to individual styles in your input script, e.g.

pair_style lj/cut/omp 2.5 :pre

You must also use the "package omp"_package.html command to enable the
USER-OMP package, unless the "-sf omp" or "-pk omp" "command-line
switches"_Section_start.html#start_7 were used.  It specifies how many
threads per MPI task to use, as well as other options.  Its doc page
explains how to set the number of threads via an environment variable
if desired.

[Speed-ups to expect:]

Depending on which styles are accelerated, you should look for a
reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
time" values printed at the end of a run.  

You may see a small performance advantage (5 to 20%) when running a
USER-OMP style (in serial or parallel) with a single thread per MPI
task, versus running standard LAMMPS with its standard
(un-accelerated) styles (in serial or all-MPI parallelization with 1
task/core).  This is because many of the USER-OMP styles contain
similar optimizations to those used in the OPT package, as described
above.

With multiple threads/task, the optimal choice of MPI tasks/node and
OpenMP threads/task can vary a lot and should always be tested via
benchmark runs for a specific simulation running on a specific
machine, paying attention to guidelines discussed in the next
sub-section.

A description of the multi-threading strategy used in the USER-OMP
package and some performance examples are "presented
here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1

[Guidelines for best performance:]

For many problems on current generation CPUs, running the USER-OMP
package with a single thread/task is faster than running with multiple
threads/task.  This is because the MPI parallelization in LAMMPS is
often more efficient than multi-threading as implemented in the
USER-OMP package.  The parallel efficiency (in a threaded sense) also
varies for different USER-OMP styles.

Using multiple threads/task can be more effective under the following
circumstances:

Individual compute nodes have a significant number of CPU cores but
the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
(Clovertown) and 54xx (Harpertown) quad core processors. Running one
MPI task per CPU core will result in significant performance
degradation, so that running with 4 or even only 2 MPI tasks per node
is faster.  Running in hybrid MPI+OpenMP mode will reduce the
inter-node communication bandwidth contention in the same way, but
offers an additional speedup by utilizing the otherwise idle CPU
cores. :ulb,l

The interconnect used for MPI communication does not provide
sufficient bandwidth for a large number of MPI tasks per node.  For
example, this applies to running over gigabit ethernet or on Cray XT4
or XT5 series supercomputers.  As in the aforementioned case, this
effect worsens when using an increasing number of nodes. :l

The system has a spatially inhomogeneous particle density which does
not map well to the "domain decomposition scheme"_processors.html or
"load-balancing"_balance.html options that LAMMPS provides.  This is
because multi-threading achives parallelism over the number of
particles, not via their distribution in space. :l

A machine is being used in "capability mode", i.e. near the point
where MPI parallelism is maxed out.  For example, this can happen when
using the "PPPM solver"_kspace_style.html for long-range
electrostatics on large numbers of nodes.  The scaling of the KSpace
calculation (see the "kspace_style"_kspace_style.html command) becomes
the performance-limiting factor.  Using multi-threading allows less
MPI tasks to be invoked and can speed-up the long-range solver, while
increasing overall performance by parallelizing the pairwise and
bonded calculations via OpenMP.  Likewise additional speedup can be
sometimes be achived by increasing the length of the Coulombic cutoff
and thus reducing the work done by the long-range solver.  Using the
"run_style verlet/split"_run_style.html command, which is compatible
with the USER-OMP package, is an alternative way to reduce the number
of MPI tasks assigned to the KSpace calculation. :l,ule

Additional performance tips are as follows:

The best parallel efficiency from {omp} styles is typically achieved
when there is at least one MPI task per physical processor,
i.e. socket or die. :ulb,l

It is usually most efficient to restrict threading to a single
socket, i.e. use one or more MPI task per socket. :l

Several current MPI implementation by default use a processor affinity
setting that restricts each MPI task to a single CPU core.  Using
multi-threading in this mode will force the threads to share that core
and thus is likely to be counterproductive.  Instead, binding MPI
tasks to a (multi-core) socket, should solve this issue. :l,ule

[Restrictions:]

None.
git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12464 f3b2605a-c512-4ea7-a41b-209d697bcdaa 2014-09-10 23:32:24 +08:00			`"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -`
			`"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c`

			`:link(lws,http://lammps.sandia.gov)`
			`:link(ld,Manual.html)`
			`:link(lc,Section_commands.html#comm)`

			`:line`

			`"Return to Section accelerate overview"_Section_accelerate.html`

			`5.3.5 USER-OMP package :h4`

			`The USER-OMP package was developed by Axel Kohlmeyer at Temple`
			`University. It provides multi-threaded versions of most pair styles,`
			`nearly all bonded styles (bond, angle, dihedral, improper), several`
			`Kspace styles, and a few fix styles. The package currently`
			`uses the OpenMP interface for multi-threading.`

			`Here is a quick overview of how to use the USER-OMP package:`

			`use the -fopenmp flag for compiling and linking in your Makefile.machine`
			`include the USER-OMP package and build LAMMPS`
			`use the mpirun command to set the number of MPI tasks/node`
			`specify how many threads per MPI task to use`
			`use USER-OMP styles in your input script :ul`

			`The latter two steps can be done using the "-pk omp" and "-sf omp"`
			`"command-line switches"_Section_start.html#start_7 respectively. Or`
			`the effect of the "-pk" or "-sf" switches can be duplicated by adding`
			`the "package omp"_package.html or "suffix omp"_suffix.html commands`
			`respectively to your input script.`

			`[Required hardware/software:]`

			`Your compiler must support the OpenMP interface. You should have one`
			`or more multi-core CPUs so that multiple threads can be launched by an`
			`MPI task running on a CPU.`

			`[Building LAMMPS with the USER-OMP package:]`

			`Include the package and build LAMMPS:`

			`cd lammps/src`
			`make yes-user-omp`
			`make machine :pre`

			`Your src/MAKE/Makefile.machine needs a flag for OpenMP support in both`
			`the CCFLAGS and LINKFLAGS variables. For GNU and Intel compilers,`
			`this flag is "-fopenmp". Without this flag the USER-OMP styles will`
			`still be compiled and work, but will not support multi-threading.`

			`[Run with the USER-OMP package from the command line:]`

			`The mpirun or mpiexec command sets the total number of MPI tasks used`
			`by LAMMPS (one or multiple per compute node) and the number of MPI`
			`tasks used per node. E.g. the mpirun command does this via its -np`
			`and -ppn switches.`

			`You need to choose how many threads per MPI task will be used by the`
			`USER-OMP package. Note that the product of MPI tasks * threads/task`
			`should not exceed the physical number of cores (on a node), otherwise`
			`performance will suffer.`

			`Use the "-sf omp" "command-line switch"_Section_start.html#start_7,`
			`which will automatically append "omp" to styles that support it. Use`
			`the "-pk omp Nt" "command-line switch"_Section_start.html#start_7, to`
			`set Nt = # of OpenMP threads per MPI task to use.`

			`lmp_machine -sf omp -pk omp 16 -in in.script # 1 MPI task on a 16-core node`
			`mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node`
			`mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script # ditto on 8 16-core nodes :pre`

			`Note that if the "-sf omp" switch is used, it also issues a default`
			`"package omp 0"_package.html command, which sets the number of threads`
			`per MPI task via the OMP_NUM_THREADS environment variable.`

			`Using the "-pk" switch explicitly allows for direct setting of the`
			`number of threads and additional options. Its syntax is the same as`
			`the "package omp" command. See the "package"_package.html command doc`
			`page for details, including the default values used for all its`
			`options if it is not specified, and how to set the number of threads`
			`via the OMP_NUM_THREADS environment variable if desired.`

			`[Or run with the USER-OMP package by editing an input script:]`

			`The discussion above for the mpirun/mpiexec command, MPI tasks/node,`
			`and threads/MPI task is the same.`

			`Use the "suffix omp"_suffix.html command, or you can explicitly add an`
			`"omp" suffix to individual styles in your input script, e.g.`

			`pair_style lj/cut/omp 2.5 :pre`

			`You must also use the "package omp"_package.html command to enable the`
			`USER-OMP package, unless the "-sf omp" or "-pk omp" "command-line`
			`switches"_Section_start.html#start_7 were used. It specifies how many`
			`threads per MPI task to use, as well as other options. Its doc page`
			`explains how to set the number of threads via an environment variable`
			`if desired.`

			`[Speed-ups to expect:]`

			`Depending on which styles are accelerated, you should look for a`
			`reduction in the "Pair time", "Bond time", "KSpace time", and "Loop`
			`time" values printed at the end of a run.`

			`You may see a small performance advantage (5 to 20%) when running a`
			`USER-OMP style (in serial or parallel) with a single thread per MPI`
			`task, versus running standard LAMMPS with its standard`
			`(un-accelerated) styles (in serial or all-MPI parallelization with 1`
			`task/core). This is because many of the USER-OMP styles contain`
			`similar optimizations to those used in the OPT package, as described`
			`above.`

			`With multiple threads/task, the optimal choice of MPI tasks/node and`
			`OpenMP threads/task can vary a lot and should always be tested via`
			`benchmark runs for a specific simulation running on a specific`
			`machine, paying attention to guidelines discussed in the next`
			`sub-section.`

			`A description of the multi-threading strategy used in the USER-OMP`
			`package and some performance examples are "presented`
			`here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1`

			`[Guidelines for best performance:]`

			`For many problems on current generation CPUs, running the USER-OMP`
			`package with a single thread/task is faster than running with multiple`
			`threads/task. This is because the MPI parallelization in LAMMPS is`
			`often more efficient than multi-threading as implemented in the`
			`USER-OMP package. The parallel efficiency (in a threaded sense) also`
			`varies for different USER-OMP styles.`

			`Using multiple threads/task can be more effective under the following`
			`circumstances:`

			`Individual compute nodes have a significant number of CPU cores but`
			`the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx`
			`(Clovertown) and 54xx (Harpertown) quad core processors. Running one`
			`MPI task per CPU core will result in significant performance`
			`degradation, so that running with 4 or even only 2 MPI tasks per node`
			`is faster. Running in hybrid MPI+OpenMP mode will reduce the`
			`inter-node communication bandwidth contention in the same way, but`
			`offers an additional speedup by utilizing the otherwise idle CPU`
			`cores. :ulb,l`

			`The interconnect used for MPI communication does not provide`
			`sufficient bandwidth for a large number of MPI tasks per node. For`
			`example, this applies to running over gigabit ethernet or on Cray XT4`
			`or XT5 series supercomputers. As in the aforementioned case, this`
			`effect worsens when using an increasing number of nodes. :l`

			`The system has a spatially inhomogeneous particle density which does`
			`not map well to the "domain decomposition scheme"_processors.html or`
			`"load-balancing"_balance.html options that LAMMPS provides. This is`
			`because multi-threading achives parallelism over the number of`
			`particles, not via their distribution in space. :l`

			`A machine is being used in "capability mode", i.e. near the point`
			`where MPI parallelism is maxed out. For example, this can happen when`
			`using the "PPPM solver"_kspace_style.html for long-range`
			`electrostatics on large numbers of nodes. The scaling of the KSpace`
			`calculation (see the "kspace_style"_kspace_style.html command) becomes`
			`the performance-limiting factor. Using multi-threading allows less`
			`MPI tasks to be invoked and can speed-up the long-range solver, while`
			`increasing overall performance by parallelizing the pairwise and`
			`bonded calculations via OpenMP. Likewise additional speedup can be`
			`sometimes be achived by increasing the length of the Coulombic cutoff`
			`and thus reducing the work done by the long-range solver. Using the`
			`"run_style verlet/split"_run_style.html command, which is compatible`
			`with the USER-OMP package, is an alternative way to reduce the number`
			`of MPI tasks assigned to the KSpace calculation. :l,ule`

			`Additional performance tips are as follows:`

			`The best parallel efficiency from {omp} styles is typically achieved`
			`when there is at least one MPI task per physical processor,`
			`i.e. socket or die. :ulb,l`

			`It is usually most efficient to restrict threading to a single`
			`socket, i.e. use one or more MPI task per socket. :l`

			`Several current MPI implementation by default use a processor affinity`
			`setting that restricts each MPI task to a single CPU core. Using`
			`multi-threading in this mode will force the threads to share that core`
			`and thus is likely to be counterproductive. Instead, binding MPI`
			`tasks to a (multi-core) socket, should solve this issue. :l,ule`

			`[Restrictions:]`

			`None.`