From 2bb69773d3bf7e342ce4747281027684fad34ecf Mon Sep 17 00:00:00 2001 From: Stan Moore Date: Mon, 8 Apr 2019 13:07:29 -0600 Subject: [PATCH] Update Kokkos docs --- doc/src/Build_extras.txt | 5 +- doc/src/Speed_kokkos.txt | 61 +++++++-------- doc/src/package.txt | 164 ++++++++++++++++++++------------------- 3 files changed, 115 insertions(+), 115 deletions(-) diff --git a/doc/src/Build_extras.txt b/doc/src/Build_extras.txt index 17d18243f2..3b9da2db39 100644 --- a/doc/src/Build_extras.txt +++ b/doc/src/Build_extras.txt @@ -247,7 +247,10 @@ Maxwell50 = NVIDIA Maxwell generation CC 5.0 Maxwell52 = NVIDIA Maxwell generation CC 5.2 Maxwell53 = NVIDIA Maxwell generation CC 5.3 Pascal60 = NVIDIA Pascal generation CC 6.0 -Pascal61 = NVIDIA Pascal generation CC 6.1 :ul +Pascal61 = NVIDIA Pascal generation CC 6.1 +Volta70 = NVIDIA Volta generation CC 7.0 +Volta72 = NVIDIA Volta generation CC 7.2 +Turing75 = NVIDIA Turing generation CC 7.5 :ul [CMake build]: diff --git a/doc/src/Speed_kokkos.txt b/doc/src/Speed_kokkos.txt index d04f8ac6f1..1750d2400f 100644 --- a/doc/src/Speed_kokkos.txt +++ b/doc/src/Speed_kokkos.txt @@ -111,16 +111,10 @@ Makefile.kokkos_mpi_only) will give better performance than the OpenMP back end (i.e. Makefile.kokkos_omp) because some of the overhead to make the code thread-safe is removed. -NOTE: The default for the "package kokkos"_package.html command is to -use "full" neighbor lists and set the Newton flag to "off" for both -pairwise and bonded interactions. However, when running on CPUs, it -will typically be faster to use "half" neighbor lists and set the -Newton flag to "on", just as is the case for non-accelerated pair -styles. It can also be faster to use non-threaded communication. Use -the "-pk kokkos" "command-line switch"_Run_options.html to change the -default "package kokkos"_package.html options. See its doc page for -details and default settings. Experimenting with its options can -provide a speed-up for specific calculations. For example: +NOTE: Use the "-pk kokkos" "command-line switch"_Run_options.html to +change the default "package kokkos"_package.html options. See its doc +page for details and default settings. Experimenting with its options +can provide a speed-up for specific calculations. For example: mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm :pre @@ -190,19 +184,18 @@ tasks/node. The "-k on t Nt" command-line switch sets the number of threads/task as Nt. The product of these two values should be N, i.e. 256 or 264. -NOTE: The default for the "package kokkos"_package.html command is to -use "full" neighbor lists and set the Newton flag to "off" for both -pairwise and bonded interactions. When running on KNL, this will -typically be best for pair-wise potentials. For many-body potentials, -using "half" neighbor lists and setting the Newton flag to "on" may be -faster. It can also be faster to use non-threaded communication. Use -the "-pk kokkos" "command-line switch"_Run_options.html to change the -default "package kokkos"_package.html options. See its doc page for -details and default settings. Experimenting with its options can -provide a speed-up for specific calculations. For example: +NOTE: The default for the "package kokkos"_package.html command when +running on KNL is to use "half" neighbor lists and set the Newton flag +to "on" for both pairwise and bonded interactions. This will typically +be best for many-body potentials. For simpler pair-wise potentials, it +may be faster to use a "full" neighbor list with Newton flag to "off". +Use the "-pk kokkos" "command-line switch"_Run_options.html to change +the default "package kokkos"_package.html options. See its doc page for +details and default settings. Experimenting with its options can provide +a speed-up for specific calculations. For example: -mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj # Newton off, full neighbor list, non-threaded comm -mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax # Newton on, half neighbor list, non-threaded comm :pre +mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.lj # Newton on, half neighbor list, threaded comm +mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton off neigh full comm no -in in.reax # Newton off, full neighbor list, non-threaded comm :pre NOTE: MPI tasks and threads should be bound to cores as described above for CPUs. @@ -236,18 +229,18 @@ one or more nodes, each with two GPUs: mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre -NOTE: The default for the "package kokkos"_package.html command is to -use "full" neighbor lists and set the Newton flag to "off" for both -pairwise and bonded interactions, along with threaded communication. -When running on Maxwell or Kepler GPUs, this will typically be -best. For Pascal GPUs, using "half" neighbor lists and setting the -Newton flag to "on" may be faster. For many pair styles, setting the -neighbor binsize equal to twice the CPU default value will give speedup, -which is the default when running on GPUs. -Use the "-pk kokkos" "command-line switch"_Run_options.html to change -the default "package kokkos"_package.html options. See its doc page -for details and default settings. Experimenting with its options can -provide a speed-up for specific calculations. For example: +NOTE: The default for the "package kokkos"_package.html command when +running on GPUs is to use "full" neighbor lists and set the Newton flag +to "off" for both pairwise and bonded interactions, along with threaded +communication. When running on Maxwell or Kepler GPUs, this will +typically be best. For Pascal GPUs, using "half" neighbor lists and +setting the Newton flag to "on" may be faster. For many pair styles, +setting the neighbor binsize equal to twice the CPU default value will +give speedup, which is the default when running on GPUs. Use the "-pk +kokkos" "command-line switch"_Run_options.html to change the default +"package kokkos"_package.html options. See its doc page for details and +default settings. Experimenting with its options can provide a speed-up +for specific calculations. For example: mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighbor list, set binsize = neighbor ghost cutoff :pre diff --git a/doc/src/package.txt b/doc/src/package.txt index 89cfd03f5f..94ee93a743 100644 --- a/doc/src/package.txt +++ b/doc/src/package.txt @@ -64,7 +64,7 @@ args = arguments specific to the style :l {no_affinity} values = none {kokkos} args = keyword value ... zero or more keyword/value pairs may be appended - keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} + keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct} {neigh} value = {full} or {half} full = full neighbor list half = half neighbor list built in thread-safe manner @@ -72,7 +72,7 @@ args = arguments specific to the style :l full = full neighbor list half = half neighbor list built in thread-safe manner {newton} = {off} or {on} - off = set Newton pairwise and bonded flags off (default) + off = set Newton pairwise and bonded flags off on = set Newton pairwise and bonded flags on {binsize} value = size size = bin size for neighbor list construction (distance units) @@ -422,33 +422,35 @@ processes/threads used for LAMMPS. :line -The {kokkos} style invokes settings associated with the use of the -KOKKOS package. +The {kokkos} style invokes settings associated with the use of the +KOKKOS package. -All of the settings are optional keyword/value pairs. Each has a -default value as listed below. +All of the settings are optional keyword/value pairs. Each has a default +value as listed below. -The {neigh} keyword determines how neighbor lists are built. A value -of {half} uses a thread-safe variant of half-neighbor lists, -the same as used by most pair styles in LAMMPS. +The {neigh} keyword determines how neighbor lists are built. A value of +{half} uses a thread-safe variant of half-neighbor lists, the same as +used by most pair styles in LAMMPS, which is the default when running on +CPUs (i.e. CUDA backend not enabled). -A value of {full} uses a full neighbor lists and is the default. This -performs twice as much computation as the {half} option, however that -is often a win because it is thread-safe and doesn't require atomic -operations in the calculation of pair forces. For that reason, {full} -is the default setting. However, when running in MPI-only mode with 1 -thread per MPI task, {half} neighbor lists will typically be faster, -just as it is for non-accelerated pair styles. Similarly, the {neigh/qeq} -keyword determines how neighbor lists are built for "fix qeq/reax/kk"_fix_qeq_reax.html. -If not explicitly set, the value of {neigh/qeq} will match {neigh}. +A value of {full} uses a full neighbor lists and is the default when +running on GPUs. This performs twice as much computation as the {half} +option, however that is often a win because it is thread-safe and +doesn't require atomic operations in the calculation of pair forces. For +that reason, {full} is the default setting for GPUs. However, when +running on CPUs, a {half} neighbor list is the default because it are +often faster, just as it is for non-accelerated pair styles. Similarly, +the {neigh/qeq} keyword determines how neighbor lists are built for "fix +qeq/reax/kk"_fix_qeq_reax.html. If not explicitly set, the value of +{neigh/qeq} will match {neigh}. -The {newton} keyword sets the Newton flags for pairwise and bonded -interactions to {off} or {on}, the same as the "newton"_newton.html -command allows. The default is {off} because this will almost always -give better performance for the KOKKOS package. This means more -computation is done, but less communication. However, when running in -MPI-only mode with 1 thread per MPI task, a value of {on} will -typically be faster, just as it is for non-accelerated pair styles. +The {newton} keyword sets the Newton flags for pairwise and bonded +interactions to {off} or {on}, the same as the "newton"_newton.html +command allows. The default for GPUs is {off} because this will almost +always give better performance for the KOKKOS package. This means more +computation is done, but less communication. However, when running on +CPUs a value of {on} is the deafult since it can often be faster, just +as it is for non-accelerated pair styles The {binsize} keyword sets the size of bins used to bin atoms in neighbor list builds. The same value can be set by the "neigh_modify @@ -465,58 +467,58 @@ because the GPU is faster at performing pairwise interactions, then this rule of thumb may give too large a binsize and the default should be overridden with a smaller value. -The {comm} and {comm/exchange} and {comm/forward} and {comm/reverse} keywords determine -whether the host or device performs the packing and unpacking of data -when communicating per-atom data between processors. "Exchange" -communication happens only on timesteps that neighbor lists are -rebuilt. The data is only for atoms that migrate to new processors. -"Forward" communication happens every timestep. "Reverse" communication -happens every timestep if the {newton} option is on. The data is for atom -coordinates and any other atom properties that needs to be updated for -ghost atoms owned by each processor. +The {comm} and {comm/exchange} and {comm/forward} and {comm/reverse} +keywords determine whether the host or device performs the packing and +unpacking of data when communicating per-atom data between processors. +"Exchange" communication happens only on timesteps that neighbor lists +are rebuilt. The data is only for atoms that migrate to new processors. +"Forward" communication happens every timestep. "Reverse" communication +happens every timestep if the {newton} option is on. The data is for +atom coordinates and any other atom properties that needs to be updated +for ghost atoms owned by each processor. -The {comm} keyword is simply a short-cut to set the same value -for both the {comm/exchange} and {comm/forward} and {comm/reverse} keywords. +The {comm} keyword is simply a short-cut to set the same value for both +the {comm/exchange} and {comm/forward} and {comm/reverse} keywords. -The value options for all 3 keywords are {no} or {host} or {device}. -A value of {no} means to use the standard non-KOKKOS method of -packing/unpacking data for the communication. A value of {host} means -to use the host, typically a multi-core CPU, and perform the -packing/unpacking in parallel with threads. A value of {device} -means to use the device, typically a GPU, to perform the -packing/unpacking operation. +The value options for all 3 keywords are {no} or {host} or {device}. A +value of {no} means to use the standard non-KOKKOS method of +packing/unpacking data for the communication. A value of {host} means to +use the host, typically a multi-core CPU, and perform the +packing/unpacking in parallel with threads. A value of {device} means to +use the device, typically a GPU, to perform the packing/unpacking +operation. -The optimal choice for these keywords depends on the input script and -the hardware used. The {no} value is useful for verifying that the -Kokkos-based {host} and {device} values are working correctly. -It may also be the fastest choice when using Kokkos styles in -MPI-only mode (i.e. with a thread count of 1). +The optimal choice for these keywords depends on the input script and +the hardware used. The {no} value is useful for verifying that the +Kokkos-based {host} and {device} values are working correctly. It is the +default when running on CPUs since it is usually the fastest. -When running on CPUs or Xeon Phi, the {host} and {device} values work -identically. When using GPUs, the {device} value will typically be -optimal if all of your styles used in your input script are supported -by the KOKKOS package. In this case data can stay on the GPU for many -timesteps without being moved between the host and GPU, if you use the -{device} value. This requires that your MPI is able to access GPU -memory directly. Currently that is true for OpenMPI 1.8 (or later -versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses -styles (e.g. fixes) which are not yet supported by the KOKKOS package, -then data has to be move between the host and device anyway, so it is -typically faster to let the host handle communication, by using the -{host} value. Using {host} instead of {no} will enable use of -multiple threads to pack/unpack communicated data. +When running on CPUs or Xeon Phi, the {host} and {device} values work +identically. When using GPUs, the {device} value is the default since it +will typically be optimal if all of your styles used in your input +script are supported by the KOKKOS package. In this case data can stay +on the GPU for many timesteps without being moved between the host and +GPU, if you use the {device} value. This requires that your MPI is able +to access GPU memory directly. Currently that is true for OpenMPI 1.8 +(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your +script uses styles (e.g. fixes) which are not yet supported by the +KOKKOS package, then data has to be move between the host and device +anyway, so it is typically faster to let the host handle communication, +by using the {host} value. Using {host} instead of {no} will enable use +of multiple threads to pack/unpack communicated data. -The {gpu/direct} keyword chooses whether GPU-direct will be used. When -this keyword is set to {on}, buffers in GPU memory are passed directly -through MPI send/receive calls. This reduces overhead of first copying -the data to the host CPU. However GPU-direct is not supported on all -systems, which can lead to segmentation faults and would require -using a value of {off}. If LAMMPS can safely detect that GPU-direct is -not available (currently only possible with OpenMPI v2.0.0 or later), -then the {gpu/direct} keyword is automatically set to {off} by default. -When the {gpu/direct} keyword is set to {off} while any of the {comm} -keywords are set to {device}, the value for these {comm} keywords will -be automatically changed to {host}. +The {gpu/direct} keyword chooses whether GPU-direct will be used. When +this keyword is set to {on}, buffers in GPU memory are passed directly +through MPI send/receive calls. This reduces overhead of first copying +the data to the host CPU. However GPU-direct is not supported on all +systems, which can lead to segmentation faults and would require using a +value of {off}. If LAMMPS can safely detect that GPU-direct is not +available (currently only possible with OpenMPI v2.0.0 or later), then +the {gpu/direct} keyword is automatically set to {off} by default. When +the {gpu/direct} keyword is set to {off} while any of the {comm} +keywords are set to {device}, the value for these {comm} keywords will +be automatically changed to {host}. This setting has no effect if not +running on GPUs. :line @@ -623,14 +625,16 @@ not used, you must invoke the package intel command in your input script or or via the "-pk intel" "command-line switch"_Run_options.html. -For the KOKKOS package, the option defaults neigh = full, neigh/qeq = -full, newton = off, binsize for CPUs = 0.0, binsize for GPUs = 2x LAMMPS -default value, and comm = device, gpu/direct = on. When LAMMPS can -safely detect, that GPU-direct is not available, the default value of -gpu/direct becomes "off". These settings are made automatically by the -required "-k on" "command-line switch"_Run_options.html. You can change -them by using the package kokkos command in your input script or via the -"-pk kokkos command-line switch"_Run_options.html. +For the KOKKOS package, the option defaults for GPUs are neigh = full, +neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default +value, comm = device, gpu/direct = on. When LAMMPS can safely detect +that GPU-direct is not available, the default value of gpu/direct +becomes "off". For CPUs or Xeon Phis, the option defaults are neigh = +half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. These +settings are made automatically by the required "-k on" "command-line +switch"_Run_options.html. You can change them by using the package +kokkos command in your input script or via the "-pk kokkos command-line +switch"_Run_options.html. For the OMP package, the default is Nthreads = 0 and the option defaults are neigh = yes. These settings are made automatically if