lammps/examples/accelerate/README

These are example scripts that can be run with any of
the acclerator packages in LAMMPS:

GPU, USER-INTEL, KOKKOS, USER-OMP, OPT

The easiest way to build LAMMPS with these packages
is via the flags described in Section 4 of the manual.
The easiest way to run these scripts is by using the appropriate
Details on the individual accelerator packages
can be found in doc/Section_accelerate.html.

---------------------

Build LAMMPS with one or more of the accelerator packages

Note that in addition to any accelerator packages, these packages also
need to be installed to run all of the example scripts: ASPHERE,
MOLECULE, KSPACE, RIGID.

These two targets will build a single LAMMPS executable with all the
CPU accelerator packages installed (USER-INTEL for CPU, KOKKOS for
OMP, USER-OMP, OPT) or all the GPU accelerator packages installed
(GPU, KOKKOS for CUDA):

For any build with GPU, or KOKKOS for CUDA, be sure to set
the arch=XX setting to the appropriate value for the GPUs and Cuda
environment on your system.

---------------------

Running with each of the accelerator packages

All of the input scripts have a default problem size and number of
timesteps:

in.lj = LJ melt with cutoff of 2.5 = 32K atoms for 100 steps
in.lj.5.0 = same with cutoff of 5.0 = 32K atoms for 100 steps
in.phosphate = 11K atoms for 100 steps
in.rhodo = 32K atoms for 100 steps
in.lc = 33K atoms for 100 steps (after 200 steps equilibration)

These can be reset using the x,y,z and t variables in the command
line.  E.g. adding "-v x 2 -v y 2 -v z 4 -t 1000" to any of the run
command below would run a 16x larger problem (2x2x4) for 1000 steps.

Here are example run commands using each of the accelerator packages:

** CPU only

lmp_cpu < in.lj
mpirun -np 4 lmp_cpu -in in.lj

** OPT package

lmp_opt -sf opt < in.lj
mpirun -np 4 lmp_opt -sf opt -in in.lj

** USER-OMP package

lmp_omp -sf omp -pk omp 1 < in.lj
mpirun -np 4 lmp_omp -sf opt -pk omp 1 -in in.lj   # 4 MPI, 1 thread/MPI
mpirun -np 2 lmp_omp -sf opt -pk omp 4 -in in.lj   # 2 MPI, 4 thread/MPI

** GPU package

lmp_gpu_double -sf gpu < in.lj
mpirun -np 8 lmp_gpu_double -sf gpu < in.lj        # 8 MPI, 8 MPI/GPU
mpirun -np 12 lmp_gpu_double -sf gpu -pk gpu 2 < in.lj  # 12 MPI, 6 MPI/GPU
mpirun -np 4 lmp_gpu_double -sf gpu -pk gpu 2 tpa 8 < in.lj.5.0   # 4 MPI, 2 MPI/GPU

Note that when running in.lj.5.0 (which has a long cutoff) with the
GPU package, the "-pk tpa" setting should be > 1 (e.g. 8) for best
performance.

** KOKKOS package for OMP

lmp_kokkos_omp -k on t 1 -sf kk -pk kokkos neigh half < in.lj
mpirun -np 2 lmp_kokkos_omp -k on t 4 -sf kk < in.lj  # 2 MPI, 4 thread/MPI

Note that when running with just 1 thread/MPI, "-pk kokkos neigh half"
was specified to use half neighbor lists which are faster when running
on just 1 thread.

** KOKKOS package for CUDA

lmp_kokkos_cuda -k on t 1 -sf kk < in.lj    # 1 thread, 1 GPU
mpirun -np 2 lmp_kokkos_cuda -k on t 6 g 2 -sf kk < in.lj   # 2 MPI, 6 thread/MPI, 1 MPI/GPU

** KOKKOS package for PHI

mpirun -np 1 lmp_kokkos_phi -k on t 240 -sf kk -in in.lj   # 1 MPI, 240 threads/MPI
mpirun -np 30 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj    # 30 MPI, 8 threads/MPI

** USER-INTEL package for CPU

lmp_intel_cpu -sf intel < in.lj
mpirun -np 4 lmp_intl_cpu -sf intel < in.lj             # 4 MPI
mpirun -np 4 lmp_intl_cpu -sf intel -pk omp 2 < in.lj   # 4 MPI, 2 thread/MPI

** USER-INTEL package for PHI

lmp_intel_phi -sf intel -pk intel 1 omp 16 < in.lc      # 1 MPI, 16 CPU thread/MPI, 1 Phi, 240 Phi thread/MPI
mpirun -np 4 lmp_intel_phi -sf intel -pk intel 1 omp 2 < in.lc  # 4 MPI, 2 CPU threads/MPI, 1 Phi, 60 Phi thread/MPI

Note that there is currently no Phi support for pair_style lj/cut in
the USER-INTEL package.