git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12601 f3b2605a-c512-4ea7-a41b-209d697bcdaa

2014-10-07 16:52:39 +00:00 · 2014-10-07 16:52:39 +00:00 · 6b7d937194
parent 6bea73f245
commit 6b7d937194
20 changed files with 23333 additions and 0 deletions
--- a/examples/accelerate/Make.list
+++ b/examples/accelerate/Make.list
@ -0,0 +1,41 @@
+# desktop builds for dual hex-core Xeons and Fermi GPUs
+# use mpicxx or nvcc with its default compiler
+# use default FFT support = KISS library
+
+# build with one accelerator package
+
+cpu: -d ../.. -j 16 -p none asphere molecule kspace rigid orig -o cpu file clean mpi
+
+omp: -d ../.. -j 16 -p none asphere molecule kspace rigid omp orig -o omp file clean mpi
+
+opt: -d ../.. -j 16 -p none asphere molecule kspace rigid opt orig -o opt file clean mpi
+
+cuda_double: -d ../.. -j 16 -p none asphere molecule kspace rigid cuda orig -cuda mode=double arch=21 -o cuda_double lib-cuda file clean mpi
+
+cuda_mixed: -d ../.. -j 16 -p none asphere molecule kspace rigid cuda orig -cuda mode=mixed arch=21 -o cuda_mixed lib-cuda file clean mpi
+
+cuda_single: -d ../.. -j 16 -p none asphere molecule kspace rigid cuda orig -cuda mode=single arch=21 -o cuda_single lib-cuda file clean mpi
+
+gpu_double: -d ../.. -j 16 -p none asphere molecule kspace rigid gpu orig -gpu mode=double arch=21 -o gpu_double lib-gpu file clean mpi
+
+gpu_mixed: -d ../.. -j 16 -p none asphere molecule kspace rigid gpu orig -gpu mode=mixed arch=21 -o gpu_mixed lib-gpu file clean mpi
+
+gpu_single: -d ../.. -j 16 -p none asphere molecule kspace rigid gpu orig -gpu mode=single arch=21 -o gpu_single lib-gpu file clean mpi
+
+intel_cpu: -d ../.. -j 16 -p none asphere molecule kspace rigid intel omp orig -cc mpi wrap=icc -intel cpu -o intel_cpu file clean mpi
+
+#intel_phi: -d ../.. -j 16 -p none asphere molecule kspace rigid intel omp orig -intel phi -o intel_phi file clean mpi 
+
+kokkos_omp: -d ../.. -j 16 -p none asphere molecule kspace rigid kokkos orig -kokkos omp -o kokkos_omp file clean mpi 
+
+kokkos_cuda: -d ../.. -j 16 -p none asphere molecule kspace rigid kokkos orig -cc nvcc wrap=mpi -kokkos cuda arch=21 -o kokkos_cuda file clean mpi
+
+#kokkos_phi: -d ../.. -j 16 -p none asphere molecule kspace rigid kokkos orig -kokkos phi -o kokkos_phi file clean mpi 
+
+# build with all accelerator packages for CPU
+
+all_cpu: -d ../.. -j 16 -p asphere molecule kspace rigid none opt omp intel kokkos orig -cc mpi wrap=icc -intel cpu -kokkos omp -o all_cpu file clean mpi
+
+# build with all accelerator packages for GPU
+
+all_gpu: -d ../.. -j 16 -p none asphere molecule kspace rigid omp gpu cuda kokkos orig -cc nvcc wrap=mpi -cuda mode=double arch=21 -gpu mode=double arch=21 -kokkos cuda arch=21 -o all_gpu lib-all file clean mpi
--- a/examples/accelerate/README
+++ b/examples/accelerate/README
@ -0,0 +1,158 @@
+These are example scripts that can be run with any of
+the acclerator packages in LAMMPS:
+
+USER-CUDA, GPU, USER-INTEL, KOKKOS, USER-OMP, OPT
+
+The easiest way to build LAMMPS with these packages
+is via the src/Make.py tool described in Section 2.4
+of the manual.  You can also type "Make.py -h" to see
+its options.  The easiest way to run these scripts
+is by using the appropriate
+
+Details on the individual accelerator packages
+can be found in doc/Section_accelerate.html.
+
+---------------------
+
+Build LAMMPS with one or more of the accelerator packages
+
+The following command will invoke the src/Make.py tool with one of the
+command-lines from the Make.list file:
+
+../../src/Make.py -r Make.list target
+
+target = one or more of the following:
+  cpu, omp, opt
+  cuda_double, cuda_mixed, cuda_single
+  gpu_double, gpu_mixed, gpu_single
+  intel_cpu, intel_phi
+  kokkos_omp, kokkos_cuda, kokkos_phi
+
+If successful, the build will produce the file lmp_target in this
+directory.
+
+Note that in addition to any accelerator packages, these packages also
+need to be installed to run all of the example scripts: ASPHERE,
+MOLECULE, KSPACE, RIGID.
+
+These two targets will build a single LAMMPS executable with all the
+CPU accelerator packages installed (USER-INTEL for CPU, KOKKOS for
+OMP, USER-OMP, OPT) or all the GPU accelerator packages installed
+(USER-CUDA, GPU, KOKKOS for CUDA):
+
+target = all_cpu, all_gpu
+
+Note that the Make.py commands in Make.list assume an MPI environment
+exists on your machine and use mpicxx as the wrapper compiler with
+whatever underlying compiler it wraps by default.  If you add "-cc mpi
+wrap=g++" or "-cc mpi wrap=icc" after the target, you can choose the
+underlying compiler for mpicxx to invoke.  E.g.
+
+../../src/Make.py -r Make.list intel_cpu -cc mpi wrap=icc
+
+You should do this for any build that includes the USER-INTEL
+package, since it will perform best with the Intel compilers.
+
+Note that for kokkos_cuda, it needs to be "-cc nvcc" instead of "mpi",
+since a KOKKOS for CUDA build requires NVIDIA nvcc as the wrapper
+compiler.
+
+Also note that the Make.py commands in Make.list use the default
+FFT support which is via the KISS library.  If you want to
+build with another FFT library, e.g. FFTW3, then you can add
+"-fft fftw3" after the target, e.g.
+
+../../src/Make.py -r Make.list gpu -fft fftw3
+
+For any build with USER-CUDA, GPU, or KOKKOS for CUDA, be sure to set
+the arch=XX setting to the appropriate value for the GPUs and Cuda
+environment on your system.  What is defined in the Make.list file is
+arch=21 for older Fermi GPUs.  This can be overridden as follows,
+e.g. for Kepler GPUs:
+
+../../src/Make.py -r Make.list gpu_double -gpu mode=double arch=35
+
+---------------------
+
+Running with each of the accelerator packages
+
+All of the input scripts have a default problem size and number of
+timesteps:
+
+in.lj = LJ melt with cutoff of 2.5 = 32K atoms for 100 steps
+in.lj.5.0 = same with cutoff of 5.0 = 32K atoms for 100 steps
+in.phosphate = 11K atoms for 100 steps
+in.rhodo = 32K atoms for 100 steps
+in.lc = 33K atoms for 100 steps (after 200 steps equilibration)
+
+These can be reset using the x,y,z and t variables in the command
+line.  E.g. adding "-v x 2 -v y 2 -v z 4 -t 1000" to any of the run
+command below would run a 16x larger problem (2x2x4) for 1000 steps.
+
+Here are example run commands using each of the accelerator packages:
+
+** CPU only
+
+lmp_cpu < in.lj
+mpirun -np 4 lmp_cpu -in in.lj
+
+** OPT package
+
+lmp_opt -sf opt < in.lj
+mpirun -np 4 lmp_opt -sf opt -in in.lj
+
+** USER-OMP package
+
+lmp_omp -sf omp -pk omp 1 < in.lj
+mpirun -np 4 lmp_omp -sf opt -pk omp 1 -in in.lj   # 4 MPI, 1 thread/MPI
+mpirun -np 2 lmp_omp -sf opt -pk omp 4 -in in.lj   # 2 MPI, 4 thread/MPI
+
+** GPU package
+
+lmp_gpu_double -sf gpu < in.lj               
+mpirun -np 8 lmp_gpu_double -sf gpu < in.lj        # 8 MPI, 8 MPI/GPU
+mpirun -np 12 lmp_gpu_double -sf gpu -pk gpu 2 < in.lj  # 12 MPI, 6 MPI/GPU
+mpirun -np 4 lmp_gpu_double -sf gpu -pk gpu 2 tpa 8 < in.lj.5.0   # 4 MPI, 2 MPI/GPU
+
+Note that when running in.lj.5.0 (which has a long cutoff) with the
+GPU package, the "-pk tpa" setting should be > 1 (e.g. 8) for best
+performance.
+
+** USER-CUDA package
+
+lmp_machine -c on -sf cuda < in.lj
+mpirun -np 1 lmp_machine -c on -sf cuda < in.lj    # 1 MPI, 1 MPI/GPU
+mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 < in.lj  # 2 MPI, 1 MPI/GPU
+
+** KOKKOS package for OMP
+
+lmp_kokkos_omp -k on t 1 -sf kk -pk kokkos neigh half < in.lj
+mpirun -np 2 lmp_kokkos_omp -k on t 4 -sf kk < in.lj  # 2 MPI, 4 thread/MPI
+
+Note that when running with just 1 thread/MPI, "-pk kokkos neigh half"
+was speficied to use half neighbor lists which are faster when running
+on just 1 thread.
+
+** KOKKOS package for CUDA
+
+lmp_kokkos_cuda -k on t 1 -sf kk < in.lj    # 1 thread, 1 GPU
+mpirun -np 2 lmp_kokkos_cuda -k on t 6 g 2 -sf kk < in.lj   # 2 MPI, 6 thread/MPI, 1 MPI/GPU
+
+** KOKKOS package for PHI
+
+mpirun -np 1 lmp_kokkos_phi -k on t 240 -sf kk -in in.lj   # 1 MPI, 240 threads/MPI
+mpirun -np 30 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj    # 30 MPI, 8 threads/MPI
+
+** USER-INTEL package for CPU
+
+lmp_intel_cpu -sf intel < in.lj
+mpirun -np 4 lmp_intl_cpu -sf intel < in.lj             # 4 MPI
+mpirun -np 4 lmp_intl_cpu -sf intel -pk omp 2 < in.lj   # 4 MPI, 2 thread/MPI
+
+** USER-INTEL package for PHI
+
+lmp_intel_phi -sf intel -pk intel 1 omp 16 < in.lc      # 1 MPI, 16 CPU thread/MPI, 1 Phi, 240 Phi thread/MPI
+mpirun -np 4 lmp_intel_phi -sf intel -pk intel 1 omp 2 < in.lc  # 4 MPI, 2 CPU threads/MPI, 1 Phi, 60 Phi thread/MPI
+
+Note that there is currently no Phi support for pair_style lj/cut in
+the USER-INTEL package.
--- a/examples/accelerate/data.phosphate
+++ b/examples/accelerate/data.phosphate
--- a/examples/accelerate/in.lc
+++ b/examples/accelerate/in.lc
@ -0,0 +1,57 @@
+# Gay-Berne benchmark
+# biaxial ellipsoid mesogens in isotropic phase
+# shape: 2 1.5 1
+# cutoff 4.0 with skin 0.8
+# NPT, T=2.4, P=8.0
+
+variable        x index 1
+variable        y index 1
+variable        z index 1
+variable        t index 100
+
+variable        i equal $x*32
+variable        j equal $y*32
+variable        k equal $z*32
+
+units	        lj
+atom_style      ellipsoid
+
+# create lattice of ellipsoids
+
+lattice	      sc 0.22
+region	      box block 0 $i 0 $j 0 $k
+create_box    1 box
+create_atoms  1 box
+
+set           type 1 mass 1.5
+set           type 1 shape 1 1.5 2
+set	      group all quat/random 982381
+
+compute	       rot all temp/asphere
+group	       spheroid type 1
+variable       dof equal count(spheroid)+3
+compute_modify rot extra ${dof}
+
+velocity      all create 2.4 41787 loop geom
+
+pair_style    gayberne 1.0 3.0 1.0 4.0
+pair_coeff    1 1 1.0 1.0 1.0 0.5 0.2 1.0 0.5 0.2
+
+neighbor      0.8 bin
+
+timestep      0.002
+thermo	      100
+
+# equilibration run
+
+fix	       1 all npt/asphere temp 2.4 2.4 0.1 iso 5.0 8.0 0.1
+compute_modify 1_temp extra ${dof}
+run	       200
+
+# dynamics run
+
+reset_timestep 0
+unfix          1
+fix            1 all nve/asphere
+
+run	       $t
--- a/examples/accelerate/in.lj
+++ b/examples/accelerate/in.lj
@ -0,0 +1,33 @@
+# 3d Lennard-Jones melt
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+variable        t index 100
+
+variable	xx equal 20*$x
+variable	yy equal 20*$y
+variable	zz equal 20*$z
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+create_box	1 box
+create_atoms	1 box
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut 2.5
+pair_coeff	1 1 1.0 1.0 2.5
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+thermo          100
+
+run		$t
--- a/examples/accelerate/in.lj.5.0
+++ b/examples/accelerate/in.lj.5.0
@ -0,0 +1,33 @@
+# 3d Lennard-Jones melt
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+variable        t index 100
+
+variable	xx equal 20*$x
+variable	yy equal 20*$y
+variable	zz equal 20*$z
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+create_box	1 box
+create_atoms	1 box
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut 5.0
+pair_coeff	1 1 1.0 1.0
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+thermo          100
+
+run		$t
--- a/examples/accelerate/in.phosphate
+++ b/examples/accelerate/in.phosphate
@ -0,0 +1,33 @@
+# GI-System
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+variable	t index 100
+
+units metal
+atom_style      charge 
+
+read_data 	data.phosphate
+
+replicate	$x $y $z
+
+pair_style      lj/cut/coul/long 15.0
+
+pair_coeff 1 1  0.0 0.29
+pair_coeff 1 2  0.0 0.29
+pair_coeff 1 3  0.000668 2.5738064
+pair_coeff 2 2  0.0 0.29
+pair_coeff 2 3  0.004251 1.91988674
+pair_coeff 3 3  0.012185 2.91706967
+
+kspace_style    pppm 1e-5
+
+neighbor	2.0 bin
+
+thermo          100
+timestep        0.001
+
+fix 		1 all npt temp 400 400 0.01 iso 1000.0 1000.0 1.0
+
+run 		$t
--- a/examples/accelerate/in.rhodo
+++ b/examples/accelerate/in.rhodo
@ -0,0 +1,34 @@
+# Rhodopsin model
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+variable	t index 100
+
+units           real  
+neigh_modify    delay 5 every 1   
+
+atom_style      full  
+bond_style      harmonic 
+angle_style     charmm 
+dihedral_style  charmm 
+improper_style  harmonic 
+pair_style      lj/charmm/coul/long 8.0 10.0 
+pair_modify     mix arithmetic 
+kspace_style    pppm 1e-4 
+
+read_data       ../../bench/data.rhodo
+
+replicate	$x $y $z
+
+fix             1 all shake 0.0001 5 0 m 1.0 a 232
+fix             2 all npt temp 300.0 300.0 100.0 &
+		z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1
+
+special_bonds   charmm
+ 
+thermo          100
+thermo_style    multi 
+timestep        2.0
+
+run	        $t
--- a/examples/accelerate/log.lj.1Feb14.gpu.1
+++ b/examples/accelerate/log.lj.1Feb14.gpu.1
@ -0,0 +1,80 @@
+LAMMPS (1 Feb 2014)
+# 3d Lennard-Jones melt
+
+newton          off
+package 	gpu force/neigh 0 1 1
+
+variable	x index 2
+variable	y index 2
+variable	z index 2
+
+variable	xx equal 20*$x
+variable	xx equal 20*2
+variable	yy equal 20*$y
+variable	yy equal 20*2
+variable	zz equal 20*$z
+variable	zz equal 20*2
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 40 0 ${yy} 0 ${zz}
+region		box block 0 40 0 40 0 ${zz}
+region		box block 0 40 0 40 0 40
+create_box	1 box
+Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
+  1 by 1 by 1 MPI processor grid
+create_atoms	1 box
+Created 256000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut/gpu 2.5
+pair_coeff	1 1 1.0 1.0 2.5
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+thermo 		100
+run		1000
+Memory usage per processor = 46.8462 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -6.7733683            0   -4.6133768   -5.0196737 
+     100   0.75865617    -5.760326            0   -4.6223462   0.19586079 
+     200   0.75643086   -5.7572859            0   -4.6226441   0.22641241 
+     300   0.74927423   -5.7463997            0   -4.6224927   0.29737707 
+     400   0.74049393   -5.7329259            0   -4.6221893    0.3776681 
+     500   0.73092107   -5.7182622            0   -4.6218849   0.46900655 
+     600   0.72320925   -5.7064076            0   -4.6215979   0.53444495 
+     700   0.71560947   -5.6946702            0   -4.6212602   0.59905402 
+     800   0.71306623   -5.6906095            0   -4.6210143   0.62859381 
+     900   0.70675364   -5.6807352            0   -4.6206089   0.68471945 
+    1000    0.7044073   -5.6771664            0   -4.6205596   0.70033364 
+Loop time of 21.016 on 1 procs for 1000 steps with 256000 atoms
+
+Pair  time (%) = 13.4638 (64.0646)
+Neigh time (%) = 6.74725e-05 (0.000321052)
+Comm  time (%) = 1.09447 (5.20779)
+Outpt time (%) = 0.0103211 (0.0491108)
+Other time (%) = 6.44732 (30.6781)
+
+Nlocal:    256000 ave 256000 max 256000 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Nghost:    69917 ave 69917 max 69917 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Neighs:    0 ave 0 max 0 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 50
+Dangerous builds = 0
+
+Please see the log.cite file for references relevant to this simulation
+
--- a/examples/accelerate/log.lj.1Feb14.gpu.4
+++ b/examples/accelerate/log.lj.1Feb14.gpu.4
@ -0,0 +1,80 @@
+LAMMPS (1 Feb 2014)
+# 3d Lennard-Jones melt
+
+newton          off
+package 	gpu force/neigh 0 1 1
+
+variable	x index 2
+variable	y index 2
+variable	z index 2
+
+variable	xx equal 20*$x
+variable	xx equal 20*2
+variable	yy equal 20*$y
+variable	yy equal 20*2
+variable	zz equal 20*$z
+variable	zz equal 20*2
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 40 0 ${yy} 0 ${zz}
+region		box block 0 40 0 40 0 ${zz}
+region		box block 0 40 0 40 0 40
+create_box	1 box
+Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
+  1 by 2 by 2 MPI processor grid
+create_atoms	1 box
+Created 256000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut/gpu 2.5
+pair_coeff	1 1 1.0 1.0 2.5
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+thermo 		100
+run		1000
+Memory usage per processor = 14.5208 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -6.7733683            0   -4.6133768   -5.0196737 
+     100   0.75865617    -5.760326            0   -4.6223462   0.19586079 
+     200   0.75643087   -5.7572859            0   -4.6226441    0.2264124 
+     300   0.74927423   -5.7463997            0   -4.6224927   0.29737713 
+     400    0.7404939   -5.7329258            0   -4.6221893   0.37766836 
+     500   0.73092104   -5.7182626            0   -4.6218853   0.46900587 
+     600   0.72320865   -5.7064076            0   -4.6215989   0.53444677 
+     700   0.71560468   -5.6946635            0   -4.6212607   0.59907258 
+     800    0.7130474   -5.6905859            0    -4.621019   0.62875333 
+     900   0.70683795    -5.680864            0   -4.6206112    0.6839564 
+    1000   0.70454326   -5.6773491            0   -4.6205384   0.69975744 
+Loop time of 8.72938 on 4 procs for 1000 steps with 256000 atoms
+
+Pair  time (%) = 5.30046 (60.7198)
+Neigh time (%) = 5.78761e-05 (0.000663004)
+Comm  time (%) = 1.62433 (18.6076)
+Outpt time (%) = 0.0129588 (0.14845)
+Other time (%) = 1.79157 (20.5235)
+
+Nlocal:    64000 ave 64066 max 63924 min
+Histogram: 1 0 1 0 0 0 0 0 0 2
+Nghost:    30535 ave 30559 max 30518 min
+Histogram: 1 0 1 0 1 0 0 0 0 1
+Neighs:    0 ave 0 max 0 min
+Histogram: 4 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 50
+Dangerous builds = 0
+
+Please see the log.cite file for references relevant to this simulation
+
--- a/examples/accelerate/log.lj.1Feb14.kokkos.cuda.1
+++ b/examples/accelerate/log.lj.1Feb14.kokkos.cuda.1
@ -0,0 +1,68 @@
+LAMMPS (27 May 2014)
+KOKKOS mode is enabled (../lammps.cpp:468)
+  using 6 OpenMP thread(s) per MPI task
+# 3d Lennard-Jones melt
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+
+variable	xx equal 20*$x
+variable	xx equal 20*1
+variable	yy equal 20*$y
+variable	yy equal 20*1
+variable	zz equal 20*$z
+variable	zz equal 20*1
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 20 0 ${yy} 0 ${zz}
+region		box block 0 20 0 20 0 ${zz}
+region		box block 0 20 0 20 0 20
+create_box	1 box
+Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919)
+  1 by 1 by 1 MPI processor grid
+create_atoms	1 box
+Created 32000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut 2.5
+pair_coeff	1 1 1.0 1.0 2.5
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+run		100
+Memory usage per processor = 16.9509 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -6.7733681            0   -4.6134356   -5.0197073 
+     100    0.7574531   -5.7585055            0   -4.6223613   0.20726105 
+Loop time of 0.57192 on 6 procs (1 MPI x 6 OpenMP) for 100 steps with 32000 atoms
+
+Pair  time (%) = 0.205416 (35.917)
+Neigh time (%) = 0.112468 (19.665)
+Comm  time (%) = 0.174223 (30.4629)
+Outpt time (%) = 0.000159025 (0.0278055)
+Other time (%) = 0.0796535 (13.9274)
+
+Nlocal:    32000 ave 32000 max 32000 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Nghost:    19657 ave 19657 max 19657 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Neighs:    0 ave 0 max 0 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+FullNghs:  2.40567e+06 ave 2.40567e+06 max 2.40567e+06 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 2405666
+Ave neighs/atom = 75.1771
+Neighbor list builds = 5
+Dangerous builds = 0
--- a/examples/accelerate/log.lj.1Feb14.kokkos.cuda.2
+++ b/examples/accelerate/log.lj.1Feb14.kokkos.cuda.2
@ -0,0 +1,68 @@
+LAMMPS (27 May 2014)
+KOKKOS mode is enabled (../lammps.cpp:468)
+  using 6 OpenMP thread(s) per MPI task
+# 3d Lennard-Jones melt
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+
+variable	xx equal 20*$x
+variable	xx equal 20*1
+variable	yy equal 20*$y
+variable	yy equal 20*1
+variable	zz equal 20*$z
+variable	zz equal 20*1
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 20 0 ${yy} 0 ${zz}
+region		box block 0 20 0 20 0 ${zz}
+region		box block 0 20 0 20 0 20
+create_box	1 box
+Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919)
+  1 by 1 by 2 MPI processor grid
+create_atoms	1 box
+Created 32000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut 2.5
+pair_coeff	1 1 1.0 1.0 2.5
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+run		100
+Memory usage per processor = 8.95027 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -6.7733681            0   -4.6134356   -5.0197073 
+     100    0.7574531   -5.7585055            0   -4.6223613   0.20726105 
+Loop time of 0.689608 on 12 procs (2 MPI x 6 OpenMP) for 100 steps with 32000 atoms
+
+Pair  time (%) = 0.210953 (30.5903)
+Neigh time (%) = 0.122991 (17.8349)
+Comm  time (%) = 0.25264 (36.6353)
+Outpt time (%) = 0.000259042 (0.0375636)
+Other time (%) = 0.102765 (14.9019)
+
+Nlocal:    16000 ave 16001 max 15999 min
+Histogram: 1 0 0 0 0 0 0 0 0 1
+Nghost:    13632.5 ave 13635 max 13630 min
+Histogram: 1 0 0 0 0 0 0 0 0 1
+Neighs:    0 ave 0 max 0 min
+Histogram: 2 0 0 0 0 0 0 0 0 0
+FullNghs:  1.20283e+06 ave 1.20347e+06 max 1.2022e+06 min
+Histogram: 1 0 0 0 0 0 0 0 0 1
+
+Total # of neighbors = 2405666
+Ave neighs/atom = 75.1771
+Neighbor list builds = 5
+Dangerous builds = 0
--- a/examples/accelerate/log.lj.1Feb14.kokkos.omp.1
+++ b/examples/accelerate/log.lj.1Feb14.kokkos.omp.1
@ -0,0 +1,68 @@
+LAMMPS (27 May 2014)
+KOKKOS mode is enabled (../lammps.cpp:468)
+  using 1 OpenMP thread(s) per MPI task
+# 3d Lennard-Jones melt
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+
+variable	xx equal 20*$x
+variable	xx equal 20*1
+variable	yy equal 20*$y
+variable	yy equal 20*1
+variable	zz equal 20*$z
+variable	zz equal 20*1
+
+package         kokkos neigh half
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 20 0 ${yy} 0 ${zz}
+region		box block 0 20 0 20 0 ${zz}
+region		box block 0 20 0 20 0 20
+create_box	1 box
+Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919)
+  1 by 1 by 1 MPI processor grid
+create_atoms	1 box
+Created 32000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut 2.5
+pair_coeff	1 1 1.0 1.0 2.5
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+run		100
+Memory usage per processor = 7.79551 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -6.7733681            0   -4.6134356   -5.0197073 
+     100    0.7574531   -5.7585055            0   -4.6223613   0.20726105 
+Loop time of 2.29105 on 1 procs (1 MPI x 1 OpenMP) for 100 steps with 32000 atoms
+
+Pair  time (%) = 1.82425 (79.6249)
+Neigh time (%) = 0.338632 (14.7806)
+Comm  time (%) = 0.0366232 (1.59853)
+Outpt time (%) = 0.000144005 (0.00628553)
+Other time (%) = 0.0914049 (3.98965)
+
+Nlocal:    32000 ave 32000 max 32000 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Nghost:    19657 ave 19657 max 19657 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Neighs:    1.20283e+06 ave 1.20283e+06 max 1.20283e+06 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 1202833
+Ave neighs/atom = 37.5885
+Neighbor list builds = 5
+Dangerous builds = 0
--- a/examples/accelerate/log.lj.1Feb14.kokkos.omp.4
+++ b/examples/accelerate/log.lj.1Feb14.kokkos.omp.4
@ -0,0 +1,68 @@
+LAMMPS (27 May 2014)
+KOKKOS mode is enabled (../lammps.cpp:468)
+  using 4 OpenMP thread(s) per MPI task
+# 3d Lennard-Jones melt
+
+variable	x index 1
+variable	y index 1
+variable	z index 1
+
+variable	xx equal 20*$x
+variable	xx equal 20*1
+variable	yy equal 20*$y
+variable	yy equal 20*1
+variable	zz equal 20*$z
+variable	zz equal 20*1
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 20 0 ${yy} 0 ${zz}
+region		box block 0 20 0 20 0 ${zz}
+region		box block 0 20 0 20 0 20
+create_box	1 box
+Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919)
+  1 by 1 by 1 MPI processor grid
+create_atoms	1 box
+Created 32000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut 2.5
+pair_coeff	1 1 1.0 1.0 2.5
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+run		100
+Memory usage per processor = 13.2888 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -6.7733681            0   -4.6134356   -5.0197073 
+     100    0.7574531   -5.7585055            0   -4.6223613   0.20726105 
+Loop time of 0.983697 on 4 procs (1 MPI x 4 OpenMP) for 100 steps with 32000 atoms
+
+Pair  time (%) = 0.767155 (77.9869)
+Neigh time (%) = 0.14734 (14.9782)
+Comm  time (%) = 0.041466 (4.21532)
+Outpt time (%) = 0.000172138 (0.0174991)
+Other time (%) = 0.0275636 (2.80204)
+
+Nlocal:    32000 ave 32000 max 32000 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Nghost:    19657 ave 19657 max 19657 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Neighs:    0 ave 0 max 0 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+FullNghs:  2.40567e+06 ave 2.40567e+06 max 2.40567e+06 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 2405666
+Ave neighs/atom = 75.1771
+Neighbor list builds = 5
+Dangerous builds = 0
--- a/examples/accelerate/log.lj.5.0.1Feb14.gpu.1
+++ b/examples/accelerate/log.lj.5.0.1Feb14.gpu.1
@ -0,0 +1,80 @@
+LAMMPS (1 Feb 2014)
+# 3d Lennard-Jones melt
+
+newton          off
+package 	gpu force/neigh 0 1 1 threads_per_atom 8
+
+variable	x index 2
+variable	y index 2
+variable	z index 2
+
+variable	xx equal 20*$x
+variable	xx equal 20*2
+variable	yy equal 20*$y
+variable	yy equal 20*2
+variable	zz equal 20*$z
+variable	zz equal 20*2
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 40 0 ${yy} 0 ${zz}
+region		box block 0 40 0 40 0 ${zz}
+region		box block 0 40 0 40 0 40
+create_box	1 box
+Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
+  1 by 1 by 1 MPI processor grid
+create_atoms	1 box
+Created 256000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut/gpu 5.0
+pair_coeff	1 1 1.0 1.0 5.0
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+thermo 		100
+run		1000
+Memory usage per processor = 58.5717 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -7.1616931            0   -5.0017016   -5.6743465 
+     100   0.75998441   -6.1430228            0   -5.0030506  -0.43702263 
+     200   0.75772859   -6.1397321            0   -5.0031437  -0.40563811 
+     300   0.75030002   -6.1286578            0   -5.0032122  -0.33104717 
+     400   0.73999054   -6.1132463            0   -5.0032649  -0.24001424 
+     500   0.73224838   -6.1016938            0   -5.0033255  -0.16524979 
+     600   0.72455889   -6.0902001            0    -5.003366 -0.099949772 
+     700   0.71911385   -6.0820798            0   -5.0034133 -0.046759186 
+     800   0.71253787   -6.0722342            0   -5.0034316 0.0019671065 
+     900   0.70835425   -6.0659819            0   -5.0034546  0.037482543 
+    1000   0.70648171   -6.0631852            0   -5.0034668  0.057159495 
+Loop time of 53.1575 on 1 procs for 1000 steps with 256000 atoms
+
+Pair  time (%) = 45.4859 (85.5682)
+Neigh time (%) = 7.9155e-05 (0.000148907)
+Comm  time (%) = 1.40304 (2.63941)
+Outpt time (%) = 0.00999498 (0.0188026)
+Other time (%) = 6.25847 (11.7734)
+
+Nlocal:    256000 ave 256000 max 256000 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Nghost:    141542 ave 141542 max 141542 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Neighs:    0 ave 0 max 0 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 50
+Dangerous builds = 0
+
+Please see the log.cite file for references relevant to this simulation
+
--- a/examples/accelerate/log.lj.5.0.1Feb14.gpu.4
+++ b/examples/accelerate/log.lj.5.0.1Feb14.gpu.4
@ -0,0 +1,80 @@
+LAMMPS (1 Feb 2014)
+# 3d Lennard-Jones melt
+
+newton          off
+package 	gpu force/neigh 0 1 1 threads_per_atom 8
+
+variable	x index 2
+variable	y index 2
+variable	z index 2
+
+variable	xx equal 20*$x
+variable	xx equal 20*2
+variable	yy equal 20*$y
+variable	yy equal 20*2
+variable	zz equal 20*$z
+variable	zz equal 20*2
+
+units		lj
+atom_style	atomic
+
+lattice		fcc 0.8442
+Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
+region		box block 0 ${xx} 0 ${yy} 0 ${zz}
+region		box block 0 40 0 ${yy} 0 ${zz}
+region		box block 0 40 0 40 0 ${zz}
+region		box block 0 40 0 40 0 40
+create_box	1 box
+Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
+  1 by 2 by 2 MPI processor grid
+create_atoms	1 box
+Created 256000 atoms
+mass		1 1.0
+
+velocity	all create 1.44 87287 loop geom
+
+pair_style	lj/cut/gpu 5.0
+pair_coeff	1 1 1.0 1.0 5.0
+
+neighbor	0.3 bin
+neigh_modify	delay 0 every 20 check no
+
+fix		1 all nve
+
+thermo 		100
+run		1000
+Memory usage per processor = 20.382 Mbytes
+Step Temp E_pair E_mol TotEng Press 
+       0         1.44   -7.1616931            0   -5.0017016   -5.6743465 
+     100   0.75998441   -6.1430228            0   -5.0030506  -0.43702263 
+     200   0.75772859   -6.1397321            0   -5.0031437  -0.40563811 
+     300   0.75030002   -6.1286578            0   -5.0032122  -0.33104718 
+     400   0.73999055   -6.1132463            0   -5.0032649  -0.24001425 
+     500   0.73224835   -6.1016938            0   -5.0033256  -0.16524973 
+     600   0.72455878      -6.0902            0   -5.0033661 -0.099949172 
+     700   0.71911606   -6.0820833            0   -5.0034134 -0.046771469 
+     800   0.71253754   -6.0722337            0   -5.0034316 0.0019725827 
+     900   0.70832904   -6.0659437            0   -5.0034543   0.03758241 
+    1000   0.70634002    -6.062973            0   -5.0034671  0.057951142 
+Loop time of 26.0448 on 4 procs for 1000 steps with 256000 atoms
+
+Pair  time (%) = 18.6673 (71.674)
+Neigh time (%) = 6.55651e-05 (0.00025174)
+Comm  time (%) = 5.797 (22.2578)
+Outpt time (%) = 0.0719919 (0.276416)
+Other time (%) = 1.50839 (5.79152)
+
+Nlocal:    64000 ave 64092 max 63823 min
+Histogram: 1 0 0 0 0 0 1 0 0 2
+Nghost:    64384.2 ave 64490 max 64211 min
+Histogram: 1 0 0 0 0 0 1 0 1 1
+Neighs:    0 ave 0 max 0 min
+Histogram: 4 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 50
+Dangerous builds = 0
+
+Please see the log.cite file for references relevant to this simulation
+
--- a/examples/accelerate/log.phosphate.1Feb14.gpu.1
+++ b/examples/accelerate/log.phosphate.1Feb14.gpu.1
@ -0,0 +1,80 @@
+LAMMPS (1 Feb 2014)
+# GI-System
+
+units metal
+newton off
+package		gpu force/neigh 0 1 1
+
+atom_style      charge
+read_data 	data.phosphate
+  orthogonal box = (33.0201 33.0201 33.0201) to (86.9799 86.9799 86.9799)
+  1 by 1 by 1 MPI processor grid
+  reading atoms ...
+  10950 atoms
+  reading velocities ...
+  10950 velocities
+
+replicate 	3 3 3
+  orthogonal box = (33.0201 33.0201 33.0201) to (194.899 194.899 194.899)
+  1 by 1 by 1 MPI processor grid
+  295650 atoms
+
+pair_style      lj/cut/coul/long/gpu 15.0
+
+pair_coeff 1 1  0.0 0.29
+pair_coeff 1 2  0.0 0.29
+pair_coeff 1 3  0.000668 2.5738064
+pair_coeff 2 2  0.0 0.29
+pair_coeff 2 3  0.004251 1.91988674
+pair_coeff 3 3  0.012185 2.91706967
+
+kspace_style    pppm/gpu 1e-5
+
+neighbor	2.0 bin
+
+thermo 100
+
+timestep 0.001
+
+fix 		1 all npt temp 400 400 0.01 iso 1000.0 1000.0 1.0
+run 		200
+PPPM initialization ...
+  G vector (1/distance) = 0.210051
+  grid = 108 108 108
+  stencil order = 5
+  estimated absolute RMS force accuracy = 0.000178801
+  estimated relative force accuracy = 1.24171e-05
+  using double precision FFTs
+  3d grid and FFT values/proc = 1520875 1259712
+Memory usage per processor = 266.927 Mbytes
+Step Temp E_pair E_mol TotEng Press Volume 
+       0    400.30257   -2381941.6            0   -2366643.8   -449.96842    4242016.4 
+     100    411.69681   -2392428.5            0   -2376695.3     7046.698    4308883.5 
+     200    401.28392   -2394152.5            0   -2378817.2    3243.2685    4334284.4 
+Loop time of 154.943 on 1 procs for 200 steps with 295650 atoms
+
+Pair  time (%) = 12.0178 (7.75625)
+Kspce time (%) = 80.3771 (51.8753)
+Neigh time (%) = 0.0138304 (0.00892614)
+Comm  time (%) = 0.348981 (0.225232)
+Outpt time (%) = 0.00180006 (0.00116176)
+Other time (%) = 62.1834 (40.1331)
+
+FFT time (% of Kspce) = 56.9885 (70.9013)
+FFT Gflps 3d (1d only) = 1.24196 3.00739
+
+Nlocal:    295650 ave 295650 max 295650 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Nghost:    226982 ave 226982 max 226982 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Neighs:    0 ave 0 max 0 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 6
+Dangerous builds = 0
+unfix 		1
+
+Please see the log.cite file for references relevant to this simulation
+
--- a/examples/accelerate/log.phosphate.1Feb14.gpu.4
+++ b/examples/accelerate/log.phosphate.1Feb14.gpu.4
@ -0,0 +1,80 @@
+LAMMPS (1 Feb 2014)
+# GI-System
+
+units metal
+newton off
+package		gpu force/neigh 0 1 1
+
+atom_style      charge
+read_data 	data.phosphate
+  orthogonal box = (33.0201 33.0201 33.0201) to (86.9799 86.9799 86.9799)
+  1 by 2 by 2 MPI processor grid
+  reading atoms ...
+  10950 atoms
+  reading velocities ...
+  10950 velocities
+
+replicate 	3 3 3
+  orthogonal box = (33.0201 33.0201 33.0201) to (194.899 194.899 194.899)
+  2 by 1 by 2 MPI processor grid
+  295650 atoms
+
+pair_style      lj/cut/coul/long/gpu 15.0
+
+pair_coeff 1 1  0.0 0.29
+pair_coeff 1 2  0.0 0.29
+pair_coeff 1 3  0.000668 2.5738064
+pair_coeff 2 2  0.0 0.29
+pair_coeff 2 3  0.004251 1.91988674
+pair_coeff 3 3  0.012185 2.91706967
+
+kspace_style    pppm/gpu 1e-5
+
+neighbor	2.0 bin
+
+thermo 100
+
+timestep 0.001
+
+fix 		1 all npt temp 400 400 0.01 iso 1000.0 1000.0 1.0
+run 		200
+PPPM initialization ...
+  G vector (1/distance) = 0.210051
+  grid = 108 108 108
+  stencil order = 5
+  estimated absolute RMS force accuracy = 0.000178801
+  estimated relative force accuracy = 1.24171e-05
+  using double precision FFTs
+  3d grid and FFT values/proc = 427915 314928
+Memory usage per processor = 80.0769 Mbytes
+Step Temp E_pair E_mol TotEng Press Volume 
+       0    400.30257   -2381941.6            0   -2366643.8   -449.96842    4242016.4 
+     100    411.69681   -2392428.5            0   -2376695.3     7046.698    4308883.5 
+     200    401.28392   -2394152.5            0   -2378817.2    3243.2685    4334284.4 
+Loop time of 56.1151 on 4 procs for 200 steps with 295650 atoms
+
+Pair  time (%) = 4.55937 (8.12503)
+Kspce time (%) = 34.5442 (61.5596)
+Neigh time (%) = 0.00624901 (0.0111361)
+Comm  time (%) = 0.470437 (0.838343)
+Outpt time (%) = 0.000446558 (0.000795789)
+Other time (%) = 16.5344 (29.4651)
+
+FFT time (% of Kspce) = 22.6526 (65.5758)
+FFT Gflps 3d (1d only) = 3.12448 11.5533
+
+Nlocal:    73912.5 ave 74223 max 73638 min
+Histogram: 1 1 0 0 0 0 0 1 0 1
+Nghost:    105257 ave 105797 max 104698 min
+Histogram: 1 0 0 1 0 0 1 0 0 1
+Neighs:    0 ave 0 max 0 min
+Histogram: 4 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Neighbor list builds = 6
+Dangerous builds = 0
+unfix 		1
+
+Please see the log.cite file for references relevant to this simulation
+
--- a/examples/accelerate/log.rhodo.1Feb14.gpu.1
+++ b/examples/accelerate/log.rhodo.1Feb14.gpu.1
@ -0,0 +1,135 @@
+LAMMPS (1 Feb 2014)
+# Rhodopsin model
+
+newton off
+package 	gpu force/neigh 0 1 1
+
+variable	x index 2
+variable	y index 2
+variable	z index 2
+
+units           real
+neigh_modify    delay 5 every 1
+
+atom_style      full
+bond_style      harmonic
+angle_style     charmm
+dihedral_style  charmm
+improper_style  harmonic
+pair_style      lj/charmm/coul/long/gpu 8.0 10.0
+pair_modify     mix arithmetic
+kspace_style    pppm/gpu 1e-4
+
+read_data       data.rhodo
+  orthogonal box = (-27.5 -38.5 -36.2676) to (27.5 38.5 36.2645)
+  1 by 1 by 1 MPI processor grid
+  reading atoms ...
+  32000 atoms
+  reading velocities ...
+  32000 velocities
+  scanning bonds ...
+  4 = max bonds/atom
+  scanning angles ...
+  18 = max angles/atom
+  scanning dihedrals ...
+  40 = max dihedrals/atom
+  scanning impropers ...
+  4 = max impropers/atom
+  reading bonds ...
+  27723 bonds
+  reading angles ...
+  40467 angles
+  reading dihedrals ...
+  56829 dihedrals
+  reading impropers ...
+  1034 impropers
+  4 = max # of 1-2 neighbors
+  12 = max # of 1-3 neighbors
+  24 = max # of 1-4 neighbors
+  26 = max # of special neighbors
+
+replicate	$x $y $z
+replicate	2 $y $z
+replicate	2 2 $z
+replicate	2 2 2
+  orthogonal box = (-27.5 -38.5 -36.2676) to (82.5 115.5 108.797)
+  1 by 1 by 1 MPI processor grid
+  256000 atoms
+  221784 bonds
+  323736 angles
+  454632 dihedrals
+  8272 impropers
+  4 = max # of 1-2 neighbors
+  12 = max # of 1-3 neighbors
+  24 = max # of 1-4 neighbors
+  26 = max # of special neighbors
+
+fix             1 all shake 0.0001 5 0 m 1.0 a 232
+  12936 = # of size 2 clusters
+  29064 = # of size 3 clusters
+  5976 = # of size 4 clusters
+  33864 = # of frozen angles
+fix             2 all npt temp 300.0 300.0 100.0 		z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1
+
+special_bonds   charmm
+
+thermo          100
+thermo_style    multi
+timestep        2.0
+
+run		200
+PPPM initialization ...
+  G vector (1/distance) = 0.245959
+  grid = 48 64 60
+  stencil order = 5
+  estimated absolute RMS force accuracy = 0.0410392
+  estimated relative force accuracy = 0.000123588
+  using double precision FFTs
+  3d grid and FFT values/proc = 237705 184320
+Memory usage per processor = 760.048 Mbytes
+---------------- Step        0 ----- CPU =      0.0000 (sec) ----------------
+TotEng   =    157024.0504 KinEng   =    172792.6155 Temp     =       301.1796 
+PotEng   =    -15768.5651 E_bond   =     28164.9917 E_angle  =    117224.0742 
+E_dihed  =     61174.8491 E_impro  =      3752.0273 E_vdwl   =     10108.6323 
+E_coul   =   1894295.6635 E_long   =  -2130488.8032 Press    =      9562.1557 
+Volume   =   2457390.7959 
+---------------- Step      100 ----- CPU =     36.3779 (sec) ----------------
+TotEng   =   -233301.6813 KinEng   =    123222.9259 Temp     =       214.7790 
+PotEng   =   -356524.6072 E_bond   =     13098.4672 E_angle  =     56766.9111 
+E_dihed  =     45556.8240 E_impro  =      1313.9378 E_vdwl   =    -40863.9278 
+E_coul   =   1705084.7672 E_long   =  -2137481.5867 Press    =     -1634.3912 
+Volume   =   2522232.6302 
+---------------- Step      200 ----- CPU =     70.7784 (sec) ----------------
+TotEng   =   -308342.0030 KinEng   =    108937.4160 Temp     =       189.8792 
+PotEng   =   -417279.4189 E_bond   =      9579.0134 E_angle  =     47373.6274 
+E_dihed  =     39847.4817 E_impro  =       967.6755 E_vdwl   =    -23635.2960 
+E_coul   =   1646633.4711 E_long   =  -2138045.3918 Press    =     -1185.9327 
+Volume   =   2554683.1533 
+Loop time of 70.7784 on 1 procs for 200 steps with 256000 atoms
+
+Pair  time (%) = 10.0374 (14.1815)
+Bond  time (%) = 27.2471 (38.4963)
+Kspce time (%) = 7.19169 (10.1608)
+Neigh time (%) = 5.43951 (7.68527)
+Comm  time (%) = 0.681534 (0.962912)
+Outpt time (%) = 0.00139809 (0.0019753)
+Other time (%) = 20.1798 (28.5112)
+
+FFT time (% of Kspce) = 5.17983 (72.0253)
+FFT Gflps 3d (1d only) = 1.72575 2.95071
+
+Nlocal:    256000 ave 256000 max 256000 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Nghost:    161662 ave 161662 max 161662 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+Neighs:    0 ave 0 max 0 min
+Histogram: 1 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Ave special neighs/atom = 7.43187
+Neighbor list builds = 31
+Dangerous builds = 12
+
+Please see the log.cite file for references relevant to this simulation
+
--- a/examples/accelerate/log.rhodo.1Feb14.gpu.4
+++ b/examples/accelerate/log.rhodo.1Feb14.gpu.4
@ -0,0 +1,135 @@
+LAMMPS (1 Feb 2014)
+# Rhodopsin model
+
+newton off
+package 	gpu force/neigh 0 1 1
+
+variable	x index 2
+variable	y index 2
+variable	z index 2
+
+units           real
+neigh_modify    delay 5 every 1
+
+atom_style      full
+bond_style      harmonic
+angle_style     charmm
+dihedral_style  charmm
+improper_style  harmonic
+pair_style      lj/charmm/coul/long/gpu 8.0 10.0
+pair_modify     mix arithmetic
+kspace_style    pppm/gpu 1e-4
+
+read_data       data.rhodo
+  orthogonal box = (-27.5 -38.5 -36.2676) to (27.5 38.5 36.2645)
+  1 by 2 by 2 MPI processor grid
+  reading atoms ...
+  32000 atoms
+  reading velocities ...
+  32000 velocities
+  scanning bonds ...
+  4 = max bonds/atom
+  scanning angles ...
+  18 = max angles/atom
+  scanning dihedrals ...
+  40 = max dihedrals/atom
+  scanning impropers ...
+  4 = max impropers/atom
+  reading bonds ...
+  27723 bonds
+  reading angles ...
+  40467 angles
+  reading dihedrals ...
+  56829 dihedrals
+  reading impropers ...
+  1034 impropers
+  4 = max # of 1-2 neighbors
+  12 = max # of 1-3 neighbors
+  24 = max # of 1-4 neighbors
+  26 = max # of special neighbors
+
+replicate	$x $y $z
+replicate	2 $y $z
+replicate	2 2 $z
+replicate	2 2 2
+  orthogonal box = (-27.5 -38.5 -36.2676) to (82.5 115.5 108.797)
+  1 by 2 by 2 MPI processor grid
+  256000 atoms
+  221784 bonds
+  323736 angles
+  454632 dihedrals
+  8272 impropers
+  4 = max # of 1-2 neighbors
+  12 = max # of 1-3 neighbors
+  24 = max # of 1-4 neighbors
+  26 = max # of special neighbors
+
+fix             1 all shake 0.0001 5 0 m 1.0 a 232
+  12936 = # of size 2 clusters
+  29064 = # of size 3 clusters
+  5976 = # of size 4 clusters
+  33864 = # of frozen angles
+fix             2 all npt temp 300.0 300.0 100.0 		z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1
+
+special_bonds   charmm
+
+thermo          100
+thermo_style    multi
+timestep        2.0
+
+run		200
+PPPM initialization ...
+  G vector (1/distance) = 0.245959
+  grid = 48 64 60
+  stencil order = 5
+  estimated absolute RMS force accuracy = 0.0410392
+  estimated relative force accuracy = 0.000123588
+  using double precision FFTs
+  3d grid and FFT values/proc = 68635 46080
+Memory usage per processor = 250.358 Mbytes
+---------------- Step        0 ----- CPU =      0.0000 (sec) ----------------
+TotEng   =    157024.0504 KinEng   =    172792.6155 Temp     =       301.1796 
+PotEng   =    -15768.5651 E_bond   =     28164.9917 E_angle  =    117224.0742 
+E_dihed  =     61174.8491 E_impro  =      3752.0273 E_vdwl   =     10108.6323 
+E_coul   =   1894295.6635 E_long   =  -2130488.8032 Press    =      9562.1557 
+Volume   =   2457390.7959 
+---------------- Step      100 ----- CPU =     12.3409 (sec) ----------------
+TotEng   =   -233301.6797 KinEng   =    123222.9259 Temp     =       214.7790 
+PotEng   =   -356524.6057 E_bond   =     13098.4672 E_angle  =     56766.9111 
+E_dihed  =     45556.8240 E_impro  =      1313.9378 E_vdwl   =    -40863.9278 
+E_coul   =   1705084.7688 E_long   =  -2137481.5867 Press    =     -1634.3910 
+Volume   =   2522232.6302 
+---------------- Step      200 ----- CPU =     23.6590 (sec) ----------------
+TotEng   =   -308341.9699 KinEng   =    108937.4196 Temp     =       189.8792 
+PotEng   =   -417279.3895 E_bond   =      9579.0134 E_angle  =     47373.6274 
+E_dihed  =     39847.4807 E_impro  =       967.6755 E_vdwl   =    -23635.2996 
+E_coul   =   1646633.5046 E_long   =  -2138045.3916 Press    =     -1185.9299 
+Volume   =   2554683.1519 
+Loop time of 23.6591 on 4 procs for 200 steps with 256000 atoms
+
+Pair  time (%) = 4.81669 (20.3587)
+Bond  time (%) = 6.52579 (27.5826)
+Kspce time (%) = 4.48765 (18.968)
+Neigh time (%) = 1.3238 (5.5953)
+Comm  time (%) = 0.490551 (2.07342)
+Outpt time (%) = 0.000454485 (0.00192098)
+Other time (%) = 6.01414 (25.42)
+
+FFT time (% of Kspce) = 1.77734 (39.6051)
+FFT Gflps 3d (1d only) = 5.02949 11.6654
+
+Nlocal:    64000 ave 64001 max 63999 min
+Histogram: 1 0 0 0 0 2 0 0 0 1
+Nghost:    70656.5 ave 70660 max 70654 min
+Histogram: 1 0 0 2 0 0 0 0 0 1
+Neighs:    0 ave 0 max 0 min
+Histogram: 4 0 0 0 0 0 0 0 0 0
+
+Total # of neighbors = 0
+Ave neighs/atom = 0
+Ave special neighs/atom = 7.43187
+Neighbor list builds = 31
+Dangerous builds = 12
+
+Please see the log.cite file for references relevant to this simulation
+