Merge pull request #725 from stanmoore1/kk_update

Update the Kokkos library in LAMMPS to v2.5.00
2018-01-08 09:12:51 -07:00 · 2018-01-08 09:12:51 -07:00 · 450c689ae9
parent d029cb9002 a2756db66b
commit 450c689ae9
366 changed files with 11645 additions and 3763 deletions
--- a/lib/kokkos/CHANGELOG.md
+++ b/lib/kokkos/CHANGELOG.md
@ -1,5 +1,75 @@
 # Change Log

+## [2.5.00](https://github.com/kokkos/kokkos/tree/2.5.00) (2017-12-15)
+[Full Changelog](https://github.com/kokkos/kokkos/compare/2.04.11...2.5.00)
+
+**Part of the Kokkos C++ Performance Portability Programming EcoSystem 2.5**      
+
+**Implemented enhancements:**
+
+- Provide Makefile.kokkos logic for CMake and TriBITS [\#878](https://github.com/kokkos/kokkos/issues/878)
+- Add Scatter View [\#825](https://github.com/kokkos/kokkos/issues/825)
+- Drop gcc 4.7 and intel 14 from supported compiler list [\#603](https://github.com/kokkos/kokkos/issues/603)
+- Enable construction of unmanaged view using common\_view\_alloc\_prop [\#1170](https://github.com/kokkos/kokkos/issues/1170)
+- Unused Function Warning with XL [\#1267](https://github.com/kokkos/kokkos/issues/1267)
+- Add memory pool parameter check [\#1218](https://github.com/kokkos/kokkos/issues/1218)
+- CUDA9: Fix warning for unsupported long double [\#1189](https://github.com/kokkos/kokkos/issues/1189)
+- CUDA9: fix warning on defaulted function marking [\#1188](https://github.com/kokkos/kokkos/issues/1188)
+- CUDA9: fix warnings for deprecated warp level functions [\#1187](https://github.com/kokkos/kokkos/issues/1187)
+- Add CUDA 9.0 nightly testing [\#1174](https://github.com/kokkos/kokkos/issues/1174)
+- {OMPI,MPICH}\_CXX hack breaks nvcc\_wrapper use case [\#1166](https://github.com/kokkos/kokkos/issues/1166)
+- KOKKOS\_HAVE\_CUDA\_LAMBDA became KOKKOS\_CUDA\_USE\_LAMBDA [\#1274](https://github.com/kokkos/kokkos/issues/1274)
+
+**Fixed bugs:**
+
+- MinMax Reducer with tagged operator doesn't compile [\#1251](https://github.com/kokkos/kokkos/issues/1251)
+- Reducers for Tagged operators give wrong answer [\#1250](https://github.com/kokkos/kokkos/issues/1250)
+- Kokkos not Compatible with Big Endian Machines? [\#1235](https://github.com/kokkos/kokkos/issues/1235)
+- Parallel Scan hangs forever on BG/Q [\#1234](https://github.com/kokkos/kokkos/issues/1234)
+- Threads backend doesn't compile with Clang on OS X [\#1232](https://github.com/kokkos/kokkos/issues/1232)
+- $\(shell date\) needs quote [\#1264](https://github.com/kokkos/kokkos/issues/1264)
+- Unqualified parallel\_for call conflicts with user-defined parallel\_for [\#1219](https://github.com/kokkos/kokkos/issues/1219)
+- KokkosAlgorithms: CMake issue in unit tests [\#1212](https://github.com/kokkos/kokkos/issues/1212)
+- Intel 18 Error: "simd pragma has been deprecated" [\#1210](https://github.com/kokkos/kokkos/issues/1210)
+- Memory leak in Kokkos::initialize [\#1194](https://github.com/kokkos/kokkos/issues/1194)
+- CUDA9: compiler error with static assert template arguments [\#1190](https://github.com/kokkos/kokkos/issues/1190)
+- Kokkos::Serial::is\_initialized returns always true [\#1184](https://github.com/kokkos/kokkos/issues/1184)
+- Triple nested parallelism still fails on bowman [\#1093](https://github.com/kokkos/kokkos/issues/1093)
+- OpenMP openmp.range on Develop Runs Forever on POWER7+ with RHEL7 and GCC4.8.5 [\#995](https://github.com/kokkos/kokkos/issues/995)
+- Rendezvous performance at global scope [\#985](https://github.com/kokkos/kokkos/issues/985)
+
+
+## [2.04.11](https://github.com/kokkos/kokkos/tree/2.04.11) (2017-10-28)
+[Full Changelog](https://github.com/kokkos/kokkos/compare/2.04.04...2.04.11)
+
+**Implemented enhancements:**
+
+- Add Subview pattern. [\#648](https://github.com/kokkos/kokkos/issues/648)
+- Add Kokkos "global" is\_initialized [\#1060](https://github.com/kokkos/kokkos/issues/1060)
+- Add create\_mirror\_view\_and\_copy [\#1161](https://github.com/kokkos/kokkos/issues/1161)
+- Add KokkosConcepts SpaceAccessibility function [\#1092](https://github.com/kokkos/kokkos/issues/1092)
+- Option to Disable Initialize Warnings [\#1142](https://github.com/kokkos/kokkos/issues/1142)
+- Mature task-DAG capability [\#320](https://github.com/kokkos/kokkos/issues/320)
+- Promote Work DAG from experimental [\#1126](https://github.com/kokkos/kokkos/issues/1126)
+- Implement new WorkGraph push/pop [\#1108](https://github.com/kokkos/kokkos/issues/1108)
+- Kokkos\_ENABLE\_Cuda\_Lambda should default ON [\#1101](https://github.com/kokkos/kokkos/issues/1101)
+- Add multidimensional parallel for example and improve unit test [\#1064](https://github.com/kokkos/kokkos/issues/1064)
+- Fix ROCm:  Performance tests not building [\#1038](https://github.com/kokkos/kokkos/issues/1038)
+- Make KOKKOS\_ALIGN\_SIZE a configure-time option [\#1004](https://github.com/kokkos/kokkos/issues/1004)
+- Make alignment consistent [\#809](https://github.com/kokkos/kokkos/issues/809)
+- Improve subview construction on Cuda backend [\#615](https://github.com/kokkos/kokkos/issues/615)
+
+**Fixed bugs:**
+
+- Kokkos::vector fixes for application [\#1134](https://github.com/kokkos/kokkos/issues/1134)
+- DynamicView non-power of two value\_type [\#1177](https://github.com/kokkos/kokkos/issues/1177)
+- Memory pool bug [\#1154](https://github.com/kokkos/kokkos/issues/1154)
+- Cuda launch bounds performance regression bug [\#1140](https://github.com/kokkos/kokkos/issues/1140)
+- Significant performance regression in LAMMPS after updating Kokkos [\#1139](https://github.com/kokkos/kokkos/issues/1139)
+- CUDA compile error [\#1128](https://github.com/kokkos/kokkos/issues/1128)
+- MDRangePolicy neg idx test failure in debug mode [\#1113](https://github.com/kokkos/kokkos/issues/1113)
+- subview construction on Cuda backend [\#615](https://github.com/kokkos/kokkos/issues/615)
+
 ## [2.04.04](https://github.com/kokkos/kokkos/tree/2.04.04) (2017-09-11)
 [Full Changelog](https://github.com/kokkos/kokkos/compare/2.04.00...2.04.04)

--- a/lib/kokkos/CMakeLists.txt
+++ b/lib/kokkos/CMakeLists.txt
@ -1,3 +1,5 @@
+# Is this a build as part of Trilinos?
+
 IF(COMMAND TRIBITS_PACKAGE_DECL)
  SET(KOKKOS_HAS_TRILINOS ON CACHE BOOL "")
 ELSE()
@ -6,13 +8,57 @@ ENDIF()

 IF(NOT KOKKOS_HAS_TRILINOS)
  cmake_minimum_required(VERSION 3.1 FATAL_ERROR)
-  project(Kokkos CXX)

-  INCLUDE(cmake/kokkos.cmake)
+  # Define Project Name if this is a standalone build
+  IF(NOT DEFINED ${PROJECT_NAME})
+    project(Kokkos CXX) 
+  ENDIF()
+
+  # Basic initialization (Used in KOKKOS_SETTINGS)
+  set(KOKKOS_SRC_PATH ${Kokkos_SOURCE_DIR})
+  set(KOKKOS_PATH ${KOKKOS_SRC_PATH})
+
+  #------------ COMPILER AND FEATURE CHECKS ------------------------------------
+  include(${KOKKOS_SRC_PATH}/cmake/kokkos_functions.cmake)
+  set_kokkos_cxx_compiler()
+  set_kokkos_cxx_standard()
+  
+  #------------ GET OPTIONS AND KOKKOS_SETTINGS --------------------------------
+  # Add Kokkos' modules to CMake's module path.
+  set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${Kokkos_SOURCE_DIR}/cmake/Modules/")
+
+  set(KOKKOS_CMAKE_VERBOSE True)
+  include(${KOKKOS_SRC_PATH}/cmake/kokkos_options.cmake)
+
+  include(${KOKKOS_SRC_PATH}/cmake/kokkos_settings.cmake)
+
+  #------------ GENERATE HEADER AND SOURCE FILES -------------------------------
+  execute_process(
+    COMMAND ${KOKKOS_SETTINGS} make -f ${KOKKOS_SRC_PATH}/cmake/Makefile.generate_cmake_settings CXX=${CMAKE_CXX_COMPILER} generate_build_settings
+    WORKING_DIRECTORY "${Kokkos_BINARY_DIR}"
+    OUTPUT_FILE ${Kokkos_BINARY_DIR}/core_src_make.out
+    RESULT_VARIABLE res
+  )
+  include(${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake)
+  set_kokkos_srcs(KOKKOS_SRC ${KOKKOS_SRC})
+
+  #------------ NOW BUILD ------------------------------------------------------
+  include(${KOKKOS_SRC_PATH}/cmake/kokkos_build.cmake)
+
+  #------------ Add in Fake Tribits Handling to allow unit test builds- --------
+
+  include(${KOKKOS_SRC_PATH}/cmake/tribits.cmake)
+
+  TRIBITS_PACKAGE_DECL(Kokkos)
+
+  ADD_SUBDIRECTORY(core)
+  ADD_SUBDIRECTORY(containers)
+  ADD_SUBDIRECTORY(algorithms)
+
 ELSE()
 #------------------------------------------------------------------------------
 #
-# A) Forward delcare the package so that certain options are also defined for
+# A) Forward declare the package so that certain options are also defined for
 # subpackages
 #

@ -21,178 +67,28 @@ TRIBITS_PACKAGE_DECL(Kokkos) # ENABLE_SHADOWING_WARNINGS)

 #------------------------------------------------------------------------------
 #
-# B) Define the common options for Kokkos first so they can be used by
-# subpackages as well.
+# B) Install Kokkos' build files
 #
+# If using the Makefile-generated files, then need to set things up.
+# Here, assume that TriBITS has been run from ProjectCompilerPostConfig.cmake
+# and already generated KokkosCore_config.h and kokkos_generated_settings.cmake
+# in the previously define Kokkos_GEN_DIR
+# We need to copy them over to the correct place and source the cmake file

-# mfh 01 Aug 2016: See Issue #61:
-#
-# https://github.com/kokkos/kokkos/issues/61
-#
-# Don't use TRIBITS_ADD_DEBUG_OPTION() here, because that defines
-# HAVE_KOKKOS_DEBUG.  We define KOKKOS_HAVE_DEBUG here instead,
-# for compatibility with Kokkos' Makefile build system.
+if(NOT KOKKOS_LEGACY_TRIBITS)
+  set(Kokkos_GEN_DIR ${CMAKE_BINARY_DIR})
+  file(COPY "${Kokkos_GEN_DIR}/KokkosCore_config.h"
+    DESTINATION "${CMAKE_CURRENT_BINARY_DIR}" USE_SOURCE_PERMISSIONS)
+  install(FILES "${Kokkos_GEN_DIR}/KokkosCore_config.h"
+    DESTINATION include)
+  file(COPY "${Kokkos_GEN_DIR}/kokkos_generated_settings.cmake"
+    DESTINATION "${CMAKE_CURRENT_BINARY_DIR}" USE_SOURCE_PERMISSIONS)

-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_DEBUG
-  KOKKOS_HAVE_DEBUG
-  "Enable run-time debug checks.  These checks may be expensive, so they are disabled by default in a release build."
-  ${${PROJECT_NAME}_ENABLE_DEBUG}
-)
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_SIERRA_BUILD
-  KOKKOS_FOR_SIERRA
-  "Configure Kokkos for building within the Sierra build system."
-  OFF
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Cuda
-  KOKKOS_HAVE_CUDA
-  "Enable CUDA support in Kokkos."
-  "${TPL_ENABLE_CUDA}"
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Cuda_UVM
-  KOKKOS_USE_CUDA_UVM
-  "Enable CUDA Unified Virtual Memory as the default in Kokkos."
-  OFF
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Cuda_RDC
-  KOKKOS_HAVE_CUDA_RDC
-  "Enable CUDA Relocatable Device Code support in Kokkos."
-  OFF
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Cuda_Lambda
-  KOKKOS_HAVE_CUDA_LAMBDA
-  "Enable CUDA LAMBDA support in Kokkos."
-  OFF
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Pthread
-  KOKKOS_HAVE_PTHREAD
-  "Enable Pthread support in Kokkos."
-  OFF
-  )
-
-ASSERT_DEFINED(TPL_ENABLE_Pthread)
-IF(Kokkos_ENABLE_Pthread AND NOT TPL_ENABLE_Pthread)
-  MESSAGE(FATAL_ERROR "You set Kokkos_ENABLE_Pthread=ON, but Trilinos' support for Pthread(s) is not enabled (TPL_ENABLE_Pthread=OFF).  This is not allowed.  Please enable Pthreads in Trilinos before attempting to enable Kokkos' support for Pthreads.")
-ENDIF()
-IF(NOT TPL_ENABLE_Pthread)
-  ADD_DEFINITIONS(-DGTEST_HAS_PTHREAD=0)
-ENDIF()
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_OpenMP
-  KOKKOS_HAVE_OPENMP
-  "Enable OpenMP support in Kokkos."
-  "${${PROJECT_NAME}_ENABLE_OpenMP}"
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_QTHREAD
-  KOKKOS_HAVE_QTHREADS
-  "Enable Qthreads support in Kokkos."
-  "${TPL_ENABLE_QTHREAD}"
-  )
-
-# TODO: No longer an option in Kokkos.  Needs to be removed.
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_CXX11
-  KOKKOS_HAVE_CXX11
-  "Enable C++11 support in Kokkos."
-  "${${PROJECT_NAME}_ENABLE_CXX11}"
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_HWLOC
-  KOKKOS_HAVE_HWLOC
-  "Enable HWLOC support in Kokkos."
-  "${TPL_ENABLE_HWLOC}"
-  )
-
-# TODO: This is currently not used in Kokkos.  Should it be removed?
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_MPI
-  KOKKOS_HAVE_MPI
-  "Enable MPI support in Kokkos."
-  "${TPL_ENABLE_MPI}"
-  )
-
-# Set default value of Kokkos_ENABLE_Debug_Bounds_Check option
-#
-# CMake is case sensitive.  The Kokkos_ENABLE_Debug_Bounds_Check
-# option (defined below) is annoyingly not all caps, but we need to
-# keep it that way for backwards compatibility.  If users forget and
-# try using an all-caps variable, then make it count by using the
-# all-caps version as the default value of the original, not-all-caps
-# option.  Otherwise, the default value of this option comes from
-# Kokkos_ENABLE_DEBUG (see Issue #367).
-
-ASSERT_DEFINED(${PACKAGE_NAME}_ENABLE_DEBUG)
-IF(DEFINED Kokkos_ENABLE_DEBUG_BOUNDS_CHECK)
-  IF(Kokkos_ENABLE_DEBUG_BOUNDS_CHECK)
-    SET(Kokkos_ENABLE_Debug_Bounds_Check_DEFAULT ON)
-  ELSE()
-    SET(Kokkos_ENABLE_Debug_Bounds_Check_DEFAULT "${${PACKAGE_NAME}_ENABLE_DEBUG}")
-  ENDIF()
-ELSE()
-  SET(Kokkos_ENABLE_Debug_Bounds_Check_DEFAULT "${${PACKAGE_NAME}_ENABLE_DEBUG}")
-ENDIF()
-ASSERT_DEFINED(Kokkos_ENABLE_Debug_Bounds_Check_DEFAULT)
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Debug_Bounds_Check
-  KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK
-  "Enable Kokkos::View run-time bounds checking."
-  "${Kokkos_ENABLE_Debug_Bounds_Check_DEFAULT}"
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Debug_DualView_Modify_Check
-  KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
-  "Enable abort when Kokkos::DualView modified on host and device without sync."
-  "${Kokkos_ENABLE_DEBUG}"
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Profiling
-  KOKKOS_ENABLE_PROFILING
-  "Enable KokkosP profiling support for kernel data collections."
-  "${TPL_ENABLE_DLlib}"
-  )
-
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Profiling_Load_Print
-  KOKKOS_ENABLE_PROFILING_LOAD_PRINT
-  "Print to standard output which profiling library was loaded."
-  OFF
-  )
-
-# placeholder for future device...
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Winthread
-  KOKKOS_HAVE_WINTHREAD
-  "Enable Winthread support in Kokkos."
-  "${TPL_ENABLE_Winthread}"
-  )
-
-# TODO: No longer an option in Kokkos.  Needs to be removed.
-# use new/old View
-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_USING_DEPRECATED_VIEW
-  KOKKOS_USING_DEPRECATED_VIEW
-  "Choose whether to use the old, deprecated Kokkos::View"
-  OFF
-  )
+  include(${CMAKE_CURRENT_BINARY_DIR}/kokkos_generated_settings.cmake)
+  # Sources come from makefile-generated kokkos_generated_settings.cmake file
+  # Enable using the individual sources if needed
+  set_kokkos_srcs(KOKKOS_SRC ${KOKKOS_SRC})
+endif ()


 #------------------------------------------------------------------------------
@ -226,10 +122,6 @@ TRIBITS_PACKAGE_DEF()

 TRIBITS_EXCLUDE_AUTOTOOLS_FILES()

-TRIBITS_EXCLUDE_FILES(
-  classic/doc
-  classic/LinAlg/doc/CrsRefactorNotesMay2012
-  )
-
 TRIBITS_PACKAGE_POSTPROCESS()
+
 ENDIF()
--- a/lib/kokkos/Makefile.kokkos
+++ b/lib/kokkos/Makefile.kokkos
@ -28,33 +28,39 @@ KOKKOS_OPTIONS ?= ""
 # Options: force_uvm,use_ldg,rdc,enable_lambda
 KOKKOS_CUDA_OPTIONS ?= "enable_lambda"

+# Return a 1 if a string contains a substring and 0 if not
+# Note the search string should be without '"'
+# Example: $(call kokkos_has_string,"hwloc,librt",hwloc)
+#   Will return a 1
+kokkos_has_string=$(if $(findstring $2,$1),1,0)
+
 # Check for general settings.
-KOKKOS_INTERNAL_ENABLE_DEBUG := $(strip $(shell echo $(KOKKOS_DEBUG) | grep "yes" | wc -l))
-KOKKOS_INTERNAL_ENABLE_CXX11 := $(strip $(shell echo $(KOKKOS_CXX_STANDARD) | grep "c++11" | wc -l))
-KOKKOS_INTERNAL_ENABLE_CXX1Z := $(strip $(shell echo $(KOKKOS_CXX_STANDARD) | grep "c++1z" | wc -l))
+KOKKOS_INTERNAL_ENABLE_DEBUG := $(call kokkos_has_string,$(KOKKOS_DEBUG),yes)
+KOKKOS_INTERNAL_ENABLE_CXX11 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++11)
+KOKKOS_INTERNAL_ENABLE_CXX1Z := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1z)

 # Check for external libraries.
-KOKKOS_INTERNAL_USE_HWLOC := $(strip $(shell echo $(KOKKOS_USE_TPLS) | grep "hwloc" | wc -l))
-KOKKOS_INTERNAL_USE_LIBRT := $(strip $(shell echo $(KOKKOS_USE_TPLS) | grep "librt" | wc -l))
-KOKKOS_INTERNAL_USE_MEMKIND := $(strip $(shell echo $(KOKKOS_USE_TPLS) | grep "experimental_memkind" | wc -l))
+KOKKOS_INTERNAL_USE_HWLOC := $(call kokkos_has_string,$(KOKKOS_USE_TPLS),hwloc)
+KOKKOS_INTERNAL_USE_LIBRT := $(call kokkos_has_string,$(KOKKOS_USE_TPLS),librt)
+KOKKOS_INTERNAL_USE_MEMKIND := $(call kokkos_has_string,$(KOKKOS_USE_TPLS),experimental_memkind)

 # Check for advanced settings.
-KOKKOS_INTERNAL_ENABLE_COMPILER_WARNINGS := $(strip $(shell echo $(KOKKOS_OPTIONS) | grep "compiler_warnings" | wc -l))
-KOKKOS_INTERNAL_OPT_RANGE_AGGRESSIVE_VECTORIZATION := $(strip $(shell echo $(KOKKOS_OPTIONS) | grep "aggressive_vectorization" | wc -l))
-KOKKOS_INTERNAL_DISABLE_PROFILING := $(strip $(shell echo $(KOKKOS_OPTIONS) | grep "disable_profiling" | wc -l))
-KOKKOS_INTERNAL_DISABLE_DUALVIEW_MODIFY_CHECK := $(strip $(shell echo $(KOKKOS_OPTIONS) | grep "disable_dualview_modify_check" | wc -l))
-KOKKOS_INTERNAL_ENABLE_PROFILING_LOAD_PRINT := $(strip $(shell echo $(KOKKOS_OPTIONS) | grep "enable_profile_load_print" | wc -l))
-KOKKOS_INTERNAL_CUDA_USE_LDG := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) | grep "use_ldg" | wc -l))
-KOKKOS_INTERNAL_CUDA_USE_UVM := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) | grep "force_uvm" | wc -l))
-KOKKOS_INTERNAL_CUDA_USE_RELOC := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) | grep "rdc" | wc -l))
-KOKKOS_INTERNAL_CUDA_USE_LAMBDA := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) | grep "enable_lambda" | wc -l))
+KOKKOS_INTERNAL_ENABLE_COMPILER_WARNINGS := $(call kokkos_has_string,$(KOKKOS_OPTIONS),compiler_warnings)
+KOKKOS_INTERNAL_OPT_RANGE_AGGRESSIVE_VECTORIZATION := $(call kokkos_has_string,$(KOKKOS_OPTIONS),aggressive_vectorization)
+KOKKOS_INTERNAL_DISABLE_PROFILING := $(call kokkos_has_string,$(KOKKOS_OPTIONS),disable_profiling)
+KOKKOS_INTERNAL_DISABLE_DUALVIEW_MODIFY_CHECK := $(call kokkos_has_string,$(KOKKOS_OPTIONS),disable_dualview_modify_check)
+KOKKOS_INTERNAL_ENABLE_PROFILING_LOAD_PRINT := $(call kokkos_has_string,$(KOKKOS_OPTIONS),enable_profile_load_print)
+KOKKOS_INTERNAL_CUDA_USE_LDG := $(call kokkos_has_string,$(KOKKOS_CUDA_OPTIONS),use_ldg)
+KOKKOS_INTERNAL_CUDA_USE_UVM := $(call kokkos_has_string,$(KOKKOS_CUDA_OPTIONS),force_uvm)
+KOKKOS_INTERNAL_CUDA_USE_RELOC := $(call kokkos_has_string,$(KOKKOS_CUDA_OPTIONS),rdc)
+KOKKOS_INTERNAL_CUDA_USE_LAMBDA := $(call kokkos_has_string,$(KOKKOS_CUDA_OPTIONS),enable_lambda)


 # Check for Kokkos Host Execution Spaces one of which must be on.
-KOKKOS_INTERNAL_USE_OPENMP := $(strip $(shell echo $(subst OpenMPTarget,,$(KOKKOS_DEVICES)) | grep OpenMP | wc -l))
-KOKKOS_INTERNAL_USE_PTHREADS := $(strip $(shell echo $(KOKKOS_DEVICES) | grep Pthread | wc -l))
-KOKKOS_INTERNAL_USE_QTHREADS := $(strip $(shell echo $(KOKKOS_DEVICES) | grep Qthreads | wc -l))
-KOKKOS_INTERNAL_USE_SERIAL := $(strip $(shell echo $(KOKKOS_DEVICES) | grep Serial | wc -l))
+KOKKOS_INTERNAL_USE_OPENMP := $(call kokkos_has_string,$(subst OpenMPTarget,,$(KOKKOS_DEVICES)),OpenMP)
+KOKKOS_INTERNAL_USE_PTHREADS := $(call kokkos_has_string,$(KOKKOS_DEVICES),Pthread)
+KOKKOS_INTERNAL_USE_QTHREADS := $(call kokkos_has_string,$(KOKKOS_DEVICES),Qthreads)
+KOKKOS_INTERNAL_USE_SERIAL := $(call kokkos_has_string,$(KOKKOS_DEVICES),Serial)

 ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 0)
  ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 0)
@ -65,9 +71,9 @@ ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 0)
 endif

 # Check for other Execution Spaces.
-KOKKOS_INTERNAL_USE_CUDA := $(strip $(shell echo $(KOKKOS_DEVICES) | grep Cuda | wc -l))
-KOKKOS_INTERNAL_USE_ROCM := $(strip $(shell echo $(KOKKOS_DEVICES) | grep ROCm | wc -l))
-KOKKOS_INTERNAL_USE_OPENMPTARGET := $(strip $(shell echo $(KOKKOS_DEVICES) | grep OpenMPTarget | wc -l))
+KOKKOS_INTERNAL_USE_CUDA := $(call kokkos_has_string,$(KOKKOS_DEVICES),Cuda)
+KOKKOS_INTERNAL_USE_ROCM := $(call kokkos_has_string,$(KOKKOS_DEVICES),ROCm)
+KOKKOS_INTERNAL_USE_OPENMPTARGET := $(call kokkos_has_string,$(KOKKOS_DEVICES),OpenMPTarget)

 ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
  KOKKOS_INTERNAL_NVCC_PATH := $(shell which nvcc)
@ -77,25 +83,20 @@ endif

 # Check OS.
 KOKKOS_OS                      := $(strip $(shell uname -s))
-KOKKOS_INTERNAL_OS_CYGWIN      := $(strip $(shell uname -s | grep CYGWIN | wc -l))
-KOKKOS_INTERNAL_OS_LINUX       := $(strip $(shell uname -s | grep Linux  | wc -l))
-KOKKOS_INTERNAL_OS_DARWIN      := $(strip $(shell uname -s | grep Darwin | wc -l))
+KOKKOS_INTERNAL_OS_CYGWIN      := $(call kokkos_has_string,$(KOKKOS_OS),CYGWIN)
+KOKKOS_INTERNAL_OS_LINUX       := $(call kokkos_has_string,$(KOKKOS_OS),Linux)
+KOKKOS_INTERNAL_OS_DARWIN      := $(call kokkos_has_string,$(KOKKOS_OS),Darwin)

 # Check compiler.
-KOKKOS_INTERNAL_COMPILER_INTEL       := $(strip $(shell $(CXX) --version       2>&1 | grep "Intel Corporation" | wc -l))
-KOKKOS_INTERNAL_COMPILER_PGI         := $(strip $(shell $(CXX) --version       2>&1 | grep PGI                 | wc -l))
+KOKKOS_CXX_VERSION                   := $(strip $(shell $(CXX) --version       2>&1))
+KOKKOS_INTERNAL_COMPILER_INTEL       := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),Intel Corporation)
+KOKKOS_INTERNAL_COMPILER_PGI         := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),PGI)
 KOKKOS_INTERNAL_COMPILER_XL          := $(strip $(shell $(CXX) -qversion       2>&1 | grep XL                  | wc -l))
 KOKKOS_INTERNAL_COMPILER_CRAY        := $(strip $(shell $(CXX) -craype-verbose 2>&1 | grep "CC-"               | wc -l))
-KOKKOS_INTERNAL_COMPILER_NVCC        := $(strip $(shell $(CXX) --version       2>&1 | grep nvcc                | wc -l))
-ifneq ($(OMPI_CXX),)
-  KOKKOS_INTERNAL_COMPILER_NVCC      := $(strip $(shell $(OMPI_CXX) --version       2>&1 | grep nvcc                | wc -l))
-endif
-ifneq ($(MPICH_CXX),)
-  KOKKOS_INTERNAL_COMPILER_NVCC      := $(strip $(shell $(MPICH_CXX) --version       2>&1 | grep nvcc                | wc -l))
-endif
-KOKKOS_INTERNAL_COMPILER_CLANG       := $(strip $(shell $(CXX) --version       2>&1 | grep clang               | wc -l))
-KOKKOS_INTERNAL_COMPILER_APPLE_CLANG := $(strip $(shell $(CXX) --version       2>&1 | grep "apple-darwin"      | wc -l))
-KOKKOS_INTERNAL_COMPILER_HCC         := $(strip $(shell $(CXX) --version       2>&1 | grep HCC                 | wc -l))
+KOKKOS_INTERNAL_COMPILER_NVCC        := $(strip $(shell export OMPI_CXX=$(OMPI_CXX); export MPICH_CXX=$(MPICH_CXX); $(CXX) --version       2>&1 | grep nvcc                | wc -l))
+KOKKOS_INTERNAL_COMPILER_CLANG       := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),clang)
+KOKKOS_INTERNAL_COMPILER_APPLE_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),apple-darwin)
+KOKKOS_INTERNAL_COMPILER_HCC         := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),HCC)

 ifeq ($(KOKKOS_INTERNAL_COMPILER_CLANG), 2)
  KOKKOS_INTERNAL_COMPILER_CLANG = 1
@ -209,47 +210,48 @@ endif
 # Check for Kokkos Architecture settings.

 # Intel based.
-KOKKOS_INTERNAL_USE_ARCH_KNC := $(strip $(shell echo $(KOKKOS_ARCH) | grep KNC | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_WSM := $(strip $(shell echo $(KOKKOS_ARCH) | grep WSM | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_SNB := $(strip $(shell echo $(KOKKOS_ARCH) | grep SNB | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_HSW := $(strip $(shell echo $(KOKKOS_ARCH) | grep HSW | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_BDW := $(strip $(shell echo $(KOKKOS_ARCH) | grep BDW | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_SKX := $(strip $(shell echo $(KOKKOS_ARCH) | grep SKX | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_KNL := $(strip $(shell echo $(KOKKOS_ARCH) | grep KNL | wc -l))
+KOKKOS_INTERNAL_USE_ARCH_KNC := $(call kokkos_has_string,$(KOKKOS_ARCH),KNC)
+KOKKOS_INTERNAL_USE_ARCH_WSM := $(call kokkos_has_string,$(KOKKOS_ARCH),WSM)
+KOKKOS_INTERNAL_USE_ARCH_SNB := $(call kokkos_has_string,$(KOKKOS_ARCH),SNB)
+KOKKOS_INTERNAL_USE_ARCH_HSW := $(call kokkos_has_string,$(KOKKOS_ARCH),HSW)
+KOKKOS_INTERNAL_USE_ARCH_BDW := $(call kokkos_has_string,$(KOKKOS_ARCH),BDW)
+KOKKOS_INTERNAL_USE_ARCH_SKX := $(call kokkos_has_string,$(KOKKOS_ARCH),SKX)
+KOKKOS_INTERNAL_USE_ARCH_KNL := $(call kokkos_has_string,$(KOKKOS_ARCH),KNL)

 # NVIDIA based.
 NVCC_WRAPPER := $(KOKKOS_PATH)/bin/nvcc_wrapper
-KOKKOS_INTERNAL_USE_ARCH_KEPLER30 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler30 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_KEPLER32 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler32 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler35 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_KEPLER37 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler37 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell50 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_MAXWELL52 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell52 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_MAXWELL53 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell53 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_PASCAL61 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Pascal61 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_PASCAL60 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Pascal60 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30)  \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32)  \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35)  \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37)  \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL61)  \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL60)  \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \
-                                                      + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53) | bc))
+KOKKOS_INTERNAL_USE_ARCH_KEPLER30 := $(call kokkos_has_string,$(KOKKOS_ARCH),Kepler30)
+KOKKOS_INTERNAL_USE_ARCH_KEPLER32 := $(call kokkos_has_string,$(KOKKOS_ARCH),Kepler32)
+KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(call kokkos_has_string,$(KOKKOS_ARCH),Kepler35)
+KOKKOS_INTERNAL_USE_ARCH_KEPLER37 := $(call kokkos_has_string,$(KOKKOS_ARCH),Kepler37)
+KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(call kokkos_has_string,$(KOKKOS_ARCH),Maxwell50)
+KOKKOS_INTERNAL_USE_ARCH_MAXWELL52 := $(call kokkos_has_string,$(KOKKOS_ARCH),Maxwell52)
+KOKKOS_INTERNAL_USE_ARCH_MAXWELL53 := $(call kokkos_has_string,$(KOKKOS_ARCH),Maxwell53)
+KOKKOS_INTERNAL_USE_ARCH_PASCAL61 := $(call kokkos_has_string,$(KOKKOS_ARCH),Pascal61)
+KOKKOS_INTERNAL_USE_ARCH_PASCAL60 := $(call kokkos_has_string,$(KOKKOS_ARCH),Pascal60)
+KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30)  \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32)  \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35)  \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37)  \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL61)  \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL60)  \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \
+                                              + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53))

+#SEK: This seems like a bug to me
 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_NVIDIA), 0)
-  KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell | wc -l))
-  KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler | wc -l))
-  KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30)  \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32)  \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35)  \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37)  \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL61)  \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL60)  \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \
-                                                        + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53) | bc))
+  KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(call kokkos_has_string,$(KOKKOS_ARCH),Maxwell)
+  KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(call kokkos_has_string,$(KOKKOS_ARCH),Kepler)
+  KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30)  \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32)  \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35)  \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37)  \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL61)  \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_PASCAL60)  \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \
+                                                + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53))
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_NVIDIA), 1)
@ -262,43 +264,43 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_NVIDIA), 1)
  endif
 endif
 # ARM based.
-KOKKOS_INTERNAL_USE_ARCH_ARMV80 := $(strip $(shell echo $(KOKKOS_ARCH) | grep ARMv80 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_ARMV81 := $(strip $(shell echo $(KOKKOS_ARCH) | grep ARMv81 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX := $(strip $(shell echo $(KOKKOS_ARCH) | grep ARMv8-ThunderX | wc -l))
+KOKKOS_INTERNAL_USE_ARCH_ARMV80 := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv80)
+KOKKOS_INTERNAL_USE_ARCH_ARMV81 := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv81)
+KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv8-ThunderX)
 KOKKOS_INTERNAL_USE_ARCH_ARM := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_ARMV80)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV81)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX) | bc))

 # IBM based.
-KOKKOS_INTERNAL_USE_ARCH_BGQ := $(strip $(shell echo $(KOKKOS_ARCH) | grep BGQ | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_POWER7 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Power7 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_POWER8 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Power8 | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_POWER9 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Power9 | wc -l))
+KOKKOS_INTERNAL_USE_ARCH_BGQ := $(call kokkos_has_string,$(KOKKOS_ARCH),BGQ)
+KOKKOS_INTERNAL_USE_ARCH_POWER7 := $(call kokkos_has_string,$(KOKKOS_ARCH),Power7)
+KOKKOS_INTERNAL_USE_ARCH_POWER8 := $(call kokkos_has_string,$(KOKKOS_ARCH),Power8)
+KOKKOS_INTERNAL_USE_ARCH_POWER9 := $(call kokkos_has_string,$(KOKKOS_ARCH),Power9)
 KOKKOS_INTERNAL_USE_ARCH_IBM := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_BGQ)+$(KOKKOS_INTERNAL_USE_ARCH_POWER7)+$(KOKKOS_INTERNAL_USE_ARCH_POWER8)+$(KOKKOS_INTERNAL_USE_ARCH_POWER9) | bc))

 # AMD based.
-KOKKOS_INTERNAL_USE_ARCH_AMDAVX := $(strip $(shell echo $(KOKKOS_ARCH) | grep AMDAVX | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_RYZEN := $(strip $(shell echo $(KOKKOS_ARCH) | grep Ryzen | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_EPYC := $(strip $(shell echo $(KOKKOS_ARCH) | grep Epyc | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_KAVERI := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kaveri | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_CARRIZO := $(strip $(shell echo $(KOKKOS_ARCH) | grep Carrizo | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_FIJI := $(strip $(shell echo $(KOKKOS_ARCH) | grep Fiji | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_VEGA := $(strip $(shell echo $(KOKKOS_ARCH) | grep Vega | wc -l))
-KOKKOS_INTERNAL_USE_ARCH_GFX901 := $(strip $(shell echo $(KOKKOS_ARCH) | grep gfx901 | wc -l))
+KOKKOS_INTERNAL_USE_ARCH_AMDAVX := $(call kokkos_has_string,$(KOKKOS_ARCH),AMDAVX)
+KOKKOS_INTERNAL_USE_ARCH_RYZEN := $(call kokkos_has_string,$(KOKKOS_ARCH),Ryzen)
+KOKKOS_INTERNAL_USE_ARCH_EPYC := $(call kokkos_has_string,$(KOKKOS_ARCH),Epyc)
+KOKKOS_INTERNAL_USE_ARCH_KAVERI := $(call kokkos_has_string,$(KOKKOS_ARCH),Kaveri)
+KOKKOS_INTERNAL_USE_ARCH_CARRIZO := $(call kokkos_has_string,$(KOKKOS_ARCH),Carrizo)
+KOKKOS_INTERNAL_USE_ARCH_FIJI := $(call kokkos_has_string,$(KOKKOS_ARCH),Fiji)
+KOKKOS_INTERNAL_USE_ARCH_VEGA := $(call kokkos_has_string,$(KOKKOS_ARCH),Vega)
+KOKKOS_INTERNAL_USE_ARCH_GFX901 := $(call kokkos_has_string,$(KOKKOS_ARCH),gfx901)

 # Any AVX?
-KOKKOS_INTERNAL_USE_ARCH_SSE42      := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_WSM) | bc ))
-KOKKOS_INTERNAL_USE_ARCH_AVX        := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_SNB)+$(KOKKOS_INTERNAL_USE_ARCH_AMDAVX) | bc ))
-KOKKOS_INTERNAL_USE_ARCH_AVX2       := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_HSW)+$(KOKKOS_INTERNAL_USE_ARCH_BDW) | bc ))
-KOKKOS_INTERNAL_USE_ARCH_AVX512MIC  := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KNL) | bc ))
-KOKKOS_INTERNAL_USE_ARCH_AVX512XEON := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_SKX) | bc ))
+KOKKOS_INTERNAL_USE_ARCH_SSE42      := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_WSM))
+KOKKOS_INTERNAL_USE_ARCH_AVX        := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_SNB) + $(KOKKOS_INTERNAL_USE_ARCH_AMDAVX))
+KOKKOS_INTERNAL_USE_ARCH_AVX2       := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_HSW) + $(KOKKOS_INTERNAL_USE_ARCH_BDW))
+KOKKOS_INTERNAL_USE_ARCH_AVX512MIC  := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_KNL))
+KOKKOS_INTERNAL_USE_ARCH_AVX512XEON := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_SKX))

 # Decide what ISA level we are able to support.
-KOKKOS_INTERNAL_USE_ISA_X86_64    := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_WSM)+$(KOKKOS_INTERNAL_USE_ARCH_SNB)+$(KOKKOS_INTERNAL_USE_ARCH_HSW)+$(KOKKOS_INTERNAL_USE_ARCH_BDW)+$(KOKKOS_INTERNAL_USE_ARCH_KNL)+$(KOKKOS_INTERNAL_USE_ARCH_SKX) | bc ))
-KOKKOS_INTERNAL_USE_ISA_KNC       := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KNC) | bc ))
-KOKKOS_INTERNAL_USE_ISA_POWERPCLE := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_POWER8)+$(KOKKOS_INTERNAL_USE_ARCH_POWER9) | bc ))
-KOKKOS_INTERNAL_USE_ISA_POWERPCBE := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_POWER7) | bc ))
+KOKKOS_INTERNAL_USE_ISA_X86_64    := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_WSM) + $(KOKKOS_INTERNAL_USE_ARCH_SNB) + $(KOKKOS_INTERNAL_USE_ARCH_HSW) + $(KOKKOS_INTERNAL_USE_ARCH_BDW) + $(KOKKOS_INTERNAL_USE_ARCH_KNL) + $(KOKKOS_INTERNAL_USE_ARCH_SKX))
+KOKKOS_INTERNAL_USE_ISA_KNC       := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_KNC))
+KOKKOS_INTERNAL_USE_ISA_POWERPCLE := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_POWER8) + $(KOKKOS_INTERNAL_USE_ARCH_POWER9))
+KOKKOS_INTERNAL_USE_ISA_POWERPCBE := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_POWER7))

 # Decide whether we can support transactional memory
-KOKKOS_INTERNAL_USE_TM            := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_BDW)+$(KOKKOS_INTERNAL_USE_ARCH_SKX) | bc ))
+KOKKOS_INTERNAL_USE_TM            := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_BDW) + $(KOKKOS_INTERNAL_USE_ARCH_SKX))

 # Incompatible flags?
 KOKKOS_INTERNAL_USE_ARCH_MULTIHOST := $(strip $(shell echo "$(KOKKOS_INTERNAL_USE_ARCH_SSE42)+$(KOKKOS_INTERNAL_USE_ARCH_AVX)+$(KOKKOS_INTERNAL_USE_ARCH_AVX2)+$(KOKKOS_INTERNAL_USE_ARCH_AVX512MIC)+$(KOKKOS_INTERNAL_USE_ARCH_AVX512XEON)+$(KOKKOS_INTERNAL_USE_ARCH_KNC)+$(KOKKOS_INTERNAL_USE_ARCH_IBM)+$(KOKKOS_INTERNAL_USE_ARCH_ARM)>1" | bc ))
@ -320,94 +322,100 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_COMPILER_WARNINGS), 1)
  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_COMPILER_WARNINGS)
 endif

-KOKKOS_LIBS = -lkokkos -ldl
+KOKKOS_LIBS = -ldl
 KOKKOS_LDFLAGS = -L$(shell pwd)
 KOKKOS_SRC =
 KOKKOS_HEADERS =

 # Generating the KokkosCore_config.h file.

+KOKKOS_INTERNAL_CONFIG_TMP=KokkosCore_config.tmp
+KOKKOS_CONFIG_HEADER=KokkosCore_config.h
+# Functions for generating config header file
+kokkos_append_header = $(shell echo $1 >> $(KOKKOS_INTERNAL_CONFIG_TMP))
+
+# Do not append first line
 tmp := $(shell echo "/* ---------------------------------------------" > KokkosCore_config.tmp)
-tmp := $(shell echo "Makefile constructed configuration:" >> KokkosCore_config.tmp)
-tmp := $(shell date >> KokkosCore_config.tmp)
-tmp := $(shell echo "----------------------------------------------*/" >> KokkosCore_config.tmp)
+tmp := $(call kokkos_append_header,"Makefile constructed configuration:")
+tmp := $(call kokkos_append_header,"$(shell date)")
+tmp := $(call kokkos_append_header,"----------------------------------------------*/")

-tmp := $(shell echo '\#if !defined(KOKKOS_MACROS_HPP) || defined(KOKKOS_CORE_CONFIG_H)' >> KokkosCore_config.tmp)
-tmp := $(shell echo '\#error "Do not include KokkosCore_config.h directly; include Kokkos_Macros.hpp instead."' >> KokkosCore_config.tmp)
-tmp := $(shell echo '\#else' >> KokkosCore_config.tmp)
-tmp := $(shell echo '\#define KOKKOS_CORE_CONFIG_H' >> KokkosCore_config.tmp)
-tmp := $(shell echo '\#endif' >> KokkosCore_config.tmp)
-
-tmp := $(shell echo "/* Execution Spaces */" >> KokkosCore_config.tmp)
+tmp := $(call kokkos_append_header,'\#if !defined(KOKKOS_MACROS_HPP) || defined(KOKKOS_CORE_CONFIG_H)')
+tmp := $(call kokkos_append_header,'\#error "Do not include $(KOKKOS_CONFIG_HEADER) directly; include Kokkos_Macros.hpp instead."')
+tmp := $(call kokkos_append_header,'\#else')
+tmp := $(call kokkos_append_header,'\#define KOKKOS_CORE_CONFIG_H')
+tmp := $(call kokkos_append_header,'\#endif')
+	
+tmp := $(call kokkos_append_header,"/* Execution Spaces */")

 ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
-  tmp := $(shell echo "\#define KOKKOS_HAVE_CUDA 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_CUDA")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
-  tmp := $(shell echo '\#define KOKKOS_ENABLE_ROCM 1' >> KokkosCore_config.tmp)
+  tmp := $(call kokkos_append_header,'\#define KOKKOS_ENABLE_ROCM')
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_OPENMPTARGET), 1)
-  tmp := $(shell echo '\#define KOKKOS_ENABLE_OPENMPTARGET 1' >> KokkosCore_config.tmp)
+  tmp := $(call kokkos_append_header,'\#define KOKKOS_ENABLE_OPENMPTARGET')
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
-  tmp := $(shell echo '\#define KOKKOS_HAVE_OPENMP 1' >> KokkosCore_config.tmp)
+  tmp := $(call kokkos_append_header,'\#define KOKKOS_HAVE_OPENMP')
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
-  tmp := $(shell echo "\#define KOKKOS_HAVE_PTHREAD 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_PTHREAD")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_QTHREADS), 1)
-  tmp := $(shell echo "\#define KOKKOS_HAVE_QTHREADS 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_QTHREADS")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
-  tmp := $(shell echo "\#define KOKKOS_HAVE_SERIAL 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_SERIAL")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_TM), 1)
-  tmp := $(shell echo "\#ifndef __CUDA_ARCH__" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_ENABLE_TM" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#endif" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#ifndef __CUDA_ARCH__")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_TM")
+  tmp := $(call kokkos_append_header,"\#endif")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ISA_X86_64), 1)
-  tmp := $(shell echo "\#ifndef __CUDA_ARCH__" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_USE_ISA_X86_64" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#endif" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#ifndef __CUDA_ARCH__")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_USE_ISA_X86_64")
+  tmp := $(call kokkos_append_header,"\#endif")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ISA_KNC), 1)
-  tmp := $(shell echo "\#ifndef __CUDA_ARCH__" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_USE_ISA_KNC" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#endif" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#ifndef __CUDA_ARCH__")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_USE_ISA_KNC")
+  tmp := $(call kokkos_append_header,"\#endif")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ISA_POWERPCLE), 1)
-  tmp := $(shell echo "\#ifndef __CUDA_ARCH__" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_USE_ISA_POWERPCLE" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#endif" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#ifndef __CUDA_ARCH__")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_USE_ISA_POWERPCLE")
+  tmp := $(call kokkos_append_header,"\#endif")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ISA_POWERPCBE), 1)
-  tmp := $(shell echo "\#ifndef __CUDA_ARCH__" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_USE_ISA_POWERPCBE" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#endif" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#ifndef __CUDA_ARCH__")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_USE_ISA_POWERPCBE")
+  tmp := $(call kokkos_append_header,"\#endif")
 endif

-tmp := $(shell echo "/* General Settings */" >> KokkosCore_config.tmp)
+tmp := $(call kokkos_append_header,"/* General Settings */")
 ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX11), 1)
  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX11_FLAG)
-  tmp := $(shell echo "\#define KOKKOS_HAVE_CXX11 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_CXX11")
 endif

 ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX1Z), 1)
  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX1Z_FLAG)
-  tmp := $(shell echo "\#define KOKKOS_HAVE_CXX11 1" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_HAVE_CXX1Z 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_CXX11")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_CXX1Z")
 endif

 ifeq ($(KOKKOS_INTERNAL_ENABLE_DEBUG), 1)
@ -417,26 +425,26 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_DEBUG), 1)

  KOKKOS_CXXFLAGS += -g
  KOKKOS_LDFLAGS += -g -ldl
-  tmp := $(shell echo "\#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK 1" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_HAVE_DEBUG 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_DEBUG")
  ifeq ($(KOKKOS_INTERNAL_DISABLE_DUALVIEW_MODIFY_CHECK), 0)
-    tmp := $(shell echo "\#define KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK")
  endif
 endif

 ifeq ($(KOKKOS_INTERNAL_ENABLE_PROFILING_LOAD_PRINT), 1)
-  tmp := $(shell echo "\#define KOKKOS_ENABLE_PROFILING_LOAD_PRINT 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_PROFILING_LOAD_PRINT")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_HWLOC), 1)
  KOKKOS_CPPFLAGS += -I$(HWLOC_PATH)/include
  KOKKOS_LDFLAGS += -L$(HWLOC_PATH)/lib
  KOKKOS_LIBS += -lhwloc
-  tmp := $(shell echo "\#define KOKKOS_HAVE_HWLOC 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_HWLOC")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_LIBRT), 1)
-  tmp := $(shell echo "\#define KOKKOS_USE_LIBRT 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_USE_LIBRT")
  KOKKOS_LIBS += -lrt
 endif

@ -444,36 +452,36 @@ ifeq ($(KOKKOS_INTERNAL_USE_MEMKIND), 1)
  KOKKOS_CPPFLAGS += -I$(MEMKIND_PATH)/include
  KOKKOS_LDFLAGS += -L$(MEMKIND_PATH)/lib
  KOKKOS_LIBS += -lmemkind -lnuma
-  tmp := $(shell echo "\#define KOKKOS_HAVE_HBWSPACE 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_HAVE_HBWSPACE")
 endif

 ifeq ($(KOKKOS_INTERNAL_DISABLE_PROFILING), 0)
-  tmp := $(shell echo "\#define KOKKOS_ENABLE_PROFILING" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_PROFILING")
 endif

-tmp := $(shell echo "/* Optimization Settings */" >> KokkosCore_config.tmp)
+tmp := $(call kokkos_append_header,"/* Optimization Settings */")

 ifeq ($(KOKKOS_INTERNAL_OPT_RANGE_AGGRESSIVE_VECTORIZATION), 1)
-  tmp := $(shell echo "\#define KOKKOS_OPT_RANGE_AGGRESSIVE_VECTORIZATION 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_OPT_RANGE_AGGRESSIVE_VECTORIZATION")
 endif

-tmp := $(shell echo "/* Cuda Settings */" >> KokkosCore_config.tmp)
+tmp := $(call kokkos_append_header,"/* Cuda Settings */")

 ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
  ifeq ($(KOKKOS_INTERNAL_CUDA_USE_LDG), 1)
-    tmp := $(shell echo "\#define KOKKOS_CUDA_USE_LDG_INTRINSIC 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_CUDA_USE_LDG_INTRINSIC")
  else
    ifeq ($(KOKKOS_INTERNAL_COMPILER_CLANG), 1)
-      tmp := $(shell echo "\#define KOKKOS_CUDA_USE_LDG_INTRINSIC 1" >> KokkosCore_config.tmp )
+      tmp := $(call kokkos_append_header,"\#define KOKKOS_CUDA_USE_LDG_INTRINSIC")
    endif
  endif

  ifeq ($(KOKKOS_INTERNAL_CUDA_USE_UVM), 1)
-    tmp := $(shell echo "\#define KOKKOS_CUDA_USE_UVM 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_CUDA_USE_UVM")
  endif

  ifeq ($(KOKKOS_INTERNAL_CUDA_USE_RELOC), 1)
-    tmp := $(shell echo "\#define KOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE")
    KOKKOS_CXXFLAGS += --relocatable-device-code=true
    KOKKOS_LDFLAGS += --relocatable-device-code=true
  endif
@ -481,7 +489,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
  ifeq ($(KOKKOS_INTERNAL_CUDA_USE_LAMBDA), 1)
    ifeq ($(KOKKOS_INTERNAL_COMPILER_NVCC), 1)
      ifeq ($(shell test $(KOKKOS_INTERNAL_COMPILER_NVCC_VERSION) -gt 70; echo $$?),0)
-        tmp := $(shell echo "\#define KOKKOS_CUDA_USE_LAMBDA 1" >> KokkosCore_config.tmp )
+        tmp := $(call kokkos_append_header,"\#define KOKKOS_CUDA_USE_LAMBDA")
        KOKKOS_CXXFLAGS += -expt-extended-lambda
      else
        $(warning Warning: Cuda Lambda support was requested but NVCC version is too low. This requires NVCC for Cuda version 7.5 or higher. Disabling Lambda support now.)
@ -489,19 +497,19 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
    endif

    ifeq ($(KOKKOS_INTERNAL_COMPILER_CLANG), 1)
-      tmp := $(shell echo "\#define KOKKOS_CUDA_USE_LAMBDA 1" >> KokkosCore_config.tmp )
+      tmp := $(call kokkos_append_header,"\#define KOKKOS_CUDA_USE_LAMBDA")
    endif
  endif

  ifeq ($(KOKKOS_INTERNAL_COMPILER_CLANG), 1)
-    tmp := $(shell echo "\#define KOKKOS_CUDA_CLANG_WORKAROUND" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_CUDA_CLANG_WORKAROUND")
  endif
 endif

 # Add Architecture flags.

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ARMV80), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_ARMV80 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ARMV80")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1)
    KOKKOS_CXXFLAGS +=
@ -518,7 +526,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ARMV80), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ARMV81), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_ARMV81 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ARMV81")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1)
    KOKKOS_CXXFLAGS +=
@ -535,8 +543,8 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ARMV81), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_ARMV80 1" >> KokkosCore_config.tmp )
-  tmp := $(shell echo "\#define KOKKOS_ARCH_ARMV8_THUNDERX 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ARMV80")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ARMV8_THUNDERX")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1)
    KOKKOS_CXXFLAGS +=
@ -553,7 +561,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_SSE42), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_SSE42 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_SSE42")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_INTEL), 1)
    KOKKOS_CXXFLAGS += -xSSE4.2
@ -575,7 +583,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_SSE42), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_AVX 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_AVX")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_INTEL), 1)
    KOKKOS_CXXFLAGS += -mavx
@ -597,7 +605,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_POWER7), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_POWER7 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_POWER7")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)

@ -609,7 +617,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_POWER7), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_POWER8), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_POWER8 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_POWER8")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)

@ -630,7 +638,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_POWER8), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_POWER9), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_POWER9 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_POWER9")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)

@ -651,7 +659,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_POWER9), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_HSW), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_AVX2 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_AVX2")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_INTEL), 1)
    KOKKOS_CXXFLAGS += -xCORE-AVX2
@ -673,7 +681,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_HSW), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_BDW), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_AVX2 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_AVX2")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_INTEL), 1)
    KOKKOS_CXXFLAGS += -xCORE-AVX2
@ -695,7 +703,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_BDW), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX512MIC), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_AVX512MIC 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_AVX512MIC")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_INTEL), 1)
    KOKKOS_CXXFLAGS += -xMIC-AVX512
@ -716,7 +724,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX512MIC), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX512XEON), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_AVX512XEON 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_AVX512XEON")

  ifeq ($(KOKKOS_INTERNAL_COMPILER_INTEL), 1)
    KOKKOS_CXXFLAGS += -xCORE-AVX512
@ -737,7 +745,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX512XEON), 1)
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KNC), 1)
-  tmp := $(shell echo "\#define KOKKOS_ARCH_KNC 1" >> KokkosCore_config.tmp )
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KNC")
  KOKKOS_CXXFLAGS += -mmic
  KOKKOS_LDFLAGS += -mmic
 endif
@ -753,48 +761,48 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
  endif

  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER30), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER30 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER30")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_30
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER32), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER32 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER32")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_32
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER35), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER35 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER35")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_35
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER37), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KEPLER37 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KEPLER37")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_37
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_MAXWELL 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_MAXWELL50 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_MAXWELL")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_MAXWELL50")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_50
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_MAXWELL 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_MAXWELL52 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_MAXWELL")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_MAXWELL52")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_52
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_MAXWELL 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_MAXWELL53 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_MAXWELL")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_MAXWELL53")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_53
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_PASCAL60), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_PASCAL 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_PASCAL60 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_PASCAL")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_PASCAL60")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_60
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_PASCAL61), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_PASCAL 1" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_PASCAL61 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_PASCAL")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_PASCAL61")
    KOKKOS_INTERNAL_CUDA_ARCH_FLAG := $(KOKKOS_INTERNAL_CUDA_ARCH_FLAG)=sm_61
  endif

@ -811,28 +819,28 @@ endif
 ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
  # Lets start with adding architecture defines
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KAVERI), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_ROCM 701" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_KAVERI 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ROCM 701")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_KAVERI")
    KOKKOS_INTERNAL_ROCM_ARCH_FLAG := --amdgpu-target=gfx701 
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_CARRIZO), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_ROCM 801" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_CARRIZO 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ROCM 801")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_CARRIZO")
    KOKKOS_INTERNAL_ROCM_ARCH_FLAG := --amdgpu-target=gfx801 
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_FIJI), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_ROCM 803" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_FIJI 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ROCM 803")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_FIJI")
    KOKKOS_INTERNAL_ROCM_ARCH_FLAG := --amdgpu-target=gfx803
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_VEGA), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_ROCM 900" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_VEGA 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ROCM 900")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_VEGA")
    KOKKOS_INTERNAL_ROCM_ARCH_FLAG := --amdgpu-target=gfx900 
  endif
  ifeq ($(KOKKOS_INTERNAL_USE_ARCH_GFX901), 1)
-    tmp := $(shell echo "\#define KOKKOS_ARCH_ROCM 901" >> KokkosCore_config.tmp )
-    tmp := $(shell echo "\#define KOKKOS_ARCH_GFX901 1" >> KokkosCore_config.tmp )
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_ROCM 901")
+    tmp := $(call kokkos_append_header,"\#define KOKKOS_ARCH_GFX901")
    KOKKOS_INTERNAL_ROCM_ARCH_FLAG := --amdgpu-target=gfx901 
  endif
 
@ -952,6 +960,10 @@ ifeq ($(KOKKOS_INTERNAL_OS_CYGWIN), 1)
  KOKKOS_CXXFLAGS += -U__STRICT_ANSI__
 endif

+# Set KokkosExtraLibs and add -lkokkos to link line
+KOKKOS_EXTRA_LIBS := ${KOKKOS_LIBS}
+KOKKOS_LIBS := -lkokkos ${KOKKOS_LIBS}
+
 # Setting up dependencies.

 KokkosCore_config.h:
--- a/lib/kokkos/Makefile.targets
+++ b/lib/kokkos/Makefile.targets
@ -22,8 +22,8 @@ Kokkos_HostThreadTeam.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokk
 	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_HostThreadTeam.cpp
 Kokkos_Spinwait.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Spinwait.cpp
 	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Spinwait.cpp
-Kokkos_Rendezvous.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Rendezvous.cpp
-	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Rendezvous.cpp
+Kokkos_HostBarrier.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_HostBarrier.cpp
+	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_HostBarrier.cpp
 Kokkos_Profiling_Interface.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Profiling_Interface.cpp
 	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Profiling_Interface.cpp
 Kokkos_SharedAlloc.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_SharedAlloc.cpp
--- a/lib/kokkos/README
+++ b/lib/kokkos/README
@ -41,48 +41,44 @@ hcedwar(at)sandia.gov and crtrott(at)sandia.gov
 ============================================================================

 Primary tested compilers on X86 are:
-  GCC 4.7.2
  GCC 4.8.4
-  GCC 4.9.2
+  GCC 4.9.3
  GCC 5.1.0
-  GCC 5.2.0
-  Intel 14.0.4
+  GCC 5.3.0
+  GCC 6.1.0
  Intel 15.0.2
  Intel 16.0.1
-  Intel 17.0.098
-  Intel 17.1.132
+  Intel 17.1.043
+  Intel 17.4.196
+  Intel 18.0.128
  Clang 3.5.2
  Clang 3.6.1
  Clang 3.7.1
  Clang 3.8.1
  Clang 3.9.0
-  PGI 17.1
+  Clang 4.0.0
+  Clang 4.0.0 for CUDA (CUDA Toolkit 8.0.44)
+  PGI 17.10
+  NVCC 7.0 for CUDA (with gcc 4.8.4)
+  NVCC 7.5 for CUDA (with gcc 4.8.4)
+  NVCC 8.0.44 for CUDA (with gcc 5.3.0)

 Primary tested compilers on Power 8 are:
  GCC 5.4.0 (OpenMP,Serial)
-  IBM XL 13.1.3 (OpenMP, Serial) (There is a workaround in place to avoid a compiler bug)
+  IBM XL 13.1.5 (OpenMP, Serial) (There is a workaround in place to avoid a compiler bug)
+  NVCC 8.0.44 for CUDA (with gcc 5.4.0)
+  NVCC 9.0.103 for CUDA (with gcc 6.3.0)

 Primary tested compilers on Intel KNL are:
   GCC 6.2.0
-   Intel 16.2.181 (with gcc 4.7.2)
-   Intel 17.0.098 (with gcc 4.7.2)
-   Intel 17.1.132 (with gcc 4.9.3)
+   Intel 16.4.258 (with gcc 4.7.2)
   Intel 17.2.174 (with gcc 4.9.3)
-   Intel 18.0.061 (beta) (with gcc 4.9.3)
-
-Secondary tested compilers are:
-  CUDA 7.0 (with gcc 4.8.4)
-  CUDA 7.5 (with gcc 4.8.4)
-  CUDA 8.0 (with gcc 5.3.0 on X86 and gcc 5.4.0 on Power8)
-  CUDA/Clang 8.0 using Clang/Trunk compiler
+   Intel 18.0.128 (with gcc 4.9.3)

 Other compilers working:
  X86:
   Cygwin 2.1.0 64bit with gcc 4.9.3

-Limited testing of the following compilers on POWER7+ systems:
-  GCC 4.8.5 (on RHEL7.1 POWER7+)
-
 Known non-working combinations:
  Power8:
   Pthreads backend
@ -96,8 +92,8 @@ GCC:   -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits
       -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized
 Intel: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized
 Clang: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized
+NVCC: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized

-Secondary compilers are passing without -Werror.
 Other compilers are tested occasionally, in particular when pushing from develop to 
 master branch, without -Werror and only for a select set of backends.

--- a/lib/kokkos/algorithms/CMakeLists.txt
+++ b/lib/kokkos/algorithms/CMakeLists.txt
@ -2,7 +2,9 @@

 TRIBITS_SUBPACKAGE(Algorithms)

-ADD_SUBDIRECTORY(src)
+IF(KOKKOS_HAS_TRILINOS)
+  ADD_SUBDIRECTORY(src)
+ENDIF()

 TRIBITS_ADD_TEST_DIRECTORIES(unit_tests)
 #TRIBITS_ADD_TEST_DIRECTORIES(performance_tests)
--- a/lib/kokkos/algorithms/unit_tests/CMakeLists.txt
+++ b/lib/kokkos/algorithms/unit_tests/CMakeLists.txt
@ -3,6 +3,32 @@ INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
 INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR})
 INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../src )

+IF(NOT KOKKOS_HAS_TRILINOS)
+  IF(KOKKOS_SEPARATE_LIBS)
+    set(TEST_LINK_TARGETS kokkoscore)
+  ELSE()
+    set(TEST_LINK_TARGETS kokkos)
+  ENDIF()
+ENDIF()
+
+SET(GTEST_SOURCE_DIR ${${PARENT_PACKAGE_NAME}_SOURCE_DIR}/tpls/gtest)
+INCLUDE_DIRECTORIES(${GTEST_SOURCE_DIR})
+
+# mfh 03 Nov 2017: The gtest library used here must have a different
+# name than that of the gtest library built in KokkosCore.  We can't
+# just refer to the library in KokkosCore's tests, because it's
+# possible to build only (e.g.,) KokkosAlgorithms tests, without
+# building KokkosCore tests.
+
+SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DGTEST_HAS_PTHREAD=0")
+
+TRIBITS_ADD_LIBRARY(
+  kokkosalgorithms_gtest
+  HEADERS ${GTEST_SOURCE_DIR}/gtest/gtest.h
+  SOURCES ${GTEST_SOURCE_DIR}/gtest/gtest-all.cc
+  TESTONLY
+  )
+
 SET(SOURCES
  UnitTestMain.cpp 
  TestCuda.cpp
@ -34,5 +60,5 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
  COMM serial mpi
  NUM_MPI_PROCS 1
  FAIL_REGULAR_EXPRESSION "  FAILED  "
-  TESTONLYLIBS kokkos_gtest
+  TESTONLYLIBS kokkosalgorithms_gtest ${TEST_LINK_TARGETS}
  )
--- a/lib/kokkos/algorithms/unit_tests/Makefile
+++ b/lib/kokkos/algorithms/unit_tests/Makefile
@ -15,7 +15,8 @@ endif

 CXXFLAGS = -O3
 LINK ?= $(CXX)
-LDFLAGS ?= -lpthread
+LDFLAGS ?=
+override LDFLAGS += -lpthread

 include $(KOKKOS_PATH)/Makefile.kokkos

--- a/lib/kokkos/algorithms/unit_tests/TestSort.hpp
+++ b/lib/kokkos/algorithms/unit_tests/TestSort.hpp
@ -211,12 +211,15 @@ void test_dynamic_view_sort(unsigned int n )

  const size_t upper_bound = 2 * n ;

+  const size_t total_alloc_size = n * sizeof(KeyType) * 1.2 ;
+  const size_t superblock_size  = std::min(total_alloc_size, size_t(1000000));
+
  typename KeyDynamicViewType::memory_pool
    pool( memory_space()
        , n * sizeof(KeyType) * 1.2
        ,     500 /* min block size in bytes */
        ,   30000 /* max block size in bytes */
-        , 1000000 /* min superblock size in bytes */
+        , superblock_size
        );

  KeyDynamicViewType keys("Keys",pool,upper_bound);
@ -271,8 +274,10 @@ void test_sort(unsigned int N)
 {
  test_1D_sort<ExecutionSpace,KeyType>(N*N*N, true);
  test_1D_sort<ExecutionSpace,KeyType>(N*N*N, false);
+#if !defined(KOKKOS_ENABLE_ROCM)
  test_3D_sort<ExecutionSpace,KeyType>(N);
  test_dynamic_view_sort<ExecutionSpace,KeyType>(N*N);
+#endif
 }

 }
--- a/lib/kokkos/benchmarks/atomic/Makefile
+++ b/lib/kokkos/benchmarks/atomic/Makefile
@ -0,0 +1,44 @@
+KOKKOS_PATH = ${HOME}/kokkos
+KOKKOS_DEVICES = "OpenMP"
+KOKKOS_ARCH = "SNB"
+EXE_NAME = "test"
+
+SRC = $(wildcard *.cpp)
+
+default: build
+	echo "Start Build"
+
+
+ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
+CXX = ${KOKKOS_PATH}/config/nvcc_wrapper
+EXE = ${EXE_NAME}.cuda
+KOKKOS_CUDA_OPTIONS = "enable_lambda"
+else
+CXX = g++
+EXE = ${EXE_NAME}.host
+endif
+
+CXXFLAGS = -O3
+
+LINK = ${CXX}
+LINKFLAGS = -O3
+
+DEPFLAGS = -M
+
+OBJ = $(SRC:.cpp=.o)
+LIB =
+
+include $(KOKKOS_PATH)/Makefile.kokkos
+
+build: $(EXE)
+
+$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
+	$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
+
+clean: kokkos-clean 
+	rm -f *.o *.cuda *.host
+
+# Compilation rules
+
+%.o:%.cpp $(KOKKOS_CPP_DEPENDS)
+	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
--- a/lib/kokkos/benchmarks/atomic/main.cpp
+++ b/lib/kokkos/benchmarks/atomic/main.cpp
@ -0,0 +1,124 @@
+#include<Kokkos_Core.hpp>
+#include<impl/Kokkos_Timer.hpp>
+#include<Kokkos_Random.hpp>
+
+template<class Scalar>
+double test_atomic(int L, int N, int M,int K,int R,Kokkos::View<const int*> offsets) {
+  Kokkos::View<Scalar*> output("Output",N);
+  Kokkos::Impl::Timer timer;
+
+  for(int r = 0; r<R; r++)
+  Kokkos::parallel_for(L, KOKKOS_LAMBDA (const int&i) {
+    Scalar s = 2;
+    for(int m=0;m<M;m++) {
+      for(int k=0;k<K;k++)
+        s=s*s+s;
+      const int idx = (i+offsets(i,m))%N;
+      Kokkos::atomic_add(&output(idx),s);
+    }
+  }); 
+  Kokkos::fence();
+  double time = timer.seconds();
+ 
+  return time;
+}
+
+template<class Scalar>
+double test_no_atomic(int L, int N, int M,int K,int R,Kokkos::View<const int*> offsets) {
+  Kokkos::View<Scalar*> output("Output",N);
+  Kokkos::Impl::Timer timer;
+  for(int r = 0; r<R; r++)
+  Kokkos::parallel_for(L, KOKKOS_LAMBDA (const int&i) {
+    Scalar s = 2;
+    for(int m=0;m<M;m++) {
+      for(int k=0;k<K;k++)
+        s=s*s+s;
+      const int idx = (i+offsets(i,m))%N;
+      output(idx) += s;
+    }
+  });
+  Kokkos::fence();
+  double time =  timer.seconds();
+  return time;
+}
+
+int main(int argc, char* argv[]) {
+  Kokkos::initialize(argc,argv);
+{
+  if(argc<8) {
+    printf("Arguments: L N M D K R T\n");
+    printf("  L:   Number of iterations to run\n");
+    printf("  N:   Length of array to do atomics into\n");
+    printf("  M:   Number of atomics per iteration to do\n");
+    printf("  D:   Distance from index i to do atomics into (randomly)\n");
+    printf("  K:   Number of FMAD per atomic\n");
+    printf("  R:   Number of repeats of the experiments\n");
+    printf("  T:   Type of atomic\n");
+    printf("       1 - int\n");
+    printf("       2 - long\n");
+    printf("       3 - float\n");
+    printf("       4 - double\n");
+    printf("       5 - complex<double>\n");
+    printf("Example Input GPU:\n");
+    printf("  Histogram : 1000000 1000 1 1000 1 10 1\n");
+    printf("  MD Force : 100000 100000 100 1000 20 10 4\n");
+    printf("  Matrix Assembly : 100000 1000000 50 1000 20 10 4\n");
+    Kokkos::finalize();
+    return 0;
+  }
+
+
+  int L = atoi(argv[1]);
+  int N = atoi(argv[2]);
+  int M = atoi(argv[3]);
+  int D = atoi(argv[4]); 
+  int K = atoi(argv[5]);
+  int R = atoi(argv[6]); 
+  int type = atoi(argv[7]);
+ 
+  Kokkos::View<int*> offsets("Offsets",L,M);
+  Kokkos::Random_XorShift64_Pool<> pool(12371);
+  Kokkos::fill_random(offsets,pool,D);
+  double time = 0;
+  if(type==1)
+    time  = test_atomic<int>(L,N,M,K,R,offsets);
+  if(type==2)
+    time = test_atomic<long>(L,N,M,K,R,offsets);
+  if(type==3)
+    time = test_atomic<float>(L,N,M,K,R,offsets);
+  if(type==4)
+    time = test_atomic<double>(L,N,M,K,R,offsets);
+  if(type==5)
+    time = test_atomic<Kokkos::complex<double> >(L,N,M,K,R,offsets);
+
+  double time2 = 1;
+  if(type==1)
+    time2 = test_no_atomic<int>(L,N,M,K,R,offsets);
+  if(type==2)
+    time2 = test_no_atomic<long>(L,N,M,K,R,offsets);
+  if(type==3)
+    time2 = test_no_atomic<float>(L,N,M,K,R,offsets);
+  if(type==4)
+    time2 = test_no_atomic<double>(L,N,M,K,R,offsets);
+  if(type==5)
+    time2 = test_no_atomic<Kokkos::complex<double> >(L,N,M,K,R,offsets);
+
+  int size = 0;
+  if(type==1) size = sizeof(int);
+  if(type==2) size = sizeof(long);
+  if(type==3) size = sizeof(float);
+  if(type==4) size = sizeof(double);
+  if(type==5) size = sizeof(Kokkos::complex<double>);
+
+  printf("%i\n",size);
+  printf("Time: %s %i %i %i %i %i %i (t_atomic: %e t_nonatomic: %e ratio: %lf )( GUpdates/s: %lf GB/s: %lf )\n",
+    (type==1)?"int": (
+    (type==2)?"long": (
+    (type==3)?"float": (
+    (type==4)?"double":"complex"))),
+    L,N,M,D,K,R,time,time2,time/time2,
+    1.e-9*L*R*M/time, 1.0*L*R*M*2*size/time/1024/1024/1024);
+}
+  Kokkos::finalize();
+}
+
--- a/lib/kokkos/benchmarks/benchmark_suite/scripts/build_code.bash
+++ b/lib/kokkos/benchmarks/benchmark_suite/scripts/build_code.bash
@ -0,0 +1,84 @@
+#!/bin/bash
+
+# ---- Default Settings -----
+
+# Paths
+KOKKOS_PATH=${PWD}/kokkos
+KOKKOS_KERNELS_PATH=${PWD}/kokkos-kernels
+MINIMD_PATH=${PWD}/miniMD/kokkos
+MINIFE_PATH=${PWD}/miniFE/kokkos
+
+# Kokkos Configure Options
+KOKKOS_DEVICES=OpenMP
+KOKKOS_ARCH=SNB
+
+# Compiler Options
+CXX=mpicxx
+OPT_FLAG="-O3"
+
+while [[ $# > 0 ]]
+do
+  key="$1"
+
+  case $key in
+    --kokkos-path*)
+      KOKKOS_PATH="${key#*=}"
+      ;;
+    --kokkos-kernels-path*)
+      KOKKOS_KERNELS_PATH="${key#*=}"
+      ;;
+    --minimd-path*)
+      MINIMD_PATH="${key#*=}"
+      ;;
+    --minife-path*)
+      MINIFE_PATH="${key#*=}"
+      ;;
+    --device-list*)
+      KOKKOS_DEVICES="${key#*=}"
+      ;;
+    --arch*)
+      KOKKOS_ARCH="--arch=${key#*=}"
+      ;;
+    --opt-flag*)
+      OPT_FLAG="${key#*=}"
+      ;;
+    --compiler*)
+      CXX="${key#*=}"
+      ;;
+    --with-cuda-options*)
+      KOKKOS_CUDA_OPTIONS="--with-cuda-options=${key#*=}"
+      ;;
+    --help*)
+      PRINT_HELP=True
+      ;;
+    *)
+      # args, just append
+      ARGS="$ARGS $1"
+      ;;
+  esac
+
+  shift
+done
+
+mkdir build
+
+# Build BytesAndFlops
+mkdir build/bytes_and_flops
+cd build/bytes_and_flops
+make KOKKOS_ARCH=${KOKKOS_ARCH} KOKKOS_DEVICES=${KOKKOS_DEVICES} CXX=${CXX} KOKKOS_PATH=${KOKKOS_PATH}\
+     CXXFLAGS=${OPT_FLAG} -f ${KOKKOS_PATH}/benchmarks/bytes_and_flops/Makefile -j 16
+cd ../..
+
+mkdir build/miniMD
+cd build/miniMD
+make KOKKOS_ARCH=${KOKKOS_ARCH} KOKKOS_DEVICES=${KOKKOS_DEVICES} CXX=${CXX} KOKKOS_PATH=${KOKKOS_PATH} \
+     CXXFLAGS=${OPT_FLAG} -f ${MINIMD_PATH}/Makefile -j 16
+cd ../../
+
+mkdir build/miniFE
+cd build/miniFE
+make KOKKOS_ARCH=${KOKKOS_ARCH} KOKKOS_DEVICES=${KOKKOS_DEVICES} CXX=${CXX} KOKKOS_PATH=${KOKKOS_PATH} \
+     CXXFLAGS=${OPT_FLAG} -f ${MINIFE_PATH}/src/Makefile -j 16
+cd ../../
+
+
--- a/lib/kokkos/benchmarks/benchmark_suite/scripts/checkout_repos.bash
+++ b/lib/kokkos/benchmarks/benchmark_suite/scripts/checkout_repos.bash
@ -0,0 +1,37 @@
+#!/bin/bash
+
+# Kokkos
+if [ ! -d "kokkos" ]; then
+  git clone https://github.com/kokkos/kokkos
+fi
+cd kokkos
+git checkout develop
+git pull
+cd ..
+
+# KokkosKernels
+if [ ! -d "kokkos-kernels" ]; then
+git clone https://github.com/kokkos/kokkos-kernels
+fi
+cd kokkos-kernels
+git pull
+cd ..
+
+# MiniMD
+if [ ! -d "miniMD" ]; then
+  git clone https://github.com/mantevo/miniMD
+fi
+cd miniMD
+git pull
+cd ..
+
+# MiniFE
+if [ ! -d "miniFE" ]; then
+  git clone https://github.com/mantevo/miniFE
+fi
+cd miniFE
+git pull
+cd ..
+
+
+
--- a/lib/kokkos/benchmarks/benchmark_suite/scripts/run_benchmark.bash
+++ b/lib/kokkos/benchmarks/benchmark_suite/scripts/run_benchmark.bash
@ -0,0 +1,14 @@
+#!/bin/bash
+SCRIPT_PATH=$1
+KOKKOS_DEVICES=$2
+KOKKOS_ARCH=$3
+COMPILER=$4
+if [[ $# < 4 ]]; then
+  echo "Usage: ./run_benchmark.bash PATH_TO_SCRIPTS KOKKOS_DEVICES KOKKOS_ARCH COMPILER"
+else
+
+${SCRIPT_PATH}/checkout_repos.bash
+${SCRIPT_PATH}/build_code.bash --arch=${KOKKOS_ARCH} --device-list=${KOKKOS_DEVICES} --compiler=${COMPILER}
+${SCRIPT_PATH}/run_tests.bash
+
+fi
--- a/lib/kokkos/benchmarks/benchmark_suite/scripts/run_tests.bash
+++ b/lib/kokkos/benchmarks/benchmark_suite/scripts/run_tests.bash
@ -0,0 +1,44 @@
+#!/bin/bash
+
+# BytesAndFlops
+cd build/bytes_and_flops
+
+USE_CUDA=`grep "_CUDA 1" KokkosCore_config.h | wc -l`
+
+if [[ ${USE_CUDA} > 0 ]]; then
+  BAF_EXE=bytes_and_flops.cuda
+  TEAM_SIZE=256
+else
+  BAF_EXE=bytes_and_flops.host
+  TEAM_SIZE=1
+fi
+
+BAF_PERF_1=`./${BAF_EXE} 2 100000 1024 1 1 1 1 ${TEAM_SIZE} 6000 | awk '{print $12/174.5}'`
+BAF_PERF_2=`./${BAF_EXE} 2 100000 1024 16 1 8 64 ${TEAM_SIZE} 6000 | awk '{print $14/1142.65}'`
+
+echo "BytesAndFlops: ${BAF_PERF_1} ${BAF_PERF_2}"
+cd ../..
+
+
+# MiniMD
+cd build/miniMD
+cp ../../miniMD/kokkos/Cu_u6.eam ./
+MD_PERF_1=`./miniMD --half_neigh 0 -s 60 --ntypes 1 -t ${OMP_NUM_THREADS} -i ../../miniMD/kokkos/in.eam.miniMD | grep PERF_SUMMARY | awk '{print $10/21163341}'`
+MD_PERF_2=`./miniMD --half_neigh 0 -s 20 --ntypes 1 -t ${OMP_NUM_THREADS} -i ../../miniMD/kokkos/in.eam.miniMD | grep PERF_SUMMARY | awk '{print $10/13393417}'`
+
+echo "MiniMD: ${MD_PERF_1} ${MD_PERF_2}"
+cd ../..
+
+# MiniFE
+cd build/miniFE
+rm *.yaml
+./miniFE.x -nx 100 &> /dev/null
+FE_PERF_1=`grep "CG Mflop" *.yaml | awk '{print $4/14174}'`
+rm *.yaml
+./miniFE.x -nx 50 &> /dev/null
+FE_PERF_2=`grep "CG Mflop" *.yaml | awk '{print $4/11897}'`
+cd ../..
+echo "MiniFE: ${FE_PERF_1} ${FE_PERF_2}"
+
+PERF_RESULT=`echo "${BAF_PERF_1} ${BAF_PERF_2} ${MD_PERF_1} ${MD_PERF_2} ${FE_PERF_1} ${FE_PERF_2}" | awk '{print ($1+$2+$3+$4+$5+$6)/6}'`
+echo "Total Result: " ${PERF_RESULT}
--- a/lib/kokkos/benchmarks/bytes_and_flops/Makefile
+++ b/lib/kokkos/benchmarks/bytes_and_flops/Makefile
@ -1,7 +1,18 @@
-KOKKOS_PATH = ${HOME}/kokkos
-SRC = $(wildcard *.cpp)
 KOKKOS_DEVICES=Cuda
 KOKKOS_CUDA_OPTIONS=enable_lambda
+KOKKOS_ARCH = "SNB,Kepler35"
+
+
+MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST))))
+
+ifndef KOKKOS_PATH
+  KOKKOS_PATH = $(MAKEFILE_PATH)../..
+endif
+
+SRC = $(wildcard $(MAKEFILE_PATH)*.cpp)
+HEADERS = $(wildcard $(MAKEFILE_PATH)*.hpp)
+
+vpath %.cpp $(sort $(dir $(SRC)))

 default: build
 	echo "Start Build"
@ -9,22 +20,19 @@ default: build
 ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
 CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
 EXE = bytes_and_flops.cuda
-KOKKOS_DEVICES = "Cuda,OpenMP"
-KOKKOS_ARCH = "SNB,Kepler35"
 else
 CXX = g++
 EXE = bytes_and_flops.host
-KOKKOS_DEVICES = "OpenMP"
-KOKKOS_ARCH = "SNB"
 endif

-CXXFLAGS = -O3 -g
+CXXFLAGS ?= -O3 -g
+override CXXFLAGS += -I$(MAKEFILE_PATH)

 DEPFLAGS = -M
 LINK = ${CXX}
 LINKFLAGS =

-OBJ = $(SRC:.cpp=.o)
+OBJ = $(notdir $(SRC:.cpp=.o))
 LIB =

 include $(KOKKOS_PATH)/Makefile.kokkos
@ -39,5 +47,5 @@ clean: kokkos-clean

 # Compilation rules

-%.o:%.cpp $(KOKKOS_CPP_DEPENDS) bench.hpp bench_unroll_stride.hpp bench_stride.hpp
-	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
+%.o:%.cpp $(KOKKOS_CPP_DEPENDS) $(HEADERS)
+	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< -o $(notdir $@)
--- a/lib/kokkos/benchmarks/policy_performance/policy_perf_test.hpp
+++ b/lib/kokkos/benchmarks/policy_performance/policy_perf_test.hpp
@ -69,11 +69,11 @@ void test_policy(int team_range, int thread_range, int vector_range,
          int team_size, int vector_size, int test_type,
          ViewType1 &v1, ViewType2 &v2, ViewType3 &v3,
          double &result, double &result_expect, double &time) {
-  
+
  typedef Kokkos::TeamPolicy<ScheduleType,IndexType> t_policy;
  typedef typename t_policy::member_type t_team;
  Kokkos::Timer timer;
-  
+
  for(int orep = 0; orep<outer_repeat; orep++) {

    if (test_type == 100) {
@ -95,7 +95,7 @@ void test_policy(int team_range, int thread_range, int vector_range,
              v2( idx, t ) = t;
              // prevent compiler optimizing loop away
            });
-          }             
+          }
      });
    }
    if (test_type == 111) {
@ -178,12 +178,13 @@ void test_policy(int team_range, int thread_range, int vector_range,
          for (int tr = 0; tr<thread_repeat; ++tr) {
            Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,thread_range), [&] (const int t, double &lval) {
              double vector_result = 0.0;
-              for (int vr = 0; vr<inner_repeat; ++vr)
+              for (int vr = 0; vr<inner_repeat; ++vr) {
                vector_result = 0.0;
                Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,vector_range), [&] (const int vi, double &vval) {
                  vval += 1;
                }, vector_result);
                lval += vector_result;
+              }
            }, team_result);
          }
          v1(idx) = team_result;
@ -191,7 +192,7 @@ void test_policy(int team_range, int thread_range, int vector_range,
      });
    }
    if (test_type == 200) {
-      Kokkos::parallel_reduce("200 outer reduce", t_policy(team_range,team_size),                        
+      Kokkos::parallel_reduce("200 outer reduce", t_policy(team_range,team_size),
        KOKKOS_LAMBDA (const t_team& team, double& lval) {
          lval+=team.team_size()*team.league_rank() + team.team_rank();
      },result);
@ -315,7 +316,7 @@ void test_policy(int team_range, int thread_range, int vector_range,

    // parallel_for RangePolicy: range = team_size*team_range
    if (test_type == 300) {
-      Kokkos::parallel_for("300 outer for", team_size*team_range, 
+      Kokkos::parallel_for("300 outer for", team_size*team_range,
        KOKKOS_LAMBDA (const int idx) {
          v1(idx) = idx;
          // prevent compiler from optimizing away the loop
@ -323,7 +324,7 @@ void test_policy(int team_range, int thread_range, int vector_range,
    }
    // parallel_reduce RangePolicy: range = team_size*team_range
    if (test_type == 400) {
-      Kokkos::parallel_reduce("400 outer reduce", team_size*team_range, 
+      Kokkos::parallel_reduce("400 outer reduce", team_size*team_range,
        KOKKOS_LAMBDA (const int idx, double& val) {
          val += idx;
      }, result);
@ -331,7 +332,7 @@ void test_policy(int team_range, int thread_range, int vector_range,
    }
    // parallel_scan RangePolicy: range = team_size*team_range
    if (test_type == 500) {
-      Kokkos::parallel_scan("500 outer scan", team_size*team_range, 
+      Kokkos::parallel_scan("500 outer scan", team_size*team_range,
        ParallelScanFunctor<ViewType1>(v1)
 #if 0
        // This does not compile with pre Cuda 8.0 - see Github Issue #913 for explanation
--- a/lib/kokkos/bin/hpcbind
+++ b/lib/kokkos/bin/hpcbind
@ -26,6 +26,7 @@ fi
 # Get parent cpuset
 HPCBIND_HWLOC_PARENT_CPUSET=""
 if [[ ${HPCBIND_HAS_HWLOC} -eq 1 ]]; then
+  HPCBIND_HWLOC_VERSION="$(hwloc-ls --version | cut -d ' ' -f 2)"
  MY_PID="$BASHPID"
  HPCBIND_HWLOC_PARENT_CPUSET="$(hwloc-ps -a --cpuset | grep ${MY_PID} | cut -f 2)"
 fi
@ -45,8 +46,11 @@ declare -i NUM_GPUS=0
 HPCBIND_VISIBLE_GPUS=""
 if [[ ${HPCBIND_HAS_NVIDIA} -eq 1 ]]; then
  NUM_GPUS=$(nvidia-smi -L | wc -l);
-  GPU_LIST="$( seq 0 $((NUM_GPUS-1)) )"
-  HPCBIND_VISIBLE_GPUS=${CUDA_VISIBLE_DEVICES:-${GPU_LIST}}
+  HPCBIND_HAS_NVIDIA=$((!$?))
+  if [[ ${HPCBIND_HAS_NVIDIA} -eq 1 ]]; then
+    GPU_LIST="$( seq 0 $((NUM_GPUS-1)) )"
+    HPCBIND_VISIBLE_GPUS=${CUDA_VISIBLE_DEVICES:-${GPU_LIST}}
+  fi
 fi

 declare -i HPCBIND_ENABLE_GPU_MAPPING=$((NUM_GPUS > 0))
@ -57,33 +61,38 @@ declare -i HPCBIND_ENABLE_GPU_MAPPING=$((NUM_GPUS > 0))
 # supports sbatch, bsub, aprun
 ################################################################################
 HPCBIND_QUEUE_NAME=""
-declare -i HPCBIND_QUEUE_INDEX=0
+declare -i HPCBIND_QUEUE_RANK=0
+declare -i HPCBIND_QUEUE_SIZE=0
 declare -i HPCBIND_QUEUE_MAPPING=0

 if [[ ! -z "${PMI_RANK}" ]]; then
  HPCBIND_QUEUE_MAPPING=1
  HPCBIND_QUEUE_NAME="mpich"
-  HPCBIND_QUEUE_INDEX=${PMI_RANK}
+  HPCBIND_QUEUE_RANK=${PMI_RANK}
+  HPCBIND_QUEUE_SIZE=${PMI_SIZE}
 elif [[ ! -z "${OMPI_COMM_WORLD_RANK}" ]]; then
  HPCBIND_QUEUE_MAPPING=1
  HPCBIND_QUEUE_NAME="openmpi"
-  HPCBIND_QUEUE_INDEX=${OMPI_COMM_WORLD_RANK}
+  HPCBIND_QUEUE_RANK=${OMPI_COMM_WORLD_RANK}
+  HPCBIND_QUEUE_SIZE=${OMPI_COMM_WORLD_SIZE}
 elif [[ ! -z "${MV2_COMM_WORLD_RANK}" ]]; then
  HPCBIND_QUEUE_MAPPING=1
  HPCBIND_QUEUE_NAME="mvapich2"
-  HPCBIND_QUEUE_INDEX=${MV2_COMM_WORLD_RANK}
+  HPCBIND_QUEUE_RANK=${MV2_COMM_WORLD_RANK}
+  HPCBIND_QUEUE_SIZE=${MV2_COMM_WORLD_SIZE}
 elif [[ ! -z "${SLURM_LOCAL_ID}" ]]; then
  HPCBIND_QUEUE_MAPPING=1
  HPCBIND_QUEUE_NAME="slurm"
-  HPCBIND_QUEUE_INDEX=${SLURM_LOCAL_ID}
-elif [[ ! -z "${LBS_JOBINDEX}" ]]; then
-  HPCBIND_QUEUE_MAPPING=1
-  HPCBIND_QUEUE_NAME="bsub"
-  HPCBIND_QUEUE_INDEX=${LBS_JOBINDEX}
+  HPCBIND_QUEUE_RANK=${SLURM_PROCID}
+  HPCBIND_QUEUE_SIZE=${SLURM_NPROCS}
 elif [[ ! -z "${ALPS_APP_PE}" ]]; then
  HPCBIND_QUEUE_MAPPING=1
  HPCBIND_QUEUE_NAME="aprun"
-  HPCBIND_QUEUE_INDEX=${ALPS_APP_PE}
+  HPCBIND_QUEUE_RANK=${ALPS_APP_PE}
+elif [[ ! -z "${LBS_JOBINDEX}" ]]; then
+  HPCBIND_QUEUE_MAPPING=1
+  HPCBIND_QUEUE_NAME="bsub"
+  HPCBIND_QUEUE_RANK=${LBS_JOBINDEX}
 fi

 ################################################################################
@ -113,8 +122,8 @@ function show_help {
  echo "  --no-gpu-mapping      Do not set CUDA_VISIBLE_DEVICES"
  echo "  --openmp=M.m          Set env variables for the given OpenMP version"
  echo "                        Default: 4.0"
-  echo "  --openmp-percent=N    Integer percentage of cpuset to use for OpenMP"
-  echo "                        threads  Default: 100"
+  echo "  --openmp-ratio=N/D    Ratio of the cpuset to use for OpenMP"
+  echo "                        Default: 1"
  echo "  --openmp-places=<Op>  Op=threads|cores|sockets. Default: threads"
  echo "  --no-openmp-proc-bind Set OMP_PROC_BIND to false and unset OMP_PLACES"
  echo "  --force-openmp-num-threads=N"
@ -123,8 +132,8 @@ function show_help {
  echo "                        Override logic for selecting OMP_PROC_BIND"
  echo "  --no-openmp-nested    Set OMP_NESTED to false"
  echo "  --output-prefix=<P>   Save the output to files of the form"
-  echo "                        P-N.log, P-N.out and P-N.err where P is the prefix"
-  echo "                        and N is the queue index or mpi rank (no spaces)"
+  echo "                        P.hpcbind.N, P.stdout.N and P.stderr.N where P is "
+  echo "                        the prefix and N is the rank (no spaces)"
  echo "  --output-mode=<Op>    How console output should be handled."
  echo "                        Options are all, rank0, and none.  Default: rank0" 
  echo "  --lstopo              Show bindings in lstopo"
@ -132,20 +141,27 @@ function show_help {
  echo "  -h|--help             Show this message"
  echo ""
  echo "Sample Usage:"
+  echo ""
  echo "  Split the current process cpuset into 4 and use the 3rd partition"
  echo "    ${cmd} --distribute=4 --distribute-partition=2 -v -- command ..."
+  echo ""
  echo "  Launch 16 jobs over 4 nodes with 4 jobs per node using only the even pus"
  echo "  and save the output to rank specific files"
  echo "    mpiexec -N 16 -npernode 4 ${cmd} --whole-system --proc-bind=pu:even \\"
  echo "      --distribute=4 -v --output-prefix=output  -- command ..."
+  echo ""
  echo "  Bind the process to all even cores"
  echo "    ${cmd} --proc-bind=core:even -v -- command ..."
+  echo ""
  echo "  Bind the the even cores of socket 0 and the odd cores of socket 1"
  echo "    ${cmd} --proc-bind='socket:0.core:even socket:1.core:odd' -v -- command ..."
+  echo ""
  echo "  Skip GPU 0 when mapping visible devices"
  echo "    ${cmd} --distribute=4 --distribute-partition=0 --visible-gpus=1,2 -v -- command ..."
+  echo ""
  echo "  Display the current bindings"
  echo "    ${cmd} --proc-bind=numa:0 -- command"
+  echo ""
  echo "  Display the current bindings using lstopo"
  echo "    ${cmd} --proc-bind=numa:0.core:odd --lstopo"
  echo ""
@ -167,12 +183,13 @@ declare -i HPCBIND_DISTRIBUTE=1
 declare -i HPCBIND_PARTITION=-1
 HPCBIND_PROC_BIND="all"
 HPCBIND_OPENMP_VERSION=4.0
-declare -i HPCBIND_OPENMP_PERCENT=100
+declare -i HPCBIND_OPENMP_RATIO_NUMERATOR=1
+declare -i HPCBIND_OPENMP_RATIO_DENOMINATOR=1
 HPCBIND_OPENMP_PLACES=${OMP_PLACES:-threads}
 declare -i HPCBIND_OPENMP_PROC_BIND=1
-declare -i HPCBIND_OPENMP_FORCE_NUM_THREADS=-1
+HPCBIND_OPENMP_FORCE_NUM_THREADS=""
 HPCBIND_OPENMP_FORCE_PROC_BIND=""
-HPCBIND_OPENMP_NESTED=${OMP_NESTED:-true}
+declare -i HPCBIND_OPENMP_NESTED=1
 declare -i HPCBIND_VERBOSE=0

 declare -i HPCBIND_LSTOPO=0
@ -199,6 +216,9 @@ for i in "$@"; do
      ;;
    --distribute=*)
      HPCBIND_DISTRIBUTE="${i#*=}"
+      if [[ ${HPCBIND_DISTRIBUTE} -le 0 ]]; then
+        HPCBIND_DISTRIBUTE=1
+      fi
      shift
      ;;
    # which partition to use
@ -222,8 +242,18 @@ for i in "$@"; do
      HPCBIND_OPENMP_VERSION="${i#*=}"
      shift
      ;;
-    --openmp-percent=*)
-      HPCBIND_OPENMP_PERCENT="${i#*=}"
+    --openmp-ratio=*)
+      IFS=/ read HPCBIND_OPENMP_RATIO_NUMERATOR HPCBIND_OPENMP_RATIO_DENOMINATOR <<< "${i#*=}"
+      if [[ ${HPCBIND_OPENMP_RATIO_NUMERATOR} -le 0 ]]; then
+        HPCBIND_OPENMP_RATIO_NUMERATOR=1
+      fi
+      if [[ ${HPCBIND_OPENMP_RATIO_DENOMINATOR} -le 0 ]]; then
+        HPCBIND_OPENMP_RATIO_DENOMINATOR=1
+      fi
+      if [[ ${HPCBIND_OPENMP_RATIO_NUMERATOR} -gt ${HPCBIND_OPENMP_RATIO_DENOMINATOR} ]]; then
+        HPCBIND_OPENMP_RATIO_NUMERATOR=1
+        HPCBIND_OPENMP_RATIO_DENOMINATOR=1
+      fi
      shift
      ;;
    --openmp-places=*)
@ -243,7 +273,7 @@ for i in "$@"; do
      shift
      ;;
    --no-openmp-nested)
-      HPCBIND_OPENMP_NESTED="false"
+      HPCBIND_OPENMP_NESTED=0
      shift
      ;;
    --output-prefix=*)
@ -292,7 +322,7 @@ if [[ "${HPCBIND_OUTPUT_MODE}" == "none" ]]; then
  HPCBIND_TEE=0
 elif [[ "${HPCBIND_OUTPUT_MODE}" == "all" ]]; then
  HPCBIND_TEE=1
-elif [[ ${HPCBIND_QUEUE_INDEX} -eq 0 ]]; then
+elif [[ ${HPCBIND_QUEUE_RANK} -eq 0 ]]; then
  #default to rank0 printing to screen
  HPCBIND_TEE=1
 fi
@ -303,9 +333,18 @@ if [[ "${HPCBIND_OUTPUT_PREFIX}" == "" ]]; then
  HPCBIND_ERR=/dev/null
  HPCBIND_OUT=/dev/null
 else
-  HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}-${HPCBIND_QUEUE_INDEX}.hpc.log"
-  HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}-${HPCBIND_QUEUE_INDEX}.err"
-  HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}-${HPCBIND_QUEUE_INDEX}.out"
+  if [[ ${HPCBIND_QUEUE_SIZE} -gt 0 ]]; then
+    HPCBIND_STR_QUEUE_SIZE="${HPCBIND_QUEUE_SIZE}"
+    HPCBIND_STR_QUEUE_RANK=$(printf %0*d ${#HPCBIND_STR_QUEUE_SIZE} ${HPCBIND_QUEUE_RANK})
+
+    HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_STR_QUEUE_RANK}"
+    HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_STR_QUEUE_RANK}"
+    HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_STR_QUEUE_RANK}"
+  else
+    HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_QUEUE_RANK}"
+    HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_QUEUE_RANK}"
+    HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_QUEUE_RANK}"
+  fi
  > ${HPCBIND_LOG}
 fi

@ -333,27 +372,12 @@ if [[ ${HPCBIND_ENABLE_GPU_MAPPING} -eq 1 ]]; then
  NUM_GPUS=${#HPCBIND_VISIBLE_GPUS[@]}
 fi

-################################################################################
-# Check OpenMP percent
-################################################################################
-if [[ ${HPCBIND_OPENMP_PERCENT} -lt 1 ]]; then
-  HPCBIND_OPENMP_PERCENT=1
-elif [[ ${HPCBIND_OPENMP_PERCENT} -gt 100 ]]; then
-  HPCBIND_OPENMP_PERCENT=100
-fi
-
-################################################################################
-# Check distribute
-################################################################################
-if [[ ${HPCBIND_DISTRIBUTE} -le 0 ]]; then
-  HPCBIND_DISTRIBUTE=1
-fi

 ################################################################################
 #choose the correct partition
 ################################################################################
 if [[ ${HPCBIND_PARTITION} -lt 0 && ${HPCBIND_QUEUE_MAPPING} -eq 1 ]]; then
-  HPCBIND_PARTITION=${HPCBIND_QUEUE_INDEX}
+  HPCBIND_PARTITION=${HPCBIND_QUEUE_RANK}
 elif [[ ${HPCBIND_PARTITION} -lt 0 ]]; then
  HPCBIND_PARTITION=0
 fi
@ -381,23 +405,40 @@ if [[ ${HPCBIND_ENABLE_HWLOC_BIND} -eq 1 ]]; then
  else
    HPCBIND_HWLOC_CPUSET="${BINDING}"
  fi
-  HPCBIND_NUM_PUS=$(hwloc-ls --restrict ${HPCBIND_HWLOC_CPUSET} --only pu | wc -l)
+  HPCBIND_NUM_PUS=$(hwloc-calc -q -N pu ${HPCBIND_HWLOC_CPUSET} )
+  if [ $? -ne 0 ]; then
+    HPCBIND_NUM_PUS=1
+  fi
+  HPCBIND_NUM_CORES=$(hwloc-calc -q -N core ${HPCBIND_HWLOC_CPUSET} )
+  if [ $? -ne 0 ]; then
+    HPCBIND_NUM_CORES=1
+  fi
+  HPCBIND_NUM_NUMAS=$(hwloc-calc -q -N numa ${HPCBIND_HWLOC_CPUSET} )
+  if [ $? -ne 0 ]; then
+    HPCBIND_NUM_NUMAS=1
+  fi
+  HPCBIND_NUM_SOCKETS=$(hwloc-calc -q -N socket ${HPCBIND_HWLOC_CPUSET} )
+  if [ $? -ne 0 ]; then
+    HPCBIND_NUM_SOCKETS=1
+  fi
 else
  HPCBIND_NUM_PUS=$(cat /proc/cpuinfo | grep -c processor)
+  HPCBIND_NUM_CORES=${HPCBIND_NUM_PUS}
+  HPCBIND_NUM_NUMAS=1
+  HPCBIND_NUM_SOCKETS=1
 fi

-declare -i HPCBIND_OPENMP_NUM_THREADS=$((HPCBIND_NUM_PUS * HPCBIND_OPENMP_PERCENT))
-HPCBIND_OPENMP_NUM_THREADS=$((HPCBIND_OPENMP_NUM_THREADS / 100))

-
-if [[ ${HPCBIND_OPENMP_NUM_THREADS} -lt 1 ]]; then
-  HPCBIND_OPENMP_NUM_THREADS=1
-elif [[ ${HPCBIND_OPENMP_NUM_THREADS} -gt ${HPCBIND_NUM_PUS} ]]; then
-  HPCBIND_OPENMP_NUM_THREADS=${HPCBIND_NUM_PUS}
-fi
-
-if [[ ${HPCBIND_OPENMP_FORCE_NUM_THREADS} -gt 0 ]]; then
+if [[ ${HPCBIND_OPENMP_FORCE_NUM_THREADS} != "" ]]; then
  HPCBIND_OPENMP_NUM_THREADS=${HPCBIND_OPENMP_FORCE_NUM_THREADS}
+else
+  declare -i HPCBIND_OPENMP_NUM_THREADS=$((HPCBIND_NUM_PUS * HPCBIND_OPENMP_RATIO_NUMERATOR / HPCBIND_OPENMP_RATIO_DENOMINATOR))
+
+  if [[ ${HPCBIND_OPENMP_NUM_THREADS} -lt 1 ]]; then
+    HPCBIND_OPENMP_NUM_THREADS=1
+  elif [[ ${HPCBIND_OPENMP_NUM_THREADS} -gt ${HPCBIND_NUM_PUS} ]]; then
+    HPCBIND_OPENMP_NUM_THREADS=${HPCBIND_NUM_PUS}
+  fi
 fi

 ################################################################################
@ -405,7 +446,11 @@ fi
 ################################################################################

 # set OMP_NUM_THREADS
-export OMP_NUM_THREADS=${HPCBIND_OPENMP_NUM_THREADS}
+if [[ ${HPCBIND_OPENMP_NESTED} -eq 1 ]]; then
+  export OMP_NUM_THREADS="${HPCBIND_OPENMP_NUM_THREADS},1"
+else
+  export OMP_NUM_THREADS=${HPCBIND_OPENMP_NUM_THREADS}
+fi

 # set OMP_PROC_BIND and OMP_PLACES
 if [[ ${HPCBIND_OPENMP_PROC_BIND} -eq 1 ]]; then
@ -413,7 +458,11 @@ if [[ ${HPCBIND_OPENMP_PROC_BIND} -eq 1 ]]; then
    #default proc bind logic
    if [[ "${HPCBIND_OPENMP_VERSION}" == "4.0" || "${HPCBIND_OPENMP_VERSION}" > "4.0" ]]; then
      export OMP_PLACES="${HPCBIND_OPENMP_PLACES}"
-      export OMP_PROC_BIND="spread"
+      if [[ ${HPCBIND_OPENMP_NESTED} -eq 1 ]]; then
+        export OMP_PROC_BIND="spread,spread"
+      else
+        export OMP_PROC_BIND="spread"
+      fi
    else
      export OMP_PROC_BIND="true"
      unset OMP_PLACES
@ -429,9 +478,17 @@ else
  unset OMP_PROC_BIND
 fi

-# set OMP_NESTED
-export OMP_NESTED=${HPCBIND_OPENMP_NESTED}
+# set up hot teams (intel specific)
+if [[ ${HPCBIND_OPENMP_NESTED} -eq 1 ]]; then
+  export OMP_NESTED="true"
+  export OMP_MAX_ACTIVE_LEVELS=2
+  export KMP_HOT_TEAMS=1
+  export KMP_HOT_TEAMS_MAX_LEVEL=2
+else
+  export OMP_NESTED="false"
+fi

+# set OMP_NESTED

 ################################################################################
 # Set CUDA environment variables
@ -442,7 +499,7 @@ if [[ ${HPCBIND_ENABLE_GPU_MAPPING} -eq 1 ]]; then
    declare -i GPU_ID=$((HPCBIND_PARTITION % NUM_GPUS))
    export CUDA_VISIBLE_DEVICES="${HPCBIND_VISIBLE_GPUS[${GPU_ID}]}"
  else
-    declare -i MY_TASK_ID=$((HPCBIND_QUEUE_INDEX * HPCBIND_DISTRIBUTE + HPCBIND_PARTITION))
+    declare -i MY_TASK_ID=$((HPCBIND_QUEUE_RANK * HPCBIND_DISTRIBUTE + HPCBIND_PARTITION))
    declare -i GPU_ID=$((MY_TASK_ID % NUM_GPUS))
    export CUDA_VISIBLE_DEVICES="${HPCBIND_VISIBLE_GPUS[${GPU_ID}]}"
  fi
@ -451,12 +508,17 @@ fi
 ################################################################################
 # Set hpcbind environment variables
 ################################################################################
+export HPCBIND_HWLOC_VERSION=${HPCBIND_HWLOC_VERSION}
 export HPCBIND_HAS_HWLOC=${HPCBIND_HAS_HWLOC}
 export HPCBIND_HAS_NVIDIA=${HPCBIND_HAS_NVIDIA}
 export HPCBIND_NUM_PUS=${HPCBIND_NUM_PUS}
+export HPCBIND_NUM_CORES=${HPCBIND_NUM_CORES}
+export HPCBIND_NUM_NUMAS=${HPCBIND_NUM_NUMAS}
+export HPCBIND_NUM_SOCKETS=${HPCBIND_NUM_SOCKETS}
 export HPCBIND_HWLOC_CPUSET="${HPCBIND_HWLOC_CPUSET}"
 export HPCBIND_HWLOC_DISTRIBUTE=${HPCBIND_DISTRIBUTE}
 export HPCBIND_HWLOC_DISTRIBUTE_PARTITION=${HPCBIND_PARTITION}
+export HPCBIND_OPENMP_RATIO="${HPCBIND_OPENMP_RATIO_NUMERATOR}/${HPCBIND_OPENMP_RATIO_DENOMINATOR}"
 if [[ "${HPCBIND_HWLOC_PARENT_CPUSET}" == "" ]]; then
  export HPCBIND_HWLOC_PARENT_CPUSET="all"
 else
@ -467,7 +529,8 @@ export HPCBIND_NVIDIA_ENABLE_GPU_MAPPING=${HPCBIND_ENABLE_GPU_MAPPING}
 export HPCBIND_NVIDIA_VISIBLE_GPUS=$(echo "${HPCBIND_VISIBLE_GPUS[*]}" | tr ' ' ',')
 export HPCBIND_OPENMP_VERSION="${HPCBIND_OPENMP_VERSION}"
 if [[ "${HPCBIND_QUEUE_NAME}" != "" ]]; then
-  export HPCBIND_QUEUE_INDEX=${HPCBIND_QUEUE_INDEX}
+  export HPCBIND_QUEUE_RANK=${HPCBIND_QUEUE_RANK}
+  export HPCBIND_QUEUE_SIZE=${HPCBIND_QUEUE_SIZE}
  export HPCBIND_QUEUE_NAME="${HPCBIND_QUEUE_NAME}"
  export HPCBIND_QUEUE_MAPPING=${HPCBIND_QUEUE_MAPPING}
 fi
@ -487,10 +550,16 @@ if [[ ${HPCBIND_TEE} -eq 0 || ${HPCBIND_VERBOSE} -eq 0 ]]; then
  echo "${TMP_ENV}" | grep -E "^CUDA_" >> ${HPCBIND_LOG}
  echo "[OPENMP]" >> ${HPCBIND_LOG}
  echo "${TMP_ENV}" | grep -E "^OMP_" >> ${HPCBIND_LOG}
+  echo "[GOMP] (gcc, g++, and gfortran)" >> ${HPCBIND_LOG}
+  echo "${TMP_ENV}" | grep -E "^GOMP_" >> ${HPCBIND_LOG}
+  echo "[KMP] (icc, icpc, and ifort)" >> ${HPCBIND_LOG}
+  echo "${TMP_ENV}" | grep -E "^KMP_" >> ${HPCBIND_LOG}
+  echo "[XLSMPOPTS] (xlc, xlc++, and xlf)" >> ${HPCBIND_LOG}
+  echo "${TMP_ENV}" | grep -E "^XLSMPOPTS" >> ${HPCBIND_LOG}

  if [[ ${HPCBIND_HAS_HWLOC} -eq 1 ]]; then
    echo "[BINDINGS]" >> ${HPCBIND_LOG}
-    hwloc-ls --restrict "${HPCBIND_HWLOC_CPUSET}" --only pu >> ${HPCBIND_LOG}
+    hwloc-ls --restrict "${HPCBIND_HWLOC_CPUSET}" >> ${HPCBIND_LOG}
  else
    echo "Unable to show bindings, hwloc not available." >> ${HPCBIND_LOG}
  fi
@ -503,10 +572,16 @@ else
  echo "${TMP_ENV}" | grep -E "^CUDA_" > >(tee -a ${HPCBIND_LOG})
  echo "[OPENMP]" > >(tee -a ${HPCBIND_LOG})
  echo "${TMP_ENV}" | grep -E "^OMP_" > >(tee -a ${HPCBIND_LOG})
+  echo "[GOMP] (gcc, g++, and gfortran)" > >(tee -a ${HPCBIND_LOG})
+  echo "${TMP_ENV}" | grep -E "^GOMP_" > >(tee -a ${HPCBIND_LOG})
+  echo "[KMP] (icc, icpc, and ifort)" > >(tee -a ${HPCBIND_LOG})
+  echo "${TMP_ENV}" | grep -E "^KMP_" > >(tee -a ${HPCBIND_LOG})
+  echo "[XLSMPOPTS] (xlc, xlc++, and xlf)" > >(tee -a ${HPCBIND_LOG})
+  echo "${TMP_ENV}" | grep -E "^XLSMPOPTS" > >(tee -a ${HPCBIND_LOG})

  if [[ ${HPCBIND_HAS_HWLOC} -eq 1 ]]; then
    echo "[BINDINGS]" > >(tee -a ${HPCBIND_LOG})
-    hwloc-ls --restrict "${HPCBIND_HWLOC_CPUSET}" --only pu > >(tee -a ${HPCBIND_LOG})
+    hwloc-ls --restrict "${HPCBIND_HWLOC_CPUSET}" --no-io --no-bridges > >(tee -a ${HPCBIND_LOG})
  else
    echo "Unable to show bindings, hwloc not available." > >(tee -a ${HPCBIND_LOG})
  fi
--- a/lib/kokkos/bin/nvcc_wrapper
+++ b/lib/kokkos/bin/nvcc_wrapper
@ -39,6 +39,12 @@ cuda_args=""
 # Arguments for both NVCC and Host compiler
 shared_args=""

+# Argument -c
+compile_arg=""
+
+# Argument -o <obj>
+output_arg=""
+
 # Linker arguments
 xlinker_args=""

@ -66,6 +72,7 @@ dry_run=0

 # Skip NVCC compilation and use host compiler directly
 host_only=0
+host_only_args=""

 # Enable workaround for CUDA 6.5 for pragma ident 
 replace_pragma_ident=0
@ -81,6 +88,11 @@ optimization_applied=0
 # Check if we have -std=c++X  or --std=c++X already
 stdcxx_applied=0

+# Run nvcc a second time to generate dependencies if needed
+depfile_separate=0
+depfile_output_arg=""
+depfile_target_arg=""
+
 #echo "Arguments: $# $@"

 while [ $# -gt 0 ]
@ -112,12 +124,31 @@ do
    fi
    ;;
  #Handle shared args (valid for both nvcc and the host compiler)
-  -D*|-c|-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared)
+  -D*|-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared)
    shared_args="$shared_args $1"
    ;;
-  #Handle shared args that have an argument
-  -o|-MT)
-    shared_args="$shared_args $1 $2"
+  #Handle compilation argument
+  -c)
+    compile_arg="$1"
+    ;;
+  #Handle output argument
+  -o)
+    output_arg="$output_arg $1 $2"
+    shift
+    ;;
+  # Handle depfile arguments.  We map them to a separate call to nvcc.
+  -MD|-MMD)
+    depfile_separate=1
+    host_only_args="$host_only_args $1"
+    ;;
+  -MF)
+    depfile_output_arg="-o $2"
+    host_only_args="$host_only_args $1 $2"
+    shift
+    ;;
+  -MT)
+    depfile_target_arg="$1 $2"
+    host_only_args="$host_only_args $1 $2"
    shift
    ;;
  #Handle known nvcc args
@ -242,7 +273,7 @@ if [ $first_xcompiler_arg -eq 0 ]; then
 fi

 #Compose host only command
-host_command="$host_compiler $shared_args $xcompiler_args $host_linker_args $shared_versioned_libraries_host"
+host_command="$host_compiler $shared_args $host_only_args $compile_arg $output_arg $xcompiler_args $host_linker_args $shared_versioned_libraries_host"

 #nvcc does not accept '#pragma ident SOME_MACRO_STRING' but it does accept '#ident SOME_MACRO_STRING'
 if [ $replace_pragma_ident -eq 1 ]; then
@ -274,10 +305,21 @@ else
  host_command="$host_command $object_files"
 fi

+if [ $depfile_separate -eq 1 ]; then
+  # run nvcc a second time to generate dependencies (without compiling)
+  nvcc_depfile_command="$nvcc_command -M $depfile_target_arg $depfile_output_arg"
+else
+  nvcc_depfile_command=""
+fi
+
+nvcc_command="$nvcc_command $compile_arg $output_arg"
+
 #Print command for dryrun
 if [ $dry_run -eq 1 ]; then
  if [ $host_only -eq 1 ]; then
    echo $host_command
+  elif [ -n "$nvcc_depfile_command" ]; then
+    echo $nvcc_command "&&" $nvcc_depfile_command
  else
    echo $nvcc_command
  fi
@ -287,6 +329,8 @@ fi
 #Run compilation command
 if [ $host_only -eq 1 ]; then
  $host_command
+elif [ -n "$nvcc_depfile_command" ]; then
+  $nvcc_command && $nvcc_depfile_command
 else
  $nvcc_command
 fi
--- a/lib/kokkos/cmake/Makefile.generate_cmake_settings
+++ b/lib/kokkos/cmake/Makefile.generate_cmake_settings
@ -0,0 +1,8 @@
+ifndef KOKKOS_PATH
+  MAKEFILE_PATH := $(abspath $(lastword $(MAKEFILE_LIST)))
+  KOKKOS_PATH = $(subst Makefile,,$(MAKEFILE_PATH))..
+endif
+
+include $(KOKKOS_PATH)/Makefile.kokkos
+include $(KOKKOS_PATH)/core/src/Makefile.generate_header_lists
+include $(KOKKOS_PATH)/core/src/Makefile.generate_build_files
--- a/lib/kokkos/cmake/kokkos.cmake
+++ b/lib/kokkos/cmake/kokkos.cmake
--- a/lib/kokkos/cmake/kokkos_build.cmake
+++ b/lib/kokkos/cmake/kokkos_build.cmake
@ -0,0 +1,219 @@
+# kokkos_generated_settings.cmake includes the kokkos library itself in KOKKOS_LIBS
+# which we do not want to use for the cmake builds so clean this up
+string(REGEX REPLACE "-lkokkos" "" KOKKOS_LIBS ${KOKKOS_LIBS})
+
+############################ Detect if submodule ###############################
+#
+# With thanks to StackOverflow:  
+#      http://stackoverflow.com/questions/25199677/how-to-detect-if-current-scope-has-a-parent-in-cmake
+#
+get_directory_property(HAS_PARENT PARENT_DIRECTORY)
+if(HAS_PARENT)
+  message(STATUS "Submodule build")
+  SET(KOKKOS_HEADER_DIR "include/kokkos")
+else()
+  message(STATUS "Standalone build")
+  SET(KOKKOS_HEADER_DIR "include")
+endif()
+
+################################ Handle the actual build #######################
+
+SET(INSTALL_LIB_DIR lib CACHE PATH "Installation directory for libraries")
+SET(INSTALL_BIN_DIR bin CACHE PATH "Installation directory for executables")
+SET(INSTALL_INCLUDE_DIR ${KOKKOS_HEADER_DIR} CACHE PATH
+  "Installation directory for header files")
+IF(WIN32 AND NOT CYGWIN)
+  SET(DEF_INSTALL_CMAKE_DIR CMake)
+ELSE()
+  SET(DEF_INSTALL_CMAKE_DIR lib/CMake/Kokkos)
+ENDIF()
+
+SET(INSTALL_CMAKE_DIR ${DEF_INSTALL_CMAKE_DIR} CACHE PATH
+    "Installation directory for CMake files")
+
+# Make relative paths absolute (needed later on)
+FOREACH(p LIB BIN INCLUDE CMAKE)
+  SET(var INSTALL_${p}_DIR)
+  IF(NOT IS_ABSOLUTE "${${var}}")
+    SET(${var} "${CMAKE_INSTALL_PREFIX}/${${var}}")
+  ENDIF()
+ENDFOREACH()
+
+# set up include-directories
+SET (Kokkos_INCLUDE_DIRS
+    ${Kokkos_SOURCE_DIR}/core/src
+    ${Kokkos_SOURCE_DIR}/containers/src
+    ${Kokkos_SOURCE_DIR}/algorithms/src
+    ${Kokkos_BINARY_DIR}  # to find KokkosCore_config.h
+    ${KOKKOS_INCLUDE_DIRS}
+)
+
+# pass include dirs back to parent scope
+if(HAS_PARENT)
+SET(Kokkos_INCLUDE_DIRS_RET ${Kokkos_INCLUDE_DIRS} PARENT_SCOPE)
+else()
+SET(Kokkos_INCLUDE_DIRS_RET ${Kokkos_INCLUDE_DIRS})
+endif()
+
+INCLUDE_DIRECTORIES(${Kokkos_INCLUDE_DIRS})
+
+IF(KOKKOS_SEPARATE_LIBS)
+  # Sources come from makefile-generated kokkos_generated_settings.cmake file
+  # Separate libs need to separate the sources
+  set_kokkos_srcs(KOKKOS_SRC ${KOKKOS_SRC})
+
+  # kokkoscore
+  ADD_LIBRARY(
+    kokkoscore
+    ${KOKKOS_CORE_SRCS}
+  )
+
+  target_compile_options(
+    kokkoscore
+    PUBLIC $<$<COMPILE_LANGUAGE:CXX>:${KOKKOS_CXX_FLAGS}>
+  )
+
+  # Install the kokkoscore library
+  INSTALL (TARGETS kokkoscore
+           EXPORT KokkosTargets
+           ARCHIVE DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           RUNTIME DESTINATION ${CMAKE_INSTALL_PREFIX}/bin
+  )
+
+  TARGET_LINK_LIBRARIES(
+    kokkoscore
+    ${KOKKOS_LD_FLAGS}
+    ${KOKKOS_EXTRA_LIBS_LIST}
+  )
+
+  # kokkoscontainers
+  if (DEFINED KOKKOS_CONTAINERS_SRCS)
+    ADD_LIBRARY(
+      kokkoscontainers
+      ${KOKKOS_CONTAINERS_SRCS}
+    )
+  endif()
+
+  TARGET_LINK_LIBRARIES(
+    kokkoscontainers
+    kokkoscore
+  )
+
+  # Install the kokkocontainers library
+  INSTALL (TARGETS kokkoscontainers
+           EXPORT KokkosTargets
+           ARCHIVE DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           RUNTIME DESTINATION ${CMAKE_INSTALL_PREFIX}/bin)
+
+  # kokkosalgorithms - Build as interface library since no source files.
+  ADD_LIBRARY(
+    kokkosalgorithms
+    INTERFACE
+  )
+
+  target_include_directories(
+    kokkosalgorithms
+    INTERFACE ${Kokkos_SOURCE_DIR}/algorithms/src
+  )
+
+  TARGET_LINK_LIBRARIES(
+    kokkosalgorithms
+    INTERFACE kokkoscore
+  )
+
+  # Install the kokkoalgorithms library
+  INSTALL (TARGETS kokkosalgorithms
+           ARCHIVE DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           RUNTIME DESTINATION ${CMAKE_INSTALL_PREFIX}/bin)
+
+  SET (Kokkos_LIBRARIES_NAMES kokkoscore kokkoscontainers kokkosalgorithms)
+
+ELSE()
+  # kokkos
+  ADD_LIBRARY(
+    kokkos
+    ${KOKKOS_CORE_SRCS}
+    ${KOKKOS_CONTAINERS_SRCS}
+  )
+
+  target_compile_options(
+    kokkos
+    PUBLIC $<$<COMPILE_LANGUAGE:CXX>:${KOKKOS_CXX_FLAGS}>
+  )
+
+  TARGET_LINK_LIBRARIES(
+    kokkos
+    ${KOKKOS_LD_FLAGS}
+    ${KOKKOS_EXTRA_LIBS_LIST}
+  )
+
+  # Install the kokkos library
+  INSTALL (TARGETS kokkos
+           EXPORT KokkosTargets
+           ARCHIVE DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/lib
+           RUNTIME DESTINATION ${CMAKE_INSTALL_PREFIX}/bin)
+
+
+  SET (Kokkos_LIBRARIES_NAMES kokkos)
+
+endif()  # KOKKOS_SEPARATE_LIBS
+
+# Install the kokkos headers
+INSTALL (DIRECTORY
+         EXPORT KokkosTargets
+         ${Kokkos_SOURCE_DIR}/core/src/
+         DESTINATION ${KOKKOS_HEADER_DIR}
+         FILES_MATCHING PATTERN "*.hpp"
+)
+INSTALL (DIRECTORY
+         EXPORT KokkosTargets
+         ${Kokkos_SOURCE_DIR}/containers/src/
+         DESTINATION ${KOKKOS_HEADER_DIR}
+         FILES_MATCHING PATTERN "*.hpp"
+)
+INSTALL (DIRECTORY
+         EXPORT KokkosTargets
+         ${Kokkos_SOURCE_DIR}/algorithms/src/
+         DESTINATION ${KOKKOS_HEADER_DIR}
+         FILES_MATCHING PATTERN "*.hpp"
+)
+
+INSTALL (FILES
+         ${Kokkos_BINARY_DIR}/KokkosCore_config.h
+         DESTINATION ${KOKKOS_HEADER_DIR}
+)
+
+# Add all targets to the build-tree export set
+export(TARGETS ${Kokkos_LIBRARIES_NAMES}
+  FILE "${Kokkos_BINARY_DIR}/KokkosTargets.cmake")
+
+# Export the package for use from the build-tree
+# (this registers the build-tree with a global CMake-registry)
+export(PACKAGE Kokkos)
+
+# Create the KokkosConfig.cmake and KokkosConfigVersion files
+file(RELATIVE_PATH REL_INCLUDE_DIR "${INSTALL_CMAKE_DIR}"
+   "${INSTALL_INCLUDE_DIR}")
+# ... for the build tree
+set(CONF_INCLUDE_DIRS "${Kokkos_SOURCE_DIR}" "${Kokkos_BINARY_DIR}")
+configure_file(${Kokkos_SOURCE_DIR}/cmake/KokkosConfig.cmake.in
+  "${Kokkos_BINARY_DIR}/KokkosConfig.cmake" @ONLY)
+# ... for the install tree
+set(CONF_INCLUDE_DIRS "\${Kokkos_CMAKE_DIR}/${REL_INCLUDE_DIR}")
+configure_file(${Kokkos_SOURCE_DIR}/cmake/KokkosConfig.cmake.in
+  "${Kokkos_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/KokkosConfig.cmake" @ONLY)
+
+# Install the KokkosConfig.cmake and KokkosConfigVersion.cmake
+install(FILES
+  "${Kokkos_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/KokkosConfig.cmake"
+  DESTINATION "${INSTALL_CMAKE_DIR}")
+
+#This seems not to do anything?
+#message(STATUS "KokkosTargets: " ${KokkosTargets})
+# Install the export set for use with the install-tree
+INSTALL(EXPORT KokkosTargets DESTINATION
+       "${INSTALL_CMAKE_DIR}")
--- a/lib/kokkos/cmake/kokkos_functions.cmake
+++ b/lib/kokkos/cmake/kokkos_functions.cmake
@ -0,0 +1,345 @@
+################################### FUNCTIONS ##################################
+# List of functions
+#   set_kokkos_cxx_compiler
+#   set_kokkos_cxx_standard
+#   set_kokkos_srcs
+
+#-------------------------------------------------------------------------------
+# function(set_kokkos_cxx_compiler)
+# Sets the following compiler variables that are analogous to the CMAKE_*
+# versions.  We add the ability to detect NVCC (really nvcc_wrapper).
+#   KOKKOS_CXX_COMPILER
+#   KOKKOS_CXX_COMPILER_ID
+#   KOKKOS_CXX_COMPILER_VERSION
+#
+# Inputs:
+#   KOKKOS_ENABLE_CUDA
+#   CMAKE_CXX_COMPILER
+#   CMAKE_CXX_COMPILER_ID
+#   CMAKE_CXX_COMPILER_VERSION
+#
+# Also verifies the compiler version meets the minimum required by Kokkos.
+function(set_kokkos_cxx_compiler)
+  # Since CMake doesn't recognize the nvcc compiler until 3.8, we use our own
+  # version of the CMake variables and detect nvcc ourselves.  Initially set to
+  # the CMake variable values.
+  set(INTERNAL_CXX_COMPILER ${CMAKE_CXX_COMPILER})
+  set(INTERNAL_CXX_COMPILER_ID ${CMAKE_CXX_COMPILER_ID})
+  set(INTERNAL_CXX_COMPILER_VERSION ${CMAKE_CXX_COMPILER_VERSION})
+
+  # Check if the compiler is nvcc (which really means nvcc_wrapper).
+  execute_process(COMMAND ${INTERNAL_CXX_COMPILER} --version
+                  COMMAND grep nvcc
+                  COMMAND wc -l
+                  OUTPUT_VARIABLE INTERNAL_HAVE_COMPILER_NVCC
+                  OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+  string(REGEX REPLACE "^ +" ""
+         INTERNAL_HAVE_COMPILER_NVCC ${INTERNAL_HAVE_COMPILER_NVCC})
+
+  if(INTERNAL_HAVE_COMPILER_NVCC)
+    # Set the compiler id to nvcc.  We use the value used by CMake 3.8.
+    set(INTERNAL_CXX_COMPILER_ID NVIDIA)
+
+    # Set nvcc's compiler version.
+    execute_process(COMMAND ${INTERNAL_CXX_COMPILER} --version
+                    COMMAND grep release
+                    OUTPUT_VARIABLE INTERNAL_CXX_COMPILER_VERSION
+                    OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+    string(REGEX MATCH "[0-9]+\.[0-9]+\.[0-9]+$"
+           INTERNAL_CXX_COMPILER_VERSION ${INTERNAL_CXX_COMPILER_VERSION})
+  endif()
+
+  # Enforce the minimum compilers supported by Kokkos.
+  set(KOKKOS_MESSAGE_TEXT "Compiler not supported by Kokkos.  Required compiler versions:")
+  set(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n    Clang      3.5.2 or higher")
+  set(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n    GCC        4.8.4 or higher")
+  set(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n    Intel     15.0.2 or higher")
+  set(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n    NVCC      7.0.28 or higher")
+  set(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n    PGI         17.1 or higher\n")
+
+  if(INTERNAL_CXX_COMPILER_ID STREQUAL Clang)
+    if(INTERNAL_CXX_COMPILER_VERSION VERSION_LESS 3.5.2)
+      message(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
+    endif()
+  elseif(INTERNAL_CXX_COMPILER_ID STREQUAL GNU)
+    if(INTERNAL_CXX_COMPILER_VERSION VERSION_LESS 4.8.4)
+      message(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
+    endif()
+  elseif(INTERNAL_CXX_COMPILER_ID STREQUAL Intel)
+    if(INTERNAL_CXX_COMPILER_VERSION VERSION_LESS 15.0.2)
+      message(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
+    endif()
+  elseif(INTERNAL_CXX_COMPILER_ID STREQUAL NVIDIA)
+    if(INTERNAL_CXX_COMPILER_VERSION VERSION_LESS 7.0.28)
+      message(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
+    endif()
+  elseif(INTERNAL_CXX_COMPILER_ID STREQUAL PGI)
+    if(INTERNAL_CXX_COMPILER_VERSION VERSION_LESS 17.1)
+      message(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
+    endif()
+  endif()
+
+  # Enforce that extensions are turned off for nvcc_wrapper.
+  if(INTERNAL_CXX_COMPILER_ID STREQUAL NVIDIA)
+    if(DEFINED CMAKE_CXX_EXTENSIONS AND CMAKE_CXX_EXTENSIONS STREQUAL ON)
+      message(FATAL_ERROR "NVCC doesn't support C++ extensions.  Set CMAKE_CXX_EXTENSIONS to OFF in your CMakeLists.txt.")
+    endif()
+  endif()
+
+  if(KOKKOS_ENABLE_CUDA)
+    # Enforce that the compiler can compile CUDA code.
+    if(INTERNAL_CXX_COMPILER_ID STREQUAL Clang)
+      if(INTERNAL_CXX_COMPILER_VERSION VERSION_LESS 4.0.0)
+        message(FATAL_ERROR "Compiling CUDA code directly with Clang requires version 4.0.0 or higher.")
+      endif()
+    elseif(NOT INTERNAL_CXX_COMPILER_ID STREQUAL NVIDIA)
+      message(FATAL_ERROR "Invalid compiler for CUDA.  The compiler must be nvcc_wrapper or Clang.")
+    endif()
+  endif()
+
+  set(KOKKOS_CXX_COMPILER ${INTERNAL_CXX_COMPILER} PARENT_SCOPE)
+  set(KOKKOS_CXX_COMPILER_ID ${INTERNAL_CXX_COMPILER_ID} PARENT_SCOPE)
+  set(KOKKOS_CXX_COMPILER_VERSION ${INTERNAL_CXX_COMPILER_VERSION} PARENT_SCOPE)
+endfunction()
+
+#-------------------------------------------------------------------------------
+# function(set_kokkos_cxx_standard)
+#  Transitively enforces that the appropriate CXX standard compile flags (C++11
+#  or above) are added to targets that use the Kokkos library.  Compile features
+#  are used if possible.  Otherwise, the appropriate flags are added to
+#  KOKKOS_CXX_FLAGS.  Values set by the user to CMAKE_CXX_STANDARD and
+#  CMAKE_CXX_EXTENSIONS are honored.
+#
+# Outputs:
+#   KOKKOS_CXX11_FEATURES
+#   KOKKOS_CXX_FLAGS
+#
+# Inputs:
+#  KOKKOS_CXX_COMPILER
+#  KOKKOS_CXX_COMPILER_ID
+#  KOKKOS_CXX_COMPILER_VERSION
+#
+function(set_kokkos_cxx_standard)
+  # The following table lists the versions of CMake that supports CXX_STANDARD
+  # and the CXX compile features for different compilers.  The versions are
+  # based on CMake documentation, looking at CMake code, and verifying by
+  # testing with specific CMake versions.
+  #
+  #   COMPILER                      CXX_STANDARD     Compile Features
+  #   ---------------------------------------------------------------
+  #   Clang                             3.1                3.1
+  #   GNU                               3.1                3.2
+  #   AppleClang                        3.2                3.2
+  #   Intel                             3.6                3.6
+  #   Cray                              No                 No
+  #   PGI                               No                 No
+  #   XL                                No                 No
+  #
+  # For compiling CUDA code using nvcc_wrapper, we will use the host compiler's
+  # flags for turning on C++11.  Since for compiler ID and versioning purposes
+  # CMake recognizes the host compiler when calling nvcc_wrapper, this just
+  # works.  Both NVCC and nvcc_wrapper only recognize '-std=c++11' which means
+  # that we can only use host compilers for CUDA builds that use those flags.
+  # It also means that extensions (gnu++11) can't be turned on for CUDA builds.
+
+  # Check if we can use compile features.
+  if(NOT KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA)
+    if(CMAKE_CXX_COMPILER_ID STREQUAL Clang)
+      if(NOT CMAKE_VERSION VERSION_LESS 3.1)
+        set(INTERNAL_USE_COMPILE_FEATURES ON)
+      endif()
+    elseif(CMAKE_CXX_COMPILER_ID STREQUAL AppleClang OR CMAKE_CXX_COMPILER_ID STREQUAL GNU)
+      if(NOT CMAKE_VERSION VERSION_LESS 3.2)
+        set(INTERNAL_USE_COMPILE_FEATURES ON)
+      endif()
+    elseif(CMAKE_CXX_COMPILER_ID STREQUAL Intel)
+      if(NOT CMAKE_VERSION VERSION_LESS 3.6)
+        set(INTERNAL_USE_COMPILE_FEATURES ON)
+      endif()
+    endif()
+  endif()
+
+  if(INTERNAL_USE_COMPILE_FEATURES)
+    # Use the compile features aspect of CMake to transitively cause C++ flags
+    # to populate to user code.
+
+    # I'm using a hack by requiring features that I know force the lowest version
+    # of the compilers we want to support.  Clang 3.3 and later support all of
+    # the C++11 standard.  With CMake 3.8 and higher, we could switch to using
+    # cxx_std_11.
+    set(KOKKOS_CXX11_FEATURES
+        cxx_nonstatic_member_init # Forces GCC 4.7 or later and Intel 14.0 or later.
+        PARENT_SCOPE
+       )
+  else()
+    # CXX compile features are not yet implemented for this combination of
+    # compiler and version of CMake.
+
+    if(CMAKE_CXX_COMPILER_ID STREQUAL AppleClang)
+      # Versions of CMAKE before 3.2 don't support CXX_STANDARD or C++ compile
+      # features for the AppleClang compiler.  Set compiler flags transitively
+      # here such that they trickle down to a call to target_compile_options().
+
+      # The following two blocks of code were copied from
+      # /Modules/Compiler/AppleClang-CXX.cmake from CMake 3.7.2 and then
+      # modified.
+      if(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 4.0)
+        set(INTERNAL_CXX11_STANDARD_COMPILE_OPTION "-std=c++11")
+        set(INTERNAL_CXX11_EXTENSION_COMPILE_OPTION "-std=gnu++11")
+      endif()
+
+      if(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 6.1)
+        set(INTERNAL_CXX14_STANDARD_COMPILE_OPTION "-std=c++14")
+        set(INTERNAL_CXX14_EXTENSION_COMPILE_OPTION "-std=gnu++14")
+      elseif(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.1)
+        # AppleClang 5.0 knows this flag, but does not set a __cplusplus macro
+        # greater than 201103L.
+        set(INTERNAL_CXX14_STANDARD_COMPILE_OPTION "-std=c++1y")
+        set(INTERNAL_CXX14_EXTENSION_COMPILE_OPTION "-std=gnu++1y")
+      endif()
+    elseif(CMAKE_CXX_COMPILER_ID STREQUAL Intel)
+      # Versions of CMAKE before 3.6 don't support CXX_STANDARD or C++ compile
+      # features for the Intel compiler.  Set compiler flags transitively here
+      # such that they trickle down to a call to target_compile_options().
+
+      # The following three blocks of code were copied from
+      # /Modules/Compiler/Intel-CXX.cmake from CMake 3.7.2 and then modified.
+      if("x${CMAKE_CXX_SIMULATE_ID}" STREQUAL "xMSVC")
+        set(_std -Qstd)
+        set(_ext c++)
+      else()
+        set(_std -std)
+        set(_ext gnu++)
+      endif()
+
+      if(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 15.0.2)
+        set(INTERNAL_CXX14_STANDARD_COMPILE_OPTION "${_std}=c++14")
+        # TODO: There is no gnu++14 value supported; figure out what to do.
+        set(INTERNAL_CXX14_EXTENSION_COMPILE_OPTION "${_std}=c++14")
+      elseif(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 15.0.0)
+        set(INTERNAL_CXX14_STANDARD_COMPILE_OPTION "${_std}=c++1y")
+        # TODO: There is no gnu++14 value supported; figure out what to do.
+        set(INTERNAL_CXX14_EXTENSION_COMPILE_OPTION "${_std}=c++1y")
+      endif()
+
+      if(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 13.0)
+        set(INTERNAL_CXX11_STANDARD_COMPILE_OPTION "${_std}=c++11")
+        set(INTERNAL_CXX11_EXTENSION_COMPILE_OPTION "${_std}=${_ext}11")
+      elseif(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 12.1)
+        set(INTERNAL_CXX11_STANDARD_COMPILE_OPTION "${_std}=c++0x")
+        set(INTERNAL_CXX11_EXTENSION_COMPILE_OPTION "${_std}=${_ext}0x")
+      endif()
+    elseif(CMAKE_CXX_COMPILER_ID STREQUAL Cray)
+      # CMAKE doesn't support CXX_STANDARD or C++ compile features for the Cray
+      # compiler.  Set compiler options transitively here such that they trickle
+      # down to a call to target_compile_options().
+      set(INTERNAL_CXX11_STANDARD_COMPILE_OPTION "-hstd=c++11")
+      set(INTERNAL_CXX11_EXTENSION_COMPILE_OPTION "-hstd=c++11")
+      set(INTERNAL_CXX14_STANDARD_COMPILE_OPTION "-hstd=c++11")
+      set(INTERNAL_CXX14_EXTENSION_COMPILE_OPTION "-hstd=c++11")
+    elseif(CMAKE_CXX_COMPILER_ID STREQUAL PGI)
+      # CMAKE doesn't support CXX_STANDARD or C++ compile features for the PGI
+      # compiler.  Set compiler options transitively here such that they trickle
+      # down to a call to target_compile_options().
+      set(INTERNAL_CXX11_STANDARD_COMPILE_OPTION "--c++11")
+      set(INTERNAL_CXX11_EXTENSION_COMPILE_OPTION "--c++11")
+      set(INTERNAL_CXX14_STANDARD_COMPILE_OPTION "--c++11")
+      set(INTERNAL_CXX14_EXTENSION_COMPILE_OPTION "--c++11")
+    elseif(CMAKE_CXX_COMPILER_ID STREQUAL XL)
+      # CMAKE doesn't support CXX_STANDARD or C++ compile features for the XL
+      # compiler.  Set compiler options transitively here such that they trickle
+      # down to a call to target_compile_options().
+      set(INTERNAL_CXX11_STANDARD_COMPILE_OPTION "-std=c++11")
+      set(INTERNAL_CXX11_EXTENSION_COMPILE_OPTION "-std=c++11")
+      set(INTERNAL_CXX14_STANDARD_COMPILE_OPTION "-std=c++11")
+      set(INTERNAL_CXX14_EXTENSION_COMPILE_OPTION "-std=c++11")
+    else()
+      # Assume GNU.  CMAKE_CXX_STANDARD is handled correctly by CMake 3.1 and
+      # above for this compiler.  If the user explicitly requests a C++
+      # standard, CMake takes care of it.  If not, transitively require C++11.
+      if(NOT CMAKE_CXX_STANDARD)
+        set(INTERNAL_CXX11_STANDARD_COMPILE_OPTION ${CMAKE_CXX11_STANDARD_COMPILE_OPTION})
+        set(INTERNAL_CXX11_EXTENSION_COMPILE_OPTION ${CMAKE_CXX11_EXTENSION_COMPILE_OPTION})
+      endif()
+    endif()
+
+    # Set the C++ standard info for Kokkos respecting user set values for
+    # CMAKE_CXX_STANDARD and CMAKE_CXX_EXTENSIONS.
+    # Only use cxx extension if explicitly requested
+    if(CMAKE_CXX_STANDARD EQUAL 14)
+      if(DEFINED CMAKE_CXX_EXTENSIONS AND CMAKE_CXX_EXTENSIONS STREQUAL ON)
+        set(INTERNAL_CXX_FLAGS ${INTERNAL_CXX14_EXTENSION_COMPILE_OPTION})
+      else()
+        set(INTERNAL_CXX_FLAGS ${INTERNAL_CXX14_STANDARD_COMPILE_OPTION})
+      endif()
+    elseif(CMAKE_CXX_STANDARD EQUAL 11)
+      if(DEFINED CMAKE_CXX_EXTENSIONS AND CMAKE_CXX_EXTENSIONS STREQUAL ON)
+        set(INTERNAL_CXX_FLAGS ${INTERNAL_CXX11_EXTENSION_COMPILE_OPTION})
+      else()
+        set(INTERNAL_CXX_FLAGS ${INTERNAL_CXX11_STANDARD_COMPILE_OPTION})
+      endif()
+    else()
+      # The user didn't explicitly request a standard, transitively require
+      # C++11 respecting CMAKE_CXX_EXTENSIONS.
+      if(DEFINED CMAKE_CXX_EXTENSIONS AND CMAKE_CXX_EXTENSIONS STREQUAL ON)
+        set(INTERNAL_CXX_FLAGS ${INTERNAL_CXX11_EXTENSION_COMPILE_OPTION})
+      else()
+        set(INTERNAL_CXX_FLAGS ${INTERNAL_CXX11_STANDARD_COMPILE_OPTION})
+      endif()
+    endif()
+
+    set(KOKKOS_CXX_FLAGS ${INTERNAL_CXX_FLAGS} PARENT_SCOPE)
+  endif()
+endfunction()
+
+
+#-------------------------------------------------------------------------------
+# function(set_kokkos_sources)
+# Takes a list of sources for kokkos (e.g., KOKKOS_SRC from Makefile.kokkos and
+# put it into kokkos_generated_settings.cmake) and sorts the files into the subpackages or
+# separate_libraries.  This is core and containers (algorithms is pure header
+# files).
+#
+# Inputs:
+#   KOKKOS_SRC
+# 
+# Outputs:
+#   KOKKOS_CORE_SRCS
+#   KOKKOS_CONTAINERS_SRCS
+#
+function(set_kokkos_srcs)
+  set(opts ) # no-value args
+  set(oneValArgs )
+  set(multValArgs KOKKOS_SRC) # e.g., lists
+  cmake_parse_arguments(IN "${opts}" "${oneValArgs}" "${multValArgs}" ${ARGN})
+
+  foreach(sfile ${IN_KOKKOS_SRC})
+     string(REPLACE "${CMAKE_CURRENT_SOURCE_DIR}/" "" stripfile "${sfile}")
+     string(REPLACE "/" ";" striplist "${stripfile}")
+     list(GET striplist 0 firstdir)
+     if(${firstdir} STREQUAL "core")
+       list(APPEND KOKKOS_CORE_SRCS ${sfile})
+     else()
+       list(APPEND KOKKOS_CONTAINERS_SRCS ${sfile})
+     endif()
+  endforeach()
+  set(KOKKOS_CORE_SRCS ${KOKKOS_CORE_SRCS} PARENT_SCOPE)
+  set(KOKKOS_CONTAINERS_SRCS ${KOKKOS_CONTAINERS_SRCS} PARENT_SCOPE)
+  return()
+endfunction()
+
+# Setting a default value if it is not already set
+macro(set_kokkos_default_default VARIABLE DEFAULT)
+  IF( "${KOKKOS_INTERNAL_ENABLE_${VARIABLE}_DEFAULT}" STREQUAL "" )
+    IF( "${KOKKOS_ENABLE_${VARIABLE}}" STREQUAL "" )
+      set(KOKKOS_INTERNAL_ENABLE_${VARIABLE}_DEFAULT ${DEFAULT})
+  #    MESSAGE(WARNING "Set: KOKKOS_INTERNAL_ENABLE_${VARIABLE}_DEFAULT to ${KOKKOS_INTERNAL_ENABLE_${VARIABLE}_DEFAULT}")
+    ELSE()
+      set(KOKKOS_INTERNAL_ENABLE_${VARIABLE}_DEFAULT ${KOKKOS_ENABLE_${VARIABLE}})
+   #   MESSAGE(WARNING "Set: KOKKOS_INTERNAL_ENABLE_${VARIABLE}_DEFAULT to ${KOKKOS_INTERNAL_ENABLE_${VARIABLE}_DEFAULT}")
+    ENDIF()
+  ENDIF()
+  UNSET(KOKKOS_ENABLE_${VARIABLE} CACHE)
+endmacro()
--- a/lib/kokkos/cmake/kokkos_options.cmake
+++ b/lib/kokkos/cmake/kokkos_options.cmake
@ -0,0 +1,365 @@
+########################## NOTES ###############################################
+#  List the options for configuring kokkos using CMake method of doing it.
+#  These options then get mapped onto KOKKOS_SETTINGS environment variable by
+#  kokkos_settings.cmake.  It is separate to allow other packages to override
+#  these variables (e.g., TriBITS).
+
+########################## AVAILABLE OPTIONS ###################################
+# Use lists for documentation, verification, and programming convenience
+
+# All CMake options of the type KOKKOS_ENABLE_*
+set(KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST)
+list(APPEND KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST
+     Serial
+     OpenMP
+     Pthread
+     Qthread
+     Cuda
+     ROCm
+     HWLOC
+     MEMKIND
+     LIBRT
+     Cuda_Lambda
+     Cuda_Relocatable_Device_Code
+     Cuda_UVM
+     Cuda_LDG_Intrinsic
+     Debug
+     Debug_DualView_Modify_Check
+     Debug_Bounds_Checkt
+     Compiler_Warnings
+     Profiling
+     Profiling_Load_Print
+     Aggressive_Vectorization
+     )
+
+#-------------------------------------------------------------------------------
+#------------------------------- Recognize CamelCase Options ---------------------------
+#-------------------------------------------------------------------------------
+
+foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
+  string(TOUPPER ${opt} OPT )
+  IF(DEFINED Kokkos_ENABLE_${opt})
+    IF(DEFINED KOKKOS_ENABLE_${OPT})
+      IF(NOT ("${KOKKOS_ENABLE_${OPT}}" STREQUAL "${Kokkos_ENABLE_${opt}}"))
+        IF(DEFINED KOKKOS_ENABLE_${OPT}_INTERNAL)
+          MESSAGE(WARNING  "Defined both Kokkos_ENABLE_${opt}=[${Kokkos_ENABLE_${opt}}] and KOKKOS_ENABLE_${OPT}=[${KOKKOS_ENABLE_${OPT}}] and they differ! Could be caused by old CMakeCache Variable. Run CMake again and warning should disappear. If not you are truly setting both variables.")
+          IF(NOT ("${Kokkos_ENABLE_${opt}}" STREQUAL "${KOKKOS_ENABLE_${OPT}_INTERNAL}"))
+            UNSET(KOKKOS_ENABLE_${OPT} CACHE)
+            SET(KOKKOS_ENABLE_${OPT} ${Kokkos_ENABLE_${opt}})
+            MESSAGE(WARNING "SET BOTH VARIABLES KOKKOS_ENABLE_${OPT}: ${KOKKOS_ENABLE_${OPT}}")
+          ELSE()
+            SET(Kokkos_ENABLE_${opt} ${KOKKOS_ENABLE_${OPT}})
+          ENDIF()
+        ELSE()
+          MESSAGE(FATAL_ERROR "Defined both Kokkos_ENABLE_${opt}=[${Kokkos_ENABLE_${opt}}] and KOKKOS_ENABLE_${OPT}=[${KOKKOS_ENABLE_${OPT}}] and they differ!")
+        ENDIF()
+      ENDIF()
+    ELSE()
+      SET(KOKKOS_INTERNAL_ENABLE_${OPT}_DEFAULT ${Kokkos_ENABLE_${opt}})
+    ENDIF()
+  ENDIF()
+endforeach()
+
+IF(DEFINED Kokkos_Arch)
+  IF(DEFINED KOKKOS_ARCH)
+    IF(NOT (${KOKKOS_ARCH} STREQUAL "${Kokkos_Arch}"))
+      MESSAGE(FATAL_ERROR "Defined both Kokkos_Arch and KOKKOS_ARCH and they differ!")
+    ENDIF()
+  ELSE()
+    SET(KOKKOS_ARCH ${Kokkos_Arch})
+  ENDIF()
+ENDIF()
+  
+#-------------------------------------------------------------------------------
+# List of possible host architectures.
+#-------------------------------------------------------------------------------
+set(KOKKOS_ARCH_LIST)
+list(APPEND KOKKOS_ARCH_LIST
+     None            # No architecture optimization
+     AMDAVX          # (HOST) AMD chip
+     ARMv80          # (HOST) ARMv8.0 Compatible CPU
+     ARMv81          # (HOST) ARMv8.1 Compatible CPU
+     ARMv8-ThunderX  # (HOST) ARMv8 Cavium ThunderX CPU
+     WSM             # (HOST) Intel Westmere CPU
+     SNB             # (HOST) Intel Sandy/Ivy Bridge CPUs
+     HSW             # (HOST) Intel Haswell CPUs
+     BDW             # (HOST) Intel Broadwell Xeon E-class CPUs
+     SKX             # (HOST) Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
+     KNC             # (HOST) Intel Knights Corner Xeon Phi
+     KNL             # (HOST) Intel Knights Landing Xeon Phi
+     BGQ             # (HOST) IBM Blue Gene Q
+     Power7          # (HOST) IBM POWER7 CPUs
+     Power8          # (HOST) IBM POWER8 CPUs
+     Power9          # (HOST) IBM POWER9 CPUs
+     Kepler          # (GPU) NVIDIA Kepler default (generation CC 3.5)
+     Kepler30        # (GPU) NVIDIA Kepler generation CC 3.0
+     Kepler32        # (GPU) NVIDIA Kepler generation CC 3.2
+     Kepler35        # (GPU) NVIDIA Kepler generation CC 3.5
+     Kepler37        # (GPU) NVIDIA Kepler generation CC 3.7
+     Maxwell         # (GPU) NVIDIA Maxwell default (generation CC 5.0)
+     Maxwell50       # (GPU) NVIDIA Maxwell generation CC 5.0
+     Maxwell52       # (GPU) NVIDIA Maxwell generation CC 5.2
+     Maxwell53       # (GPU) NVIDIA Maxwell generation CC 5.3
+     Pascal60        # (GPU) NVIDIA Pascal generation CC 6.0
+     Pascal61        # (GPU) NVIDIA Pascal generation CC 6.1
+    )
+
+# List of possible device architectures.
+# The case and spelling here needs to match Makefile.kokkos
+set(KOKKOS_DEVICES_LIST)
+# Options: Cuda,ROCm,OpenMP,Pthread,Qthreads,Serial
+list(APPEND KOKKOS_DEVICES_LIST
+    Cuda          # NVIDIA GPU -- see below
+    OpenMP        # OpenMP
+    Pthread       # pthread
+    Qthreads      # qthreads
+    Serial        # serial
+    ROCm          # Relocatable device code
+    )
+
+# List of possible TPLs for Kokkos
+# From Makefile.kokkos: Options: hwloc,librt,experimental_memkind
+set(KOKKOS_USE_TPLS_LIST)
+list(APPEND KOKKOS_USE_TPLS_LIST
+    HWLOC          # hwloc
+    LIBRT          # librt
+    MEMKIND        # experimental_memkind
+    )
+# Map of cmake variables to Makefile variables
+set(KOKKOS_INTERNAL_HWLOC hwloc)
+set(KOKKOS_INTERNAL_LIBRT librt)
+set(KOKKOS_INTERNAL_MEMKIND experimental_memkind)
+
+# List of possible Advanced options
+set(KOKKOS_OPTIONS_LIST)
+list(APPEND KOKKOS_OPTIONS_LIST
+       AGGRESSIVE_VECTORIZATION    
+       DISABLE_PROFILING          
+       DISABLE_DUALVIEW_MODIFY_CHECK
+       ENABLE_PROFILE_LOAD_PRINT   
+    )
+# Map of cmake variables to Makefile variables
+set(KOKKOS_INTERNAL_LDG_INTRINSIC use_ldg)
+set(KOKKOS_INTERNAL_UVM librt)
+set(KOKKOS_INTERNAL_RELOCATABLE_DEVICE_CODE rdc)
+
+
+#-------------------------------------------------------------------------------
+# List of possible Options for CUDA
+#-------------------------------------------------------------------------------
+# From Makefile.kokkos: Options: use_ldg,force_uvm,rdc
+set(KOKKOS_CUDA_OPTIONS_LIST)
+list(APPEND KOKKOS_CUDA_OPTIONS_LIST
+    LDG_INTRINSIC              # use_ldg
+    UVM                        # force_uvm
+    RELOCATABLE_DEVICE_CODE    # rdc
+    LAMBDA                     # enable_lambda
+    )
+    
+# Map of cmake variables to Makefile variables
+set(KOKKOS_INTERNAL_LDG_INTRINSIC use_ldg)
+set(KOKKOS_INTERNAL_UVM force_uvm)
+set(KOKKOS_INTERNAL_RELOCATABLE_DEVICE_CODE rdc)
+set(KOKKOS_INTERNAL_LAMBDA enable_lambda)
+
+
+#-------------------------------------------------------------------------------
+#------------------------------- Create doc strings ----------------------------
+#-------------------------------------------------------------------------------
+
+set(tmpr "\n       ")
+string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_ARCH_DOCSTR "${KOKKOS_ARCH_LIST}")
+# This would be useful, but we use Foo_ENABLE mechanisms
+#string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_DEVICES_DOCSTR "${KOKKOS_DEVICES_LIST}")
+#string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_USE_TPLS_DOCSTR "${KOKKOS_USE_TPLS_LIST}")
+#string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_CUDA_OPTIONS_DOCSTR "${KOKKOS_CUDA_OPTIONS_LIST}")
+
+#-------------------------------------------------------------------------------
+#------------------------------- GENERAL OPTIONS -------------------------------
+#-------------------------------------------------------------------------------
+
+# Setting this variable to a value other than "None" can improve host
+# performance by turning on architecture specific code.
+# NOT SET is used to determine if the option is passed in.  It is reset to
+# default "None" down below.
+set(KOKKOS_ARCH "NOT_SET" CACHE STRING 
+      "Optimize for specific host architecture. Options are: ${KOKKOS_INTERNAL_ARCH_DOCSTR}")
+
+# Whether to build separate libraries or now
+set(KOKKOS_SEPARATE_LIBS OFF CACHE BOOL "OFF = kokkos.  ON = kokkoscore, kokkoscontainers, and kokkosalgorithms.")
+
+# Qthreads options.
+set(KOKKOS_QTHREADS_DIR "" CACHE PATH "Location of Qthreads library.")
+
+
+#-------------------------------------------------------------------------------
+#------------------------------- KOKKOS_DEVICES --------------------------------
+#-------------------------------------------------------------------------------
+# Figure out default settings
+IF(Trilinos_ENABLE_Kokkos)             
+  set_kokkos_default_default(SERIAL ON)
+  set_kokkos_default_default(PTHREAD OFF)
+  IF(TPL_ENABLE_QTHREAD)
+    set_kokkos_default_default(QTHREADS ${TPL_ENABLE_QTHREAD})
+  ELSE()
+    set_kokkos_default_default(QTHREADS OFF)
+  ENDIF()
+  IF(Trilinos_ENABLE_OpenMP)
+    set_kokkos_default_default(OPENMP ${Trilinos_ENABLE_OpenMP})
+  ELSE()
+    set_kokkos_default_default(OPENMP OFF)
+  ENDIF()
+  IF(TPL_ENABLE_CUDA)
+    set_kokkos_default_default(CUDA ${TPL_ENABLE_CUDA})
+  ELSE()
+    set_kokkos_default_default(CUDA OFF)
+  ENDIF()
+  set_kokkos_default_default(ROCM OFF)
+ELSE()
+  set_kokkos_default_default(SERIAL ON)
+  set_kokkos_default_default(OPENMP OFF)
+  set_kokkos_default_default(PTHREAD OFF)
+  set_kokkos_default_default(QTHREAD OFF)
+  set_kokkos_default_default(CUDA OFF)
+  set_kokkos_default_default(ROCM OFF)
+ENDIF()
+
+# Set which Kokkos backend to use.
+# These are the actual options that define the settings.
+set(KOKKOS_ENABLE_SERIAL ${KOKKOS_INTERNAL_ENABLE_SERIAL_DEFAULT} CACHE BOOL "Whether to enable the Kokkos::Serial device.  This device executes \"parallel\" kernels sequentially on a single CPU thread.  It is enabled by default.  If you disable this device, please enable at least one other CPU device, such as Kokkos::OpenMP or Kokkos::Threads.")
+set(KOKKOS_ENABLE_OPENMP ${KOKKOS_INTERNAL_ENABLE_OPENMP_DEFAULT} CACHE BOOL "Enable OpenMP support in Kokkos." FORCE)
+set(KOKKOS_ENABLE_PTHREAD ${KOKKOS_INTERNAL_ENABLE_PTHREAD_DEFAULT} CACHE BOOL "Enable Pthread support in Kokkos.")
+set(KOKKOS_ENABLE_QTHREADS ${KOKKOS_INTERNAL_ENABLE_QTHREADS_DEFAULT} CACHE BOOL "Enable Qthreads support in Kokkos.")
+set(KOKKOS_ENABLE_CUDA ${KOKKOS_INTERNAL_ENABLE_CUDA_DEFAULT} CACHE BOOL "Enable CUDA support in Kokkos.")
+set(KOKKOS_ENABLE_ROCM ${KOKKOS_INTERNAL_ENABLE_ROCM_DEFAULT} CACHE BOOL "Enable ROCm support in Kokkos.")
+
+
+
+#-------------------------------------------------------------------------------
+#------------------------------- KOKKOS DEBUG and PROFILING --------------------
+#-------------------------------------------------------------------------------
+
+# Debug related options enable compiler warnings
+
+set_kokkos_default_default(DEBUG OFF)
+set(KOKKOS_ENABLE_DEBUG ${KOKKOS_INTERNAL_ENABLE_DEBUG_DEFAULT} CACHE BOOL "Enable Kokkos Debug.")
+
+# From Makefile.kokkos: Advanced Options: 
+#compiler_warnings, aggressive_vectorization, disable_profiling, disable_dualview_modify_check, enable_profile_load_print
+set_kokkos_default_default(COMPILER_WARNINGS OFF)
+set(KOKKOS_ENABLE_COMPILER_WARNINGS ${KOKKOS_INTERNAL_ENABLE_COMPILER_WARNINGS_DEFAULT} CACHE BOOL "Enable compiler warnings.")
+
+set_kokkos_default_default(DEBUG_DUALVIEW_MODIFY_CHECK OFF)
+set(KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK ${KOKKOS_INTERNAL_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK_DEFAULT} CACHE BOOL "Enable dualview modify check.")
+
+# Enable aggressive vectorization.
+set_kokkos_default_default(AGGRESSIVE_VECTORIZATION OFF)
+set(KOKKOS_ENABLE_AGGRESSIVE_VECTORIZATION ${KOKKOS_INTERNAL_ENABLE_AGGRESSIVE_VECTORIZATION_DEFAULT} CACHE BOOL "Enable aggressive vectorization.")
+
+# Enable profiling.
+set_kokkos_default_default(PROFILING ON)
+set(KOKKOS_ENABLE_PROFILING ${KOKKOS_INTERNAL_ENABLE_PROFILING_DEFAULT} CACHE BOOL "Enable profiling.")
+
+set_kokkos_default_default(PROFILING_LOAD_PRINT OFF)
+set(KOKKOS_ENABLE_PROFILING_LOAD_PRINT ${KOKKOS_INTERNAL_ENABLE_PROFILING_LOAD_PRINT_DEFAULT} CACHE BOOL "Enable profile load print.")
+
+
+
+
+#-------------------------------------------------------------------------------
+#------------------------------- KOKKOS_USE_TPLS -------------------------------
+#-------------------------------------------------------------------------------
+# Enable hwloc library.
+# Figure out default:
+IF(Trilinos_ENABLE_Kokkos AND TPL_ENABLE_HWLOC)
+  set_kokkos_default_default(HWLOC ON)
+ELSE()
+  set_kokkos_default_default(HWLOC OFF)
+ENDIF()
+set(KOKKOS_ENABLE_HWLOC ${KOKKOS_INTERNAL_ENABLE_HWLOC_DEFAULT} CACHE BOOL "Enable hwloc for better process placement.")
+set(KOKKOS_HWLOC_DIR "" CACHE PATH "Location of hwloc library. (kokkos tpl)")
+
+# Enable memkind library.
+set_kokkos_default_default(MEMKIND OFF)
+set(KOKKOS_ENABLE_MEMKIND ${KOKKOS_INTERNAL_ENABLE_MEMKIND_DEFAULT} CACHE BOOL "Enable memkind. (kokkos tpl)")
+set(KOKKOS_MEMKIND_DIR "" CACHE PATH "Location of memkind library. (kokkos tpl)")
+
+# Enable rt library.
+IF(Trilinos_ENABLE_Kokkos)
+  IF(DEFINED TPL_ENABLE_LIBRT)
+    set_kokkos_default_default(LIBRT ${TPL_ENABLE_LIBRT})
+  ELSE()
+    set_kokkos_default_default(LIBRT OFF)
+  ENDIF()
+ELSE()
+  set_kokkos_default_default(LIBRT ON)
+ENDIF()
+set(KOKKOS_ENABLE_LIBRT ${KOKKOS_INTERNAL_ENABLE_LIBRT_DEFAULT} CACHE BOOL "Enable librt for more precise timer.  (kokkos tpl)")
+
+
+#-------------------------------------------------------------------------------
+#------------------------------- KOKKOS_CUDA_OPTIONS ---------------------------
+#-------------------------------------------------------------------------------
+
+# CUDA options.
+# Set Defaults
+set_kokkos_default_default(CUDA_LDG_INTRINSIC_DEFAULT OFF)
+set_kokkos_default_default(CUDA_UVM_DEFAULT OFF)
+set_kokkos_default_default(CUDA_RELOCATABLE_DEVICE_CODE OFF)
+IF(Trilinos_ENABLE_Kokkos)
+  IF(KOKKOS_ENABLE_CUDA)
+    find_package(CUDA)
+  ENDIF()
+  IF (DEFINED CUDA_VERSION)
+    IF (CUDA_VERSION VERSION_GREATER "7.0")
+      set_kokkos_default_default(CUDA_LAMBDA ON)
+    ELSE()
+      set_kokkos_default_default(CUDA_LAMBDA OFF)
+    ENDIF()
+  ENDIF()
+ELSE()
+  set_kokkos_default_default(CUDA_LAMBDA OFF)
+ENDIF()
+
+# Set actual options
+set(KOKKOS_CUDA_DIR "" CACHE PATH "Location of CUDA library.  Defaults to where nvcc installed.")
+set(KOKKOS_ENABLE_CUDA_LDG_INTRINSIC ${KOKKOS_INTERNAL_ENABLE_CUDA_LDG_INTRINSIC_DEFAULT} CACHE BOOL "Enable CUDA LDG. (cuda option)") 
+set(KOKKOS_ENABLE_CUDA_UVM ${KOKKOS_INTERNAL_ENABLE_CUDA_UVM_DEFAULT} CACHE BOOL "Enable CUDA unified virtual memory.")
+set(KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE ${KOKKOS_INTERNAL_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE_DEFAULT} CACHE BOOL "Enable relocatable device code for CUDA. (cuda option)")
+set(KOKKOS_ENABLE_CUDA_LAMBDA ${KOKKOS_INTERNAL_ENABLE_CUDA_LAMBDA_DEFAULT} CACHE BOOL "Enable lambdas for CUDA. (cuda option)")
+
+
+#-------------------------------------------------------------------------------
+#----------------------- HOST ARCH AND LEGACY TRIBITS --------------------------
+#-------------------------------------------------------------------------------
+
+# This defines the previous legacy TriBITS builds. 
+set(KOKKOS_LEGACY_TRIBITS False)
+IF ("${KOKKOS_ARCH}" STREQUAL "NOT_SET")
+  set(KOKKOS_ARCH "None")
+  IF(KOKKOS_HAS_TRILINOS)
+    set(KOKKOS_LEGACY_TRIBITS True)
+  ENDIF()
+ENDIF()
+IF (KOKKOS_HAS_TRILINOS)
+  IF (KOKKOS_LEGACY_TRIBITS)
+    message(STATUS "Using the legacy tribits build because KOKKOS_ARCH not set")
+  ELSE()
+    message(STATUS "NOT using the legacy tribits build because KOKKOS_ARCH *is* set")
+  ENDIF()
+ENDIF()
+
+#-------------------------------------------------------------------------------
+#----------------------- Set CamelCase Options if they are not yet set ---------
+#-------------------------------------------------------------------------------
+
+foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
+  string(TOUPPER ${opt} OPT )
+  UNSET(KOKKOS_ENABLE_${OPT}_INTERNAL CACHE)
+  SET(KOKKOS_ENABLE_${OPT}_INTERNAL ${KOKKOS_ENABLE_${OPT}} CACHE BOOL INTERNAL)
+  IF(DEFINED KOKKOS_ENABLE_${OPT})
+    UNSET(Kokkos_ENABLE_${opt} CACHE)
+    SET(Kokkos_ENABLE_${opt} ${KOKKOS_ENABLE_${OPT}} CACHE BOOL "CamelCase Compatibility setting for KOKKOS_ENABLE_${OPT}")
+  ENDIF()
+endforeach()
+
--- a/lib/kokkos/cmake/kokkos_settings.cmake
+++ b/lib/kokkos/cmake/kokkos_settings.cmake
@ -0,0 +1,257 @@
+########################## NOTES ###############################################
+# This files goal is to take CMake options found in kokkos_options.cmake but 
+# possibly set from elsewhere 
+#   (see: trilinos/cmake/ProjectCOmpilerPostConfig.cmake) 
+# using CMake idioms and map them onto the KOKKOS_SETTINGS variables that gets 
+# passed to the kokkos makefile configuration:
+#  make -f ${CMAKE_SOURCE_DIR}/core/src/Makefile ${KOKKOS_SETTINGS} build-makefile-cmake-kokkos
+# that generates KokkosCore_config.h and kokkos_generated_settings.cmake
+# To understand how to form KOKKOS_SETTINGS, see
+#     <KOKKOS_PATH>/Makefile.kokkos
+
+#-------------------------------------------------------------------------------
+#------------------------------- GENERAL OPTIONS -------------------------------
+#-------------------------------------------------------------------------------
+
+# Ensure that KOKKOS_ARCH is in the ARCH_LIST
+foreach(arch ${KOKKOS_ARCH})
+  list(FIND KOKKOS_ARCH_LIST ${arch} indx)
+  if (indx EQUAL -1)
+    message(FATAL_ERROR "${arch} is not an accepted value for KOKKOS_ARCH."
+      "  Please pick from these choices: ${KOKKOS_INTERNAL_ARCH_DOCSTR}")
+  endif ()
+endforeach()
+
+# KOKKOS_SETTINGS uses KOKKOS_ARCH
+string(REPLACE ";" "," KOKKOS_ARCH "${KOKKOS_ARCH}")
+set(KOKKOS_ARCH ${KOKKOS_ARCH})
+
+# From Makefile.kokkos: Options: yes,no
+if(${KOKKOS_ENABLE_DEBUG})
+  set(KOKKOS_DEBUG yes)
+else()
+  set(KOKKOS_DEBUG no)
+endif()
+
+#------------------------------- KOKKOS_DEVICES --------------------------------
+# Can have multiple devices 
+set(KOKKOS_DEVICESl)
+foreach(devopt ${KOKKOS_DEVICES_LIST})
+  string(TOUPPER ${devopt} devoptuc)
+  if (${KOKKOS_ENABLE_${devoptuc}}) 
+    list(APPEND KOKKOS_DEVICESl ${devopt})
+  endif ()
+endforeach()
+# List needs to be comma-delmitted
+string(REPLACE ";" "," KOKKOS_DEVICES "${KOKKOS_DEVICESl}")
+
+#------------------------------- KOKKOS_OPTIONS --------------------------------
+# From Makefile.kokkos: Options: aggressive_vectorization,disable_profiling
+#compiler_warnings, aggressive_vectorization, disable_profiling, disable_dualview_modify_check, enable_profile_load_print
+
+set(KOKKOS_OPTIONSl)
+if(${KOKKOS_ENABLE_COMPILER_WARNINGS})
+      list(APPEND KOKKOS_OPTIONSl compiler_warnings)
+endif()
+if(${KOKKOS_ENABLE_AGGRESSIVE_VECTORIZATION})
+      list(APPEND KOKKOS_OPTIONSl aggressive_vectorization)
+endif()
+if(NOT ${KOKKOS_ENABLE_PROFILING})
+      list(APPEND KOKKOS_OPTIONSl disable_vectorization)
+endif()
+if(NOT ${KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK})
+      list(APPEND KOKKOS_OPTIONSl disable_dualview_modify_check)
+endif()
+if(${KOKKOS_ENABLE_PROFILING_LOAD_PRINT})
+      list(APPEND KOKKOS_OPTIONSl enable_profile_load_print)
+endif()
+# List needs to be comma-delimitted
+string(REPLACE ";" "," KOKKOS_OPTIONS "${KOKKOS_OPTIONSl}")
+
+
+#------------------------------- KOKKOS_USE_TPLS -------------------------------
+# Construct the Makefile options
+set(KOKKOS_USE_TPLSl)
+foreach(tplopt ${KOKKOS_USE_TPLS_LIST})
+  if (${KOKKOS_ENABLE_${tplopt}}) 
+    list(APPEND KOKKOS_USE_TPLSl ${KOKKOS_INTERNAL_${tplopt}})
+  endif ()
+endforeach()
+# List needs to be comma-delimitted
+string(REPLACE ";" "," KOKKOS_USE_TPLS "${KOKKOS_USE_TPLSl}")
+
+
+#------------------------------- KOKKOS_CUDA_OPTIONS ---------------------------
+# Construct the Makefile options
+set(KOKKOS_CUDA_OPTIONS)
+foreach(cudaopt ${KOKKOS_CUDA_OPTIONS_LIST})
+  if (${KOKKOS_ENABLE_CUDA_${cudaopt}})
+    list(APPEND KOKKOS_CUDA_OPTIONSl ${KOKKOS_INTERNAL_${cudaopt}})
+  endif ()
+endforeach()
+# List needs to be comma-delmitted
+string(REPLACE ";" "," KOKKOS_CUDA_OPTIONS "${KOKKOS_CUDA_OPTIONSl}")
+
+#------------------------------- PATH VARIABLES --------------------------------
+#  Want makefile to use same executables specified which means modifying
+#  the path so the $(shell ...) commands in the makefile see the right exec
+#  Also, the Makefile's use FOO_PATH naming scheme for -I/-L construction
+#TODO:  Makefile.kokkos allows this to be overwritten? ROCM_HCC_PATH
+
+set(KOKKOS_INTERNAL_PATHS)
+set(addpathl)
+foreach(kvar "CUDA;QTHREADS;${KOKKOS_USE_TPLS_LIST}")
+  if(${KOKKOS_ENABLE_${kvar}})
+    if(DEFINED KOKKOS_${kvar}_DIR)
+      set(KOKKOS_INTERNAL_PATHS "${KOKKOS_INTERNAL_PATHS} ${kvar}_PATH=${KOKKOS_${kvar}_DIR}")
+      if(IS_DIRECTORY ${KOKKOS_${kvar}_DIR}/bin)
+        list(APPEND addpathl ${KOKKOS_${kvar}_DIR}/bin)
+      endif()
+    endif()
+  endif()
+endforeach()
+# Path env is : delimitted
+string(REPLACE ";" ":" KOKKOS_INTERNAL_ADDTOPATH "${addpathl}")
+
+
+######################### SET KOKKOS_SETTINGS ##################################
+# Set the KOKKOS_SETTINGS String -- this is the primary communication with the
+# makefile configuration.  See Makefile.kokkos
+
+set(KOKKOS_SETTINGS KOKKOS_SRC_PATH=${KOKKOS_SRC_PATH})
+set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_PATH=${KOKKOS_PATH})
+set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_INSTALL_PATH=${CMAKE_INSTALL_PREFIX})
+
+# Form of KOKKOS_foo=$KOKKOS_foo
+foreach(kvar ARCH;DEVICES;DEBUG;OPTIONS;CUDA_OPTIONS;USE_TPLS)
+  set(KOKKOS_VAR KOKKOS_${kvar})
+  if(DEFINED KOKKOS_${kvar})
+    if (NOT "${${KOKKOS_VAR}}" STREQUAL "")
+      set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} ${KOKKOS_VAR}=${${KOKKOS_VAR}})
+    endif()
+  endif()
+endforeach()
+
+# Form of VAR=VAL
+#TODO:  Makefile supports MPICH_CXX, OMPI_CXX as well
+foreach(ovar CXX;CXXFLAGS;LDFLAGS)
+  if(DEFINED ${ovar})
+    if (NOT "${${ovar}}" STREQUAL "")
+      set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} ${ovar}=${${ovar}})
+    endif()
+  endif()
+endforeach()
+
+# Finally, do the paths
+if (NOT "${KOKKOS_INTERNAL_PATHS}" STREQUAL "")
+  set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} ${KOKKOS_INTERNAL_PATHS})
+endif()
+if (NOT "${KOKKOS_INTERNAL_ADDTOPATH}" STREQUAL "")
+  set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} PATH=${KOKKOS_INTERNAL_ADDTOPATH}:\${PATH})
+endif()
+
+# Final form that gets passed to make
+set(KOKKOS_SETTINGS env ${KOKKOS_SETTINGS})
+
+
+############################ PRINT CONFIGURE STATUS ############################
+
+if(KOKKOS_CMAKE_VERBOSE)
+  message(STATUS "")
+  message(STATUS "****************** Kokkos Settings ******************")
+  message(STATUS "Execution Spaces")
+
+  if(KOKKOS_ENABLE_CUDA)
+    message(STATUS "  Device Parallel: Cuda")
+  else()
+    message(STATUS "  Device Parallel: None")
+  endif()
+
+  if(KOKKOS_ENABLE_OPENMP)
+    message(STATUS "    Host Parallel: OpenMP")
+  elseif(KOKKOS_ENABLE_PTHREAD)
+    message(STATUS "    Host Parallel: Pthread")
+  elseif(KOKKOS_ENABLE_QTHREADS)
+    message(STATUS "    Host Parallel: Qthreads")
+  else()
+    message(STATUS "    Host Parallel: None")
+  endif()
+
+  if(KOKKOS_ENABLE_SERIAL)
+    message(STATUS "      Host Serial: Serial")
+  else()
+    message(STATUS "      Host Serial: None")
+  endif()
+
+  message(STATUS "")
+  message(STATUS "Architectures:")
+  message(STATUS "    ${KOKKOS_ARCH}")
+
+  message(STATUS "")
+  message(STATUS "Enabled options")
+
+  if(KOKKOS_SEPARATE_LIBS)
+    message(STATUS "  KOKKOS_SEPARATE_LIBS")
+  endif()
+
+  if(KOKKOS_ENABLE_HWLOC)
+    message(STATUS "  KOKKOS_ENABLE_HWLOC")
+  endif()
+
+  if(KOKKOS_ENABLE_MEMKIND)
+    message(STATUS "  KOKKOS_ENABLE_MEMKIND")
+  endif()
+
+  if(KOKKOS_ENABLE_DEBUG)
+    message(STATUS "  KOKKOS_ENABLE_DEBUG")
+  endif()
+
+  if(KOKKOS_ENABLE_PROFILING)
+    message(STATUS "  KOKKOS_ENABLE_PROFILING")
+  endif()
+
+  if(KOKKOS_ENABLE_AGGRESSIVE_VECTORIZATION)
+    message(STATUS "  KOKKOS_ENABLE_AGGRESSIVE_VECTORIZATION")
+  endif()
+
+  if(KOKKOS_ENABLE_CUDA)
+    if(KOKKOS_ENABLE_CUDA_LDG_INTRINSIC)
+      message(STATUS "  KOKKOS_ENABLE_CUDA_LDG_INTRINSIC")
+    endif()
+
+    if(KOKKOS_ENABLE_CUDA_UVM)
+      message(STATUS "  KOKKOS_ENABLE_CUDA_UVM")
+    endif()
+
+    if(KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE)
+      message(STATUS "  KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE")
+    endif()
+
+    if(KOKKOS_ENABLE_CUDA_LAMBDA)
+      message(STATUS "  KOKKOS_ENABLE_CUDA_LAMBDA")
+    endif()
+
+    if(KOKKOS_CUDA_DIR)
+      message(STATUS "  KOKKOS_CUDA_DIR: ${KOKKOS_CUDA_DIR}")
+    endif()
+  endif()
+
+  if(KOKKOS_QTHREADS_DIR)
+    message(STATUS "  KOKKOS_QTHREADS_DIR: ${KOKKOS_QTHREADS_DIR}")
+  endif()
+
+  if(KOKKOS_HWLOC_DIR)
+    message(STATUS "  KOKKOS_HWLOC_DIR: ${KOKKOS_HWLOC_DIR}")
+  endif()
+
+  if(KOKKOS_MEMKIND_DIR)
+    message(STATUS "  KOKKOS_MEMKIND_DIR: ${KOKKOS_MEMKIND_DIR}")
+  endif()
+
+  message(STATUS "")
+  message(STATUS "Final kokkos settings variable:")
+  message(STATUS "  ${KOKKOS_SETTINGS}")
+
+  message(STATUS "*****************************************************")
+  message(STATUS "")
+endif()
--- a/lib/kokkos/cmake/tribits.cmake
+++ b/lib/kokkos/cmake/tribits.cmake
@ -3,10 +3,6 @@ INCLUDE(CTest)

 cmake_policy(SET CMP0054 NEW)

-IF(NOT DEFINED ${PROJECT_NAME})
-  project(KokkosCMake) 
-ENDIF()
-
 MESSAGE(WARNING "The project name is: ${PROJECT_NAME}")

 IF(NOT DEFINED ${PROJECT_NAME}_ENABLE_OpenMP)
@ -46,26 +42,26 @@ MACRO(PREPEND_GLOBAL_SET VARNAME)
  GLOBAL_SET(${VARNAME} ${ARGN} ${${VARNAME}})
 ENDMACRO()

-FUNCTION(REMOVE_GLOBAL_DUPLICATES VARNAME)
-  ASSERT_DEFINED(${VARNAME})
-  IF (${VARNAME})
-    SET(TMP ${${VARNAME}})
-    LIST(REMOVE_DUPLICATES TMP)
-    GLOBAL_SET(${VARNAME} ${TMP})
-  ENDIF()
-ENDFUNCTION()
+#FUNCTION(REMOVE_GLOBAL_DUPLICATES VARNAME)
+#  ASSERT_DEFINED(${VARNAME})
+#  IF (${VARNAME})
+#    SET(TMP ${${VARNAME}})
+#    LIST(REMOVE_DUPLICATES TMP)
+#    GLOBAL_SET(${VARNAME} ${TMP})
+#  ENDIF()
+#ENDFUNCTION()

-MACRO(TRIBITS_ADD_OPTION_AND_DEFINE  USER_OPTION_NAME  MACRO_DEFINE_NAME DOCSTRING  DEFAULT_VALUE)
-  MESSAGE(STATUS "TRIBITS_ADD_OPTION_AND_DEFINE: '${USER_OPTION_NAME}' '${MACRO_DEFINE_NAME}' '${DEFAULT_VALUE}'")
-  SET( ${USER_OPTION_NAME} "${DEFAULT_VALUE}" CACHE BOOL "${DOCSTRING}" )
-  IF(NOT ${MACRO_DEFINE_NAME} STREQUAL "")
-    IF(${USER_OPTION_NAME})
-      GLOBAL_SET(${MACRO_DEFINE_NAME} ON)
-    ELSE()
-      GLOBAL_SET(${MACRO_DEFINE_NAME} OFF)
-    ENDIF()
-  ENDIF()
-ENDMACRO()
+#MACRO(TRIBITS_ADD_OPTION_AND_DEFINE  USER_OPTION_NAME  MACRO_DEFINE_NAME DOCSTRING  DEFAULT_VALUE)
+#  MESSAGE(STATUS "TRIBITS_ADD_OPTION_AND_DEFINE: '${USER_OPTION_NAME}' '${MACRO_DEFINE_NAME}' '${DEFAULT_VALUE}'")
+#  SET( ${USER_OPTION_NAME} "${DEFAULT_VALUE}" CACHE BOOL "${DOCSTRING}" )
+#  IF(NOT ${MACRO_DEFINE_NAME} STREQUAL "")
+#    IF(${USER_OPTION_NAME})
+#      GLOBAL_SET(${MACRO_DEFINE_NAME} ON)
+#    ELSE()
+#      GLOBAL_SET(${MACRO_DEFINE_NAME} OFF)
+#    ENDIF()
+#  ENDIF()
+#ENDMACRO()

 FUNCTION(TRIBITS_CONFIGURE_FILE  PACKAGE_NAME_CONFIG_FILE)

@ -77,17 +73,20 @@ FUNCTION(TRIBITS_CONFIGURE_FILE  PACKAGE_NAME_CONFIG_FILE)

 ENDFUNCTION()

-MACRO(TRIBITS_ADD_DEBUG_OPTION)
-  TRIBITS_ADD_OPTION_AND_DEFINE(
-    ${PROJECT_NAME}_ENABLE_DEBUG
-    HAVE_${PROJECT_NAME_UC}_DEBUG
-    "Enable a host of runtime debug checking."
-    OFF
-    )
-ENDMACRO()
+#MACRO(TRIBITS_ADD_DEBUG_OPTION)
+#  TRIBITS_ADD_OPTION_AND_DEFINE(
+#    ${PROJECT_NAME}_ENABLE_DEBUG
+#    HAVE_${PROJECT_NAME_UC}_DEBUG
+#    "Enable a host of runtime debug checking."
+#    OFF
+#    )
+#ENDMACRO()


 MACRO(TRIBITS_ADD_TEST_DIRECTORIES)
+  message(STATUS "ProjectName: " ${PROJECT_NAME})
+  message(STATUS "Tests: " ${${PROJECT_NAME}_ENABLE_TESTS})
+  
  IF(${${PROJECT_NAME}_ENABLE_TESTS})
    FOREACH(TEST_DIR ${ARGN})
      ADD_SUBDIRECTORY(${TEST_DIR})
@ -387,17 +386,17 @@ FUNCTION(TRIBITS_TPL_FIND_INCLUDE_DIRS_AND_LIBRARIES TPL_NAME)

 ENDFUNCTION()

-MACRO(TRIBITS_PROCESS_TPL_DEP_FILE TPL_FILE)
-  GET_FILENAME_COMPONENT(TPL_NAME ${TPL_FILE} NAME_WE)
-  INCLUDE("${TPL_FILE}")
-  IF(TARGET TPL_LIB_${TPL_NAME})
-    MESSAGE(STATUS "Found tpl library: ${TPL_NAME}")
-    SET(TPL_ENABLE_${TPL_NAME} TRUE)
-  ELSE()
-    MESSAGE(STATUS "Tpl library not found: ${TPL_NAME}")
-    SET(TPL_ENABLE_${TPL_NAME} FALSE)
-  ENDIF()
-ENDMACRO()
+#MACRO(TRIBITS_PROCESS_TPL_DEP_FILE TPL_FILE)
+#  GET_FILENAME_COMPONENT(TPL_NAME ${TPL_FILE} NAME_WE)
+#  INCLUDE("${TPL_FILE}")
+#  IF(TARGET TPL_LIB_${TPL_NAME})
+#    MESSAGE(STATUS "Found tpl library: ${TPL_NAME}")
+#    SET(TPL_ENABLE_${TPL_NAME} TRUE)
+#  ELSE()
+#    MESSAGE(STATUS "Tpl library not found: ${TPL_NAME}")
+#    SET(TPL_ENABLE_${TPL_NAME} FALSE)
+#  ENDIF()
+#ENDMACRO()

 MACRO(PREPEND_TARGET_SET VARNAME TARGET_NAME TYPE)
  IF(TYPE STREQUAL "REQUIRED")
@ -475,6 +474,7 @@ MACRO(TRIBITS_SUBPACKAGE NAME)
  SET(PARENT_PACKAGE_NAME ${PACKAGE_NAME})
  SET(PACKAGE_NAME ${PACKAGE_NAME}${NAME})
  STRING(TOUPPER ${PACKAGE_NAME} PACKAGE_NAME_UC)
+  SET(${PACKAGE_NAME}_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})

  ADD_INTERFACE_LIBRARY(PACKAGE_${PACKAGE_NAME})

@ -494,11 +494,11 @@ MACRO(TRIBITS_PACKAGE_DECL NAME)
  SET(${PACKAGE_NAME}_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})
  STRING(TOUPPER ${PACKAGE_NAME} PACKAGE_NAME_UC)

-  SET(TRIBITS_DEPS_DIR "${CMAKE_SOURCE_DIR}/cmake/deps")
-  FILE(GLOB TPLS_FILES "${TRIBITS_DEPS_DIR}/*.cmake")
-  FOREACH(TPL_FILE ${TPLS_FILES})
-    TRIBITS_PROCESS_TPL_DEP_FILE(${TPL_FILE})
-  ENDFOREACH()
+  #SET(TRIBITS_DEPS_DIR "${CMAKE_SOURCE_DIR}/cmake/deps")
+  #FILE(GLOB TPLS_FILES "${TRIBITS_DEPS_DIR}/*.cmake")
+  #FOREACH(TPL_FILE ${TPLS_FILES})
+  #  TRIBITS_PROCESS_TPL_DEP_FILE(${TPL_FILE})
+  #ENDFOREACH()

 ENDMACRO()

--- a/lib/kokkos/config/master_history.txt
+++ b/lib/kokkos/config/master_history.txt
@ -10,3 +10,5 @@ tag:  2.03.05    date: 05:27:2017    master: 36b92f43    develop: 79073186
 tag:  2.03.13    date: 07:27:2017    master: da314444    develop: 29ccb58a
 tag:  2.04.00    date: 08:16:2017    master: 54eb75c0    develop: 32fb8ee1
 tag:  2.04.04    date: 09:11:2017    master: 2b7e9c20    develop: 51e7b25a
+tag:  2.04.11    date: 10:28:2017    master: 54a1330a    develop: ed36c017
+tag:  2.5.11     date: 12:15:2017    master: dfe685f4    develop: ec7ad6d8
--- a/lib/kokkos/config/nvcc_wrapper
+++ b/lib/kokkos/config/nvcc_wrapper
@ -39,6 +39,12 @@ cuda_args=""
 # Arguments for both NVCC and Host compiler
 shared_args=""

+# Argument -c
+compile_arg=""
+
+# Argument -o <obj>
+output_arg=""
+
 # Linker arguments
 xlinker_args=""

@ -66,6 +72,7 @@ dry_run=0

 # Skip NVCC compilation and use host compiler directly
 host_only=0
+host_only_args=""

 # Enable workaround for CUDA 6.5 for pragma ident 
 replace_pragma_ident=0
@ -78,6 +85,14 @@ temp_dir=${TMPDIR:-/tmp}
 # Check if we have an optimization argument already
 optimization_applied=0

+# Check if we have -std=c++X  or --std=c++X already
+stdcxx_applied=0
+
+# Run nvcc a second time to generate dependencies if needed
+depfile_separate=0
+depfile_output_arg=""
+depfile_target_arg=""
+
 #echo "Arguments: $# $@"

 while [ $# -gt 0 ]
@ -109,12 +124,31 @@ do
    fi
    ;;
  #Handle shared args (valid for both nvcc and the host compiler)
-  -D*|-c|-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared)
+  -D*|-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared)
    shared_args="$shared_args $1"
    ;;
-  #Handle shared args that have an argument
-  -o|-MT)
-    shared_args="$shared_args $1 $2"
+  #Handle compilation argument
+  -c)
+    compile_arg="$1"
+    ;;
+  #Handle output argument
+  -o)
+    output_arg="$output_arg $1 $2"
+    shift
+    ;;
+  # Handle depfile arguments.  We map them to a separate call to nvcc.
+  -MD|-MMD)
+    depfile_separate=1
+    host_only_args="$host_only_args $1"
+    ;;
+  -MF)
+    depfile_output_arg="-o $2"
+    host_only_args="$host_only_args $1 $2"
+    shift
+    ;;
+  -MT)
+    depfile_target_arg="$1 $2"
+    host_only_args="$host_only_args $1 $2"
    shift
    ;;
  #Handle known nvcc args
@ -130,16 +164,25 @@ do
    cuda_args="$cuda_args $1 $2"
    shift
    ;;
-  #Handle c++11 setting
-  --std=c++11|-std=c++11)
-    shared_args="$shared_args $1"
+  #Handle c++11
+  --std=c++11|-std=c++11|--std=c++14|-std=c++14|--std=c++1z|-std=c++1z)
+    if [ $stdcxx_applied -eq 1 ]; then
+       echo "nvcc_wrapper - *warning* you have set multiple optimization flags (-std=c++1* or --std=c++1*), only the first is used because nvcc can only accept a single std setting"
+    else
+       shared_args="$shared_args $1"
+       stdcxx_applied=1
+    fi
    ;;
+
  #strip of -std=c++98 due to nvcc warnings and Tribits will place both -std=c++11 and -std=c++98
  -std=c++98|--std=c++98)
    ;;
  #strip of pedantic because it produces endless warnings about #LINE added by the preprocessor
  -pedantic|-Wpedantic|-ansi)
    ;;
+  #strip of -Woverloaded-virtual to avoid "cc1: warning: command line option ‘-Woverloaded-virtual’ is valid for C++/ObjC++ but not for C"
+  -Woverloaded-virtual)
+    ;;
  #strip -Xcompiler because we add it
  -Xcompiler)
    if [ $first_xcompiler_arg -eq 1 ]; then
@ -190,7 +233,7 @@ do
    object_files_xlinker="$object_files_xlinker -Xlinker $1"
    ;;
  #Handle object files which always need to use "-Xlinker": -x cu applies to all input files, so give them to linker, except if only linking
-  *.dylib)
+  @*|*.dylib)
    object_files="$object_files -Xlinker $1"
    object_files_xlinker="$object_files_xlinker -Xlinker $1"
    ;;
@ -230,7 +273,7 @@ if [ $first_xcompiler_arg -eq 0 ]; then
 fi

 #Compose host only command
-host_command="$host_compiler $shared_args $xcompiler_args $host_linker_args $shared_versioned_libraries_host"
+host_command="$host_compiler $shared_args $host_only_args $compile_arg $output_arg $xcompiler_args $host_linker_args $shared_versioned_libraries_host"

 #nvcc does not accept '#pragma ident SOME_MACRO_STRING' but it does accept '#ident SOME_MACRO_STRING'
 if [ $replace_pragma_ident -eq 1 ]; then
@ -262,10 +305,21 @@ else
  host_command="$host_command $object_files"
 fi

+if [ $depfile_separate -eq 1 ]; then
+  # run nvcc a second time to generate dependencies (without compiling)
+  nvcc_depfile_command="$nvcc_command -M $depfile_target_arg $depfile_output_arg"
+else
+  nvcc_depfile_command=""
+fi
+
+nvcc_command="$nvcc_command $compile_arg $output_arg"
+
 #Print command for dryrun
 if [ $dry_run -eq 1 ]; then
  if [ $host_only -eq 1 ]; then
    echo $host_command
+  elif [ -n "$nvcc_depfile_command" ]; then
+    echo $nvcc_command "&&" $nvcc_depfile_command
  else
    echo $nvcc_command
  fi
@ -275,6 +329,8 @@ fi
 #Run compilation command
 if [ $host_only -eq 1 ]; then
  $host_command
+elif [ -n "$nvcc_depfile_command" ]; then
+  $nvcc_command && $nvcc_depfile_command
 else
  $nvcc_command
 fi
--- a/lib/kokkos/config/test_all_sandia
+++ b/lib/kokkos/config/test_all_sandia
@ -16,12 +16,12 @@ if [[ "$HOSTNAME" =~ (white|ride).* ]]; then
  MACHINE=white
 elif [[ "$HOSTNAME" =~ .*bowman.* ]]; then
  MACHINE=bowman
-elif [[ "$HOSTNAME" =~ node.* ]]; then # Warning: very generic name
+elif [[ "$HOSTNAME" =~ n.* ]]; then # Warning: very generic name
  if [[ "$PROCESSOR" = "aarch64" ]]; then
    MACHINE=sullivan
-  else
-    MACHINE=shepard
  fi
+elif [[ "$HOSTNAME" =~ node.* ]]; then # Warning: very generic name
+    MACHINE=shepard
 elif [[ "$HOSTNAME" =~ apollo ]]; then
  MACHINE=apollo
 elif [[ "$HOSTNAME" =~ sullivan ]]; then
@ -45,7 +45,8 @@ GCC_WARNING_FLAGS="-Wall,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits
 IBM_WARNING_FLAGS="-Wall,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wuninitialized"
 CLANG_WARNING_FLAGS="-Wall,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wuninitialized"
 INTEL_WARNING_FLAGS="-Wall,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wuninitialized"
-CUDA_WARNING_FLAGS=""
+CUDA_WARNING_FLAGS="-Wall,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wuninitialized"
+PGI_WARNING_FLAGS=""

 # Default. Machine specific can override.
 DEBUG=False
@ -61,6 +62,8 @@ SPOT_CHECK=False

 PRINT_HELP=False
 OPT_FLAG=""
+CXX_FLAGS_EXTRA=""
+LD_FLAGS_EXTRA=""
 KOKKOS_OPTIONS=""

 #
@ -111,6 +114,12 @@ do
    --with-cuda-options*)
      KOKKOS_CUDA_OPTIONS="--with-cuda-options=${key#*=}"
      ;;
+    --cxxflags-extra*)
+      CXX_FLAGS_EXTRA="${key#*=}"
+      ;;
+    --ldflags-extra*)
+      LD_FLAGS_EXTRA="${key#*=}"
+      ;;
    --help*)
      PRINT_HELP=True
      ;;
@ -150,20 +159,18 @@ if [ "$MACHINE" = "sems" ]; then

  if [ "$SPOT_CHECK" = "True" ]; then
    # Format: (compiler module-list build-list exe-name warning-flag)
-    COMPILERS=("gcc/4.7.2 $BASE_MODULE_LIST "OpenMP,Pthread" g++ $GCC_WARNING_FLAGS"
-               "gcc/5.1.0 $BASE_MODULE_LIST "Serial" g++ $GCC_WARNING_FLAGS"
-               "intel/16.0.1 $BASE_MODULE_LIST "OpenMP" icpc $INTEL_WARNING_FLAGS"
+    COMPILERS=("gcc/5.3.0 $BASE_MODULE_LIST "OpenMP" g++ $GCC_WARNING_FLAGS"
+               "gcc/6.1.0 $BASE_MODULE_LIST "Serial" g++ $GCC_WARNING_FLAGS"
+               "intel/17.0.1 $BASE_MODULE_LIST "OpenMP" icpc $INTEL_WARNING_FLAGS"
               "clang/3.9.0 $BASE_MODULE_LIST "Pthread_Serial" clang++ $CLANG_WARNING_FLAGS"
               "cuda/8.0.44 $CUDA8_MODULE_LIST "Cuda_OpenMP" $KOKKOS_PATH/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
    )
  else
    # Format: (compiler module-list build-list exe-name warning-flag)
-    COMPILERS=("gcc/4.7.2 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
-               "gcc/4.8.4 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
+    COMPILERS=("gcc/4.8.4 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
               "gcc/4.9.3 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
               "gcc/5.3.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
               "gcc/6.1.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
-               "intel/14.0.4 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
               "intel/15.0.2 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
               "intel/16.0.1 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
               "intel/16.0.3 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
@ -184,6 +191,7 @@ elif [ "$MACHINE" = "white" ]; then
  BASE_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>"
  IBM_MODULE_LIST="<COMPILER_NAME>/xl/<COMPILER_VERSION>"
  CUDA_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/5.4.0"
+  CUDA_MODULE_LIST2="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/6.3.0,ibm/xl/13.1.6-BETA"

  # Don't do pthread on white.
  GCC_BUILD_LIST="OpenMP,Serial,OpenMP_Serial"
@ -192,6 +200,7 @@ elif [ "$MACHINE" = "white" ]; then
  COMPILERS=("gcc/5.4.0 $BASE_MODULE_LIST $IBM_BUILD_LIST g++ $GCC_WARNING_FLAGS"
             "ibm/13.1.3 $IBM_MODULE_LIST $IBM_BUILD_LIST xlC $IBM_WARNING_FLAGS"
             "cuda/8.0.44 $CUDA_MODULE_LIST $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
+             "cuda/9.0.103 $CUDA_MODULE_LIST2 $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
  )

  if [ -z "$ARCH_FLAG" ]; then
@ -210,8 +219,9 @@ elif [ "$MACHINE" = "bowman" ]; then
  OLD_INTEL_BUILD_LIST="Pthread,Serial,Pthread_Serial"

  # Format: (compiler module-list build-list exe-name warning-flag)
-  COMPILERS=("intel/16.2.181 $BASE_MODULE_LIST $OLD_INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
-             "intel/17.0.098 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
+  COMPILERS=("intel/16.4.258 $BASE_MODULE_LIST $OLD_INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
+             "intel/17.2.174 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
+             "intel/18.0.128 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
  )

  if [ -z "$ARCH_FLAG" ]; then
@ -241,13 +251,13 @@ elif [ "$MACHINE" = "shepard" ]; then
  SKIP_HWLOC=True
  export SLURM_TASKS_PER_NODE=32

-  BASE_MODULE_LIST="<COMPILER_NAME>/compilers/<COMPILER_VERSION>"
-
-  OLD_INTEL_BUILD_LIST="Pthread,Serial,Pthread_Serial"
+  BASE_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>"
+  BASE_MODULE_LIST_INTEL="<COMPILER_NAME>/compilers/<COMPILER_VERSION>"

  # Format: (compiler module-list build-list exe-name warning-flag)
-  COMPILERS=("intel/16.2.181 $BASE_MODULE_LIST $OLD_INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
-             "intel/17.0.098 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
+  COMPILERS=("intel/17.4.196 $BASE_MODULE_LIST_INTEL $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
+             "intel/18.0.128 $BASE_MODULE_LIST_INTEL $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
+             "pgi/17.10.0 $BASE_MODULE_LIST $GCC_BUILD_LIST pgc++ $PGI_WARNING_FLAGS"
  )

  if [ -z "$ARCH_FLAG" ]; then
@ -280,7 +290,7 @@ elif [ "$MACHINE" = "apollo" ]; then

  if [ "$SPOT_CHECK" = "True" ]; then
    # Format: (compiler module-list build-list exe-name warning-flag)
-    COMPILERS=("gcc/4.7.2 $BASE_MODULE_LIST "OpenMP,Pthread" g++ $GCC_WARNING_FLAGS"
+    COMPILERS=("gcc/4.8.4 $BASE_MODULE_LIST "OpenMP,Pthread" g++ $GCC_WARNING_FLAGS"
               "gcc/5.1.0 $BASE_MODULE_LIST "Serial" g++ $GCC_WARNING_FLAGS"
               "intel/16.0.1 $BASE_MODULE_LIST "OpenMP" icpc $INTEL_WARNING_FLAGS"
               "clang/3.9.0 $BASE_MODULE_LIST "Pthread_Serial" clang++ $CLANG_WARNING_FLAGS"
@ -292,14 +302,13 @@ elif [ "$MACHINE" = "apollo" ]; then
    COMPILERS=("cuda/8.0.44 $CUDA8_MODULE_LIST $BUILD_LIST_CUDA_NVCC $KOKKOS_PATH/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
               "clang/4.0.0 $CLANG_MODULE_LIST $BUILD_LIST_CUDA_CLANG clang++ $CUDA_WARNING_FLAGS"
               "clang/3.9.0 $CLANG_MODULE_LIST $BUILD_LIST_CLANG clang++ $CLANG_WARNING_FLAGS"
-               "gcc/4.7.2 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
               "gcc/4.8.4 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
-               "gcc/4.9.2 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
+               "gcc/4.9.3 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
               "gcc/5.3.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
               "gcc/6.1.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS"
-               "intel/14.0.4 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
               "intel/15.0.2 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
               "intel/16.0.1 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
+               "intel/17.0.1 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS"
               "clang/3.5.2 $BASE_MODULE_LIST $CLANG_BUILD_LIST clang++ $CLANG_WARNING_FLAGS"
               "clang/3.6.1 $BASE_MODULE_LIST $CLANG_BUILD_LIST clang++ $CLANG_WARNING_FLAGS"
               "cuda/7.0.28 $CUDA_MODULE_LIST $CUDA_BUILD_LIST $KOKKOS_PATH/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
@ -336,6 +345,8 @@ if [ "$PRINT_HELP" = "True" ]; then
  echo "--dry-run: Just print what would be executed"
  echo "--build-only: Just do builds, don't run anything"
  echo "--opt-flag=FLAG: Optimization flag (default: -O3)"
+  echo "--cxxflags-extra=FLAGS: Extra flags to be added to CXX_FLAGS"
+  echo "--ldflags-extra=FLAGS: Extra flags to be added to LD_FLAGS"
  echo "--arch=ARCHITECTURE: overwrite architecture flags"
  echo "--with-cuda-options=OPT: set KOKKOS_CUDA_OPTIONS"
  echo "--build-list=BUILD,BUILD,BUILD..."
@ -361,14 +372,14 @@ if [ "$PRINT_HELP" = "True" ]; then
  echo "  Run all gcc tests"
  echo "  % test_all_sandia gcc"
  echo ""
-  echo "  Run all gcc/4.7.2 and all intel tests"
-  echo "  % test_all_sandia gcc/4.7.2 intel"
+  echo "  Run all gcc/4.8.4 and all intel tests"
+  echo "  % test_all_sandia gcc/4.8.4 intel"
  echo ""
  echo "  Run all tests in debug"
  echo "  % test_all_sandia --debug"
  echo ""
-  echo "  Run gcc/4.7.2 and only do OpenMP and OpenMP_Serial builds"
-  echo "  % test_all_sandia gcc/4.7.2 --build-list=OpenMP,OpenMP_Serial"
+  echo "  Run gcc/4.8.4 and only do OpenMP and OpenMP_Serial builds"
+  echo "  % test_all_sandia gcc/4.8.4 --build-list=OpenMP,OpenMP_Serial"
  echo ""
  echo "If you want to kill the tests, do:"
  echo "  hit ctrl-z"
@ -566,10 +577,15 @@ single_build_and_test() {
  if [[ "$build_type" = *debug* ]]; then
    local extra_args="$extra_args --debug"
    local cxxflags="-g $compiler_warning_flags"
+    local ldflags="-g"
  else
    local cxxflags="$OPT_FLAG $compiler_warning_flags"
+    local ldflags="${OPT_FLAG}"
  fi

+  local cxxflags="${cxxflags} ${CXX_FLAGS_EXTRA}"
+  local ldflags="${ldflags} ${LD_FLAGS_EXTRA}"
+
  if [[ "$KOKKOS_CUDA_OPTIONS" != "" ]]; then
    local extra_args="$extra_args $KOKKOS_CUDA_OPTIONS"
  fi
@ -586,7 +602,7 @@ single_build_and_test() {
      run_cmd ls fake_problem >& ${desc}.configure.log || { report_and_log_test_result 1 $desc configure && return 0; }
    fi
  else
-    run_cmd ${KOKKOS_PATH}/generate_makefile.bash --with-devices=$build $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" $extra_args &>> ${desc}.configure.log || { report_and_log_test_result 1 ${desc} configure && return 0; }
+    run_cmd ${KOKKOS_PATH}/generate_makefile.bash --with-devices=$build $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" --ldflags=\"$ldflags\" $extra_args &>> ${desc}.configure.log || { report_and_log_test_result 1 ${desc} configure && return 0; }
    local -i build_start_time=$(date +%s)
    run_cmd make -j 32 build-test >& ${desc}.build.log || { report_and_log_test_result 1 ${desc} build && return 0; }
    local -i build_end_time=$(date +%s)
--- a/lib/kokkos/config/trilinos-integration/shepard_jenkins_run_script_pthread_intel
+++ b/lib/kokkos/config/trilinos-integration/shepard_jenkins_run_script_pthread_intel
@ -1,6 +1,6 @@
 #!/bin/bash -el
 ulimit -c 0
-module load devpack/openmpi/1.10.0/intel/16.1.056/cuda/none 
+module load devpack/openmpi/2.1.1/intel/17.4.196/cuda/none

 KOKKOS_BRANCH=$1
 TRILINOS_UPDATE_BRANCH=$2
--- a/lib/kokkos/config/trilinos-integration/shepard_jenkins_run_script_serial_intel
+++ b/lib/kokkos/config/trilinos-integration/shepard_jenkins_run_script_serial_intel
@ -1,6 +1,6 @@
 #!/bin/bash -el
 ulimit -c 0
-module load devpack/openmpi/1.10.0/intel/16.1.056/cuda/none 
+module load devpack/openmpi/2.1.1/intel/17.4.196/cuda/none

 KOKKOS_BRANCH=$1
 TRILINOS_UPDATE_BRANCH=$2
--- a/lib/kokkos/containers/CMakeLists.txt
+++ b/lib/kokkos/containers/CMakeLists.txt
@ -2,7 +2,10 @@

 TRIBITS_SUBPACKAGE(Containers)

-ADD_SUBDIRECTORY(src)
+
+IF(KOKKOS_HAS_TRILINOS)
+  ADD_SUBDIRECTORY(src)
+ENDIF()

 TRIBITS_ADD_TEST_DIRECTORIES(unit_tests)
 TRIBITS_ADD_TEST_DIRECTORIES(performance_tests)
--- a/lib/kokkos/containers/performance_tests/CMakeLists.txt
+++ b/lib/kokkos/containers/performance_tests/CMakeLists.txt
@ -3,6 +3,14 @@ INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
 INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR})
 INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../src )

+IF(NOT KOKKOS_HAS_TRILINOS)
+  IF(KOKKOS_SEPARATE_LIBS)
+    set(TEST_LINK_TARGETS kokkoscore)
+  ELSE()
+    set(TEST_LINK_TARGETS kokkos)
+  ENDIF()
+ENDIF()
+
 SET(SOURCES
  TestMain.cpp 
  TestCuda.cpp
@ -24,7 +32,7 @@ TRIBITS_ADD_EXECUTABLE(
  PerfTestExec
  SOURCES ${SOURCES}
  COMM serial mpi
-  TESTONLYLIBS kokkos_gtest
+  TESTONLYLIBS kokkos_gtest ${TEST_LINK_TARGETS}
  )

 TRIBITS_ADD_TEST(
--- a/lib/kokkos/containers/performance_tests/Makefile
+++ b/lib/kokkos/containers/performance_tests/Makefile
@ -15,7 +15,8 @@ endif

 CXXFLAGS = -O3
 LINK ?= $(CXX)
-LDFLAGS ?= -lpthread
+LDFLAGS ?=
+override LDFLAGS += -lpthread

 include $(KOKKOS_PATH)/Makefile.kokkos

@ -30,6 +31,12 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
 	TEST_TARGETS += test-cuda
 endif

+ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
+	OBJ_ROCM = TestROCm.o TestMain.o gtest-all.o
+	TARGETS += KokkosContainers_PerformanceTest_ROCm
+	TEST_TARGETS += test-rocm
+endif
+
 ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
 	OBJ_THREADS = TestThreads.o TestMain.o gtest-all.o
 	TARGETS += KokkosContainers_PerformanceTest_Threads
@ -45,6 +52,9 @@ endif
 KokkosContainers_PerformanceTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
 	$(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Cuda

+KokkosContainers_PerformanceTest_ROCm: $(OBJ_ROCM) $(KOKKOS_LINK_DEPENDS)
+	$(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_ROCM) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_ROCm
+
 KokkosContainers_PerformanceTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
 	$(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Threads

@ -54,6 +64,9 @@ KokkosContainers_PerformanceTest_OpenMP: $(OBJ_OPENMP) $(KOKKOS_LINK_DEPENDS)
 test-cuda: KokkosContainers_PerformanceTest_Cuda
 	./KokkosContainers_PerformanceTest_Cuda

+test-rocm: KokkosContainers_PerformanceTest_ROCm
+	./KokkosContainers_PerformanceTest_ROCm
+
 test-threads: KokkosContainers_PerformanceTest_Threads
 	./KokkosContainers_PerformanceTest_Threads

--- a/lib/kokkos/containers/performance_tests/TestDynRankView.hpp
+++ b/lib/kokkos/containers/performance_tests/TestDynRankView.hpp
@ -180,8 +180,8 @@ void test_dynrankview_op_perf( const int par_size )

  typedef DeviceType execution_space;
  typedef typename execution_space::size_type size_type;
-  const size_type dim2 = 90;
-  const size_type dim3 = 30;
+  const size_type dim_2 = 90;
+  const size_type dim_3 = 30;

  double elapsed_time_view = 0;
  double elapsed_time_compview = 0;
@ -191,7 +191,7 @@ void test_dynrankview_op_perf( const int par_size )
  double elapsed_time_compdrview = 0;
  Kokkos::Timer timer;
  {
-    Kokkos::View<double***,DeviceType> testview("testview",par_size,dim2,dim3);
+    Kokkos::View<double***,DeviceType> testview("testview",par_size,dim_2,dim_3);
    typedef InitViewFunctor<DeviceType> FunctorType;

    timer.reset();
@ -220,7 +220,7 @@ void test_dynrankview_op_perf( const int par_size )
    std::cout << " Strided View time (init only): " << elapsed_time_strideview << std::endl;
  }
  {
-    Kokkos::View<double*******,DeviceType> testview("testview",par_size,dim2,dim3,1,1,1,1);
+    Kokkos::View<double*******,DeviceType> testview("testview",par_size,dim_2,dim_3,1,1,1,1);
    typedef InitViewRank7Functor<DeviceType> FunctorType;

    timer.reset();
@ -231,7 +231,7 @@ void test_dynrankview_op_perf( const int par_size )
    std::cout << " View Rank7 time (init only): " << elapsed_time_view_rank7 << std::endl;
  }
  {
-    Kokkos::DynRankView<double,DeviceType> testdrview("testdrview",par_size,dim2,dim3);
+    Kokkos::DynRankView<double,DeviceType> testdrview("testdrview",par_size,dim_2,dim_3);
    typedef InitDynRankViewFunctor<DeviceType> FunctorType;

    timer.reset();
--- a/lib/kokkos/containers/performance_tests/TestOpenMP.cpp
+++ b/lib/kokkos/containers/performance_tests/TestOpenMP.cpp
@ -54,6 +54,7 @@
 #include <TestUnorderedMapPerformance.hpp>

 #include <TestDynRankView.hpp>
+#include <TestScatterView.hpp>

 #include <iomanip>
 #include <sstream>
@ -122,6 +123,18 @@ TEST_F( openmp, unordered_map_performance_far)
  Perf::run_performance_tests<Kokkos::OpenMP,false>(base_file_name.str());
 }

+TEST_F( openmp, scatter_view)
+{
+  std::cout << "ScatterView data-duplicated test:\n";
+  Perf::test_scatter_view<Kokkos::OpenMP, Kokkos::LayoutRight,
+    Kokkos::Experimental::ScatterDuplicated,
+    Kokkos::Experimental::ScatterNonAtomic>(10, 1000 * 1000);
+//std::cout << "ScatterView atomics test:\n";
+//Perf::test_scatter_view<Kokkos::OpenMP, Kokkos::LayoutRight,
+//  Kokkos::Experimental::ScatterNonDuplicated,
+//  Kokkos::Experimental::ScatterAtomic>(10, 1000 * 1000);
+}
+
 } // namespace test
 #else
 void KOKKOS_CONTAINERS_PERFORMANCE_TESTS_TESTOPENMP_PREVENT_EMPTY_LINK_ERROR() {}
--- a/lib/kokkos/containers/performance_tests/TestROCm.cpp
+++ b/lib/kokkos/containers/performance_tests/TestROCm.cpp
@ -0,0 +1,113 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact  H. Carter Edwards (hcedwar@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include <Kokkos_Macros.hpp>
+#if defined( KOKKOS_ENABLE_ROCM )
+
+#include <cstdint>
+#include <string>
+#include <iostream>
+#include <iomanip>
+#include <sstream>
+#include <fstream>
+
+#include <gtest/gtest.h>
+
+#include <Kokkos_Core.hpp>
+
+#include <TestDynRankView.hpp>
+
+#include <Kokkos_UnorderedMap.hpp>
+
+#include <TestGlobal2LocalIds.hpp>
+
+#include <TestUnorderedMapPerformance.hpp>
+
+namespace Performance {
+
+class rocm : public ::testing::Test {
+protected:
+  static void SetUpTestCase()
+  {
+    std::cout << std::setprecision(5) << std::scientific;
+    Kokkos::HostSpace::execution_space::initialize();
+    Kokkos::Experimental::ROCm::initialize( Kokkos::Experimental::ROCm::SelectDevice(0) );
+  }
+  static void TearDownTestCase()
+  {
+    Kokkos::Experimental::ROCm::finalize();
+    Kokkos::HostSpace::execution_space::finalize();
+  }
+};
+#if 0
+// issue 1089
+TEST_F( rocm, dynrankview_perf )
+{
+  std::cout << "ROCm" << std::endl;
+  std::cout << " DynRankView vs View: Initialization Only " << std::endl;
+  test_dynrankview_op_perf<Kokkos::Experimental::ROCm>( 40960 );
+}
+
+TEST_F( rocm, global_2_local)
+{
+  std::cout << "ROCm" << std::endl;
+  std::cout << "size, create, generate, fill, find" << std::endl;
+  for (unsigned i=Performance::begin_id_size; i<=Performance::end_id_size; i *= Performance::id_step)
+    test_global_to_local_ids<Kokkos::Experimental::ROCm>(i);
+}
+
+#endif
+TEST_F( rocm, unordered_map_performance_near)
+{
+  Perf::run_performance_tests<Kokkos::Experimental::ROCm,true>("rocm-near");
+}
+
+TEST_F( rocm, unordered_map_performance_far)
+{
+  Perf::run_performance_tests<Kokkos::Experimental::ROCm,false>("rocm-far");
+}
+
+}
+#else
+void KOKKOS_CONTAINERS_PERFORMANCE_TESTS_TESTROCM_PREVENT_EMPTY_LINK_ERROR() {}
+#endif  /* #if defined( KOKKOS_ENABLE_ROCM ) */
--- a/lib/kokkos/containers/performance_tests/TestScatterView.hpp
+++ b/lib/kokkos/containers/performance_tests/TestScatterView.hpp
@ -0,0 +1,113 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact  H. Carter Edwards (hcedwar@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#ifndef KOKKOS_TEST_SCATTER_VIEW_HPP
+#define KOKKOS_TEST_SCATTER_VIEW_HPP
+
+#include <Kokkos_ScatterView.hpp>
+#include <impl/Kokkos_Timer.hpp>
+
+namespace Perf {
+
+template <typename ExecSpace, typename Layout, int duplication, int contribution>
+void test_scatter_view(int m, int n)
+{
+  Kokkos::View<double *[3], Layout, ExecSpace> original_view("original_view", n);
+  {
+    auto scatter_view = Kokkos::Experimental::create_scatter_view
+      < Kokkos::Experimental::ScatterSum
+      , duplication
+      , contribution
+      > (original_view);
+    Kokkos::Experimental::UniqueToken<
+      ExecSpace, Kokkos::Experimental::UniqueTokenScope::Global>
+      unique_token{ExecSpace()};
+  //auto internal_view = scatter_view.internal_view;
+    auto policy = Kokkos::RangePolicy<ExecSpace, int>(0, n);
+    for (int foo = 0; foo < 5; ++foo) {
+    {
+      auto num_threads = unique_token.size();
+      std::cout << "num_threads " << num_threads << '\n';
+      Kokkos::View<double **[3], Layout, ExecSpace> hand_coded_duplicate_view("hand_coded_duplicate", num_threads, n);
+      auto f2 = KOKKOS_LAMBDA(int i) {
+        auto thread_id = unique_token.acquire();
+        for (int j = 0; j < 10; ++j) {
+          auto k = (i + j) % n;
+          hand_coded_duplicate_view(thread_id, k, 0) += 4.2;
+          hand_coded_duplicate_view(thread_id, k, 1) += 2.0;
+          hand_coded_duplicate_view(thread_id, k, 2) += 1.0;
+        }
+      };
+      Kokkos::Timer timer;
+      timer.reset();
+      for (int k = 0; k < m; ++k) {
+        Kokkos::parallel_for(policy, f2, "hand_coded_duplicate_scatter_view_test");
+      }
+      auto t = timer.seconds();
+      std::cout << "hand-coded test took " << t << " seconds\n";
+    }
+    {
+      auto f = KOKKOS_LAMBDA(int i) {
+        auto scatter_access = scatter_view.access();
+        for (int j = 0; j < 10; ++j) {
+          auto k = (i + j) % n;
+          scatter_access(k, 0) += 4.2;
+          scatter_access(k, 1) += 2.0;
+          scatter_access(k, 2) += 1.0;
+        }
+      };
+      Kokkos::Timer timer;
+      timer.reset();
+      for (int k = 0; k < m; ++k) {
+        Kokkos::parallel_for(policy, f, "scatter_view_test");
+      }
+      auto t = timer.seconds();
+      std::cout << "test took " << t << " seconds\n";
+    }
+  }
+  }
+}
+
+}
+
+#endif
--- a/lib/kokkos/containers/src/CMakeLists.txt
+++ b/lib/kokkos/containers/src/CMakeLists.txt
@ -6,26 +6,42 @@ INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR})

 #-----------------------------------------------------------------------------

-SET(HEADERS "")
-SET(SOURCES "")
-
-SET(HEADERS_IMPL "")
-
-FILE(GLOB HEADERS *.hpp)
-FILE(GLOB HEADERS_IMPL impl/*.hpp)
-FILE(GLOB SOURCES impl/*.cpp)
-
 SET(TRILINOS_INCDIR ${CMAKE_INSTALL_PREFIX}/${${PROJECT_NAME}_INSTALL_INCLUDE_DIR})

-INSTALL(FILES ${HEADERS_IMPL} DESTINATION ${TRILINOS_INCDIR}/impl/)
+if(KOKKOS_LEGACY_TRIBITS)

-TRIBITS_ADD_LIBRARY(
-    kokkoscontainers
-    HEADERS ${HEADERS}
-    NOINSTALLHEADERS ${HEADERS_IMPL}
-    SOURCES ${SOURCES}
-    DEPLIBS
-    )
+  SET(HEADERS "")
+  SET(SOURCES "")

+  SET(HEADERS_IMPL "")
+
+  FILE(GLOB HEADERS *.hpp)
+  FILE(GLOB HEADERS_IMPL impl/*.hpp)
+  FILE(GLOB SOURCES impl/*.cpp)
+
+  INSTALL(FILES ${HEADERS_IMPL} DESTINATION ${TRILINOS_INCDIR}/impl/)
+
+  TRIBITS_ADD_LIBRARY(
+      kokkoscontainers
+      HEADERS ${HEADERS}
+      NOINSTALLHEADERS ${HEADERS_IMPL}
+      SOURCES ${SOURCES}
+      DEPLIBS
+      )
+
+else()
+
+  INSTALL (
+      DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/"
+      DESTINATION ${TRILINOS_INCDIR}
+      FILES_MATCHING PATTERN "*.hpp"
+      )
+
+  TRIBITS_ADD_LIBRARY(
+      kokkoscontainers
+      SOURCES ${KOKKOS_CONTAINERS_SRCS}
+      DEPLIBS
+      )
+
+endif()
 #-----------------------------------------------------------------------------
-
--- a/lib/kokkos/containers/src/Kokkos_DynamicView.hpp
+++ b/lib/kokkos/containers/src/Kokkos_DynamicView.hpp
@ -72,8 +72,10 @@ private:
               , "DynamicView must be rank-one" );

  static_assert( std::is_trivial< typename traits::value_type >::value &&
-                 std::is_same< typename traits::specialize , void >::value
-               , "DynamicView must have trivial data type" );
+                 std::is_same< typename traits::specialize , void >::value &&
+                 Kokkos::Impl::is_power_of_two
+                   <sizeof(typename traits::value_type)>::value
+               , "DynamicView must have trivial value_type and sizeof(value_type) is a power-of-two");


  template< class Space , bool = Kokkos::Impl::MemorySpaceAccess< Space , typename traits::memory_space >::accessible > struct verify_space
--- a/lib/kokkos/containers/src/Kokkos_ScatterView.hpp
+++ b/lib/kokkos/containers/src/Kokkos_ScatterView.hpp
@ -0,0 +1,999 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact  H. Carter Edwards (hcedwar@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+
+/// \file Kokkos_ScatterView.hpp
+/// \brief Declaration and definition of Kokkos::ScatterView.
+///
+/// This header file declares and defines Kokkos::ScatterView and its
+/// related nonmember functions.
+
+#ifndef KOKKOS_SCATTER_VIEW_HPP
+#define KOKKOS_SCATTER_VIEW_HPP
+
+#include <Kokkos_Core.hpp>
+#include <utility>
+
+namespace Kokkos {
+namespace Experimental {
+
+//TODO: replace this enum with the Kokkos::Sum, etc reducers for parallel_reduce
+enum : int {
+  ScatterSum,
+};
+
+enum : int {
+  ScatterNonDuplicated = 0,
+  ScatterDuplicated    = 1
+};
+
+enum : int {
+  ScatterNonAtomic = 0,
+  ScatterAtomic    = 1
+};
+
+}} // Kokkos::Experimental
+
+namespace Kokkos {
+namespace Impl {
+namespace Experimental {
+
+template <typename ExecSpace>
+struct DefaultDuplication;
+
+template <typename ExecSpace, int duplication>
+struct DefaultContribution;
+
+#ifdef KOKKOS_ENABLE_SERIAL
+template <>
+struct DefaultDuplication<Kokkos::Serial> {
+  enum : int { value = Kokkos::Experimental::ScatterNonDuplicated };
+};
+template <>
+struct DefaultContribution<Kokkos::Serial, Kokkos::Experimental::ScatterNonDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterNonAtomic };
+};
+template <>
+struct DefaultContribution<Kokkos::Serial, Kokkos::Experimental::ScatterDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterNonAtomic };
+};
+#endif
+
+#ifdef KOKKOS_ENABLE_OPENMP
+template <>
+struct DefaultDuplication<Kokkos::OpenMP> {
+  enum : int { value = Kokkos::Experimental::ScatterDuplicated };
+};
+template <>
+struct DefaultContribution<Kokkos::OpenMP, Kokkos::Experimental::ScatterNonDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterAtomic };
+};
+template <>
+struct DefaultContribution<Kokkos::OpenMP, Kokkos::Experimental::ScatterDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterNonAtomic };
+};
+#endif
+
+#ifdef KOKKOS_ENABLE_THREADS
+template <>
+struct DefaultDuplication<Kokkos::Threads> {
+  enum : int { value = Kokkos::Experimental::ScatterDuplicated };
+};
+template <>
+struct DefaultContribution<Kokkos::Threads, Kokkos::Experimental::ScatterNonDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterAtomic };
+};
+template <>
+struct DefaultContribution<Kokkos::Threads, Kokkos::Experimental::ScatterDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterNonAtomic };
+};
+#endif
+
+#ifdef KOKKOS_ENABLE_CUDA
+template <>
+struct DefaultDuplication<Kokkos::Cuda> {
+  enum : int { value = Kokkos::Experimental::ScatterNonDuplicated };
+};
+template <>
+struct DefaultContribution<Kokkos::Cuda, Kokkos::Experimental::ScatterNonDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterAtomic };
+};
+template <>
+struct DefaultContribution<Kokkos::Cuda, Kokkos::Experimental::ScatterDuplicated> {
+  enum : int { value = Kokkos::Experimental::ScatterAtomic };
+};
+#endif
+
+/* ScatterValue is the object returned by the access operator() of ScatterAccess,
+   similar to that returned by an Atomic View, it wraps Kokkos::atomic_add with convenient
+   operator+=, etc. */
+template <typename ValueType, int Op, int contribution>
+struct ScatterValue;
+
+template <typename ValueType>
+struct ScatterValue<ValueType, Kokkos::Experimental::ScatterSum, Kokkos::Experimental::ScatterNonAtomic> {
+  public:
+    KOKKOS_FORCEINLINE_FUNCTION ScatterValue(ValueType& value_in) : value( value_in ) {}
+    KOKKOS_FORCEINLINE_FUNCTION ScatterValue(ScatterValue&& other) : value( other.value ) {}
+    KOKKOS_FORCEINLINE_FUNCTION void operator+=(ValueType const& rhs) {
+      value += rhs;
+    }
+    KOKKOS_FORCEINLINE_FUNCTION void operator-=(ValueType const& rhs) {
+      value -= rhs;
+    }
+  private:
+    ValueType& value;
+};
+
+template <typename ValueType>
+struct ScatterValue<ValueType, Kokkos::Experimental::ScatterSum, Kokkos::Experimental::ScatterAtomic> {
+  public:
+    KOKKOS_FORCEINLINE_FUNCTION ScatterValue(ValueType& value_in) : value( value_in ) {}
+    KOKKOS_FORCEINLINE_FUNCTION void operator+=(ValueType const& rhs) {
+      Kokkos::atomic_add(&value, rhs);
+    }
+    KOKKOS_FORCEINLINE_FUNCTION void operator-=(ValueType const& rhs) {
+      Kokkos::atomic_add(&value, -rhs);
+    }
+  private:
+    ValueType& value;
+};
+
+/* DuplicatedDataType, given a View DataType, will create a new DataType
+   that has a new runtime dimension which becomes the largest-stride dimension.
+   In the case of LayoutLeft, due to the limitation induced by the design of DataType
+   itself, it must convert any existing compile-time dimensions into runtime dimensions. */
+template <typename T, typename Layout>
+struct DuplicatedDataType;
+
+template <typename T>
+struct DuplicatedDataType<T, Kokkos::LayoutRight> {
+  typedef T* value_type; // For LayoutRight, add a star all the way on the left
+};
+
+template <typename T, size_t N>
+struct DuplicatedDataType<T[N], Kokkos::LayoutRight> {
+  typedef typename DuplicatedDataType<T, Kokkos::LayoutRight>::value_type value_type[N];
+};
+
+template <typename T>
+struct DuplicatedDataType<T[], Kokkos::LayoutRight> {
+  typedef typename DuplicatedDataType<T, Kokkos::LayoutRight>::value_type value_type[];
+};
+
+template <typename T>
+struct DuplicatedDataType<T*, Kokkos::LayoutRight> {
+  typedef typename DuplicatedDataType<T, Kokkos::LayoutRight>::value_type* value_type;
+};
+
+template <typename T>
+struct DuplicatedDataType<T, Kokkos::LayoutLeft> {
+  typedef T* value_type;
+};
+
+template <typename T, size_t N>
+struct DuplicatedDataType<T[N], Kokkos::LayoutLeft> {
+  typedef typename DuplicatedDataType<T, Kokkos::LayoutLeft>::value_type* value_type;
+};
+
+template <typename T>
+struct DuplicatedDataType<T[], Kokkos::LayoutLeft> {
+  typedef typename DuplicatedDataType<T, Kokkos::LayoutLeft>::value_type* value_type;
+};
+
+template <typename T>
+struct DuplicatedDataType<T*, Kokkos::LayoutLeft> {
+  typedef typename DuplicatedDataType<T, Kokkos::LayoutLeft>::value_type* value_type;
+};
+
+/* Slice is just responsible for stuffing the correct number of Kokkos::ALL
+   arguments on the correct side of the index in a call to subview() to get a
+   subview where the index specified is the largest-stride one. */
+template <typename Layout, int rank, typename V, typename ... Args>
+struct Slice {
+  typedef Slice<Layout, rank - 1, V, Kokkos::Impl::ALL_t, Args...> next;
+  typedef typename next::value_type value_type;
+
+  static
+  value_type get(V const& src, const size_t i, Args ... args) {
+    return next::get(src, i, Kokkos::ALL, args...);
+  }
+};
+
+template <typename V, typename ... Args>
+struct Slice<Kokkos::LayoutRight, 1, V, Args...> {
+  typedef typename Kokkos::Impl::ViewMapping
+                          < void
+                          , V
+                          , const size_t
+                          , Args ...
+                          >::type value_type;
+  static
+  value_type get(V const& src, const size_t i, Args ... args) {
+    return Kokkos::subview(src, i, args...);
+  }
+};
+
+template <typename V, typename ... Args>
+struct Slice<Kokkos::LayoutLeft, 1, V, Args...> {
+  typedef typename Kokkos::Impl::ViewMapping
+                          < void
+                          , V
+                          , Args ...
+                          , const size_t
+                          >::type value_type;
+  static
+  value_type get(V const& src, const size_t i, Args ... args) {
+    return Kokkos::subview(src, args..., i);
+  }
+};
+
+template <typename ExecSpace, typename ValueType, int Op>
+struct ReduceDuplicates;
+
+template <typename ExecSpace, typename ValueType, int Op>
+struct ReduceDuplicatesBase {
+  typedef ReduceDuplicates<ExecSpace, ValueType, Op> Derived;
+  ValueType const* src;
+  ValueType* dst;
+  size_t stride;
+  size_t start;
+  size_t n;
+  ReduceDuplicatesBase(ValueType const* src_in, ValueType* dest_in, size_t stride_in, size_t start_in, size_t n_in, std::string const& name)
+    : src(src_in)
+    , dst(dest_in)
+    , stride(stride_in)
+    , start(start_in)
+    , n(n_in)
+  {
+#if defined(KOKKOS_ENABLE_PROFILING)
+    uint64_t kpID = 0;
+    if(Kokkos::Profiling::profileLibraryLoaded()) {
+      Kokkos::Profiling::beginParallelFor(std::string("reduce_") + name, 0, &kpID);
+    }
+#endif
+    typedef RangePolicy<ExecSpace, size_t> policy_type;
+    typedef Kokkos::Impl::ParallelFor<Derived, policy_type> closure_type;
+    const closure_type closure(*(static_cast<Derived*>(this)), policy_type(0, stride));
+    closure.execute();
+#if defined(KOKKOS_ENABLE_PROFILING)
+    if(Kokkos::Profiling::profileLibraryLoaded()) {
+      Kokkos::Profiling::endParallelFor(kpID);
+    }
+#endif
+  }
+};
+
+template <typename ExecSpace, typename ValueType>
+struct ReduceDuplicates<ExecSpace, ValueType, Kokkos::Experimental::ScatterSum> :
+  public ReduceDuplicatesBase<ExecSpace, ValueType, Kokkos::Experimental::ScatterSum>
+{
+  typedef ReduceDuplicatesBase<ExecSpace, ValueType, Kokkos::Experimental::ScatterSum> Base;
+  ReduceDuplicates(ValueType const* src_in, ValueType* dst_in, size_t stride_in, size_t start_in, size_t n_in, std::string const& name):
+    Base(src_in, dst_in, stride_in, start_in, n_in, name)
+  {}
+  KOKKOS_FORCEINLINE_FUNCTION void operator()(size_t i) const {
+    for (size_t j = Base::start; j < Base::n; ++j) {
+      Base::dst[i] += Base::src[i + Base::stride * j];
+    }
+  }
+};
+
+template <typename ExecSpace, typename ValueType, int Op>
+struct ResetDuplicates;
+
+template <typename ExecSpace, typename ValueType, int Op>
+struct ResetDuplicatesBase {
+  typedef ResetDuplicates<ExecSpace, ValueType, Op> Derived;
+  ValueType* data;
+  ResetDuplicatesBase(ValueType* data_in, size_t size_in, std::string const& name)
+    : data(data_in)
+  {
+#if defined(KOKKOS_ENABLE_PROFILING)
+    uint64_t kpID = 0;
+    if(Kokkos::Profiling::profileLibraryLoaded()) {
+      Kokkos::Profiling::beginParallelFor(std::string("reduce_") + name, 0, &kpID);
+    }
+#endif
+    typedef RangePolicy<ExecSpace, size_t> policy_type;
+    typedef Kokkos::Impl::ParallelFor<Derived, policy_type> closure_type;
+    const closure_type closure(*(static_cast<Derived*>(this)), policy_type(0, size_in));
+    closure.execute();
+#if defined(KOKKOS_ENABLE_PROFILING)
+    if(Kokkos::Profiling::profileLibraryLoaded()) {
+      Kokkos::Profiling::endParallelFor(kpID);
+    }
+#endif
+  }
+};
+
+template <typename ExecSpace, typename ValueType>
+struct ResetDuplicates<ExecSpace, ValueType, Kokkos::Experimental::ScatterSum> :
+  public ResetDuplicatesBase<ExecSpace, ValueType, Kokkos::Experimental::ScatterSum>
+{
+  typedef ResetDuplicatesBase<ExecSpace, ValueType, Kokkos::Experimental::ScatterSum> Base;
+  ResetDuplicates(ValueType* data_in, size_t size_in, std::string const& name):
+    Base(data_in, size_in, name)
+  {}
+  KOKKOS_FORCEINLINE_FUNCTION void operator()(size_t i) const {
+    Base::data[i] = Kokkos::reduction_identity<ValueType>::sum();
+  }
+};
+
+}}} // Kokkos::Impl::Experimental
+
+namespace Kokkos {
+namespace Experimental {
+
+template <typename DataType
+         ,typename Layout = Kokkos::DefaultExecutionSpace::array_layout
+         ,typename ExecSpace = Kokkos::DefaultExecutionSpace
+         ,int Op = ScatterSum
+         ,int duplication = Kokkos::Impl::Experimental::DefaultDuplication<ExecSpace>::value
+         ,int contribution = Kokkos::Impl::Experimental::DefaultContribution<ExecSpace, duplication>::value
+         >
+class ScatterView;
+
+template <typename DataType
+         ,int Op
+         ,typename ExecSpace
+         ,typename Layout
+         ,int duplication
+         ,int contribution
+         ,int override_contribution
+         >
+class ScatterAccess;
+
+// non-duplicated implementation
+template <typename DataType
+         ,int Op
+         ,typename ExecSpace
+         ,typename Layout
+         ,int contribution
+         >
+class ScatterView<DataType
+                   ,Layout
+                   ,ExecSpace
+                   ,Op
+                   ,ScatterNonDuplicated
+                   ,contribution>
+{
+public:
+  typedef Kokkos::View<DataType, Layout, ExecSpace> original_view_type;
+  typedef typename original_view_type::value_type original_value_type;
+  typedef typename original_view_type::reference_type original_reference_type;
+  friend class ScatterAccess<DataType, Op, ExecSpace, Layout, ScatterNonDuplicated, contribution, ScatterNonAtomic>;
+  friend class ScatterAccess<DataType, Op, ExecSpace, Layout, ScatterNonDuplicated, contribution, ScatterAtomic>;
+
+  ScatterView()
+  {
+  }
+
+  template <typename RT, typename ... RP>
+  ScatterView(View<RT, RP...> const& original_view)
+  : internal_view(original_view)
+  {
+  }
+
+  template <typename ... Dims>
+  ScatterView(std::string const& name, Dims ... dims)
+  : internal_view(name, dims ...)
+  {
+  }
+
+  template <int override_contrib = contribution>
+  KOKKOS_FORCEINLINE_FUNCTION
+  ScatterAccess<DataType, Op, ExecSpace, Layout, ScatterNonDuplicated, contribution, override_contrib>
+  access() const {
+    return ScatterAccess<DataType, Op, ExecSpace, Layout, ScatterNonDuplicated, contribution, override_contrib>{*this};
+  }
+
+  original_view_type subview() const {
+    return internal_view;
+  }
+
+  template <typename DT, typename ... RP>
+  void contribute_into(View<DT, RP...> const& dest) const
+  {
+    typedef View<DT, RP...> dest_type;
+    static_assert(std::is_same<
+        typename dest_type::array_layout,
+        Layout>::value,
+        "ScatterView contribute destination has different layout");
+    static_assert(Kokkos::Impl::VerifyExecutionCanAccessMemorySpace<
+        typename ExecSpace::memory_space,
+        typename dest_type::memory_space>::value,
+        "ScatterView contribute destination memory space not accessible");
+    if (dest.data() == internal_view.data()) return;
+    Kokkos::Impl::Experimental::ReduceDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data(),
+        dest.data(),
+        0,
+        0,
+        1,
+        internal_view.label());
+  }
+
+  void reset() {
+    Kokkos::Impl::Experimental::ResetDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data(),
+        internal_view.size(),
+        internal_view.label());
+  }
+  template <typename DT, typename ... RP>
+  void reset_except(View<DT, RP...> const& view) {
+    if (view.data() != internal_view.data()) reset();
+  }
+
+  void resize(const size_t n0 = 0,
+           const size_t n1 = 0,
+           const size_t n2 = 0,
+           const size_t n3 = 0,
+           const size_t n4 = 0,
+           const size_t n5 = 0,
+           const size_t n6 = 0,
+           const size_t n7 = 0) {
+    ::Kokkos::resize(internal_view,n0,n1,n2,n3,n4,n5,n6,n7);
+  }
+
+  void realloc(const size_t n0 = 0,
+           const size_t n1 = 0,
+           const size_t n2 = 0,
+           const size_t n3 = 0,
+           const size_t n4 = 0,
+           const size_t n5 = 0,
+           const size_t n6 = 0,
+           const size_t n7 = 0) {
+    ::Kokkos::realloc(internal_view,n0,n1,n2,n3,n4,n5,n6,n7);
+  }
+
+protected:
+  template <typename ... Args>
+  KOKKOS_FORCEINLINE_FUNCTION
+  original_reference_type at(Args ... args) const {
+    return internal_view(args...);
+  }
+private:
+  typedef original_view_type internal_view_type;
+  internal_view_type internal_view;
+};
+
+template <typename DataType
+         ,int Op
+         ,typename ExecSpace
+         ,typename Layout
+         ,int contribution
+         ,int override_contribution
+         >
+class ScatterAccess<DataType
+                   ,Op
+                   ,ExecSpace
+                   ,Layout
+                   ,ScatterNonDuplicated
+                   ,contribution
+                   ,override_contribution>
+{
+public:
+  typedef ScatterView<DataType, Layout, ExecSpace, Op, ScatterNonDuplicated, contribution> view_type;
+  typedef typename view_type::original_value_type original_value_type;
+  typedef Kokkos::Impl::Experimental::ScatterValue<
+      original_value_type, Op, override_contribution> value_type;
+
+  KOKKOS_INLINE_FUNCTION
+  ScatterAccess(view_type const& view_in)
+    : view(view_in)
+  {
+  }
+
+  template <typename ... Args>
+  KOKKOS_FORCEINLINE_FUNCTION
+  value_type operator()(Args ... args) const {
+    return view.at(args...);
+  }
+
+  template <typename Arg>
+  KOKKOS_FORCEINLINE_FUNCTION
+  typename std::enable_if<view_type::original_view_type::rank == 1 &&
+  std::is_integral<Arg>::value, value_type>::type
+  operator[](Arg arg) const {
+    return view.at(arg);
+  }
+
+private:
+  view_type const& view;
+};
+
+// duplicated implementation
+// LayoutLeft and LayoutRight are different enough that we'll just specialize each
+
+template <typename DataType
+         ,int Op
+         ,typename ExecSpace
+         ,int contribution
+         >
+class ScatterView<DataType
+                   ,Kokkos::LayoutRight
+                   ,ExecSpace
+                   ,Op
+                   ,ScatterDuplicated
+                   ,contribution>
+{
+public:
+  typedef Kokkos::View<DataType, Kokkos::LayoutRight, ExecSpace> original_view_type;
+  typedef typename original_view_type::value_type original_value_type;
+  typedef typename original_view_type::reference_type original_reference_type;
+  friend class ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutRight, ScatterDuplicated, contribution, ScatterNonAtomic>;
+  friend class ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutRight, ScatterDuplicated, contribution, ScatterAtomic>;
+  typedef typename Kokkos::Impl::Experimental::DuplicatedDataType<DataType, Kokkos::LayoutRight> data_type_info;
+  typedef typename data_type_info::value_type internal_data_type;
+  typedef Kokkos::View<internal_data_type, Kokkos::LayoutRight, ExecSpace> internal_view_type;
+
+  ScatterView()
+  {
+  }
+
+  template <typename RT, typename ... RP >
+  ScatterView(View<RT, RP...> const& original_view)
+  : unique_token()
+  , internal_view(Kokkos::ViewAllocateWithoutInitializing(
+                    std::string("duplicated_") + original_view.label()),
+                  unique_token.size(),
+                  original_view.extent(0),
+                  original_view.extent(1),
+                  original_view.extent(2),
+                  original_view.extent(3),
+                  original_view.extent(4),
+                  original_view.extent(5),
+                  original_view.extent(6))
+  {
+    reset();
+  }
+
+  template <typename ... Dims>
+  ScatterView(std::string const& name, Dims ... dims)
+  : internal_view(Kokkos::ViewAllocateWithoutInitializing(name), unique_token.size(), dims ...)
+  {
+    reset();
+  }
+
+  template <int override_contribution = contribution>
+  inline
+  ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutRight, ScatterDuplicated, contribution, override_contribution>
+  access() const {
+    return ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutRight, ScatterDuplicated, contribution, override_contribution>{*this};
+  }
+
+  typename Kokkos::Impl::Experimental::Slice<
+    Kokkos::LayoutRight, internal_view_type::rank, internal_view_type>::value_type
+  subview() const
+  {
+    return Kokkos::Impl::Experimental::Slice<
+      Kokkos::LayoutRight, internal_view_type::Rank, internal_view_type>::get(internal_view, 0);
+  }
+
+  template <typename DT, typename ... RP>
+  void contribute_into(View<DT, RP...> const& dest) const
+  {
+    typedef View<DT, RP...> dest_type;
+    static_assert(std::is_same<
+        typename dest_type::array_layout,
+        Kokkos::LayoutRight>::value,
+        "ScatterView deep_copy destination has different layout");
+    static_assert(Kokkos::Impl::VerifyExecutionCanAccessMemorySpace<
+        typename ExecSpace::memory_space,
+        typename dest_type::memory_space>::value,
+        "ScatterView deep_copy destination memory space not accessible");
+    size_t strides[8];
+    internal_view.stride(strides);
+    bool is_equal = (dest.data() == internal_view.data());
+    size_t start = is_equal ? 1 : 0;
+    Kokkos::Impl::Experimental::ReduceDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data(),
+        dest.data(),
+        strides[0],
+        start,
+        internal_view.extent(0),
+        internal_view.label());
+  }
+
+  void reset() {
+    Kokkos::Impl::Experimental::ResetDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data(),
+        internal_view.size(),
+        internal_view.label());
+  }
+  template <typename DT, typename ... RP>
+  void reset_except(View<DT, RP...> const& view) {
+    if (view.data() != internal_view.data()) {
+      reset();
+      return;
+    }
+    Kokkos::Impl::Experimental::ResetDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data() + view.size(),
+        internal_view.size() - view.size(),
+        internal_view.label());
+  }
+
+  void resize(const size_t n0 = 0,
+           const size_t n1 = 0,
+           const size_t n2 = 0,
+           const size_t n3 = 0,
+           const size_t n4 = 0,
+           const size_t n5 = 0,
+           const size_t n6 = 0) {
+    ::Kokkos::resize(internal_view,unique_token.size(),n0,n1,n2,n3,n4,n5,n6);
+  }
+
+  void realloc(const size_t n0 = 0,
+           const size_t n1 = 0,
+           const size_t n2 = 0,
+           const size_t n3 = 0,
+           const size_t n4 = 0,
+           const size_t n5 = 0,
+           const size_t n6 = 0) {
+    ::Kokkos::realloc(internal_view,unique_token.size(),n0,n1,n2,n3,n4,n5,n6);
+  }
+
+protected:
+  template <typename ... Args>
+  KOKKOS_FORCEINLINE_FUNCTION
+  original_reference_type at(int rank, Args ... args) const {
+    return internal_view(rank, args...);
+  }
+
+protected:
+  typedef Kokkos::Experimental::UniqueToken<
+      ExecSpace, Kokkos::Experimental::UniqueTokenScope::Global> unique_token_type;
+
+  unique_token_type unique_token;
+  internal_view_type internal_view;
+};
+
+template <typename DataType
+         ,int Op
+         ,typename ExecSpace
+         ,int contribution
+         >
+class ScatterView<DataType
+                   ,Kokkos::LayoutLeft
+                   ,ExecSpace
+                   ,Op
+                   ,ScatterDuplicated
+                   ,contribution>
+{
+public:
+  typedef Kokkos::View<DataType, Kokkos::LayoutLeft, ExecSpace> original_view_type;
+  typedef typename original_view_type::value_type original_value_type;
+  typedef typename original_view_type::reference_type original_reference_type;
+  friend class ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutLeft, ScatterDuplicated, contribution, ScatterNonAtomic>;
+  friend class ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutLeft, ScatterDuplicated, contribution, ScatterAtomic>;
+  typedef typename Kokkos::Impl::Experimental::DuplicatedDataType<DataType, Kokkos::LayoutLeft> data_type_info;
+  typedef typename data_type_info::value_type internal_data_type;
+  typedef Kokkos::View<internal_data_type, Kokkos::LayoutLeft, ExecSpace> internal_view_type;
+
+  ScatterView()
+  {
+  }
+
+  template <typename RT, typename ... RP >
+  ScatterView(View<RT, RP...> const& original_view)
+  : unique_token()
+  {
+    size_t arg_N[8] = {
+      original_view.extent(0),
+      original_view.extent(1),
+      original_view.extent(2),
+      original_view.extent(3),
+      original_view.extent(4),
+      original_view.extent(5),
+      original_view.extent(6),
+      0
+    };
+    arg_N[internal_view_type::rank - 1] = unique_token.size();
+    internal_view = internal_view_type(
+        Kokkos::ViewAllocateWithoutInitializing(
+          std::string("duplicated_") + original_view.label()),
+        arg_N[0], arg_N[1], arg_N[2], arg_N[3],
+        arg_N[4], arg_N[5], arg_N[6], arg_N[7]);
+    reset();
+  }
+
+  template <typename ... Dims>
+  ScatterView(std::string const& name, Dims ... dims)
+  : internal_view(Kokkos::ViewAllocateWithoutInitializing(name), dims ..., unique_token.size())
+  {
+    reset();
+  }
+
+  template <int override_contribution = contribution>
+  inline
+  ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutLeft, ScatterDuplicated, contribution, override_contribution>
+  access() const {
+    return ScatterAccess<DataType, Op, ExecSpace, Kokkos::LayoutLeft, ScatterDuplicated, contribution, override_contribution>{*this};
+  }
+
+  typename Kokkos::Impl::Experimental::Slice<
+    Kokkos::LayoutLeft, internal_view_type::rank, internal_view_type>::value_type
+  subview() const
+  {
+    return Kokkos::Impl::Experimental::Slice<
+      Kokkos::LayoutLeft, internal_view_type::rank, internal_view_type>::get(internal_view, 0);
+  }
+
+  template <typename ... RP>
+  void contribute_into(View<DataType, RP...> const& dest) const
+  {
+    typedef View<DataType, RP...> dest_type;
+    static_assert(std::is_same<
+        typename dest_type::array_layout,
+        Kokkos::LayoutLeft>::value,
+        "ScatterView deep_copy destination has different layout");
+    static_assert(Kokkos::Impl::VerifyExecutionCanAccessMemorySpace<
+        typename ExecSpace::memory_space,
+        typename dest_type::memory_space>::value,
+        "ScatterView deep_copy destination memory space not accessible");
+    size_t strides[8];
+    internal_view.stride(strides);
+    size_t stride = strides[internal_view_type::rank - 1];
+    auto extent = internal_view.extent(
+        internal_view_type::rank - 1);
+    bool is_equal = (dest.data() == internal_view.data());
+    size_t start = is_equal ? 1 : 0;
+    Kokkos::Impl::Experimental::ReduceDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data(),
+        dest.data(),
+        stride,
+        start,
+        extent,
+        internal_view.label());
+  }
+
+  void reset() {
+    Kokkos::Impl::Experimental::ResetDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data(),
+        internal_view.size(),
+        internal_view.label());
+  }
+  template <typename DT, typename ... RP>
+  void reset_except(View<DT, RP...> const& view) {
+    if (view.data() != internal_view.data()) {
+      reset();
+      return;
+    }
+    Kokkos::Impl::Experimental::ResetDuplicates<ExecSpace, original_value_type, Op>(
+        internal_view.data() + view.size(),
+        internal_view.size() - view.size(),
+        internal_view.label());
+  }
+
+  void resize(const size_t n0 = 0,
+           const size_t n1 = 0,
+           const size_t n2 = 0,
+           const size_t n3 = 0,
+           const size_t n4 = 0,
+           const size_t n5 = 0,
+           const size_t n6 = 0) {
+
+    size_t arg_N[8] = {n0,n1,n2,n3,n4,n5,n6,0};
+    const int i = internal_view.rank-1;
+    arg_N[i] = unique_token.size();
+
+    ::Kokkos::resize(internal_view,
+        arg_N[0], arg_N[1], arg_N[2], arg_N[3],
+        arg_N[4], arg_N[5], arg_N[6], arg_N[7]);
+  }
+
+  void realloc(const size_t n0 = 0,
+           const size_t n1 = 0,
+           const size_t n2 = 0,
+           const size_t n3 = 0,
+           const size_t n4 = 0,
+           const size_t n5 = 0,
+           const size_t n6 = 0) {
+
+    size_t arg_N[8] = {n0,n1,n2,n3,n4,n5,n6,0};
+    const int i = internal_view.rank-1;
+    arg_N[i] = unique_token.size();
+
+    ::Kokkos::realloc(internal_view,
+        arg_N[0], arg_N[1], arg_N[2], arg_N[3],
+        arg_N[4], arg_N[5], arg_N[6], arg_N[7]);
+  }
+
+protected:
+  template <typename ... Args>
+  inline original_reference_type at(int thread_id, Args ... args) const {
+    return internal_view(args..., thread_id);
+  }
+
+protected:
+  typedef Kokkos::Experimental::UniqueToken<
+      ExecSpace, Kokkos::Experimental::UniqueTokenScope::Global> unique_token_type;
+
+  unique_token_type unique_token;
+  internal_view_type internal_view;
+};
+
+
+/* This object has to be separate in order to store the thread ID, which cannot
+   be obtained until one is inside a parallel construct, and may be relatively
+   expensive to obtain at every contribution
+   (calls a non-inlined function, looks up a thread-local variable).
+   Due to the expense, it is sensible to query it at most once per parallel iterate
+   (ideally once per thread, but parallel_for doesn't expose that)
+   and then store it in a stack variable.
+   ScatterAccess serves as a non-const object on the stack which can store the thread ID */
+
+template <typename DataType
+         ,int Op
+         ,typename ExecSpace
+         ,typename Layout
+         ,int contribution
+         ,int override_contribution
+         >
+class ScatterAccess<DataType
+                   ,Op
+                   ,ExecSpace
+                   ,Layout
+                   ,ScatterDuplicated
+                   ,contribution
+                   ,override_contribution>
+{
+public:
+  typedef ScatterView<DataType, Layout, ExecSpace, Op, ScatterDuplicated, contribution> view_type;
+  typedef typename view_type::original_value_type original_value_type;
+  typedef Kokkos::Impl::Experimental::ScatterValue<
+      original_value_type, Op, override_contribution> value_type;
+
+  inline ScatterAccess(view_type const& view_in)
+    : view(view_in)
+    , thread_id(view_in.unique_token.acquire()) {
+  }
+
+  inline ~ScatterAccess() {
+    if (thread_id != ~thread_id_type(0)) view.unique_token.release(thread_id);
+  }
+
+  template <typename ... Args>
+  KOKKOS_FORCEINLINE_FUNCTION
+  value_type operator()(Args ... args) const {
+    return view.at(thread_id, args...);
+  }
+
+  template <typename Arg>
+  KOKKOS_FORCEINLINE_FUNCTION
+  typename std::enable_if<view_type::original_view_type::rank == 1 &&
+  std::is_integral<Arg>::value, value_type>::type
+  operator[](Arg arg) const {
+    return view.at(thread_id, arg);
+  }
+
+private:
+
+  view_type const& view;
+
+  // simplify RAII by disallowing copies
+  ScatterAccess(ScatterAccess const& other) = delete;
+  ScatterAccess& operator=(ScatterAccess const& other) = delete;
+  ScatterAccess& operator=(ScatterAccess&& other) = delete;
+
+public:
+  // do need to allow moves though, for the common
+  // auto b = a.access();
+  // that assignments turns into a move constructor call 
+  inline ScatterAccess(ScatterAccess&& other)
+    : view(other.view)
+    , thread_id(other.thread_id)
+  {
+    other.thread_id = ~thread_id_type(0);
+  }
+
+private:
+
+  typedef typename view_type::unique_token_type unique_token_type;
+  typedef typename unique_token_type::size_type thread_id_type;
+  thread_id_type thread_id;
+};
+
+template <int Op = Kokkos::Experimental::ScatterSum,
+          int duplication = -1,
+          int contribution = -1,
+          typename RT, typename ... RP>
+ScatterView
+  < RT
+  , typename ViewTraits<RT, RP...>::array_layout
+  , typename ViewTraits<RT, RP...>::execution_space
+  , Op
+  /* just setting defaults if not specified... things got messy because the view type
+     does not come before the duplication/contribution settings in the
+     template parameter list */
+  , duplication == -1 ? Kokkos::Impl::Experimental::DefaultDuplication<typename ViewTraits<RT, RP...>::execution_space>::value : duplication
+  , contribution == -1 ?
+      Kokkos::Impl::Experimental::DefaultContribution<
+                        typename ViewTraits<RT, RP...>::execution_space,
+                        (duplication == -1 ?
+                           Kokkos::Impl::Experimental::DefaultDuplication<
+                             typename ViewTraits<RT, RP...>::execution_space
+                             >::value
+                                           : duplication
+                        )
+                        >::value
+                       : contribution
+  >
+create_scatter_view(View<RT, RP...> const& original_view) {
+  return original_view; // implicit ScatterView constructor call
+}
+
+}} // namespace Kokkos::Experimental
+
+namespace Kokkos {
+namespace Experimental {
+
+template <typename DT1, typename DT2, typename LY, typename ES,  int OP, int CT, int DP, typename ... VP>
+void
+contribute(View<DT1, VP...>& dest, Kokkos::Experimental::ScatterView<DT2, LY, ES, OP, CT, DP> const& src)
+{
+  src.contribute_into(dest);
+}
+
+}} // namespace Kokkos::Experimental
+
+namespace Kokkos {
+
+template <typename DT, typename LY, typename ES,  int OP, int CT, int DP, typename ... IS>
+void
+realloc(Kokkos::Experimental::ScatterView<DT, LY, ES, OP, CT, DP>& scatter_view, IS ... is)
+{
+  scatter_view.realloc(is ...);
+}
+
+template <typename DT, typename LY, typename ES,  int OP, int CT, int DP, typename ... IS>
+void
+resize(Kokkos::Experimental::ScatterView<DT, LY, ES, OP, CT, DP>& scatter_view, IS ... is)
+{
+  scatter_view.resize(is ...);
+}
+
+} // namespace Kokkos
+
+#endif
--- a/lib/kokkos/containers/src/Kokkos_UnorderedMap.hpp
+++ b/lib/kokkos/containers/src/Kokkos_UnorderedMap.hpp
@ -517,7 +517,7 @@ public:

    size_type find_attempts = 0;

-    enum { bounded_find_attempts = 32u };
+    enum : unsigned { bounded_find_attempts = 32u };
    const size_type max_attempts = (m_bounded_insert && (bounded_find_attempts < m_available_indexes.max_hint()) ) ?
                                    bounded_find_attempts :
                                    m_available_indexes.max_hint();
--- a/lib/kokkos/containers/src/Kokkos_Vector.hpp
+++ b/lib/kokkos/containers/src/Kokkos_Vector.hpp
@ -56,11 +56,12 @@
 template< class Scalar, class Arg1Type = void>
 class vector : public DualView<Scalar*,LayoutLeft,Arg1Type> {

+public:
  typedef Scalar value_type;
  typedef Scalar* pointer;
  typedef const Scalar* const_pointer;
-  typedef Scalar* reference;
-  typedef const Scalar* const_reference;
+  typedef Scalar& reference;
+  typedef const Scalar& const_reference;
  typedef Scalar* iterator;
  typedef const Scalar* const_iterator;

@ -73,11 +74,11 @@ private:

 public:
 #ifdef KOKKOS_ENABLE_CUDA_UVM
-  KOKKOS_INLINE_FUNCTION Scalar& operator() (int i) const {return DV::h_view(i);};
-  KOKKOS_INLINE_FUNCTION Scalar& operator[] (int i) const {return DV::h_view(i);};
+  KOKKOS_INLINE_FUNCTION reference operator() (int i) const {return DV::h_view(i);};
+  KOKKOS_INLINE_FUNCTION reference operator[] (int i) const {return DV::h_view(i);};
 #else
-  inline Scalar& operator() (int i) const {return DV::h_view(i);};
-  inline Scalar& operator[] (int i) const {return DV::h_view(i);};
+  inline reference operator() (int i) const {return DV::h_view(i);};
+  inline reference operator[] (int i) const {return DV::h_view(i);};
 #endif

  /* Member functions which behave like std::vector functions */
@ -86,7 +87,7 @@ public:
    _size = 0;
    _extra_storage = 1.1;
    DV::modified_host() = 1;
-  };
+  }


  vector(int n, Scalar val=Scalar()):DualView<Scalar*,LayoutLeft,Arg1Type>("Vector",size_t(n*(1.1))) {
@ -146,25 +147,32 @@ public:
    DV::h_view(_size) = val;
    _size++;

-  };
+  }

  void pop_back() {
    _size--;
-  };
+  }

  void clear() {
    _size = 0;
  }

-  size_type size() const {return _size;};
+  size_type size() const {return _size;}
  size_type max_size() const {return 2000000000;}
-  size_type capacity() const {return DV::capacity();};
-  bool empty() const {return _size==0;};
+  size_type capacity() const {return DV::capacity();}
+  bool empty() const {return _size==0;}

-  iterator begin() const {return &DV::h_view(0);};
+  iterator begin() const {return &DV::h_view(0);}

-  iterator end() const {return &DV::h_view(_size);};
+  iterator end() const {return &DV::h_view(_size);}

+  reference front() {return DV::h_view(0);}
+
+  reference back() {return DV::h_view(_size - 1);}
+
+  const_reference front() const {return DV::h_view(0);}
+
+  const_reference back() const {return DV::h_view(_size - 1);}

  /* std::algorithms wich work originally with iterators, here they are implemented as member functions */

--- a/lib/kokkos/containers/unit_tests/CMakeLists.txt
+++ b/lib/kokkos/containers/unit_tests/CMakeLists.txt
@ -3,7 +3,13 @@ INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
 INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR})
 INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../src )

-SET(LIBRARIES kokkoscore)
+IF(NOT KOKKOS_HAS_TRILINOS)
+  IF(KOKKOS_SEPARATE_LIBS)
+    set(TEST_LINK_TARGETS kokkoscore)
+  ELSE()
+    set(TEST_LINK_TARGETS kokkos)
+  ENDIF()
+ENDIF()

 IF(Kokkos_ENABLE_Pthread)
 TRIBITS_ADD_EXECUTABLE_AND_TEST(
@ -12,7 +18,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
  COMM serial mpi
  NUM_MPI_PROCS 1
  FAIL_REGULAR_EXPRESSION "  FAILED  "
-  TESTONLYLIBS kokkos_gtest
+  TESTONLYLIBS kokkos_gtest ${TEST_LINK_TARGETS}
  )
 ENDIF()

@ -23,7 +29,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
  COMM serial mpi
  NUM_MPI_PROCS 1
  FAIL_REGULAR_EXPRESSION "  FAILED  "
-  TESTONLYLIBS kokkos_gtest
+  TESTONLYLIBS kokkos_gtest ${TEST_LINK_TARGETS}
  )
 ENDIF()

@ -34,7 +40,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
  COMM serial mpi
  NUM_MPI_PROCS 1
  FAIL_REGULAR_EXPRESSION "  FAILED  "
-  TESTONLYLIBS kokkos_gtest
+  TESTONLYLIBS kokkos_gtest ${TEST_LINK_TARGETS}
  )
 ENDIF()

@ -45,7 +51,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
  COMM serial mpi
  NUM_MPI_PROCS 1
  FAIL_REGULAR_EXPRESSION "  FAILED  "
-  TESTONLYLIBS kokkos_gtest
+  TESTONLYLIBS kokkos_gtest ${TEST_LINK_TARGETS}
  )
 ENDIF()

--- a/lib/kokkos/containers/unit_tests/Makefile
+++ b/lib/kokkos/containers/unit_tests/Makefile
@ -15,7 +15,8 @@ endif

 CXXFLAGS = -O3
 LINK ?= $(CXX)
-LDFLAGS ?= -lpthread
+LDFLAGS ?=
+override LDFLAGS += -lpthread

 include $(KOKKOS_PATH)/Makefile.kokkos

@ -30,6 +31,12 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
 	TEST_TARGETS += test-cuda
 endif

+ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
+	OBJ_ROCM = TestROCm.o UnitTestMain.o gtest-all.o
+	TARGETS += KokkosContainers_UnitTest_ROCm
+	TEST_TARGETS += test-rocm
+endif
+
 ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
 	OBJ_THREADS = TestThreads.o UnitTestMain.o gtest-all.o
 	TARGETS += KokkosContainers_UnitTest_Threads
@ -51,6 +58,9 @@ endif
 KokkosContainers_UnitTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
 	$(LINK) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosContainers_UnitTest_Cuda

+KokkosContainers_UnitTest_ROCm: $(OBJ_ROCM) $(KOKKOS_LINK_DEPENDS)
+	$(LINK) $(EXTRA_PATH) $(OBJ_ROCM) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosContainers_UnitTest_ROCm
+
 KokkosContainers_UnitTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
 	$(LINK) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosContainers_UnitTest_Threads

@ -63,6 +73,9 @@ KokkosContainers_UnitTest_Serial: $(OBJ_SERIAL) $(KOKKOS_LINK_DEPENDS)
 test-cuda: KokkosContainers_UnitTest_Cuda
 	./KokkosContainers_UnitTest_Cuda

+test-rocm: KokkosContainers_UnitTest_ROCm
+	./KokkosContainers_UnitTest_ROCm
+
 test-threads: KokkosContainers_UnitTest_Threads
 	./KokkosContainers_UnitTest_Threads

--- a/lib/kokkos/containers/unit_tests/TestCuda.cpp
+++ b/lib/kokkos/containers/unit_tests/TestCuda.cpp
@ -62,6 +62,7 @@
 #include <TestVector.hpp>
 #include <TestDualView.hpp>
 #include <TestDynamicView.hpp>
+#include <TestScatterView.hpp>

 #include <Kokkos_DynRankView.hpp>
 #include <TestDynViewAPI.hpp>
@ -201,10 +202,18 @@ void cuda_test_bitset()
      cuda_test_dualview_combinations(size);                     \
  }

+#define CUDA_SCATTERVIEW_TEST( size )             \
+  TEST_F( cuda, scatterview_##size##x) {                      \
+    test_scatter_view<Kokkos::Cuda>(size);               \
+  }
+
 CUDA_DUALVIEW_COMBINE_TEST( 10 )
 CUDA_VECTOR_COMBINE_TEST( 10 )
 CUDA_VECTOR_COMBINE_TEST( 3057 )

+CUDA_SCATTERVIEW_TEST( 10 )
+
+CUDA_SCATTERVIEW_TEST( 1000000 )

 CUDA_INSERT_TEST(close,               100000, 90000, 100, 500)
 CUDA_INSERT_TEST(far,                 100000, 90000, 100, 500)
--- a/lib/kokkos/containers/unit_tests/TestDynamicView.hpp
+++ b/lib/kokkos/containers/unit_tests/TestDynamicView.hpp
@ -131,11 +131,14 @@ struct TestDynamicView

 // printf("TestDynamicView::run(%d) construct memory pool\n",arg_total_size);

+    const size_t total_alloc_size = arg_total_size * sizeof(Scalar) * 1.2 ;
+    const size_t superblock = std::min( total_alloc_size , size_t(1000000) );
+
    memory_pool_type pool( memory_space()
-                         , arg_total_size * sizeof(Scalar) * 1.2
+                         , total_alloc_size
                         ,     500 /* min block size in bytes */
                         ,   30000 /* max block size in bytes */
-                         , 1000000 /* min superblock size in bytes */
+                         , superblock
                         );

 // printf("TestDynamicView::run(%d) construct dynamic view\n",arg_total_size);
--- a/lib/kokkos/containers/unit_tests/TestOpenMP.cpp
+++ b/lib/kokkos/containers/unit_tests/TestOpenMP.cpp
@ -63,6 +63,8 @@
 #include <Kokkos_DynRankView.hpp>
 #include <TestDynViewAPI.hpp>

+#include <TestScatterView.hpp>
+
 #include <Kokkos_ErrorReporter.hpp>
 #include <TestErrorReporter.hpp>

@ -152,6 +154,11 @@ TEST_F( openmp , staticcrsgraph )
      test_dualview_combinations<int,Kokkos::OpenMP>(size);                     \
  }

+#define OPENMP_SCATTERVIEW_TEST( size )             \
+  TEST_F( openmp, scatterview_##size##x) {                      \
+    test_scatter_view<Kokkos::OpenMP>(size);               \
+  }
+
 OPENMP_INSERT_TEST(close, 100000, 90000, 100, 500, true)
 OPENMP_INSERT_TEST(far, 100000, 90000, 100, 500, false)
 OPENMP_FAILED_INSERT_TEST( 10000, 1000 )
@ -161,6 +168,10 @@ OPENMP_VECTOR_COMBINE_TEST( 10 )
 OPENMP_VECTOR_COMBINE_TEST( 3057 )
 OPENMP_DUALVIEW_COMBINE_TEST( 10 )

+OPENMP_SCATTERVIEW_TEST( 10 )
+
+OPENMP_SCATTERVIEW_TEST( 1000000 )
+
 #undef OPENMP_INSERT_TEST
 #undef OPENMP_FAILED_INSERT_TEST
 #undef OPENMP_ASSIGNEMENT_TEST
--- a/lib/kokkos/containers/unit_tests/TestROCm.cpp
+++ b/lib/kokkos/containers/unit_tests/TestROCm.cpp
@ -0,0 +1,263 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact  H. Carter Edwards (hcedwar@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include <Kokkos_Macros.hpp>
+#ifdef KOKKOS_ENABLE_ROCM
+
+#include <iostream>
+#include <iomanip>
+#include <cstdint>
+
+#include <gtest/gtest.h>
+
+#include <Kokkos_Core.hpp>
+
+#include <Kokkos_Bitset.hpp>
+#include <Kokkos_UnorderedMap.hpp>
+#include <Kokkos_Vector.hpp>
+
+#include <TestBitset.hpp>
+#include <TestUnorderedMap.hpp>
+#include <TestStaticCrsGraph.hpp>
+#include <TestVector.hpp>
+#include <TestDualView.hpp>
+#include <TestDynamicView.hpp>
+
+#include <Kokkos_DynRankView.hpp>
+#include <TestDynViewAPI.hpp>
+
+#include <Kokkos_ErrorReporter.hpp>
+#include <TestErrorReporter.hpp>
+
+#include <TestViewCtorPropEmbeddedDim.hpp>
+
+//----------------------------------------------------------------------------
+
+
+
+namespace Test {
+
+class rocm : public ::testing::Test {
+protected:
+  static void SetUpTestCase()
+  {
+    std::cout << std::setprecision(5) << std::scientific;
+    Kokkos::HostSpace::execution_space::initialize();
+    Kokkos::Experimental::ROCm::initialize( Kokkos::Experimental::ROCm::SelectDevice(0) );
+  }
+  static void TearDownTestCase()
+  {
+    Kokkos::Experimental::ROCm::finalize();
+    Kokkos::HostSpace::execution_space::finalize();
+  }
+};
+
+#if !defined(KOKKOS_ENABLE_ROCM)
+//issue 964
+TEST_F( rocm , dyn_view_api) {
+  TestDynViewAPI< double , Kokkos::Experimental::ROCm >();
+}
+#endif 
+
+TEST_F( rocm, viewctorprop_embedded_dim ) {
+  TestViewCtorProp_EmbeddedDim< Kokkos::Experimental::ROCm >::test_vcpt( 2, 3 );
+}
+
+TEST_F( rocm , staticcrsgraph )
+{
+  TestStaticCrsGraph::run_test_graph< Kokkos::Experimental::ROCm >();
+  TestStaticCrsGraph::run_test_graph2< Kokkos::Experimental::ROCm >();
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(1, 0);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(1, 1000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(1, 10000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(1, 100000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(3, 0);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(3, 1000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(3, 10000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(3, 100000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(75, 0);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(75, 1000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(75, 10000);
+  TestStaticCrsGraph::run_test_graph3< Kokkos::Experimental::ROCm >(75, 100000);
+}
+
+
+#if !defined(KOKKOS_ENABLE_ROCM)
+// issue 1089
+// same as 130203 (MemPool, static member function link issue
+void rocm_test_insert_close(  uint32_t num_nodes
+                            , uint32_t num_inserts
+                            , uint32_t num_duplicates
+                           )
+{
+  test_insert< Kokkos::Experimental::ROCm >( num_nodes, num_inserts, num_duplicates, true);
+}
+
+// hcc link error , Referencing function in another module!
+void rocm_test_insert_far(  uint32_t num_nodes
+                          , uint32_t num_inserts
+                          , uint32_t num_duplicates
+                         )
+{
+  test_insert< Kokkos::Experimental::ROCm >( num_nodes, num_inserts, num_duplicates, false);
+}
+
+void rocm_test_failed_insert(  uint32_t num_nodes )
+{
+  test_failed_insert< Kokkos::Experimental::ROCm >( num_nodes );
+}
+
+void rocm_test_deep_copy(  uint32_t num_nodes )
+{
+  test_deep_copy< Kokkos::Experimental::ROCm >( num_nodes );
+}
+
+void rocm_test_vector_combinations(unsigned int size)
+{
+  test_vector_combinations<int,Kokkos::Experimental::ROCm>(size);
+}
+
+void rocm_test_dualview_combinations(unsigned int size)
+{
+  test_dualview_combinations<int,Kokkos::Experimental::ROCm>(size);
+}
+
+void rocm_test_bitset()
+{
+  test_bitset<Kokkos::Experimental::ROCm>();
+}
+
+
+
+/*TEST_F( rocm, bitset )
+{
+  rocm_test_bitset();
+}*/
+
+#define ROCM_INSERT_TEST( name, num_nodes, num_inserts, num_duplicates, repeat )                                \
+  TEST_F( rocm, UnorderedMap_insert_##name##_##num_nodes##_##num_inserts##_##num_duplicates##_##repeat##x) {   \
+    for (int i=0; i<repeat; ++i)                                                                                \
+      rocm_test_insert_##name(num_nodes,num_inserts,num_duplicates);                                            \
+  }
+
+#define ROCM_FAILED_INSERT_TEST( num_nodes, repeat )                           \
+  TEST_F( rocm, UnorderedMap_failed_insert_##num_nodes##_##repeat##x) {       \
+    for (int i=0; i<repeat; ++i)                                               \
+      rocm_test_failed_insert(num_nodes);                                      \
+  }
+
+#define ROCM_ASSIGNEMENT_TEST( num_nodes, repeat )                               \
+  TEST_F( rocm, UnorderedMap_assignment_operators_##num_nodes##_##repeat##x) {  \
+    for (int i=0; i<repeat; ++i)                                                 \
+      rocm_test_assignment_operators(num_nodes);                                 \
+  }
+
+#define ROCM_DEEP_COPY( num_nodes, repeat )                             \
+  TEST_F( rocm, UnorderedMap_deep_copy##num_nodes##_##repeat##x) {       \
+    for (int i=0; i<repeat; ++i)                                               \
+      rocm_test_deep_copy(num_nodes);                     \
+  }
+
+#define ROCM_VECTOR_COMBINE_TEST( size )                             \
+  TEST_F( rocm, vector_combination##size##x) {       \
+      rocm_test_vector_combinations(size);                     \
+  }
+
+#define ROCM_DUALVIEW_COMBINE_TEST( size )                             \
+  TEST_F( rocm, dualview_combination##size##x) {       \
+      rocm_test_dualview_combinations(size);                     \
+  }
+
+//ROCM_DUALVIEW_COMBINE_TEST( 10 )
+//ROCM_VECTOR_COMBINE_TEST( 10 )
+//ROCM_VECTOR_COMBINE_TEST( 3057 )
+
+
+//ROCM_INSERT_TEST(close,               100000, 90000, 100, 500)
+//ROCM_INSERT_TEST(far,                 100000, 90000, 100, 500)
+//ROCM_DEEP_COPY( 10000, 1 )
+//ROCM_FAILED_INSERT_TEST( 10000, 1000 )
+
+
+#undef ROCM_INSERT_TEST
+#undef ROCM_FAILED_INSERT_TEST
+#undef ROCM_ASSIGNEMENT_TEST
+#undef ROCM_DEEP_COPY
+#undef ROCM_VECTOR_COMBINE_TEST
+#undef ROCM_DUALVIEW_COMBINE_TEST
+
+
+#endif
+#if !defined(KOKKOS_ENABLE_ROCM)
+//static member function issue 
+TEST_F( rocm , dynamic_view )
+{
+//  typedef TestDynamicView< double , Kokkos::ROCmUVMSpace >
+  typedef TestDynamicView< double , Kokkos::Experimental::ROCmSpace >
+    TestDynView ;
+
+  for ( int i = 0 ; i < 10 ; ++i ) {
+    TestDynView::run( 100000 + 100 * i );
+  }
+}
+#endif
+
+
+#if defined(KOKKOS_CLASS_LAMBDA)
+TEST_F(rocm, ErrorReporterViaLambda)
+{
+  TestErrorReporter<ErrorReporterDriverUseLambda<Kokkos::Experimental::ROCm>>();
+}
+#endif
+
+TEST_F(rocm, ErrorReporter)
+{
+  TestErrorReporter<ErrorReporterDriver<Kokkos::Experimental::ROCm>>();
+}
+
+}
+
+#else
+void KOKKOS_CONTAINERS_UNIT_TESTS_TESTROCM_PREVENT_EMPTY_LINK_ERROR() {}
+#endif  /* #ifdef KOKKOS_ENABLE_ROCM */
+
--- a/lib/kokkos/containers/unit_tests/TestScatterView.hpp
+++ b/lib/kokkos/containers/unit_tests/TestScatterView.hpp
@ -0,0 +1,156 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact  H. Carter Edwards (hcedwar@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#ifndef KOKKOS_TEST_SCATTER_VIEW_HPP
+#define KOKKOS_TEST_SCATTER_VIEW_HPP
+
+#include <Kokkos_ScatterView.hpp>
+
+namespace Test {
+
+template <typename ExecSpace, typename Layout, int duplication, int contribution>
+void test_scatter_view_config(int n)
+{
+  Kokkos::View<double *[3], Layout, ExecSpace> original_view("original_view", n);
+  {
+    auto scatter_view = Kokkos::Experimental::create_scatter_view
+      < Kokkos::Experimental::ScatterSum
+      , duplication
+      , contribution
+      > (original_view);
+#if defined( KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA )
+    auto policy = Kokkos::RangePolicy<ExecSpace, int>(0, n);
+    auto f = KOKKOS_LAMBDA(int i) {
+      auto scatter_access = scatter_view.access();
+      auto scatter_access_atomic = scatter_view.template access<Kokkos::Experimental::ScatterAtomic>();
+      for (int j = 0; j < 10; ++j) {
+        auto k = (i + j) % n;
+        scatter_access(k, 0) += 4.2;
+        scatter_access_atomic(k, 1) += 2.0;
+        scatter_access(k, 2) += 1.0;
+      }
+    };
+    Kokkos::parallel_for(policy, f, "scatter_view_test");
+#endif
+    Kokkos::Experimental::contribute(original_view, scatter_view);
+    scatter_view.reset_except(original_view);
+#if defined( KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA )
+    Kokkos::parallel_for(policy, f, "scatter_view_test");
+#endif
+    Kokkos::Experimental::contribute(original_view, scatter_view);
+  }
+#if defined( KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA )
+  auto host_view = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), original_view);
+  for (typename decltype(host_view)::size_type i = 0; i < host_view.dimension_0(); ++i) {
+    auto val0 = host_view(i, 0);
+    auto val1 = host_view(i, 1);
+    auto val2 = host_view(i, 2);
+    EXPECT_TRUE(std::fabs((val0 - 84.0) / 84.0) < 1e-15);
+    EXPECT_TRUE(std::fabs((val1 - 40.0) / 40.0) < 1e-15);
+    EXPECT_TRUE(std::fabs((val2 - 20.0) / 20.0) < 1e-15);
+  }
+#endif
+  {
+    Kokkos::Experimental::ScatterView
+      < double*[3]
+      , Layout
+      , ExecSpace
+      , Kokkos::Experimental::ScatterSum
+      , duplication
+      , contribution
+      >
+      persistent_view("persistent", n);
+    auto result_view = persistent_view.subview();
+    contribute(result_view, persistent_view);
+  }
+}
+
+template <typename ExecSpace>
+struct TestDuplicatedScatterView {
+  TestDuplicatedScatterView(int n) {
+    test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
+      Kokkos::Experimental::ScatterDuplicated,
+      Kokkos::Experimental::ScatterNonAtomic>(n);
+    test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
+      Kokkos::Experimental::ScatterDuplicated,
+      Kokkos::Experimental::ScatterAtomic>(n);
+  }
+};
+
+#ifdef KOKKOS_ENABLE_CUDA
+// disable duplicated instantiation with CUDA until
+// UniqueToken can support it
+template <>
+struct TestDuplicatedScatterView<Kokkos::Cuda> {
+  TestDuplicatedScatterView(int) {
+  }
+};
+#endif
+
+template <typename ExecSpace>
+void test_scatter_view(int n)
+{
+  // all of these configurations should compile okay, but only some of them are
+  // correct and/or sensible in terms of memory use
+  Kokkos::Experimental::UniqueToken<ExecSpace> unique_token{ExecSpace()};
+
+  // no atomics or duplication is only sensible if the execution space
+  // is running essentially in serial (doesn't have to be Serial though,
+  // we also test OpenMP with one thread: LAMMPS cares about that)
+  if (unique_token.size() == 1) {
+    test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
+      Kokkos::Experimental::ScatterNonDuplicated,
+      Kokkos::Experimental::ScatterNonAtomic>(n);
+  }
+  test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
+    Kokkos::Experimental::ScatterNonDuplicated,
+    Kokkos::Experimental::ScatterAtomic>(n);
+
+  TestDuplicatedScatterView<ExecSpace> duptest(n);
+}
+
+} // namespace Test
+
+#endif //KOKKOS_TEST_UNORDERED_MAP_HPP
+
+
--- a/lib/kokkos/containers/unit_tests/TestSerial.cpp
+++ b/lib/kokkos/containers/unit_tests/TestSerial.cpp
@ -58,6 +58,7 @@
 #include <TestVector.hpp>
 #include <TestDualView.hpp>
 #include <TestDynamicView.hpp>
+#include <TestScatterView.hpp>

 #include <iomanip>

@ -148,6 +149,11 @@ TEST_F( serial, bitset )
    test_dualview_combinations<int,Kokkos::Serial>(size);               \
  }

+#define SERIAL_SCATTERVIEW_TEST( size )             \
+  TEST_F( serial, scatterview_##size##x) {                      \
+    test_scatter_view<Kokkos::Serial>(size);               \
+  }
+
 SERIAL_INSERT_TEST(close, 100000, 90000, 100, 500, true)
 SERIAL_INSERT_TEST(far, 100000, 90000, 100, 500, false)
 SERIAL_FAILED_INSERT_TEST( 10000, 1000 )
@ -157,6 +163,10 @@ SERIAL_VECTOR_COMBINE_TEST( 10 )
 SERIAL_VECTOR_COMBINE_TEST( 3057 )
 SERIAL_DUALVIEW_COMBINE_TEST( 10 )

+SERIAL_SCATTERVIEW_TEST( 10 )
+
+SERIAL_SCATTERVIEW_TEST( 1000000 )
+
 #undef SERIAL_INSERT_TEST
 #undef SERIAL_FAILED_INSERT_TEST
 #undef SERIAL_ASSIGNEMENT_TEST
--- a/lib/kokkos/containers/unit_tests/TestStaticCrsGraph.hpp
+++ b/lib/kokkos/containers/unit_tests/TestStaticCrsGraph.hpp
@ -71,7 +71,7 @@ void run_test_graph()
  }

  dx = Kokkos::create_staticcrsgraph<dView>( "dx" , graph );
-    hx = Kokkos::create_mirror( dx );
+  hx = Kokkos::create_mirror( dx );

  ASSERT_EQ( hx.row_map.dimension_0() - 1 , LENGTH );

@ -83,6 +83,16 @@ void run_test_graph()
      ASSERT_EQ( (int) hx.entries( j + begin ) , graph[i][j] );
    }
  }
+
+  // Test row view access
+  for ( size_t i = 0 ; i < LENGTH ; ++i ) {
+    auto rowView = hx.rowConst(i);
+    ASSERT_EQ( rowView.length, graph[i].size() );
+    for ( size_t j = 0 ; j < rowView.length ; ++j ) {
+      ASSERT_EQ( rowView.colidx( j ) , graph[i][j] );
+      ASSERT_EQ( rowView( j )        , graph[i][j] );
+    }
+  }
 }

 template< class Space >
@ -182,5 +192,6 @@ void run_test_graph3(size_t B, size_t N)
    ASSERT_FALSE((ne>2*((hx.row_map(hx.numRows())+C*hx.numRows())/B))&&(hx.row_block_offsets(i+1)>hx.row_block_offsets(i)+1));
  }
 }
+
 } /* namespace TestStaticCrsGraph */

--- a/lib/kokkos/core/CMakeLists.txt
+++ b/lib/kokkos/core/CMakeLists.txt
@ -2,7 +2,9 @@

 TRIBITS_SUBPACKAGE(Core)

-ADD_SUBDIRECTORY(src)
+IF(KOKKOS_HAS_TRILINOS)
+  ADD_SUBDIRECTORY(src)
+ENDIF()

 TRIBITS_ADD_TEST_DIRECTORIES(unit_test)
 TRIBITS_ADD_TEST_DIRECTORIES(perf_test)
--- a/lib/kokkos/core/perf_test/CMakeLists.txt
+++ b/lib/kokkos/core/perf_test/CMakeLists.txt
@ -2,6 +2,14 @@
 INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
 INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR})

+IF(NOT KOKKOS_HAS_TRILINOS)
+  IF(KOKKOS_SEPARATE_LIBS)
+    set(TEST_LINK_TARGETS kokkoscore)
+  ELSE()
+    set(TEST_LINK_TARGETS kokkos)
+  ENDIF()
+ENDIF()
+
 # warning: PerfTest_CustomReduction.cpp uses
 # ../../algorithms/src/Kokkos_Random.hpp
 # we'll just allow it to be included, but note
@ -23,7 +31,7 @@ TRIBITS_ADD_EXECUTABLE(
  PerfTestExec
  SOURCES ${SOURCES}
  COMM serial mpi
-  TESTONLYLIBS kokkos_gtest
+  TESTONLYLIBS kokkos_gtest ${TEST_LINK_TARGETS}
  )

 TRIBITS_ADD_TEST(
--- a/lib/kokkos/core/perf_test/Makefile
+++ b/lib/kokkos/core/perf_test/Makefile
@ -17,7 +17,8 @@ endif
 CXXFLAGS = -O3 
 #CXXFLAGS += -DGENERIC_REDUCER
 LINK ?= $(CXX)
-LDFLAGS ?= -lpthread
+LDFLAGS ?=
+override LDFLAGS += -lpthread

 include $(KOKKOS_PATH)/Makefile.kokkos

@ -43,6 +44,7 @@ TEST_TARGETS += test-atomic

 #

+ifneq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
 OBJ_MEMPOOL = test_mempool.o 
 TARGETS += KokkosCore_PerformanceTest_Mempool
 TEST_TARGETS += test-mempool
@ -52,6 +54,7 @@ TEST_TARGETS += test-mempool
 OBJ_TASKDAG = test_taskdag.o 
 TARGETS += KokkosCore_PerformanceTest_TaskDAG
 TEST_TARGETS += test-taskdag
+endif

 #

--- a/lib/kokkos/core/src/CMakeLists.txt
+++ b/lib/kokkos/core/src/CMakeLists.txt
@ -1,15 +1,4 @@

-TRIBITS_ADD_OPTION_AND_DEFINE(
-  Kokkos_ENABLE_Serial
-  KOKKOS_HAVE_SERIAL
-  "Whether to enable the Kokkos::Serial device.  This device executes \"parallel\" kernels sequentially on a single CPU thread.  It is enabled by default.  If you disable this device, please enable at least one other CPU device, such as Kokkos::OpenMP or Kokkos::Threads."
-  ON
-  )
-
-ASSERT_DEFINED(${PROJECT_NAME}_ENABLE_CXX11)
-ASSERT_DEFINED(${PACKAGE_NAME}_ENABLE_CUDA)
-
-TRIBITS_CONFIGURE_FILE(${PACKAGE_NAME}_config.h)

 INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
 INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR})
@ -20,68 +9,90 @@ SET(TRILINOS_INCDIR ${CMAKE_INSTALL_PREFIX}/${${PROJECT_NAME}_INSTALL_INCLUDE_DI

 #-----------------------------------------------------------------------------

-SET(HEADERS_PUBLIC "")
-SET(HEADERS_PRIVATE "")
-SET(SOURCES "")
+IF(KOKKOS_LEGACY_TRIBITS)

-FILE(GLOB HEADERS_PUBLIC Kokkos*.hpp)
-LIST( APPEND HEADERS_PUBLIC ${CMAKE_CURRENT_BINARY_DIR}/${PACKAGE_NAME}_config.h )
+  ASSERT_DEFINED(${PROJECT_NAME}_ENABLE_CXX11)
+  ASSERT_DEFINED(${PACKAGE_NAME}_ENABLE_CUDA)
+
+  SET(HEADERS_PUBLIC "")
+  SET(HEADERS_PRIVATE "")
+  SET(SOURCES "")
+
+  FILE(GLOB HEADERS_PUBLIC Kokkos*.hpp)
+  LIST( APPEND HEADERS_PUBLIC ${CMAKE_BINARY_DIR}/${PACKAGE_NAME}_config.h )
+
+  #-----------------------------------------------------------------------------
+
+  FILE(GLOB HEADERS_IMPL impl/*.hpp)
+  FILE(GLOB SOURCES_IMPL impl/*.cpp)
+
+  LIST(APPEND HEADERS_PRIVATE ${HEADERS_IMPL} )
+  LIST(APPEND SOURCES         ${SOURCES_IMPL} )
+
+  INSTALL(FILES ${HEADERS_IMPL} DESTINATION ${TRILINOS_INCDIR}/impl/)
+
+  #-----------------------------------------------------------------------------
+
+  FILE(GLOB HEADERS_THREADS Threads/*.hpp)
+  FILE(GLOB SOURCES_THREADS Threads/*.cpp)
+
+  LIST(APPEND HEADERS_PRIVATE ${HEADERS_THREADS} )
+  LIST(APPEND SOURCES         ${SOURCES_THREADS} )
+
+  INSTALL(FILES ${HEADERS_THREADS} DESTINATION ${TRILINOS_INCDIR}/Threads/)
+
+  #-----------------------------------------------------------------------------
+
+  FILE(GLOB HEADERS_OPENMP OpenMP/*.hpp)
+  FILE(GLOB SOURCES_OPENMP OpenMP/*.cpp)
+
+  LIST(APPEND HEADERS_PRIVATE ${HEADERS_OPENMP} )
+  LIST(APPEND SOURCES         ${SOURCES_OPENMP} )
+
+  INSTALL(FILES ${HEADERS_OPENMP} DESTINATION ${TRILINOS_INCDIR}/OpenMP/)
+
+  #-----------------------------------------------------------------------------
+
+  FILE(GLOB HEADERS_CUDA Cuda/*.hpp)
+  FILE(GLOB SOURCES_CUDA Cuda/*.cpp)
+
+  LIST(APPEND HEADERS_PRIVATE ${HEADERS_CUDA} )
+  LIST(APPEND SOURCES         ${SOURCES_CUDA} )
+
+  INSTALL(FILES ${HEADERS_CUDA} DESTINATION ${TRILINOS_INCDIR}/Cuda/)
+
+  #-----------------------------------------------------------------------------
+  FILE(GLOB HEADERS_QTHREADS Qthreads/*.hpp)
+  FILE(GLOB SOURCES_QTHREADS Qthreads/*.cpp)
+
+  LIST(APPEND HEADERS_PRIVATE ${HEADERS_QTHREADS} )
+  LIST(APPEND SOURCES         ${SOURCES_QTHREADS} )
+
+  INSTALL(FILES ${HEADERS_QTHREADS} DESTINATION ${TRILINOS_INCDIR}/Qthreads/)
+
+  TRIBITS_ADD_LIBRARY(
+      kokkoscore
+      HEADERS ${HEADERS_PUBLIC}
+      NOINSTALLHEADERS ${HEADERS_PRIVATE}
+      SOURCES ${SOURCES}
+      DEPLIBS
+      )

 #-----------------------------------------------------------------------------
+#  In the new build system, sources are calculated by Makefile.kokkos
+else()

-FILE(GLOB HEADERS_IMPL impl/*.hpp)
-FILE(GLOB SOURCES_IMPL impl/*.cpp)
+  INSTALL (DIRECTORY
+           "${CMAKE_CURRENT_SOURCE_DIR}/"
+           DESTINATION ${TRILINOS_INCDIR}
+           FILES_MATCHING PATTERN "*.hpp"
+  )

-LIST(APPEND HEADERS_PRIVATE ${HEADERS_IMPL} )
-LIST(APPEND SOURCES         ${SOURCES_IMPL} )
-
-INSTALL(FILES ${HEADERS_IMPL} DESTINATION ${TRILINOS_INCDIR}/impl/)
+  TRIBITS_ADD_LIBRARY(
+      kokkoscore
+      SOURCES ${KOKKOS_CORE_SRCS}
+      DEPLIBS
+      )

+endif()
 #-----------------------------------------------------------------------------
-
-FILE(GLOB HEADERS_THREADS Threads/*.hpp)
-FILE(GLOB SOURCES_THREADS Threads/*.cpp)
-
-LIST(APPEND HEADERS_PRIVATE ${HEADERS_THREADS} )
-LIST(APPEND SOURCES         ${SOURCES_THREADS} )
-
-INSTALL(FILES ${HEADERS_THREADS} DESTINATION ${TRILINOS_INCDIR}/Threads/)
-
-#-----------------------------------------------------------------------------
-
-FILE(GLOB HEADERS_OPENMP OpenMP/*.hpp)
-FILE(GLOB SOURCES_OPENMP OpenMP/*.cpp)
-
-LIST(APPEND HEADERS_PRIVATE ${HEADERS_OPENMP} )
-LIST(APPEND SOURCES         ${SOURCES_OPENMP} )
-
-INSTALL(FILES ${HEADERS_OPENMP} DESTINATION ${TRILINOS_INCDIR}/OpenMP/)
-
-#-----------------------------------------------------------------------------
-
-FILE(GLOB HEADERS_CUDA Cuda/*.hpp)
-FILE(GLOB SOURCES_CUDA Cuda/*.cpp)
-
-LIST(APPEND HEADERS_PRIVATE ${HEADERS_CUDA} )
-LIST(APPEND SOURCES         ${SOURCES_CUDA} )
-
-INSTALL(FILES ${HEADERS_CUDA} DESTINATION ${TRILINOS_INCDIR}/Cuda/)
-
-#-----------------------------------------------------------------------------
-FILE(GLOB HEADERS_QTHREADS Qthreads/*.hpp)
-FILE(GLOB SOURCES_QTHREADS Qthreads/*.cpp)
-
-LIST(APPEND HEADERS_PRIVATE ${HEADERS_QTHREADS} )
-LIST(APPEND SOURCES         ${SOURCES_QTHREADS} )
-
-INSTALL(FILES ${HEADERS_QTHREADS} DESTINATION ${TRILINOS_INCDIR}/Qthreads/)
-
-#-----------------------------------------------------------------------------
-
-TRIBITS_ADD_LIBRARY(
-    kokkoscore
-    HEADERS ${HEADERS_PUBLIC}
-    NOINSTALLHEADERS ${HEADERS_PRIVATE}
-    SOURCES ${SOURCES}
-    DEPLIBS
-    )
--- a/lib/kokkos/core/src/Cuda/KokkosExp_Cuda_IterateTile.hpp
+++ b/lib/kokkos/core/src/Cuda/KokkosExp_Cuda_IterateTile.hpp
@ -63,7 +63,7 @@
 #include <typeinfo>
 #endif

-namespace Kokkos { namespace Experimental { namespace Impl {
+namespace Kokkos { namespace Impl {

 // ------------------------------------------------------------------ //

@ -110,21 +110,12 @@ struct apply_impl<2,RP,Functor,void >
  {
 // LL
  if (RP::inner_direction == RP::Left) {
- /*
-    index_type offset_1 = blockIdx.y*m_rp.m_tile[1] + threadIdx.y;
-    index_type offset_0 = blockIdx.x*m_rp.m_tile[0] + threadIdx.x;
-
-    for ( index_type j = offset_1; j < m_rp.m_upper[1], threadIdx.y < m_rp.m_tile[1]; j += (gridDim.y*m_rp.m_tile[1]) ) {
-    for ( index_type i = offset_0; i < m_rp.m_upper[0], threadIdx.x < m_rp.m_tile[0]; i += (gridDim.x*m_rp.m_tile[0]) ) {
-            m_func(i, j);
-    } }
-*/
    for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-      const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+      const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
      if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {

        for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-          const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+          const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
          if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {
            m_func(offset_0 , offset_1);
          }
@ -134,21 +125,12 @@ struct apply_impl<2,RP,Functor,void >
  }
 // LR
  else {
-/*
-    index_type offset_1 = blockIdx.y*m_rp.m_tile[1] + threadIdx.y;
-    index_type offset_0 = blockIdx.x*m_rp.m_tile[0] + threadIdx.x;
-
-    for ( index_type i = offset_0; i < m_rp.m_upper[0], threadIdx.x < m_rp.m_tile[0]; i += (gridDim.x*m_rp.m_tile[0]) ) {
-    for ( index_type j = offset_1; j < m_rp.m_upper[1], threadIdx.y < m_rp.m_tile[1]; j += (gridDim.y*m_rp.m_tile[1]) ) {
-            m_func(i, j);
-    } }
-*/
    for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-      const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+      const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
      if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {

        for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-          const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+          const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
          if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {
            m_func(offset_0 , offset_1);
          }
@ -182,21 +164,12 @@ struct apply_impl<2,RP,Functor,Tag>
  {
  if (RP::inner_direction == RP::Left) {
    // Loop over size maxnumblocks until full range covered
-/*
-    index_type offset_1 = blockIdx.y*m_rp.m_tile[1] + threadIdx.y;
-    index_type offset_0 = blockIdx.x*m_rp.m_tile[0] + threadIdx.x;
-
-    for ( index_type j = offset_1; j < m_rp.m_upper[1], threadIdx.y < m_rp.m_tile[1]; j += (gridDim.y*m_rp.m_tile[1]) ) {
-    for ( index_type i = offset_0; i < m_rp.m_upper[0], threadIdx.x < m_rp.m_tile[0]; i += (gridDim.x*m_rp.m_tile[0]) ) {
-            m_func(Tag(), i, j);
-    } }
-*/
    for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-      const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+      const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
      if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {

        for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-          const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+          const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
          if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {
            m_func(Tag(), offset_0 , offset_1);
          }
@ -205,21 +178,12 @@ struct apply_impl<2,RP,Functor,Tag>
    }
  }
  else {
-/*
-    index_type offset_1 = blockIdx.y*m_rp.m_tile[1] + threadIdx.y;
-    index_type offset_0 = blockIdx.x*m_rp.m_tile[0] + threadIdx.x;
-
-    for ( index_type i = offset_0; i < m_rp.m_upper[0], threadIdx.x < m_rp.m_tile[0]; i += (gridDim.x*m_rp.m_tile[0]) ) {
-    for ( index_type j = offset_1; j < m_rp.m_upper[1], threadIdx.y < m_rp.m_tile[1]; j += (gridDim.y*m_rp.m_tile[1]) ) {
-            m_func(Tag(), i, j);
-    } }
-*/
    for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-      const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+      const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
      if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {

        for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-          const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+          const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
          if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {
            m_func(Tag(), offset_0 , offset_1);
          }
@ -255,15 +219,15 @@ struct apply_impl<3,RP,Functor,void >
 // LL
    if (RP::inner_direction == RP::Left) {
      for ( index_type tile_id2 = blockIdx.z; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.z ) {
-        const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.z;
+        const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[2];
        if ( offset_2 < m_rp.m_upper[2] && threadIdx.z < m_rp.m_tile[2] ) {

          for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {

              for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-                const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+                const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
                if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {
                  m_func(offset_0 , offset_1 , offset_2);
                }
@ -276,15 +240,15 @@ struct apply_impl<3,RP,Functor,void >
 // LR
  else {
    for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-      const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+      const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
      if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {

        for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-          const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+          const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
          if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {

            for ( index_type tile_id2 = blockIdx.z; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.z ) {
-              const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.z;
+              const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[2];
              if ( offset_2 < m_rp.m_upper[2] && threadIdx.z < m_rp.m_tile[2] ) {
                m_func(offset_0 , offset_1 , offset_2);
              }
@ -319,15 +283,15 @@ struct apply_impl<3,RP,Functor,Tag>
  {
    if (RP::inner_direction == RP::Left) {
      for ( index_type tile_id2 = blockIdx.z; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.z ) {
-        const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.z;
+        const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[2];
        if ( offset_2 < m_rp.m_upper[2] && threadIdx.z < m_rp.m_tile[2] ) {

          for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {

              for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-                const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+                const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
                if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {
                  m_func(Tag(), offset_0 , offset_1 , offset_2);
                }
@ -339,15 +303,15 @@ struct apply_impl<3,RP,Functor,Tag>
    }
    else {
      for ( index_type tile_id0 = blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
-        const index_type offset_0 = tile_id0*m_rp.m_tile[0] + threadIdx.x;
+        const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x + (index_type)m_rp.m_lower[0];
        if ( offset_0 < m_rp.m_upper[0] && threadIdx.x < m_rp.m_tile[0] ) {

          for ( index_type tile_id1 = blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
-            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + threadIdx.y;
+            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && threadIdx.y < m_rp.m_tile[1] ) {

              for ( index_type tile_id2 = blockIdx.z; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.z ) {
-                const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.z;
+                const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && threadIdx.z < m_rp.m_tile[2] ) {
                  m_func(Tag(), offset_0 , offset_1 , offset_2);
                }
@ -398,19 +362,19 @@ struct apply_impl<4,RP,Functor,void >
      const index_type thr_id1 = threadIdx.x / m_rp.m_tile[0];

      for ( index_type tile_id3 = blockIdx.z; tile_id3 < m_rp.m_tile_end[3]; tile_id3 += gridDim.z ) {
-        const index_type offset_3 = tile_id3*m_rp.m_tile[3] + threadIdx.z;
+        const index_type offset_3 = tile_id3*m_rp.m_tile[3] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[3];
        if ( offset_3 < m_rp.m_upper[3] && threadIdx.z < m_rp.m_tile[3] ) {

          for ( index_type tile_id2 = blockIdx.y; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.y ) {
-            const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.y;
+            const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[2];
            if ( offset_2 < m_rp.m_upper[2] && threadIdx.y < m_rp.m_tile[2] ) {

              for ( index_type j = tile_id1 ; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-                const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+                const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
                if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

                  for ( index_type i = tile_id0 ; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-                    const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+                    const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
                    if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {
                      m_func(offset_0 , offset_1 , offset_2 , offset_3);
                    }
@ -436,19 +400,19 @@ struct apply_impl<4,RP,Functor,void >
      const index_type thr_id1 = threadIdx.x % m_rp.m_tile[1];

      for ( index_type i = tile_id0; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {

          for ( index_type j = tile_id1; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

              for ( index_type tile_id2 = blockIdx.y; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.y ) {
-                const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.y;
+                const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && threadIdx.y < m_rp.m_tile[2] ) {

                  for ( index_type tile_id3 = blockIdx.z; tile_id3 < m_rp.m_tile_end[3]; tile_id3 += gridDim.z ) {
-                    const index_type offset_3 = tile_id3*m_rp.m_tile[3] + threadIdx.z;
+                    const index_type offset_3 = tile_id3*m_rp.m_tile[3] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[3];
                    if ( offset_3 < m_rp.m_upper[3] && threadIdx.z < m_rp.m_tile[3] ) {
                      m_func(offset_0 , offset_1 , offset_2 , offset_3);
                    }
@ -498,19 +462,19 @@ struct apply_impl<4,RP,Functor,Tag>
      const index_type thr_id1 = threadIdx.x / m_rp.m_tile[0];

      for ( index_type tile_id3 = blockIdx.z; tile_id3 < m_rp.m_tile_end[3]; tile_id3 += gridDim.z ) {
-        const index_type offset_3 = tile_id3*m_rp.m_tile[3] + threadIdx.z;
+        const index_type offset_3 = tile_id3*m_rp.m_tile[3] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[3];
        if ( offset_3 < m_rp.m_upper[3] && threadIdx.z < m_rp.m_tile[3] ) {

          for ( index_type tile_id2 = blockIdx.y; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.y ) {
-            const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.y;
+            const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[2];
            if ( offset_2 < m_rp.m_upper[2] && threadIdx.y < m_rp.m_tile[2] ) {

              for ( index_type j = tile_id1; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-                const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+                const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
                if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

                  for ( index_type i = tile_id0; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-                    const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+                    const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
                    if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {
                      m_func(Tag(), offset_0 , offset_1 , offset_2 , offset_3);
                    }
@ -535,19 +499,19 @@ struct apply_impl<4,RP,Functor,Tag>
      const index_type thr_id1 = threadIdx.x % m_rp.m_tile[1];

      for ( index_type i = tile_id0; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {

          for ( index_type j = tile_id1; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + thr_id1;
+            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

              for ( index_type tile_id2 = blockIdx.y; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.y ) {
-                const index_type offset_2 = tile_id2*m_rp.m_tile[2] + threadIdx.y;
+                const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.y + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && threadIdx.y < m_rp.m_tile[2] ) {

                  for ( index_type tile_id3 = blockIdx.z; tile_id3 < m_rp.m_tile_end[3]; tile_id3 += gridDim.z ) {
-                    const index_type offset_3 = tile_id3*m_rp.m_tile[3] + threadIdx.z;
+                    const index_type offset_3 = tile_id3*m_rp.m_tile[3] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[3];
                    if ( offset_3 < m_rp.m_upper[3] && threadIdx.z < m_rp.m_tile[3] ) {
                      m_func(Tag() , offset_0 , offset_1 , offset_2 , offset_3);
                    }
@ -612,23 +576,23 @@ struct apply_impl<5,RP,Functor,void >
      const index_type thr_id3 = threadIdx.y / m_rp.m_tile[2];

      for ( index_type tile_id4 = blockIdx.z; tile_id4 < m_rp.m_tile_end[4]; tile_id4 += gridDim.z ) {
-        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + threadIdx.z;
+        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[4];
        if ( offset_4 < m_rp.m_upper[4] && threadIdx.z < m_rp.m_tile[4] ) {

          for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-            const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+            const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
            if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

              for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                  for ( index_type j = tile_id1 ; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-                    const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+                    const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
                    if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

                      for ( index_type i = tile_id0 ; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-                        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+                        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
                        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {
                          m_func(offset_0 , offset_1 , offset_2 , offset_3, offset_4);
                        }
@ -667,23 +631,23 @@ struct apply_impl<5,RP,Functor,void >
      const index_type thr_id3 = threadIdx.y % m_rp.m_tile[3];

      for ( index_type i = tile_id0; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {

          for ( index_type j = tile_id1; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

              for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                  for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
                    if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

                      for ( index_type tile_id4 = blockIdx.z; tile_id4 < m_rp.m_tile_end[4]; tile_id4 += gridDim.z ) {
-                        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + threadIdx.z;
+                        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[4];
                        if ( offset_4 < m_rp.m_upper[4] && threadIdx.z < m_rp.m_tile[4] ) {
                          m_func(offset_0 , offset_1 , offset_2 , offset_3 , offset_4);
                        }
@ -747,23 +711,23 @@ struct apply_impl<5,RP,Functor,Tag>
      const index_type thr_id3 = threadIdx.y / m_rp.m_tile[2];

      for ( index_type tile_id4 = blockIdx.z; tile_id4 < m_rp.m_tile_end[4]; tile_id4 += gridDim.z ) {
-        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + threadIdx.z;
+        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[4];
        if ( offset_4 < m_rp.m_upper[4] && threadIdx.z < m_rp.m_tile[4] ) {

          for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-            const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+            const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
            if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

              for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                  for ( index_type j = tile_id1 ; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-                    const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+                    const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
                    if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

                      for ( index_type i = tile_id0 ; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-                        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+                        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
                        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {
                          m_func(Tag() , offset_0 , offset_1 , offset_2 , offset_3, offset_4);
                        }
@ -802,23 +766,23 @@ struct apply_impl<5,RP,Functor,Tag>
      const index_type thr_id3 = threadIdx.y % m_rp.m_tile[3];

      for ( index_type i = tile_id0; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {

          for ( index_type j = tile_id1; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

              for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                  for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
                    if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

                      for ( index_type tile_id4 = blockIdx.z; tile_id4 < m_rp.m_tile_end[4]; tile_id4 += gridDim.z ) {
-                        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + threadIdx.z;
+                        const index_type offset_4 = tile_id4*m_rp.m_tile[4] + (index_type)threadIdx.z + (index_type)m_rp.m_lower[4];
                        if ( offset_4 < m_rp.m_upper[4] && threadIdx.z < m_rp.m_tile[4] ) {
                          m_func(Tag() , offset_0 , offset_1 , offset_2 , offset_3 , offset_4);
                        }
@ -895,27 +859,27 @@ struct apply_impl<6,RP,Functor,void >
      const index_type thr_id5 = threadIdx.z / m_rp.m_tile[4];

      for ( index_type n = tile_id5; n < m_rp.m_tile_end[5]; n += numbl5 ) {
-        const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5;
+        const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5 + (index_type)m_rp.m_lower[5];
        if ( offset_5 < m_rp.m_upper[5] && thr_id5 < m_rp.m_tile[5] ) {

          for ( index_type m = tile_id4; m < m_rp.m_tile_end[4]; m += numbl4 ) {
-            const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4;
+            const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4 + (index_type)m_rp.m_lower[4];
            if ( offset_4 < m_rp.m_upper[4] && thr_id4 < m_rp.m_tile[4] ) {

              for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-                const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+                const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
                if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

                  for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                    const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                    const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                    if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                      for ( index_type j = tile_id1 ; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-                        const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+                        const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
                        if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

                          for ( index_type i = tile_id0 ; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-                            const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+                            const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
                            if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {
                              m_func(offset_0 , offset_1 , offset_2 , offset_3, offset_4, offset_5);
                            }
@ -967,27 +931,27 @@ struct apply_impl<6,RP,Functor,void >
      const index_type thr_id5 = threadIdx.z % m_rp.m_tile[5];

      for ( index_type i = tile_id0; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {

          for ( index_type j = tile_id1; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

              for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                  for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
                    if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

                      for ( index_type m = tile_id4; m < m_rp.m_tile_end[4]; m += numbl4 ) {
-                        const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4;
+                        const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4 + (index_type)m_rp.m_lower[4];
                        if ( offset_4 < m_rp.m_upper[4] && thr_id4 < m_rp.m_tile[4] ) {

                          for ( index_type n = tile_id5; n < m_rp.m_tile_end[5]; n += numbl5 ) {
-                            const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5;
+                            const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5 + (index_type)m_rp.m_lower[5];
                            if ( offset_5 < m_rp.m_upper[5] && thr_id5 < m_rp.m_tile[5] ) {
                              m_func(offset_0 , offset_1 , offset_2 , offset_3 , offset_4 , offset_5);
                            }
@ -1064,27 +1028,27 @@ struct apply_impl<6,RP,Functor,Tag>
      const index_type thr_id5 = threadIdx.z / m_rp.m_tile[4];

      for ( index_type n = tile_id5; n < m_rp.m_tile_end[5]; n += numbl5 ) {
-        const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5;
+        const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5 + (index_type)m_rp.m_lower[5];
        if ( offset_5 < m_rp.m_upper[5] && thr_id5 < m_rp.m_tile[5] ) {

          for ( index_type m = tile_id4; m < m_rp.m_tile_end[4]; m += numbl4 ) {
-            const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4;
+            const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4 + (index_type)m_rp.m_lower[4];
            if ( offset_4 < m_rp.m_upper[4] && thr_id4 < m_rp.m_tile[4] ) {

              for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-                const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+                const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
                if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

                  for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                    const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                    const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                    if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                      for ( index_type j = tile_id1 ; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-                        const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+                        const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
                        if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

                          for ( index_type i = tile_id0 ; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-                            const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+                            const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
                            if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {
                              m_func(Tag() , offset_0 , offset_1 , offset_2 , offset_3, offset_4, offset_5);
                            }
@ -1136,27 +1100,27 @@ struct apply_impl<6,RP,Functor,Tag>
      const index_type thr_id5 = threadIdx.z % m_rp.m_tile[5];

      for ( index_type i = tile_id0; i < m_rp.m_tile_end[0]; i += numbl0 ) {
-        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0;
+        const index_type offset_0 = i*m_rp.m_tile[0] + thr_id0 + (index_type)m_rp.m_lower[0];
        if ( offset_0 < m_rp.m_upper[0] && thr_id0 < m_rp.m_tile[0] ) {

          for ( index_type j = tile_id1; j < m_rp.m_tile_end[1]; j += numbl1 ) {
-            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1;
+            const index_type offset_1 = j*m_rp.m_tile[1] + thr_id1 + (index_type)m_rp.m_lower[1];
            if ( offset_1 < m_rp.m_upper[1] && thr_id1 < m_rp.m_tile[1] ) {

              for ( index_type k = tile_id2; k < m_rp.m_tile_end[2]; k += numbl2 ) {
-                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2;
+                const index_type offset_2 = k*m_rp.m_tile[2] + thr_id2 + (index_type)m_rp.m_lower[2];
                if ( offset_2 < m_rp.m_upper[2] && thr_id2 < m_rp.m_tile[2] ) {

                  for ( index_type l = tile_id3; l < m_rp.m_tile_end[3]; l += numbl3 ) {
-                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3;
+                    const index_type offset_3 = l*m_rp.m_tile[3] + thr_id3 + (index_type)m_rp.m_lower[3];
                    if ( offset_3 < m_rp.m_upper[3] && thr_id3 < m_rp.m_tile[3] ) {

                      for ( index_type m = tile_id4; m < m_rp.m_tile_end[4]; m += numbl4 ) {
-                        const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4;
+                        const index_type offset_4 = m*m_rp.m_tile[4] + thr_id4 + (index_type)m_rp.m_lower[4];
                        if ( offset_4 < m_rp.m_upper[4] && thr_id4 < m_rp.m_tile[4] ) {

                          for ( index_type n = tile_id5; n < m_rp.m_tile_end[5]; n += numbl5 ) {
-                            const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5;
+                            const index_type offset_5 = n*m_rp.m_tile[5] + thr_id5 + (index_type)m_rp.m_lower[5];
                            if ( offset_5 < m_rp.m_upper[5] && thr_id5 < m_rp.m_tile[5] ) {
                              m_func(Tag() , offset_0 , offset_1 , offset_2 , offset_3 , offset_4 , offset_5);
                            }
@ -1292,7 +1256,7 @@ protected:
  const Functor    m_func;
 };

-} } } //end namespace Kokkos::Experimental::Impl
+} } //end namespace Kokkos::Impl

 #endif
 #endif
--- a/lib/kokkos/core/src/Cuda/KokkosExp_Cuda_IterateTile_Refactor.hpp
+++ b/lib/kokkos/core/src/Cuda/KokkosExp_Cuda_IterateTile_Refactor.hpp
@ -63,7 +63,7 @@
 #include <typeinfo>
 #endif

-namespace Kokkos { namespace Experimental { namespace Impl {
+namespace Kokkos { namespace Impl {

 namespace Refactor {

@ -2709,7 +2709,7 @@ private:

 // ----------------------------------------------------------------------------------

-} } } //end namespace Kokkos::Experimental::Impl
+} } //end namespace Kokkos::Impl

 #endif
 #endif
--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
@ -164,7 +164,7 @@ static void cuda_parallel_launch_constant_memory()

 template< class DriverType, unsigned int maxTperB, unsigned int minBperSM >
 __global__
-//__launch_bounds__(maxTperB, minBperSM)
+__launch_bounds__(maxTperB, minBperSM)
 static void cuda_parallel_launch_constant_memory()
 {
  const DriverType & driver =
@ -182,7 +182,7 @@ static void cuda_parallel_launch_local_memory( const DriverType driver )

 template< class DriverType, unsigned int maxTperB, unsigned int minBperSM >
 __global__
-//__launch_bounds__(maxTperB, minBperSM)
+__launch_bounds__(maxTperB, minBperSM)
 static void cuda_parallel_launch_local_memory( const DriverType driver )
 {
  driver();
@ -193,9 +193,14 @@ template < class DriverType
         , bool Large = ( CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType) ) >
 struct CudaParallelLaunch ;

-template < class DriverType, class LaunchBounds >
-struct CudaParallelLaunch< DriverType, LaunchBounds, true > {
-
+template < class DriverType
+         , unsigned int MaxThreadsPerBlock
+         , unsigned int MinBlocksPerSM >
+struct CudaParallelLaunch< DriverType
+                         , Kokkos::LaunchBounds< MaxThreadsPerBlock 
+                                               , MinBlocksPerSM >
+                         , true >
+{
  inline
  CudaParallelLaunch( const DriverType & driver
                    , const dim3       & grid
@ -216,21 +221,28 @@ struct CudaParallelLaunch< DriverType, LaunchBounds, true > {
      if ( CudaTraits::SharedMemoryCapacity < shmem ) {
        Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: shared memory request is too large") );
      }
-      #ifndef KOKKOS_ARCH_KEPLER //On Kepler the L1 has no benefit since it doesn't cache reads
-      else if ( shmem ) {
-        CUDA_SAFE_CALL( cudaFuncSetCacheConfig( cuda_parallel_launch_constant_memory< DriverType, LaunchBounds::maxTperB, LaunchBounds::minBperSM > , cudaFuncCachePreferShared ) );
-      } else {
-        CUDA_SAFE_CALL( cudaFuncSetCacheConfig( cuda_parallel_launch_constant_memory< DriverType, LaunchBounds::maxTperB, LaunchBounds::minBperSM > , cudaFuncCachePreferL1 ) );
+      #ifndef KOKKOS_ARCH_KEPLER
+      // On Kepler the L1 has no benefit since it doesn't cache reads
+      else {
+        CUDA_SAFE_CALL(
+          cudaFuncSetCacheConfig
+            ( cuda_parallel_launch_constant_memory
+                < DriverType, MaxThreadsPerBlock, MinBlocksPerSM >
+            , ( shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1 )
+            ) );
      }
      #endif

      // Copy functor to constant memory on the device
-      cudaMemcpyToSymbol( kokkos_impl_cuda_constant_memory_buffer , & driver , sizeof(DriverType) );
+      cudaMemcpyToSymbol(
+        kokkos_impl_cuda_constant_memory_buffer, &driver, sizeof(DriverType) );

      KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();

      // Invoke the driver function on the device
-      cuda_parallel_launch_constant_memory< DriverType, LaunchBounds::maxTperB, LaunchBounds::minBperSM ><<< grid , block , shmem , stream >>>();
+      cuda_parallel_launch_constant_memory
+        < DriverType, MaxThreadsPerBlock, MinBlocksPerSM >
+          <<< grid , block , shmem , stream >>>();

 #if defined( KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK )
      CUDA_SAFE_CALL( cudaGetLastError() );
@ -240,9 +252,11 @@ struct CudaParallelLaunch< DriverType, LaunchBounds, true > {
  }
 };

-template < class DriverType, class LaunchBounds >
-struct CudaParallelLaunch< DriverType, LaunchBounds, false > {
-
+template < class DriverType >
+struct CudaParallelLaunch< DriverType
+                         , Kokkos::LaunchBounds<>
+                         , true >
+{
  inline
  CudaParallelLaunch( const DriverType & driver
                    , const dim3       & grid
@ -252,20 +266,136 @@ struct CudaParallelLaunch< DriverType, LaunchBounds, false > {
  {
    if ( grid.x && ( block.x * block.y * block.z ) ) {

+      if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
+           sizeof( DriverType ) ) {
+        Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: Functor is too large") );
+      }
+
+      // Fence before changing settings and copying closure
+      Kokkos::Cuda::fence();
+
      if ( CudaTraits::SharedMemoryCapacity < shmem ) {
        Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: shared memory request is too large") );
      }
-      #ifndef KOKKOS_ARCH_KEPLER //On Kepler the L1 has no benefit since it doesn't cache reads
-      else if ( shmem ) {
-        CUDA_SAFE_CALL( cudaFuncSetCacheConfig( cuda_parallel_launch_local_memory< DriverType, LaunchBounds::maxTperB, LaunchBounds::minBperSM > , cudaFuncCachePreferShared ) );
-      } else {
-        CUDA_SAFE_CALL( cudaFuncSetCacheConfig( cuda_parallel_launch_local_memory< DriverType, LaunchBounds::maxTperB, LaunchBounds::minBperSM > , cudaFuncCachePreferL1 ) );
+      #ifndef KOKKOS_ARCH_KEPLER
+      // On Kepler the L1 has no benefit since it doesn't cache reads
+      else {
+        CUDA_SAFE_CALL(
+          cudaFuncSetCacheConfig
+            ( cuda_parallel_launch_constant_memory< DriverType >
+            , ( shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1 )
+            ) );
+      }
+      #endif
+
+      // Copy functor to constant memory on the device
+      cudaMemcpyToSymbol(
+        kokkos_impl_cuda_constant_memory_buffer, &driver, sizeof(DriverType) );
+
+      KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
+
+      // Invoke the driver function on the device
+      cuda_parallel_launch_constant_memory< DriverType >
+          <<< grid , block , shmem , stream >>>();
+
+#if defined( KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK )
+      CUDA_SAFE_CALL( cudaGetLastError() );
+      Kokkos::Cuda::fence();
+#endif
+    }
+  }
+};
+
+template < class DriverType
+         , unsigned int MaxThreadsPerBlock
+         , unsigned int MinBlocksPerSM >
+struct CudaParallelLaunch< DriverType
+                         , Kokkos::LaunchBounds< MaxThreadsPerBlock 
+                                               , MinBlocksPerSM >
+                         , false >
+{
+  inline
+  CudaParallelLaunch( const DriverType & driver
+                    , const dim3       & grid
+                    , const dim3       & block
+                    , const int          shmem
+                    , const cudaStream_t stream = 0 )
+  {
+    if ( grid.x && ( block.x * block.y * block.z ) ) {
+
+      if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
+           sizeof( DriverType ) ) {
+        Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: Functor is too large") );
+      }
+
+      if ( CudaTraits::SharedMemoryCapacity < shmem ) {
+        Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: shared memory request is too large") );
+      }
+      #ifndef KOKKOS_ARCH_KEPLER
+      // On Kepler the L1 has no benefit since it doesn't cache reads
+      else {
+        CUDA_SAFE_CALL(
+          cudaFuncSetCacheConfig
+            ( cuda_parallel_launch_local_memory
+                < DriverType, MaxThreadsPerBlock, MinBlocksPerSM >
+            , ( shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1 )
+            ) );
      }
      #endif

      KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();

-      cuda_parallel_launch_local_memory< DriverType, LaunchBounds::maxTperB, LaunchBounds::minBperSM ><<< grid , block , shmem , stream >>>( driver );
+      // Invoke the driver function on the device
+      cuda_parallel_launch_local_memory
+        < DriverType, MaxThreadsPerBlock, MinBlocksPerSM >
+          <<< grid , block , shmem , stream >>>( driver );
+
+#if defined( KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK )
+      CUDA_SAFE_CALL( cudaGetLastError() );
+      Kokkos::Cuda::fence();
+#endif
+    }
+  }
+};
+
+template < class DriverType >
+struct CudaParallelLaunch< DriverType
+                         , Kokkos::LaunchBounds<>
+                         , false >
+{
+  inline
+  CudaParallelLaunch( const DriverType & driver
+                    , const dim3       & grid
+                    , const dim3       & block
+                    , const int          shmem
+                    , const cudaStream_t stream = 0 )
+  {
+    if ( grid.x && ( block.x * block.y * block.z ) ) {
+
+      if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
+           sizeof( DriverType ) ) {
+        Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: Functor is too large") );
+      }
+
+      if ( CudaTraits::SharedMemoryCapacity < shmem ) {
+        Kokkos::Impl::throw_runtime_exception( std::string("CudaParallelLaunch FAILED: shared memory request is too large") );
+      }
+      #ifndef KOKKOS_ARCH_KEPLER
+      // On Kepler the L1 has no benefit since it doesn't cache reads
+      else {
+        CUDA_SAFE_CALL(
+          cudaFuncSetCacheConfig
+            ( cuda_parallel_launch_local_memory< DriverType >
+            , ( shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1 )
+            ) );
+      }
+      #endif
+
+      KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
+
+      // Invoke the driver function on the device
+      cuda_parallel_launch_local_memory< DriverType >
+          <<< grid , block , shmem , stream >>>( driver );

 #if defined( KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK )
      CUDA_SAFE_CALL( cudaGetLastError() );
--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
@ -366,7 +366,7 @@ SharedAllocationRecord< Kokkos::CudaSpace , void >::
  if(Kokkos::Profiling::profileLibraryLoaded()) {

    SharedAllocationHeader header ;
-    Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>::DeepCopy( & header , RecordBase::m_alloc_ptr , sizeof(SharedAllocationHeader) );
+    Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>( & header , RecordBase::m_alloc_ptr , sizeof(SharedAllocationHeader) );

    Kokkos::Profiling::deallocateData(
      Kokkos::Profiling::SpaceHandle(Kokkos::CudaSpace::name()),header.m_label,
@ -446,7 +446,7 @@ SharedAllocationRecord( const Kokkos::CudaSpace & arg_space
          );

  // Copy to device memory
-  Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>::DeepCopy( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
+  Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
 }

 SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::
@ -655,7 +655,7 @@ SharedAllocationRecord< Kokkos::CudaSpace , void >::get_record( void * alloc_ptr
  Header const * const head_cuda = alloc_ptr ? Header::get_header( alloc_ptr ) : (Header*) 0 ;

  if ( alloc_ptr ) {
-    Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>::DeepCopy( & head , head_cuda , sizeof(SharedAllocationHeader) );
+    Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>( & head , head_cuda , sizeof(SharedAllocationHeader) );
  }

  RecordCuda * const record = alloc_ptr ? static_cast< RecordCuda * >( head.m_record ) : (RecordCuda *) 0 ;
@ -713,7 +713,7 @@ SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::get_record( void *
 // Iterate records to print orphaned memory ...
 void
 SharedAllocationRecord< Kokkos::CudaSpace , void >::
-print_records( std::ostream & s , const Kokkos::CudaSpace & space , bool detail )
+print_records( std::ostream & s , const Kokkos::CudaSpace & , bool detail )
 {
  SharedAllocationRecord< void , void > * r = & s_root_record ;

@ -724,7 +724,7 @@ print_records( std::ostream & s , const Kokkos::CudaSpace & space , bool detail
  if ( detail ) {
    do {
      if ( r->m_alloc_ptr ) {
-        Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>::DeepCopy( & head , r->m_alloc_ptr , sizeof(SharedAllocationHeader) );
+        Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>( & head , r->m_alloc_ptr , sizeof(SharedAllocationHeader) );
      }
      else {
        head.m_label[0] = 0 ;
@ -751,7 +751,7 @@ print_records( std::ostream & s , const Kokkos::CudaSpace & space , bool detail
              , reinterpret_cast<uintptr_t>( r->m_dealloc )
              , head.m_label
              );
-      std::cout << buffer ;
+      s << buffer ;
      r = r->m_next ;
    } while ( r != & s_root_record );
  }
@ -759,7 +759,7 @@ print_records( std::ostream & s , const Kokkos::CudaSpace & space , bool detail
    do {
      if ( r->m_alloc_ptr ) {

-        Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>::DeepCopy( & head , r->m_alloc_ptr , sizeof(SharedAllocationHeader) );
+        Kokkos::Impl::DeepCopy<HostSpace,CudaSpace>( & head , r->m_alloc_ptr , sizeof(SharedAllocationHeader) );

        //Formatting dependent on sizeof(uintptr_t)
        const char * format_string;
@ -781,7 +781,7 @@ print_records( std::ostream & s , const Kokkos::CudaSpace & space , bool detail
      else {
        snprintf( buffer , 256 , "Cuda [ 0 + 0 ]\n" );
      }
-      std::cout << buffer ;
+      s << buffer ;
      r = r->m_next ;
    } while ( r != & s_root_record );
  }
@ -789,14 +789,14 @@ print_records( std::ostream & s , const Kokkos::CudaSpace & space , bool detail

 void
 SharedAllocationRecord< Kokkos::CudaUVMSpace , void >::
-print_records( std::ostream & s , const Kokkos::CudaUVMSpace & space , bool detail )
+print_records( std::ostream & s , const Kokkos::CudaUVMSpace & , bool detail )
 {
  SharedAllocationRecord< void , void >::print_host_accessible_records( s , "CudaUVM" , & s_root_record , detail );
 }

 void
 SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
-print_records( std::ostream & s , const Kokkos::CudaHostPinnedSpace & space , bool detail )
+print_records( std::ostream & s , const Kokkos::CudaHostPinnedSpace & , bool detail )
 {
  SharedAllocationRecord< void , void >::print_host_accessible_records( s , "CudaHostPinned" , & s_root_record , detail );
 }
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp
@ -421,7 +421,7 @@ void CudaInternal::initialize( int cuda_device_id , int stream_count )
      std::string msg = ss.str();
      Kokkos::abort( msg.c_str() );
    }
-    if ( compiled_major != cudaProp.major || compiled_minor != cudaProp.minor ) {
+    if ( Kokkos::show_warnings() && (compiled_major != cudaProp.major || compiled_minor != cudaProp.minor) ) {
      std::cerr << "Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability "
                << compiled_major << "." << compiled_minor
                << " on device with compute capability "
@ -467,7 +467,7 @@ void CudaInternal::initialize( int cuda_device_id , int stream_count )

    m_scratchUnifiedSupported = cudaProp.unifiedAddressing ;

-    if ( ! m_scratchUnifiedSupported ) {
+    if ( Kokkos::show_warnings() && ! m_scratchUnifiedSupported ) {
      std::cout << "Kokkos::Cuda device "
                << cudaProp.name << " capability "
                << cudaProp.major << "." << cudaProp.minor
@ -545,7 +545,7 @@ void CudaInternal::initialize( int cuda_device_id , int stream_count )
  }

  #ifdef KOKKOS_ENABLE_CUDA_UVM
-    if(!cuda_launch_blocking()) {
+    if( Kokkos::show_warnings() && !cuda_launch_blocking() ) {
      std::cout << "Kokkos::Cuda::initialize WARNING: Cuda is allocating into UVMSpace by default" << std::endl;
      std::cout << "                                  without setting CUDA_LAUNCH_BLOCKING=1." << std::endl;
      std::cout << "                                  The code must call Cuda::fence() after each kernel" << std::endl;
@ -561,7 +561,7 @@ void CudaInternal::initialize( int cuda_device_id , int stream_count )
    bool visible_devices_one=true;
    if (env_visible_devices == 0) visible_devices_one=false;

-    if(!visible_devices_one && !force_device_alloc) {
+    if( Kokkos::show_warnings() && (!visible_devices_one && !force_device_alloc) ) {
      std::cout << "Kokkos::Cuda::initialize WARNING: Cuda is allocating into UVMSpace by default" << std::endl;
      std::cout << "                                  without setting CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 or " << std::endl;
      std::cout << "                                  setting CUDA_VISIBLE_DEVICES." << std::endl;
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
@ -381,12 +381,12 @@ public:
 // MDRangePolicy impl
 template< class FunctorType , class ... Traits >
 class ParallelFor< FunctorType
-                 , Kokkos::Experimental::MDRangePolicy< Traits ... >
+                 , Kokkos::MDRangePolicy< Traits ... >
                 , Kokkos::Cuda
                 >
 {
 private:
-  typedef Kokkos::Experimental::MDRangePolicy< Traits ...  > Policy ;
+  typedef Kokkos::MDRangePolicy< Traits ...  > Policy ;
  using RP = Policy;
  typedef typename Policy::array_index_type array_index_type;
  typedef typename Policy::index_type index_type;
@ -402,7 +402,7 @@ public:
  __device__
  void operator()(void) const
    {
-      Kokkos::Experimental::Impl::Refactor::DeviceIterateTile<Policy::rank,Policy,FunctorType,typename Policy::work_tag>(m_rp,m_functor).exec_range();
+      Kokkos::Impl::Refactor::DeviceIterateTile<Policy::rank,Policy,FunctorType,typename Policy::work_tag>(m_rp,m_functor).exec_range();
    }


@ -648,10 +648,11 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTag > ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTagFwd > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTagFwd > ValueJoin ;

 public:

@ -721,7 +722,7 @@ public:
    }

    // Reduce with final value at blockDim.y - 1 location.
-    if ( cuda_single_inter_block_reduce_scan<false,ReducerTypeFwd,WorkTag>(
+    if ( cuda_single_inter_block_reduce_scan<false,ReducerTypeFwd,WorkTagFwd>(
           ReducerConditional::select(m_functor , m_reducer) , blockIdx.x , gridDim.x ,
           kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags ) ) {

@ -731,7 +732,7 @@ public:
      size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;

      if ( threadIdx.y == 0 ) {
-        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
+        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
      }

      if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); }
@ -766,11 +767,11 @@ public:

    value_type init;
    ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &init);
-     if(Impl::cuda_inter_block_reduction<ReducerTypeFwd,ValueJoin,WorkTag>
+     if(Impl::cuda_inter_block_reduction<ReducerTypeFwd,ValueJoin,WorkTagFwd>
            (value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,max_active_thread)) {
       const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;
       if(id==0) {
-         Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
+         Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
         *result = value;
       }
     }
@ -858,14 +859,14 @@ public:
 // MDRangePolicy impl
 template< class FunctorType , class ReducerType, class ... Traits >
 class ParallelReduce< FunctorType
-                    , Kokkos::Experimental::MDRangePolicy< Traits ... >
+                    , Kokkos::MDRangePolicy< Traits ... >
                    , ReducerType
                    , Kokkos::Cuda
                    >
 {
 private:

-  typedef Kokkos::Experimental::MDRangePolicy< Traits ... > Policy ;
+  typedef Kokkos::MDRangePolicy< Traits ... > Policy ;
  typedef typename Policy::array_index_type                 array_index_type;
  typedef typename Policy::index_type                       index_type;

@ -875,10 +876,11 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTag > ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTagFwd > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTagFwd > ValueJoin ;

 public:

@ -898,7 +900,7 @@ public:
  size_type *         m_scratch_flags ;
  size_type *         m_unified_space ;

-  typedef typename Kokkos::Experimental::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;
+  typedef typename Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;

  // Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit
  enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
@ -913,7 +915,7 @@ public:
  void
  exec_range( reference_type update ) const
  {
-    Kokkos::Experimental::Impl::Reduce::DeviceIterateTile<Policy::rank,Policy,FunctorType,typename Policy::work_tag, reference_type>(m_policy, m_functor, update).exec_range();
+    Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank,Policy,FunctorType,typename Policy::work_tag, reference_type>(m_policy, m_functor, update).exec_range();
  }

  inline
@ -942,7 +944,7 @@ public:

    // Reduce with final value at blockDim.y - 1 location.
    // Problem: non power-of-two blockDim
-    if ( cuda_single_inter_block_reduce_scan<false,ReducerTypeFwd,WorkTag>(
+    if ( cuda_single_inter_block_reduce_scan<false,ReducerTypeFwd,WorkTagFwd>(
           ReducerConditional::select(m_functor , m_reducer) , blockIdx.x , gridDim.x ,
           kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags ) ) {

@ -951,7 +953,7 @@ public:
      size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;

      if ( threadIdx.y == 0 ) {
-        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
+        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
      }

      if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); }
@ -983,11 +985,11 @@ public:

     value_type init;
     ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &init);
-     if(Impl::cuda_inter_block_reduction<ReducerTypeFwd,ValueJoin,WorkTag>
+     if(Impl::cuda_inter_block_reduction<ReducerTypeFwd,ValueJoin,WorkTagFwd>
         (value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,max_active_thread)) {
       const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;
       if(id==0) {
-         Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
+         Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
         *result = value;
       }
     }
@ -1100,10 +1102,11 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTag > ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTagFwd > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTagFwd > ValueJoin ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
  typedef typename ValueTraits::reference_type  reference_type ;
@ -1222,7 +1225,7 @@ public:
      size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;

      if ( threadIdx.y == 0 ) {
-        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
+        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
      }

      if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); }
@ -1260,7 +1263,7 @@ public:
           (value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,blockDim.y)) {
      const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;
      if(id==0) {
-        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
+        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
        *result = value;
      }
    }
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp
@ -69,7 +69,7 @@ void cuda_shfl( T & out , T const & in , int lane ,
  typename std::enable_if< sizeof(int) == sizeof(T) , int >::type width )
 {
  *reinterpret_cast<int*>(&out) =
-    __shfl( *reinterpret_cast<int const *>(&in) , lane , width );
+    KOKKOS_IMPL_CUDA_SHFL( *reinterpret_cast<int const *>(&in) , lane , width );
 }

 template< typename T >
@ -83,7 +83,7 @@ void cuda_shfl( T & out , T const & in , int lane ,

  for ( int i = 0 ; i < N ; ++i ) {
    reinterpret_cast<int*>(&out)[i] =
-      __shfl( reinterpret_cast<int const *>(&in)[i] , lane , width );
+      KOKKOS_IMPL_CUDA_SHFL( reinterpret_cast<int const *>(&in)[i] , lane , width );
  }
 }

@ -95,7 +95,7 @@ void cuda_shfl_down( T & out , T const & in , int delta ,
  typename std::enable_if< sizeof(int) == sizeof(T) , int >::type width )
 {
  *reinterpret_cast<int*>(&out) =
-    __shfl_down( *reinterpret_cast<int const *>(&in) , delta , width );
+    KOKKOS_IMPL_CUDA_SHFL_DOWN( *reinterpret_cast<int const *>(&in) , delta , width );
 }

 template< typename T >
@ -109,7 +109,7 @@ void cuda_shfl_down( T & out , T const & in , int delta ,

  for ( int i = 0 ; i < N ; ++i ) {
    reinterpret_cast<int*>(&out)[i] =
-      __shfl_down( reinterpret_cast<int const *>(&in)[i] , delta , width );
+      KOKKOS_IMPL_CUDA_SHFL_DOWN( reinterpret_cast<int const *>(&in)[i] , delta , width );
  }
 }

@ -121,7 +121,7 @@ void cuda_shfl_up( T & out , T const & in , int delta ,
  typename std::enable_if< sizeof(int) == sizeof(T) , int >::type width )
 {
  *reinterpret_cast<int*>(&out) =
-    __shfl_up( *reinterpret_cast<int const *>(&in) , delta , width );
+    KOKKOS_IMPL_CUDA_SHFL_UP( *reinterpret_cast<int const *>(&in) , delta , width );
 }

 template< typename T >
@ -135,7 +135,7 @@ void cuda_shfl_up( T & out , T const & in , int delta ,

  for ( int i = 0 ; i < N ; ++i ) {
    reinterpret_cast<int*>(&out)[i] =
-      __shfl_up( reinterpret_cast<int const *>(&in)[i] , delta , width );
+      KOKKOS_IMPL_CUDA_SHFL_UP( reinterpret_cast<int const *>(&in)[i] , delta , width );
  }
 }

@ -268,31 +268,31 @@ bool cuda_inter_block_reduction( typename FunctorValueTraits< FunctorType , ArgT
        if( id + 1 < int(gridDim.x) )
          join(value, tmp);
      }
-      int active = __ballot(1);
+      int active = KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 2) {
        value_type tmp = Kokkos::shfl_down(value, 2,32);
        if( id + 2 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 4) {
        value_type tmp = Kokkos::shfl_down(value, 4,32);
        if( id + 4 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 8) {
        value_type tmp = Kokkos::shfl_down(value, 8,32);
        if( id + 8 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 16) {
        value_type tmp = Kokkos::shfl_down(value, 16,32);
        if( id + 16 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
    }
  }
  //The last block has in its thread=0 the global reduction value through "value"
@ -432,31 +432,31 @@ cuda_inter_block_reduction( const ReducerType& reducer,
        if( id + 1 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      int active = __ballot(1);
+      int active = KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 2) {
        value_type tmp = Kokkos::shfl_down(value, 2,32);
        if( id + 2 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 4) {
        value_type tmp = Kokkos::shfl_down(value, 4,32);
        if( id + 4 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 8) {
        value_type tmp = Kokkos::shfl_down(value, 8,32);
        if( id + 8 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
      if (int(blockDim.x*blockDim.y) > 16) {
        value_type tmp = Kokkos::shfl_down(value, 16,32);
        if( id + 16 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += __ballot(1);
+      active += KOKKOS_IMPL_CUDA_BALLOT(1);
    }
  }

--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp
@ -73,16 +73,16 @@ public:
  KOKKOS_INLINE_FUNCTION
  UniqueToken() : m_buffer(0), m_count(0) {}

-  KOKKOS_INLINE_FUNCTION
+  KOKKOS_FUNCTION_DEFAULTED
  UniqueToken( const UniqueToken & ) = default;

-  KOKKOS_INLINE_FUNCTION
+  KOKKOS_FUNCTION_DEFAULTED
  UniqueToken( UniqueToken && )      = default;

-  KOKKOS_INLINE_FUNCTION
+  KOKKOS_FUNCTION_DEFAULTED
  UniqueToken & operator=( const UniqueToken & ) = default ;

-  KOKKOS_INLINE_FUNCTION
+  KOKKOS_FUNCTION_DEFAULTED
  UniqueToken & operator=( UniqueToken && ) = default ;

  /// \brief upper bound for acquired values, i.e. 0 <= value < size()
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Vectorization.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Vectorization.hpp
@ -47,7 +47,7 @@
 #ifdef KOKKOS_ENABLE_CUDA

 #include <Kokkos_Cuda.hpp>
-
+#include <Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp>
 namespace Kokkos {


@ -91,12 +91,12 @@ namespace Impl {

    KOKKOS_INLINE_FUNCTION
    int shfl(const int &val, const int& srcLane, const int& width ) {
-      return __shfl(val,srcLane,width);
+      return KOKKOS_IMPL_CUDA_SHFL(val,srcLane,width);
    }

    KOKKOS_INLINE_FUNCTION
    float shfl(const float &val, const int& srcLane, const int& width ) {
-      return __shfl(val,srcLane,width);
+      return KOKKOS_IMPL_CUDA_SHFL(val,srcLane,width);
    }

    template<typename Scalar>
@ -105,7 +105,7 @@ namespace Impl {
        ) {
      Scalar tmp1 = val;
      float tmp = *reinterpret_cast<float*>(&tmp1);
-      tmp = __shfl(tmp,srcLane,width);
+      tmp = KOKKOS_IMPL_CUDA_SHFL(tmp,srcLane,width);
      return *reinterpret_cast<Scalar*>(&tmp);
    }

@ -113,8 +113,8 @@ namespace Impl {
    double shfl(const double &val, const int& srcLane, const int& width) {
      int lo = __double2loint(val);
      int hi = __double2hiint(val);
-      lo = __shfl(lo,srcLane,width);
-      hi = __shfl(hi,srcLane,width);
+      lo = KOKKOS_IMPL_CUDA_SHFL(lo,srcLane,width);
+      hi = KOKKOS_IMPL_CUDA_SHFL(hi,srcLane,width);
      return __hiloint2double(hi,lo);
    }

@ -123,8 +123,8 @@ namespace Impl {
    Scalar shfl(const Scalar &val, const int& srcLane, const typename Impl::enable_if< (sizeof(Scalar) == 8) ,int>::type& width) {
      int lo = __double2loint(*reinterpret_cast<const double*>(&val));
      int hi = __double2hiint(*reinterpret_cast<const double*>(&val));
-      lo = __shfl(lo,srcLane,width);
-      hi = __shfl(hi,srcLane,width);
+      lo = KOKKOS_IMPL_CUDA_SHFL(lo,srcLane,width);
+      hi = KOKKOS_IMPL_CUDA_SHFL(hi,srcLane,width);
      const double tmp = __hiloint2double(hi,lo);
      return *(reinterpret_cast<const Scalar*>(&tmp));
    }
@ -137,18 +137,18 @@ namespace Impl {
      s_val = val;

      for(int i = 0; i<s_val.n; i++)
-        r_val.fval[i] = __shfl(s_val.fval[i],srcLane,width);
+        r_val.fval[i] = KOKKOS_IMPL_CUDA_SHFL(s_val.fval[i],srcLane,width);
      return r_val.value();
    }

    KOKKOS_INLINE_FUNCTION
    int shfl_down(const int &val, const int& delta, const int& width) {
-      return __shfl_down(val,delta,width);
+      return KOKKOS_IMPL_CUDA_SHFL_DOWN(val,delta,width);
    }

    KOKKOS_INLINE_FUNCTION
    float shfl_down(const float &val, const int& delta, const int& width) {
-      return __shfl_down(val,delta,width);
+      return KOKKOS_IMPL_CUDA_SHFL_DOWN(val,delta,width);
    }

    template<typename Scalar>
@ -156,7 +156,7 @@ namespace Impl {
    Scalar shfl_down(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 4) , int >::type & width) {
      Scalar tmp1 = val;
      float tmp = *reinterpret_cast<float*>(&tmp1);
-      tmp = __shfl_down(tmp,delta,width);
+      tmp = KOKKOS_IMPL_CUDA_SHFL_DOWN(tmp,delta,width);
      return *reinterpret_cast<Scalar*>(&tmp);
    }

@ -164,8 +164,8 @@ namespace Impl {
    double shfl_down(const double &val, const int& delta, const int& width) {
      int lo = __double2loint(val);
      int hi = __double2hiint(val);
-      lo = __shfl_down(lo,delta,width);
-      hi = __shfl_down(hi,delta,width);
+      lo = KOKKOS_IMPL_CUDA_SHFL_DOWN(lo,delta,width);
+      hi = KOKKOS_IMPL_CUDA_SHFL_DOWN(hi,delta,width);
      return __hiloint2double(hi,lo);
    }

@ -174,8 +174,8 @@ namespace Impl {
    Scalar shfl_down(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 8) , int >::type & width) {
      int lo = __double2loint(*reinterpret_cast<const double*>(&val));
      int hi = __double2hiint(*reinterpret_cast<const double*>(&val));
-      lo = __shfl_down(lo,delta,width);
-      hi = __shfl_down(hi,delta,width);
+      lo = KOKKOS_IMPL_CUDA_SHFL_DOWN(lo,delta,width);
+      hi = KOKKOS_IMPL_CUDA_SHFL_DOWN(hi,delta,width);
      const double tmp = __hiloint2double(hi,lo);
      return *(reinterpret_cast<const Scalar*>(&tmp));
    }
@ -188,18 +188,18 @@ namespace Impl {
      s_val = val;

      for(int i = 0; i<s_val.n; i++)
-        r_val.fval[i] = __shfl_down(s_val.fval[i],delta,width);
+        r_val.fval[i] = KOKKOS_IMPL_CUDA_SHFL_DOWN(s_val.fval[i],delta,width);
      return r_val.value();
    }

    KOKKOS_INLINE_FUNCTION
    int shfl_up(const int &val, const int& delta, const int& width ) {
-      return __shfl_up(val,delta,width);
+      return KOKKOS_IMPL_CUDA_SHFL_UP(val,delta,width);
    }

    KOKKOS_INLINE_FUNCTION
    float shfl_up(const float &val, const int& delta, const int& width ) {
-      return __shfl_up(val,delta,width);
+      return KOKKOS_IMPL_CUDA_SHFL_UP(val,delta,width);
    }

    template<typename Scalar>
@ -207,7 +207,7 @@ namespace Impl {
    Scalar shfl_up(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 4) , int >::type & width) {
      Scalar tmp1 = val;
      float tmp = *reinterpret_cast<float*>(&tmp1);
-      tmp = __shfl_up(tmp,delta,width);
+      tmp = KOKKOS_IMPL_CUDA_SHFL_UP(tmp,delta,width);
      return *reinterpret_cast<Scalar*>(&tmp);
    }

@ -215,8 +215,8 @@ namespace Impl {
    double shfl_up(const double &val, const int& delta, const int& width ) {
      int lo = __double2loint(val);
      int hi = __double2hiint(val);
-      lo = __shfl_up(lo,delta,width);
-      hi = __shfl_up(hi,delta,width);
+      lo = KOKKOS_IMPL_CUDA_SHFL_UP(lo,delta,width);
+      hi = KOKKOS_IMPL_CUDA_SHFL_UP(hi,delta,width);
      return __hiloint2double(hi,lo);
    }

@ -225,8 +225,8 @@ namespace Impl {
    Scalar shfl_up(const Scalar &val, const int& delta, const typename Impl::enable_if< (sizeof(Scalar) == 8) , int >::type & width) {
      int lo = __double2loint(*reinterpret_cast<const double*>(&val));
      int hi = __double2hiint(*reinterpret_cast<const double*>(&val));
-      lo = __shfl_up(lo,delta,width);
-      hi = __shfl_up(hi,delta,width);
+      lo = KOKKOS_IMPL_CUDA_SHFL_UP(lo,delta,width);
+      hi = KOKKOS_IMPL_CUDA_SHFL_UP(hi,delta,width);
      const double tmp = __hiloint2double(hi,lo);
      return *(reinterpret_cast<const Scalar*>(&tmp));
    }
@ -239,7 +239,7 @@ namespace Impl {
      s_val = val;

      for(int i = 0; i<s_val.n; i++)
-        r_val.fval[i] = __shfl_up(s_val.fval[i],delta,width);
+        r_val.fval[i] = KOKKOS_IMPL_CUDA_SHFL_UP(s_val.fval[i],delta,width);
      return r_val.value();
    }

--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp
@ -0,0 +1,12 @@
+#include<Kokkos_Macros.hpp>
+#if ( CUDA_VERSION < 9000 )
+#define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot(x)
+#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl(x,y,z)
+#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up(x,y,z)
+#define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) __shfl_down(x,y,z)
+#else
+#define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot_sync(0xffffffff,x)
+#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl_sync(0xffffffff,x,y,z)
+#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up_sync(0xffffffff,x,y,z)
+#define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) __shfl_down_sync(0xffffffff,x,y,z)
+#endif 
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp
@ -127,11 +127,11 @@ struct CudaTextureFetch {
  template< class CudaMemorySpace >
  inline explicit
  CudaTextureFetch( const ValueType * const arg_ptr
-                  , Kokkos::Experimental::Impl::SharedAllocationRecord< CudaMemorySpace , void > & record
+                  , Kokkos::Impl::SharedAllocationRecord< CudaMemorySpace , void > * record
                  )
-    : m_obj( record.template attach_texture_object< AliasType >() )
+    : m_obj( record->template attach_texture_object< AliasType >() )
    , m_ptr( arg_ptr )
-    , m_offset( record.attach_texture_object_offset( reinterpret_cast<const AliasType*>( arg_ptr ) ) )
+    , m_offset( record->attach_texture_object_offset( reinterpret_cast<const AliasType*>( arg_ptr ) ) )
    {}

  // Texture object spans the entire allocation.
@ -199,8 +199,8 @@ struct CudaLDGFetch {
  template< class CudaMemorySpace >
  inline explicit
  CudaLDGFetch( const ValueType * const arg_ptr
-                  , Kokkos::Experimental::Impl::SharedAllocationRecord< CudaMemorySpace , void > const &
-                  )
+              , Kokkos::Impl::SharedAllocationRecord<CudaMemorySpace,void>*
+              )
    : m_ptr( arg_ptr )
    {}

@ -285,7 +285,21 @@ public:
      // Assignment of texture = non-texture requires creation of a texture object
      // which can only occur on the host.  In addition, 'get_record' is only valid
      // if called in a host execution space
-      return handle_type( arg_data_ptr , arg_tracker.template get_record< typename Traits::memory_space >() );
+
+
+      typedef typename Traits::memory_space memory_space ;
+      typedef typename Impl::SharedAllocationRecord<memory_space,void> record ;
+
+      record * const r = arg_tracker.template get_record< memory_space >();
+
+#if ! defined( KOKKOS_ENABLE_CUDA_LDG_INTRINSIC )
+      if ( 0 == r ) {
+        Kokkos::abort("Cuda const random access View using Cuda texture memory requires Kokkos to allocate the View's memory");
+      }
+#endif
+
+      return handle_type( arg_data_ptr , r );
+
 #else
      Kokkos::Impl::cuda_abort("Cannot create Cuda texture object from within a Cuda kernel");
      return handle_type();
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_WorkGraphPolicy.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_WorkGraphPolicy.hpp
@ -48,50 +48,52 @@ namespace Kokkos {
 namespace Impl {

 template< class FunctorType , class ... Traits >
-class ParallelFor< FunctorType ,
-                   Kokkos::Experimental::WorkGraphPolicy< Traits ... > ,
-                   Kokkos::Cuda
+class ParallelFor< FunctorType
+                 , Kokkos::WorkGraphPolicy< Traits ... >
+                 , Kokkos::Cuda
                 >
-  : public Kokkos::Impl::Experimental::
-           WorkGraphExec< FunctorType,
-                          Kokkos::Cuda,
-                          Traits ...
-                        >
 {
 public:

-  typedef Kokkos::Experimental::WorkGraphPolicy< Traits ... >   Policy ;
-  typedef Kokkos::Impl::Experimental::
-          WorkGraphExec<FunctorType, Kokkos::Cuda, Traits ... > Base ;
+  typedef Kokkos::WorkGraphPolicy< Traits ... >   Policy ;
  typedef ParallelFor<FunctorType, Policy, Kokkos::Cuda>        Self ;

 private:

-  template< class TagType >
-  __device__
-  typename std::enable_if< std::is_same< TagType , void >::value >::type
-  exec_one(const typename Policy::member_type& i) const {
-    Base::m_functor( i );
-  }
+  Policy       m_policy ;
+  FunctorType  m_functor ;

  template< class TagType >
-  __device__
+  __device__ inline
+  typename std::enable_if< std::is_same< TagType , void >::value >::type
+  exec_one( const std::int32_t w ) const noexcept
+    { m_functor( w ); }
+
+  template< class TagType >
+  __device__ inline
  typename std::enable_if< ! std::is_same< TagType , void >::value >::type
-  exec_one(const typename Policy::member_type& i) const {
-    const TagType t{} ;
-    Base::m_functor( t , i );
-  }
+  exec_one( const std::int32_t w ) const noexcept
+    { const TagType t{} ; m_functor( t , w ); }

 public:

-  __device__
-  inline
-  void operator()() const {
-    for (std::int32_t i; (-1 != (i = Base::before_work())); ) {
-      exec_one< typename Policy::work_tag >( i );
-      Base::after_work(i);
+  __device__ inline
+  void operator()() const noexcept
+    {
+      if ( 0 == ( threadIdx.y % 16 ) ) {
+
+        // Spin until COMPLETED_TOKEN.
+        // END_TOKEN indicates no work is currently available.
+
+        for ( std::int32_t w = Policy::END_TOKEN ;
+              Policy::COMPLETED_TOKEN != ( w = m_policy.pop_work() ) ; ) {
+          if ( Policy::END_TOKEN != w ) {
+            exec_one< typename Policy::work_tag >( w );
+            m_policy.completed_work(w);
+          }
+        }
+      }
    }
-  }

  inline
  void execute()
@ -108,9 +110,9 @@ public:
  inline
  ParallelFor( const FunctorType & arg_functor
             , const Policy      & arg_policy )
-    : Base( arg_functor, arg_policy )
-  {
-  }
+    : m_policy( arg_policy )
+    , m_functor( arg_functor )
+  {}
 };

 } // namespace Impl
--- a/lib/kokkos/core/src/KokkosExp_MDRangePolicy.hpp
+++ b/lib/kokkos/core/src/KokkosExp_MDRangePolicy.hpp
@ -55,7 +55,7 @@
 #include <Cuda/KokkosExp_Cuda_IterateTile_Refactor.hpp>
 #endif

-namespace Kokkos { namespace Experimental {
+namespace Kokkos {

 // ------------------------------------------------------------------ //

@ -331,11 +331,23 @@ struct MDRangePolicy
  }

 };
+
+} // namespace Kokkos
+
+// For backward compatibility
+namespace Kokkos { namespace Experimental {
+  using Kokkos::MDRangePolicy;
+  using Kokkos::Rank;
+  using Kokkos::Iterate;
+} } // end Kokkos::Experimental
 // ------------------------------------------------------------------ //

 // ------------------------------------------------------------------ //
 //md_parallel_for - deprecated use parallel_for
 // ------------------------------------------------------------------ //
+
+namespace Kokkos { namespace Experimental {
+
 template <typename MDRange, typename Functor, typename Enable = void>
 void md_parallel_for( MDRange const& range
                    , Functor const& f
@ -347,7 +359,7 @@ void md_parallel_for( MDRange const& range
                      ) >::type* = 0
                    )
 {
-  Impl::MDFunctor<MDRange, Functor, void> g(range, f);
+  Kokkos::Impl::Experimental::MDFunctor<MDRange, Functor, void> g(range, f);

  using range_policy = typename MDRange::impl_range_policy;

@ -365,7 +377,7 @@ void md_parallel_for( const std::string& str
                      ) >::type* = 0
                    )
 {
-  Impl::MDFunctor<MDRange, Functor, void> g(range, f);
+  Kokkos::Impl::Experimental::MDFunctor<MDRange, Functor, void> g(range, f);

  using range_policy = typename MDRange::impl_range_policy;

@ -385,7 +397,7 @@ void md_parallel_for( const std::string& str
                      ) >::type* = 0
                    )
 {
-  Impl::DeviceIterateTile<MDRange, Functor, typename MDRange::work_tag> closure(range, f);
+  Kokkos::Impl::DeviceIterateTile<MDRange, Functor, typename MDRange::work_tag> closure(range, f);
  closure.execute();
 }

@ -400,7 +412,7 @@ void md_parallel_for( MDRange const& range
                      ) >::type* = 0
                    )
 {
-  Impl::DeviceIterateTile<MDRange, Functor, typename MDRange::work_tag> closure(range, f);
+  Kokkos::Impl::DeviceIterateTile<MDRange, Functor, typename MDRange::work_tag> closure(range, f);
  closure.execute();
 }
 #endif
@ -421,7 +433,7 @@ void md_parallel_reduce( MDRange const& range
                      ) >::type* = 0
                    )
 {
-  Impl::MDFunctor<MDRange, Functor, ValueType> g(range, f);
+  Kokkos::Impl::Experimental::MDFunctor<MDRange, Functor, ValueType> g(range, f);

  using range_policy = typename MDRange::impl_range_policy;
  Kokkos::parallel_reduce( str, range_policy(0, range.m_num_tiles).set_chunk_size(1), g, v );
@ -439,7 +451,7 @@ void md_parallel_reduce( const std::string& str
                      ) >::type* = 0
                    )
 {
-  Impl::MDFunctor<MDRange, Functor, ValueType> g(range, f);
+  Kokkos::Impl::Experimental::MDFunctor<MDRange, Functor, ValueType> g(range, f);

  using range_policy = typename MDRange::impl_range_policy;

@ -448,7 +460,7 @@ void md_parallel_reduce( const std::string& str

 // Cuda - md_parallel_reduce not implemented - use parallel_reduce

-}} // namespace Kokkos::Experimental
+} } // namespace Kokkos::Experimental

 #endif //KOKKOS_CORE_EXP_MD_RANGE_POLICY_HPP

--- a/lib/kokkos/core/src/Kokkos_Concepts.hpp
+++ b/lib/kokkos/core/src/Kokkos_Concepts.hpp
@ -81,10 +81,10 @@ struct IndexType

 /**\brief Specify Launch Bounds for CUDA execution.
 *
- *  The "best" defaults may be architecture specific.
+ *  If no launch bounds specified then do not set launch bounds.
 */
-template< unsigned int maxT = 1024 /* Max threads per block */
-        , unsigned int minB = 1    /* Min blocks per SM */
+template< unsigned int maxT = 0 /* Max threads per block */
+        , unsigned int minB = 0 /* Min blocks per SM */
        >
 struct LaunchBounds
 {
@ -280,6 +280,9 @@ struct MemorySpaceAccess {
  enum { deepcopy = assignable };
 };

+}} // namespace Kokkos::Impl
+
+namespace Kokkos {

 /**\brief  Can AccessSpace access MemorySpace ?
 *
@ -358,6 +361,13 @@ public:
    >::type  space ;
 };

+} // namespace Kokkos
+
+namespace Kokkos {
+namespace Impl {
+
+using Kokkos::SpaceAccessibility ; // For backward compatibility
+
 }} // namespace Kokkos::Impl

 //----------------------------------------------------------------------------
--- a/lib/kokkos/core/src/Kokkos_Core.hpp
+++ b/lib/kokkos/core/src/Kokkos_Core.hpp
@ -99,13 +99,17 @@ struct InitArguments {
  int num_threads;
  int num_numa;
  int device_id;
+  bool disable_warnings;

  InitArguments( int nt = -1
               , int nn = -1
-               , int dv = -1)
-    : num_threads( nt )
-    , num_numa( nn )
-    , device_id( dv )
+               , int dv = -1
+               , bool dw = false
+               )
+    : num_threads{ nt }
+    , num_numa{ nn }
+    , device_id{ dv }
+    , disable_warnings{ dw }
  {}
 };

@ -113,6 +117,10 @@ void initialize(int& narg, char* arg[]);

 void initialize(const InitArguments& args = InitArguments());

+bool is_initialized() noexcept;
+
+bool show_warnings() noexcept;
+
 /** \brief  Finalize the spaces that were initialized via Kokkos::initialize */
 void finalize();

--- a/lib/kokkos/core/src/Kokkos_Crs.hpp
+++ b/lib/kokkos/core/src/Kokkos_Crs.hpp
@ -45,7 +45,6 @@
 #define KOKKOS_CRS_HPP

 namespace Kokkos {
-namespace Experimental {

 /// \class Crs
 /// \brief Compressed row storage array.
@ -164,7 +163,7 @@ void transpose_crs(
    Crs<DataType, Arg1Type, Arg2Type, SizeType>& out,
    Crs<DataType, Arg1Type, Arg2Type, SizeType> const& in);

-}} // namespace Kokkos::Experimental
+} // namespace Kokkos

 /*--------------------------------------------------------------------------*/

@ -172,7 +171,6 @@ void transpose_crs(

 namespace Kokkos {
 namespace Impl {
-namespace Experimental {

 template <class InCrs, class OutCounts>
 class GetCrsTransposeCounts {
@ -277,14 +275,13 @@ class FillCrsTransposeEntries {
  }
 };

-}}} // namespace Kokkos::Impl::Experimental
+}} // namespace Kokkos::Impl

 /*--------------------------------------------------------------------------*/

 /*--------------------------------------------------------------------------*/

 namespace Kokkos {
-namespace Experimental {

 template< class OutCounts,
          class DataType,
@ -297,8 +294,7 @@ void get_crs_transpose_counts(
    std::string const& name) {
  using InCrs = Crs<DataType, Arg1Type, Arg2Type, SizeType>;
  out = OutCounts(name, in.numRows());
-  Kokkos::Impl::Experimental::
-    GetCrsTransposeCounts<InCrs, OutCounts> functor(in, out);
+  Kokkos::Impl::GetCrsTransposeCounts<InCrs, OutCounts> functor(in, out);
 }

 template< class OutRowMap,
@ -308,8 +304,7 @@ typename OutRowMap::value_type get_crs_row_map_from_counts(
    InCounts const& in,
    std::string const& name) {
  out = OutRowMap(ViewAllocateWithoutInitializing(name), in.size() + 1);
-  Kokkos::Impl::Experimental::
-    CrsRowMapFromCounts<InCounts, OutRowMap> functor(in, out);
+  Kokkos::Impl::CrsRowMapFromCounts<InCounts, OutRowMap> functor(in, out);
  return functor.execute();
 }

@ -326,32 +321,37 @@ void transpose_crs(
  typedef View<SizeType*, memory_space>               counts_type ;
  {
  counts_type counts;
-  Kokkos::Experimental::get_crs_transpose_counts(counts, in);
-  Kokkos::Experimental::get_crs_row_map_from_counts(out.row_map, counts,
+  Kokkos::get_crs_transpose_counts(counts, in);
+  Kokkos::get_crs_row_map_from_counts(out.row_map, counts,
      "tranpose_row_map");
  }
  out.entries = decltype(out.entries)("transpose_entries", in.entries.size());
-  Kokkos::Impl::Experimental::
+  Kokkos::Impl::
    FillCrsTransposeEntries<crs_type, crs_type> entries_functor(in, out);
 }

 template< class CrsType,
-          class Functor>
-struct CountAndFill {
+          class Functor,
+          class ExecutionSpace = typename CrsType::execution_space>
+struct CountAndFillBase;
+
+template< class CrsType,
+          class Functor,
+          class ExecutionSpace>
+struct CountAndFillBase {
  using data_type = typename CrsType::size_type;
  using size_type = typename CrsType::size_type;
  using row_map_type = typename CrsType::row_map_type;
-  using entries_type = typename CrsType::entries_type;
  using counts_type = row_map_type;
  CrsType m_crs;
  Functor m_functor;
  counts_type m_counts;
  struct Count {};
-  KOKKOS_INLINE_FUNCTION void operator()(Count, size_type i) const {
+  inline void operator()(Count, size_type i) const {
    m_counts(i) = m_functor(i, nullptr);
  }
  struct Fill {};
-  KOKKOS_INLINE_FUNCTION void operator()(Fill, size_type i) const {
+  inline void operator()(Fill, size_type i) const {
    auto j = m_crs.row_map(i);
    /* we don't want to access entries(entries.size()), even if its just to get its
       address and never use it.
@ -363,13 +363,63 @@ struct CountAndFill {
      nullptr : (&(m_crs.entries(j)));
    m_functor(i, fill);
  }
-  using self_type = CountAndFill<CrsType, Functor>;
-  CountAndFill(CrsType& crs, size_type nrows, Functor const& f):
+  CountAndFillBase(CrsType& crs, Functor const& f):
    m_crs(crs),
    m_functor(f)
+  {}
+};
+
+#if defined( KOKKOS_ENABLE_CUDA )
+template< class CrsType,
+          class Functor>
+struct CountAndFillBase<CrsType, Functor, Kokkos::Cuda> {
+  using data_type = typename CrsType::size_type;
+  using size_type = typename CrsType::size_type;
+  using row_map_type = typename CrsType::row_map_type;
+  using counts_type = row_map_type;
+  CrsType m_crs;
+  Functor m_functor;
+  counts_type m_counts;
+  struct Count {};
+  __device__ inline void operator()(Count, size_type i) const {
+    m_counts(i) = m_functor(i, nullptr);
+  }
+  struct Fill {};
+  __device__ inline void operator()(Fill, size_type i) const {
+    auto j = m_crs.row_map(i);
+    /* we don't want to access entries(entries.size()), even if its just to get its
+       address and never use it.
+       this can happen when row (i) is empty and all rows after it are also empty.
+       we could compare to row_map(i + 1), but that is a read from global memory,
+       whereas dimension_0() should be part of the View in registers (or constant memory) */
+    data_type* fill =
+      (j == static_cast<decltype(j)>(m_crs.entries.dimension_0())) ?
+      nullptr : (&(m_crs.entries(j)));
+    m_functor(i, fill);
+  }
+  CountAndFillBase(CrsType& crs, Functor const& f):
+    m_crs(crs),
+    m_functor(f)
+  {}
+};
+#endif
+
+template< class CrsType,
+          class Functor>
+struct CountAndFill : public CountAndFillBase<CrsType, Functor> {
+  using base_type = CountAndFillBase<CrsType, Functor>;
+  using typename base_type::data_type;
+  using typename base_type::size_type;
+  using typename base_type::counts_type;
+  using typename base_type::Count;
+  using typename base_type::Fill;
+  using entries_type = typename CrsType::entries_type;
+  using self_type = CountAndFill<CrsType, Functor>;
+  CountAndFill(CrsType& crs, size_type nrows, Functor const& f):
+    base_type(crs, f)
  {
    using execution_space = typename CrsType::execution_space;
-    m_counts = counts_type("counts", nrows);
+    this->m_counts = counts_type("counts", nrows);
    {
    using count_policy_type = RangePolicy<size_type, execution_space, Count>;
    using count_closure_type =
@ -377,10 +427,10 @@ struct CountAndFill {
    const count_closure_type closure(*this, count_policy_type(0, nrows));
    closure.execute();
    }
-    auto nentries = Kokkos::Experimental::
-      get_crs_row_map_from_counts(m_crs.row_map, m_counts);
-    m_counts = counts_type();
-    m_crs.entries = entries_type("entries", nentries);
+    auto nentries = Kokkos::
+      get_crs_row_map_from_counts(this->m_crs.row_map, this->m_counts);
+    this->m_counts = counts_type();
+    this->m_crs.entries = entries_type("entries", nentries);
    {
    using fill_policy_type = RangePolicy<size_type, execution_space, Fill>;
    using fill_closure_type =
@ -388,7 +438,7 @@ struct CountAndFill {
    const fill_closure_type closure(*this, fill_policy_type(0, nrows));
    closure.execute();
    }
-    crs = m_crs;
+    crs = this->m_crs;
  }
 };

@ -398,9 +448,9 @@ void count_and_fill_crs(
    CrsType& crs,
    typename CrsType::size_type nrows,
    Functor const& f) {
-  Kokkos::Experimental::CountAndFill<CrsType, Functor>(crs, nrows, f);
+  Kokkos::CountAndFill<CrsType, Functor>(crs, nrows, f);
 }

-}} // namespace Kokkos::Experimental
+} // namespace Kokkos

 #endif /* #define KOKKOS_CRS_HPP */
--- a/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp
+++ b/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp
@ -379,12 +379,13 @@ Impl::PerThreadValue PerThread(const int& arg);
 *  uses variadic templates. Each and any of the template arguments can
 *  be omitted.
 *
- *  Possible Template arguments and there default values:
+ *  Possible Template arguments and their default values:
 *    ExecutionSpace (DefaultExecutionSpace): where to execute code. Must be enabled.
 *    WorkTag (none): Tag which is used as the first argument for the functor operator.
 *    Schedule<Type> (Schedule<Static>): Scheduling Policy (Dynamic, or Static).
 *    IndexType<Type> (IndexType<ExecutionSpace::size_type>: Integer Index type used to iterate over the Index space.
- *    LaunchBounds<int,int> (LaunchBounds<1024,1>: Launch Bounds for CUDA compilation.
+ *    LaunchBounds<unsigned,unsigned> Launch Bounds for CUDA compilation,
+ *    default of LaunchBounds<0,0> indicates no launch bounds specified.
 */
 template< class ... Properties>
 class TeamPolicy: public
--- a/lib/kokkos/core/src/Kokkos_Macros.hpp
+++ b/lib/kokkos/core/src/Kokkos_Macros.hpp
@ -251,7 +251,7 @@
  #endif
 #endif

-#if defined( __PGIC__ ) && !defined( __GNUC__ )
+#if defined( __PGIC__ ) 
  #define KOKKOS_COMPILER_PGI __PGIC__*100+__PGIC_MINOR__*10+__PGIC_PATCHLEVEL__

  #if ( 1540 > KOKKOS_COMPILER_PGI )
@ -268,24 +268,22 @@
  #define KOKKOS_ENABLE_PRAGMA_UNROLL 1
  #define KOKKOS_ENABLE_PRAGMA_LOOPCOUNT 1
  #define KOKKOS_ENABLE_PRAGMA_VECTOR 1
-  #define KOKKOS_ENABLE_PRAGMA_SIMD 1
+  #if ( 1800 > KOKKOS_COMPILER_INTEL )
+    #define KOKKOS_ENABLE_PRAGMA_SIMD 1
+  #endif

  #if ( __INTEL_COMPILER > 1400 )
    #define KOKKOS_ENABLE_PRAGMA_IVDEP 1
  #endif

+  #if ! defined( KOKKOS_MEMORY_ALIGNMENT )
+    #define KOKKOS_MEMORY_ALIGNMENT 64
+  #endif
+
  #define KOKKOS_RESTRICT __restrict__

-  #ifndef KOKKOS_ALIGN
-    #define KOKKOS_ALIGN(size) __attribute__((aligned(size)))
-  #endif
-
-  #ifndef KOKKOS_ALIGN_PTR
-    #define KOKKOS_ALIGN_PTR(size) __attribute__((align_value(size)))
-  #endif
-
-  #ifndef KOKKOS_ALIGN_SIZE
-    #define KOKKOS_ALIGN_SIZE 64
+  #ifndef KOKKOS_IMPL_ALIGN_PTR
+    #define KOKKOS_IMPL_ALIGN_PTR(size) __attribute__((align_value(size)))
  #endif

  #if ( 1400 > KOKKOS_COMPILER_INTEL )
@ -351,6 +349,11 @@
  #if !defined( KOKKOS_FORCEINLINE_FUNCTION )
    #define KOKKOS_FORCEINLINE_FUNCTION  inline __attribute__((always_inline))
  #endif
+
+  #if !defined( KOKKOS_IMPL_ALIGN_PTR )
+    #define KOKKOS_IMPL_ALIGN_PTR(size) __attribute__((aligned(size)))
+  #endif
+
 #endif

 //----------------------------------------------------------------------------
@ -426,16 +429,16 @@
 //----------------------------------------------------------------------------
 // Define Macro for alignment:

-#if !defined KOKKOS_ALIGN_SIZE
-  #define KOKKOS_ALIGN_SIZE 16
+#if ! defined( KOKKOS_MEMORY_ALIGNMENT )
+  #define KOKKOS_MEMORY_ALIGNMENT 16
 #endif

-#if !defined( KOKKOS_ALIGN )
-  #define KOKKOS_ALIGN(size) __attribute__((aligned(size)))
+#if ! defined( KOKKOS_MEMORY_ALIGNMENT_THRESHOLD )
+  #define KOKKOS_MEMORY_ALIGNMENT_THRESHOLD 4
 #endif

-#if !defined( KOKKOS_ALIGN_PTR )
-  #define KOKKOS_ALIGN_PTR(size) __attribute__((aligned(size)))
+#if !defined( KOKKOS_IMPL_ALIGN_PTR )
+  #define KOKKOS_IMPL_ALIGN_PTR(size) /* */
 #endif

 //----------------------------------------------------------------------------
@ -510,5 +513,11 @@
  #define KOKKOS_ENABLE_TASKDAG
 #endif

+
+#if defined ( KOKKOS_ENABLE_CUDA )
+  #if ( 9000 <= CUDA_VERSION )
+  #define KOKKOS_IMPL_CUDA_VERSION_9_WORKAROUND
+  #endif
+#endif
 #endif // #ifndef KOKKOS_MACROS_HPP

--- a/lib/kokkos/core/src/Kokkos_MemoryPool.hpp
+++ b/lib/kokkos/core/src/Kokkos_MemoryPool.hpp
@ -51,6 +51,27 @@
 #include <impl/Kokkos_Error.hpp>
 #include <impl/Kokkos_SharedAlloc.hpp>

+namespace Kokkos {
+namespace Impl {
+/* Report violation of size constraints:
+ *   min_block_alloc_size <= max_block_alloc_size
+ *   max_block_alloc_size <= min_superblock_size 
+ *   min_superblock_size  <= max_superblock_size
+ *   min_superblock_size  <= min_total_alloc_size
+ *   min_superblock_size  <= min_block_alloc_size * 
+ *                           max_block_per_superblock
+ */
+void memory_pool_bounds_verification
+  ( size_t min_block_alloc_size
+  , size_t max_block_alloc_size
+  , size_t min_superblock_size
+  , size_t max_superblock_size
+  , size_t max_block_per_superblock
+  , size_t min_total_alloc_size
+  );
+}
+}
+
 namespace Kokkos {

 template< typename DeviceType >
@ -111,6 +132,10 @@ private:

 public:

+  /**\brief  The maximum size of a superblock and block */
+  enum : uint32_t { max_superblock_size      = 1LU << 31 /* 2 gigabytes */ };
+  enum : uint32_t { max_block_per_superblock = max_bit_count };
+
  //--------------------------------------------------------------------------

  KOKKOS_INLINE_FUNCTION
@ -206,7 +231,7 @@ public:
      const uint32_t * sb_state_ptr = sb_state_array ;

      s << "pool_size(" << ( size_t(m_sb_count) << m_sb_size_lg2 ) << ")"
-        << " superblock_size(" << ( 1 << m_sb_size_lg2 ) << ")" << std::endl ;
+        << " superblock_size(" << ( 1LU << m_sb_size_lg2 ) << ")" << std::endl ;

      for ( int32_t i = 0 ; i < m_sb_count
          ; ++i , sb_state_ptr += m_sb_state_size ) {
@ -215,7 +240,7 @@ public:

          const uint32_t block_count_lg2 = (*sb_state_ptr) >> state_shift ;
          const uint32_t block_size_lg2  = m_sb_size_lg2 - block_count_lg2 ;
-          const uint32_t block_count     = 1 << block_count_lg2 ;
+          const uint32_t block_count     = 1u << block_count_lg2 ;
          const uint32_t block_used      = (*sb_state_ptr) & state_used_mask ;

          s << "Superblock[ " << i << " / " << m_sb_count << " ] {"
@ -284,43 +309,71 @@ public:
    {
      const uint32_t int_align_lg2   = 3 ; /* align as int[8] */
      const uint32_t int_align_mask  = ( 1u << int_align_lg2 ) - 1 ;
+      const uint32_t default_min_block_size       = 1u << 6  ; /* 64 bytes */
+      const uint32_t default_max_block_size       = 1u << 12 ;/* 4k bytes */
+      const uint32_t default_min_superblock_size  = 1u << 20 ;/* 1M bytes */

-      // Constraints and defaults:
-      //   min_block_alloc_size <= max_block_alloc_size
-      //   max_block_alloc_size <= min_superblock_size 
-      //   min_superblock_size  <= min_total_alloc_size
+      //--------------------------------------------------
+      // Default block and superblock sizes:

-      const uint32_t MIN_BLOCK_SIZE  = 1u << 6   /*   64 bytes */ ;
-      const uint32_t MAX_BLOCK_SIZE  = 1u << 12  /*   4k bytes */ ;
+      if ( 0 == min_block_alloc_size ) {
+        // Default all sizes:

-      if ( 0 == min_block_alloc_size ) min_block_alloc_size = MIN_BLOCK_SIZE ;
+        min_superblock_size =
+          std::min( size_t(default_min_superblock_size)
+                  , min_total_alloc_size );
+
+        min_block_alloc_size =
+          std::min( size_t(default_min_block_size)
+                  , min_superblock_size );
+
+        max_block_alloc_size =
+          std::min( size_t(default_max_block_size)
+                  , min_superblock_size );
+      }
+      else if ( 0 == min_superblock_size ) {
+
+        // Choose superblock size as minimum of:
+        //   max_block_per_superblock * min_block_size
+        //   max_superblock_size
+        //   min_total_alloc_size
+
+        const size_t max_superblock =
+          min_block_alloc_size * max_block_per_superblock ;
+
+        min_superblock_size =
+          std::min( max_superblock ,
+          std::min( size_t(max_superblock_size)
+                  , min_total_alloc_size ) );
+      }

      if ( 0 == max_block_alloc_size ) {
-
-        max_block_alloc_size = MAX_BLOCK_SIZE ;
-
-        // Upper bound of total allocation size
-        max_block_alloc_size = std::min( size_t(max_block_alloc_size)
-                                       , min_total_alloc_size );
-
-        // Lower bound of minimum block size
-        max_block_alloc_size = std::max( max_block_alloc_size
-                                       , min_block_alloc_size );
+        max_block_alloc_size = min_superblock_size ;
      }

-      if ( 0 == min_superblock_size ) {
-        min_superblock_size = max_block_alloc_size ;
+      //--------------------------------------------------

-        // Upper bound of total allocation size
-        min_superblock_size = std::min( size_t(min_superblock_size)
-                                      , min_total_alloc_size );
+      /* Enforce size constraints:
+       *   min_block_alloc_size <= max_block_alloc_size
+       *   max_block_alloc_size <= min_superblock_size 
+       *   min_superblock_size  <= max_superblock_size
+       *   min_superblock_size  <= min_total_alloc_size
+       *   min_superblock_size  <= min_block_alloc_size * 
+       *                           max_block_per_superblock
+       */

-        // Lower bound of maximum block size
-        min_superblock_size = std::max( min_superblock_size
-                                      , max_block_alloc_size );
-      }
+      Kokkos::Impl::memory_pool_bounds_verification
+        ( min_block_alloc_size
+        , max_block_alloc_size
+        , min_superblock_size
+        , max_superblock_size
+        , max_block_per_superblock
+        , min_total_alloc_size
+        );

+      //--------------------------------------------------
      // Block and superblock size is power of two:
+      // Maximum value is 'max_superblock_size'

      m_min_block_size_lg2 =
        Kokkos::Impl::integral_power_of_two_that_contains(min_block_alloc_size);
@ -331,45 +384,26 @@ public:
      m_sb_size_lg2 =
        Kokkos::Impl::integral_power_of_two_that_contains(min_superblock_size);

-      // Constraints:
-      // m_min_block_size_lg2 <= m_max_block_size_lg2 <= m_sb_size_lg2
-      // m_sb_size_lg2 <= m_min_block_size + max_bit_count_lg2
+      {
+        // number of superblocks is multiple of superblock size that
+        // can hold min_total_alloc_size.

-      if ( m_min_block_size_lg2 + max_bit_count_lg2 < m_sb_size_lg2 ) {
-        m_min_block_size_lg2 = m_sb_size_lg2 - max_bit_count_lg2 ;
-      }
-      if ( m_min_block_size_lg2 + max_bit_count_lg2 < m_max_block_size_lg2 ) {
-        m_min_block_size_lg2 = m_max_block_size_lg2 - max_bit_count_lg2 ;
-      }
-      if ( m_max_block_size_lg2 < m_min_block_size_lg2 ) {
-        m_max_block_size_lg2 = m_min_block_size_lg2 ;
-      }
-      if ( m_sb_size_lg2 < m_max_block_size_lg2 ) {
-        m_sb_size_lg2 = m_max_block_size_lg2 ;
+        const uint64_t sb_size_mask = ( 1LU << m_sb_size_lg2 ) - 1 ;
+
+        m_sb_count = ( min_total_alloc_size + sb_size_mask ) >> m_sb_size_lg2 ;
      }

-      // At least 32 minimum size blocks in a superblock
+      {
+        // Any superblock can be assigned to the smallest size block
+        // Size the block bitset to maximum number of blocks

-      if ( m_sb_size_lg2 < m_min_block_size_lg2 + 5 ) {
-        m_sb_size_lg2 = m_min_block_size_lg2 + 5 ;
+        const uint32_t max_block_count_lg2 =
+          m_sb_size_lg2 - m_min_block_size_lg2 ;
+
+        m_sb_state_size =
+          ( CB::buffer_bound_lg2( max_block_count_lg2 ) + int_align_mask ) & ~int_align_mask ;
      }

-      // number of superblocks is multiple of superblock size that
-      // can hold min_total_alloc_size.
-
-      const uint32_t sb_size_mask = ( 1u << m_sb_size_lg2 ) - 1 ;
-
-      m_sb_count = ( min_total_alloc_size + sb_size_mask ) >> m_sb_size_lg2 ;
-
-      // Any superblock can be assigned to the smallest size block
-      // Size the block bitset to maximum number of blocks
-
-      const uint32_t max_block_count_lg2 =
-        m_sb_size_lg2 - m_min_block_size_lg2 ;
-
-      m_sb_state_size =
-        ( CB::buffer_bound_lg2( max_block_count_lg2 ) + int_align_mask ) & ~int_align_mask ;
-
      // Array of all superblock states

      const size_t all_sb_state_size =
@ -454,7 +488,7 @@ private:
   * Restrict lower bound to minimum block size.
   */
  KOKKOS_FORCEINLINE_FUNCTION
-  unsigned get_block_size_lg2( unsigned n ) const noexcept
+  uint32_t get_block_size_lg2( uint32_t n ) const noexcept
    {
      const unsigned i = Kokkos::Impl::integral_power_of_two_that_contains( n );

@ -463,11 +497,12 @@ private:

 public:

+  /* Return 0 for invalid block size */
  KOKKOS_INLINE_FUNCTION
-  uint32_t allocate_block_size( uint32_t alloc_size ) const noexcept
+  uint32_t allocate_block_size( uint64_t alloc_size ) const noexcept
    {
      return alloc_size <= (1UL << m_max_block_size_lg2)
-           ? ( 1u << get_block_size_lg2( alloc_size ) )
+           ? ( 1UL << get_block_size_lg2( uint32_t(alloc_size) ) )
           : 0 ;
    }

@ -485,246 +520,253 @@ public:
  void * allocate( size_t alloc_size
                 , int32_t attempt_limit = 1 ) const noexcept
    {
+      if ( size_t(1LU << m_max_block_size_lg2) < alloc_size ) {
+        Kokkos::abort("Kokkos MemoryPool allocation request exceeded specified maximum allocation size");
+      }
+
      if ( 0 == alloc_size ) return (void*) 0 ;

      void * p = 0 ;

      const uint32_t block_size_lg2 = get_block_size_lg2( alloc_size );

-      if ( block_size_lg2 <= m_max_block_size_lg2 ) {
+      // Allocation will fit within a superblock
+      // that has block sizes ( 1 << block_size_lg2 )

-        // Allocation will fit within a superblock
-        // that has block sizes ( 1 << block_size_lg2 )
+      const uint32_t block_count_lg2 = m_sb_size_lg2 - block_size_lg2 ;
+      const uint32_t block_state     = block_count_lg2 << state_shift ;
+      const uint32_t block_count     = 1u << block_count_lg2 ;

-        const uint32_t block_count_lg2 = m_sb_size_lg2 - block_size_lg2 ;
-        const uint32_t block_state     = block_count_lg2 << state_shift ;
-        const uint32_t block_count     = 1u << block_count_lg2 ;
+      // Superblock hints for this block size:
+      //   hint_sb_id_ptr[0] is the dynamically changing hint
+      //   hint_sb_id_ptr[1] is the static start point

-        // Superblock hints for this block size:
-        //   hint_sb_id_ptr[0] is the dynamically changing hint
-        //   hint_sb_id_ptr[1] is the static start point
+      volatile uint32_t * const hint_sb_id_ptr
+        = m_sb_state_array     /* memory pool state array */
+        + m_hint_offset        /* offset to hint portion of array */
+        + HINT_PER_BLOCK_SIZE  /* number of hints per block size */
+          * ( block_size_lg2 - m_min_block_size_lg2 ); /* block size id */

-        volatile uint32_t * const hint_sb_id_ptr
-          = m_sb_state_array     /* memory pool state array */
-          + m_hint_offset        /* offset to hint portion of array */
-          + HINT_PER_BLOCK_SIZE  /* number of hints per block size */
-            * ( block_size_lg2 - m_min_block_size_lg2 ); /* block size id */
+      const int32_t sb_id_begin = int32_t( hint_sb_id_ptr[1] );

-        const int32_t sb_id_begin = int32_t( hint_sb_id_ptr[1] );
+      // Fast query clock register 'tic' to pseudo-randomize
+      // the guess for which block within a superblock should
+      // be claimed.  If not available then a search occurs.

-        // Fast query clock register 'tic' to pseudo-randomize
-        // the guess for which block within a superblock should
-        // be claimed.  If not available then a search occurs.
-
-        const uint32_t block_id_hint =
-          (uint32_t)( Kokkos::Impl::clock_tic()
+      const uint32_t block_id_hint =
+        (uint32_t)( Kokkos::Impl::clock_tic()
 #if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA )
-          // Spread out potentially concurrent access
-          // by threads within a warp or thread block.
-          + ( threadIdx.x + blockDim.x * threadIdx.y )
+        // Spread out potentially concurrent access
+        // by threads within a warp or thread block.
+        + ( threadIdx.x + blockDim.x * threadIdx.y )
 #endif
-          );
+        );

-        // expected state of superblock for allocation
-        uint32_t sb_state = block_state ;
+      // expected state of superblock for allocation
+      uint32_t sb_state = block_state ;

-        int32_t sb_id = -1 ;
+      int32_t sb_id = -1 ;

-        volatile uint32_t * sb_state_array = 0 ;
+      volatile uint32_t * sb_state_array = 0 ;

-        while ( attempt_limit ) {
+      while ( attempt_limit ) {

-          int32_t hint_sb_id = -1 ;
+        int32_t hint_sb_id = -1 ;

-          if ( sb_id < 0 ) {
+        if ( sb_id < 0 ) {

-            // No superblock specified, try the hint for this block size
+          // No superblock specified, try the hint for this block size

-            sb_id = hint_sb_id = int32_t( *hint_sb_id_ptr );
+          sb_id = hint_sb_id = int32_t( *hint_sb_id_ptr );
+
+          sb_state_array = m_sb_state_array + ( sb_id * m_sb_state_size );
+        }
+
+        // Require:
+        //   0 <= sb_id
+        //   sb_state_array == m_sb_state_array + m_sb_state_size * sb_id
+
+        if ( sb_state == ( state_header_mask & *sb_state_array ) ) {
+
+          // This superblock state is as expected, for the moment.
+          // Attempt to claim a bit.  The attempt updates the state
+          // so have already made sure the state header is as expected.
+
+          const uint32_t count_lg2 = sb_state >> state_shift ;
+          const uint32_t mask      = ( 1u << count_lg2 ) - 1 ;
+
+          const Kokkos::pair<int,int> result =
+            CB::acquire_bounded_lg2( sb_state_array
+                                   , count_lg2
+                                   , block_id_hint & mask
+                                   , sb_state
+                                   );
+
+          // If result.first < 0 then failed to acquire
+          // due to either full or buffer was wrong state.
+          // Could be wrong state if a deallocation raced the
+          // superblock to empty before the acquire could succeed.
+
+          if ( 0 <= result.first ) { // acquired a bit
+
+            const uint32_t size_lg2 = m_sb_size_lg2 - count_lg2 ;
+
+            // Set the allocated block pointer
+
+            p = ((char*)( m_sb_state_array + m_data_offset ))
+              + ( uint64_t(sb_id) << m_sb_size_lg2 ) // superblock memory
+              + ( uint64_t(result.first) << size_lg2 ); // block memory
+
+#if 0
+  printf( "  MemoryPool(0x%lx) pointer(0x%lx) allocate(%lu) sb_id(%d) sb_state(0x%x) block_size(%d) block_capacity(%d) block_id(%d) block_claimed(%d)\n"
+        , (uintptr_t)m_sb_state_array
+        , (uintptr_t)p
+        , alloc_size
+        , sb_id
+        , sb_state 
+        , (1u << size_lg2)
+        , (1u << count_lg2)
+        , result.first 
+        , result.second );
+#endif
+
+            break ; // Success
+          }
+        }
+        //------------------------------------------------------------------
+        //  Arrive here if failed to acquire a block.
+        //  Must find a new superblock.
+
+        //  Start searching at designated index for this block size.
+        //  Look for superblock that, in preferential order,
+        //  1) part-full superblock of this block size
+        //  2) empty superblock to claim for this block size
+        //  3) part-full superblock of the next larger block size
+
+        sb_state = block_state ; // Expect to find the desired state
+        sb_id = -1 ;
+
+        bool update_hint = false ;
+        int32_t sb_id_empty = -1 ;
+        int32_t sb_id_large = -1 ;
+        uint32_t sb_state_large = 0 ;
+
+        sb_state_array = m_sb_state_array + sb_id_begin * m_sb_state_size ;
+
+        for ( int32_t i = 0 , id = sb_id_begin ; i < m_sb_count ; ++i ) {
+
+          //  Query state of the candidate superblock.
+          //  Note that the state may change at any moment
+          //  as concurrent allocations and deallocations occur.
+          
+          const uint32_t full_state = *sb_state_array ;
+          const uint32_t used       = full_state & state_used_mask ;
+          const uint32_t state      = full_state & state_header_mask ;
+
+          if ( state == block_state ) {
+
+            //  Superblock is assigned to this block size
+
+            if ( used < block_count ) {
+
+              // There is room to allocate one block
+
+              sb_id = id ;
+
+              // Is there room to allocate more than one block?
+
+              update_hint = used + 1 < block_count ;
+
+              break ;
+            }
+          }
+          else if ( 0 == used ) {
+
+            // Superblock is empty
+
+            if ( -1 == sb_id_empty ) {
+
+              // Superblock is not assigned to this block size
+              // and is the first empty superblock encountered.
+              // Save this id to use if a partfull superblock is not found.
+
+              sb_id_empty = id ;
+            }
+          }
+          else if ( ( -1 == sb_id_empty /* have not found an empty */ ) &&
+                    ( -1 == sb_id_large /* have not found a larger */ ) &&
+                    ( state < block_state /* a larger block */ ) &&
+                    // is not full:
+                    ( used < ( 1u << ( state >> state_shift ) ) ) ) {
+            //  First superblock encountered that is
+            //  larger than this block size and
+            //  has room for an allocation.
+            //  Save this id to use of partfull or empty superblock not found
+            sb_id_large    = id ;
+            sb_state_large = state ;
+          }
+
+          // Iterate around the superblock array:
+
+          if ( ++id < m_sb_count ) {
+            sb_state_array += m_sb_state_size ;
+          }
+          else {
+            id = 0 ;
+            sb_state_array = m_sb_state_array ;
+          }
+        }
+
+ // printf("  search m_sb_count(%d) sb_id(%d) sb_id_empty(%d) sb_id_large(%d)\n" , m_sb_count , sb_id , sb_id_empty , sb_id_large);
+
+        if ( sb_id < 0 ) {
+
+          //  Did not find a partfull superblock for this block size.
+
+          if ( 0 <= sb_id_empty ) {
+
+            //  Found first empty superblock following designated superblock
+            //  Attempt to claim it for this block size.
+            //  If the claim fails assume that another thread claimed it
+            //  for this block size and try to use it anyway,
+            //  but do not update hint.
+
+            sb_id = sb_id_empty ;
+
+            sb_state_array = m_sb_state_array + ( sb_id * m_sb_state_size );
+
+            //  If successfully changed assignment of empty superblock 'sb_id'
+            //  to this block_size then update the hint.
+
+            const uint32_t state_empty = state_header_mask & *sb_state_array ;
+
+            // If this thread claims the empty block then update the hint
+            update_hint =
+              state_empty ==
+                Kokkos::atomic_compare_exchange
+                  (sb_state_array,state_empty,block_state);
+          }
+          else if ( 0 <= sb_id_large ) {
+
+            // Found a larger superblock with space available
+
+            sb_id    = sb_id_large ;
+            sb_state = sb_state_large ;

            sb_state_array = m_sb_state_array + ( sb_id * m_sb_state_size );
          }
-
-          // Require:
-          //   0 <= sb_id
-          //   sb_state_array == m_sb_state_array + m_sb_state_size * sb_id
-
-          if ( sb_state == ( state_header_mask & *sb_state_array ) ) {
-
-            // This superblock state is as expected, for the moment.
-            // Attempt to claim a bit.  The attempt updates the state
-            // so have already made sure the state header is as expected.
-
-            const uint32_t count_lg2 = sb_state >> state_shift ;
-            const uint32_t mask      = ( 1u << count_lg2 ) - 1 ;
-
-            const Kokkos::pair<int,int> result =
-              CB::acquire_bounded_lg2( sb_state_array
-                                     , count_lg2
-                                     , block_id_hint & mask
-                                     , sb_state
-                                     );
-
-            // If result.first < 0 then failed to acquire
-            // due to either full or buffer was wrong state.
-            // Could be wrong state if a deallocation raced the
-            // superblock to empty before the acquire could succeed.
-
-            if ( 0 <= result.first ) { // acquired a bit
-
-              const uint32_t size_lg2 = m_sb_size_lg2 - count_lg2 ;
-
-              // Set the allocated block pointer
-
-              p = ((char*)( m_sb_state_array + m_data_offset ))
-                + ( uint32_t(sb_id) << m_sb_size_lg2 ) // superblock memory
-                + ( result.first    << size_lg2 );     // block memory
-
-              break ; // Success
-            }
-
-// printf("  acquire count_lg2(%d) sb_state(0x%x) sb_id(%d) result(%d,%d)\n" , count_lg2 , sb_state , sb_id , result.first , result.second );
-
+          else {
+            // Did not find a potentially usable superblock
+            --attempt_limit ;
          }
-          //------------------------------------------------------------------
-          //  Arrive here if failed to acquire a block.
-          //  Must find a new superblock.
+        }

-          //  Start searching at designated index for this block size.
-          //  Look for superblock that, in preferential order,
-          //  1) part-full superblock of this block size
-          //  2) empty superblock to claim for this block size
-          //  3) part-full superblock of the next larger block size
-
-          sb_state = block_state ; // Expect to find the desired state
-          sb_id = -1 ;
-
-          bool update_hint = false ;
-          int32_t sb_id_empty = -1 ;
-          int32_t sb_id_large = -1 ;
-          uint32_t sb_state_large = 0 ;
-
-          sb_state_array = m_sb_state_array + sb_id_begin * m_sb_state_size ;
-
-          for ( int32_t i = 0 , id = sb_id_begin ; i < m_sb_count ; ++i ) {
-
-            //  Query state of the candidate superblock.
-            //  Note that the state may change at any moment
-            //  as concurrent allocations and deallocations occur.
-            
-            const uint32_t full_state = *sb_state_array ;
-            const uint32_t used       = full_state & state_used_mask ;
-            const uint32_t state      = full_state & state_header_mask ;
-
-            if ( state == block_state ) {
-
-              //  Superblock is assigned to this block size
-
-              if ( used < block_count ) {
-
-                // There is room to allocate one block
-
-                sb_id = id ;
-
-                // Is there room to allocate more than one block?
-
-                update_hint = used + 1 < block_count ;
-
-                break ;
-              }
-            }
-            else if ( 0 == used ) {
-
-              // Superblock is empty
-
-              if ( -1 == sb_id_empty ) {
-
-                // Superblock is not assigned to this block size
-                // and is the first empty superblock encountered.
-                // Save this id to use if a partfull superblock is not found.
-
-                sb_id_empty = id ;
-              }
-            }
-            else if ( ( -1 == sb_id_empty /* have not found an empty */ ) &&
-                      ( -1 == sb_id_large /* have not found a larger */ ) &&
-                      ( state < block_state /* a larger block */ ) &&
-                      // is not full:
-                      ( used < ( 1u << ( state >> state_shift ) ) ) ) {
-              //  First superblock encountered that is
-              //  larger than this block size and
-              //  has room for an allocation.
-              //  Save this id to use of partfull or empty superblock not found
-              sb_id_large    = id ;
-              sb_state_large = state ;
-            }
-
-            // Iterate around the superblock array:
-
-            if ( ++id < m_sb_count ) {
-              sb_state_array += m_sb_state_size ;
-            }
-            else {
-              id = 0 ;
-              sb_state_array = m_sb_state_array ;
-            }
-          }
-
-// printf("  search m_sb_count(%d) sb_id(%d) sb_id_empty(%d) sb_id_large(%d)\n" , m_sb_count , sb_id , sb_id_empty , sb_id_large);
-
-          if ( sb_id < 0 ) {
-
-            //  Did not find a partfull superblock for this block size.
-
-            if ( 0 <= sb_id_empty ) {
-
-              //  Found first empty superblock following designated superblock
-              //  Attempt to claim it for this block size.
-              //  If the claim fails assume that another thread claimed it
-              //  for this block size and try to use it anyway,
-              //  but do not update hint.
-
-              sb_id = sb_id_empty ;
-
-              sb_state_array = m_sb_state_array + ( sb_id * m_sb_state_size );
-
-              //  If successfully changed assignment of empty superblock 'sb_id'
-              //  to this block_size then update the hint.
-
-              const uint32_t state_empty = state_header_mask & *sb_state_array ;
-
-              // If this thread claims the empty block then update the hint
-              update_hint =
-                state_empty ==
-                  Kokkos::atomic_compare_exchange
-                    (sb_state_array,state_empty,block_state);
-            }
-            else if ( 0 <= sb_id_large ) {
-
-              // Found a larger superblock with space available
-
-              sb_id    = sb_id_large ;
-              sb_state = sb_state_large ;
-
-              sb_state_array = m_sb_state_array + ( sb_id * m_sb_state_size );
-            }
-            else {
-              // Did not find a potentially usable superblock
-              --attempt_limit ;
-            }
-          }
-
-          if ( update_hint ) {
-            Kokkos::atomic_compare_exchange
-              ( hint_sb_id_ptr , uint32_t(hint_sb_id) , uint32_t(sb_id) );
-          }
-        } // end allocation attempt loop
-
-        //--------------------------------------------------------------------
-      }
-      else {
-        Kokkos::abort("Kokkos MemoryPool allocation request exceeded specified maximum allocation size");
-      }
+        if ( update_hint ) {
+          Kokkos::atomic_compare_exchange
+            ( hint_sb_id_ptr , uint32_t(hint_sb_id) , uint32_t(sb_id) );
+        }
+      } // end allocation attempt loop
+      //--------------------------------------------------------------------

      return p ;
    }
@ -765,7 +807,7 @@ public:
        const uint32_t block_size_lg2 =
          m_sb_size_lg2 - ( block_state >> state_shift );

-        ok_block_aligned = 0 == ( d & ( ( 1 << block_size_lg2 ) - 1 ) );
+        ok_block_aligned = 0 == ( d & ( ( 1UL << block_size_lg2 ) - 1 ) );

        if ( ok_block_aligned ) {

@ -773,31 +815,70 @@ public:
          // mask into superblock and then shift down for block index

          const uint32_t bit =
-            ( d & ( ptrdiff_t( 1 << m_sb_size_lg2 ) - 1 ) ) >> block_size_lg2 ;
+            ( d & ( ptrdiff_t( 1LU << m_sb_size_lg2 ) - 1 ) ) >> block_size_lg2 ;

          const int result =
            CB::release( sb_state_array , bit , block_state );

          ok_dealloc_once = 0 <= result ;

-// printf("  deallocate from sb_id(%d) result(%d) bit(%d) state(0x%x)\n"
-//       , sb_id
-//       , result
-//       , uint32_t(d >> block_size_lg2)
-//       , *sb_state_array );
-
+#if 0
+  printf( "  MemoryPool(0x%lx) pointer(0x%lx) deallocate sb_id(%d) block_size(%d) block_capacity(%d) block_id(%d) block_claimed(%d)\n"
+        , (uintptr_t)m_sb_state_array
+        , (uintptr_t)p
+        , sb_id
+        , (1u << block_size_lg2)
+        , (1u << (m_sb_size_lg2 - block_size_lg2))
+        , bit
+        , result );
+#endif
        }
      }

      if ( ! ok_contains || ! ok_block_aligned || ! ok_dealloc_once ) {
 #if 0
-        printf("Kokkos MemoryPool deallocate(0x%lx) contains(%d) block_aligned(%d) dealloc_once(%d)\n",(uintptr_t)p,ok_contains,ok_block_aligned,ok_dealloc_once);
+  printf( "  MemoryPool(0x%lx) pointer(0x%lx) deallocate ok_contains(%d) ok_block_aligned(%d) ok_dealloc_once(%d)\n"
+        , (uintptr_t)m_sb_state_array
+        , (uintptr_t)p
+        , int(ok_contains)
+        , int(ok_block_aligned)
+        , int(ok_dealloc_once) );
 #endif
        Kokkos::abort("Kokkos MemoryPool::deallocate given erroneous pointer");
      }
    }
  // end deallocate
  //--------------------------------------------------------------------------
+
+  KOKKOS_INLINE_FUNCTION
+  int number_of_superblocks() const noexcept { return m_sb_count ; }
+
+  KOKKOS_INLINE_FUNCTION
+  void superblock_state( int sb_id
+                       , int & block_size
+                       , int & block_count_capacity
+                       , int & block_count_used ) const noexcept
+    {
+      block_size           = 0 ;
+      block_count_capacity = 0 ;
+      block_count_used     = 0 ;
+
+      if ( Kokkos::Impl::MemorySpaceAccess
+             < Kokkos::Impl::ActiveExecutionMemorySpace
+             , base_memory_space >::accessible ) {
+       // Can access the state array
+       
+        const uint32_t state =
+          ((uint32_t volatile *)m_sb_state_array)[sb_id*m_sb_state_size];
+
+        const uint32_t block_count_lg2 = state >> state_shift ;
+        const uint32_t block_used      = state & state_used_mask ;
+
+        block_size           = 1LU << ( m_sb_size_lg2 - block_count_lg2 );
+        block_count_capacity = 1LU << block_count_lg2 ;
+        block_count_used     = block_used ;
+      }
+    }
 };

 } // namespace Kokkos 
--- a/lib/kokkos/core/src/Kokkos_MemoryTraits.hpp
+++ b/lib/kokkos/core/src/Kokkos_MemoryTraits.hpp
@ -97,26 +97,22 @@ typedef Kokkos::MemoryTraits< Kokkos::Unmanaged | Kokkos::RandomAccess > MemoryR
 namespace Kokkos {
 namespace Impl {

+static_assert(
+  ( 0 < int(KOKKOS_MEMORY_ALIGNMENT) ) &&
+  ( 0 == ( int(KOKKOS_MEMORY_ALIGNMENT) & (int(KOKKOS_MEMORY_ALIGNMENT)-1))) ,
+  "KOKKOS_MEMORY_ALIGNMENT must be a power of two" );
+
 /** \brief Memory alignment settings
 *
 *  Sets global value for memory alignment.  Must be a power of two!
 *  Enable compatibility of views from different devices with static stride.
 *  Use compiler flag to enable overwrites.
 */
-enum { MEMORY_ALIGNMENT =
-#if defined( KOKKOS_MEMORY_ALIGNMENT )
-    ( 1 << Kokkos::Impl::integral_power_of_two( KOKKOS_MEMORY_ALIGNMENT ) )
-#else
-    ( 1 << Kokkos::Impl::integral_power_of_two( 128 ) )
-#endif
-#if defined( KOKKOS_MEMORY_ALIGNMENT_THRESHOLD )
+enum : unsigned
+  { MEMORY_ALIGNMENT           = KOKKOS_MEMORY_ALIGNMENT
  , MEMORY_ALIGNMENT_THRESHOLD = KOKKOS_MEMORY_ALIGNMENT_THRESHOLD
-#else
-  , MEMORY_ALIGNMENT_THRESHOLD = 4
-#endif
  };

-
 } //namespace Impl
 } // namespace Kokkos

--- a/lib/kokkos/core/src/Kokkos_NumericTraits.hpp
+++ b/lib/kokkos/core/src/Kokkos_NumericTraits.hpp
@ -204,6 +204,7 @@ struct reduction_identity<double> {
  KOKKOS_FORCEINLINE_FUNCTION constexpr static double min()  {return DBL_MAX;}
 };

+#if !defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA )
 template<>
 struct reduction_identity<long double> {
  KOKKOS_FORCEINLINE_FUNCTION constexpr static long double sum()  {return static_cast<long double>(0.0);}
@ -211,6 +212,7 @@ struct reduction_identity<long double> {
  KOKKOS_FORCEINLINE_FUNCTION constexpr static long double max()  {return -LDBL_MAX;}
  KOKKOS_FORCEINLINE_FUNCTION constexpr static long double min()  {return LDBL_MAX;}
 };
+#endif

 }

--- a/lib/kokkos/core/src/Kokkos_Pair.hpp
+++ b/lib/kokkos/core/src/Kokkos_Pair.hpp
@ -78,7 +78,7 @@ struct pair
  /// This calls the default constructors of T1 and T2.  It won't
  /// compile if those default constructors are not defined and
  /// public.
-  KOKKOS_FORCEINLINE_FUNCTION constexpr
+  KOKKOS_FUNCTION_DEFAULTED constexpr
  pair() = default ;

  /// \brief Constructor that takes both elements of the pair.
@ -458,7 +458,7 @@ struct pair<T1,void>
  first_type  first;
  enum { second = 0 };

-  KOKKOS_FORCEINLINE_FUNCTION constexpr
+  KOKKOS_FUNCTION_DEFAULTED constexpr
  pair() = default ;

  KOKKOS_FORCEINLINE_FUNCTION constexpr
--- a/lib/kokkos/core/src/Kokkos_Parallel.hpp
+++ b/lib/kokkos/core/src/Kokkos_Parallel.hpp
@ -241,7 +241,7 @@ void parallel_for( const std::string & str
  std::cout << "KOKKOS_DEBUG Start parallel_for kernel: " << str << std::endl;
  #endif

-  parallel_for(policy,functor,str);
+  ::Kokkos::parallel_for(policy,functor,str);

  #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
  Kokkos::fence();
@ -487,7 +487,7 @@ void parallel_scan( const std::string& str
  std::cout << "KOKKOS_DEBUG Start parallel_scan kernel: " << str << std::endl;
  #endif

-  parallel_scan(policy,functor,str);
+  ::Kokkos::parallel_scan(policy,functor,str);

  #if KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES
  Kokkos::fence();
--- a/lib/kokkos/core/src/Kokkos_Profiling_ProfileSection.hpp
+++ b/lib/kokkos/core/src/Kokkos_Profiling_ProfileSection.hpp
@ -0,0 +1,111 @@
+/*
+ //@HEADER
+ // ************************************************************************
+ //
+ //                        Kokkos v. 2.0
+ //              Copyright (2014) Sandia Corporation
+ //
+ // Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+ // the U.S. Government retains certain rights in this software.
+ //
+ // Redistribution and use in source and binary forms, with or without
+ // modification, are permitted provided that the following conditions are
+ // met:
+ //
+ // 1. Redistributions of source code must retain the above copyright
+ // notice, this list of conditions and the following disclaimer.
+ //
+ // 2. Redistributions in binary form must reproduce the above copyright
+ // notice, this list of conditions and the following disclaimer in the
+ // documentation and/or other materials provided with the distribution.
+ //
+ // 3. Neither the name of the Corporation nor the names of the
+ // contributors may be used to endorse or promote products derived from
+ // this software without specific prior written permission.
+ //
+ // THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+ // EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ // PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+ // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ // EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ // LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ // NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ //
+ // Questions? Contact  H. Carter Edwards (hcedwar@sandia.gov)
+ //
+ // ************************************************************************
+ //@HEADER
+ */
+
+#ifndef KOKKOSP_PROFILE_SECTION_HPP
+#define KOKKOSP_PROFILE_SECTION_HPP
+
+#include <Kokkos_Macros.hpp>
+#include <impl/Kokkos_Profiling_Interface.hpp>
+
+#include <string>
+
+namespace Kokkos {
+namespace Profiling {
+
+class ProfilingSection {
+
+public:
+	ProfilingSection(const std::string& sectionName) :
+		secName(sectionName) {
+
+		#if defined( KOKKOS_ENABLE_PROFILING )
+			if(Kokkos::Profiling::profileLibraryLoaded()) {
+				Kokkos::Profiling::createProfileSection(secName, &secID);
+			}
+		#else
+			secID = 0;
+		#endif
+	}
+	
+	void start() {
+		#if defined( KOKKOS_ENABLE_PROFILING )
+			if(Kokkos::Profiling::profileLibraryLoaded()) {
+				Kokkos::Profiling::startSection(secID);
+			}
+		#endif
+	}
+	
+	void stop() {
+		#if defined( KOKKOS_ENABLE_PROFILING )
+			if(Kokkos::Profiling::profileLibraryLoaded()) {
+				Kokkos::Profiling::stopSection(secID);
+			}
+		#endif
+	}
+	
+	~ProfilingSection() {
+		#if defined( KOKKOS_ENABLE_PROFILING )
+			if(Kokkos::Profiling::profileLibraryLoaded()) {
+				Kokkos::Profiling::destroyProfileSection(secID);
+			}
+		#endif
+	}
+	
+	std::string getName() {
+		return secName;
+	}
+	
+	uint32_t getSectionID() {
+		return secID;
+	}
+	
+protected:
+	const std::string secName;
+	uint32_t secID;
+
+};
+
+}
+}
+
+#endif
--- a/lib/kokkos/core/src/Kokkos_ROCm.hpp
+++ b/lib/kokkos/core/src/Kokkos_ROCm.hpp
@ -204,8 +204,8 @@ struct VerifyExecutionCanAccessMemorySpace
  >
 {
  enum { value = false };
-  inline static void verify( void ) { Experimental::ROCmSpace::access_error(); }
-  inline static void verify( const void * p ) { Experimental::ROCmSpace::access_error(p); }
+  inline static void verify( void ) { Kokkos::Experimental::ROCmSpace::access_error(); }
+  inline static void verify( const void * p ) { Kokkos::Experimental::ROCmSpace::access_error(p); }
 };
 } // namespace Experimental
 } // namespace Kokkos
--- a/lib/kokkos/core/src/Kokkos_Serial.hpp
+++ b/lib/kokkos/core/src/Kokkos_Serial.hpp
@ -145,7 +145,7 @@ public:
                          unsigned use_cores_per_numa = 0 ,
                          bool allow_asynchronous_threadpool = false);

-  static int is_initialized();
+  static bool is_initialized();

  /** \brief  Return the maximum amount of concurrency.  */
  static int concurrency() {return 1;};
@ -424,11 +424,13 @@ private:
  typedef typename Policy::work_tag                                  WorkTag ;

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
+
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

  typedef FunctorAnalysis< FunctorPatternInterface::REDUCE , Policy , FunctorType > Analysis ;

-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTag >  ValueInit ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd >  ValueInit ;

  typedef typename Analysis::pointer_type    pointer_type ;
  typedef typename Analysis::reference_type  reference_type ;
@ -488,7 +490,7 @@ public:

      this-> template exec< WorkTag >( update );

-      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::
+      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::
        final(  ReducerConditional::select(m_functor , m_reducer) , ptr );
    }

@ -619,16 +621,16 @@ namespace Impl {

 template< class FunctorType , class ... Traits >
 class ParallelFor< FunctorType ,
-                   Kokkos::Experimental::MDRangePolicy< Traits ... > ,
+                   Kokkos::MDRangePolicy< Traits ... > ,
                   Kokkos::Serial
                 >
 {
 private:

-  typedef Kokkos::Experimental::MDRangePolicy< Traits ... > MDRangePolicy ;
+  typedef Kokkos::MDRangePolicy< Traits ... > MDRangePolicy ;
  typedef typename MDRangePolicy::impl_range_policy Policy ;

-  typedef typename Kokkos::Experimental::Impl::HostIterateTile< MDRangePolicy, FunctorType, typename MDRangePolicy::work_tag, void > iterate_type;
+  typedef typename Kokkos::Impl::HostIterateTile< MDRangePolicy, FunctorType, typename MDRangePolicy::work_tag, void > iterate_type;

  const FunctorType   m_functor ;
  const MDRangePolicy m_mdr_policy ;
@ -661,32 +663,33 @@ public:

 template< class FunctorType , class ReducerType , class ... Traits >
 class ParallelReduce< FunctorType
-                    , Kokkos::Experimental::MDRangePolicy< Traits ... >
+                    , Kokkos::MDRangePolicy< Traits ... >
                    , ReducerType
                    , Kokkos::Serial
                    >
 {
 private:

-  typedef Kokkos::Experimental::MDRangePolicy< Traits ... > MDRangePolicy ;
+  typedef Kokkos::MDRangePolicy< Traits ... > MDRangePolicy ;
  typedef typename MDRangePolicy::impl_range_policy Policy ;

  typedef typename MDRangePolicy::work_tag                                  WorkTag ;

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

  typedef typename ReducerTypeFwd::value_type ValueType; 

  typedef FunctorAnalysis< FunctorPatternInterface::REDUCE , Policy , FunctorType > Analysis ;

-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTag >  ValueInit ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd >  ValueInit ;

  typedef typename Analysis::pointer_type    pointer_type ;
  typedef typename Analysis::reference_type  reference_type ;


-  using iterate_type = typename Kokkos::Experimental::Impl::HostIterateTile< MDRangePolicy
+  using iterate_type = typename Kokkos::Impl::HostIterateTile< MDRangePolicy
                                                                           , FunctorType
                                                                           , WorkTag
                                                                           , ValueType
@ -735,7 +738,7 @@ public:

      this-> exec( update );

-      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::
+      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::
        final(  ReducerConditional::select(m_functor , m_reducer) , ptr );
    }

@ -878,8 +881,9 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTag >  ValueInit ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd >  ValueInit ;

  typedef typename Analysis::pointer_type    pointer_type ;
  typedef typename Analysis::reference_type  reference_type ;
@ -940,7 +944,7 @@ public:

      this-> template exec< WorkTag >( data , update );

-      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::
+      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::
        final(  ReducerConditional::select(m_functor , m_reducer) , ptr );
    }

--- a/lib/kokkos/core/src/Kokkos_View.hpp
+++ b/lib/kokkos/core/src/Kokkos_View.hpp
@ -408,7 +408,7 @@ view_alloc( Args const & ... args )
 }

 template< class ... Args >
-inline
+KOKKOS_INLINE_FUNCTION
 Impl::ViewCtorProp< typename Impl::ViewCtorProp< void , Args >::type ... >
 view_wrap( Args const & ... args )
 {
@ -1216,6 +1216,13 @@ public:
      m_track.assign_allocated_record_to_uninitialized( record );
    }

+  KOKKOS_INLINE_FUNCTION
+  void assign_data( pointer_type arg_data )
+    {
+      m_track.clear();
+      m_map.assign_data( arg_data );
+    }
+
  // Wrap memory according to properties and array layout
  template< class ... P >
  explicit KOKKOS_INLINE_FUNCTION
@ -2235,6 +2242,29 @@ create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
  return typename Impl::MirrorViewType<Space,T,P ...>::view_type(src.label(),src.layout());
 }

+// Create a mirror view and deep_copy in a new space (specialization for same space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorViewType<Space,T,P ...>::view_type
+create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
+  , std::string const& name = ""
+  , typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  (void)name;
+  return src;
+}
+
+// Create a mirror view and deep_copy in a new space (specialization for different space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorViewType<Space,T,P ...>::view_type
+create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
+  , std::string const& name = ""
+  , typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  using Mirror = typename Impl::MirrorViewType<Space,T,P ...>::view_type;
+  std::string label = name.empty() ? src.label() : name;
+  auto mirror = Mirror(ViewAllocateWithoutInitializing(label), src.layout());
+  deep_copy(mirror, src);
+  return mirror;
+}
+
 } /* namespace Kokkos */

 //----------------------------------------------------------------------------
@ -2432,6 +2462,7 @@ struct CommonViewAllocProp< void, ValueType >
  using scalar_array_type = ValueType;

  template < class ... Views >
+  KOKKOS_INLINE_FUNCTION
  CommonViewAllocProp( const Views & ... ) {}
 };

@ -2499,6 +2530,7 @@ using DeducedCommonPropsType = typename Impl::DeduceCommonViewAllocProp<Views...

 // User function
 template < class ... Views >
+KOKKOS_INLINE_FUNCTION
 DeducedCommonPropsType<Views...> 
 common_view_alloc_prop( Views const & ... views )
 {
--- a/lib/kokkos/core/src/Kokkos_WorkGraphPolicy.hpp
+++ b/lib/kokkos/core/src/Kokkos_WorkGraphPolicy.hpp
@ -46,205 +46,198 @@

 namespace Kokkos {
 namespace Impl {
-namespace Experimental {

 template< class functor_type , class execution_space, class ... policy_args >
 class WorkGraphExec;

-}}} // namespace Kokkos::Impl::Experimental
+}} // namespace Kokkos::Impl

 namespace Kokkos {
-namespace Experimental {

 template< class ... Properties >
 class WorkGraphPolicy
 {
 public:

-  using self_type = WorkGraphPolicy<Properties ... >;
-  using traits = Kokkos::Impl::PolicyTraits<Properties ... >;
-  using index_type = typename traits::index_type;
+  using self_type       = WorkGraphPolicy<Properties ... >;
+  using traits          = Kokkos::Impl::PolicyTraits<Properties ... >;
+  using index_type      = typename traits::index_type;
+  using member_type     = index_type;
+  using work_tag        = typename traits::work_tag;
  using execution_space = typename traits::execution_space;
-  using work_tag = typename traits::work_tag;
-  using memory_space = typename execution_space::memory_space;
-  using graph_type = Kokkos::Experimental::Crs<index_type, execution_space, void, index_type>;
-  using member_type = index_type;
+  using memory_space    = typename execution_space::memory_space;
+  using graph_type      = Kokkos::Crs<index_type,execution_space,void,index_type>;
+
+  enum : std::int32_t {
+    END_TOKEN       = -1 ,
+    BEGIN_TOKEN     = -2 ,
+    COMPLETED_TOKEN = -3 };

 private:
-   
-  graph_type m_graph;

  using ints_type = Kokkos::View<std::int32_t*, memory_space>;
-  using range_type = Kokkos::pair<std::int32_t, std::int32_t>;
-  using ranges_type = Kokkos::View<range_type*, memory_space>;
-  const std::int32_t m_total_work;
-  ints_type m_counts;
-  ints_type m_queue;
-  ranges_type m_ranges;
+
+  // Let N = m_graph.numRows(), the total work
+  // m_queue[  0 ..   N-1] = the ready queue
+  // m_queue[  N .. 2*N-1] = the waiting queue counts
+  // m_queue[2*N .. 2*N+2] = the ready queue hints
+
+  graph_type const m_graph;
+  ints_type        m_queue ;
+
+  KOKKOS_INLINE_FUNCTION
+  void push_work( const std::int32_t w ) const noexcept
+    {
+      const std::int32_t N = m_graph.numRows();
+
+      std::int32_t volatile * const ready_queue = & m_queue[0] ;
+      std::int32_t volatile * const end_hint    = & m_queue[2*N+1] ;
+
+      // Push work to end of queue
+      const std::int32_t j = atomic_fetch_add( end_hint , 1 );
+
+      if ( ( N <= j ) ||
+           ( END_TOKEN != atomic_exchange(ready_queue+j,w) ) ) {
+        // ERROR: past the end of queue or did not replace END_TOKEN
+        Kokkos::abort("WorkGraphPolicy push_work error");
+      }
+
+      memory_fence();
+    }

 public:

-  struct TagZeroRanges {};
+  /**\brief  Attempt to pop the work item at the head of the queue.
+   *
+   *  Find entry 'i' such that
+   *    ( m_queue[i] != BEGIN_TOKEN ) AND
+   *    ( i == 0 OR m_queue[i-1] == BEGIN_TOKEN )
+   *  if found then
+   *    increment begin hint
+   *    return atomic_exchange( m_queue[i] , BEGIN_TOKEN )
+   *  else if i < total work
+   *    return END_TOKEN
+   *  else
+   *    return COMPLETED_TOKEN
+   *  
+   */
  KOKKOS_INLINE_FUNCTION
-  void operator()(TagZeroRanges, std::int32_t i) const {
-    m_ranges[i] = range_type(0, 0);
-  }
-  void zero_ranges() {
-    using policy_type = RangePolicy<std::int32_t, execution_space, TagZeroRanges>;
-    using closure_type = Kokkos::Impl::ParallelFor<self_type, policy_type>;
-    const closure_type closure(*this, policy_type(0, 1));
-    closure.execute();
-    execution_space::fence();
-  }
+  std::int32_t pop_work() const noexcept
+    {
+      const std::int32_t N = m_graph.numRows();

-  struct TagFillQueue {};
-  KOKKOS_INLINE_FUNCTION
-  void operator()(TagFillQueue, std::int32_t i) const {
-    if (*((volatile std::int32_t*)(&m_counts(i))) == 0) push_work(i);
-  }
-  void fill_queue() {
-    using policy_type = RangePolicy<std::int32_t, execution_space, TagFillQueue>;
-    using closure_type = Kokkos::Impl::ParallelFor<self_type, policy_type>;
-    const closure_type closure(*this, policy_type(0, m_total_work));
-    closure.execute();
-    execution_space::fence();
-  }
+      std::int32_t volatile * const ready_queue = & m_queue[0] ;
+      std::int32_t volatile * const begin_hint  = & m_queue[2*N] ;

-private:
+      // begin hint is guaranteed to be less than or equal to
+      // actual begin location in the queue.

-  inline
-  void setup() {
-    if (m_graph.numRows() > std::numeric_limits<std::int32_t>::max()) {
-      Kokkos::abort("WorkGraphPolicy work must be indexable using int32_t");
-    }
-    get_crs_transpose_counts(m_counts, m_graph);
-    m_queue = ints_type(ViewAllocateWithoutInitializing("queue"), m_total_work);
-    deep_copy(m_queue, std::int32_t(-1));
-    m_ranges = ranges_type("ranges", 1);
-    fill_queue();
-  }
+      for ( std::int32_t i = *begin_hint ; i < N ; ++i ) {

-  KOKKOS_INLINE_FUNCTION
-  std::int32_t pop_work() const {
-    range_type w(-1,-1);
-    while (true) {
-      const range_type w_new( w.first + 1 , w.second );
-      w = atomic_compare_exchange( &m_ranges(0) , w , w_new );
-      if ( w.first < w.second ) { // there was work in the queue
-        if ( w_new.first == w.first + 1 && w_new.second == w.second ) {
-          // we got a work item
-          std::int32_t i;
-          // the push_work function may have incremented the end counter
-          // but not yet written the work index into the queue.
-          // wait until the entry is valid.
-          while ( -1 == ( i = *((volatile std::int32_t*)(&m_queue( w.first ))) ) );
-          return i;
-        } // we got a work item
-      } else { // there was no work in the queue
-#ifdef KOKKOS_DEBUG
-        if ( w_new.first == w.first + 1 && w_new.second == w.second ) {
-          Kokkos::abort("bug in pop_work");
+        const std::int32_t w = ready_queue[i] ;
+
+        if ( w == END_TOKEN ) { return END_TOKEN ; }
+
+        if ( ( w != BEGIN_TOKEN ) &&
+             ( w == atomic_compare_exchange(ready_queue+i,w,BEGIN_TOKEN) ) ) {
+          // Attempt to claim ready work index succeeded,
+          // update the hint and return work index
+          atomic_increment( begin_hint );
+          return w ;
        }
-#endif
-        if (w.first == m_total_work) { // all work is done
-          return -1;
-        } else { // need to wait for more work to be pushed
-          // take a guess that one work item will be pushed
-          // the key thing is we can't leave (w) alone, because
-          // otherwise the next compare_exchange may succeed in
-          // popping work from an empty queue
-          w.second++;
-        }
-      } // there was no work in the queue
-    } // while (true)
-  }
+        // arrive here when ready_queue[i] == BEGIN_TOKEN
+      }
+
+      return COMPLETED_TOKEN ;
+    }
+

  KOKKOS_INLINE_FUNCTION
-  void push_work(std::int32_t i) const {
-    range_type w(-1,-1);
-    while (true) {
-      const range_type w_new( w.first , w.second + 1 );
-      // try to increment the end counter
-      w = atomic_compare_exchange( &m_ranges(0) , w , w_new );
-      // stop trying if the increment was successful
-      if ( w.first == w_new.first && w.second + 1 == w_new.second ) break;
+  void completed_work( std::int32_t w ) const noexcept
+    {
+      Kokkos::memory_fence();
+
+      // Make sure the completed work function's memory accesses are flushed.
+
+      const std::int32_t N = m_graph.numRows();
+
+      std::int32_t volatile * const count_queue = & m_queue[N] ;
+
+      const std::int32_t B = m_graph.row_map(w);
+      const std::int32_t E = m_graph.row_map(w+1);
+
+      for ( std::int32_t i = B ; i < E ; ++i ) {
+        const std::int32_t j = m_graph.entries(i);
+        if ( 1 == atomic_fetch_add(count_queue+j,-1) ) {
+          push_work(j);
+        }
+      }
    }
-    // write the work index into the claimed spot in the queue
-    *((volatile std::int32_t*)(&m_queue( w.second ))) = i;
-    // push this write out into the memory system
-    memory_fence();
-  }

-  template< class functor_type , class execution_space, class ... policy_args >
-  friend class Kokkos::Impl::Experimental::WorkGraphExec;
+  struct TagInit {};
+  struct TagCount {};
+  struct TagReady {};

-public:
+  /**\brief  Initialize queue
+   *
+   *  m_queue[0..N-1] = END_TOKEN, the ready queue
+   *  m_queue[N..2*N-1] = 0, the waiting count queue
+   *  m_queue[2*N..2*N+1] = 0, begin/end hints for ready queue
+   */
+  KOKKOS_INLINE_FUNCTION
+  void operator()( const TagInit , int i ) const noexcept
+    { m_queue[i] = i < m_graph.numRows() ? END_TOKEN : 0 ; }

-  WorkGraphPolicy(graph_type arg_graph)
+  KOKKOS_INLINE_FUNCTION
+  void operator()( const TagCount , int i ) const noexcept
+    {
+      std::int32_t volatile * const count_queue =
+        & m_queue[ m_graph.numRows() ] ;
+
+      atomic_increment( count_queue + m_graph.entries[i] );
+    }
+
+  KOKKOS_INLINE_FUNCTION
+  void operator()( const TagReady , int w ) const noexcept
+    {
+      std::int32_t const * const count_queue =
+        & m_queue[ m_graph.numRows() ] ;
+
+      if ( 0 == count_queue[w] ) push_work(w);
+    }
+
+  WorkGraphPolicy( const graph_type & arg_graph )
    : m_graph(arg_graph)
-    , m_total_work( arg_graph.numRows() )
+    , m_queue( view_alloc( "queue" , WithoutInitializing )
+             , arg_graph.numRows() * 2 + 2 )
  {
-    setup();
-  }
+    { // Initialize
+      using policy_type = RangePolicy<std::int32_t, execution_space, TagInit>;
+      using closure_type = Kokkos::Impl::ParallelFor<self_type, policy_type>;
+      const closure_type closure(*this, policy_type(0, m_queue.size()));
+      closure.execute();
+      execution_space::fence();
+    }

-};
+    { // execute-after counts
+      using policy_type = RangePolicy<std::int32_t, execution_space, TagCount>;
+      using closure_type = Kokkos::Impl::ParallelFor<self_type, policy_type>;
+      const closure_type closure(*this,policy_type(0,m_graph.entries.size()));
+      closure.execute();
+      execution_space::fence();
+    }

-}} // namespace Kokkos::Experimental
-
-/*--------------------------------------------------------------------------*/
-
-/*--------------------------------------------------------------------------*/
-
-namespace Kokkos {
-namespace Impl {
-namespace Experimental {
-
-template< class functor_type , class execution_space, class ... policy_args >
-class WorkGraphExec
-{
- public:
-
-  using self_type = WorkGraphExec< functor_type, execution_space, policy_args ... >;
-  using policy_type = Kokkos::Experimental::WorkGraphPolicy< policy_args ... >;
-  using member_type = typename policy_type::member_type;
-  using memory_space = typename execution_space::memory_space;
-
- protected:
-
-  const functor_type m_functor;
-  const policy_type  m_policy;
-
- protected:
-
-  KOKKOS_INLINE_FUNCTION
-  std::int32_t before_work() const {
-    return m_policy.pop_work();
-  }
-
-  KOKKOS_INLINE_FUNCTION
-  void after_work(std::int32_t i) const {
-    /* fence any writes that were done by the work item itself
-       (usually writing its result to global memory) */
-    memory_fence();
-    const std::int32_t begin = m_policy.m_graph.row_map( i );
-    const std::int32_t end = m_policy.m_graph.row_map( i + 1 );
-    for (std::int32_t j = begin; j < end; ++j) {
-      const std::int32_t next = m_policy.m_graph.entries( j );
-      const std::int32_t old_count = atomic_fetch_add( &(m_policy.m_counts(next)), -1 );
-      if ( old_count == 1 )  m_policy.push_work( next );
+    { // Scheduling ready tasks
+      using policy_type = RangePolicy<std::int32_t, execution_space, TagReady>;
+      using closure_type = Kokkos::Impl::ParallelFor<self_type, policy_type>;
+      const closure_type closure(*this,policy_type(0,m_graph.numRows()));
+      closure.execute();
+      execution_space::fence();
    }
  }
-
-  inline
-  WorkGraphExec( const functor_type & arg_functor
-               , const policy_type  & arg_policy )
-    : m_functor( arg_functor )
-    , m_policy(  arg_policy )
-  {
-  }
 };

-}}} // namespace Kokkos::Impl::Experimental
+} // namespace Kokkos

 #ifdef KOKKOS_ENABLE_SERIAL
 #include "impl/Kokkos_Serial_WorkGraphPolicy.hpp"
--- a/lib/kokkos/core/src/Makefile
+++ b/lib/kokkos/core/src/Makefile
@ -5,51 +5,44 @@ endif

 PREFIX ?= /usr/local/lib/kokkos

-default: messages build-lib
-	echo "End Build"
+default: build-lib

 ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
-  CXX = $(KOKKOS_PATH)/bin/nvcc_wrapper
+  CXX ?= $(KOKKOS_PATH)/bin/nvcc_wrapper
 else
-  CXX = g++
+  CXX ?= g++
 endif

-CXXFLAGS = -O3
+CXXFLAGS ?= -O3
 LINK ?= $(CXX)
 LDFLAGS ?=

 include $(KOKKOS_PATH)/Makefile.kokkos
-
-PWD = $(shell pwd)
-
-KOKKOS_HEADERS_INCLUDE = $(wildcard $(KOKKOS_PATH)/core/src/*.hpp)
-KOKKOS_HEADERS_INCLUDE_IMPL = $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp)
-KOKKOS_HEADERS_INCLUDE += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp)
-KOKKOS_HEADERS_INCLUDE_IMPL += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.hpp)
-KOKKOS_HEADERS_INCLUDE += $(wildcard $(KOKKOS_PATH)/algorithms/src/*.hpp)
+include $(KOKKOS_PATH)/core/src/Makefile.generate_header_lists
+include $(KOKKOS_PATH)/core/src/Makefile.generate_build_files

 CONDITIONAL_COPIES =

 ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
-  KOKKOS_HEADERS_CUDA += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
  CONDITIONAL_COPIES += copy-cuda
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
-  KOKKOS_HEADERS_THREADS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.hpp)
  CONDITIONAL_COPIES += copy-threads
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_QTHREADS), 1)
-  KOKKOS_HEADERS_QTHREADS += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.hpp)
  CONDITIONAL_COPIES += copy-qthreads
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
-  KOKKOS_HEADERS_OPENMP += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.hpp)
  CONDITIONAL_COPIES += copy-openmp
 endif

+ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
+  CONDITIONAL_COPIES += copy-rocm
+endif
+
 ifeq ($(KOKKOS_OS),CYGWIN)
  COPY_FLAG = -u
 endif
@ -66,104 +59,7 @@ else
  KOKKOS_DEBUG_CMAKE = ON
 endif

-messages:
-	echo "Start Build"
-
-build-makefile-kokkos:
-	rm -f Makefile.kokkos
-	echo "#Global Settings used to generate this library" >> Makefile.kokkos
-	echo "KOKKOS_PATH = $(PREFIX)" >> Makefile.kokkos
-	echo "KOKKOS_DEVICES = $(KOKKOS_DEVICES)" >> Makefile.kokkos
-	echo "KOKKOS_ARCH = $(KOKKOS_ARCH)" >> Makefile.kokkos
-	echo "KOKKOS_DEBUG = $(KOKKOS_DEBUG)" >> Makefile.kokkos
-	echo "KOKKOS_USE_TPLS = $(KOKKOS_USE_TPLS)" >> Makefile.kokkos
-	echo "KOKKOS_CXX_STANDARD = $(KOKKOS_CXX_STANDARD)" >> Makefile.kokkos
-	echo "KOKKOS_OPTIONS = $(KOKKOS_OPTIONS)" >> Makefile.kokkos
-	echo "KOKKOS_CUDA_OPTIONS = $(KOKKOS_CUDA_OPTIONS)" >> Makefile.kokkos
-	echo "CXX ?= $(CXX)" >> Makefile.kokkos
-	echo "NVCC_WRAPPER ?= $(PREFIX)/bin/nvcc_wrapper" >> Makefile.kokkos
-	echo "" >> Makefile.kokkos
-	echo "#Source and Header files of Kokkos relative to KOKKOS_PATH" >> Makefile.kokkos
-	echo "KOKKOS_HEADERS = $(KOKKOS_HEADERS)" >> Makefile.kokkos
-	echo "KOKKOS_SRC = $(KOKKOS_SRC)" >> Makefile.kokkos
-	echo "" >> Makefile.kokkos
-	echo "#Variables used in application Makefiles" >> Makefile.kokkos
-	echo "KOKKOS_OS = $(KOKKOS_OS)" >> Makefile.kokkos
-	echo "KOKKOS_CPP_DEPENDS = $(KOKKOS_CPP_DEPENDS)" >> Makefile.kokkos
-	echo "KOKKOS_CXXFLAGS = $(KOKKOS_CXXFLAGS)" >> Makefile.kokkos
-	echo "KOKKOS_CPPFLAGS = $(KOKKOS_CPPFLAGS)" >> Makefile.kokkos
-	echo "KOKKOS_LINK_DEPENDS  = $(KOKKOS_LINK_DEPENDS)" >> Makefile.kokkos
-	echo "KOKKOS_LIBS = $(KOKKOS_LIBS)" >> Makefile.kokkos
-	echo "KOKKOS_LDFLAGS = $(KOKKOS_LDFLAGS)" >> Makefile.kokkos
-	echo "" >> Makefile.kokkos
-	echo "#Internal settings which need to propagated for Kokkos examples" >> Makefile.kokkos
-	echo "KOKKOS_INTERNAL_USE_CUDA = ${KOKKOS_INTERNAL_USE_CUDA}" >> Makefile.kokkos
-	echo "KOKKOS_INTERNAL_USE_QTHREADS = ${KOKKOS_INTERNAL_USE_QTHREADS}" >> Makefile.kokkos
-	echo "KOKKOS_INTERNAL_USE_OPENMP = ${KOKKOS_INTERNAL_USE_OPENMP}" >> Makefile.kokkos
-	echo "KOKKOS_INTERNAL_USE_PTHREADS = ${KOKKOS_INTERNAL_USE_PTHREADS}" >> Makefile.kokkos
-	echo "" >> Makefile.kokkos
-	echo "#Fake kokkos-clean target" >> Makefile.kokkos
-	echo "kokkos-clean:" >> Makefile.kokkos
-	echo "" >> Makefile.kokkos
-	sed \
-		-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
-		-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
-		-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
-		-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
-		-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
-		-e 's|= KokkosCore_config.h|= $(PREFIX)/include/KokkosCore_config.h|g' Makefile.kokkos \
-		> Makefile.kokkos.tmp
-	mv -f Makefile.kokkos.tmp Makefile.kokkos
-
-build-cmake-kokkos:
-	rm -f kokkos.cmake
-	echo "#Global Settings used to generate this library" >> kokkos.cmake
-	echo "set(KOKKOS_PATH $(PREFIX) CACHE PATH \"Kokkos installation path\")" >> kokkos.cmake
-	echo "set(KOKKOS_DEVICES $(KOKKOS_DEVICES) CACHE STRING \"Kokkos devices list\")" >> kokkos.cmake
-	echo "set(KOKKOS_ARCH $(KOKKOS_ARCH) CACHE STRING \"Kokkos architecture flags\")" >> kokkos.cmake
-	echo "set(KOKKOS_DEBUG $(KOKKOS_DEBUG_CMAKE) CACHE BOOL \"Kokkos debug enabled ?)\")" >> kokkos.cmake
-	echo "set(KOKKOS_USE_TPLS $(KOKKOS_USE_TPLS) CACHE STRING \"Kokkos templates list\")" >> kokkos.cmake
-	echo "set(KOKKOS_CXX_STANDARD $(KOKKOS_CXX_STANDARD) CACHE STRING \"Kokkos C++ standard\")" >> kokkos.cmake
-	echo "set(KOKKOS_OPTIONS $(KOKKOS_OPTIONS) CACHE STRING \"Kokkos options\")" >> kokkos.cmake
-	echo "set(KOKKOS_CUDA_OPTIONS $(KOKKOS_CUDA_OPTIONS) CACHE STRING \"Kokkos Cuda options\")" >> kokkos.cmake
-	echo "if(NOT $ENV{CXX})" >> kokkos.cmake
-	echo '  message(WARNING "You are currently using compiler $${CMAKE_CXX_COMPILER} while Kokkos was built with $(CXX) ; make sure this is the behavior you intended to be.")' >> kokkos.cmake
-	echo "endif()" >> kokkos.cmake
-	echo "if(NOT DEFINED ENV{NVCC_WRAPPER})" >> kokkos.cmake
-	echo "  set(NVCC_WRAPPER \"$(NVCC_WRAPPER)\" CACHE FILEPATH \"Path to command nvcc_wrapper\")" >> kokkos.cmake
-	echo "else()" >> kokkos.cmake
-	echo '  set(NVCC_WRAPPER $$ENV{NVCC_WRAPPER} CACHE FILEPATH "Path to command nvcc_wrapper")' >> kokkos.cmake
-	echo "endif()" >> kokkos.cmake
-	echo "" >> kokkos.cmake
-	echo "#Source and Header files of Kokkos relative to KOKKOS_PATH" >> kokkos.cmake
-	echo "set(KOKKOS_HEADERS \"$(KOKKOS_HEADERS)\" CACHE STRING \"Kokkos headers list\")" >> kokkos.cmake
-	echo "set(KOKKOS_SRC \"$(KOKKOS_SRC)\" CACHE STRING \"Kokkos source list\")" >> kokkos.cmake
-	echo "" >> kokkos.cmake
-	echo "#Variables used in application Makefiles" >> kokkos.cmake
-	echo "set(KOKKOS_CPP_DEPENDS \"$(KOKKOS_CPP_DEPENDS)\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "set(KOKKOS_CXXFLAGS \"$(KOKKOS_CXXFLAGS)\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "set(KOKKOS_CPPFLAGS \"$(KOKKOS_CPPFLAGS)\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "set(KOKKOS_LINK_DEPENDS \"$(KOKKOS_LINK_DEPENDS)\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "set(KOKKOS_LIBS \"$(KOKKOS_LIBS)\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "set(KOKKOS_LDFLAGS \"$(KOKKOS_LDFLAGS)\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "" >> kokkos.cmake
-	echo "#Internal settings which need to propagated for Kokkos examples" >> kokkos.cmake
-	echo "set(KOKKOS_INTERNAL_USE_CUDA \"${KOKKOS_INTERNAL_USE_CUDA}\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "set(KOKKOS_INTERNAL_USE_OPENMP \"${KOKKOS_INTERNAL_USE_OPENMP}\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "set(KOKKOS_INTERNAL_USE_PTHREADS \"${KOKKOS_INTERNAL_USE_PTHREADS}\" CACHE STRING \"\")" >> kokkos.cmake
-	echo "mark_as_advanced(KOKKOS_HEADERS KOKKOS_SRC KOKKOS_INTERNAL_USE_CUDA KOKKOS_INTERNAL_USE_OPENMP KOKKOS_INTERNAL_USE_PTHREADS)" >> kokkos.cmake
-	echo "" >> kokkos.cmake
-	sed \
-		-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
-	 	-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
-	 	-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
-	 	-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
-	 	-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
-	 	-e 's|= KokkosCore_config.h|= $(PREFIX)/include/KokkosCore_config.h|g' kokkos.cmake \
-	 	> kokkos.cmake.tmp
-	mv -f kokkos.cmake.tmp kokkos.cmake
-
-build-lib: build-makefile-kokkos build-cmake-kokkos $(KOKKOS_LINK_DEPENDS)
+build-lib: $(KOKKOS_LINK_DEPENDS)

 mkdir:
 	mkdir -p $(PREFIX)
@ -188,14 +84,18 @@ copy-openmp: mkdir
 	mkdir -p $(PREFIX)/include/OpenMP
 	cp $(COPY_FLAG) $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP

-install: mkdir $(CONDITIONAL_COPIES) build-lib
+copy-rocm: mkdir
+	mkdir -p $(PREFIX)/include/ROCm
+	cp $(COPY_FLAG) $(KOKKOS_HEADERS_ROCM) $(PREFIX)/include/ROCm
+
+install: mkdir $(CONDITIONAL_COPIES) build-lib generate_build_settings
 	cp $(COPY_FLAG) $(NVCC_WRAPPER) $(PREFIX)/bin
 	cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
 	cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
-	cp $(COPY_FLAG) Makefile.kokkos $(PREFIX)
-	cp $(COPY_FLAG) kokkos.cmake $(PREFIX)
+	cp $(COPY_FLAG) $(KOKKOS_MAKEFILE)  $(PREFIX)
+	cp $(COPY_FLAG) $(KOKKOS_CMAKEFILE)  $(PREFIX)
 	cp $(COPY_FLAG) libkokkos.a $(PREFIX)/lib
-	cp $(COPY_FLAG) KokkosCore_config.h $(PREFIX)/include
+	cp $(COPY_FLAG) $(KOKKOS_CONFIG_HEADER) $(PREFIX)/include

 clean: kokkos-clean
-	rm -f Makefile.kokkos
+	rm -f $(KOKKOS_MAKEFILE) $(KOKKOS_CMAKEFILE) 
--- a/lib/kokkos/core/src/Makefile.generate_build_files
+++ b/lib/kokkos/core/src/Makefile.generate_build_files
@ -0,0 +1,100 @@
+# This file is responsible for generating files which will be used 
+# by build system (make and cmake) in scenarios where the kokkos library
+# gets installed before building the app 
+
+# These files are generated by this makefile
+KOKKOS_MAKEFILE=Makefile.kokkos
+KOKKOS_CMAKEFILE=kokkos_generated_settings.cmake
+
+ifeq ($(KOKKOS_DEBUG),"no")
+  KOKKOS_DEBUG_CMAKE = OFF
+else
+  KOKKOS_DEBUG_CMAKE = ON
+endif
+
+# Functions for generating makefile and cmake file
+# In calling these routines, do not put space after ,
+# e.g., $(call kokkos_append_var,KOKKOS_PATH,$(PREFIX))
+kokkos_append_makefile = echo $1 >> $(KOKKOS_MAKEFILE)
+kokkos_append_cmakefile = echo $1 >> $(KOKKOS_CMAKEFILE)
+
+kokkos_setvar_cmakefile = echo set\($1 $2\) >> $(KOKKOS_CMAKEFILE)
+kokkos_setlist_cmakefile = echo set\($1 \"$2\"\) >> $(KOKKOS_CMAKEFILE)
+
+kokkos_appendvar_makefile = echo $1 = $($(1)) >> $(KOKKOS_MAKEFILE)
+kokkos_appendvar2_makefile = echo $1 ?= $($(1)) >> $(KOKKOS_MAKEFILE)
+kokkos_appendvar_cmakefile = echo set\($1 $($(1)) CACHE $2 FORCE\) >> $(KOKKOS_CMAKEFILE)
+kokkos_appendval_makefile = echo $1 = $2 >> $(KOKKOS_MAKEFILE)
+kokkos_appendval_cmakefile = echo set\($1 $2 CACHE $3 FORCE\) >> $(KOKKOS_CMAKEFILE)
+
+kokkos_append_string = $(call kokkos_append_makefile,$1); $(call kokkos_append_cmakefile,$1)
+kokkos_append_var = $(call kokkos_appendvar_makefile,$1); $(call kokkos_appendvar_cmakefile,$1,$2)
+kokkos_append_var2 = $(call kokkos_appendvar2_makefile,$1); $(call kokkos_appendvar_cmakefile,$1,$2)
+kokkos_append_varval = $(call kokkos_appendval_makefile,$1,$2); $(call kokkos_appendval_cmakefile,$1,$2,$3)
+
+generate_build_settings: $(KOKKOS_CONFIG_HEADER)
+	@rm -f $(KOKKOS_MAKEFILE)
+	@rm -f $(KOKKOS_CMAKEFILE)
+	@$(call kokkos_append_string, "#Global Settings used to generate this library")
+	@$(call kokkos_append_varval,KOKKOS_PATH,$(KOKKOS_INSTALL_PATH),'FILEPATH "Kokkos installation path"')
+	@$(call kokkos_append_var,KOKKOS_DEVICES,'STRING "Kokkos devices list"')
+	@$(call kokkos_append_var,KOKKOS_ARCH,'STRING "Kokkos architecture flags"')
+	@$(call kokkos_appendvar_makefile,KOKKOS_DEBUG)
+	@$(call kokkos_appendvar_cmakefile,KOKKOS_DEBUG_CMAKE,'BOOL "Kokkos debug enabled ?"')
+	@$(call kokkos_append_var,KOKKOS_USE_TPLS,'STRING "Kokkos templates list"')
+	@$(call kokkos_append_var,KOKKOS_CXX_STANDARD,'STRING "Kokkos C++ standard"')
+	@$(call kokkos_append_var,KOKKOS_OPTIONS,'STRING "Kokkos options"')
+	@$(call kokkos_append_var,KOKKOS_CUDA_OPTIONS,'STRING "Kokkos Cuda options"')
+	@$(call kokkos_appendvar2,CXX,'KOKKOS C++ Compiler')
+	@$(call kokkos_append_cmakefile,"if(NOT DEFINED ENV{NVCC_WRAPPER})")
+	@$(call kokkos_append_var2,NVCC_WRAPPER,'FILEPATH "Path to command nvcc_wrapper"')
+	@$(call kokkos_append_cmakefile,"else()")
+	@$(call kokkos_append_cmakefile,'  set(NVCC_WRAPPER $$ENV{NVCC_WRAPPER} CACHE FILEPATH "Path to command nvcc_wrapper")')
+	@$(call kokkos_append_cmakefile,"endif()")
+	@$(call kokkos_append_string,"")
+	@$(call kokkos_append_string,"#Source and Header files of Kokkos relative to KOKKOS_PATH")
+	@$(call kokkos_append_var,KOKKOS_HEADERS,'STRING "Kokkos headers list"')
+	@$(call kokkos_append_var,KOKKOS_HEADERS_IMPL,'STRING "Kokkos headers impl list"')
+	@$(call kokkos_append_var,KOKKOS_HEADERS_CUDA,'STRING "Kokkos headers Cuda list"')
+	@$(call kokkos_append_var,KOKKOS_HEADERS_OPENMP,'STRING "Kokkos headers OpenMP list"')
+	@$(call kokkos_append_var,KOKKOS_HEADERS_ROCM,'STRING "Kokkos headers ROCm list"')
+	@$(call kokkos_append_var,KOKKOS_HEADERS_THREADS,'STRING "Kokkos headers Threads list"')
+	@$(call kokkos_append_var,KOKKOS_HEADERS_QTHREADS,'STRING "Kokkos headers QThreads list"')
+	@$(call kokkos_append_var,KOKKOS_SRC,'STRING "Kokkos source list"')
+	@$(call kokkos_append_string,"")
+	@$(call kokkos_append_string,"#Variables used in application Makefiles")
+	@$(call kokkos_append_var,KOKKOS_OS,'STRING ""')  # This was not in original cmake gen
+	@$(call kokkos_append_var,KOKKOS_CPP_DEPENDS,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_LINK_DEPENDS,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_CXXFLAGS,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_CPPFLAGS,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_LDFLAGS,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_LIBS,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_EXTRA_LIBS,'STRING ""')
+	@$(call kokkos_append_string,"")
+	@$(call kokkos_append_string,"#Internal settings which need to propagated for Kokkos examples")
+	@$(call kokkos_append_var,KOKKOS_INTERNAL_USE_CUDA,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_INTERNAL_USE_OPENMP,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_INTERNAL_USE_PTHREADS,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_INTERNAL_USE_ROCM,'STRING ""')
+	@$(call kokkos_append_var,KOKKOS_INTERNAL_USE_QTHREADS,'STRING ""') # Not in original cmake gen
+	@$(call kokkos_append_cmakefile "mark_as_advanced(KOKKOS_HEADERS KOKKOS_SRC KOKKOS_INTERNAL_USE_CUDA KOKKOS_INTERNAL_USE_OPENMP KOKKOS_INTERNAL_USE_PTHREADS)")
+	@$(call kokkos_append_makefile,"")
+	@$(call kokkos_append_makefile,"#Fake kokkos-clean target")
+	@$(call kokkos_append_makefile,"kokkos-clean:")
+	@$(call kokkos_append_makefile,"")
+	@sed \
+		-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
+		-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
+		-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
+		-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
+		-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
+		-e 's|= $(KOKKOS_CONFIG_HEADER)|= $(PREFIX)/include/$(KOKKOS_CONFIG_HEADER)|g' $(KOKKOS_MAKEFILE) \
+		> $(KOKKOS_MAKEFILE).tmp
+	@mv -f $(KOKKOS_MAKEFILE).tmp $(KOKKOS_MAKEFILE)
+	@$(call kokkos_setvar_cmakefile,KOKKOS_CXX_FLAGS,$(KOKKOS_CXXFLAGS))
+	@$(call kokkos_setvar_cmakefile,KOKKOS_CPP_FLAGS,$(KOKKOS_CPPFLAGS))
+	@$(call kokkos_setvar_cmakefile,KOKKOS_LD_FLAGS,$(KOKKOS_LDFLAGS))
+	@$(call kokkos_setlist_cmakefile,KOKKOS_LIBS_LIST,$(KOKKOS_LIBS))
+	@$(call kokkos_setlist_cmakefile,KOKKOS_EXTRA_LIBS_LIST,$(KOKKOS_EXTRA_LIBS))
+
--- a/lib/kokkos/core/src/Makefile.generate_header_lists
+++ b/lib/kokkos/core/src/Makefile.generate_header_lists
@ -0,0 +1,28 @@
+# Build a List of Header Files
+
+KOKKOS_HEADERS_INCLUDE       = $(wildcard $(KOKKOS_PATH)/core/src/*.hpp)
+KOKKOS_HEADERS_INCLUDE_IMPL  = $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp)
+KOKKOS_HEADERS_INCLUDE      += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp)
+KOKKOS_HEADERS_INCLUDE_IMPL += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.hpp)
+KOKKOS_HEADERS_INCLUDE      += $(wildcard $(KOKKOS_PATH)/algorithms/src/*.hpp)
+
+ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
+  KOKKOS_HEADERS_CUDA += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
+endif
+
+ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
+  KOKKOS_HEADERS_THREADS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.hpp)
+endif
+
+ifeq ($(KOKKOS_INTERNAL_USE_QTHREADS), 1)
+  KOKKOS_HEADERS_QTHREADS += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.hpp)
+endif
+
+ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
+  KOKKOS_HEADERS_OPENMP += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.hpp)
+endif
+
+ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
+  KOKKOS_HEADERS_ROCM += $(wildcard $(KOKKOS_PATH)/core/src/ROCm/*.hpp)
+endif
+
--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Exec.cpp
+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Exec.cpp
@ -294,7 +294,7 @@ void OpenMP::initialize( int thread_count )
  }

  {
-    if (nullptr == std::getenv("OMP_PROC_BIND") ) {
+    if ( Kokkos::show_warnings() && nullptr == std::getenv("OMP_PROC_BIND") ) {
      printf("Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set\n");
      printf("  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads\n");
      printf("  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true\n");
@ -327,7 +327,7 @@ void OpenMP::initialize( int thread_count )
      omp_set_num_threads(Impl::g_openmp_hardware_max_threads);
    }
    else {
-      if( thread_count > process_num_threads ) {
+      if( Kokkos::show_warnings() && thread_count > process_num_threads ) {
        printf( "Kokkos::OpenMP::initialize WARNING: You are likely oversubscribing your CPU cores.\n");
        printf( "  process threads available : %3d,  requested thread : %3d\n", process_num_threads, thread_count );
      }
@ -364,12 +364,12 @@ void OpenMP::initialize( int thread_count )


  // Check for over-subscription
-  //if( Impl::mpi_ranks_per_node() * long(thread_count) > Impl::processors_per_node() ) {
-  //  std::cout << "Kokkos::OpenMP::initialize WARNING: You are likely oversubscribing your CPU cores." << std::endl;
-  //  std::cout << "                                    Detected: " << Impl::processors_per_node() << " cores per node." << std::endl;
-  //  std::cout << "                                    Detected: " << Impl::mpi_ranks_per_node() << " MPI_ranks per node." << std::endl;
-  //  std::cout << "                                    Requested: " << thread_count << " threads per process." << std::endl;
-  //}
+  if( Kokkos::show_warnings() && (Impl::mpi_ranks_per_node() * long(thread_count) > Impl::processors_per_node()) ) {
+    std::cout << "Kokkos::OpenMP::initialize WARNING: You are likely oversubscribing your CPU cores." << std::endl;
+    std::cout << "                                    Detected: " << Impl::processors_per_node() << " cores per node." << std::endl;
+    std::cout << "                                    Detected: " << Impl::mpi_ranks_per_node() << " MPI_ranks per node." << std::endl;
+    std::cout << "                                    Requested: " << thread_count << " threads per process." << std::endl;
+  }
  // Init the array for used for arbitrarily sized atomics
  Impl::init_lock_array_host_space();

--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp
+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp
@ -170,20 +170,20 @@ public:
 // MDRangePolicy impl
 template< class FunctorType , class ... Traits >
 class ParallelFor< FunctorType
-                 , Kokkos::Experimental::MDRangePolicy< Traits ... >
+                 , Kokkos::MDRangePolicy< Traits ... >
                 , Kokkos::OpenMP
                 >
 {
 private:

-  typedef Kokkos::Experimental::MDRangePolicy< Traits ... > MDRangePolicy ;
+  typedef Kokkos::MDRangePolicy< Traits ... > MDRangePolicy ;
  typedef typename MDRangePolicy::impl_range_policy         Policy ;
  typedef typename MDRangePolicy::work_tag                  WorkTag ;

  typedef typename Policy::WorkRange    WorkRange ;
  typedef typename Policy::member_type  Member ;

-  typedef typename Kokkos::Experimental::Impl::HostIterateTile< MDRangePolicy, FunctorType, typename MDRangePolicy::work_tag, void > iterate_type;
+  typedef typename Kokkos::Impl::HostIterateTile< MDRangePolicy, FunctorType, typename MDRangePolicy::work_tag, void > iterate_type;

        OpenMPExec   * m_instance ;
  const FunctorType   m_functor ;
@ -292,11 +292,12 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

  // Static Assert WorkTag void if ReducerType not InvalidType

-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTag > ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTagFwd > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTagFwd > ValueJoin ;

  typedef typename Analysis::pointer_type    pointer_type ;
  typedef typename Analysis::reference_type  reference_type ;
@ -393,7 +394,7 @@ public:
                       , m_instance->get_thread_data(i)->pool_reduce_local() );
      }

-      Kokkos::Impl::FunctorFinal<  ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , ptr );
+      Kokkos::Impl::FunctorFinal<  ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , ptr );

      if ( m_result_ptr ) {
        const int n = Analysis::value_count( ReducerConditional::select(m_functor , m_reducer) );
@ -445,14 +446,14 @@ public:
 // MDRangePolicy impl
 template< class FunctorType , class ReducerType, class ... Traits >
 class ParallelReduce< FunctorType
-                    , Kokkos::Experimental::MDRangePolicy< Traits ...>
+                    , Kokkos::MDRangePolicy< Traits ...>
                    , ReducerType
                    , Kokkos::OpenMP
                    >
 {
 private:

-  typedef Kokkos::Experimental::MDRangePolicy< Traits ... > MDRangePolicy ;
+  typedef Kokkos::MDRangePolicy< Traits ... > MDRangePolicy ;
  typedef typename MDRangePolicy::impl_range_policy         Policy ;

  typedef typename MDRangePolicy::work_tag                  WorkTag ;
@ -463,16 +464,17 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

  typedef typename ReducerTypeFwd::value_type ValueType; 

-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTag > ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTagFwd > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTagFwd > ValueJoin ;

  typedef typename Analysis::pointer_type    pointer_type ;
  typedef typename Analysis::reference_type  reference_type ;

-  using iterate_type = typename Kokkos::Experimental::Impl::HostIterateTile< MDRangePolicy
+  using iterate_type = typename Kokkos::Impl::HostIterateTile< MDRangePolicy
                                                                           , FunctorType
                                                                           , WorkTag
                                                                           , ValueType
@ -558,7 +560,7 @@ public:
                       , m_instance->get_thread_data(i)->pool_reduce_local() );
      }

-      Kokkos::Impl::FunctorFinal<  ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , ptr );
+      Kokkos::Impl::FunctorFinal<  ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , ptr );

      if ( m_result_ptr ) {
        const int n = Analysis::value_count( ReducerConditional::select(m_functor , m_reducer) );
@ -920,9 +922,10 @@ private:
                            , FunctorType, ReducerType> ReducerConditional;

  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTag >  ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd , WorkTag >  ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd >  ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd , WorkTagFwd >  ValueJoin ;

  typedef typename Analysis::pointer_type    pointer_type ;
  typedef typename Analysis::reference_type  reference_type ;
@ -1067,7 +1070,7 @@ public:
                       , m_instance->get_thread_data(i)->pool_reduce_local() );
      }

-      Kokkos::Impl::FunctorFinal<  ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , ptr );
+      Kokkos::Impl::FunctorFinal<  ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , ptr );

      if ( m_result_ptr ) {
        const int n = Analysis::value_count( ReducerConditional::select(m_functor , m_reducer) );
--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_WorkGraphPolicy.hpp
+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_WorkGraphPolicy.hpp
@ -49,33 +49,26 @@ namespace Impl {

 template< class FunctorType , class ... Traits >
 class ParallelFor< FunctorType ,
-                   Kokkos::Experimental::WorkGraphPolicy< Traits ... > ,
+                   Kokkos::WorkGraphPolicy< Traits ... > ,
                   Kokkos::OpenMP
                 >
-  : public Kokkos::Impl::Experimental::
-           WorkGraphExec< FunctorType,
-                          Kokkos::OpenMP,
-                          Traits ...
-                        >
 {
 private:

-  typedef Kokkos::Experimental::WorkGraphPolicy< Traits ... > Policy ;
-  typedef Kokkos::Impl::Experimental::
-          WorkGraphExec<FunctorType, Kokkos::OpenMP, Traits ... > Base ;
+  typedef Kokkos::WorkGraphPolicy< Traits ... > Policy ;
+
+  Policy       m_policy ;
+  FunctorType  m_functor ;

  template< class TagType >
  typename std::enable_if< std::is_same< TagType , void >::value >::type
-  exec_one(const typename Policy::member_type& i) const {
-    Base::m_functor( i );
-  }
+  exec_one( const std::int32_t w ) const noexcept
+    { m_functor( w ); }

  template< class TagType >
  typename std::enable_if< ! std::is_same< TagType , void >::value >::type
-  exec_one(const typename Policy::member_type& i) const {
-    const TagType t{} ;
-    Base::m_functor( t , i );
-  }
+  exec_one( const std::int32_t w ) const noexcept
+    { const TagType t{} ; m_functor( t , w ); }

 public:

@ -86,9 +79,15 @@ public:

    #pragma omp parallel num_threads(pool_size)
    {
-      for (std::int32_t i; (-1 != (i = Base::before_work())); ) {
-        exec_one< typename Policy::work_tag >( i );
-        Base::after_work(i);
+      // Spin until COMPLETED_TOKEN.
+      // END_TOKEN indicates no work is currently available.
+
+      for ( std::int32_t w = Policy::END_TOKEN ;
+            Policy::COMPLETED_TOKEN != ( w = m_policy.pop_work() ) ; ) {
+        if ( Policy::END_TOKEN != w ) {
+          exec_one< typename Policy::work_tag >( w );
+          m_policy.completed_work(w);
+        }
      }
    }
  }
@ -96,12 +95,13 @@ public:
  inline
  ParallelFor( const FunctorType & arg_functor
             , const Policy      & arg_policy )
-    : Base( arg_functor, arg_policy )
-  {
-  }
+    : m_policy( arg_policy )
+    , m_functor( arg_functor )
+  {}
 };

 } // namespace Impl
 } // namespace Kokkos

 #endif /* #define KOKKOS_OPENMP_WORKGRAPHPOLICY_HPP */
+
--- a/lib/kokkos/core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp
+++ b/lib/kokkos/core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp
@ -248,12 +248,13 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

  // Static Assert WorkTag void if ReducerType not InvalidType

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTag > ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd , WorkTagFwd > ValueJoin ;

  enum {HasJoin = ReduceFunctorHasJoin<FunctorType>::value };
  enum {UseReducer = is_reducer_type<ReducerType>::value };
@ -620,10 +621,11 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTag >  ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTag >  ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd , WorkTag >  ValueJoin ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTagFwd >  ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd >  ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd , WorkTagFwd >  ValueJoin ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
  typedef typename ValueTraits::reference_type  reference_type ;
--- a/lib/kokkos/core/src/Qthreads/Kokkos_Qthreads_Parallel.hpp
+++ b/lib/kokkos/core/src/Qthreads/Kokkos_Qthreads_Parallel.hpp
@ -150,11 +150,12 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType, ReducerType>::value, FunctorType, ReducerType > ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType, ReducerType>::value, WorkTag, void >::type WorkTagFwd;

  // Static Assert WorkTag void if ReducerType not InvalidType

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd > ValueInit ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
  typedef typename ValueTraits::reference_type  reference_type ;
@ -213,7 +214,7 @@ public:

      const pointer_type data = (pointer_type) QthreadsExec::exec_all_reduce_result();

-      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer) , data );
+      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , data );

      if ( m_result_ptr ) {
        const unsigned n = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
@ -331,9 +332,10 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType, ReducerType>::value, WorkTag, void >::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTag >  ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTag >  ValueInit ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTagFwd >  ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd >  ValueInit ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
  typedef typename ValueTraits::reference_type  reference_type ;
@ -394,7 +396,7 @@ public:

      const pointer_type data = (pointer_type) QthreadsExec::exec_all_reduce_result();

-      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTag >::final( ReducerConditional::select(m_functor , m_reducer), data );
+      Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer), data );

      if ( m_result_ptr ) {
        const unsigned n = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Atomic.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Atomic.hpp
@ -125,7 +125,7 @@ namespace Kokkos {
    oldval.t = *dest ;
    assume.i = oldval.i ;
    newval.t = val ;
-    atomic_compare_exchange( reinterpret_cast<int*>(dest) , assume.i, newval.i );
+    atomic_compare_exchange( (int*)(dest) , assume.i, newval.i );

    return oldval.t ;    
  }
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Impl.cpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Impl.cpp
@ -608,6 +608,7 @@ ROCmInternal::scratch_space( const Kokkos::Experimental::ROCm::size_type size )

 void ROCmInternal::finalize()
 {
+  Kokkos::Impl::rocm_device_synchronize();
  was_finalized = 1;
  if ( 0 != m_scratchSpace || 0 != m_scratchFlags ) {

--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Parallel.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Parallel.hpp
@ -277,7 +277,7 @@ public:
      this->team_barrier();
      value = local_value;
    }
-// Reduce accross a team of threads.
+// Reduce across a team of threads.
 //
 // Each thread has vector_length elements.
 // This reduction is for TeamThreadRange operations, where the range
@ -354,6 +354,80 @@ public:
      return buffer[0];
    }

+// Reduce across a team of threads, with a reducer data type
+//
+// Each thread has vector_length elements.
+// This reduction is for TeamThreadRange operations, where the range
+// is spread across threads.  Effectively, there are vector_length
+// independent reduction operations.
+// This is different from a reduction across the elements of a thread,
+// which reduces every vector element.
+
+    template< class ReducerType >
+    KOKKOS_INLINE_FUNCTION
+    typename std::enable_if< is_reducer< ReducerType >::value >::type
+    team_reduce( const ReducerType & reducer) const
+    {
+      typedef typename ReducerType::value_type value_type ;
+
+      tile_static value_type buffer[512];
+      const auto local = lindex();
+      const auto team  = team_rank();
+      auto vector_rank = local%m_vector_length;
+      auto thread_base = team*m_vector_length;
+
+      const std::size_t size = next_pow_2(m_team_size+1)/2;
+#if defined(ROCM15)
+      buffer[local] = reducer.reference();
+#else
+        // ROCM 1.5 handles address spaces better, previous version didn't
+      lds_for(buffer[local], [&](ValueType& x)
+      {
+          x = value;
+      });
+#endif
+      m_idx.barrier.wait();
+
+      for(std::size_t s = 1; s < size; s *= 2)
+      {
+          const std::size_t index = 2 * s * team;
+          if (index < size)
+          {
+#if defined(ROCM15)
+                reducer.join(buffer[vector_rank+index*m_vector_length],
+                        buffer[vector_rank+(index+s)*m_vector_length]);
+#else
+              lds_for(buffer[vector_rank+index*m_vector_length], [&](ValueType& x)
+              {
+                  lds_for(buffer[vector_rank+(index+s)*m_vector_length],
+                                [&](ValueType& y)
+                  {
+                      reducer.join(x, y);
+                  });
+              });
+#endif
+          }
+          m_idx.barrier.wait();
+      }
+
+      if (local == 0)
+      {
+          for(int i=size*m_vector_length; i<m_team_size*m_vector_length; i+=m_vector_length)
+#if defined(ROCM15)
+              reducer.join(buffer[vector_rank], buffer[vector_rank+i]);
+#else
+              lds_for(buffer[vector_rank], [&](ValueType& x)
+              {
+                  lds_for(buffer[vector_rank+i],
+                                [&](ValueType& y)
+                  {
+                      reducer.join(x, y);
+                  });
+              });
+#endif
+      }
+      m_idx.barrier.wait();
+    }

    /** \brief  Intra-team vector reduce 
     *          with intra-team non-deterministic ordering accumulation.
@ -406,6 +480,33 @@ public:
      return buffer[thread_base];
    }

+  template< typename ReducerType >
+  KOKKOS_INLINE_FUNCTION static
+  typename std::enable_if< is_reducer< ReducerType >::value >::type
+  vector_reduce( ReducerType const & reducer )
+    {
+      #ifdef __HCC_ACCELERATOR__
+      if(blockDim_x == 1) return;
+
+      // Intra vector lane shuffle reduction:
+      typename ReducerType::value_type tmp ( reducer.reference() );
+
+      for ( int i = blockDim_x ; ( i >>= 1 ) ; ) {
+        shfl_down( reducer.reference() , i , blockDim_x );
+        if ( (int)threadIdx_x < i ) { reducer.join( tmp , reducer.reference() ); }
+      }
+
+      // Broadcast from root lane to all other lanes.
+      // Cannot use "butterfly" algorithm to avoid the broadcast
+      // because floating point summation is not associative
+      // and thus different threads could have different results.
+
+      shfl( reducer.reference() , 0 , blockDim_x );
+      #endif
+    }
+
+
+
    /** \brief  Intra-team exclusive prefix sum with team_rank() ordering
     *          with intra-team non-deterministic ordering accumulation.
     *
@ -1075,6 +1176,22 @@ void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROC
 //               Impl::JoinAdd<ValueType>());
 }

+/** \brief  Inter-thread thread range parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
+ *
+ * The range i=0..N-1 is mapped to all threads of the the calling thread team and a summation of
+ * val is performed and put into result. This functionality requires C++11 support.*/
+template< typename iType, class Lambda, typename ReducerType >
+KOKKOS_INLINE_FUNCTION
+void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
+                     const Lambda & lambda, ReducerType const & reducer) {
+  reducer.init( reducer.reference() );
+
+  for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
+    lambda(i,reducer.reference());
+  }
+  loop_boundaries.thread.team_reduce(reducer);
+}
+
 /** \brief  Intra-thread thread range parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
 *
 * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
@ -1161,6 +1278,41 @@ void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::R
  result = loop_boundaries.thread.thread_reduce(result,join);
 }

+
+/** \brief  Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
+ *
+ * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a summation of
+ * val is performed and put into result. This functionality requires C++11 support.*/
+template< typename iType, class Lambda, typename ReducerType >
+KOKKOS_INLINE_FUNCTION
+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
+      loop_boundaries, const Lambda & lambda, ReducerType const & reducer) {
+  reducer.init( reducer.reference() );
+
+  for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
+    lambda(i,reducer.reference());
+  }
+  loop_boundaries.thread.vector_reduce(reducer);
+}
+/** \brief  Intra-thread vector parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
+ *
+ * The range i=0..N-1 is mapped to all vector lanes of the the calling thread and a reduction of
+ * val is performed using JoinType(ValueType& val, const ValueType& update) and put into init_result.
+ * The input value of init_result is used as initializer for temporary variables of ValueType. Therefore
+ * the input value should be the neutral element with respect to the join operation (e.g. '0 for +-' or
+ * '1 for *'). This functionality requires C++11 support.*/
+template< typename iType, class Lambda, typename ReducerType, class JoinType >
+KOKKOS_INLINE_FUNCTION
+void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
+      loop_boundaries, const Lambda & lambda, const JoinType& join, ReducerType const & reducer) {
+
+  for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
+    lambda(i,reducer.reference());  
+    loop_boundaries.thread.team_barrier();
+  }
+  reducer.reference() = loop_boundaries.thread.thread_reduce(reducer.reference(),join);
+}
+
 /** \brief  Intra-thread vector parallel exclusive prefix sum. Executes lambda(iType i, ValueType & val, bool final)
 *          for each i=0..N-1.
 *
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Reduce.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Reduce.hpp
@ -102,11 +102,12 @@ void reduce_enqueue(

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, F, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType, ReducerType>::value, Tag, void >::type TagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , Tag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit< ReducerTypeFwd , Tag >   ValueInit ;
-  typedef Kokkos::Impl::FunctorValueJoin< ReducerTypeFwd , Tag >   ValueJoin ;
-  typedef Kokkos::Impl::FunctorFinal< ReducerTypeFwd , Tag >       ValueFinal ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , TagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit< ReducerTypeFwd , TagFwd >   ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin< ReducerTypeFwd , TagFwd >   ValueJoin ;
+  typedef Kokkos::Impl::FunctorFinal< ReducerTypeFwd , TagFwd >       ValueFinal ;

  typedef typename ValueTraits::pointer_type   pointer_type ;
  typedef typename ValueTraits::reference_type reference_type ;
--- a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.cpp
+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.cpp
@ -266,7 +266,7 @@ void ThreadsExec::execute_sleep( ThreadsExec & exec , const void * )
  const int rank_rev = exec.m_pool_size - ( exec.m_pool_rank + 1 );

  for ( int i = 0 ; i < n ; ++i ) {
-    Impl::spinwait_while_equal( exec.m_pool_base[ rank_rev + (1<<i) ]->m_pool_state , ThreadsExec::Active );
+    Impl::spinwait_while_equal<int>( exec.m_pool_base[ rank_rev + (1<<i) ]->m_pool_state , ThreadsExec::Active );
  }

  exec.m_pool_state = ThreadsExec::Inactive ;
@ -310,7 +310,7 @@ void ThreadsExec::fence()
 {
  if ( s_thread_pool_size[0] ) {
    // Wait for the root thread to complete:
-    Impl::spinwait_while_equal( s_threads_exec[0]->m_pool_state , ThreadsExec::Active );
+    Impl::spinwait_while_equal<int>( s_threads_exec[0]->m_pool_state , ThreadsExec::Active );
  }

  s_current_function     = 0 ;
@ -716,12 +716,12 @@ void ThreadsExec::initialize( unsigned thread_count ,
  }

  // Check for over-subscription
-  //if( Impl::mpi_ranks_per_node() * long(thread_count) > Impl::processors_per_node() ) {
-  //  std::cout << "Kokkos::Threads::initialize WARNING: You are likely oversubscribing your CPU cores." << std::endl;
-  //  std::cout << "                                    Detected: " << Impl::processors_per_node() << " cores per node." << std::endl;
-  //  std::cout << "                                    Detected: " << Impl::mpi_ranks_per_node() << " MPI_ranks per node." << std::endl;
-  //  std::cout << "                                    Requested: " << thread_count << " threads per process." << std::endl;
-  //}
+  if( Kokkos::show_warnings() && (Impl::mpi_ranks_per_node() * long(thread_count) > Impl::processors_per_node()) ) {
+    std::cout << "Kokkos::Threads::initialize WARNING: You are likely oversubscribing your CPU cores." << std::endl;
+    std::cout << "                                    Detected: " << Impl::processors_per_node() << " cores per node." << std::endl;
+    std::cout << "                                    Detected: " << Impl::mpi_ranks_per_node() << " MPI_ranks per node." << std::endl;
+    std::cout << "                                    Requested: " << thread_count << " threads per process." << std::endl;
+  }

  // Init the array for used for arbitrarily sized atomics
  Impl::init_lock_array_host_space();
--- a/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.hpp
+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsExec.hpp
@ -107,7 +107,7 @@ private:
  // Which thread am I stealing from currently
  int m_current_steal_target;
  // This thread's owned work_range
-  Kokkos::pair<long,long> m_work_range KOKKOS_ALIGN(16);
+  Kokkos::pair<long,long> m_work_range __attribute__((aligned(16))) ;
  // Team Offset if one thread determines work_range for others
  long m_team_work_index;

@ -191,13 +191,13 @@ public:
      // Fan-in reduction with highest ranking thread as the root
      for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
        // Wait: Active -> Rendezvous
-        Impl::spinwait_while_equal( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
+        Impl::spinwait_while_equal<int>( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
      }

      if ( rev_rank ) {
        m_pool_state = ThreadsExec::Rendezvous ;
        // Wait: Rendezvous -> Active
-        Impl::spinwait_while_equal( m_pool_state , ThreadsExec::Rendezvous );
+        Impl::spinwait_while_equal<int>( m_pool_state , ThreadsExec::Rendezvous );
      }
      else {
        // Root thread does the reduction and broadcast
@ -233,13 +233,13 @@ public:
      // Fan-in reduction with highest ranking thread as the root
      for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
        // Wait: Active -> Rendezvous
-        Impl::spinwait_while_equal( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
+        Impl::spinwait_while_equal<int>( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
      }

      if ( rev_rank ) {
        m_pool_state = ThreadsExec::Rendezvous ;
        // Wait: Rendezvous -> Active
-        Impl::spinwait_while_equal( m_pool_state , ThreadsExec::Rendezvous );
+        Impl::spinwait_while_equal<int>( m_pool_state , ThreadsExec::Rendezvous );
      }
      else {
        // Root thread does the reduction and broadcast
@ -268,7 +268,7 @@ public:

        ThreadsExec & fan = *m_pool_base[ rev_rank + ( 1 << i ) ] ;

-        Impl::spinwait_while_equal( fan.m_pool_state , ThreadsExec::Active );
+        Impl::spinwait_while_equal<int>( fan.m_pool_state , ThreadsExec::Active );

        Join::join( f , reduce_memory() , fan.reduce_memory() );
      }
@ -295,7 +295,7 @@ public:
      const int rev_rank = m_pool_size - ( m_pool_rank + 1 );

      for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
-        Impl::spinwait_while_equal( m_pool_base[rev_rank+(1<<i)]->m_pool_state , ThreadsExec::Active );
+        Impl::spinwait_while_equal<int>( m_pool_base[rev_rank+(1<<i)]->m_pool_state , ThreadsExec::Active );
      }
    }

@ -327,7 +327,7 @@ public:
        ThreadsExec & fan = *m_pool_base[ rev_rank + (1<<i) ];

        // Wait: Active -> ReductionAvailable (or ScanAvailable)
-        Impl::spinwait_while_equal( fan.m_pool_state , ThreadsExec::Active );
+        Impl::spinwait_while_equal<int>( fan.m_pool_state , ThreadsExec::Active );
        Join::join( f , work_value , fan.reduce_memory() );
      }

@ -345,8 +345,8 @@ public:

          // Wait: Active             -> ReductionAvailable
          // Wait: ReductionAvailable -> ScanAvailable
-          Impl::spinwait_while_equal( th.m_pool_state , ThreadsExec::Active );
-          Impl::spinwait_while_equal( th.m_pool_state , ThreadsExec::ReductionAvailable );
+          Impl::spinwait_while_equal<int>( th.m_pool_state , ThreadsExec::Active );
+          Impl::spinwait_while_equal<int>( th.m_pool_state , ThreadsExec::ReductionAvailable );

          Join::join( f , work_value + count , ((scalar_type *)th.reduce_memory()) + count );
        }
@ -357,7 +357,7 @@ public:

        // Wait for all threads to complete inclusive scan
        // Wait: ScanAvailable -> Rendezvous
-        Impl::spinwait_while_equal( m_pool_state , ThreadsExec::ScanAvailable );
+        Impl::spinwait_while_equal<int>( m_pool_state , ThreadsExec::ScanAvailable );
      }

      //--------------------------------
@ -365,7 +365,7 @@ public:
      for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
        ThreadsExec & fan = *m_pool_base[ rev_rank + (1<<i) ];
        // Wait: ReductionAvailable -> ScanAvailable
-        Impl::spinwait_while_equal( fan.m_pool_state , ThreadsExec::ReductionAvailable );
+        Impl::spinwait_while_equal<int>( fan.m_pool_state , ThreadsExec::ReductionAvailable );
        // Set: ScanAvailable -> Rendezvous
        fan.m_pool_state = ThreadsExec::Rendezvous ;
      }
@ -392,13 +392,13 @@ public:
      // Wait for all threads to copy previous thread's inclusive scan value
      // Wait for all threads: Rendezvous -> ScanCompleted
      for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
-        Impl::spinwait_while_equal( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Rendezvous );
+        Impl::spinwait_while_equal<int>( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Rendezvous );
      }
      if ( rev_rank ) {
        // Set: ScanAvailable -> ScanCompleted
        m_pool_state = ThreadsExec::ScanCompleted ;
        // Wait: ScanCompleted -> Active
-        Impl::spinwait_while_equal( m_pool_state , ThreadsExec::ScanCompleted );
+        Impl::spinwait_while_equal<int>( m_pool_state , ThreadsExec::ScanCompleted );
      }
      // Set: ScanCompleted -> Active
      for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
@ -425,7 +425,7 @@ public:
      // Fan-in reduction with highest ranking thread as the root
      for ( int i = 0 ; i < m_pool_fan_size ; ++i ) {
        // Wait: Active -> Rendezvous
-        Impl::spinwait_while_equal( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
+        Impl::spinwait_while_equal<int>( m_pool_base[ rev_rank + (1<<i) ]->m_pool_state , ThreadsExec::Active );
      }

      for ( unsigned i = 0 ; i < count ; ++i ) { work_value[i+count] = work_value[i]; }
@ -433,7 +433,7 @@ public:
      if ( rev_rank ) {
        m_pool_state = ThreadsExec::Rendezvous ;
        // Wait: Rendezvous -> Active
-        Impl::spinwait_while_equal( m_pool_state , ThreadsExec::Rendezvous );
+        Impl::spinwait_while_equal<int>( m_pool_state , ThreadsExec::Rendezvous );
      }
      else {
        // Root thread does the thread-scan before releasing threads
--- a/lib/kokkos/core/src/Threads/Kokkos_ThreadsTeam.hpp
+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsTeam.hpp
@ -107,13 +107,13 @@ public:

      // Wait for fan-in threads
      for ( n = 1 ; ( ! ( m_team_rank_rev & n ) ) && ( ( j = m_team_rank_rev + n ) < m_team_size ) ; n <<= 1 ) {
-        Impl::spinwait_while_equal( m_team_base[j]->state() , ThreadsExec::Active );
+        Impl::spinwait_while_equal<int>( m_team_base[j]->state() , ThreadsExec::Active );
      }

      // If not root then wait for release
      if ( m_team_rank_rev ) {
        m_exec->state() = ThreadsExec::Rendezvous ;
-        Impl::spinwait_while_equal( m_exec->state() , ThreadsExec::Rendezvous );
+        Impl::spinwait_while_equal<int>( m_exec->state() , ThreadsExec::Rendezvous );
      }

      return ! m_team_rank_rev ;
--- a/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp
+++ b/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp
@ -180,12 +180,12 @@ public:
 // MDRangePolicy impl
 template< class FunctorType , class ... Traits >
 class ParallelFor< FunctorType
-                 , Kokkos::Experimental::MDRangePolicy< Traits ... >
+                 , Kokkos::MDRangePolicy< Traits ... >
                 , Kokkos::Threads
                 >
 {
 private:
-  typedef Kokkos::Experimental::MDRangePolicy< Traits ... > MDRangePolicy ;
+  typedef Kokkos::MDRangePolicy< Traits ... > MDRangePolicy ;
  typedef typename MDRangePolicy::impl_range_policy         Policy ;

  typedef typename MDRangePolicy::work_tag                  WorkTag ;
@ -193,7 +193,7 @@ private:
  typedef typename Policy::WorkRange   WorkRange ;
  typedef typename Policy::member_type Member ;

-  typedef typename Kokkos::Experimental::Impl::HostIterateTile< MDRangePolicy, FunctorType, typename MDRangePolicy::work_tag, void > iterate_type;
+  typedef typename Kokkos::Impl::HostIterateTile< MDRangePolicy, FunctorType, typename MDRangePolicy::work_tag, void > iterate_type;

  const FunctorType   m_functor ;
  const MDRangePolicy m_mdr_policy ;
@ -396,9 +396,10 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd > ValueInit ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
  typedef typename ValueTraits::reference_type  reference_type ;
@ -458,7 +459,7 @@ private:
      ( self.m_functor , range.begin() , range.end()
      , ValueInit::init( ReducerConditional::select(self.m_functor , self.m_reducer) , exec.reduce_memory() ) );

-    exec.template fan_in_reduce< ReducerTypeFwd , WorkTag >( ReducerConditional::select(self.m_functor , self.m_reducer) );
+    exec.template fan_in_reduce< ReducerTypeFwd , WorkTagFwd >( ReducerConditional::select(self.m_functor , self.m_reducer) );
  }

  template<class Schedule>
@ -484,7 +485,7 @@ private:
      work_index = exec.get_work_index();
    }

-    exec.template fan_in_reduce< ReducerTypeFwd , WorkTag >( ReducerConditional::select(self.m_functor , self.m_reducer) );
+    exec.template fan_in_reduce< ReducerTypeFwd , WorkTagFwd >( ReducerConditional::select(self.m_functor , self.m_reducer) );
  }

 public:
@ -548,14 +549,14 @@ public:
 // MDRangePolicy impl
 template< class FunctorType , class ReducerType, class ... Traits >
 class ParallelReduce< FunctorType
-                    , Kokkos::Experimental::MDRangePolicy< Traits ... >
+                    , Kokkos::MDRangePolicy< Traits ... >
                    , ReducerType
                    , Kokkos::Threads
                    >
 {
 private:

-  typedef Kokkos::Experimental::MDRangePolicy< Traits ... > MDRangePolicy ;
+  typedef Kokkos::MDRangePolicy< Traits ... > MDRangePolicy ;
  typedef typename MDRangePolicy::impl_range_policy Policy ;

  typedef typename MDRangePolicy::work_tag    WorkTag ;
@ -564,16 +565,17 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

  typedef typename ReducerTypeFwd::value_type ValueType; 

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd > ValueInit ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
  typedef typename ValueTraits::reference_type  reference_type ;

-  using iterate_type = typename Kokkos::Experimental::Impl::HostIterateTile< MDRangePolicy
+  using iterate_type = typename Kokkos::Impl::HostIterateTile< MDRangePolicy
                                                                           , FunctorType
                                                                           , WorkTag
                                                                           , ValueType
@ -618,7 +620,7 @@ private:
      ( self.m_mdr_policy, self.m_functor , range.begin() , range.end()
      , ValueInit::init( ReducerConditional::select(self.m_functor , self.m_reducer) , exec.reduce_memory() ) );

-    exec.template fan_in_reduce< ReducerTypeFwd , WorkTag >( ReducerConditional::select(self.m_functor , self.m_reducer) );
+    exec.template fan_in_reduce< ReducerTypeFwd , WorkTagFwd >( ReducerConditional::select(self.m_functor , self.m_reducer) );
  }

  template<class Schedule>
@ -644,7 +646,7 @@ private:
      work_index = exec.get_work_index();
    }

-    exec.template fan_in_reduce< ReducerTypeFwd , WorkTag >( ReducerConditional::select(self.m_functor , self.m_reducer) );
+    exec.template fan_in_reduce< ReducerTypeFwd , WorkTagFwd >( ReducerConditional::select(self.m_functor , self.m_reducer) );
  }

 public:
@ -725,9 +727,10 @@ private:

  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;

-  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTag > ValueTraits ;
-  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTag > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd , WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd , WorkTagFwd > ValueInit ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
  typedef typename ValueTraits::reference_type  reference_type ;
@ -767,7 +770,7 @@ private:
      ( self.m_functor , Member( & exec , self.m_policy , self.m_shared )
      , ValueInit::init( ReducerConditional::select(self.m_functor , self.m_reducer) , exec.reduce_memory() ) );

-    exec.template fan_in_reduce< ReducerTypeFwd , WorkTag >( ReducerConditional::select(self.m_functor , self.m_reducer) );
+    exec.template fan_in_reduce< ReducerTypeFwd , WorkTagFwd >( ReducerConditional::select(self.m_functor , self.m_reducer) );
  }

 public:
--- a/Show More
+++ b/Show More