Update Kokkos library in LAMMPS to v3.2.1

2020-11-30 10:36:13 -07:00 · 2020-11-30 10:36:13 -07:00 · a6037a957f
parent 91f21fcd9e
commit a6037a957f
23 changed files with 647 additions and 284 deletions
--- a/lib/kokkos/BUILD.md
+++ b/lib/kokkos/BUILD.md
@ -52,13 +52,17 @@ There are numerous device backends, options, and architecture-specific optimizat
 ````
 which activates the OpenMP backend. All of the options controlling device backends, options, architectures, and third-party libraries (TPLs) are given below.

-## Platform-specific Problems
+## Known Issues<a name="KnownIssues"></a>

 ### Cray

 * The Cray compiler wrappers do static linking by default. This seems to break the Kokkos build. You will likely need to set the environment variable `CRAYPE_LINK_TYPE=dynamic` in order to link correctly. Kokkos warns during configure if this is missing.
 * The Cray compiler identifies to CMake as Clang, but it sometimes has its own flags that differ from Clang. We try to include all exceptions, but flag errors may occur in which a Clang-specific flag is passed that the Cray compiler does not recognize.

+### Fortran
+
+* In a mixed C++/Fortran code, CMake will use the C++ linker by default. If you override this behavior and use Fortran as the link language, the link may break because Kokkos adds linker flags expecting the linker to be C++. Prior to CMake 3.18, Kokkos has no way of detecting in downstream projects that the linker was changed to Fortran.  From CMake 3.18, Kokkos can use generator expressions to avoid adding flags when the linker is not C++. Note: Kokkos will not add any linker flags in this Fortran case. The user will be entirely on their own to add the appropriate linker flags.
+
 ## Spack
 An alternative to manually building with the CMake is to use the Spack package manager.
 To do so, download the `kokkos-spack` git repo and add to the package list:
--- a/lib/kokkos/CHANGELOG.md
+++ b/lib/kokkos/CHANGELOG.md
@ -1,5 +1,27 @@
 # Change Log

+## [3.2.01](https://github.com/kokkos/kokkos/tree/3.2.01) (2020-11-17)
+[Full Changelog](https://github.com/kokkos/kokkos/compare/3.2.00...3.2.01)
+
+**Fixed bugs:**
+- Disallow KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE in shared library builds [\#3332](https://github.com/kokkos/kokkos/pull/3332)
+- Do not install libprinter-tool when testing is enabled [\#3313](https://github.com/kokkos/kokkos/pull/3313)
+- Fix restrict/alignment following refactor [\#3373](https://github.com/kokkos/kokkos/pull/3373)
+  - Intel fix: workaround compiler issue with using statement [\#3383](https://github.com/kokkos/kokkos/pull/3383)
+- Fix zero-length reductions [#\3364](https://github.com/kokkos/kokkos/pull/3364)
+  - Pthread zero-length reduction fix [\#3452](https://github.com/kokkos/kokkos/pull/3452)
+  - HPX zero-length reduction fix [\#3470](https://github.com/kokkos/kokkos/pull/3470)
+  - cuda/9.2 zero-length reduction fix [\#3580](https://github.com/kokkos/kokkos/pull/3580)
+- Fix multi-stream scratch [#\3269](https://github.com/kokkos/kokkos/pull/3269)
+- Guard KOKKOS_ALL_COMPILE_OPTIONS if Cuda is not enabled [\#3387](https://github.com/kokkos/kokkos/pull/3387)
+- Do not include link flags for Fortran linkage [\#3384](https://github.com/kokkos/kokkos/pull/3384)
+- Fix NVIDIA GPU arch macro with autodetection [\#3473](https://github.com/kokkos/kokkos/pull/3473)
+- Fix libdl/test issues with Trilinos [\#3543](https://github.com/kokkos/kokkos/pull/3543)
+  - Register Pthread as Tribits option to be enabled with Trilinos [\#3558](https://github.com/kokkos/kokkos/pull/3558)
+
+**Implemented enhancements:**
+- Separate Cuda timing-based tests into their own executable [\#3407](https://github.com/kokkos/kokkos/pull/3407)
+
 ## [3.2.00](https://github.com/kokkos/kokkos/tree/3.2.00) (2020-08-19)
 [Full Changelog](https://github.com/kokkos/kokkos/compare/3.1.01...3.2.00)

--- a/lib/kokkos/CMakeLists.txt
+++ b/lib/kokkos/CMakeLists.txt
@ -112,7 +112,7 @@ ENDIF()

 set(Kokkos_VERSION_MAJOR 3)
 set(Kokkos_VERSION_MINOR 2)
-set(Kokkos_VERSION_PATCH 0)
+set(Kokkos_VERSION_PATCH 1)
 set(Kokkos_VERSION "${Kokkos_VERSION_MAJOR}.${Kokkos_VERSION_MINOR}.${Kokkos_VERSION_PATCH}")
 math(EXPR KOKKOS_VERSION "${Kokkos_VERSION_MAJOR} * 10000 + ${Kokkos_VERSION_MINOR} * 100 + ${Kokkos_VERSION_PATCH}")

@ -207,7 +207,9 @@ IF (KOKKOS_HAS_TRILINOS)
  # we have to match the annoying behavior
  STRING(REPLACE ";" " " KOKKOSCORE_COMPILE_OPTIONS "${KOKKOS_COMPILE_OPTIONS}")
  LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS ${KOKKOS_COMPILE_OPTIONS})
-  LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS ${KOKKOS_CUDA_OPTIONS})
+  IF (KOKKOS_ENABLE_CUDA)
+    LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS ${KOKKOS_CUDA_OPTIONS})
+  ENDIF()
  FOREACH(XCOMP_FLAG ${KOKKOS_XCOMPILER_OPTIONS})
    SET(KOKKOSCORE_XCOMPILER_OPTIONS "${KOKKOSCORE_XCOMPILER_OPTIONS} -Xcompiler ${XCOMP_FLAG}")
    LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS -Xcompiler ${XCOMP_FLAG})
--- a/lib/kokkos/README.md
+++ b/lib/kokkos/README.md
@ -18,7 +18,7 @@ profiling and debugging tools (https://github.com/kokkos/kokkos-tools).

 A programming guide can be found on the Wiki, the API reference is under development.

-For questions find us on Slack: https://kokkosteam.slack.com or open a GitHub issue.
+For questions find us on Slack: https://kokkosteam.slack.com or open a github issue.

 For non-public questions send an email to
 crtrott(at)sandia.gov
@ -44,7 +44,7 @@ To learn more about Kokkos consider watching one of our presentations:
 We are open and try to encourage contributions from external developers.
 To do so please first open an issue describing the contribution and then issue
 a pull request against the develop branch. For larger features it may be good
-to get guidance from the core development team first through the GitHub issue.
+to get guidance from the core development team first through the github issue.

 Note that Kokkos Core is licensed under standard 3-clause BSD terms of use.
 Which means contributing to Kokkos allows anyone else to use your contributions
@ -108,6 +108,10 @@ For specifics see the LICENSE file contained in the repository or distribution.
 * ARM
    * Pthreads backend

+### Build system:
+* CMake >= 3.10: required
+* CMake >= 3.13: recommended
+* CMake >= 3.18: Fortran linkage. This does not affect most mixed Fortran/Kokkos builds. See [build issues](BUILD.md#KnownIssues).

 Primary tested compiler are passing in release mode
 with warnings as errors. They also are tested with a comprehensive set of
--- a/lib/kokkos/cmake/fake_tribits.cmake
+++ b/lib/kokkos/cmake/fake_tribits.cmake
@ -103,11 +103,18 @@ FUNCTION(KOKKOS_ADD_TEST)
      COMM serial mpi
      NUM_MPI_PROCS 1
      ${TEST_UNPARSED_ARGUMENTS}
+      ADDED_TESTS_NAMES_OUT ALL_TESTS_ADDED
    )

+    # We will get prepended package name here
+    SET(TEST_NAME ${PACKAGE_NAME}_${TEST_NAME})
+    SET(EXE ${PACKAGE_NAME}_${EXE_ROOT})
+
    if(TEST_TOOL)
      add_dependencies(${EXE} ${TEST_TOOL}) #make sure the exe has to build the tool
-      set_property(TEST ${TEST_NAME} APPEND_STRING PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>")
+      foreach(TEST_ADDED ${ALL_TESTS_ADDED})
+        set_property(TEST ${TEST_ADDED} APPEND PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>")
+      endforeach()
    endif()
  else()
    CMAKE_PARSE_ARGUMENTS(TEST
@ -263,9 +270,7 @@ ENDFUNCTION()

 FUNCTION(KOKKOS_ADD_TEST_LIBRARY NAME)
 IF (KOKKOS_HAS_TRILINOS)
-  TRIBITS_ADD_LIBRARY(${NAME} ${ARGN} TESTONLY
-   ADDED_LIB_TARGET_NAME_OUT ${NAME}
-  )
+  TRIBITS_ADD_LIBRARY(${NAME} ${ARGN} TESTONLY)
 ELSE()
  SET(oneValueArgs)
  SET(multiValueArgs HEADERS SOURCES)
@ -273,16 +278,26 @@ ELSE()
  CMAKE_PARSE_ARGUMENTS(PARSE
    "STATIC;SHARED"
    ""
-    "HEADERS;SOURCES"
+    "HEADERS;SOURCES;DEPLIBS"
    ${ARGN})

+  SET(LIB_TYPE)
+  IF (PARSE_STATIC)
+    SET(LIB_TYPE STATIC)
+  ELSEIF (PARSE_SHARED)
+    SET(LIB_TYPE SHARED)
+  ENDIF()
+
  IF(PARSE_HEADERS)
    LIST(REMOVE_DUPLICATES PARSE_HEADERS)
  ENDIF()
  IF(PARSE_SOURCES)
    LIST(REMOVE_DUPLICATES PARSE_SOURCES)
  ENDIF()
-  ADD_LIBRARY(${NAME} ${PARSE_SOURCES})
+  ADD_LIBRARY(${NAME} ${LIB_TYPE} ${PARSE_SOURCES})
+  IF (PARSE_DEPLIBS)
+    TARGET_LINK_LIBRARIES(${NAME} PRIVATE ${PARSE_DEPLIBS})
+  ENDIF()
 ENDIF()
 ENDFUNCTION()

--- a/lib/kokkos/cmake/kokkos_arch.cmake
+++ b/lib/kokkos/cmake/kokkos_arch.cmake
@ -65,32 +65,6 @@ KOKKOS_ARCH_OPTION(VEGA900         GPU  "AMD GPU MI25 GFX900")
 KOKKOS_ARCH_OPTION(VEGA906         GPU  "AMD GPU MI50/MI60 GFX906")
 KOKKOS_ARCH_OPTION(INTEL_GEN       GPU  "Intel GPUs Gen9+")

-IF (KOKKOS_ENABLE_CUDA)
- #Regardless of version, make sure we define the general architecture name
- IF (KOKKOS_ARCH_KEPLER30 OR KOKKOS_ARCH_KEPLER32 OR KOKKOS_ARCH_KEPLER35 OR KOKKOS_ARCH_KEPLER37)
-   SET(KOKKOS_ARCH_KEPLER ON)
- ENDIF()
-
- #Regardless of version, make sure we define the general architecture name
- IF (KOKKOS_ARCH_MAXWELL50 OR KOKKOS_ARCH_MAXWELL52 OR KOKKOS_ARCH_MAXWELL53)
-   SET(KOKKOS_ARCH_MAXWELL ON)
- ENDIF()
-
- #Regardless of version, make sure we define the general architecture name
- IF (KOKKOS_ARCH_PASCAL60 OR KOKKOS_ARCH_PASCAL61)
-   SET(KOKKOS_ARCH_PASCAL ON)
- ENDIF()
-
-  #Regardless of version, make sure we define the general architecture name
-  IF (KOKKOS_ARCH_VOLTA70 OR KOKKOS_ARCH_VOLTA72)
-    SET(KOKKOS_ARCH_VOLTA ON)
-  ENDIF()
-
-  IF (KOKKOS_ARCH_AMPERE80)
-    SET(KOKKOS_ARCH_AMPERE ON)
-  ENDIF()
-ENDIF()
-


 IF(KOKKOS_ENABLE_COMPILER_WARNINGS)
@ -475,6 +449,32 @@ IF(KOKKOS_ENABLE_CUDA AND NOT CUDA_ARCH_ALREADY_SPECIFIED)
  ENDIF()
 ENDIF()

+IF (KOKKOS_ENABLE_CUDA)
+ #Regardless of version, make sure we define the general architecture name
+ IF (KOKKOS_ARCH_KEPLER30 OR KOKKOS_ARCH_KEPLER32 OR KOKKOS_ARCH_KEPLER35 OR KOKKOS_ARCH_KEPLER37)
+   SET(KOKKOS_ARCH_KEPLER ON)
+ ENDIF()
+
+ #Regardless of version, make sure we define the general architecture name
+ IF (KOKKOS_ARCH_MAXWELL50 OR KOKKOS_ARCH_MAXWELL52 OR KOKKOS_ARCH_MAXWELL53)
+   SET(KOKKOS_ARCH_MAXWELL ON)
+ ENDIF()
+
+ #Regardless of version, make sure we define the general architecture name
+ IF (KOKKOS_ARCH_PASCAL60 OR KOKKOS_ARCH_PASCAL61)
+   SET(KOKKOS_ARCH_PASCAL ON)
+ ENDIF()
+
+  #Regardless of version, make sure we define the general architecture name
+  IF (KOKKOS_ARCH_VOLTA70 OR KOKKOS_ARCH_VOLTA72)
+    SET(KOKKOS_ARCH_VOLTA ON)
+  ENDIF()
+
+  IF (KOKKOS_ARCH_AMPERE80)
+    SET(KOKKOS_ARCH_AMPERE ON)
+  ENDIF()
+ENDIF()
+
 #CMake verbose is kind of pointless
 #Let's just always print things
 MESSAGE(STATUS "Execution Spaces:")
--- a/lib/kokkos/cmake/kokkos_enable_options.cmake
+++ b/lib/kokkos/cmake/kokkos_enable_options.cmake
@ -109,3 +109,7 @@ ENDIF()
 IF (KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE AND KOKKOS_CXX_COMPILER_ID STREQUAL Clang)
  MESSAGE(FATAL_ERROR "Relocatable device code is currently not supported with Clang - must use nvcc_wrapper or turn off RDC")
 ENDIF()
+
+IF (KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE AND BUILD_SHARED_LIBS)
+  MESSAGE(FATAL_ERROR "Relocatable device code requires static libraries.")
+ENDIF()
--- a/lib/kokkos/cmake/kokkos_functions.cmake
+++ b/lib/kokkos/cmake/kokkos_functions.cmake
@ -209,6 +209,11 @@ ENDMACRO()
 #
 #     If specified, this TPL will build an INTERFACE library rather than an
 #     IMPORTED target
+IF (KOKKOS_HAS_TRILINOS)
+MACRO(kokkos_import_tpl NAME)
+  #do nothing
+ENDMACRO()
+ELSE()
 MACRO(kokkos_import_tpl NAME)
  CMAKE_PARSE_ARGUMENTS(TPL
   "NO_EXPORT;INTERFACE"
@ -241,6 +246,7 @@ MACRO(kokkos_import_tpl NAME)
    LIST(APPEND KOKKOS_ENABLED_TPLS ${NAME})
  ENDIF()
 ENDMACRO(kokkos_import_tpl)
+ENDIF()

 MACRO(kokkos_import_cmake_tpl MODULE_NAME)
  kokkos_import_tpl(${MODULE_NAME} ${ARGN} NO_EXPORT)
--- a/lib/kokkos/cmake/kokkos_tpls.cmake
+++ b/lib/kokkos/cmake/kokkos_tpls.cmake
@ -2,22 +2,49 @@ KOKKOS_CFG_DEPENDS(TPLS OPTIONS)
 KOKKOS_CFG_DEPENDS(TPLS DEVICES)

 FUNCTION(KOKKOS_TPL_OPTION PKG DEFAULT)
+  CMAKE_PARSE_ARGUMENTS(PARSED
+    ""
+    "TRIBITS"
+    ""
+    ${ARGN})
+
+  IF (PARSED_TRIBITS)
+    #this is also a TPL option you can activate with Tribits
+    IF (NOT "${TPL_ENABLE_${PARSED_TRIBITS}}" STREQUAL "")
+      #Tribits brought its own default that should take precedence
+      SET(DEFAULT ${TPL_ENABLE_${PARSED_TRIBITS}})
+    ENDIF()
+  ENDIF()
+
  KOKKOS_ENABLE_OPTION(${PKG} ${DEFAULT} "Whether to enable the ${PKG} library")
  KOKKOS_OPTION(${PKG}_DIR "" PATH "Location of ${PKG} library")
  SET(KOKKOS_ENABLE_${PKG} ${KOKKOS_ENABLE_${PKG}} PARENT_SCOPE)
  SET(KOKKOS_${PKG}_DIR  ${KOKKOS_${PKG}_DIR} PARENT_SCOPE)
+
+  IF (KOKKOS_HAS_TRILINOS
+    AND KOKKOS_ENABLE_${PKG}
+    AND NOT PARSED_TRIBITS)
+    #this TPL was enabled, but it is not valid to use inside of TriBITS
+    MESSAGE(FATAL_ERROR "Enabled TPL ${PKG} inside TriBITS build, "
+           "but this can only be enabled in a standalone build")
+  ENDIF()
 ENDFUNCTION()

 KOKKOS_TPL_OPTION(HWLOC   Off)
 KOKKOS_TPL_OPTION(LIBNUMA Off)
 KOKKOS_TPL_OPTION(MEMKIND Off)
-KOKKOS_TPL_OPTION(CUDA    Off)
-KOKKOS_TPL_OPTION(LIBRT   Off)
-IF (WIN32)
-  KOKKOS_TPL_OPTION(LIBDL Off)
-ELSE()
-  KOKKOS_TPL_OPTION(LIBDL On)
+IF(KOKKOS_ENABLE_MEMKIND)
+  SET(KOKKOS_ENABLE_HBWSPACE ON)
 ENDIF()
+KOKKOS_TPL_OPTION(CUDA    ${Kokkos_ENABLE_CUDA} TRIBITS CUDA)
+KOKKOS_TPL_OPTION(LIBRT   Off)
+
+IF (WIN32)
+  SET(LIBDL_DEFAULT Off)
+ELSE()
+  SET(LIBDL_DEFAULT On)
+ENDIF()
+KOKKOS_TPL_OPTION(LIBDL ${LIBDL_DEFAULT} TRIBITS DLlib)

 IF(Trilinos_ENABLE_Kokkos AND TPL_ENABLE_HPX)
 SET(HPX_DEFAULT ON)
@ -31,7 +58,7 @@ SET(PTHREAD_DEFAULT ON)
 ELSE()
 SET(PTHREAD_DEFAULT OFF)
 ENDIF()
-KOKKOS_TPL_OPTION(PTHREAD ${PTHREAD_DEFAULT})
+KOKKOS_TPL_OPTION(PTHREAD ${PTHREAD_DEFAULT} TRIBITS Pthread)


 #Make sure we use our local FindKokkosCuda.cmake
--- a/lib/kokkos/cmake/kokkos_tribits.cmake
+++ b/lib/kokkos/cmake/kokkos_tribits.cmake
@ -202,8 +202,8 @@ MACRO(KOKKOS_SETUP_BUILD_ENVIRONMENT)
    INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_arch.cmake)
    IF (NOT KOKKOS_HAS_TRILINOS)
      SET(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${Kokkos_SOURCE_DIR}/cmake/Modules/")
-      INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_tpls.cmake)
    ENDIF()
+    INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_tpls.cmake)
    INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_corner_cases.cmake)
  ENDIF()
 ENDMACRO()
@ -234,12 +234,21 @@ FUNCTION(KOKKOS_SET_LIBRARY_PROPERTIES LIBRARY_NAME)
    ""
    ${ARGN})

-  IF(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.13")
-    #great, this works the "right" way
+  IF(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.18")
+    #I can use link options
+    #check for CXX linkage using the simple 3.18 way
+    TARGET_LINK_OPTIONS(
+      ${LIBRARY_NAME} PUBLIC
+      $<$<LINK_LANGUAGE:CXX>:${KOKKOS_LINK_OPTIONS}>
+    )
+  ELSEIF(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.13")
+    #I can use link options
+    #just assume CXX linkage
    TARGET_LINK_OPTIONS(
      ${LIBRARY_NAME} PUBLIC ${KOKKOS_LINK_OPTIONS}
    )
  ELSE()
+    #assume CXX linkage, we have no good way to check otherwise
    IF (PARSE_PLAIN_STYLE)
      TARGET_LINK_LIBRARIES(
        ${LIBRARY_NAME} ${KOKKOS_LINK_OPTIONS}
@ -361,10 +370,15 @@ FUNCTION(KOKKOS_ADD_LIBRARY LIBRARY_NAME)
  CMAKE_PARSE_ARGUMENTS(PARSE
    "ADD_BUILD_OPTIONS"
    ""
-    ""
+    "HEADERS"
    ${ARGN}
  )
  IF (KOKKOS_HAS_TRILINOS)
+    # We do not pass headers to trilinos. They would get installed
+    # to the default include folder, but we want headers installed
+    # preserving the directory structure, e.g. impl
+    # If headers got installed in both locations, it breaks some
+    # downstream packages
    TRIBITS_ADD_LIBRARY(${LIBRARY_NAME} ${PARSE_UNPARSED_ARGUMENTS})
    #Stolen from Tribits - it can add prefixes
    SET(TRIBITS_LIBRARY_NAME_PREFIX "${${PROJECT_NAME}_LIBRARY_NAME_PREFIX}")
@ -379,8 +393,10 @@ FUNCTION(KOKKOS_ADD_LIBRARY LIBRARY_NAME)
    #Do not set any transitive properties and keep everything working as before
    #KOKKOS_SET_LIBRARY_PROPERTIES(${TRIBITS_LIBRARY_NAME} PLAIN_STYLE)
  ELSE()
+    # Forward the headers, we want to know about all headers
+    # to make sure they appear correctly in IDEs
    KOKKOS_INTERNAL_ADD_LIBRARY(
-      ${LIBRARY_NAME} ${PARSE_UNPARSED_ARGUMENTS})
+      ${LIBRARY_NAME} ${PARSE_UNPARSED_ARGUMENTS} HEADERS ${PARSE_HEADERS})
    IF (PARSE_ADD_BUILD_OPTIONS)
      KOKKOS_SET_LIBRARY_PROPERTIES(${LIBRARY_NAME})
    ENDIF()
--- a/lib/kokkos/containers/src/Kokkos_DynamicView.hpp
+++ b/lib/kokkos/containers/src/Kokkos_DynamicView.hpp
@ -299,19 +299,21 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {

    // *m_chunks[m_chunk_max] stores the current number of chunks being used
    uintptr_t* const pc = reinterpret_cast<uintptr_t*>(m_chunks + m_chunk_max);
-
+    std::string _label =
+        m_track.template get_label<typename traits::memory_space>();
    if (*pc < NC) {
      while (*pc < NC) {
        m_chunks[*pc] = reinterpret_cast<value_pointer_type>(
-            typename traits::memory_space().allocate(sizeof(local_value_type)
-                                                     << m_chunk_shift));
+            typename traits::memory_space().allocate(
+                _label.c_str(), sizeof(local_value_type) << m_chunk_shift));
        ++*pc;
      }
    } else {
      while (NC + 1 <= *pc) {
        --*pc;
        typename traits::memory_space().deallocate(
-            m_chunks[*pc], sizeof(local_value_type) << m_chunk_shift);
+            _label.c_str(), m_chunks[*pc],
+            sizeof(local_value_type) << m_chunk_shift);
        m_chunks[*pc] = nullptr;
      }
    }
@ -356,7 +358,9 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
  //----------------------------------------------------------------------

  struct Destroy {
-    typename traits::value_type** m_chunks;
+    using local_value_type = typename traits::value_type;
+    std::string m_label;
+    local_value_type** m_chunks;
    unsigned m_chunk_max;
    bool m_destroy;
    unsigned m_chunk_size;
@ -365,7 +369,9 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
    // Two entries beyond the max chunks are allocation counters.
    inline void operator()(unsigned i) const {
      if (m_destroy && i < m_chunk_max && nullptr != m_chunks[i]) {
-        typename traits::memory_space().deallocate(m_chunks[i], m_chunk_size);
+        typename traits::memory_space().deallocate(
+            m_label.c_str(), m_chunks[i],
+            sizeof(local_value_type) * m_chunk_size);
      }
      m_chunks[i] = nullptr;
    }
@ -397,9 +403,10 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
    Destroy& operator=(Destroy&&) = default;
    Destroy& operator=(const Destroy&) = default;

-    Destroy(typename traits::value_type** arg_chunk,
+    Destroy(std::string label, typename traits::value_type** arg_chunk,
            const unsigned arg_chunk_max, const unsigned arg_chunk_size)
-        : m_chunks(arg_chunk),
+        : m_label(label),
+          m_chunks(arg_chunk),
          m_chunk_max(arg_chunk_max),
          m_destroy(false),
          m_chunk_size(arg_chunk_size) {}
@ -443,7 +450,7 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {

    m_chunks = reinterpret_cast<pointer_type*>(record->data());

-    record->m_destroy = Destroy(m_chunks, m_chunk_max, m_chunk_size);
+    record->m_destroy = Destroy(arg_label, m_chunks, m_chunk_max, m_chunk_size);

    // Initialize to zero
    record->m_destroy.construct_shared_allocation();
--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
@ -931,29 +931,6 @@ void SharedAllocationRecord<Kokkos::CudaHostPinnedSpace, void>::print_records(
 // </editor-fold> end SharedAllocationRecord::print_records() }}}1
 //==============================================================================

-void *cuda_resize_scratch_space(std::int64_t bytes, bool force_shrink) {
-  static void *ptr                 = nullptr;
-  static std::int64_t current_size = 0;
-  if (current_size == 0) {
-    current_size = bytes;
-    ptr = Kokkos::kokkos_malloc<Kokkos::CudaSpace>("CudaSpace::ScratchMemory",
-                                                   current_size);
-  }
-  if (bytes > current_size) {
-    current_size = bytes;
-    Kokkos::kokkos_free<Kokkos::CudaSpace>(ptr);
-    ptr = Kokkos::kokkos_malloc<Kokkos::CudaSpace>("CudaSpace::ScratchMemory",
-                                                   current_size);
-  }
-  if ((bytes < current_size) && (force_shrink)) {
-    current_size = bytes;
-    Kokkos::kokkos_free<Kokkos::CudaSpace>(ptr);
-    ptr = Kokkos::kokkos_malloc<Kokkos::CudaSpace>("CudaSpace::ScratchMemory",
-                                                   current_size);
-  }
-  return ptr;
-}
-
 void cuda_prefetch_pointer(const Cuda &space, const void *ptr, size_t bytes,
                           bool to_device) {
  if ((ptr == nullptr) || (bytes == 0)) return;
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp
@ -277,23 +277,25 @@ CudaInternal::~CudaInternal() {
    std::cerr.flush();
  }

-  m_cudaDev                 = -1;
-  m_cudaArch                = -1;
-  m_multiProcCount          = 0;
-  m_maxWarpCount            = 0;
-  m_maxBlock                = 0;
-  m_maxSharedWords          = 0;
-  m_maxConcurrency          = 0;
-  m_scratchSpaceCount       = 0;
-  m_scratchFlagsCount       = 0;
-  m_scratchUnifiedCount     = 0;
-  m_scratchUnifiedSupported = 0;
-  m_streamCount             = 0;
-  m_scratchSpace            = nullptr;
-  m_scratchFlags            = nullptr;
-  m_scratchUnified          = nullptr;
-  m_scratchConcurrentBitset = nullptr;
-  m_stream                  = nullptr;
+  m_cudaDev                   = -1;
+  m_cudaArch                  = -1;
+  m_multiProcCount            = 0;
+  m_maxWarpCount              = 0;
+  m_maxBlock                  = 0;
+  m_maxSharedWords            = 0;
+  m_maxConcurrency            = 0;
+  m_scratchSpaceCount         = 0;
+  m_scratchFlagsCount         = 0;
+  m_scratchUnifiedCount       = 0;
+  m_scratchUnifiedSupported   = 0;
+  m_streamCount               = 0;
+  m_scratchSpace              = nullptr;
+  m_scratchFlags              = nullptr;
+  m_scratchUnified            = nullptr;
+  m_scratchConcurrentBitset   = nullptr;
+  m_stream                    = nullptr;
+  m_team_scratch_current_size = 0;
+  m_team_scratch_ptr          = nullptr;
 }

 int CudaInternal::verify_is_initialized(const char *const label) const {
@ -552,7 +554,9 @@ void CudaInternal::initialize(int cuda_device_id, cudaStream_t stream) {
    CUDA_SAFE_CALL(cudaEventCreate(&constantMemReusable));
  }

-  m_stream = stream;
+  m_stream                    = stream;
+  m_team_scratch_current_size = 0;
+  m_team_scratch_ptr          = nullptr;
 }

 //----------------------------------------------------------------------------
@ -653,12 +657,31 @@ Cuda::size_type *CudaInternal::scratch_functor(
  return m_scratchFunctor;
 }

+void *CudaInternal::resize_team_scratch_space(std::int64_t bytes,
+                                              bool force_shrink) {
+  if (m_team_scratch_current_size == 0) {
+    m_team_scratch_current_size = bytes;
+    m_team_scratch_ptr          = Kokkos::kokkos_malloc<Kokkos::CudaSpace>(
+        "CudaSpace::ScratchMemory", m_team_scratch_current_size);
+  }
+  if ((bytes > m_team_scratch_current_size) ||
+      ((bytes < m_team_scratch_current_size) && (force_shrink))) {
+    m_team_scratch_current_size = bytes;
+    m_team_scratch_ptr          = Kokkos::kokkos_realloc<Kokkos::CudaSpace>(
+        m_team_scratch_ptr, m_team_scratch_current_size);
+  }
+  return m_team_scratch_ptr;
+}
+
 //----------------------------------------------------------------------------

 void CudaInternal::finalize() {
  was_finalized = true;
  if (nullptr != m_scratchSpace || nullptr != m_scratchFlags) {
-    Impl::finalize_host_cuda_lock_arrays();
+    // Only finalize this if we're the singleton
+    if (this == &singleton()) {
+      Impl::finalize_host_cuda_lock_arrays();
+    }

    using RecordCuda = Kokkos::Impl::SharedAllocationRecord<CudaSpace>;
    using RecordHost =
@ -671,20 +694,25 @@ void CudaInternal::finalize() {
    if (m_scratchFunctorSize > 0)
      RecordCuda::decrement(RecordCuda::get_record(m_scratchFunctor));

-    m_cudaDev                 = -1;
-    m_multiProcCount          = 0;
-    m_maxWarpCount            = 0;
-    m_maxBlock                = 0;
-    m_maxSharedWords          = 0;
-    m_scratchSpaceCount       = 0;
-    m_scratchFlagsCount       = 0;
-    m_scratchUnifiedCount     = 0;
-    m_streamCount             = 0;
-    m_scratchSpace            = nullptr;
-    m_scratchFlags            = nullptr;
-    m_scratchUnified          = nullptr;
-    m_scratchConcurrentBitset = nullptr;
-    m_stream                  = nullptr;
+    if (m_team_scratch_current_size > 0)
+      Kokkos::kokkos_free<Kokkos::CudaSpace>(m_team_scratch_ptr);
+
+    m_cudaDev                   = -1;
+    m_multiProcCount            = 0;
+    m_maxWarpCount              = 0;
+    m_maxBlock                  = 0;
+    m_maxSharedWords            = 0;
+    m_scratchSpaceCount         = 0;
+    m_scratchFlagsCount         = 0;
+    m_scratchUnifiedCount       = 0;
+    m_streamCount               = 0;
+    m_scratchSpace              = nullptr;
+    m_scratchFlags              = nullptr;
+    m_scratchUnified            = nullptr;
+    m_scratchConcurrentBitset   = nullptr;
+    m_stream                    = nullptr;
+    m_team_scratch_current_size = 0;
+    m_team_scratch_ptr          = nullptr;
  }

  // only destroy these if we're finalizing the singleton
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.hpp
@ -104,10 +104,12 @@ class CudaInternal {

  cudaDeviceProp m_deviceProp;

+  // Scratch Spaces for Reductions
  mutable size_type m_scratchSpaceCount;
  mutable size_type m_scratchFlagsCount;
  mutable size_type m_scratchUnifiedCount;
  mutable size_type m_scratchFunctorSize;
+
  size_type m_scratchUnifiedSupported;
  size_type m_streamCount;
  mutable size_type* m_scratchSpace;
@ -117,6 +119,10 @@ class CudaInternal {
  uint32_t* m_scratchConcurrentBitset;
  cudaStream_t m_stream;

+  // Team Scratch Level 1 Space
+  mutable int64_t m_team_scratch_current_size;
+  mutable void* m_team_scratch_ptr;
+
  bool was_initialized = false;
  bool was_finalized   = false;

@ -172,12 +178,19 @@ class CudaInternal {
        m_scratchUnified(nullptr),
        m_scratchFunctor(nullptr),
        m_scratchConcurrentBitset(nullptr),
-        m_stream(nullptr) {}
+        m_stream(nullptr),
+        m_team_scratch_current_size(0),
+        m_team_scratch_ptr(nullptr) {}

+  // Resizing of reduction related scratch spaces
  size_type* scratch_space(const size_type size) const;
  size_type* scratch_flags(const size_type size) const;
  size_type* scratch_unified(const size_type size) const;
  size_type* scratch_functor(const size_type size) const;
+
+  // Resizing of team level 1 scratch
+  void* resize_team_scratch_space(std::int64_t bytes,
+                                  bool force_shrink = false);
 };

 }  // Namespace Impl
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp
@ -75,12 +75,6 @@ __device__ __constant__ unsigned long kokkos_impl_cuda_constant_memory_buffer

 #endif

-namespace Kokkos {
-namespace Impl {
-void* cuda_resize_scratch_space(std::int64_t bytes, bool force_shrink = false);
-}
-}  // namespace Kokkos
-
 template <typename T>
 inline __device__ T* kokkos_impl_cuda_shared_memory() {
  extern __shared__ Kokkos::CudaSpace::size_type sh[];
@ -296,15 +290,19 @@ struct CudaParallelLaunch<
  }

  static cudaFuncAttributes get_cuda_func_attributes() {
-    static cudaFuncAttributes attr;
-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is
+    // given requires using a local variable as input instead of a static Rely
+    // on static variable initialization to make sure only one thread executes
+    // the code and the result is visible.
+    auto wrap_get_attributes = []() -> cudaFuncAttributes {
+      cudaFuncAttributes attr_tmp;
      CUDA_SAFE_CALL(cudaFuncGetAttributes(
-          &attr,
+          &attr_tmp,
          cuda_parallel_launch_constant_memory<DriverType, MaxThreadsPerBlock,
                                               MinBlocksPerSM>));
-      attr_set = true;
-    }
+      return attr_tmp;
+    };
+    static cudaFuncAttributes attr = wrap_get_attributes();
    return attr;
  }
 };
@ -370,13 +368,17 @@ struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
  }

  static cudaFuncAttributes get_cuda_func_attributes() {
-    static cudaFuncAttributes attr;
-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is
+    // given requires using a local variable as input instead of a static Rely
+    // on static variable initialization to make sure only one thread executes
+    // the code and the result is visible.
+    auto wrap_get_attributes = []() -> cudaFuncAttributes {
+      cudaFuncAttributes attr_tmp;
      CUDA_SAFE_CALL(cudaFuncGetAttributes(
-          &attr, cuda_parallel_launch_constant_memory<DriverType>));
-      attr_set = true;
-    }
+          &attr_tmp, cuda_parallel_launch_constant_memory<DriverType>));
+      return attr_tmp;
+    };
+    static cudaFuncAttributes attr = wrap_get_attributes();
    return attr;
  }
 };
@ -430,15 +432,19 @@ struct CudaParallelLaunch<
  }

  static cudaFuncAttributes get_cuda_func_attributes() {
-    static cudaFuncAttributes attr;
-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is
+    // given requires using a local variable as input instead of a static Rely
+    // on static variable initialization to make sure only one thread executes
+    // the code and the result is visible.
+    auto wrap_get_attributes = []() -> cudaFuncAttributes {
+      cudaFuncAttributes attr_tmp;
      CUDA_SAFE_CALL(cudaFuncGetAttributes(
-          &attr,
+          &attr_tmp,
          cuda_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
                                            MinBlocksPerSM>));
-      attr_set = true;
-    }
+      return attr_tmp;
+    };
+    static cudaFuncAttributes attr = wrap_get_attributes();
    return attr;
  }
 };
@ -488,13 +494,17 @@ struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
  }

  static cudaFuncAttributes get_cuda_func_attributes() {
-    static cudaFuncAttributes attr;
-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is
+    // given requires using a local variable as input instead of a static Rely
+    // on static variable initialization to make sure only one thread executes
+    // the code and the result is visible.
+    auto wrap_get_attributes = []() -> cudaFuncAttributes {
+      cudaFuncAttributes attr_tmp;
      CUDA_SAFE_CALL(cudaFuncGetAttributes(
-          &attr, cuda_parallel_launch_local_memory<DriverType>));
-      attr_set = true;
-    }
+          &attr_tmp, cuda_parallel_launch_local_memory<DriverType>));
+      return attr_tmp;
+    };
+    static cudaFuncAttributes attr = wrap_get_attributes();
    return attr;
  }
 };
@ -550,15 +560,19 @@ struct CudaParallelLaunch<
    }
  }
  static cudaFuncAttributes get_cuda_func_attributes() {
-    static cudaFuncAttributes attr;
-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is
+    // given requires using a local variable as input instead of a static Rely
+    // on static variable initialization to make sure only one thread executes
+    // the code and the result is visible.
+    auto wrap_get_attributes = []() -> cudaFuncAttributes {
+      cudaFuncAttributes attr_tmp;
      CUDA_SAFE_CALL(cudaFuncGetAttributes(
-          &attr,
+          &attr_tmp,
          cuda_parallel_launch_global_memory<DriverType, MaxThreadsPerBlock,
                                             MinBlocksPerSM>));
-      attr_set = true;
-    }
+      return attr_tmp;
+    };
+    static cudaFuncAttributes attr = wrap_get_attributes();
    return attr;
  }
 };
@ -610,13 +624,17 @@ struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
  }

  static cudaFuncAttributes get_cuda_func_attributes() {
-    static cudaFuncAttributes attr;
-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is
+    // given requires using a local variable as input instead of a static Rely
+    // on static variable initialization to make sure only one thread executes
+    // the code and the result is visible.
+    auto wrap_get_attributes = []() -> cudaFuncAttributes {
+      cudaFuncAttributes attr_tmp;
      CUDA_SAFE_CALL(cudaFuncGetAttributes(
-          &attr, cuda_parallel_launch_global_memory<DriverType>));
-      attr_set = true;
-    }
+          &attr_tmp, cuda_parallel_launch_global_memory<DriverType>));
+      return attr_tmp;
+    };
+    static cudaFuncAttributes attr = wrap_get_attributes();
    return attr;
  }
 };
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
@ -775,10 +775,12 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
    m_scratch_ptr[1] =
        m_team_size <= 0
            ? nullptr
-            : cuda_resize_scratch_space(
-                  static_cast<ptrdiff_t>(m_scratch_size[1]) *
-                  static_cast<ptrdiff_t>(Cuda::concurrency() /
-                                         (m_team_size * m_vector_size)));
+            : m_policy.space()
+                  .impl_internal_space_instance()
+                  ->resize_team_scratch_space(
+                      static_cast<ptrdiff_t>(m_scratch_size[1]) *
+                      static_cast<ptrdiff_t>(Cuda::concurrency() /
+                                             (m_team_size * m_vector_size)));

    const int shmem_size_total = m_shmem_begin + m_shmem_size;
    if (m_policy.space().impl_internal_space_instance()->m_maxShmemPerBlock <
@ -854,6 +856,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
  const ReducerType m_reducer;
  const pointer_type m_result_ptr;
  const bool m_result_ptr_device_accessible;
+  const bool m_result_ptr_host_accessible;
  size_type* m_scratch_space;
  size_type* m_scratch_flags;
  size_type* m_unified_space;
@ -919,11 +922,38 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
      }
    }

+    // Doing code duplication here to fix issue #3428
+    // Suspect optimizer bug??
    // Reduce with final value at blockDim.y - 1 location.
-    if (cuda_single_inter_block_reduce_scan<false, ReducerTypeFwd, WorkTagFwd>(
-            ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
-            gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
-            m_scratch_space, m_scratch_flags)) {
+    // Shortcut for length zero reduction
+    if (m_policy.begin() == m_policy.end()) {
+      // This is the final block with the final result at the final threads'
+      // location
+
+      size_type* const shared = kokkos_impl_cuda_shared_memory<size_type>() +
+                                (blockDim.y - 1) * word_count.value;
+      size_type* const global =
+          m_result_ptr_device_accessible
+              ? reinterpret_cast<size_type*>(m_result_ptr)
+              : (m_unified_space ? m_unified_space : m_scratch_space);
+
+      if (threadIdx.y == 0) {
+        Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
+            ReducerConditional::select(m_functor, m_reducer), shared);
+      }
+
+      if (CudaTraits::WarpSize < word_count.value) {
+        __syncthreads();
+      }
+
+      for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) {
+        global[i] = shared[i];
+      }
+    } else if (cuda_single_inter_block_reduce_scan<false, ReducerTypeFwd,
+                                                   WorkTagFwd>(
+                   ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
+                   gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
+                   m_scratch_space, m_scratch_flags)) {
      // This is the final block with the final result at the final threads'
      // location

@ -1021,9 +1051,14 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
  }

  inline void execute() {
-    const index_type nwork = m_policy.end() - m_policy.begin();
-    if (nwork) {
+    const index_type nwork     = m_policy.end() - m_policy.begin();
+    const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
+                                 ReduceFunctorHasFinal<FunctorType>::value ||
+                                 !m_result_ptr_host_accessible ||
+                                 !std::is_same<ReducerType, InvalidType>::value;
+    if ((nwork > 0) || need_device_set) {
      const int block_size = local_block_size(m_functor);
+
      KOKKOS_ASSERT(block_size > 0);

      m_scratch_space = cuda_internal_scratch_space(
@ -1049,12 +1084,14 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
                                                          WorkTag>(m_functor,
                                                                   block.y);

+      if ((nwork == 0)
 #ifdef KOKKOS_IMPL_DEBUG_CUDA_SERIAL_EXECUTION
-      if (Kokkos::Impl::CudaInternal::cuda_use_serial_execution()) {
+          || Kokkos::Impl::CudaInternal::cuda_use_serial_execution()
+#endif
+      ) {
        block = dim3(1, 1, 1);
        grid  = dim3(1, 1, 1);
      }
-#endif

      CudaParallelLaunch<ParallelReduce, LaunchBounds>(
          *this, grid, block, shmem,
@ -1062,7 +1099,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
          false);  // copy to device and execute

      if (!m_result_ptr_device_accessible) {
-        Cuda().fence();
+        m_policy.space().fence();

        if (m_result_ptr) {
          if (m_unified_space) {
@ -1098,6 +1135,9 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
        m_result_ptr_device_accessible(
            MemorySpaceAccess<Kokkos::CudaSpace,
                              typename ViewType::memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
+                              typename ViewType::memory_space>::accessible),
        m_scratch_space(nullptr),
        m_scratch_flags(nullptr),
        m_unified_space(nullptr) {}
@ -1112,6 +1152,10 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
            MemorySpaceAccess<Kokkos::CudaSpace,
                              typename ReducerType::result_view_type::
                                  memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
+                              typename ReducerType::result_view_type::
+                                  memory_space>::accessible),
        m_scratch_space(nullptr),
        m_scratch_flags(nullptr),
        m_unified_space(nullptr) {}
@ -1467,6 +1511,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
  const ReducerType m_reducer;
  const pointer_type m_result_ptr;
  const bool m_result_ptr_device_accessible;
+  const bool m_result_ptr_host_accessible;
  size_type* m_scratch_space;
  size_type* m_scratch_flags;
  size_type* m_unified_space;
@ -1561,10 +1606,35 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
    }

    // Reduce with final value at blockDim.y - 1 location.
-    if (cuda_single_inter_block_reduce_scan<false, FunctorType, WorkTag>(
-            ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
-            gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
-            m_scratch_space, m_scratch_flags)) {
+    // Doing code duplication here to fix issue #3428
+    // Suspect optimizer bug??
+    if (m_league_size == 0) {
+      // This is the final block with the final result at the final threads'
+      // location
+
+      size_type* const shared = kokkos_impl_cuda_shared_memory<size_type>() +
+                                (blockDim.y - 1) * word_count.value;
+      size_type* const global =
+          m_result_ptr_device_accessible
+              ? reinterpret_cast<size_type*>(m_result_ptr)
+              : (m_unified_space ? m_unified_space : m_scratch_space);
+
+      if (threadIdx.y == 0) {
+        Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
+            ReducerConditional::select(m_functor, m_reducer), shared);
+      }
+
+      if (CudaTraits::WarpSize < word_count.value) {
+        __syncthreads();
+      }
+
+      for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) {
+        global[i] = shared[i];
+      }
+    } else if (cuda_single_inter_block_reduce_scan<false, FunctorType, WorkTag>(
+                   ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
+                   gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
+                   m_scratch_space, m_scratch_flags)) {
      // This is the final block with the final result at the final threads'
      // location

@ -1617,7 +1687,13 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,

    value_type init;
    ValueInit::init(ReducerConditional::select(m_functor, m_reducer), &init);
-    if (Impl::cuda_inter_block_reduction<FunctorType, ValueJoin, WorkTag>(
+
+    if (int_league_size == 0) {
+      Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
+          ReducerConditional::select(m_functor, m_reducer), (void*)&value);
+      *result = value;
+    } else if (
+        Impl::cuda_inter_block_reduction<FunctorType, ValueJoin, WorkTag>(
            value, init,
            ValueJoin(ReducerConditional::select(m_functor, m_reducer)),
            m_scratch_space, result, m_scratch_flags, blockDim.y)
@ -1637,8 +1713,12 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
  }

  inline void execute() {
-    const int nwork = m_league_size * m_team_size;
-    if (nwork) {
+    const int nwork            = m_league_size * m_team_size;
+    const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
+                                 ReduceFunctorHasFinal<FunctorType>::value ||
+                                 !m_result_ptr_host_accessible ||
+                                 !std::is_same<ReducerType, InvalidType>::value;
+    if ((nwork > 0) || need_device_set) {
      const int block_count =
          UseShflReduction ? std::min(m_league_size, size_type(1024 * 32))
                           : std::min(int(m_league_size), m_team_size);
@ -1657,12 +1737,14 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
      dim3 grid(block_count, 1, 1);
      const int shmem_size_total = m_team_begin + m_shmem_begin + m_shmem_size;

+      if ((nwork == 0)
 #ifdef KOKKOS_IMPL_DEBUG_CUDA_SERIAL_EXECUTION
-      if (Kokkos::Impl::CudaInternal::cuda_use_serial_execution()) {
+          || Kokkos::Impl::CudaInternal::cuda_use_serial_execution()
+#endif
+      ) {
        block = dim3(1, 1, 1);
        grid  = dim3(1, 1, 1);
      }
-#endif

      CudaParallelLaunch<ParallelReduce, LaunchBounds>(
          *this, grid, block, shmem_size_total,
@ -1670,7 +1752,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
          true);  // copy to device and execute

      if (!m_result_ptr_device_accessible) {
-        Cuda().fence();
+        m_policy.space().fence();

        if (m_result_ptr) {
          if (m_unified_space) {
@ -1706,6 +1788,9 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
        m_result_ptr_device_accessible(
            MemorySpaceAccess<Kokkos::CudaSpace,
                              typename ViewType::memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
+                              typename ViewType::memory_space>::accessible),
        m_scratch_space(nullptr),
        m_scratch_flags(nullptr),
        m_unified_space(nullptr),
@ -1728,13 +1813,6 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
                  m_policy.thread_scratch_size(0)) /
                  m_vector_size;

-    // Return Init value if the number of worksets is zero
-    if (m_league_size * m_team_size == 0) {
-      ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
-                      arg_result.data());
-      return;
-    }
-
    m_team_begin =
        UseShflReduction
            ? 0
@ -1750,10 +1828,13 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
    m_scratch_ptr[1] =
        m_team_size <= 0
            ? nullptr
-            : cuda_resize_scratch_space(
-                  static_cast<std::int64_t>(m_scratch_size[1]) *
-                  (static_cast<std::int64_t>(Cuda::concurrency() /
-                                             (m_team_size * m_vector_size))));
+            : m_policy.space()
+                  .impl_internal_space_instance()
+                  ->resize_team_scratch_space(
+                      static_cast<std::int64_t>(m_scratch_size[1]) *
+                      (static_cast<std::int64_t>(
+                          Cuda::concurrency() /
+                          (m_team_size * m_vector_size))));

    // The global parallel_reduce does not support vector_length other than 1 at
    // the moment
@ -1805,6 +1886,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
            MemorySpaceAccess<Kokkos::CudaSpace,
                              typename ReducerType::result_view_type::
                                  memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
+                              typename ReducerType::result_view_type::
+                                  memory_space>::accessible),
        m_scratch_space(nullptr),
        m_scratch_flags(nullptr),
        m_unified_space(nullptr),
@ -1827,13 +1912,6 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
                  m_policy.thread_scratch_size(0)) /
                  m_vector_size;

-    // Return Init value if the number of worksets is zero
-    if (arg_policy.league_size() == 0) {
-      ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
-                      m_result_ptr);
-      return;
-    }
-
    m_team_begin =
        UseShflReduction
            ? 0
@ -1849,10 +1927,12 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
    m_scratch_ptr[1] =
        m_team_size <= 0
            ? nullptr
-            : cuda_resize_scratch_space(
-                  static_cast<ptrdiff_t>(m_scratch_size[1]) *
-                  static_cast<ptrdiff_t>(Cuda::concurrency() /
-                                         (m_team_size * m_vector_size)));
+            : m_policy.space()
+                  .impl_internal_space_instance()
+                  ->resize_team_scratch_space(
+                      static_cast<ptrdiff_t>(m_scratch_size[1]) *
+                      static_cast<ptrdiff_t>(Cuda::concurrency() /
+                                             (m_team_size * m_vector_size)));

    // The global parallel_reduce does not support vector_length other than 1 at
    // the moment
--- a/lib/kokkos/core/src/HIP/Kokkos_HIP_Parallel_Range.hpp
+++ b/lib/kokkos/core/src/HIP/Kokkos_HIP_Parallel_Range.hpp
@ -166,6 +166,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
  const ReducerType m_reducer;
  const pointer_type m_result_ptr;
  const bool m_result_ptr_device_accessible;
+  const bool m_result_ptr_host_accessible;
  size_type* m_scratch_space = nullptr;
  size_type* m_scratch_flags = nullptr;

@ -230,11 +231,16 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
    }

    // Reduce with final value at blockDim.y - 1 location.
-    if (hip_single_inter_block_reduce_scan<false, ReducerTypeFwd, WorkTagFwd>(
-            ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
-            gridDim.x,
-            ::Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
-            m_scratch_space, m_scratch_flags)) {
+    // Shortcut for length zero reduction
+    bool do_final_reduction = m_policy.begin() == m_policy.end();
+    if (!do_final_reduction)
+      do_final_reduction = hip_single_inter_block_reduce_scan<
+          false, ReducerTypeFwd, WorkTagFwd>(
+          ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
+          gridDim.x,
+          ::Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
+          m_scratch_space, m_scratch_flags);
+    if (do_final_reduction) {
      // This is the final block with the final result at the final threads'
      // location

@ -288,11 +294,19 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,

    value_type init;
    ValueInit::init(ReducerConditional::select(m_functor, m_reducer), &init);
-    if (Impl::hip_inter_block_shuffle_reduction<ReducerTypeFwd, ValueJoin,
-                                                WorkTagFwd>(
-            value, init,
-            ValueJoin(ReducerConditional::select(m_functor, m_reducer)),
-            m_scratch_space, result, m_scratch_flags, max_active_thread)) {
+    if (m_policy.begin() == m_policy.end()) {
+      Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
+          ReducerConditional::select(m_functor, m_reducer),
+          reinterpret_cast<void*>(&value));
+      pointer_type const final_result =
+          m_result_ptr_device_accessible ? m_result_ptr : result;
+      *final_result = value;
+    } else if (Impl::hip_inter_block_shuffle_reduction<ReducerTypeFwd,
+                                                       ValueJoin, WorkTagFwd>(
+                   value, init,
+                   ValueJoin(ReducerConditional::select(m_functor, m_reducer)),
+                   m_scratch_space, result, m_scratch_flags,
+                   max_active_thread)) {
      unsigned int const id = threadIdx.y * blockDim.x + threadIdx.x;
      if (id == 0) {
        Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
@ -328,8 +342,12 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
  }

  inline void execute() {
-    const index_type nwork = m_policy.end() - m_policy.begin();
-    if (nwork) {
+    const index_type nwork     = m_policy.end() - m_policy.begin();
+    const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
+                                 ReduceFunctorHasFinal<FunctorType>::value ||
+                                 !m_result_ptr_host_accessible ||
+                                 !std::is_same<ReducerType, InvalidType>::value;
+    if ((nwork > 0) || need_device_set) {
      const int block_size = local_block_size(m_functor);
      KOKKOS_ASSERT(block_size > 0);

@ -343,10 +361,16 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
              sizeof(size_type));

      // REQUIRED ( 1 , N , 1 )
-      const dim3 block(1, block_size, 1);
+      dim3 block(1, block_size, 1);
      // Required grid.x <= block.y
-      const dim3 grid(std::min(block.y, (nwork + block.y - 1) / block.y), 1, 1);
+      dim3 grid(std::min(block.y, static_cast<uint32_t>((nwork + block.y - 1) /
+                                                        block.y)),
+                1, 1);

+      if (nwork == 0) {
+        block = dim3(1, 1, 1);
+        grid  = dim3(1, 1, 1);
+      }
      const int shmem =
          UseShflReduction
              ? 0
@ -389,6 +413,9 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
        m_result_ptr(arg_result.data()),
        m_result_ptr_device_accessible(
            MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
+                              typename ViewType::memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
                              typename ViewType::memory_space>::accessible) {}

  ParallelReduce(const FunctorType& arg_functor, const Policy& arg_policy,
@ -399,6 +426,10 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
        m_result_ptr(reducer.view().data()),
        m_result_ptr_device_accessible(
            MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
+                              typename ReducerType::result_view_type::
+                                  memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
                              typename ReducerType::result_view_type::
                                  memory_space>::accessible) {}
 };
--- a/lib/kokkos/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
+++ b/lib/kokkos/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
@ -646,6 +646,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
  const ReducerType m_reducer;
  const pointer_type m_result_ptr;
  const bool m_result_ptr_device_accessible;
+  const bool m_result_ptr_host_accessible;
  size_type* m_scratch_space;
  size_type* m_scratch_flags;
  size_type m_team_begin;
@ -742,11 +743,15 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
    }

    // Reduce with final value at blockDim.y - 1 location.
-    if (hip_single_inter_block_reduce_scan<false, FunctorType, work_tag>(
-            reducer_conditional::select(m_functor, m_reducer), blockIdx.x,
-            gridDim.x,
-            Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
-            m_scratch_space, m_scratch_flags)) {
+    bool do_final_reduce = (m_league_size == 0);
+    if (!do_final_reduce)
+      do_final_reduce =
+          hip_single_inter_block_reduce_scan<false, FunctorType, work_tag>(
+              reducer_conditional::select(m_functor, m_reducer), blockIdx.x,
+              gridDim.x,
+              Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
+              m_scratch_space, m_scratch_flags);
+    if (do_final_reduce) {
      // This is the final block with the final result at the final threads'
      // location

@ -802,11 +807,17 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,

    value_type init;
    value_init::init(reducer_conditional::select(m_functor, m_reducer), &init);
-    if (Impl::hip_inter_block_shuffle_reduction<FunctorType, value_join,
-                                                work_tag>(
-            value, init,
-            value_join(reducer_conditional::select(m_functor, m_reducer)),
-            m_scratch_space, result, m_scratch_flags, blockDim.y)) {
+    if (int_league_size == 0) {
+      Kokkos::Impl::FunctorFinal<reducer_type_fwd, work_tag_fwd>::final(
+          reducer_conditional::select(m_functor, m_reducer),
+          reinterpret_cast<void*>(&value));
+      *result = value;
+    } else if (Impl::hip_inter_block_shuffle_reduction<FunctorType, value_join,
+                                                       work_tag>(
+                   value, init,
+                   value_join(
+                       reducer_conditional::select(m_functor, m_reducer)),
+                   m_scratch_space, result, m_scratch_flags, blockDim.y)) {
      unsigned int const id = threadIdx.y * blockDim.x + threadIdx.x;
      if (id == 0) {
        Kokkos::Impl::FunctorFinal<reducer_type_fwd, work_tag_fwd>::final(
@ -818,8 +829,12 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
  }

  inline void execute() {
-    const int nwork = m_league_size * m_team_size;
-    if (nwork) {
+    const int nwork            = m_league_size * m_team_size;
+    const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
+                                 ReduceFunctorHasFinal<FunctorType>::value ||
+                                 !m_result_ptr_host_accessible ||
+                                 !std::is_same<ReducerType, InvalidType>::value;
+    if ((nwork > 0) || need_device_set) {
      const int block_count =
          UseShflReduction
              ? std::min(
@ -837,6 +852,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,

      dim3 block(m_vector_size, m_team_size, 1);
      dim3 grid(block_count, 1, 1);
+      if (nwork == 0) {
+        block = dim3(1, 1, 1);
+        grid  = dim3(1, 1, 1);
+      }
      const int shmem_size_total = m_team_begin + m_shmem_begin + m_shmem_size;

      Kokkos::Experimental::Impl::HIPParallelLaunch<ParallelReduce,
@ -875,6 +894,9 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
        m_result_ptr_device_accessible(
            MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
                              typename ViewType::memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
+                              typename ViewType::memory_space>::accessible),
        m_scratch_space(0),
        m_scratch_flags(0),
        m_team_begin(0),
@ -896,12 +918,16 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
                  m_policy.thread_scratch_size(0)) /
                  m_vector_size;

+    // We can't early exit here because the result place might not be accessible
+    // or the functor/reducer init not callable on the host. But I am not sure
+    // all the other code below is kosher with zero work length ...
+    //
    // Return Init value if the number of worksets is zero
-    if (m_league_size * m_team_size == 0) {
-      value_init::init(reducer_conditional::select(m_functor, m_reducer),
-                       arg_result.data());
-      return;
-    }
+    // if (m_league_size * m_team_size == 0) {
+    //  value_init::init(reducer_conditional::select(m_functor, m_reducer),
+    //                   arg_result.data());
+    //  return;
+    //}

    m_team_begin =
        UseShflReduction
@ -974,6 +1000,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
            MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
                              typename ReducerType::result_view_type::
                                  memory_space>::accessible),
+        m_result_ptr_host_accessible(
+            MemorySpaceAccess<Kokkos::HostSpace,
+                              typename ReducerType::result_view_type::
+                                  memory_space>::accessible),
        m_scratch_space(0),
        m_scratch_flags(0),
        m_team_begin(0),
@ -995,12 +1025,16 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
                  m_policy.thread_scratch_size(0)) /
                  m_vector_size;

+    // We can't early exit here because the result place might not be accessible
+    // or the functor/reducer init not callable on the host. But I am not sure
+    // all the other code below is kosher with zero work length ...
+    //
    // Return Init value if the number of worksets is zero
-    if (arg_policy.league_size() == 0) {
-      value_init::init(reducer_conditional::select(m_functor, m_reducer),
-                       m_result_ptr);
-      return;
-    }
+    // if (arg_policy.league_size() == 0) {
+    //   value_init::init(reducer_conditional::select(m_functor, m_reducer),
+    //                   m_result_ptr);
+    //  return;
+    //}

    m_team_begin =
        UseShflReduction
--- a/lib/kokkos/core/src/Kokkos_HPX.hpp
+++ b/lib/kokkos/core/src/Kokkos_HPX.hpp
@ -1131,6 +1131,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
      typename Kokkos::Impl::if_c<std::is_same<InvalidType, ReducerType>::value,
                                  WorkTag, void>::type;
  using ValueInit  = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
+  using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
  using ValueJoin  = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
  using ValueOps   = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
  using value_type = typename Analysis::value_type;
@ -1254,6 +1255,15 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,

 public:
  void execute() const {
+    if (m_policy.end() <= m_policy.begin()) {
+      if (m_result_ptr) {
+        ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
+                        m_result_ptr);
+        ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
+                          m_result_ptr);
+      }
+      return;
+    }
    dispatch_execute_task(this, m_policy.space(), m_force_synchronous);
  }

@ -1459,9 +1469,10 @@ class ParallelReduce<FunctorType, Kokkos::MDRangePolicy<Traits...>, ReducerType,
  using WorkTagFwd =
      typename Kokkos::Impl::if_c<std::is_same<InvalidType, ReducerType>::value,
                                  WorkTag, void>::type;
-  using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
-  using ValueJoin = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
-  using ValueOps  = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
+  using ValueInit  = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
+  using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
+  using ValueJoin  = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
+  using ValueOps   = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
  using pointer_type   = typename Analysis::pointer_type;
  using value_type     = typename Analysis::value_type;
  using reference_type = typename Analysis::reference_type;
@ -2060,9 +2071,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
  using WorkTagFwd =
      typename Kokkos::Impl::if_c<std::is_same<InvalidType, ReducerType>::value,
                                  WorkTag, void>::type;
-  using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
-  using ValueJoin = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
-  using ValueOps  = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
+  using ValueInit  = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
+  using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
+  using ValueJoin  = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
+  using ValueOps   = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
  using pointer_type   = typename Analysis::pointer_type;
  using reference_type = typename Analysis::reference_type;
  using value_type     = typename Analysis::value_type;
@ -2132,7 +2144,18 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
  }

 public:
-  void execute() const { dispatch_execute_task(this, m_policy.space()); }
+  void execute() const {
+    if (m_policy.league_size() * m_policy.team_size() == 0) {
+      if (m_result_ptr) {
+        ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
+                        m_result_ptr);
+        ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
+                          m_result_ptr);
+      }
+      return;
+    }
+    dispatch_execute_task(this, m_policy.space());
+  }

  inline void execute_task() const {
    // See [note 1] for an explanation.
--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp
+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Parallel.hpp
@ -297,6 +297,15 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,

 public:
  inline void execute() const {
+    if (m_policy.end() <= m_policy.begin()) {
+      if (m_result_ptr) {
+        ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
+                        m_result_ptr);
+        Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
+            ReducerConditional::select(m_functor, m_reducer), m_result_ptr);
+      }
+      return;
+    }
    enum {
      is_dynamic = std::is_same<typename Policy::schedule_type::type,
                                Kokkos::Dynamic>::value
@ -1014,6 +1023,15 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
  inline void execute() const {
    enum { is_dynamic = std::is_same<SchedTag, Kokkos::Dynamic>::value };

+    if (m_policy.league_size() * m_policy.team_size() == 0) {
+      if (m_result_ptr) {
+        ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
+                        m_result_ptr);
+        Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
+            ReducerConditional::select(m_functor, m_reducer), m_result_ptr);
+      }
+      return;
+    }
    OpenMPExec::verify_is_master("Kokkos::OpenMP parallel_reduce");

    const size_t pool_reduce_size =
--- a/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp
+++ b/lib/kokkos/core/src/Threads/Kokkos_Threads_Parallel.hpp
@ -365,7 +365,8 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,

  using ValueTraits =
      Kokkos::Impl::FunctorValueTraits<ReducerTypeFwd, WorkTagFwd>;
-  using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
+  using ValueInit  = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
+  using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;

  using pointer_type   = typename ValueTraits::pointer_type;
  using reference_type = typename ValueTraits::reference_type;
@ -461,23 +462,32 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,

 public:
  inline void execute() const {
-    ThreadsExec::resize_scratch(
-        ValueTraits::value_size(
-            ReducerConditional::select(m_functor, m_reducer)),
-        0);
+    if (m_policy.end() <= m_policy.begin()) {
+      if (m_result_ptr) {
+        ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
+                        m_result_ptr);
+        ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
+                          m_result_ptr);
+      }
+    } else {
+      ThreadsExec::resize_scratch(
+          ValueTraits::value_size(
+              ReducerConditional::select(m_functor, m_reducer)),
+          0);

-    ThreadsExec::start(&ParallelReduce::exec, this);
+      ThreadsExec::start(&ParallelReduce::exec, this);

-    ThreadsExec::fence();
+      ThreadsExec::fence();

-    if (m_result_ptr) {
-      const pointer_type data =
-          (pointer_type)ThreadsExec::root_reduce_scratch();
+      if (m_result_ptr) {
+        const pointer_type data =
+            (pointer_type)ThreadsExec::root_reduce_scratch();

-      const unsigned n = ValueTraits::value_count(
-          ReducerConditional::select(m_functor, m_reducer));
-      for (unsigned i = 0; i < n; ++i) {
-        m_result_ptr[i] = data[i];
+        const unsigned n = ValueTraits::value_count(
+            ReducerConditional::select(m_functor, m_reducer));
+        for (unsigned i = 0; i < n; ++i) {
+          m_result_ptr[i] = data[i];
+        }
      }
    }
  }
@ -696,7 +706,8 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,

  using ValueTraits =
      Kokkos::Impl::FunctorValueTraits<ReducerTypeFwd, WorkTagFwd>;
-  using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
+  using ValueInit  = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
+  using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;

  using pointer_type   = typename ValueTraits::pointer_type;
  using reference_type = typename ValueTraits::reference_type;
@ -743,23 +754,32 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,

 public:
  inline void execute() const {
-    ThreadsExec::resize_scratch(
-        ValueTraits::value_size(
-            ReducerConditional::select(m_functor, m_reducer)),
-        Policy::member_type::team_reduce_size() + m_shared);
+    if (m_policy.league_size() * m_policy.team_size() == 0) {
+      if (m_result_ptr) {
+        ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
+                        m_result_ptr);
+        ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
+                          m_result_ptr);
+      }
+    } else {
+      ThreadsExec::resize_scratch(
+          ValueTraits::value_size(
+              ReducerConditional::select(m_functor, m_reducer)),
+          Policy::member_type::team_reduce_size() + m_shared);

-    ThreadsExec::start(&ParallelReduce::exec, this);
+      ThreadsExec::start(&ParallelReduce::exec, this);

-    ThreadsExec::fence();
+      ThreadsExec::fence();

-    if (m_result_ptr) {
-      const pointer_type data =
-          (pointer_type)ThreadsExec::root_reduce_scratch();
+      if (m_result_ptr) {
+        const pointer_type data =
+            (pointer_type)ThreadsExec::root_reduce_scratch();

-      const unsigned n = ValueTraits::value_count(
-          ReducerConditional::select(m_functor, m_reducer));
-      for (unsigned i = 0; i < n; ++i) {
-        m_result_ptr[i] = data[i];
+        const unsigned n = ValueTraits::value_count(
+            ReducerConditional::select(m_functor, m_reducer));
+        for (unsigned i = 0; i < n; ++i) {
+          m_result_ptr[i] = data[i];
+        }
      }
    }
  }
--- a/lib/kokkos/core/src/impl/Kokkos_ViewMapping.hpp
+++ b/lib/kokkos/core/src/impl/Kokkos_ViewMapping.hpp
@ -2696,8 +2696,8 @@ struct ViewDataHandle<
 #endif
                && (!Traits::memory_traits::is_atomic))>::type> {
  using value_type  = typename Traits::value_type;
-  using handle_type = typename Traits::value_type*;
-  using return_type = typename Traits::value_type&;
+  using handle_type = typename Traits::value_type* KOKKOS_RESTRICT;
+  using return_type = typename Traits::value_type& KOKKOS_RESTRICT;
  using track_type  = Kokkos::Impl::SharedAllocationTracker;

  KOKKOS_INLINE_FUNCTION
@ -2725,8 +2725,12 @@ struct ViewDataHandle<
                                   Kokkos::CudaUVMSpace>::value))
 #endif
                && (!Traits::memory_traits::is_atomic))>::type> {
-  using value_type  = typename Traits::value_type;
-  using handle_type = typename Traits::value_type*;
+  using value_type = typename Traits::value_type;
+  // typedef work-around for intel compilers error #3186: expected typedef
+  // declaration
+  // NOLINTNEXTLINE(modernize-use-using)
+  typedef value_type* KOKKOS_IMPL_ALIGN_PTR(KOKKOS_MEMORY_ALIGNMENT)
+      handle_type;
  using return_type = typename Traits::value_type&;
  using track_type  = Kokkos::Impl::SharedAllocationTracker;

@ -2766,9 +2770,13 @@ struct ViewDataHandle<
                           Kokkos::CudaUVMSpace>::value))
 #endif
        && (!Traits::memory_traits::is_atomic))>::type> {
-  using value_type  = typename Traits::value_type;
-  using handle_type = typename Traits::value_type*;
-  using return_type = typename Traits::value_type&;
+  using value_type = typename Traits::value_type;
+  // typedef work-around for intel compilers error #3186: expected typedef
+  // declaration
+  // NOLINTNEXTLINE(modernize-use-using)
+  typedef value_type* KOKKOS_IMPL_ALIGN_PTR(KOKKOS_MEMORY_ALIGNMENT)
+      handle_type;
+  using return_type = typename Traits::value_type& KOKKOS_RESTRICT;
  using track_type  = Kokkos::Impl::SharedAllocationTracker;

  KOKKOS_INLINE_FUNCTION
--- a/lib/kokkos/core/unit_test/CMakeLists.txt
+++ b/lib/kokkos/core/unit_test/CMakeLists.txt
@ -295,6 +295,12 @@ if(Kokkos_ENABLE_CUDA)
      cuda/TestCudaUVM_ViewMapping_b.cpp
      cuda/TestCudaUVM_ViewMapping_subview.cpp
      cuda/TestCuda_Spaces.cpp
+  )
+
+    KOKKOS_ADD_EXECUTABLE_AND_TEST(
+    UnitTest_CudaTimingBased
+    SOURCES
+      UnitTestMainInit.cpp
      cuda/TestCuda_DebugSerialExecution.cpp
      cuda/TestCuda_DebugPinUVMSpace.cpp
  )
@ -414,9 +420,9 @@ KOKKOS_ADD_ADVANCED_TEST( UnitTest_PushFinalizeHook_terminate
  endif()
  if(KOKKOS_ENABLE_LIBDL)

-    KOKKOS_ADD_LIBRARY(
-    printer-tool SHARED
-    SOURCES tools/printing-tool.cpp
+    KOKKOS_ADD_TEST_LIBRARY(
+      kokkosprinter-tool SHARED
+      SOURCES tools/printing-tool.cpp
    )

    KOKKOS_ADD_TEST_EXECUTABLE(
@ -431,7 +437,7 @@ KOKKOS_ADD_ADVANCED_TEST( UnitTest_PushFinalizeHook_terminate

    KOKKOS_ADD_TEST( NAME ProfilingTestLibraryLoad
      EXE  ProfilingAllCalls
-      TOOL printer-tool
+      TOOL kokkosprinter-tool
      PASS_REGULAR_EXPRESSION "kokkosp_init_library::kokkosp_allocate_data:${MEMSPACE_REGEX}:source:${ADDRESS_REGEX}:40::kokkosp_begin_parallel_for:Kokkos::View::initialization [[]source]:0:0::kokkosp_end_parallel_for:0::kokkosp_allocate_data:${MEMSPACE_REGEX}:destination:${ADDRESS_REGEX}:40::kokkosp_begin_parallel_for:Kokkos::View::initialization [[]destination]:0:0::kokkosp_end_parallel_for:0::kokkosp_begin_deep_copy:${MEMSPACE_REGEX}:destination:${ADDRESS_REGEX}:${MEMSPACE_REGEX}:source:${ADDRESS_REGEX}:40::kokkosp_end_deep_copy::kokkosp_begin_parallel_for:parallel_for:${SIZE_REGEX}:0::kokkosp_end_parallel_for:0::kokkosp_begin_parallel_reduce:parallel_reduce:0:1${SKIP_SCRATCH_INITIALIZATION_REGEX}::kokkosp_end_parallel_reduce:1::kokkosp_begin_parallel_scan:parallel_scan:${SIZE_REGEX}:2::kokkosp_end_parallel_scan:2::kokkosp_push_profile_region:push_region::kokkosp_pop_profile_region::kokkosp_create_profile_section:created_section:3::kokkosp_start_profile_section:3::kokkosp_stop_profile_section:3::kokkosp_destroy_profile_section:3::kokkosp_profile_event:profiling_event::kokkosp_deallocate_data:${MEMSPACE_REGEX}:destination:${ADDRESS_REGEX}:40::kokkosp_deallocate_data:${MEMSPACE_REGEX}:source:${ADDRESS_REGEX}:40::kokkosp_finalize_library::"
    )
  endif() #KOKKOS_ENABLE_LIBDL