Update Kokkos library in LAMMPS to v3.2.1

This commit is contained in:
Stan Moore 2020-11-30 10:36:13 -07:00
parent 91f21fcd9e
commit a6037a957f
23 changed files with 647 additions and 284 deletions

View File

@ -52,13 +52,17 @@ There are numerous device backends, options, and architecture-specific optimizat
````
which activates the OpenMP backend. All of the options controlling device backends, options, architectures, and third-party libraries (TPLs) are given below.
## Platform-specific Problems
## Known Issues<a name="KnownIssues"></a>
### Cray
* The Cray compiler wrappers do static linking by default. This seems to break the Kokkos build. You will likely need to set the environment variable `CRAYPE_LINK_TYPE=dynamic` in order to link correctly. Kokkos warns during configure if this is missing.
* The Cray compiler identifies to CMake as Clang, but it sometimes has its own flags that differ from Clang. We try to include all exceptions, but flag errors may occur in which a Clang-specific flag is passed that the Cray compiler does not recognize.
### Fortran
* In a mixed C++/Fortran code, CMake will use the C++ linker by default. If you override this behavior and use Fortran as the link language, the link may break because Kokkos adds linker flags expecting the linker to be C++. Prior to CMake 3.18, Kokkos has no way of detecting in downstream projects that the linker was changed to Fortran. From CMake 3.18, Kokkos can use generator expressions to avoid adding flags when the linker is not C++. Note: Kokkos will not add any linker flags in this Fortran case. The user will be entirely on their own to add the appropriate linker flags.
## Spack
An alternative to manually building with the CMake is to use the Spack package manager.
To do so, download the `kokkos-spack` git repo and add to the package list:

View File

@ -1,5 +1,27 @@
# Change Log
## [3.2.01](https://github.com/kokkos/kokkos/tree/3.2.01) (2020-11-17)
[Full Changelog](https://github.com/kokkos/kokkos/compare/3.2.00...3.2.01)
**Fixed bugs:**
- Disallow KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE in shared library builds [\#3332](https://github.com/kokkos/kokkos/pull/3332)
- Do not install libprinter-tool when testing is enabled [\#3313](https://github.com/kokkos/kokkos/pull/3313)
- Fix restrict/alignment following refactor [\#3373](https://github.com/kokkos/kokkos/pull/3373)
- Intel fix: workaround compiler issue with using statement [\#3383](https://github.com/kokkos/kokkos/pull/3383)
- Fix zero-length reductions [#\3364](https://github.com/kokkos/kokkos/pull/3364)
- Pthread zero-length reduction fix [\#3452](https://github.com/kokkos/kokkos/pull/3452)
- HPX zero-length reduction fix [\#3470](https://github.com/kokkos/kokkos/pull/3470)
- cuda/9.2 zero-length reduction fix [\#3580](https://github.com/kokkos/kokkos/pull/3580)
- Fix multi-stream scratch [#\3269](https://github.com/kokkos/kokkos/pull/3269)
- Guard KOKKOS_ALL_COMPILE_OPTIONS if Cuda is not enabled [\#3387](https://github.com/kokkos/kokkos/pull/3387)
- Do not include link flags for Fortran linkage [\#3384](https://github.com/kokkos/kokkos/pull/3384)
- Fix NVIDIA GPU arch macro with autodetection [\#3473](https://github.com/kokkos/kokkos/pull/3473)
- Fix libdl/test issues with Trilinos [\#3543](https://github.com/kokkos/kokkos/pull/3543)
- Register Pthread as Tribits option to be enabled with Trilinos [\#3558](https://github.com/kokkos/kokkos/pull/3558)
**Implemented enhancements:**
- Separate Cuda timing-based tests into their own executable [\#3407](https://github.com/kokkos/kokkos/pull/3407)
## [3.2.00](https://github.com/kokkos/kokkos/tree/3.2.00) (2020-08-19)
[Full Changelog](https://github.com/kokkos/kokkos/compare/3.1.01...3.2.00)

View File

@ -112,7 +112,7 @@ ENDIF()
set(Kokkos_VERSION_MAJOR 3)
set(Kokkos_VERSION_MINOR 2)
set(Kokkos_VERSION_PATCH 0)
set(Kokkos_VERSION_PATCH 1)
set(Kokkos_VERSION "${Kokkos_VERSION_MAJOR}.${Kokkos_VERSION_MINOR}.${Kokkos_VERSION_PATCH}")
math(EXPR KOKKOS_VERSION "${Kokkos_VERSION_MAJOR} * 10000 + ${Kokkos_VERSION_MINOR} * 100 + ${Kokkos_VERSION_PATCH}")
@ -207,7 +207,9 @@ IF (KOKKOS_HAS_TRILINOS)
# we have to match the annoying behavior
STRING(REPLACE ";" " " KOKKOSCORE_COMPILE_OPTIONS "${KOKKOS_COMPILE_OPTIONS}")
LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS ${KOKKOS_COMPILE_OPTIONS})
LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS ${KOKKOS_CUDA_OPTIONS})
IF (KOKKOS_ENABLE_CUDA)
LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS ${KOKKOS_CUDA_OPTIONS})
ENDIF()
FOREACH(XCOMP_FLAG ${KOKKOS_XCOMPILER_OPTIONS})
SET(KOKKOSCORE_XCOMPILER_OPTIONS "${KOKKOSCORE_XCOMPILER_OPTIONS} -Xcompiler ${XCOMP_FLAG}")
LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS -Xcompiler ${XCOMP_FLAG})

View File

@ -18,7 +18,7 @@ profiling and debugging tools (https://github.com/kokkos/kokkos-tools).
A programming guide can be found on the Wiki, the API reference is under development.
For questions find us on Slack: https://kokkosteam.slack.com or open a GitHub issue.
For questions find us on Slack: https://kokkosteam.slack.com or open a github issue.
For non-public questions send an email to
crtrott(at)sandia.gov
@ -44,7 +44,7 @@ To learn more about Kokkos consider watching one of our presentations:
We are open and try to encourage contributions from external developers.
To do so please first open an issue describing the contribution and then issue
a pull request against the develop branch. For larger features it may be good
to get guidance from the core development team first through the GitHub issue.
to get guidance from the core development team first through the github issue.
Note that Kokkos Core is licensed under standard 3-clause BSD terms of use.
Which means contributing to Kokkos allows anyone else to use your contributions
@ -108,6 +108,10 @@ For specifics see the LICENSE file contained in the repository or distribution.
* ARM
* Pthreads backend
### Build system:
* CMake >= 3.10: required
* CMake >= 3.13: recommended
* CMake >= 3.18: Fortran linkage. This does not affect most mixed Fortran/Kokkos builds. See [build issues](BUILD.md#KnownIssues).
Primary tested compiler are passing in release mode
with warnings as errors. They also are tested with a comprehensive set of

View File

@ -103,11 +103,18 @@ FUNCTION(KOKKOS_ADD_TEST)
COMM serial mpi
NUM_MPI_PROCS 1
${TEST_UNPARSED_ARGUMENTS}
ADDED_TESTS_NAMES_OUT ALL_TESTS_ADDED
)
# We will get prepended package name here
SET(TEST_NAME ${PACKAGE_NAME}_${TEST_NAME})
SET(EXE ${PACKAGE_NAME}_${EXE_ROOT})
if(TEST_TOOL)
add_dependencies(${EXE} ${TEST_TOOL}) #make sure the exe has to build the tool
set_property(TEST ${TEST_NAME} APPEND_STRING PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>")
foreach(TEST_ADDED ${ALL_TESTS_ADDED})
set_property(TEST ${TEST_ADDED} APPEND PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>")
endforeach()
endif()
else()
CMAKE_PARSE_ARGUMENTS(TEST
@ -263,9 +270,7 @@ ENDFUNCTION()
FUNCTION(KOKKOS_ADD_TEST_LIBRARY NAME)
IF (KOKKOS_HAS_TRILINOS)
TRIBITS_ADD_LIBRARY(${NAME} ${ARGN} TESTONLY
ADDED_LIB_TARGET_NAME_OUT ${NAME}
)
TRIBITS_ADD_LIBRARY(${NAME} ${ARGN} TESTONLY)
ELSE()
SET(oneValueArgs)
SET(multiValueArgs HEADERS SOURCES)
@ -273,16 +278,26 @@ ELSE()
CMAKE_PARSE_ARGUMENTS(PARSE
"STATIC;SHARED"
""
"HEADERS;SOURCES"
"HEADERS;SOURCES;DEPLIBS"
${ARGN})
SET(LIB_TYPE)
IF (PARSE_STATIC)
SET(LIB_TYPE STATIC)
ELSEIF (PARSE_SHARED)
SET(LIB_TYPE SHARED)
ENDIF()
IF(PARSE_HEADERS)
LIST(REMOVE_DUPLICATES PARSE_HEADERS)
ENDIF()
IF(PARSE_SOURCES)
LIST(REMOVE_DUPLICATES PARSE_SOURCES)
ENDIF()
ADD_LIBRARY(${NAME} ${PARSE_SOURCES})
ADD_LIBRARY(${NAME} ${LIB_TYPE} ${PARSE_SOURCES})
IF (PARSE_DEPLIBS)
TARGET_LINK_LIBRARIES(${NAME} PRIVATE ${PARSE_DEPLIBS})
ENDIF()
ENDIF()
ENDFUNCTION()

View File

@ -65,32 +65,6 @@ KOKKOS_ARCH_OPTION(VEGA900 GPU "AMD GPU MI25 GFX900")
KOKKOS_ARCH_OPTION(VEGA906 GPU "AMD GPU MI50/MI60 GFX906")
KOKKOS_ARCH_OPTION(INTEL_GEN GPU "Intel GPUs Gen9+")
IF (KOKKOS_ENABLE_CUDA)
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_KEPLER30 OR KOKKOS_ARCH_KEPLER32 OR KOKKOS_ARCH_KEPLER35 OR KOKKOS_ARCH_KEPLER37)
SET(KOKKOS_ARCH_KEPLER ON)
ENDIF()
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_MAXWELL50 OR KOKKOS_ARCH_MAXWELL52 OR KOKKOS_ARCH_MAXWELL53)
SET(KOKKOS_ARCH_MAXWELL ON)
ENDIF()
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_PASCAL60 OR KOKKOS_ARCH_PASCAL61)
SET(KOKKOS_ARCH_PASCAL ON)
ENDIF()
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_VOLTA70 OR KOKKOS_ARCH_VOLTA72)
SET(KOKKOS_ARCH_VOLTA ON)
ENDIF()
IF (KOKKOS_ARCH_AMPERE80)
SET(KOKKOS_ARCH_AMPERE ON)
ENDIF()
ENDIF()
IF(KOKKOS_ENABLE_COMPILER_WARNINGS)
@ -475,6 +449,32 @@ IF(KOKKOS_ENABLE_CUDA AND NOT CUDA_ARCH_ALREADY_SPECIFIED)
ENDIF()
ENDIF()
IF (KOKKOS_ENABLE_CUDA)
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_KEPLER30 OR KOKKOS_ARCH_KEPLER32 OR KOKKOS_ARCH_KEPLER35 OR KOKKOS_ARCH_KEPLER37)
SET(KOKKOS_ARCH_KEPLER ON)
ENDIF()
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_MAXWELL50 OR KOKKOS_ARCH_MAXWELL52 OR KOKKOS_ARCH_MAXWELL53)
SET(KOKKOS_ARCH_MAXWELL ON)
ENDIF()
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_PASCAL60 OR KOKKOS_ARCH_PASCAL61)
SET(KOKKOS_ARCH_PASCAL ON)
ENDIF()
#Regardless of version, make sure we define the general architecture name
IF (KOKKOS_ARCH_VOLTA70 OR KOKKOS_ARCH_VOLTA72)
SET(KOKKOS_ARCH_VOLTA ON)
ENDIF()
IF (KOKKOS_ARCH_AMPERE80)
SET(KOKKOS_ARCH_AMPERE ON)
ENDIF()
ENDIF()
#CMake verbose is kind of pointless
#Let's just always print things
MESSAGE(STATUS "Execution Spaces:")

View File

@ -109,3 +109,7 @@ ENDIF()
IF (KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE AND KOKKOS_CXX_COMPILER_ID STREQUAL Clang)
MESSAGE(FATAL_ERROR "Relocatable device code is currently not supported with Clang - must use nvcc_wrapper or turn off RDC")
ENDIF()
IF (KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE AND BUILD_SHARED_LIBS)
MESSAGE(FATAL_ERROR "Relocatable device code requires static libraries.")
ENDIF()

View File

@ -209,6 +209,11 @@ ENDMACRO()
#
# If specified, this TPL will build an INTERFACE library rather than an
# IMPORTED target
IF (KOKKOS_HAS_TRILINOS)
MACRO(kokkos_import_tpl NAME)
#do nothing
ENDMACRO()
ELSE()
MACRO(kokkos_import_tpl NAME)
CMAKE_PARSE_ARGUMENTS(TPL
"NO_EXPORT;INTERFACE"
@ -241,6 +246,7 @@ MACRO(kokkos_import_tpl NAME)
LIST(APPEND KOKKOS_ENABLED_TPLS ${NAME})
ENDIF()
ENDMACRO(kokkos_import_tpl)
ENDIF()
MACRO(kokkos_import_cmake_tpl MODULE_NAME)
kokkos_import_tpl(${MODULE_NAME} ${ARGN} NO_EXPORT)

View File

@ -2,22 +2,49 @@ KOKKOS_CFG_DEPENDS(TPLS OPTIONS)
KOKKOS_CFG_DEPENDS(TPLS DEVICES)
FUNCTION(KOKKOS_TPL_OPTION PKG DEFAULT)
CMAKE_PARSE_ARGUMENTS(PARSED
""
"TRIBITS"
""
${ARGN})
IF (PARSED_TRIBITS)
#this is also a TPL option you can activate with Tribits
IF (NOT "${TPL_ENABLE_${PARSED_TRIBITS}}" STREQUAL "")
#Tribits brought its own default that should take precedence
SET(DEFAULT ${TPL_ENABLE_${PARSED_TRIBITS}})
ENDIF()
ENDIF()
KOKKOS_ENABLE_OPTION(${PKG} ${DEFAULT} "Whether to enable the ${PKG} library")
KOKKOS_OPTION(${PKG}_DIR "" PATH "Location of ${PKG} library")
SET(KOKKOS_ENABLE_${PKG} ${KOKKOS_ENABLE_${PKG}} PARENT_SCOPE)
SET(KOKKOS_${PKG}_DIR ${KOKKOS_${PKG}_DIR} PARENT_SCOPE)
IF (KOKKOS_HAS_TRILINOS
AND KOKKOS_ENABLE_${PKG}
AND NOT PARSED_TRIBITS)
#this TPL was enabled, but it is not valid to use inside of TriBITS
MESSAGE(FATAL_ERROR "Enabled TPL ${PKG} inside TriBITS build, "
"but this can only be enabled in a standalone build")
ENDIF()
ENDFUNCTION()
KOKKOS_TPL_OPTION(HWLOC Off)
KOKKOS_TPL_OPTION(LIBNUMA Off)
KOKKOS_TPL_OPTION(MEMKIND Off)
KOKKOS_TPL_OPTION(CUDA Off)
KOKKOS_TPL_OPTION(LIBRT Off)
IF (WIN32)
KOKKOS_TPL_OPTION(LIBDL Off)
ELSE()
KOKKOS_TPL_OPTION(LIBDL On)
IF(KOKKOS_ENABLE_MEMKIND)
SET(KOKKOS_ENABLE_HBWSPACE ON)
ENDIF()
KOKKOS_TPL_OPTION(CUDA ${Kokkos_ENABLE_CUDA} TRIBITS CUDA)
KOKKOS_TPL_OPTION(LIBRT Off)
IF (WIN32)
SET(LIBDL_DEFAULT Off)
ELSE()
SET(LIBDL_DEFAULT On)
ENDIF()
KOKKOS_TPL_OPTION(LIBDL ${LIBDL_DEFAULT} TRIBITS DLlib)
IF(Trilinos_ENABLE_Kokkos AND TPL_ENABLE_HPX)
SET(HPX_DEFAULT ON)
@ -31,7 +58,7 @@ SET(PTHREAD_DEFAULT ON)
ELSE()
SET(PTHREAD_DEFAULT OFF)
ENDIF()
KOKKOS_TPL_OPTION(PTHREAD ${PTHREAD_DEFAULT})
KOKKOS_TPL_OPTION(PTHREAD ${PTHREAD_DEFAULT} TRIBITS Pthread)
#Make sure we use our local FindKokkosCuda.cmake

View File

@ -202,8 +202,8 @@ MACRO(KOKKOS_SETUP_BUILD_ENVIRONMENT)
INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_arch.cmake)
IF (NOT KOKKOS_HAS_TRILINOS)
SET(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${Kokkos_SOURCE_DIR}/cmake/Modules/")
INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_tpls.cmake)
ENDIF()
INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_tpls.cmake)
INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_corner_cases.cmake)
ENDIF()
ENDMACRO()
@ -234,12 +234,21 @@ FUNCTION(KOKKOS_SET_LIBRARY_PROPERTIES LIBRARY_NAME)
""
${ARGN})
IF(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.13")
#great, this works the "right" way
IF(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.18")
#I can use link options
#check for CXX linkage using the simple 3.18 way
TARGET_LINK_OPTIONS(
${LIBRARY_NAME} PUBLIC
$<$<LINK_LANGUAGE:CXX>:${KOKKOS_LINK_OPTIONS}>
)
ELSEIF(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.13")
#I can use link options
#just assume CXX linkage
TARGET_LINK_OPTIONS(
${LIBRARY_NAME} PUBLIC ${KOKKOS_LINK_OPTIONS}
)
ELSE()
#assume CXX linkage, we have no good way to check otherwise
IF (PARSE_PLAIN_STYLE)
TARGET_LINK_LIBRARIES(
${LIBRARY_NAME} ${KOKKOS_LINK_OPTIONS}
@ -361,10 +370,15 @@ FUNCTION(KOKKOS_ADD_LIBRARY LIBRARY_NAME)
CMAKE_PARSE_ARGUMENTS(PARSE
"ADD_BUILD_OPTIONS"
""
""
"HEADERS"
${ARGN}
)
IF (KOKKOS_HAS_TRILINOS)
# We do not pass headers to trilinos. They would get installed
# to the default include folder, but we want headers installed
# preserving the directory structure, e.g. impl
# If headers got installed in both locations, it breaks some
# downstream packages
TRIBITS_ADD_LIBRARY(${LIBRARY_NAME} ${PARSE_UNPARSED_ARGUMENTS})
#Stolen from Tribits - it can add prefixes
SET(TRIBITS_LIBRARY_NAME_PREFIX "${${PROJECT_NAME}_LIBRARY_NAME_PREFIX}")
@ -379,8 +393,10 @@ FUNCTION(KOKKOS_ADD_LIBRARY LIBRARY_NAME)
#Do not set any transitive properties and keep everything working as before
#KOKKOS_SET_LIBRARY_PROPERTIES(${TRIBITS_LIBRARY_NAME} PLAIN_STYLE)
ELSE()
# Forward the headers, we want to know about all headers
# to make sure they appear correctly in IDEs
KOKKOS_INTERNAL_ADD_LIBRARY(
${LIBRARY_NAME} ${PARSE_UNPARSED_ARGUMENTS})
${LIBRARY_NAME} ${PARSE_UNPARSED_ARGUMENTS} HEADERS ${PARSE_HEADERS})
IF (PARSE_ADD_BUILD_OPTIONS)
KOKKOS_SET_LIBRARY_PROPERTIES(${LIBRARY_NAME})
ENDIF()

View File

@ -299,19 +299,21 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
// *m_chunks[m_chunk_max] stores the current number of chunks being used
uintptr_t* const pc = reinterpret_cast<uintptr_t*>(m_chunks + m_chunk_max);
std::string _label =
m_track.template get_label<typename traits::memory_space>();
if (*pc < NC) {
while (*pc < NC) {
m_chunks[*pc] = reinterpret_cast<value_pointer_type>(
typename traits::memory_space().allocate(sizeof(local_value_type)
<< m_chunk_shift));
typename traits::memory_space().allocate(
_label.c_str(), sizeof(local_value_type) << m_chunk_shift));
++*pc;
}
} else {
while (NC + 1 <= *pc) {
--*pc;
typename traits::memory_space().deallocate(
m_chunks[*pc], sizeof(local_value_type) << m_chunk_shift);
_label.c_str(), m_chunks[*pc],
sizeof(local_value_type) << m_chunk_shift);
m_chunks[*pc] = nullptr;
}
}
@ -356,7 +358,9 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
//----------------------------------------------------------------------
struct Destroy {
typename traits::value_type** m_chunks;
using local_value_type = typename traits::value_type;
std::string m_label;
local_value_type** m_chunks;
unsigned m_chunk_max;
bool m_destroy;
unsigned m_chunk_size;
@ -365,7 +369,9 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
// Two entries beyond the max chunks are allocation counters.
inline void operator()(unsigned i) const {
if (m_destroy && i < m_chunk_max && nullptr != m_chunks[i]) {
typename traits::memory_space().deallocate(m_chunks[i], m_chunk_size);
typename traits::memory_space().deallocate(
m_label.c_str(), m_chunks[i],
sizeof(local_value_type) * m_chunk_size);
}
m_chunks[i] = nullptr;
}
@ -397,9 +403,10 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
Destroy& operator=(Destroy&&) = default;
Destroy& operator=(const Destroy&) = default;
Destroy(typename traits::value_type** arg_chunk,
Destroy(std::string label, typename traits::value_type** arg_chunk,
const unsigned arg_chunk_max, const unsigned arg_chunk_size)
: m_chunks(arg_chunk),
: m_label(label),
m_chunks(arg_chunk),
m_chunk_max(arg_chunk_max),
m_destroy(false),
m_chunk_size(arg_chunk_size) {}
@ -443,7 +450,7 @@ class DynamicView : public Kokkos::ViewTraits<DataType, P...> {
m_chunks = reinterpret_cast<pointer_type*>(record->data());
record->m_destroy = Destroy(m_chunks, m_chunk_max, m_chunk_size);
record->m_destroy = Destroy(arg_label, m_chunks, m_chunk_max, m_chunk_size);
// Initialize to zero
record->m_destroy.construct_shared_allocation();

View File

@ -931,29 +931,6 @@ void SharedAllocationRecord<Kokkos::CudaHostPinnedSpace, void>::print_records(
// </editor-fold> end SharedAllocationRecord::print_records() }}}1
//==============================================================================
void *cuda_resize_scratch_space(std::int64_t bytes, bool force_shrink) {
static void *ptr = nullptr;
static std::int64_t current_size = 0;
if (current_size == 0) {
current_size = bytes;
ptr = Kokkos::kokkos_malloc<Kokkos::CudaSpace>("CudaSpace::ScratchMemory",
current_size);
}
if (bytes > current_size) {
current_size = bytes;
Kokkos::kokkos_free<Kokkos::CudaSpace>(ptr);
ptr = Kokkos::kokkos_malloc<Kokkos::CudaSpace>("CudaSpace::ScratchMemory",
current_size);
}
if ((bytes < current_size) && (force_shrink)) {
current_size = bytes;
Kokkos::kokkos_free<Kokkos::CudaSpace>(ptr);
ptr = Kokkos::kokkos_malloc<Kokkos::CudaSpace>("CudaSpace::ScratchMemory",
current_size);
}
return ptr;
}
void cuda_prefetch_pointer(const Cuda &space, const void *ptr, size_t bytes,
bool to_device) {
if ((ptr == nullptr) || (bytes == 0)) return;

View File

@ -277,23 +277,25 @@ CudaInternal::~CudaInternal() {
std::cerr.flush();
}
m_cudaDev = -1;
m_cudaArch = -1;
m_multiProcCount = 0;
m_maxWarpCount = 0;
m_maxBlock = 0;
m_maxSharedWords = 0;
m_maxConcurrency = 0;
m_scratchSpaceCount = 0;
m_scratchFlagsCount = 0;
m_scratchUnifiedCount = 0;
m_scratchUnifiedSupported = 0;
m_streamCount = 0;
m_scratchSpace = nullptr;
m_scratchFlags = nullptr;
m_scratchUnified = nullptr;
m_scratchConcurrentBitset = nullptr;
m_stream = nullptr;
m_cudaDev = -1;
m_cudaArch = -1;
m_multiProcCount = 0;
m_maxWarpCount = 0;
m_maxBlock = 0;
m_maxSharedWords = 0;
m_maxConcurrency = 0;
m_scratchSpaceCount = 0;
m_scratchFlagsCount = 0;
m_scratchUnifiedCount = 0;
m_scratchUnifiedSupported = 0;
m_streamCount = 0;
m_scratchSpace = nullptr;
m_scratchFlags = nullptr;
m_scratchUnified = nullptr;
m_scratchConcurrentBitset = nullptr;
m_stream = nullptr;
m_team_scratch_current_size = 0;
m_team_scratch_ptr = nullptr;
}
int CudaInternal::verify_is_initialized(const char *const label) const {
@ -552,7 +554,9 @@ void CudaInternal::initialize(int cuda_device_id, cudaStream_t stream) {
CUDA_SAFE_CALL(cudaEventCreate(&constantMemReusable));
}
m_stream = stream;
m_stream = stream;
m_team_scratch_current_size = 0;
m_team_scratch_ptr = nullptr;
}
//----------------------------------------------------------------------------
@ -653,12 +657,31 @@ Cuda::size_type *CudaInternal::scratch_functor(
return m_scratchFunctor;
}
void *CudaInternal::resize_team_scratch_space(std::int64_t bytes,
bool force_shrink) {
if (m_team_scratch_current_size == 0) {
m_team_scratch_current_size = bytes;
m_team_scratch_ptr = Kokkos::kokkos_malloc<Kokkos::CudaSpace>(
"CudaSpace::ScratchMemory", m_team_scratch_current_size);
}
if ((bytes > m_team_scratch_current_size) ||
((bytes < m_team_scratch_current_size) && (force_shrink))) {
m_team_scratch_current_size = bytes;
m_team_scratch_ptr = Kokkos::kokkos_realloc<Kokkos::CudaSpace>(
m_team_scratch_ptr, m_team_scratch_current_size);
}
return m_team_scratch_ptr;
}
//----------------------------------------------------------------------------
void CudaInternal::finalize() {
was_finalized = true;
if (nullptr != m_scratchSpace || nullptr != m_scratchFlags) {
Impl::finalize_host_cuda_lock_arrays();
// Only finalize this if we're the singleton
if (this == &singleton()) {
Impl::finalize_host_cuda_lock_arrays();
}
using RecordCuda = Kokkos::Impl::SharedAllocationRecord<CudaSpace>;
using RecordHost =
@ -671,20 +694,25 @@ void CudaInternal::finalize() {
if (m_scratchFunctorSize > 0)
RecordCuda::decrement(RecordCuda::get_record(m_scratchFunctor));
m_cudaDev = -1;
m_multiProcCount = 0;
m_maxWarpCount = 0;
m_maxBlock = 0;
m_maxSharedWords = 0;
m_scratchSpaceCount = 0;
m_scratchFlagsCount = 0;
m_scratchUnifiedCount = 0;
m_streamCount = 0;
m_scratchSpace = nullptr;
m_scratchFlags = nullptr;
m_scratchUnified = nullptr;
m_scratchConcurrentBitset = nullptr;
m_stream = nullptr;
if (m_team_scratch_current_size > 0)
Kokkos::kokkos_free<Kokkos::CudaSpace>(m_team_scratch_ptr);
m_cudaDev = -1;
m_multiProcCount = 0;
m_maxWarpCount = 0;
m_maxBlock = 0;
m_maxSharedWords = 0;
m_scratchSpaceCount = 0;
m_scratchFlagsCount = 0;
m_scratchUnifiedCount = 0;
m_streamCount = 0;
m_scratchSpace = nullptr;
m_scratchFlags = nullptr;
m_scratchUnified = nullptr;
m_scratchConcurrentBitset = nullptr;
m_stream = nullptr;
m_team_scratch_current_size = 0;
m_team_scratch_ptr = nullptr;
}
// only destroy these if we're finalizing the singleton

View File

@ -104,10 +104,12 @@ class CudaInternal {
cudaDeviceProp m_deviceProp;
// Scratch Spaces for Reductions
mutable size_type m_scratchSpaceCount;
mutable size_type m_scratchFlagsCount;
mutable size_type m_scratchUnifiedCount;
mutable size_type m_scratchFunctorSize;
size_type m_scratchUnifiedSupported;
size_type m_streamCount;
mutable size_type* m_scratchSpace;
@ -117,6 +119,10 @@ class CudaInternal {
uint32_t* m_scratchConcurrentBitset;
cudaStream_t m_stream;
// Team Scratch Level 1 Space
mutable int64_t m_team_scratch_current_size;
mutable void* m_team_scratch_ptr;
bool was_initialized = false;
bool was_finalized = false;
@ -172,12 +178,19 @@ class CudaInternal {
m_scratchUnified(nullptr),
m_scratchFunctor(nullptr),
m_scratchConcurrentBitset(nullptr),
m_stream(nullptr) {}
m_stream(nullptr),
m_team_scratch_current_size(0),
m_team_scratch_ptr(nullptr) {}
// Resizing of reduction related scratch spaces
size_type* scratch_space(const size_type size) const;
size_type* scratch_flags(const size_type size) const;
size_type* scratch_unified(const size_type size) const;
size_type* scratch_functor(const size_type size) const;
// Resizing of team level 1 scratch
void* resize_team_scratch_space(std::int64_t bytes,
bool force_shrink = false);
};
} // Namespace Impl

View File

@ -75,12 +75,6 @@ __device__ __constant__ unsigned long kokkos_impl_cuda_constant_memory_buffer
#endif
namespace Kokkos {
namespace Impl {
void* cuda_resize_scratch_space(std::int64_t bytes, bool force_shrink = false);
}
} // namespace Kokkos
template <typename T>
inline __device__ T* kokkos_impl_cuda_shared_memory() {
extern __shared__ Kokkos::CudaSpace::size_type sh[];
@ -296,15 +290,19 @@ struct CudaParallelLaunch<
}
static cudaFuncAttributes get_cuda_func_attributes() {
static cudaFuncAttributes attr;
static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr,
&attr_tmp,
cuda_parallel_launch_constant_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>));
attr_set = true;
}
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
@ -370,13 +368,17 @@ struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
}
static cudaFuncAttributes get_cuda_func_attributes() {
static cudaFuncAttributes attr;
static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr, cuda_parallel_launch_constant_memory<DriverType>));
attr_set = true;
}
&attr_tmp, cuda_parallel_launch_constant_memory<DriverType>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
@ -430,15 +432,19 @@ struct CudaParallelLaunch<
}
static cudaFuncAttributes get_cuda_func_attributes() {
static cudaFuncAttributes attr;
static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr,
&attr_tmp,
cuda_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>));
attr_set = true;
}
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
@ -488,13 +494,17 @@ struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
}
static cudaFuncAttributes get_cuda_func_attributes() {
static cudaFuncAttributes attr;
static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr, cuda_parallel_launch_local_memory<DriverType>));
attr_set = true;
}
&attr_tmp, cuda_parallel_launch_local_memory<DriverType>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
@ -550,15 +560,19 @@ struct CudaParallelLaunch<
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
static cudaFuncAttributes attr;
static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr,
&attr_tmp,
cuda_parallel_launch_global_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>));
attr_set = true;
}
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
@ -610,13 +624,17 @@ struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
}
static cudaFuncAttributes get_cuda_func_attributes() {
static cudaFuncAttributes attr;
static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr, cuda_parallel_launch_global_memory<DriverType>));
attr_set = true;
}
&attr_tmp, cuda_parallel_launch_global_memory<DriverType>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};

View File

@ -775,10 +775,12 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_scratch_ptr[1] =
m_team_size <= 0
? nullptr
: cuda_resize_scratch_space(
static_cast<ptrdiff_t>(m_scratch_size[1]) *
static_cast<ptrdiff_t>(Cuda::concurrency() /
(m_team_size * m_vector_size)));
: m_policy.space()
.impl_internal_space_instance()
->resize_team_scratch_space(
static_cast<ptrdiff_t>(m_scratch_size[1]) *
static_cast<ptrdiff_t>(Cuda::concurrency() /
(m_team_size * m_vector_size)));
const int shmem_size_total = m_shmem_begin + m_shmem_size;
if (m_policy.space().impl_internal_space_instance()->m_maxShmemPerBlock <
@ -854,6 +856,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
const ReducerType m_reducer;
const pointer_type m_result_ptr;
const bool m_result_ptr_device_accessible;
const bool m_result_ptr_host_accessible;
size_type* m_scratch_space;
size_type* m_scratch_flags;
size_type* m_unified_space;
@ -919,11 +922,38 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
}
}
// Doing code duplication here to fix issue #3428
// Suspect optimizer bug??
// Reduce with final value at blockDim.y - 1 location.
if (cuda_single_inter_block_reduce_scan<false, ReducerTypeFwd, WorkTagFwd>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags)) {
// Shortcut for length zero reduction
if (m_policy.begin() == m_policy.end()) {
// This is the final block with the final result at the final threads'
// location
size_type* const shared = kokkos_impl_cuda_shared_memory<size_type>() +
(blockDim.y - 1) * word_count.value;
size_type* const global =
m_result_ptr_device_accessible
? reinterpret_cast<size_type*>(m_result_ptr)
: (m_unified_space ? m_unified_space : m_scratch_space);
if (threadIdx.y == 0) {
Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
ReducerConditional::select(m_functor, m_reducer), shared);
}
if (CudaTraits::WarpSize < word_count.value) {
__syncthreads();
}
for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) {
global[i] = shared[i];
}
} else if (cuda_single_inter_block_reduce_scan<false, ReducerTypeFwd,
WorkTagFwd>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags)) {
// This is the final block with the final result at the final threads'
// location
@ -1021,9 +1051,14 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
}
inline void execute() {
const index_type nwork = m_policy.end() - m_policy.begin();
if (nwork) {
const index_type nwork = m_policy.end() - m_policy.begin();
const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
ReduceFunctorHasFinal<FunctorType>::value ||
!m_result_ptr_host_accessible ||
!std::is_same<ReducerType, InvalidType>::value;
if ((nwork > 0) || need_device_set) {
const int block_size = local_block_size(m_functor);
KOKKOS_ASSERT(block_size > 0);
m_scratch_space = cuda_internal_scratch_space(
@ -1049,12 +1084,14 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
WorkTag>(m_functor,
block.y);
if ((nwork == 0)
#ifdef KOKKOS_IMPL_DEBUG_CUDA_SERIAL_EXECUTION
if (Kokkos::Impl::CudaInternal::cuda_use_serial_execution()) {
|| Kokkos::Impl::CudaInternal::cuda_use_serial_execution()
#endif
) {
block = dim3(1, 1, 1);
grid = dim3(1, 1, 1);
}
#endif
CudaParallelLaunch<ParallelReduce, LaunchBounds>(
*this, grid, block, shmem,
@ -1062,7 +1099,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
false); // copy to device and execute
if (!m_result_ptr_device_accessible) {
Cuda().fence();
m_policy.space().fence();
if (m_result_ptr) {
if (m_unified_space) {
@ -1098,6 +1135,9 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
m_result_ptr_device_accessible(
MemorySpaceAccess<Kokkos::CudaSpace,
typename ViewType::memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ViewType::memory_space>::accessible),
m_scratch_space(nullptr),
m_scratch_flags(nullptr),
m_unified_space(nullptr) {}
@ -1112,6 +1152,10 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
MemorySpaceAccess<Kokkos::CudaSpace,
typename ReducerType::result_view_type::
memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ReducerType::result_view_type::
memory_space>::accessible),
m_scratch_space(nullptr),
m_scratch_flags(nullptr),
m_unified_space(nullptr) {}
@ -1467,6 +1511,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
const ReducerType m_reducer;
const pointer_type m_result_ptr;
const bool m_result_ptr_device_accessible;
const bool m_result_ptr_host_accessible;
size_type* m_scratch_space;
size_type* m_scratch_flags;
size_type* m_unified_space;
@ -1561,10 +1606,35 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
}
// Reduce with final value at blockDim.y - 1 location.
if (cuda_single_inter_block_reduce_scan<false, FunctorType, WorkTag>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags)) {
// Doing code duplication here to fix issue #3428
// Suspect optimizer bug??
if (m_league_size == 0) {
// This is the final block with the final result at the final threads'
// location
size_type* const shared = kokkos_impl_cuda_shared_memory<size_type>() +
(blockDim.y - 1) * word_count.value;
size_type* const global =
m_result_ptr_device_accessible
? reinterpret_cast<size_type*>(m_result_ptr)
: (m_unified_space ? m_unified_space : m_scratch_space);
if (threadIdx.y == 0) {
Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
ReducerConditional::select(m_functor, m_reducer), shared);
}
if (CudaTraits::WarpSize < word_count.value) {
__syncthreads();
}
for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) {
global[i] = shared[i];
}
} else if (cuda_single_inter_block_reduce_scan<false, FunctorType, WorkTag>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags)) {
// This is the final block with the final result at the final threads'
// location
@ -1617,7 +1687,13 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
value_type init;
ValueInit::init(ReducerConditional::select(m_functor, m_reducer), &init);
if (Impl::cuda_inter_block_reduction<FunctorType, ValueJoin, WorkTag>(
if (int_league_size == 0) {
Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
ReducerConditional::select(m_functor, m_reducer), (void*)&value);
*result = value;
} else if (
Impl::cuda_inter_block_reduction<FunctorType, ValueJoin, WorkTag>(
value, init,
ValueJoin(ReducerConditional::select(m_functor, m_reducer)),
m_scratch_space, result, m_scratch_flags, blockDim.y)
@ -1637,8 +1713,12 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
}
inline void execute() {
const int nwork = m_league_size * m_team_size;
if (nwork) {
const int nwork = m_league_size * m_team_size;
const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
ReduceFunctorHasFinal<FunctorType>::value ||
!m_result_ptr_host_accessible ||
!std::is_same<ReducerType, InvalidType>::value;
if ((nwork > 0) || need_device_set) {
const int block_count =
UseShflReduction ? std::min(m_league_size, size_type(1024 * 32))
: std::min(int(m_league_size), m_team_size);
@ -1657,12 +1737,14 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
dim3 grid(block_count, 1, 1);
const int shmem_size_total = m_team_begin + m_shmem_begin + m_shmem_size;
if ((nwork == 0)
#ifdef KOKKOS_IMPL_DEBUG_CUDA_SERIAL_EXECUTION
if (Kokkos::Impl::CudaInternal::cuda_use_serial_execution()) {
|| Kokkos::Impl::CudaInternal::cuda_use_serial_execution()
#endif
) {
block = dim3(1, 1, 1);
grid = dim3(1, 1, 1);
}
#endif
CudaParallelLaunch<ParallelReduce, LaunchBounds>(
*this, grid, block, shmem_size_total,
@ -1670,7 +1752,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
true); // copy to device and execute
if (!m_result_ptr_device_accessible) {
Cuda().fence();
m_policy.space().fence();
if (m_result_ptr) {
if (m_unified_space) {
@ -1706,6 +1788,9 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_result_ptr_device_accessible(
MemorySpaceAccess<Kokkos::CudaSpace,
typename ViewType::memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ViewType::memory_space>::accessible),
m_scratch_space(nullptr),
m_scratch_flags(nullptr),
m_unified_space(nullptr),
@ -1728,13 +1813,6 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_policy.thread_scratch_size(0)) /
m_vector_size;
// Return Init value if the number of worksets is zero
if (m_league_size * m_team_size == 0) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
arg_result.data());
return;
}
m_team_begin =
UseShflReduction
? 0
@ -1750,10 +1828,13 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_scratch_ptr[1] =
m_team_size <= 0
? nullptr
: cuda_resize_scratch_space(
static_cast<std::int64_t>(m_scratch_size[1]) *
(static_cast<std::int64_t>(Cuda::concurrency() /
(m_team_size * m_vector_size))));
: m_policy.space()
.impl_internal_space_instance()
->resize_team_scratch_space(
static_cast<std::int64_t>(m_scratch_size[1]) *
(static_cast<std::int64_t>(
Cuda::concurrency() /
(m_team_size * m_vector_size))));
// The global parallel_reduce does not support vector_length other than 1 at
// the moment
@ -1805,6 +1886,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
MemorySpaceAccess<Kokkos::CudaSpace,
typename ReducerType::result_view_type::
memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ReducerType::result_view_type::
memory_space>::accessible),
m_scratch_space(nullptr),
m_scratch_flags(nullptr),
m_unified_space(nullptr),
@ -1827,13 +1912,6 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_policy.thread_scratch_size(0)) /
m_vector_size;
// Return Init value if the number of worksets is zero
if (arg_policy.league_size() == 0) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
return;
}
m_team_begin =
UseShflReduction
? 0
@ -1849,10 +1927,12 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_scratch_ptr[1] =
m_team_size <= 0
? nullptr
: cuda_resize_scratch_space(
static_cast<ptrdiff_t>(m_scratch_size[1]) *
static_cast<ptrdiff_t>(Cuda::concurrency() /
(m_team_size * m_vector_size)));
: m_policy.space()
.impl_internal_space_instance()
->resize_team_scratch_space(
static_cast<ptrdiff_t>(m_scratch_size[1]) *
static_cast<ptrdiff_t>(Cuda::concurrency() /
(m_team_size * m_vector_size)));
// The global parallel_reduce does not support vector_length other than 1 at
// the moment

View File

@ -166,6 +166,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
const ReducerType m_reducer;
const pointer_type m_result_ptr;
const bool m_result_ptr_device_accessible;
const bool m_result_ptr_host_accessible;
size_type* m_scratch_space = nullptr;
size_type* m_scratch_flags = nullptr;
@ -230,11 +231,16 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
}
// Reduce with final value at blockDim.y - 1 location.
if (hip_single_inter_block_reduce_scan<false, ReducerTypeFwd, WorkTagFwd>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x,
::Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags)) {
// Shortcut for length zero reduction
bool do_final_reduction = m_policy.begin() == m_policy.end();
if (!do_final_reduction)
do_final_reduction = hip_single_inter_block_reduce_scan<
false, ReducerTypeFwd, WorkTagFwd>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x,
::Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags);
if (do_final_reduction) {
// This is the final block with the final result at the final threads'
// location
@ -288,11 +294,19 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
value_type init;
ValueInit::init(ReducerConditional::select(m_functor, m_reducer), &init);
if (Impl::hip_inter_block_shuffle_reduction<ReducerTypeFwd, ValueJoin,
WorkTagFwd>(
value, init,
ValueJoin(ReducerConditional::select(m_functor, m_reducer)),
m_scratch_space, result, m_scratch_flags, max_active_thread)) {
if (m_policy.begin() == m_policy.end()) {
Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
ReducerConditional::select(m_functor, m_reducer),
reinterpret_cast<void*>(&value));
pointer_type const final_result =
m_result_ptr_device_accessible ? m_result_ptr : result;
*final_result = value;
} else if (Impl::hip_inter_block_shuffle_reduction<ReducerTypeFwd,
ValueJoin, WorkTagFwd>(
value, init,
ValueJoin(ReducerConditional::select(m_functor, m_reducer)),
m_scratch_space, result, m_scratch_flags,
max_active_thread)) {
unsigned int const id = threadIdx.y * blockDim.x + threadIdx.x;
if (id == 0) {
Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
@ -328,8 +342,12 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
}
inline void execute() {
const index_type nwork = m_policy.end() - m_policy.begin();
if (nwork) {
const index_type nwork = m_policy.end() - m_policy.begin();
const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
ReduceFunctorHasFinal<FunctorType>::value ||
!m_result_ptr_host_accessible ||
!std::is_same<ReducerType, InvalidType>::value;
if ((nwork > 0) || need_device_set) {
const int block_size = local_block_size(m_functor);
KOKKOS_ASSERT(block_size > 0);
@ -343,10 +361,16 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
sizeof(size_type));
// REQUIRED ( 1 , N , 1 )
const dim3 block(1, block_size, 1);
dim3 block(1, block_size, 1);
// Required grid.x <= block.y
const dim3 grid(std::min(block.y, (nwork + block.y - 1) / block.y), 1, 1);
dim3 grid(std::min(block.y, static_cast<uint32_t>((nwork + block.y - 1) /
block.y)),
1, 1);
if (nwork == 0) {
block = dim3(1, 1, 1);
grid = dim3(1, 1, 1);
}
const int shmem =
UseShflReduction
? 0
@ -389,6 +413,9 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
m_result_ptr(arg_result.data()),
m_result_ptr_device_accessible(
MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
typename ViewType::memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ViewType::memory_space>::accessible) {}
ParallelReduce(const FunctorType& arg_functor, const Policy& arg_policy,
@ -399,6 +426,10 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
m_result_ptr(reducer.view().data()),
m_result_ptr_device_accessible(
MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
typename ReducerType::result_view_type::
memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ReducerType::result_view_type::
memory_space>::accessible) {}
};

View File

@ -646,6 +646,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
const ReducerType m_reducer;
const pointer_type m_result_ptr;
const bool m_result_ptr_device_accessible;
const bool m_result_ptr_host_accessible;
size_type* m_scratch_space;
size_type* m_scratch_flags;
size_type m_team_begin;
@ -742,11 +743,15 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
}
// Reduce with final value at blockDim.y - 1 location.
if (hip_single_inter_block_reduce_scan<false, FunctorType, work_tag>(
reducer_conditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x,
Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags)) {
bool do_final_reduce = (m_league_size == 0);
if (!do_final_reduce)
do_final_reduce =
hip_single_inter_block_reduce_scan<false, FunctorType, work_tag>(
reducer_conditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x,
Kokkos::Experimental::kokkos_impl_hip_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags);
if (do_final_reduce) {
// This is the final block with the final result at the final threads'
// location
@ -802,11 +807,17 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
value_type init;
value_init::init(reducer_conditional::select(m_functor, m_reducer), &init);
if (Impl::hip_inter_block_shuffle_reduction<FunctorType, value_join,
work_tag>(
value, init,
value_join(reducer_conditional::select(m_functor, m_reducer)),
m_scratch_space, result, m_scratch_flags, blockDim.y)) {
if (int_league_size == 0) {
Kokkos::Impl::FunctorFinal<reducer_type_fwd, work_tag_fwd>::final(
reducer_conditional::select(m_functor, m_reducer),
reinterpret_cast<void*>(&value));
*result = value;
} else if (Impl::hip_inter_block_shuffle_reduction<FunctorType, value_join,
work_tag>(
value, init,
value_join(
reducer_conditional::select(m_functor, m_reducer)),
m_scratch_space, result, m_scratch_flags, blockDim.y)) {
unsigned int const id = threadIdx.y * blockDim.x + threadIdx.x;
if (id == 0) {
Kokkos::Impl::FunctorFinal<reducer_type_fwd, work_tag_fwd>::final(
@ -818,8 +829,12 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
}
inline void execute() {
const int nwork = m_league_size * m_team_size;
if (nwork) {
const int nwork = m_league_size * m_team_size;
const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
ReduceFunctorHasFinal<FunctorType>::value ||
!m_result_ptr_host_accessible ||
!std::is_same<ReducerType, InvalidType>::value;
if ((nwork > 0) || need_device_set) {
const int block_count =
UseShflReduction
? std::min(
@ -837,6 +852,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
dim3 block(m_vector_size, m_team_size, 1);
dim3 grid(block_count, 1, 1);
if (nwork == 0) {
block = dim3(1, 1, 1);
grid = dim3(1, 1, 1);
}
const int shmem_size_total = m_team_begin + m_shmem_begin + m_shmem_size;
Kokkos::Experimental::Impl::HIPParallelLaunch<ParallelReduce,
@ -875,6 +894,9 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_result_ptr_device_accessible(
MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
typename ViewType::memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ViewType::memory_space>::accessible),
m_scratch_space(0),
m_scratch_flags(0),
m_team_begin(0),
@ -896,12 +918,16 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_policy.thread_scratch_size(0)) /
m_vector_size;
// We can't early exit here because the result place might not be accessible
// or the functor/reducer init not callable on the host. But I am not sure
// all the other code below is kosher with zero work length ...
//
// Return Init value if the number of worksets is zero
if (m_league_size * m_team_size == 0) {
value_init::init(reducer_conditional::select(m_functor, m_reducer),
arg_result.data());
return;
}
// if (m_league_size * m_team_size == 0) {
// value_init::init(reducer_conditional::select(m_functor, m_reducer),
// arg_result.data());
// return;
//}
m_team_begin =
UseShflReduction
@ -974,6 +1000,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
MemorySpaceAccess<Kokkos::Experimental::HIPSpace,
typename ReducerType::result_view_type::
memory_space>::accessible),
m_result_ptr_host_accessible(
MemorySpaceAccess<Kokkos::HostSpace,
typename ReducerType::result_view_type::
memory_space>::accessible),
m_scratch_space(0),
m_scratch_flags(0),
m_team_begin(0),
@ -995,12 +1025,16 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_policy.thread_scratch_size(0)) /
m_vector_size;
// We can't early exit here because the result place might not be accessible
// or the functor/reducer init not callable on the host. But I am not sure
// all the other code below is kosher with zero work length ...
//
// Return Init value if the number of worksets is zero
if (arg_policy.league_size() == 0) {
value_init::init(reducer_conditional::select(m_functor, m_reducer),
m_result_ptr);
return;
}
// if (arg_policy.league_size() == 0) {
// value_init::init(reducer_conditional::select(m_functor, m_reducer),
// m_result_ptr);
// return;
//}
m_team_begin =
UseShflReduction

View File

@ -1131,6 +1131,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
typename Kokkos::Impl::if_c<std::is_same<InvalidType, ReducerType>::value,
WorkTag, void>::type;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
using ValueJoin = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
using ValueOps = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
using value_type = typename Analysis::value_type;
@ -1254,6 +1255,15 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
public:
void execute() const {
if (m_policy.end() <= m_policy.begin()) {
if (m_result_ptr) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
}
return;
}
dispatch_execute_task(this, m_policy.space(), m_force_synchronous);
}
@ -1459,9 +1469,10 @@ class ParallelReduce<FunctorType, Kokkos::MDRangePolicy<Traits...>, ReducerType,
using WorkTagFwd =
typename Kokkos::Impl::if_c<std::is_same<InvalidType, ReducerType>::value,
WorkTag, void>::type;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueJoin = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
using ValueOps = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
using ValueJoin = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
using ValueOps = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
using pointer_type = typename Analysis::pointer_type;
using value_type = typename Analysis::value_type;
using reference_type = typename Analysis::reference_type;
@ -2060,9 +2071,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
using WorkTagFwd =
typename Kokkos::Impl::if_c<std::is_same<InvalidType, ReducerType>::value,
WorkTag, void>::type;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueJoin = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
using ValueOps = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
using ValueJoin = Kokkos::Impl::FunctorValueJoin<ReducerTypeFwd, WorkTagFwd>;
using ValueOps = Kokkos::Impl::FunctorValueOps<ReducerTypeFwd, WorkTagFwd>;
using pointer_type = typename Analysis::pointer_type;
using reference_type = typename Analysis::reference_type;
using value_type = typename Analysis::value_type;
@ -2132,7 +2144,18 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
}
public:
void execute() const { dispatch_execute_task(this, m_policy.space()); }
void execute() const {
if (m_policy.league_size() * m_policy.team_size() == 0) {
if (m_result_ptr) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
}
return;
}
dispatch_execute_task(this, m_policy.space());
}
inline void execute_task() const {
// See [note 1] for an explanation.

View File

@ -297,6 +297,15 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
public:
inline void execute() const {
if (m_policy.end() <= m_policy.begin()) {
if (m_result_ptr) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
ReducerConditional::select(m_functor, m_reducer), m_result_ptr);
}
return;
}
enum {
is_dynamic = std::is_same<typename Policy::schedule_type::type,
Kokkos::Dynamic>::value
@ -1014,6 +1023,15 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
inline void execute() const {
enum { is_dynamic = std::is_same<SchedTag, Kokkos::Dynamic>::value };
if (m_policy.league_size() * m_policy.team_size() == 0) {
if (m_result_ptr) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>::final(
ReducerConditional::select(m_functor, m_reducer), m_result_ptr);
}
return;
}
OpenMPExec::verify_is_master("Kokkos::OpenMP parallel_reduce");
const size_t pool_reduce_size =

View File

@ -365,7 +365,8 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
using ValueTraits =
Kokkos::Impl::FunctorValueTraits<ReducerTypeFwd, WorkTagFwd>;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
using pointer_type = typename ValueTraits::pointer_type;
using reference_type = typename ValueTraits::reference_type;
@ -461,23 +462,32 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
public:
inline void execute() const {
ThreadsExec::resize_scratch(
ValueTraits::value_size(
ReducerConditional::select(m_functor, m_reducer)),
0);
if (m_policy.end() <= m_policy.begin()) {
if (m_result_ptr) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
}
} else {
ThreadsExec::resize_scratch(
ValueTraits::value_size(
ReducerConditional::select(m_functor, m_reducer)),
0);
ThreadsExec::start(&ParallelReduce::exec, this);
ThreadsExec::start(&ParallelReduce::exec, this);
ThreadsExec::fence();
ThreadsExec::fence();
if (m_result_ptr) {
const pointer_type data =
(pointer_type)ThreadsExec::root_reduce_scratch();
if (m_result_ptr) {
const pointer_type data =
(pointer_type)ThreadsExec::root_reduce_scratch();
const unsigned n = ValueTraits::value_count(
ReducerConditional::select(m_functor, m_reducer));
for (unsigned i = 0; i < n; ++i) {
m_result_ptr[i] = data[i];
const unsigned n = ValueTraits::value_count(
ReducerConditional::select(m_functor, m_reducer));
for (unsigned i = 0; i < n; ++i) {
m_result_ptr[i] = data[i];
}
}
}
}
@ -696,7 +706,8 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
using ValueTraits =
Kokkos::Impl::FunctorValueTraits<ReducerTypeFwd, WorkTagFwd>;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueInit = Kokkos::Impl::FunctorValueInit<ReducerTypeFwd, WorkTagFwd>;
using ValueFinal = Kokkos::Impl::FunctorFinal<ReducerTypeFwd, WorkTagFwd>;
using pointer_type = typename ValueTraits::pointer_type;
using reference_type = typename ValueTraits::reference_type;
@ -743,23 +754,32 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
public:
inline void execute() const {
ThreadsExec::resize_scratch(
ValueTraits::value_size(
ReducerConditional::select(m_functor, m_reducer)),
Policy::member_type::team_reduce_size() + m_shared);
if (m_policy.league_size() * m_policy.team_size() == 0) {
if (m_result_ptr) {
ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
ValueFinal::final(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr);
}
} else {
ThreadsExec::resize_scratch(
ValueTraits::value_size(
ReducerConditional::select(m_functor, m_reducer)),
Policy::member_type::team_reduce_size() + m_shared);
ThreadsExec::start(&ParallelReduce::exec, this);
ThreadsExec::start(&ParallelReduce::exec, this);
ThreadsExec::fence();
ThreadsExec::fence();
if (m_result_ptr) {
const pointer_type data =
(pointer_type)ThreadsExec::root_reduce_scratch();
if (m_result_ptr) {
const pointer_type data =
(pointer_type)ThreadsExec::root_reduce_scratch();
const unsigned n = ValueTraits::value_count(
ReducerConditional::select(m_functor, m_reducer));
for (unsigned i = 0; i < n; ++i) {
m_result_ptr[i] = data[i];
const unsigned n = ValueTraits::value_count(
ReducerConditional::select(m_functor, m_reducer));
for (unsigned i = 0; i < n; ++i) {
m_result_ptr[i] = data[i];
}
}
}
}

View File

@ -2696,8 +2696,8 @@ struct ViewDataHandle<
#endif
&& (!Traits::memory_traits::is_atomic))>::type> {
using value_type = typename Traits::value_type;
using handle_type = typename Traits::value_type*;
using return_type = typename Traits::value_type&;
using handle_type = typename Traits::value_type* KOKKOS_RESTRICT;
using return_type = typename Traits::value_type& KOKKOS_RESTRICT;
using track_type = Kokkos::Impl::SharedAllocationTracker;
KOKKOS_INLINE_FUNCTION
@ -2725,8 +2725,12 @@ struct ViewDataHandle<
Kokkos::CudaUVMSpace>::value))
#endif
&& (!Traits::memory_traits::is_atomic))>::type> {
using value_type = typename Traits::value_type;
using handle_type = typename Traits::value_type*;
using value_type = typename Traits::value_type;
// typedef work-around for intel compilers error #3186: expected typedef
// declaration
// NOLINTNEXTLINE(modernize-use-using)
typedef value_type* KOKKOS_IMPL_ALIGN_PTR(KOKKOS_MEMORY_ALIGNMENT)
handle_type;
using return_type = typename Traits::value_type&;
using track_type = Kokkos::Impl::SharedAllocationTracker;
@ -2766,9 +2770,13 @@ struct ViewDataHandle<
Kokkos::CudaUVMSpace>::value))
#endif
&& (!Traits::memory_traits::is_atomic))>::type> {
using value_type = typename Traits::value_type;
using handle_type = typename Traits::value_type*;
using return_type = typename Traits::value_type&;
using value_type = typename Traits::value_type;
// typedef work-around for intel compilers error #3186: expected typedef
// declaration
// NOLINTNEXTLINE(modernize-use-using)
typedef value_type* KOKKOS_IMPL_ALIGN_PTR(KOKKOS_MEMORY_ALIGNMENT)
handle_type;
using return_type = typename Traits::value_type& KOKKOS_RESTRICT;
using track_type = Kokkos::Impl::SharedAllocationTracker;
KOKKOS_INLINE_FUNCTION

View File

@ -295,6 +295,12 @@ if(Kokkos_ENABLE_CUDA)
cuda/TestCudaUVM_ViewMapping_b.cpp
cuda/TestCudaUVM_ViewMapping_subview.cpp
cuda/TestCuda_Spaces.cpp
)
KOKKOS_ADD_EXECUTABLE_AND_TEST(
UnitTest_CudaTimingBased
SOURCES
UnitTestMainInit.cpp
cuda/TestCuda_DebugSerialExecution.cpp
cuda/TestCuda_DebugPinUVMSpace.cpp
)
@ -414,9 +420,9 @@ KOKKOS_ADD_ADVANCED_TEST( UnitTest_PushFinalizeHook_terminate
endif()
if(KOKKOS_ENABLE_LIBDL)
KOKKOS_ADD_LIBRARY(
printer-tool SHARED
SOURCES tools/printing-tool.cpp
KOKKOS_ADD_TEST_LIBRARY(
kokkosprinter-tool SHARED
SOURCES tools/printing-tool.cpp
)
KOKKOS_ADD_TEST_EXECUTABLE(
@ -431,7 +437,7 @@ KOKKOS_ADD_ADVANCED_TEST( UnitTest_PushFinalizeHook_terminate
KOKKOS_ADD_TEST( NAME ProfilingTestLibraryLoad
EXE ProfilingAllCalls
TOOL printer-tool
TOOL kokkosprinter-tool
PASS_REGULAR_EXPRESSION "kokkosp_init_library::kokkosp_allocate_data:${MEMSPACE_REGEX}:source:${ADDRESS_REGEX}:40::kokkosp_begin_parallel_for:Kokkos::View::initialization [[]source]:0:0::kokkosp_end_parallel_for:0::kokkosp_allocate_data:${MEMSPACE_REGEX}:destination:${ADDRESS_REGEX}:40::kokkosp_begin_parallel_for:Kokkos::View::initialization [[]destination]:0:0::kokkosp_end_parallel_for:0::kokkosp_begin_deep_copy:${MEMSPACE_REGEX}:destination:${ADDRESS_REGEX}:${MEMSPACE_REGEX}:source:${ADDRESS_REGEX}:40::kokkosp_end_deep_copy::kokkosp_begin_parallel_for:parallel_for:${SIZE_REGEX}:0::kokkosp_end_parallel_for:0::kokkosp_begin_parallel_reduce:parallel_reduce:0:1${SKIP_SCRATCH_INITIALIZATION_REGEX}::kokkosp_end_parallel_reduce:1::kokkosp_begin_parallel_scan:parallel_scan:${SIZE_REGEX}:2::kokkosp_end_parallel_scan:2::kokkosp_push_profile_region:push_region::kokkosp_pop_profile_region::kokkosp_create_profile_section:created_section:3::kokkosp_start_profile_section:3::kokkosp_stop_profile_section:3::kokkosp_destroy_profile_section:3::kokkosp_profile_event:profiling_event::kokkosp_deallocate_data:${MEMSPACE_REGEX}:destination:${ADDRESS_REGEX}:40::kokkosp_deallocate_data:${MEMSPACE_REGEX}:source:${ADDRESS_REGEX}:40::kokkosp_finalize_library::"
)
endif() #KOKKOS_ENABLE_LIBDL