In much of the libomptarget interface we have an ident_t object now, if
it is not null we can use it to improve the profile output. For now, we
simply use the ident_t "source information string" as generated by the
FE.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95282
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
[libomptarget][cuda] Gracefully handle missing cuda library
If using dynamic cuda, and it failed to load, it is not safe to call
cuGetErrorString.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95412
`omp_is_initial_device` in device code was implemented as a builtin
function in D38968 for a better performance. Therefore there is no chance that
this function will be called to `deviceRTLs`. As we're moving to build `deviceRTLs`
with OpenMP compiler, this function can lead to a compilation error. This patch
just simply removes it.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95397
This patch makes prep for dropping CUDA when compiling `deviceRTLs`.
CUDA intrinsics are replaced by NVVM intrinsics which refers to code in
`__clang_cuda_intrinsics.h`. We don't want to directly include it because in the
near future we're going to switch to OpenMP and by then the header cannot be
used anymore.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95327
D95161 removed the option `--libomptarget-nvptx-path`, which is used in
the tests for `libomptarget-nvptx`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95293
[libomptarget][nvptx] Replace cuda atomic primitives with clang intrinsics
Tested by diff of IR generated for target_impl.cu before and after. NFC. Part
of removing deviceRTL build time dependency on cuda SDK.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95294
[libomptarget][cuda] Call v2 functions explicitly
rtl.cpp calls functions like cuMemFree that are replaced by a macro
in cuda.h with cuMemFree_v2. This patch changes the source to use
the v2 names consistently.
See also D95104, D95155 for the idea. Alternatives are to use a mixture,
e.g. call the macro names and explictly dlopen the _v2 names, or to keep
the current status where the symbols are replaced by macros in both files
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95274
[libomptarget] Build cuda plugin without cuda installed locally
Compiles a new file, `plugins/cuda/dynamic_cuda/cuda.cpp`, to an object file that exposes the same symbols that the plugin presently uses from libcuda. The object file contains dlopen of libcuda and cached dlsym calls. Also provides a cuda.h containing the subset that is used.
This lets the cmake file choose between the system cuda and a dlopen shim, with no changes to rtl.cpp.
The corresponding change to amdgpu is postponed until after a refactor of the plugin to reduce the size of the hsa.h stub required
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95155
The buckets are initialized in __kmp_dephash_create but when they are extended
the memory is allocated but not NULL'd, potentially leaving some buckets
uninitialized after all entries have been copied into the new allocation.
This commit makes sure the buckets are properly initialized with NULL before
copying the entries.
Differential Revision: https://reviews.llvm.org/D95167
[libomptarget][devicertl] Drop templated atomic functions
The five __kmpc_atomic templates are instantiated a total of seven times.
This change replaces the template with explictly typed functions, which
have the same prototype for amdgcn and nvptx, and implements them with
the same code presently in use.
Rolls in the accepted but not yet landed D95085.
The unsigned long long type can be replaced with uint64_t when replacing
the cuda function. Until then, clang warns on casting a pointer to one to
a pointer to the other.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95093
Summary:
Prior to D91261 the information checked the OMP_MAP_TARGET_PARAM flag, change this as it has been removed. The INFO macro was changed to accept a flag as input to make conditionally printing information easier.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95133
Profiling has been recently implemented in libomptarget (D93055). This patch enables time profiling support for libomptarget in libomp, to support profiling of multi-threaded execution of offloaded regions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94855
Pretty similar to D95058, this patch added forward declaration for
CUDA atomic functions. We already have definitions with right mangled names in
internal CUDA headers so the forward declaration here can work properly.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D95085
Summary:
The custom mapper API did not previously support the mapping names added previously. This means they were not present if a user requested debugging information while using the mapper functions. This adds basic support for passing the mapped names to the runtime library.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D94806
Once we switch to build deviceRTLs with OpenMP, primitives and CUDA
intrinsics cannot be used directly anymore because `__device__` is not recognized
by OpenMP compiler. To avoid involving all CUDA internal headers we had in `clang`,
we forward declared these functions. Eventually they will be transformed into
right LLVM instrinsics.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95058
[libomptarget][devicertl][nfc] Simplify target_atomic abstraction
Atomic functions were implemented as a shim around cuda's atomics, with
amdgcn implementing those symbols as a shim around gcc style intrinsics.
This patch folds target_atomic.h into target_impl.h and folds amdgcn.
Further work is likely to be useful here, either changing to openmp's atomic
interface or instantiating the templates on the few used types in order to
move them into a cuda/c++ implementation file. This change is mostly to
group the remaining uses of the cuda api under nvptx' target_impl abstraction.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95062
[libomptarget][devicertl][nfc] Remove some cuda intrinsics, simplify
Replace __popc, __ffs with clang intrinsics. Move kmpc_impl_min to only file
that uses it and replace template with explictly typed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95060
Replaced CUDA builtin vars with LLVM intrinsics such that we don't need
definitions of those intrinsics.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95013
[libomptarget][devicertl] Wrap source in declare target pragmas
Factored out of D93135 / D94745. C++ and cuda ignore unknown pragmas
so this is a NFC for the current implementation language. Removes noise
from patches for building deviceRTL as openmp.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95048
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
[libomptarget][nvptx] Reduce calls to cuda header
Remove use of clock_t in favour of a builtin. Drop a preprocessor branch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94731
[libomptarget][nvptx][nfc] Move target_impl functions out of header
This removes most of the differences between the two target_impl.h.
Also change name mangling from C to C++ for __kmpc_impl_*_lock.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D94728
`omptarget-nvptx` is still a dependence for `check-libomptarget-nvtpx`
although it has been removed by D94573.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94725
The comment said CUDA 9 header files use the `nv_weak` attribute which
`clang` is not yet prepared to handle. It's three years ago and now things have
changed. Based on my test, removing the definition doesn't have any problem on
my machine with CUDA 11.1 installed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94700
For NVPTX target, OpenMP provides a static library `libomptarget-nvptx`
built by NVCC, and another bitcode `libomptarget-nvptx-sm_{$sm}.bc` generated by
Clang. When compiling an OpenMP program, the `.bc` file will be fed to `clang`
in the second run on the program that compiles the target part. Then the generated
PTX file will be fed to `ptxas` to generate the object file, and finally the driver
invokes `nvlink` to generate the binary, where the static library will be appened
to `nvlink`.
One question is, why do we need two libraries? The only difference is, the static
library contains `omp_data.cu` and the bitcode library doesn't. It's unclear why
they were implemented in this way, but per D94565, there is no issue if we also
include the file into the bitcode library. Therefore, we can safely drop the
static library.
This patch is about the change in OpenMP. The driver will be updated as well if
this patch is accepted.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D94573
Restore control of kernel launch tracing to be >= 1 as it was before
export LIBOMPTARGET_KERNEL_TRACE=1
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94695
Hierarchical barrier is an experimental barrier algorithm that uses aspects
of machine hierarchy to define the barrier tree structure. This patch fixes
offset calculation in hierarchical barrier. The offset is used to store info
on a flag about sleeping threads waiting on a location stored in the flag.
This commit also fixes a potential deadlock in hierarchical barrier when
using infinite blocktime by adjusting the offset value of leaf kids so that
it matches the value of leaf state. It also adds testing of default barriers
with infinite blocktime, and also tests hierarchical barrier algorithm with
both default and infinite blocktime.
Patch by Terry Wilmarth and Nawrin Sultana.
Differential Revision: https://reviews.llvm.org/D94241
Add extra information to the runtime page describing the error messages and add information to the release notes for clang 12.0
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94562
This change enables volatile use of persistent memory for omp_large_cap_mem*
on supported systems. It depends on libmemkind's support for persistent memory,
and requirements/details can be found at the following url.
https://pmem.io/2020/01/20/memkind-dax-kmem.html
Differential Revision: https://reviews.llvm.org/D94353
Constant static data member can be defined in the class without another
define after the class in C++17. Although it is C++17, Clang can still handle it
even w/o the flag for C++17. Unluckily, GCC cannot handle that.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D94541
[libomptarget][amdgpu][nfc] Fix build on centos
rtl.cpp replaced 224 with a #define from elf.h, but that
doesn't work on a centos 7 build machine with an old elf.h
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D94528
Some LLVM headers are generated by CMake. Before the installation,
LLVM's headers are distributed everywhere, some of which are in
`${LLVM_SRC_ROOT}/llvm/include/llvm`, and some are in
`${LLVM_BINARY_ROOT}/include/llvm`. After intallation, they're all in
`${LLVM_INSTALLATION_ROOT}/include/llvm`.
OpenMP now depends on LLVM headers. Some headers depend on headers generated
by CMake. When building OpenMP along with LLVM, a.k.a via `LLVM_ENABLE_RUNTIMES`,
we need to tell OpenMP where it can find those headers, especially those still
have not been copied/installed.
Reviewed By: jdoerfert, jhuber6
Differential Revision: https://reviews.llvm.org/D94534
The lifetime of `libomptarget` and its opened plugins are not aligned
and it's hard for `libomptarget` to determine when the plugins are destroyed.
As a result, some issues (see D94256 for details) occur on some platforms.
Actually, if we take target memory as target resources, same as other resources,
such as CUDA streams, in each plugin, then the memory manager should also be in
the plugin. Also considering some platforms may want to opt out the feature, it
makes sense to move the memory manager to plugin, make it a common interface, and
let plguin developers determine whether they need it. This is what this patch does.
CUDA plugin is taken as example to show how to integrate it. In this way, we can
also get a bonus that different thresholds can be set for different platforms.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D94379
For now `elf_common.c` is taken as a common part included into
different plugin implementations directly via
`#include "../../common/elf_common.c"`, which is not a best practice. Since it
is simple enough such that we don't need to create a real library for it, we just
take it as a interface library so that other targets can link it directly. Another
advantage of this method is, we don't need to add the folder into header search
path which can potentially pollute the search path.
VE and AMD platforms have not been tested because I don't have target machines.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94443