Passes same tests as the current deviceRTL. Includes cmake change from D111987.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D112227
We do not generate _serialized_parallel calls in device mode, no
need for an external API.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D112145
Exiting a data environment will reset all values, it is wrong to adjust
them afterwards.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112144
We will later use the fact that a barrier is aligned to reason about
thread divergence. For now we introduce the assumption and some more
documentation.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112153
The OpenMP thread ID is not the hardware thread ID if we have nesting.
We need to ask the runtime properly to ensure correct results.
Note that the loop interface is going to change soon so we do not adjust
it now but simply ignore the extra argument.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D111950
The team size could/should be an ICV but since we know it is either 1 or
a value we can leave it in the team state for now. However, we still
need to determine if the current level is nested before we use it.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D111949
The first thread state in the new GPU runtime doesn't have a previous
one and we should not dereference the nullptr placeholder.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D111946
Essentially moves the foreach over sm integers into a macro and instantiates it for nvptx.
NFC in that the macro is not presently instantiated for amdgpu as the corresponding code doesn't compile yet.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D111987
Implemented by patching python config instead of modifying all
the tests so that -generic and XFAIL work as usual. Expectation is for
this to be reverted once the old runtime is deleted.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D112225
Step towards building the DeviceRTL for amdgpu.
Mostly replaces cuda-specific toolchain finding logic with the
generic logic currently found in the amdgpu deviceRTL cmake. Also
deletes dead code and changes the default to build on systems
without cuda installed, as the library doesn't use cuda and the
amdgpu-only systems generally won't have cuda installed.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D111983
The plugin currently uses a macro to check if this is a debug built
before assigning the debug kind variable to the device environment
struct. This is being deprecated because the new device runtime does not
maintain separate debug builds and should always be availible.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112083
D110279 introduced a bug to the device runtime. In `__kmpc_parallel_51`, we detect
whether we are already in parallel region by `__kmpc_parallel_level() > __kmpc_is_spmd_exec_mode()`.
It is based on the assumption that:
- In SPMD mode, parallel level is initialized to 1.
- In generic mode, parallel level is initialized to 0.
- `__kmpc_is_spmd_exec_mode` returns `1` for SPMD mode, 0 otherwise.
Because the return value type of `__kmpc_is_spmd_exec_mode` is `int8_t`, there
was an implicit cast from `bool` to `int8_t`. We can make sure it is either 0 or
1 since C++14. In D110279, the return value is the result of an `and` operation,
which is 2 in SPMD mode. This breaks the assumption in `__kmpc_parallel_51`.
Reviewed By: carlo.bertolli, dpalermo
Differential Revision: https://reviews.llvm.org/D111905
This patch adds support for the
`__kmpc_get_hardware_num_threads_in_block` function that returns the
number of threads. This was missing in the new runtime and was used by
the AMDGPU plugin which prevented it from using the new runtime. This
patchs also unified the interface for getting the thread numbers in the
frontend.
Originally authored by jdoerfert.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D111475
Until we hit the first barrier we should not call `mapping::isSPMDMode`
with all threads. Instead, we now have (and use during initialization) a
`mapping::isMainThreadInGenericMode` overload that takes the known
SPMD-mode state and one that queries it.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D111381
This patch adds an external interface to access the dynamic shared
memory buffer in the device runtime. The function introduced is
``llvm_omp_get_dynamic_shared``. This includes a host-side
definition that only returns a null pointer so that it can be used when
host-fallback is enabled without crashing. Support for dynamic shared
memory was also ported to the old device runtime.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110957
For NVPTX, `printf` can be used just with a function declaration. For AMDGCN, an
function definition is added, but it simply returns.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109728
We need to synchronize the threads *before* we destroy the RAII objects
that hold the old values and not after to avoid threads executing the
parallel region but seeing an inconsistent state.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D111369
Follow on to D110006, related to D110957
Where implementations have diverged this resolves to match the new DeviceRTL
- replaces definitions of this struct in deviceRTL and plugins with include
- changes the dynamic_shared_size field from D110006 to 32 bits
- handles stdint being unavailable in DeviceRTL
- adds a zero initializer for the field to amdgpu
- moves the extern declaration for deviceRTL to target_interface
(omptarget.h is more natural, but doesn't work due to include order
with debug.h)
- Renames the fields everywhere to match the LLVM format used in DeviceRTL
- Makes debug_level uint32_t everywhere (previously sometimes int32_t)
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111069
The hand-rolled linking logic in elf_common does not account for
the possibility of using LLVM dylib rather than a dozen static
libraries. Since it does not seem to be easily convertible
to add_llvm_library, just hand-roll support for LLVM_LINK_LLVM_DYLIB.
This is necessary to support stand-alone builds against installed LLVM.
Differential Revision: https://reviews.llvm.org/D111038
Fixes 51982. Adds a missing CreatePointerCast and allocates a global in
the correct address space.
Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\
blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting
parts while checking the assertion failure still occurred.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110556
Use enum for execution mode.
This is partly a port from ROCm and partly a port from D110029. Attempted to
make the same choices as ROCm as far as comments etc go to reduce the merge
conflicts.
There is some cleanup warranted here - in particular I like the cuda patch
factoring out the comparisons into named variables - but I'd like to leave
that for a follow up patch, keeping this one minimal.
Reviewed By: carlo.bertolli
Differential Revision: https://reviews.llvm.org/D110845
Fixes: SWDEV-275232 (With contributions from Ammar Elwazir, Laurent Morichetti, and Tony Tye)
The current code is racy. After the packet is submitted, the GPU will increment the read index. If this wraps around before the memory is read from it'll refer to a signal from an unrelated packet. Change avoids reading from the packet post-submission.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110679
Fixes 51982. Minor refactor to remove `return x = y` construct.
Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\
blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting
parts while checking the assertion failure still occurred.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110556
This path defines the newly added `__kmpc_disitrute_static_init`
functions in the device runtime library. These functions are currently
exact copies of the current worksharing method but can be tuned later.
Depends on D110429
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110430
Use the in-project clang, llvm-link and opt if available and unless
CMake cache variables specify to use a different compiler. This applies
D101265 to the new DeviceRTL's CMakeLists.txt which was copied before
D101265 was applied.
Fixes the openmp-offloading-cuda-runtime builder which was failing
since D110006.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110251
Store queues in unique_ptr so they are destroyed when the global DeviceInfo is. Currently they leak which raises an assert in debug builds of hsa.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D109511
This patch fixes a data-race observed when using the new device runtime
library. The Internal control variable for the parallel level is read in
the `__kmpc_parallel_51` function while it could potentially be written
by other threads. This causes data corruption and will cause
nondetermistic behaviour in the runtime. This patch fixes this by adding
an explicit synchronization before the region starts.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110366
This is a follow-up of D110029, which uses bitset to indicate execution mode. This patches makes the changes in the function call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110279
This patch adds support for an RAII struct that will print function
traces when placed inside of a function declaration. Each successive
call will increase the indentation to make it easier to visually
inspect.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110202
The execution mode of a kernel is stored in a global variable, whose value means:
- 0 - SPMD mode
- 1 - indicates generic mode
- 2 - SPMD mode execution with generic mode semantics
We are going to add support for SIMD execution mode. It will be come with another
execution mode, such as SIMD-generic mode. As a result, this value-based indicator
is not flexible.
This patch changes to bitset based solution to encode execution mode. Each
position is:
[0] - generic mode
[1] - SPMD mode
[2] - SIMD mode (will be added later)
In this way, `0x1` is generic mode, `0x2` is SPMD mode, and `0x3` is SPMD mode
execution with generic mode semantics. In the future after we add the support for
SIMD mode, `0b1xx` will be in SIMD mode.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110029
Summary:
The thread ID function was reintroduced in D110195, but could
potentially be removed by the optimizer. Make the function noinline to
preserve the call sites and add it to the externalization RAII so its
definition is not removed by the attributor.
The new device runtime library currently lacks the
`kmpc_get_hardware_thread_id_in_block` function which is currently used
when doing the SPMDzation optimization. This call would be introduced
through the optimization and then cause a linking error because it was
not present. This patch adds support for this runtime call.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110195
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Reviewed By: jdoerfert, jhuber6
Differential Revision: https://reviews.llvm.org/D102107
This patch adds support for using dynamic shared memory in the new
device runtime. The new function `__kmpc_get_dynamic_shared` will return a
pointer to the buffer of dynamic shared memory. Currently the amount of memory
allocated is set by an environment variable.
In the future this amount will be added to the amount used for the smart stack
which will be configured in a similar way.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110006
This patch adds fields for the device number and number of devices into
the device environment struct and debugging values.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110004
This patch implements the `__assert_fail` function in the new device
runtime. This allows users and developers to use the standars assert
function inside of the device.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D109886
The defintion of OFFLOAD_SUCCESS and OFFLOAD_FAIL used in plugin APIs and libomptarget public APIs are not consistent.
Create __tgt_target_return_t for libomptarget public APIs.
Differential Revision: https://reviews.llvm.org/D109304
The hsa library must be initialized before any calls into it and
destructed after the last call into it. There have been a number of bugs in
this area related to member variables which would like to use raii to manage
resources acquired from hsa.
This patch moves the init/shutdown of hsa into a class, such that when used as
the first member variable (could be a base), the lifetime of other member
variables are reliably scoped within it. This will allow other classes to use
raii reliably when used as member variables within the global.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D109512
Given D109057, change test runner to use the libomptarget-x-bc-path
argument instead of the LIBRARY_PATH environment variable to find the device
library.
Also drop the use of LIBRARY_PATH environment variable as it is far
too easy to pull in the device library from an unrelated toolchain by accident
with the current setup. No loss in flexibility to developers as the clang
commandline used here is still available.
Reviewed By: jdoerfert, tianshilei1992
Differential Revision: https://reviews.llvm.org/D109061
Using std::vector<DeviceTy> requires implementing copy constructor and copied assign operator for DeviceTy.
Indeed DeviceTy should never be copied. After changing to std::vector<std::unique_ptr<DeviceTy>>,
All the unsafe copy constructor and copy assign operator implementations can be removed.
Compilers mark them deleted due to mutex or underlying objects and this is the desired behavior.
Differential Revision: https://reviews.llvm.org/D109276
Use the same debug print as the rest of libomptarget plugins with
the same environment control. Also drop the max queue size debugging hook as
I don't believe it is still in use, can bring it back near the rest of the env
handling in rtl.cpp if someone objects.
That makes most of rt.h and all of utils.cpp unused. Clean that up and simplify
control flow in a couple of places.
Behaviour change is that debug prints that used to use the old environment
variable now use the new one and print in slightly different format, and the
removal of the max queue size variable.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D108784
Use unique_ptr to achieve the effect of mutable.
Remove mutable keyword of DynRefCount and HoldRefCount
Remove std::shared_ptr from UpdateMtx
Reviewed By: tianshilei1992, grokos
Differential Revision: https://reviews.llvm.org/D109007
As started in D107925, this patch replaces the remaining occurrences
of `UNIFIED_SHARED_MEMORY && TgtPtrBegin == HstPtrBegin` in
`omptarget.cpp` with `IsHostPtr`. The former condition is broken in
the rare case that the device and host happen to use the same address
for their mapped allocations. I don't know how to write a test that's
likely to reveal this case.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D107928
As discussed in D105990, without this patch, `targetDataBegin`
determines whether to transfer data (as opposed to assuming it's in
shared memory) using the condition `!UseUSM || HasCloseModifier`.
However, this condition is broken if use of discrete memory was forced
by `omp_target_associate_ptr`. This patch extends
`unified_shared_memory/associate_ptr.c` to reveal this case, and it
fixes it using `!IsHostPtr` in `DeviceTy::getTargetPointer` to replace
this condition.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D107927
This patch is based on comments in D105990. It is NFC according to
the following observations:
1. `CopyMember` is computed as `!IsHostPtr && IsLast`.
2. `DelEntry` is true only if `IsLast` is true.
We apply those observations in order:
```
if ((DelEntry || Always || CopyMember) && !IsHostPtr)
if ((DelEntry || Always || IsLast) && !IsHostPtr)
if ((Always || IsLast) && !IsHostPtr)
```
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D107926
As discussed in D105990, without this patch, `targetDataEnd`
determines whether to transfer data or delete a device mapping (as
opposed to assuming it's in shared memory) using two different
conditions, each of which is broken for some cases:
1. `!(UNIFIED_SHARED_MEMORY && TgtPtrBegin == HstPtrBegin)`: The
broken case is rare: the device and host might happen to use the
same address for their mapped allocations. I don't know how to
write a test that's likely to reveal this case, but this patch does
fix it, as discussed below.
2. `!UNIFIED_SHARED_MEMORY || HasCloseModifier`: There are at least
two broken cases:
1. The `close` modifier might have been specified on an `omp
target enter data` but not the corresponding `omp target exit
data`, which thus might falsely assume a mapping is in shared
memory. The test `unified_shared_memory/close_enter_exit.c`
already has a missing deletion as a result, and this patch adds
a check for that. This patch also adds the new test
`close_member.c` to reveal a missing transfer and deletion.
2. Use of discrete memory might have been forced by
`omp_target_associate_ptr`, as in the test
`unified_shared_memory/api.c`. In the current `targetDataEnd`
implementation, this condition turns out not be used for this
case: because the reference count is infinite, a transfer is
possible only with an `always` modifier, and this condition is
never used in that case. To ensure it's never used for that
case in the future, this patch adds the test
`unified_shared_memory/associate_ptr.c`.
Fortunately, `DeviceTy::getTgtPtrBegin` already has a solution: it
reports whether the allocation was found in shared memory via the
variable `IsHostPtr`.
After this patch, `HasCloseModifier` is no longer used in
`targetDataEnd`, and I wonder if the `close` modifier is ever useful
on an `omp target data end`.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D107925
Given D109057, change test runner to use the libomptarget-x-bc-path
argument instead of the LIBRARY_PATH environment variable to find the device
library.
Also drop the use of LIBRARY_PATH environment variable as it is far
too easy to pull in the device library from an unrelated toolchain by accident
with the current setup. No loss in flexibility to developers as the clang
commandline used here is still available.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109061
Using rpath instead of LD_LIBRARY_PATH to find libomp.so and
libomptarget.so lets one rerun the already built test executables without
setting environment variables and removes the risk of the test runner picking
up different libraries to the developer debugging the failure.
rpath usually means runpath, which is not transitive, so set runpath on
libomptarget itself so that it can find the plugins located next to it,
spelled $ORIGIN. This provides sufficient functionality to drop D102043
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D109071
This patch implements OpenMP runtime support for an original OpenMP
extension we have developed to support OpenACC: the `ompx_hold` map
type modifier. The previous patch in this series, D106509, implements
Clang support and documents the new functionality in detail.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D106510
In some build configurations, the target we depend on is not available for declaring the build dependency.
We only need to declare the build dependency, if the build target is available in the same build.
Fixes the issue raised in https://reviews.llvm.org/D107156#2969862
This patch should go into release/13 together with D108404
Differential Revision: https://reviews.llvm.org/D108868
`CU_EVENT_DEFAULT` is defined in CUDA header. It should be added to
`openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.h` for CUDA free build.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D108878
This patch adds the support form event related interfaces, which will be used
later to fix data race. See D104418 for more details.
Reviewed By: jdoerfert, ye-luo
Differential Revision: https://reviews.llvm.org/D108528
Lets wavefront size be 32 for amdgpu openmp, as well as 64.
Fixes up as little as possible to pass that through the libraries. This change
is end to end, as opposed to updating clang/devicertl/plugin separately. It can
be broken up for review/commit if preferred. Posting as-is so that others with
a gfx10 can try it out. It works roughly as well as gfx9 for me, but there are
probably bugs remaining as well as the todo: for letting grid values vary more.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D108708
Lets the amdgpu plugin write to omptarget_device_environment
to enable debugging. Intend to use in the near future to record the
wavesize that a given deviceRTL was compiled with for running on hardware
that supports 32 or 64.
Patch sets all the attributes that are useful. Notably .data means the variable
is set by writing to host memory before copying to the GPU instead of launching
a kernel to update the image. Can simplify the plugin slightly to drop the
code for patching after load if this is used consistently.
NFC on nvptx, cuda plugin seems to work fine without any annotations.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108698
Move most debug printing in rtl.cpp behind DP() macro
Adjust the print output for gpu arch mismatch when the architectures match
Convert an assert into graceful failure
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108562
The use of `$<TARGET_FILE:clang>` was adapted too broadly from D101265.
Fixes llvm.org/PR51579
Also see discussion in D108534.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D108640
With uses of g_atl_machine gone, a significant portion of dead
code has been removed.
This patch depends on D104691 and D104695.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104696
Compiled nvptx devicertl as freestanding, breaking the
dependency on host glibc and gcc-multilibs. Thus build it by default.
Comes at the cost of #defining out printf. Tried mapping it onto
__builtin_printf but that gets transformed back to printf instead
of hitting the cuda/openmp lowering transform.
Printf could be preserved by one of:
- dropping all the standard headers and ffreestanding
- providing a header only printf implementation
- changing the compiler handling of printf
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D108349
Add include path to the cmakefiles and set the target_impl enums
from the llvm constants instead of copying the values.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108391
D107156 and D107320 are not sufficient when OpenMP is built as llvm runtime
(LLVM_ENABLE_RUNTIMES=openmp) because dependencies only work within the same
cmake instance.
We could limit the dependency to cases where libomptarget/plugins are really
built. But compared to the whole llvm project, building openmp runtime is
negligible and postponing the build of OpenMP runtime after the dependencies
are ready seems reasonable.
The direct dependency introduced in D107156 and D107320 is necessary for the
case where OpenMP is built as llvm project (LLVM_ENABLE_PROJECTS=openmp).
Differential Revision: https://reviews.llvm.org/D108404