This patch adds the `cold` attribute to the keepAlive functions in the
RTL. This dummy function exists to keep certain RTL calls alive without
them being optimized out, but it is never called and can be declared
cold. This also helps some erroneous remarks being given on this
function because it has weak linkage and cannot be made internal.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D117513
The problem with the old scheme is that we would need to keep track of
the "next region" and reset the num_threads value after it. The new RT
doesn't do it and an assertion is triggered. The old RT doesn't do it
either, I haven't tested it but I assume a num_threads clause might
impact multiple parallel regions "accidentally". Further, in SPMD mode
num_threads was simply ignored, for some reason beyond me.
In any case, parallel_51 is designed to take the clause value directly,
so let's do that instead.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D113623
The existing CGOpenMPRuntimeAMDGCN and CGOpenMPRuntimeNVPTX classes are
just code bloat. By removing them, the codebase gets a bit cleaner.
Reviewed By: jdoerfert, JonChesterfield, tianshilei1992
Differential Revision: https://reviews.llvm.org/D113421
The existing CGOpenMPRuntimeAMDGCN and CGOpenMPRuntimeNVPTX classes are
just code bloat. By removing them, the codebase gets a bit cleaner.
Reviewed By: jdoerfert, JonChesterfield, tianshilei1992
Differential Revision: https://reviews.llvm.org/D113421
Before we had aligned barriers the `__kmpc_barrier_simple_spmd` was
OK to be used in the custom state machine. Now that SPMD barriers are
assumed to be aligned we need to use a "generic" barrier in places
that are not aligned.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112893
We do not generate _serialized_parallel calls in device mode, no
need for an external API.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D112145
D110279 introduced a bug to the device runtime. In `__kmpc_parallel_51`, we detect
whether we are already in parallel region by `__kmpc_parallel_level() > __kmpc_is_spmd_exec_mode()`.
It is based on the assumption that:
- In SPMD mode, parallel level is initialized to 1.
- In generic mode, parallel level is initialized to 0.
- `__kmpc_is_spmd_exec_mode` returns `1` for SPMD mode, 0 otherwise.
Because the return value type of `__kmpc_is_spmd_exec_mode` is `int8_t`, there
was an implicit cast from `bool` to `int8_t`. We can make sure it is either 0 or
1 since C++14. In D110279, the return value is the result of an `and` operation,
which is 2 in SPMD mode. This breaks the assumption in `__kmpc_parallel_51`.
Reviewed By: carlo.bertolli, dpalermo
Differential Revision: https://reviews.llvm.org/D111905
This patch adds an external interface to access the dynamic shared
memory buffer in the device runtime. The function introduced is
``llvm_omp_get_dynamic_shared``. This includes a host-side
definition that only returns a null pointer so that it can be used when
host-fallback is enabled without crashing. Support for dynamic shared
memory was also ported to the old device runtime.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110957
Follow on to D110006, related to D110957
Where implementations have diverged this resolves to match the new DeviceRTL
- replaces definitions of this struct in deviceRTL and plugins with include
- changes the dynamic_shared_size field from D110006 to 32 bits
- handles stdint being unavailable in DeviceRTL
- adds a zero initializer for the field to amdgpu
- moves the extern declaration for deviceRTL to target_interface
(omptarget.h is more natural, but doesn't work due to include order
with debug.h)
- Renames the fields everywhere to match the LLVM format used in DeviceRTL
- Makes debug_level uint32_t everywhere (previously sometimes int32_t)
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111069
This path defines the newly added `__kmpc_disitrute_static_init`
functions in the device runtime library. These functions are currently
exact copies of the current worksharing method but can be tuned later.
Depends on D110429
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110430
This is a follow-up of D110029, which uses bitset to indicate execution mode. This patches makes the changes in the function call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110279
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Reviewed By: jdoerfert, jhuber6
Differential Revision: https://reviews.llvm.org/D102107
This patch adds fields for the device number and number of devices into
the device environment struct and debugging values.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110004
Lets the amdgpu plugin write to omptarget_device_environment
to enable debugging. Intend to use in the near future to record the
wavesize that a given deviceRTL was compiled with for running on hardware
that supports 32 or 64.
Patch sets all the attributes that are useful. Notably .data means the variable
is set by writing to host memory before copying to the GPU instead of launching
a kernel to update the image. Can simplify the plugin slightly to drop the
code for patching after load if this is used consistently.
NFC on nvptx, cuda plugin seems to work fine without any annotations.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108698
Use uint64_t for lanemask on all GPU architectures at the interface
with clang. Updates tests. The deviceRTL is always linked as IR so the zext
and trunc introduced for wave32 architectures will fold after inlining.
Simplification partly motivated by amdgpu gfx10 which will be wave32 and
is awkward to express in the current arch-dependant typedef interface.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108317
Similar to D105787, this patch tries to fold `__kmpc_parallel_level` if possible.
Note that `__kmpc_parallel_level` doesn't take activeness into consideration,
based on current `deviceRTLs`, its return value can be such as 0, 1, 2, instead
of 0, 129, 130, etc. that also indicate activeness.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106154
Unrolling this loop provides better performance in practice because it is
executed on the device and is likely to be very small.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D106692
With D106496 we can make the globalization fallback stack much simpler
and this version doesn't seem to experience the spurious failures and
deadlocks we have seen before.
Differential Revision: https://reviews.llvm.org/D106576
These functions should follow the camel case convention. These are really easy to change
and are needed for D106033.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D106390
This patch introduces `__kmpc_is_generic_main_thread_id` which splits the old
comparison into its own runtime function. The purpose of this is so we can fold
this part independently, so when both this and `is_spmd_mode` are folded the
final function will be folded as well.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106437
This patch changes `__kmpc_free_shared` to take an additional argument
corresponding to the associated allocation's size. This makes it easier to
implement the allocator in the runtime.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106496
The patch exposes the libomptarget runtime function that gets the hardware thread id through the kmpc API. This is to be used in SPMDization for checking the thread id to execute regions by a single thread in a block.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106323
In `deviceRTLs`, the parallel level is stored in a shared variable of type `uint8_t`.
`__kmpc_parallel_level` currently returns a 16-bit interger. This patch first
changes the return type of the function to `uint8_t`, same as the shared variable,
and then corrects function type which was updated in D105955.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106384
Currently the NPVTX work function is marked volatile. This prevents some
optimizations from using this value.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106310
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102107
There are two places in current deviceRTLs where it computes parallel level explicitly,
which is basically the functionality of `__kmpc_parallel_level`. Starting from
D105787, we plan to introduce a series of function call folding based on information
that can be deducted during compilation time. Computation of parallel level is
the next target. This patch makes steps for the optimization.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105955
Simplifies control flow to allow store/load forwarding
This change folds two basic blocks into one, leaving a single store to parallelLevel.
This is a step towards spmd kernels with sufficiently aggressive inlining folding
the loads from parallelLevel and thus discarding the nested parallel handling
when it is unused.
Transform:
```
int threadId = GetThreadIdInBlock();
if (threadId == 0) {
parallelLevel[0] = expr;
} else if (GetLaneId() == 0) {
parallelLevel[GetWarpId()] = expr;
}
// =>
if (GetLaneId() == 0) {
parallelLevel[GetWarpId()] = expr;
}
// because
unsigned GetLaneId() { return GetThreadIdInBlock() & (WARPSIZE - 1);}
// so whenever threadId == 0, GetLaneId() is also 0.
```
That replaces a store in two distinct basic blocks with as single store.
A more aggressive follow up is possible if the threads in the warp/wave
race to write the same value to the same address. This is not done as
part of this change.
```
if (GetLaneId() == 0) {
parallelLevel[GetWarpId()] = expr;
}
// =>
parallelLevel[GetWarpId()] = expr;
// because
unsigned GetWarpId() { return GetThreadIdInBlock() / WARPSIZE; }
// so GetWarpId will index the same element for every thread in the warp
// and, because expr is lane-invariant in this case, every lane stores the
// same value to this unique address
```
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D105699
In order to fold calls based on high-level knowledge and control flow
tracking it helps to expose the information as a runtime call. The
logic: `!SPMD && getTID() == getMasterTID()` was used in various places
and is now encapsulated in `__kmpc_is_generic_main_thread`. As part of
this rewrite we replaced eager computation of arguments with on-demand
computation, especially helpful if the calls can be folded and arguments
don't need to be computed consequently.
Differential Revision: https://reviews.llvm.org/D105768
In order to avoid malloc/free, up to NUM_SHARED_VARIABLES_IN_SHARED_MEM
(=64) variables are communicated in dedicated shared memory instead. The
simplification does avoid the need for an "init" and requires "deinit"
only if we ever communicate more than NUM_SHARED_VARIABLES_IN_SHARED_MEM
variables.
Differential Revision: https://reviews.llvm.org/D105767
We had multiple functions to determine the execution mode (SPMD/Generic)
and runtime status (initialized/uninitialized) but that just increased
complexity without a real benefit. Especially with D102307 in mind it
is helpful to reduce the dependence on the `ident_t` flags.
Differential Revision: https://reviews.llvm.org/D105586
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.
The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
This was in parts extracted from D59319.
Reviewed By: ABataev, JonChesterfield
Differential Revision: https://reviews.llvm.org/D101976
Broke check-clang, see https://reviews.llvm.org/D102307#2869065
Ran `git revert -n ebbe149a6f08535ede848a531a601ae6591cfbc5..269416d41908bb670f67af689155d5ab8eea689a`
We had multiple functions to determine the execution mode (SPMD/Generic)
and runtime status (initialized/uninitialized) but that just increased
complexity without a real benefit. Especially with D102307 in mind it
is helpful to reduce the dependence on the `ident_t` flags.
Differential Revision: https://reviews.llvm.org/D105586
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.
The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
This was in parts extracted from D59319.
Reviewed By: ABataev, JonChesterfield
Differential Revision: https://reviews.llvm.org/D101976
In our ongoing work, we are using `AbstractAttributor` to deduct execution model
of device functions, and potententially remove unnecessary function calls to
`__kmpc_is_spmd_exec_mode`. In current device runtime, we have mixed use of
`isSPMDMode` and `__kmpc_is_spmd_exec_mode`, but in fact in `__kmpc_is_spmd_exec_mode`
it simply calls `isSPMDMode`. Since all functions starting with `__kmpc` is C
function, which doesn't have things like name mangling. It is more optimization
friendly. In this patch, we simply replaced all calls to `isSPMDMode` with
`__kmpc_is_spmd_exec_mode` to pave the way for the optimization.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105211
Currently the runtime implementation of `__kmpc_alloc_shared` is extremely slow because it allocated memory for each thread individually. This patch adds a small buffer for the threads to share data and will greatly improve performance for builds where all globalization could not be optimized out. If the shared buffer is full, then memory will not only be allocated per-warp rather than per-thread.
Depends on D97680
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104666
Summary:
This patch introduces the new globalization runtime to be used by D97680. These
runtime calls will replace the __kmpc_data_sharing_push_stack and
__kmpc_data_sharing_pop_stack functions.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102532
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
[libomptarget] Fix devicertl build
The target specific functions in target_interface are extern C, but the
implementations for nvptx were mostly C++ mangling. That worked out as
a quirk of DEVICE macro expanding to nothing, except for shuffle.h which
only forward declared the functions with C++ linkage.
Also implements GetWarpSize, as used by shuffle, and includes target_interface
in nvptx target_impl.cu to help catch future divergence between interface and
implementation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98651
[libomptarget] Drop assert.h, use freestanding for amdgcn devicertl
Promotes the runtime assert to a link time error for the unimplemented
fallback functions. Enables amdgcn to build with only clang provided
headers, which makes it less likely to break other builds when enabled.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98649
[libomptarget][amdgcn] Drop use of inttypes.h, moving closer to freestanding
The glibc headers are a periodic source of problems compiling the devicertl.
This patch resolves the following error run into while building llvm on a slightly
different linux system.
```
In file included from .../lib/clang/13.0.0/include/inttypes.h:21:
In file included from /usr/include/inttypes.h:25:
/usr/include/features.h:461:12: fatal error: 'sys/cdefs.h' file not found
# include <sys/cdefs.h>
^~~~~~~~~~~~~
```
As a second patch, removing assert.h from shuffle will let amdgcn build as
-ffreestanding, at which point only the headers that clang itself provides are
used and interactions with the host glibc are eliminated. Doing the same for
nvptx is complicated by printf handling but also seems worthwhile.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D98565