Commit Graph

285 Commits

Author SHA1 Message Date
Joseph Huber 74d622dea4 [OpenMP] Add new worksharing definitions into device RTL
This path defines the newly added `__kmpc_disitrute_static_init`
functions in the device runtime library. These functions are currently
exact copies of the current worksharing method but can be tuned later.

Depends on D110429

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D110430
2021-09-27 11:36:41 -04:00
Shilei Tian 423d34f74a [OpenMP][Offloading] Change `bool IsSPMD` to `int8_t Mode` in `__kmpc_target_init` and `__kmpc_target_deinit`
This is a follow-up of D110029, which uses bitset to indicate execution mode. This patches makes the changes in the function call.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D110279
2021-09-22 17:16:41 -04:00
Giorgis Georgakoudis ac90dfc43a Revert "[OpenMP] Codegen aggregate for outlined function captures"
This reverts commit 1d66649adf.

Revert to fix AMG GPU issue.
2021-09-21 13:20:39 -07:00
Giorgis Georgakoudis 1d66649adf [OpenMP] Codegen aggregate for outlined function captures
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3)  forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.

Reviewed By: jdoerfert, jhuber6

Differential Revision: https://reviews.llvm.org/D102107
2021-09-21 10:50:04 -07:00
Joseph Huber ec02c34b6d [OpenMP] Add additional fields to device environment
This patch adds fields for the device number and number of devices into
the device environment struct and debugging values.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D110004
2021-09-17 21:25:32 -04:00
Shilei Tian 81a1a91c62 [NFC] clang-format -i /openmp/libomptarget/deviceRTLs/interface.h 2021-09-17 12:55:02 -04:00
Ron Lieberman fdac5adee6 [openmp] NFC add bitcode comment 2021-09-02 18:21:39 -05:00
Jon Chesterfield 201e466eba [libomptarget][amdgpu] Add gfx90a to build list 2021-09-02 18:11:02 +01:00
Joachim Protze 5ea1c37118 [libomptarget][amdcgn] Only add opt/llvm-link dependency if TARGET is available
In some build configurations, the target we depend on is not available for declaring the build dependency.
We only need to declare the build dependency, if the build target is available in the same build.

Fixes the issue raised in https://reviews.llvm.org/D107156#2969862
This patch should go into release/13 together with D108404

Differential Revision: https://reviews.llvm.org/D108868
2021-08-30 17:32:11 +02:00
Jon Chesterfield 78f92c3810 [openmp][amdgpu] Initial gfx10 offloading implementation
Lets wavefront size be 32 for amdgpu openmp, as well as 64.

Fixes up as little as possible to pass that through the libraries. This change
is end to end, as opposed to updating clang/devicertl/plugin separately. It can
be broken up for review/commit if preferred. Posting as-is so that others with
a gfx10 can try it out. It works roughly as well as gfx9 for me, but there are
probably bugs remaining as well as the todo: for letting grid values vary more.

Reviewed By: ronlieb

Differential Revision: https://reviews.llvm.org/D108708
2021-08-27 12:34:03 +01:00
Jon Chesterfield a5f4074d85 [libomptarget][amdgpu] Macro for accessing GPU variables from plugin
Lets the amdgpu plugin write to omptarget_device_environment
to enable debugging. Intend to use in the near future to record the
wavesize that a given deviceRTL was compiled with for running on hardware
that supports 32 or 64.

Patch sets all the attributes that are useful. Notably .data means the variable
is set by writing to host memory before copying to the GPU instead of launching
a kernel to update the image. Can simplify the plugin slightly to drop the
code for patching after load if this is used consistently.

NFC on nvptx, cuda plugin seems to work fine without any annotations.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108698
2021-08-26 17:28:18 +01:00
Michael Kruse 1275ee3041 [OpenMP][amdgcn] Don't use in-tree clang if not available.
The use of `$<TARGET_FILE:clang>` was adapted too broadly from D101265.

Fixes llvm.org/PR51579

Also see discussion in D108534.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D108640
2021-08-24 12:50:49 -05:00
Jon Chesterfield d26000e4cc [openmp][devicertl] Freestanding nvptx via stub printf
Compiled nvptx devicertl as freestanding, breaking the
dependency on host glibc and gcc-multilibs. Thus build it by default.

Comes at the cost of #defining out printf. Tried mapping it onto
__builtin_printf but that gets transformed back to printf instead
of hitting the cuda/openmp lowering transform.

Printf could be preserved by one of:
- dropping all the standard headers and ffreestanding
- providing a header only printf implementation
- changing the compiler handling of printf

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D108349
2021-08-23 23:07:47 +01:00
Jon Chesterfield 842f875c8b [openmp] Use llvm GridValues from devicertl
Add include path to the cmakefiles and set the target_impl enums
from the llvm constants instead of copying the values.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108391
2021-08-23 20:25:24 +01:00
Joachim Protze 4bb36df144 [libomptarget][amdcgn] Add build dependency for llvm-link and opt
D107156 and D107320 are not sufficient when OpenMP is built as llvm runtime
(LLVM_ENABLE_RUNTIMES=openmp) because dependencies only work within the same
cmake instance.

We could limit the dependency to cases where libomptarget/plugins are really
built. But compared to the whole llvm project, building openmp runtime is
negligible and postponing the build of OpenMP runtime after the dependencies
are ready seems reasonable.

The direct dependency introduced in D107156 and D107320 is necessary for the
case where OpenMP is built as llvm project (LLVM_ENABLE_PROJECTS=openmp).

Differential Revision: https://reviews.llvm.org/D108404
2021-08-20 01:57:58 +02:00
Jon Chesterfield 6c75ce1b8b [libomptarget][nfc] Move lanemask_t type into target_impl.h 2021-08-19 18:50:03 +01:00
Jon Chesterfield f420939b82 [libomptarget] Apply D106710 to amdgcn devicertl 2021-08-19 01:34:33 +01:00
Jon Chesterfield c480792b6a [libomptarget][nfc][devicertl] Delete unused enums 2021-08-19 00:14:34 +01:00
Jon Chesterfield 21d91a8ef3 [libomptarget][devicertl] Replace lanemask with uint64 at interface
Use uint64_t for lanemask on all GPU architectures at the interface
with clang. Updates tests. The deviceRTL is always linked as IR so the zext
and trunc introduced for wave32 architectures will fold after inlining.

Simplification partly motivated by amdgpu gfx10 which will be wave32 and
is awkward to express in the current arch-dependant typedef interface.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108317
2021-08-18 20:47:33 +01:00
Joachim Protze 4ffa1478fd [libomptarget][amdcgn] Add build dependency for opt
This patch should fix the build we observe when building LLVM from scratch.

Differential Revision: https://reviews.llvm.org/D107156
2021-07-30 15:45:13 +02:00
Jose M Monsalve Diaz 5ab6aedda9 [OpenMP] Folding threadLimit and numThreads when single value in kernels
The device runtime contains several calls to `__kmpc_get_hardware_num_threads_in_block`
and `__kmpc_get_hardware_num_blocks`. If the thread_limit and the num_teams are constant,
these calls can be folded to the constant value.

In this patch we use the already introduced `AAFoldRuntimeCall` and the `NumTeams` and
`NumThreads` kernel attributes (to be introduced in a different patch) to fold these functions.
The code checks all the kernels, and if their attributes match, the functions are folded.

In the future we will explore specializing for multiple values of NumThreads and NumTeams.

Depends on D106390

Reviewed By: jdoerfert, JonChesterfield

Differential Revision: https://reviews.llvm.org/D106033
2021-07-27 21:47:12 -04:00
Shilei Tian e97e0a4fad [AbstractAttributor] Fold __kmpc_parallel_level if possible
Similar to D105787, this patch tries to fold `__kmpc_parallel_level` if possible.
Note that `__kmpc_parallel_level` doesn't take activeness into consideration,
based on current `deviceRTLs`, its return value can be such as 0, 1, 2, instead
of 0, 129, 130, etc. that also indicate activeness.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106154
2021-07-26 22:46:19 -04:00
Shilei Tian f1b8fa55d0 [OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLs
We build `deviceRTLs` with `-O1` by default, which also triggers OpenMPOpt. When
the info cache is created, some attributes are removed. As a result, although we
mark a few functions `noinline`, they are still inlined when the bitcode library
is generated. This can cause an issue in middle end optimization.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106710
2021-07-25 10:38:27 -04:00
Joseph Huber e1dedecaa6 [Libomptarget] Add unroll flag to shared variables loop
Unrolling this loop provides better performance in practice because it is
executed on the device and is likely to be very small.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D106692
2021-07-23 16:45:27 -04:00
Johannes Doerfert d12ee28e2e [OpenMP] Simplify the ThreadStackTy for globalization fallback
With D106496 we can make the globalization fallback stack much simpler
and this version doesn't seem to experience the spurious failures and
deadlocks we have seen before.

Differential Revision: https://reviews.llvm.org/D106576
2021-07-22 23:57:46 -05:00
Jose M Monsalve Diaz 68d6278a6e [OpenMP] Renaming RT functions `GetNumberOfBlocksInKernel` and `GetNumberOfThreadsInBlock`
These functions should follow the camel case convention. These are really easy to change
and are needed for D106033.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D106390
2021-07-22 18:17:49 -04:00
Joseph Huber 4a66860424 [OpenMP] Add an option to disable function internalization
Function internalization can sometimes occur in situations where we want to
keep the call sites intact. This patch adds an option to disable function
internalization and prevents the device runtime from being internalized while
creating the bitcode library.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106438
2021-07-21 21:18:18 -04:00
Joseph Huber 1684012a47 [Libomptarget] Introduce new main thread ID runtime function
This patch introduces `__kmpc_is_generic_main_thread_id` which splits the old
comparison into its own runtime function. The purpose of this is so we can fold
this part independently, so when both this and `is_spmd_mode` are folded the
final function will be folded as well.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106437
2021-07-21 21:18:14 -04:00
Joseph Huber 754eb1c210 [OpenMP] Change `__kmpc_free_shared` to include the paired allocation size
This patch changes `__kmpc_free_shared` to take an additional argument
corresponding to the associated allocation's size. This makes it easier to
implement the allocator in the runtime.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106496
2021-07-21 20:56:21 -04:00
Giorgis Georgakoudis 5a682d9b91 [OpenMP] Expose libomptarget function to get HW thread id
The patch exposes the libomptarget runtime function that gets the hardware thread id through the kmpc API. This is to be used in SPMDization for checking the thread id to execute regions by a single thread in a block.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106323
2021-07-21 10:26:04 -07:00
Shilei Tian 55c65884a4 [OpenMP][deviceRTLs] Update return type of function __kmpc_parallel_level
In `deviceRTLs`, the parallel level is stored in a shared variable of type `uint8_t`.
`__kmpc_parallel_level` currently returns a 16-bit interger. This patch first
changes the return type of the function to `uint8_t`, same as the shared variable,
and then corrects function type which was updated in D105955.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106384
2021-07-20 15:45:43 -04:00
Joseph Huber 762badb0ab [Libomptarget] Remove volatile from NVPTX work function
Currently the NPVTX work function is marked volatile. This prevents some
optimizations from using this value.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106310
2021-07-19 20:03:25 -04:00
Giorgis Georgakoudis fb0cf01795 Revert "[OpenMP] Codegen aggregate for outlined function captures"
This reverts commit e9c7291cb2.

Fix failing tests
2021-07-19 07:54:26 -07:00
Shilei Tian 4357cfc792 [OpenMP][Offloading] Add -g when compiling deviceRTLs in debug mode
Currently when we compile the project in debug mode, `-g` will not be added to
compilation flag. The bc files generated in different mode are of different size.
When using GPU debuggers like `cuda-gdb`, it is expected to provide more info
with a debug version of bc lib.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D106229
2021-07-18 09:34:54 -04:00
Giorgis Georgakoudis e9c7291cb2 [OpenMP] Codegen aggregate for outlined function captures
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3)  forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D102107
2021-07-16 23:27:44 -07:00
Shilei Tian 97c8f60bba [NFC][OpenMP][Offloading] Replaced explicit parallel level computation with function `__kmpc_parallel_level`
There are two places in current deviceRTLs where it computes parallel level explicitly,
which is basically the functionality of `__kmpc_parallel_level`. Starting from
D105787, we plan to introduce a series of function call folding based on information
that can be deducted during compilation time. Computation of parallel level is
the next target. This patch makes steps for the optimization.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D105955
2021-07-15 22:21:06 -04:00
Jon Chesterfield b6b53ffef4 [libomptarget][devicertl] Remove branches around setting parallelLevel
Simplifies control flow to allow store/load forwarding

This change folds two basic blocks into one, leaving a single store to parallelLevel.
This is a step towards spmd kernels with sufficiently aggressive inlining folding
the loads from parallelLevel and thus discarding the nested parallel handling
when it is unused.

Transform:
```
int threadId = GetThreadIdInBlock();
if (threadId == 0) {
  parallelLevel[0] = expr;
} else if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// because
unsigned GetLaneId() { return GetThreadIdInBlock() & (WARPSIZE - 1);}
// so whenever threadId == 0, GetLaneId() is also 0.
```

That replaces a store in two distinct basic blocks with as single store.

A more aggressive follow up is possible if the threads in the warp/wave
race to write the same value to the same address. This is not done as
part of this change.

```
if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
parallelLevel[GetWarpId()] = expr;
// because
unsigned GetWarpId() { return GetThreadIdInBlock() / WARPSIZE; }
// so GetWarpId will index the same element for every thread in the warp
// and, because expr is lane-invariant in this case, every lane stores the
// same value to this unique address
```

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D105699
2021-07-13 12:06:57 +01:00
Johannes Doerfert a7b7b5dfe5 [OpenMP] Create and use `__kmpc_is_generic_main_thread`
In order to fold calls based on high-level knowledge and control flow
tracking it helps to expose the information as a runtime call. The
logic: `!SPMD && getTID() == getMasterTID()` was used in various places
and is now encapsulated in `__kmpc_is_generic_main_thread`. As part of
this rewrite we replaced eager computation of arguments with on-demand
computation, especially helpful if the calls can be folded and arguments
don't need to be computed consequently.

Differential Revision: https://reviews.llvm.org/D105768
2021-07-11 19:18:03 -05:00
Johannes Doerfert 1ab1f04a2b [OpenMP] Simplify variable sharing and increase shared memory size
In order to avoid malloc/free, up to NUM_SHARED_VARIABLES_IN_SHARED_MEM
(=64) variables are communicated in dedicated shared memory instead. The
simplification does avoid the need for an "init" and requires "deinit"
only if we ever communicate more than NUM_SHARED_VARIABLES_IN_SHARED_MEM
variables.

Differential Revision: https://reviews.llvm.org/D105767
2021-07-11 19:18:03 -05:00
Johannes Doerfert 0a223827de [OpenMP] Remove checkXXXX device runtime functions
We had multiple functions to determine the execution mode (SPMD/Generic)
and runtime status (initialized/uninitialized) but that just increased
complexity without a real benefit. Especially with D102307 in mind it
is helpful to reduce the dependence on the `ident_t` flags.

Differential Revision: https://reviews.llvm.org/D105586
2021-07-10 18:20:40 -05:00
Johannes Doerfert e2cfbfcc0c [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.

The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.

[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11

This was in parts extracted from D59319.

Reviewed By: ABataev, JonChesterfield

Differential Revision: https://reviews.llvm.org/D101976
2021-07-10 17:53:56 -05:00
Nico Weber d3e7491333 Revert Attributor patch series
Broke check-clang, see https://reviews.llvm.org/D102307#2869065
Ran `git revert -n ebbe149a6f08535ede848a531a601ae6591cfbc5..269416d41908bb670f67af689155d5ab8eea689a`
2021-07-10 16:15:55 -04:00
Johannes Doerfert e603ca0306 [OpenMP] Remove checkXXXX device runtime functions
We had multiple functions to determine the execution mode (SPMD/Generic)
and runtime status (initialized/uninitialized) but that just increased
complexity without a real benefit. Especially with D102307 in mind it
is helpful to reduce the dependence on the `ident_t` flags.

Differential Revision: https://reviews.llvm.org/D105586
2021-07-10 12:32:51 -05:00
Johannes Doerfert 1d5711c3ee [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.

The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.

[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11

This was in parts extracted from D59319.

Reviewed By: ABataev, JonChesterfield

Differential Revision: https://reviews.llvm.org/D101976
2021-07-10 12:32:50 -05:00
Shilei Tian 24a36ce58b [OpenMP][Offloading] Replace all calls to `isSPMDMode` with `__kmpc_is_spmd_exec_mode`
In our ongoing work, we are using `AbstractAttributor` to deduct execution model
of device functions, and potententially remove unnecessary function calls to
`__kmpc_is_spmd_exec_mode`. In current device runtime, we have mixed use of
`isSPMDMode` and `__kmpc_is_spmd_exec_mode`, but in fact in `__kmpc_is_spmd_exec_mode`
it simply calls `isSPMDMode`. Since all functions starting with `__kmpc` is C
function, which doesn't have things like name mangling. It is more optimization
friendly. In this patch, we simply replaced all calls to `isSPMDMode` with
`__kmpc_is_spmd_exec_mode` to pave the way for the optimization.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D105211
2021-06-30 15:39:57 -04:00
Jon Chesterfield f66b8fdc0a [libomptarget][amdgpu] Build openmp for two more targets
[libomptarget][amdgpu] Build openmp for two more targets

The 4800U APU is a gfx902 and the MI100 accelerator is a gfx908.
Both numbers are listed in ROCT topology.c

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D104922
2021-06-25 19:02:03 +01:00
Joseph Huber 244e98ff48 [Libomptarget] Improve device runtime implementation for globalized variables.
Currently the runtime implementation of `__kmpc_alloc_shared` is extremely slow because it allocated memory for each thread individually. This patch adds a small buffer for the threads to share data and will greatly improve performance for builds where all globalization could not be optimized out. If the shared buffer is full, then memory will not only be allocated per-warp rather than per-thread.

Depends on D97680

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D104666
2021-06-22 11:52:49 -04:00
Joseph Huber 952a0f2385 [Libomptarget] Introduce new globalization runtime calls
Summary:
This patch introduces the new globalization runtime to be used by D97680. These
runtime calls will replace the __kmpc_data_sharing_push_stack and
__kmpc_data_sharing_pop_stack functions.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D102532
2021-06-22 10:05:42 -04:00
Jon Chesterfield d54712ab4d [libomptarget][amdgpu] Mark alloc, free weak to facilitate local experimentation
[libomptarget][amdgpu] Mark alloc, free weak to facilitate local experimentation

There are a lot of different ways we might implement the devicertl local alloc
and free functions. Via host, local buffers (stack or arena), specialising per
kernel etc. It is not yet clear what the right design is. This change makes the
alloc and free functions weak, so one can override them from local tests while
comparing options.

Not strictly necessary, as a comparable patch can be applied locally each time,
but would be convenient for out of tree dev. Plan would be to drop the weak
attribute at the same time as introducing a working allocator to trunk.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D102499
2021-05-21 16:09:22 +01:00
Jon Chesterfield 10de217209 [libomptarget][amdgpu] Fix truncation error for partial wavefront
[libomptarget][amdgpu] Fix truncation error for partial wavefront

The partial barrier implementation involves one wavefront resetting and N-1
waiting. This change future proofs against launching with a number of threads
that is not a multiple of the wavefront size.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D102407
2021-05-13 17:31:57 +01:00