[libomptarget][amdgpu] Call into deviceRTL instead of ockl
Amdgpu codegen presently emits a call into ockl. The same functionality
is already present in the deviceRTL. Adds an amdgpu specific entry point
to avoid the dependency. This lets simple openmp code (specifically, that
which doesn't use libm) run without rocm device libraries installed.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D93356
[libomptarget][nfc] Replace static const with enum
Semantically identical. Replaces 0xff... with ~0 to spare counting the f.
Has the advantage that the compiler doesn't need to prove the 4/8 byte
value dead before discarding it, and sidesteps the compilation question
associated with what static means for a single source language.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D93328
[AMDGPU] Add __builtin_amdgcn_grid_size
Similar to D76772, loads the data from the dispatch pointer. Marked invariant.
Patch also updates the openmp devicertl to use this builtin.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D90251
[libomptarget][amdgcn] Implement missing symbols in deviceRTL
Malloc, wtime are stubs. Malloc needs a hostrpc implementation which is
a work in progress, wtime needs some experimentation to find out the
multiplier to get a time in seconds as documentation is scarce.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D89725
[libomptarget][amdgcn] Implement partial barrier
named_sync is used to coordinate non-spmd kernels. This uses bar.sync on nvptx.
There is no corresponding ISA support on amdgcn, so this is implemented using
shared memory, one word initialized to zero.
Each wave increments the variable by one. Whichever wave is last is responsible
for resetting the variable to zero, at which point it and the others continue.
The race condition on a wave reaching the barrier before another wave has
noticed that it has been released is handled with a generation counter, packed
into the same word.
Uses a shared variable that is not needed on nvptx. Introduces a new hook,
kmpc_impl_target_init, to allow different targets to do extra initialization.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D88602
[nfc][libomptarget] Drop parameter to named_sync
named_sync has one call site (in sync.cu) where it always passed L1_BARRIER.
Folding this into the call site and dropping the macro is a simplification.
amdgpu doesn't have ptx' bar.sync instruction. A correct implementation of
__kmpc_impl_named_sync in terms of shared memory is much easier if it can
assume that the barrier argument is this constant. Said implementation is left
for a second patch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D88474
This function uses __builtin_amdgcn_atomic_inc32():
uint32_t atomicInc(uint32_t *address, uint32_t max);
These functions use __builtin_amdgcn_fence():
__kmpc_impl_threadfence()
__kmpc_impl_threadfence_block()
__kmpc_impl_threadfence_system()
They will take place of current mechanism of directly calling IR functions.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83132
Summary:
[libomptarget][nfc] Move non-freestanding headers out of common
Lowers the bar for building deviceRTL.
Drops math.h entirely as it wasn't used and libm is a big dependency.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77071
Summary:
[libomptarget] Implement locks for amdgcn
The nvptx implementation deadlocks on amdgcn. atomic_cas with multiple
active lanes can deadlock - if one lane succeeds, all the others are locked
out. The set_lock implementation therefore runs on a single lane.
Also uses a sleep intrinsic instead of the system clock for a probably
minor performance improvement. The unset/test implementations may be revised
later, based on code size / performance or similar concerns.
This implements the lock at a per-wavefront scope. That's not strictly as
specified, since openmp describes locks in terms of threads. I think the
nvptx implementation provides true per-thread locking on volta and the same
per-warp locking on other architectures.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75546
Summary:
[libomptarget][nfc] Move GetWarp/LaneId functions into per arch code
No code change for nvptx. Amdgcn currently has two implementations of GetLaneId,
this patch keeps the one a colleague considered to be superior for our ISA.
GetWarpId is currently the same function for amdgcn and nvptx, but I think it's
cleaner to keep it grouped with all the others than to keep it in support.cu.
Reviewers: jdoerfert, grokos, ABataev
Reviewed By: jdoerfert
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75587
Summary:
[libomptarget] Implement hip atomic functions in terms of intrinsics
All but atomicInc can be implemented using type generic clang intrinsics.
There is not yet a corresponding intrinsic for atomicInc in clang, only one in
LLVM. This patch leaves atomicInc as an unresolved symbol.
Reviewers: jdoerfert, ABataev, hfinkel, grokos, arsenm
Reviewed By: arsenm
Subscribers: sri, saiislam, wdng, jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D73076
Summary:
[libomptarget] Implement smid for amdgcn
Implementation is in a new file as it uses an intrinsic with
complicated encoding that warranted substantial comments.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72956
Summary:
[nfc][libomptarget] Refactor amdgcn target_impl
Removes references to internal libraries from the header
Standardises on C++ mangling for all the target_impl functions
Update comment block
clang-format
Move some functions into a new target_impl.hip source file
This lays the groundwork for implementing the remaining unresolved
symbols in the target_impl.hip source.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72712
Summary:
[libomptarget][nfc] Introduce atomic wrapper function
Wraps atomic functions in a template prefixed __kmpc_atomic that
dispatches to cuda or hip atomic functions. Intended to be easily extended
to dispatch to OpenCL or C++ atomics for a third target.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: Anastasia, jvesely, mgrang, dexonsmith, llvm-commits, mgorny, jfb, openmp-commits
Tags: #openmp, #llvm
Differential Revision: https://reviews.llvm.org/D71404
Summary:
[libomptarget][nfc] Extract function from data_sharing, move to common
Finding the first active thread in the warp is different on nvptx and amdgcn,
mostly due to warp size and the desire for efficiency.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71643
Summary:
[libomptarget][nfc] Move three files under common, build them for amdgcn
Change to reduction.cu to remove two dead includes, otherwise no code change.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71601
Summary:
[libomptarget][nfc] Move omp locks under target_impl
These are likely to be target specific, even down to the lock_t which is
correspondingly moved out of interface.h. The alternative is to include
interface.h in target_impl which substantiatially increases the scope of
those symbols.
The current nvptx implementation deadlocks on amdgcn. The preferred
implementation for that arch is still under discussion - this change
leaves declarations in target_impl.
The functions could be inline for nvptx. I'd prefer to keep the internals
hidden in the target_impl translation unit, but will add the (possibly renamed)
macros to target_impl.h if preferred.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71574
Summary:
[libomptarget][nfc] Wrap cuda min() in target_impl
nvptx forwards to cuda min, amdgcn implements directly.
Sufficient to build parallel.cu for amdgcn, added to CMakeLists.
All call sites are homogenous except one that passes a uint32_t and an
int32_t. This could be smoothed over by taking two type parameters
and some care over the return type, but overall I think the inline
<uint32_t> calling attention to what was an implicit sign conversion
is cleaner.
Reviewers: ABataev, jdoerfert
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71580
Summary:
This reverts commit dd8a7fcdd7.
Alexey reports undefined symbols for the new inline functions defined in target_impl.h
This does not reproduce for me for nvptx, or amdgcn, under release or debug builds.
I believe the patch is fine, based on:
- the semantics of an inline function in C++ (the cuda INLINE functions end
up as linkonce_odr in IR), which are only legal to drop if they have no uses
- the code generated from a debug build of clang 9 does not show these undef symbols
- the tests pass
- the code is trivial
To progress from here I either need:
- A tie break - someone to play the role of CI in determining whether the patch works
- Alexey to provide sufficient information about his build for me to reproduce the failure
- Alexey to debug why the symbols are disappearing for him and report back
Reviewers: ABataev, jdoerfert, grokos
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71502
Summary:
[libomptarget] Build most of common/src for amdgcn
Excluding parallel.cu, which uses an integer min() from cuda,
Excluding support.cu, which calls malloc that is not yet available for amdgcn
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: gregrodgers, ronlieb, jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71446
Summary:
[libomptarget][nfc] Add declarations of atomic functions for amdgcn
This enables building more source for amdgcn. The functions are usually available
in a hip runtime header, but are duplicated here to decouple the implementation
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71412
Summary:
[libomptarget][nfc] Move cuda threadfence functions behind kmpc_impl
Part of building code under common/ without requiring a cuda compiler
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: ABataev
Subscribers: jvesely, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71102
Summary:
[libomptarget] Build a minimal deviceRTL for amdgcn
Repeat of D70414, with an include path fixed. Diff for sanity checking.
The CMakeLists.txt file is functionally identical to the one used in the aomp fork.
Whitespace changes were made based on nvptx/CMakeLists.txt, plus the
copyright notice updated to match (Greg was the original author so would
like his sign off on that here).
This change will build a small subset of the deviceRTL if an appropriate toolchain is
available, e.g. a local install of rocm. Support.h is moved from nvptx as a dependency
of debug.h.
Reviewers: ABataev, jdoerfert
Reviewed By: ABataev
Subscribers: jvesely, mgorny, jfb, openmp-commits, jdoerfert
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70971
Summary:
[libomptarget] Build a minimal deviceRTL for amdgcn
The CMakeLists.txt file is functionally identical to the one used in the aomp fork.
Whitespace changes were made based on nvptx/CMakeLists.txt, plus the
copyright notice updated to match (Greg was the original author so would
like his sign off on that here).
This change will build a small subset of the deviceRTL if an appropriate toolchain is
available, e.g. a local install of rocm. Support.h is moved from nvptx as a dependency
of debug.h.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Reviewed By: jdoerfert
Subscribers: jfb, Hahnfeld, jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70414
Summary:
[libomptarget] Implement target_impl for amdgcn
Smallest atomic addition for a new target. Implements enough of the amdgcn
specific code that some of the source files under nvptx/src could be compiled,
without modification, to run on amdgcn.
This foreshadows a work in progress patch to move said source out of nvptx/src.
Patch based on fork at https://github.com/ROCm-Developer-Tools/llvm-project
Reviewers: ABataev, jdoerfert, grokos, ronlieb
Subscribers: jvesely, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69718