[libomptarget][amdgpu] Fix truncation error for partial wavefront
The partial barrier implementation involves one wavefront resetting and N-1
waiting. This change future proofs against launching with a number of threads
that is not a multiple of the wavefront size.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102407
[libomptarget][amdgpu] Convert an assert to print and offload_fail
The kernel launched is supposed to be present in the binary, but a not yet
diagnosed bug means it is missing for some of the qmcpack test cases. Changing
from assert to print and offload_fail should help diagnose that and similar bugs.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102378
Add a `REQUIRES: unified_shared_memory` option to tests that use `#pragma omp requires unified_shared_memory`.
For CUDA, the feature tag is derived from LIBOMPTARGET_DEP_CUDA_ARCH which itself is derived using [[ https://cmake.org/cmake/help/latest/module/FindCUDA.html#commands | cuda_select_nvcc_arch_flags ]]. The latter determines which compute capability the GPU in the system supports. To ensure that this is the CUDA arch being used, we could also set the `-Xopenmp-target -march=` flag.
In the absence of an NVIDIA GPU, LIBOMPTARGET_DEP_CUDA_ARCH will be 35. That is, in that case we are assuming unified_shared_memory is not available. CUDA plugin testing could be disabled entirely in this case, but this currently depends on `LIBOMPTARGET_CAN_LINK_LIBCUDA OR LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA`, not on whether the hardware is actually available.
For all other targets, nothing changes and we are assuming unified shared memory is available. This might need refinement if not the case.
This tries to fix the [[ http://meinersbur.de:8011/#/builders/143 | OpenMP Offloading Buildbot ]] that, although brand-new, only has a Pascal-generation (sm_61) GPU installed. Hence, tests that require unified shared memory are currently failing. I wish I had known in advance.
Reviewed By: protze.joachim, tianshilei1992
Differential Revision: https://reviews.llvm.org/D101498
[libomptarget][amdgpu][nfc] Expand errorcheck macros
These macros expand to continue, which is confusing, or exit,
which is incompatible with continuing execution on offloading fail.
Expanding the macros in place makes the code look untidy but the
control flow obvious and amenable to improving. In particular, exit
becomes easier to eliminate.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D102230
[libomptarget][nfc] Add hook to easily disable building amdgcn bclib
This is useful when building LLVM with a toolchain that can't emit code
for amdgcn, e.g. because it overrides the include search path with headers
from another architecture, or the clang compiler is missing builtins.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102229
[libomptarget] Add support for target allocators to dynamic cuda RTL
Follow on to D102000 which introduced new calls into libcuda. This patch adds
the corresponding entry points to dynamic_cuda, fixing the build for systems
that do not have the cuda toolkit installed.
Function types and enum from https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D102169
This patch prevents runtime tests running on systems without amdgpu.
Reviewed By: protze.joachim, tianshilei1992
Differential Revision: https://reviews.llvm.org/D102054
I want to start using LLVM component libraries in libomptarget
to stop duplicating implementations already available in LLVM
(e.g. LLVMObject, LLVMSupport, etc.). Without relying on LLVM
in all libomptarget builds one has to provide fallback implementation
for each used LLVM feature.
This is an attempt to stop supporting out-of-llvm-tree builds of libomptarget.
I understand that I may need to revert this,
if this affects downstream projects in a bad way.
Differential Revision: https://reviews.llvm.org/D101509
Summary:
The allocator interface added in D97883 allows the RTL to allocate shared and
host-pinned memory from the cuda plugin. This patch adds support for these to
the runtime.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D102000
[libomptarget][nfc] Refactor amdgpu partial barrier to simplify adding a second one
D101976 would require a second barrier instance. This NFC to amdgpu makes it
simpler to add one (an extra global, one more line in init). Also renames the
current barrier to L0.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102016
[libomptarget][amdgpu][nfc] Remove dead code from amdgpu plugin
Drops an enum that was identical to a HSA one, localises some functions where
they were only called from one TU. Covers everything internalize + adce can
identify as dead, except for msgpack::dump which is useful when debugging.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102014
This enables the runtime tests on amdgpu targets.
10 tests have been marked as XFAIL on amdgcn currently mostly due to
missing printf.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D99656
If available, use the clang that is already built in the same project as
CUDA compiler unless another executable is explicitly defined. This also
ensures the generated deviceRTL IR will be consistent with the version
of Clang.
This patch is required to reliably test OpenMP offloading in a buildbot
without either a two-stage build (e.g. with LLVM_ENABLE_RUNTIMES) or a
separately installed clang on the worker that will eventually become
outdated.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101265
The OpenMP runtime can be compiled using a CUDA installed at non-default
location with the -DCUDA_TOOLKIT_ROOT_DIR setting. However, check-openmp
will fail afterwards because Clang needs to know where to find the CUDA
headers.
Fix by passing -cuda-path to Clang using the value of
CUDA_TOOLKIT_ROOT_DIR which has been determined by CMake. Also set
LD_LIBRARY_PATH such that it can find the cuda runtime when executing.
This will ensure that the regression test do not depend on the current
environment, but use the environment it was configured for.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101266
This patch fuses the RUN lines for most libomptarget tests. The previous patch
D101315 created separate test targets for each supported offloading triple.
This patch updates the RUN lines in libomptarget tests to use a generic run
line independent of the offloading target selected for the lit instance.
In cases, where no RUN line was defined for a specific offloading target,
the corresponding target is declared as XFAIL. If it turns out that a test
actually supports the target, the XFAIL line can be removed.
Differential Revision: https://reviews.llvm.org/D101326
This patch creates a separate test directory for each offloading target to be
tested. This allows to test multiple architectures in one configuration, while
still see all failing tests separately. The lit test names include the target
triple, so that it will be easier to spot the failing target.
This patch also allows to mark expected failing tests based on the
target-triple, as the currently used triple is added to the lit "features":
```
// XFAIL: nvptx64-nvidia-cuda
```
Differential Revision: https://reviews.llvm.org/D101315
[libomptarget] Enable AMDGPU devicertl
The amdgpu devicertl is written in freestanding openmp and compiles to a
bitcode library (per listed gfx arch) with no unresolved symbols. It requires
a recent clang, preferably the one from the same monorepo checkout.
This is D98658, with printf explicitly stubbed out, after patching clang to no
longer require an llvm with the amdgpu target enabled.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101213
Summary:
This patch improves the implementation of D100774 by replacing the global
variable introduced with a function that returns a reference to an internal
one. This removes the need to define the variable in every plugin that uses it.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D101102
Summary:
This patch adds a new runtime function __tgt_set_info_flag that allows the
user to set the information level at runtime without using the environment
variable. Using this will require an extern function, but will eventually be
added into an auxilliary library for OpenMP support functions.
This patch required moving the current InfoLevel to a global variable which must
be instantiated by each plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100774
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
The implicitly generated mappings for allocation/deallocation in mappers
runtime should be mapped as implicit, also no need to clear member_of
flag to avoid ref counter increment. Also, the ref counter should not be
incremented for the very first element that comes from the mapper
function.
Differential Revision: https://reviews.llvm.org/D100673
Summary:
This patch adds a feature to print information whenever the host-device pointer
mapping table is changed by inserting or removing an entry. This introduces a
new bit field for LIBOMPTARGET_INFO at position 0x8.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100600
omp_is_initial_device() is marked as a built-in function in the current
compiler, and user code guarded by this call may be optimized away,
resulting in undesired behavior in some cases. This patch provides a
possible fix for such cases by defining the routine as a variant
function and removing it from builtin list.
Differential Revision: https://reviews.llvm.org/D99447
Summary:
Remove some of the error messages printed when the CUDA plugin fails. The current error messages can be confusing because they are the first error messages printed after the async stream finds an error. This means that the printed values aren't related to what caused the issue, but are simply the last asyncronous operation that succeeded on the device. Remove these as they can be misleading.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D99510
Summary:
If the call to `synchronize` fails, it will currently block the stream indefinitely if execution is continued from this point. Additionally, if the program exits it will trigger an assertion on the non-null value of the async queue and prevent the runtime from printing debugging information.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D99443
Add register usage information to the runtime metadata so that it can be used during kernel launch (that change will be in a different commit). Add this information to the kernel trace.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D98829
[libomptarget] Build amdgcn devicertl by default
The cmake for this looks for an llvm install and does the right thing when
building as part of enable_runtimes. It will probably do the right thing
in other settings - at least, it won't try to build this with gcc.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98658
[libomptarget] Build amdgpu plugin by default
This will build the amdgpu plugin if cmake is able to find the hsa
runtime library, which will be the case if rocm is installed or if
the hsa library has been installed somewhere cmake looks.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D98654
[libomptarget] Fix devicertl build
The target specific functions in target_interface are extern C, but the
implementations for nvptx were mostly C++ mangling. That worked out as
a quirk of DEVICE macro expanding to nothing, except for shuffle.h which
only forward declared the functions with C++ linkage.
Also implements GetWarpSize, as used by shuffle, and includes target_interface
in nvptx target_impl.cu to help catch future divergence between interface and
implementation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98651
[libomptarget] Drop assert.h, use freestanding for amdgcn devicertl
Promotes the runtime assert to a link time error for the unimplemented
fallback functions. Enables amdgcn to build with only clang provided
headers, which makes it less likely to break other builds when enabled.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98649
[libomptarget][amdgcn] Drop use of inttypes.h, moving closer to freestanding
The glibc headers are a periodic source of problems compiling the devicertl.
This patch resolves the following error run into while building llvm on a slightly
different linux system.
```
In file included from .../lib/clang/13.0.0/include/inttypes.h:21:
In file included from /usr/include/inttypes.h:25:
/usr/include/features.h:461:12: fatal error: 'sys/cdefs.h' file not found
# include <sys/cdefs.h>
^~~~~~~~~~~~~
```
As a second patch, removing assert.h from shuffle will let amdgcn build as
-ffreestanding, at which point only the headers that clang itself provides are
used and interactions with the host glibc are eliminated. Doing the same for
nvptx is complicated by printf handling but also seems worthwhile.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D98565
This patch adds the infrastructure for allocator support for target memory.
Three allocators are introduced for device, host and shared memory.
The corresponding API functions have the llvm_ prefix temporarily, until they become part of the OpenMP standard.
Differential Revision: https://reviews.llvm.org/D97883
The shuffle idiom is differently implemented in our supported targets.
To reduce the "target_impl" file we now move the shuffle idiom in it's
own self-contained header that provides the implementation for AMDGPU
and NVPTX. A fallback can be added later on.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95752
Summary:
The changes introduced in D87946 changed the API for libomptarget
functions. `__kmpc_push_target_tripcount` was a function in Clang 11.x
but was not given a backward-compatible interface. This change will
require people using Clang 13.x or 12.x to recompile their offloading
programs.
Reviewed By: jdoerfert cchen
Differential Revision: https://reviews.llvm.org/D98358
In D97003, CUDA 9.2 is the minimum requirement for OpenMP offloading on
NVPTX target. We don't need to have macros in source code to select right functions
based on CUDA version. we don't need to compile multiple bitcode libraries of
different CUDA versions for each SM. We don't need to worry about future
compatibility with newer CUDA version.
`-target-feature +ptx61` is used in this patch, which corresponds to the highest
PTX version that CUDA 9.2 can support.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97198
This patch just encapsulates some repeated code. To do so, it
relocates some functions from interface.cpp to omptarget.cpp. It also
adjusts them to the LLVM coding style.
This patch is almost NFC except some `DP` messages are a bit
different. For example, messages like "Entering target region" are
now emitted even if offload is disabled, but a subsequent "Offload is
disabled" is then emitted.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D97908
Without this patch, an `omp target exit data` before the runtime is
initialized produces a runtime error. This patch fixes that by
changing `__tgt_target_data_end_mapper` to call `CheckDeviceAndCtors`
like many other runtime routines.
Discussed at
<https://lists.llvm.org/pipermail/openmp-dev/2021-March/003920.html>.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D97907
Without this patch, when the offload device is set to
`omp_get_initial_device()`, the runtime fails with an error diagnostic
when entering target regions or target data regions.
However, OpenMP 5.1, sec. 2.14.5 "target Construct", "Restrictions",
p. 203, L3-5 states:
> The device clause expression must evaluate to a non-negative integer
> value that is less than or equal to the value of
> omp_get_num_devices().
Sec. 3.7.7 "omp_get_initial_device", p. 412, L2-3 states:
> The value of the device number is the value returned by the
> omp_get_num_devices routine.
Similarly, OpenMP 5.0, sec. 2.12.5 "target Construct", "Restrictions",
p. 174 L30-32 states:
> The device clause expression must evaluate to a non-negative integer
> value less than the value of omp_get_num_devices() or to the value
> of omp_get_initial_device().
This patch fixes this behavior by changing the runtime to behave as if
offloading is disabled whenever it finds the offload device (either
from a `device` clause or the default device) is set to the host
device. In the case of mandatory offloading when
`omp_get_num_devices() == 0`, it incorporates the behavior proposed
for OpenMP 5.2 in OpenMP spec github issue 2669.
Reviewed By: grokos, RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D97616