The OpenMP runtime can be compiled using a CUDA installed at non-default
location with the -DCUDA_TOOLKIT_ROOT_DIR setting. However, check-openmp
will fail afterwards because Clang needs to know where to find the CUDA
headers.
Fix by passing -cuda-path to Clang using the value of
CUDA_TOOLKIT_ROOT_DIR which has been determined by CMake. Also set
LD_LIBRARY_PATH such that it can find the cuda runtime when executing.
This will ensure that the regression test do not depend on the current
environment, but use the environment it was configured for.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101266
This patch fuses the RUN lines for most libomptarget tests. The previous patch
D101315 created separate test targets for each supported offloading triple.
This patch updates the RUN lines in libomptarget tests to use a generic run
line independent of the offloading target selected for the lit instance.
In cases, where no RUN line was defined for a specific offloading target,
the corresponding target is declared as XFAIL. If it turns out that a test
actually supports the target, the XFAIL line can be removed.
Differential Revision: https://reviews.llvm.org/D101326
This patch creates a separate test directory for each offloading target to be
tested. This allows to test multiple architectures in one configuration, while
still see all failing tests separately. The lit test names include the target
triple, so that it will be easier to spot the failing target.
This patch also allows to mark expected failing tests based on the
target-triple, as the currently used triple is added to the lit "features":
```
// XFAIL: nvptx64-nvidia-cuda
```
Differential Revision: https://reviews.llvm.org/D101315
[libomptarget] Enable AMDGPU devicertl
The amdgpu devicertl is written in freestanding openmp and compiles to a
bitcode library (per listed gfx arch) with no unresolved symbols. It requires
a recent clang, preferably the one from the same monorepo checkout.
This is D98658, with printf explicitly stubbed out, after patching clang to no
longer require an llvm with the amdgpu target enabled.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101213
Summary:
This patch improves the implementation of D100774 by replacing the global
variable introduced with a function that returns a reference to an internal
one. This removes the need to define the variable in every plugin that uses it.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D101102
Summary:
This patch adds a new runtime function __tgt_set_info_flag that allows the
user to set the information level at runtime without using the environment
variable. Using this will require an extern function, but will eventually be
added into an auxilliary library for OpenMP support functions.
This patch required moving the current InfoLevel to a global variable which must
be instantiated by each plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100774
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
The implicitly generated mappings for allocation/deallocation in mappers
runtime should be mapped as implicit, also no need to clear member_of
flag to avoid ref counter increment. Also, the ref counter should not be
incremented for the very first element that comes from the mapper
function.
Differential Revision: https://reviews.llvm.org/D100673
Implement the remaining GOMP_* functions to support task reductions
in taskgroup, parallel, loop, and taskloop constructs. The unused mem
argument to many of the work-sharing constructs has to do with the
scan() directive/ inscan() modifier. If mem is set, each function
will call KMP_FATAL() and tell the user scan/inscan is unsupported. The
GOMP reduction implementation is kept separate from our implementation
because of how GOMP presents reduction data and computes the reductions.
GOMP expects the privatized copies to be present even after a #pragma
omp parallel reduction(task:...) region has ended so the data is stored
inside GOMP's uintptr_t* data pseudo-structure. This style is tightly
coupled with GCC compiler codegen. There also isn't any init(),
combiner(), fini() functions in GOMP's codegen so the two
implementations were to disparate to try to wrap GOMP's around our own.
Differential Revision: https://reviews.llvm.org/D98806
Current atfork() handler for child processes does not reset
the affinity masks array which prevents users from setting their own
affinity in child processes.
Differential Revision: https://reviews.llvm.org/D99218
Summary:
This patch adds a feature to print information whenever the host-device pointer
mapping table is changed by inserting or removing an entry. This introduces a
new bit field for LIBOMPTARGET_INFO at position 0x8.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100600
omp_is_initial_device() is marked as a built-in function in the current
compiler, and user code guarded by this call may be optimized away,
resulting in undesired behavior in some cases. This patch provides a
possible fix for such cases by defining the routine as a variant
function and removing it from builtin list.
Differential Revision: https://reviews.llvm.org/D99447
The second argument to the strnlen_s(str, size) function should be
sizeof(str) when str is a true array of characters with known size
(instead of just a char*). Use type traits to determine if first
parameter is a character array and use the correct size based on that
trait.
Differential Revision: https://reviews.llvm.org/D98209
Summary:
Remove some of the error messages printed when the CUDA plugin fails. The current error messages can be confusing because they are the first error messages printed after the async stream finds an error. This means that the printed values aren't related to what caused the issue, but are simply the last asyncronous operation that succeeded on the device. Remove these as they can be misleading.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D99510
Summary:
If the call to `synchronize` fails, it will currently block the stream indefinitely if execution is continued from this point. Additionally, if the program exits it will trigger an assertion on the non-null value of the async queue and prevent the runtime from printing debugging information.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D99443
-- Added or moved checks to appropriate places.
-- Removed ineffective null check where the pointer is already being
dereferenced around the code.
-- Initialized variables that can be used without definitions.
-- Added call to dlclose/FreeLibrary in OMPT tool activation.
-- Added a new build compiler definition.
Differential Revision: https://reviews.llvm.org/D98584
It is reported that after enabling hidden helper thread, the program
can hit the assertion `new_gtid < __kmp_threads_capacity` sometimes. The root
cause is explained as follows. Let's say the default `__kmp_threads_capacity` is
`N`. If hidden helper thread is enabled, `__kmp_threads_capacity` will be offset
to `N+8` by default. If the number of threads we need exceeds `N+8`, e.g. via
`num_threads` clause, we need to expand `__kmp_threads`. In
`__kmp_expand_threads`, the expansion starts from `__kmp_threads_capacity`, and
repeatedly doubling it until the new capacity meets the requirement. Let's
assume the new requirement is `Y`. If `Y` happens to meet the constraint
`(N+8)*2^X=Y` where `X` is the number of iterations, the new capacity is not
enough because we have 8 slots for hidden helper threads.
Here is an example.
```
#include <vector>
int main(int argc, char *argv[]) {
constexpr const size_t N = 1344;
std::vector<int> data(N);
#pragma omp parallel for
for (unsigned i = 0; i < N; ++i) {
data[i] = i;
}
#pragma omp parallel for num_threads(N)
for (unsigned i = 0; i < N; ++i) {
data[i] += i;
}
return 0;
}
```
My CPU is 20C40T, then `__kmp_threads_capacity` is 160. After offset,
`__kmp_threads_capacity` becomes 168. `1344 = (160+8)*2^3`, then the assertions
hit.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D98838
Add register usage information to the runtime metadata so that it can be used during kernel launch (that change will be in a different commit). Add this information to the kernel trace.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D98829
[libomptarget] Build amdgcn devicertl by default
The cmake for this looks for an llvm install and does the right thing when
building as part of enable_runtimes. It will probably do the right thing
in other settings - at least, it won't try to build this with gcc.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98658
[libomptarget] Build amdgpu plugin by default
This will build the amdgpu plugin if cmake is able to find the hsa
runtime library, which will be the case if rocm is installed or if
the hsa library has been installed somewhere cmake looks.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D98654
[libomptarget] Fix devicertl build
The target specific functions in target_interface are extern C, but the
implementations for nvptx were mostly C++ mangling. That worked out as
a quirk of DEVICE macro expanding to nothing, except for shuffle.h which
only forward declared the functions with C++ linkage.
Also implements GetWarpSize, as used by shuffle, and includes target_interface
in nvptx target_impl.cu to help catch future divergence between interface and
implementation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98651
[libomptarget] Drop assert.h, use freestanding for amdgcn devicertl
Promotes the runtime assert to a link time error for the unimplemented
fallback functions. Enables amdgcn to build with only clang provided
headers, which makes it less likely to break other builds when enabled.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98649
[libomptarget][amdgcn] Drop use of inttypes.h, moving closer to freestanding
The glibc headers are a periodic source of problems compiling the devicertl.
This patch resolves the following error run into while building llvm on a slightly
different linux system.
```
In file included from .../lib/clang/13.0.0/include/inttypes.h:21:
In file included from /usr/include/inttypes.h:25:
/usr/include/features.h:461:12: fatal error: 'sys/cdefs.h' file not found
# include <sys/cdefs.h>
^~~~~~~~~~~~~
```
As a second patch, removing assert.h from shuffle will let amdgcn build as
-ffreestanding, at which point only the headers that clang itself provides are
used and interactions with the host glibc are eliminated. Doing the same for
nvptx is complicated by printf handling but also seems worthwhile.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D98565
This patch adds the infrastructure for allocator support for target memory.
Three allocators are introduced for device, host and shared memory.
The corresponding API functions have the llvm_ prefix temporarily, until they become part of the OpenMP standard.
Differential Revision: https://reviews.llvm.org/D97883
The shuffle idiom is differently implemented in our supported targets.
To reduce the "target_impl" file we now move the shuffle idiom in it's
own self-contained header that provides the implementation for AMDGPU
and NVPTX. A fallback can be added later on.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95752
Summary:
The changes introduced in D87946 changed the API for libomptarget
functions. `__kmpc_push_target_tripcount` was a function in Clang 11.x
but was not given a backward-compatible interface. This change will
require people using Clang 13.x or 12.x to recompile their offloading
programs.
Reviewed By: jdoerfert cchen
Differential Revision: https://reviews.llvm.org/D98358
For clang this change is NFC cleanup, because clang
never calls atomic functions from runtime library.
Basically, pause is good in spin-loops waiting for something.
Atomic CAS loops do not wait for anything,
each CAS failure means some other thread progressed.
Performance experiments show that the pause only causes unnecessary slowdown
on CPUs with slow pause instruction, no difference on CPUs with fast pause
instruction, removal of the pause gives lesser binary size which is good.
Differential Revision: https://reviews.llvm.org/D97079
In D97003, CUDA 9.2 is the minimum requirement for OpenMP offloading on
NVPTX target. We don't need to have macros in source code to select right functions
based on CUDA version. we don't need to compile multiple bitcode libraries of
different CUDA versions for each SM. We don't need to worry about future
compatibility with newer CUDA version.
`-target-feature +ptx61` is used in this patch, which corresponds to the highest
PTX version that CUDA 9.2 can support.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97198
Restrict the chunk_size * chunk_num to only occur for valid
chunk_nums and reimplement calculating the limit to avoid overflow.
Differential Revision: https://reviews.llvm.org/D96747
and __kmpc_end_masked. The "master" construct is deprecated. Changed
proc-bind keyword from "master" to "primary". Use of both master
construct and master as proc-bind keyword is still allowed, but
deprecated.
Remove references to "master" in comments and strings, and replace
with "primary" or "primary thread". Function names and variables were
not touched, nor were references to deprecated master construct. These
can be updated over time. No new code should refer to master.
This patch just encapsulates some repeated code. To do so, it
relocates some functions from interface.cpp to omptarget.cpp. It also
adjusts them to the LLVM coding style.
This patch is almost NFC except some `DP` messages are a bit
different. For example, messages like "Entering target region" are
now emitted even if offload is disabled, but a subsequent "Offload is
disabled" is then emitted.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D97908
Without this patch, an `omp target exit data` before the runtime is
initialized produces a runtime error. This patch fixes that by
changing `__tgt_target_data_end_mapper` to call `CheckDeviceAndCtors`
like many other runtime routines.
Discussed at
<https://lists.llvm.org/pipermail/openmp-dev/2021-March/003920.html>.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D97907
Without this patch, when the offload device is set to
`omp_get_initial_device()`, the runtime fails with an error diagnostic
when entering target regions or target data regions.
However, OpenMP 5.1, sec. 2.14.5 "target Construct", "Restrictions",
p. 203, L3-5 states:
> The device clause expression must evaluate to a non-negative integer
> value that is less than or equal to the value of
> omp_get_num_devices().
Sec. 3.7.7 "omp_get_initial_device", p. 412, L2-3 states:
> The value of the device number is the value returned by the
> omp_get_num_devices routine.
Similarly, OpenMP 5.0, sec. 2.12.5 "target Construct", "Restrictions",
p. 174 L30-32 states:
> The device clause expression must evaluate to a non-negative integer
> value less than the value of omp_get_num_devices() or to the value
> of omp_get_initial_device().
This patch fixes this behavior by changing the runtime to behave as if
offloading is disabled whenever it finds the offload device (either
from a `device` clause or the default device) is set to the host
device. In the case of mandatory offloading when
`omp_get_num_devices() == 0`, it incorporates the behavior proposed
for OpenMP 5.2 in OpenMP spec github issue 2669.
Reviewed By: grokos, RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D97616
This is a preview of allocator support for target memory that depends on the
offload runtime API which allocates memory as described below.
llvm_omp_target_alloc_host(size_t size, int device_num);
-- Returns non-migratable memory owned by host.
-- Memory is accessible by host and device(s).
llvm_omp_target_alloc_shared(size_t size, int device_num);
-- Returns migratable memory owned by host and device.
-- Memory is accessible by host and device.
llvm_omp_target_alloc_device(size_t size, int device_num);
-- Returns memory owned by device.
-- Memory is only accessible by device.
New memory space and predefined allocator names are
-- llvm_omp_target_host_mem_space
-- llvm_omp_target_shared_mem_space
-- llvm_omp_target_device_mem_space
-- llvm_omp_target_host_mem_alloc
-- llvm_omp_target_shared_mem_alloc
-- llvm_omp_target_device_mem_alloc
Differential Revision: https://reviews.llvm.org/D96669