This patch allows Openmp runtime atomic functions operating on x87 high-precision
to be present only in Openmp runtime for x86 architectures
The functions affected are:
__kmpc_atomic_10
__kmpc_atomic_20
__kmpc_atomic_cmplx10_add
__kmpc_atomic_cmplx10_div
__kmpc_atomic_cmplx10_mul
__kmpc_atomic_cmplx10_sub
__kmpc_atomic_float10_add
__kmpc_atomic_float10_div
__kmpc_atomic_float10_mul
__kmpc_atomic_float10_sub
__kmpc_atomic_float10_add_fp
__kmpc_atomic_float10_div_fp
__kmpc_atomic_float10_mul_fp
__kmpc_atomic_float10_sub_fp
__kmpc_atomic_float10_max
__kmpc_atomic_float10_min
Differential Revision: https://reviews.llvm.org/D117473
This patch changes the visibility for all construct in the new device
RTL to be hidden by default. This is done after the changes introduced
in D117806 changed the visibility from being hidden by default for all
device compilations. This asserts that the visibility for the device
runtime library will be hidden except for the internal environment
variable. This is done to aid optimization and linking of the device
library.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D117807
In the OpenMC app we saw `omp target update` spending an awful lot of
time in the shadow map traversal without ever doing any update there.
There are two cases that allow us to avoid the traversal completely.
The simplest thing is that small updates cannot (reasonably) contain
an attached pointer part. The other case requires to track in the
mapping table if an entry might contain an attached pointer as part.
Given that we have a single location shadow map entries are created,
the latter is actually fairly easy as well.
Differential Revision: https://reviews.llvm.org/D113124
Atomic handling of map clauses was introduced to comply with the OpenMP
standard (see D104418). However, many apps won't need this feature which
can be costly in certain situations. To allow for applications to
opt-out we now introduce the `LIBOMPTARGET_MAP_FORCE_ATOMIC` environment
flag that voids the atomicity guarantee of the standard for map clauses
again, shifting the burden to the user.
This patch also de-duplicates the code that introduces the events used
to enforce atomicity as a cleanup.
Differential Revision: https://reviews.llvm.org/D117627
The OpenMP offloading libraries are built with fixed triples and linked
in during compile time. This would cause un-helpful errors if the user
passed in the wrong expansion of the triple used for the bitcode
library. because we only support these triples for OpenMP offloading we
can normalize them to the full verion used in the bitcode library.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D117634
After the changes in D117362 made variables declared inside of a target
declare directive visible outside the plugin, some variables inside the
runtime were given visiblity that conflicted with their address space
type. This caused problems when shared or local memory was made
externally visible. This patch fixes this issue by making these
varialbes static within the module, therefore limiting their visibility
to being internal.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D117526
After the changes in D117362 made variables declared inside of a target
declare directive visible outside the plugin, some variables inside the
runtime were given visiblity that conflicted with their address space
type. This caused problems when shared or local memory was made
externally visible. This patch fixes this issue by making these
varialbes static within the module, therefore limiting their visibility
to being internal.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D117526
Add use of TPAUSE (from WAITPKG) to the runtime for Intel hardware,
with an envirable to turn it on in a particular C-state. Always uses
TPAUSE if it is selected and enabled by Intel hardware and presence of
WAITPKG, and if not, falls back to old way of checking
__kmp_use_yield, etc.
Differential Revision: https://reviews.llvm.org/D115758
This patch adds the `cold` attribute to the keepAlive functions in the
RTL. This dummy function exists to keep certain RTL calls alive without
them being optimized out, but it is never called and can be declared
cold. This also helps some erroneous remarks being given on this
function because it has weak linkage and cannot be made internal.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D117513
This is the original patch in my GNUInstallDirs series, now last to merge as the final piece!
It arose as a new draft of D28234. I initially did the unorthodox thing of pushing to that when I wasn't the original author, but since I ended up
- Using `GNUInstallDirs`, rather than mimicking it, as the original author was hesitant to do but others requested.
- Converting all the packages, not just LLVM, effecting many more projects than LLVM itself.
I figured it was time to make a new revision.
I have used this patch series (and many back-ports) as the basis of https://github.com/NixOS/nixpkgs/pull/111487 for my distro (NixOS), which was merged last spring (2021). It looked like people were generally on board in D28234, but I make note of this here in case extra motivation is useful.
---
As pointed out in the original issue, a central tension is that LLVM already has some partial support for these sorts of things. Variables like `COMPILER_RT_INSTALL_PATH` have already been dealt with. Variables like `LLVM_LIBDIR_SUFFIX` however, will require further work, so that we may use `CMAKE_INSTALL_LIBDIR`.
These remaining items will be addressed in further patches. What is here is now rote and so we should get it out of the way before dealing more intricately with the remainder.
Reviewed By: #libunwind, #libc, #libc_abi, compnerd
Differential Revision: https://reviews.llvm.org/D99484
This is the original patch in my GNUInstallDirs series, now last to merge as the final piece!
It arose as a new draft of D28234. I initially did the unorthodox thing of pushing to that when I wasn't the original author, but since I ended up
- Using `GNUInstallDirs`, rather than mimicking it, as the original author was hesitant to do but others requested.
- Converting all the packages, not just LLVM, effecting many more projects than LLVM itself.
I figured it was time to make a new revision.
I have used this patch series (and many back-ports) as the basis of https://github.com/NixOS/nixpkgs/pull/111487 for my distro (NixOS), which was merged last spring (2021). It looked like people were generally on board in D28234, but I make note of this here in case extra motivation is useful.
---
As pointed out in the original issue, a central tension is that LLVM already has some partial support for these sorts of things. Variables like `COMPILER_RT_INSTALL_PATH` have already been dealt with. Variables like `LLVM_LIBDIR_SUFFIX` however, will require further work, so that we may use `CMAKE_INSTALL_LIBDIR`.
These remaining items will be addressed in further patches. What is here is now rote and so we should get it out of the way before dealing more intricately with the remainder.
Reviewed By: #libunwind, #libc, #libc_abi, compnerd
Differential Revision: https://reviews.llvm.org/D99484
This patch adds the `weak` identifier to the openmp device environment
variable. The changes introduced in https://reviews.llvm.org/D117211
result in multiply defined symbols. Because the symbol is potentially
included multiple times for each offloading file we will get symbol
colisions, and because it needs to have external visiblity it should be
weak.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D117231
In function `DeviceTy::getTargetPointer`, `Entry` could be `nullptr` because of
zero length array section. We need to check if it is a valid iterator before
using it.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D116716
The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.
For each issued data movement, a new event is generated and returned to `libomptarget`
by calling `createEvent`. The event will be attached to the corresponding mapping table
entry.
For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.
This design is to avoid synchronization on host side.
Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:
- `Event` can be any kind of flag that has at least two status, 0 and 1.
- `waitEvent` can directly busy loop if `Event` is still 0.
My local test shows that `bug49334.cpp` can pass.
Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940
Reviewed By: grokos, JonChesterfield, ye-luo
Differential Revision: https://reviews.llvm.org/D104418
Segmentation fault in ompt_tsan_dependences function due to an unchecked NULL pointer dereference is as follows:
```
ThreadSanitizer:DEADLYSIGNAL
==140865==ERROR: ThreadSanitizer: SEGV on unknown address 0x000000000050 (pc 0x7f217c2d3652 bp 0x7ffe8cfc7e00 sp 0x7ffe8cfc7d90 T140865)
==140865==The signal is caused by a READ memory access.
==140865==Hint: address points to the zero page.
/usr/bin/addr2line: DWARF error: could not find variable specification at offset 1012a
/usr/bin/addr2line: DWARF error: could not find variable specification at offset 133b5
/usr/bin/addr2line: DWARF error: could not find variable specification at offset 1371a
/usr/bin/addr2line: DWARF error: could not find variable specification at offset 13a58
#0 ompt_tsan_dependences(ompt_data_t*, ompt_dependence_t const*, int) /ptmp/bhararit/llvm-project/openmp/tools/archer/ompt-tsan.cpp:1004 (libarcher.so+0x15652)
#1 __kmpc_doacross_post /ptmp/bhararit/llvm-project/openmp/runtime/src/kmp_csupport.cpp:4280 (libomp.so+0x74d98)
#2 .omp_outlined. for_ordered_01.c:? (for_ordered_01.exe+0x5186cb)
#3 __kmp_invoke_microtask /ptmp/bhararit/llvm-project/openmp/runtime/src/z_Linux_asm.S:1166 (libomp.so+0x14e592)
#4 __kmp_invoke_task_func /ptmp/bhararit/llvm-project/openmp/runtime/src/kmp_runtime.cpp:7556 (libomp.so+0x909ad)
#5 __kmp_fork_call /ptmp/bhararit/llvm-project/openmp/runtime/src/kmp_runtime.cpp:2284 (libomp.so+0x8461a)
#6 __kmpc_fork_call /ptmp/bhararit/llvm-project/openmp/runtime/src/kmp_csupport.cpp:308 (libomp.so+0x6db55)
#7 main ??:? (for_ordered_01.exe+0x51828f)
#8 __libc_start_main ??:? (libc.so.6+0x24349)
#9 _start /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/../sysdeps/x86_64/start.S:120 (for_ordered_01.exe+0x4214e9)
ThreadSanitizer can not provide additional info.
SUMMARY: ThreadSanitizer: SEGV /ptmp/bhararit/llvm-project/openmp/tools/archer/ompt-tsan.cpp:1004 in ompt_tsan_dependences(ompt_data_t*, ompt_dependence_t const*, int)
==140865==ABORTING
```
To reproduce the error, use the following openmp code snippet:
```
/* initialise testMatrixInt Matrix, cols, r and c */
#pragma omp parallel private(r,c) shared(testMatrixInt)
{
#pragma omp for ordered(2)
for (r=1; r < rows; r++) {
for (c=1; c < cols; c++) {
#pragma omp ordered depend(sink:r-1, c+1) depend(sink:r-1,c-1)
testMatrixInt[r][c] = (testMatrixInt[r-1][c] + testMatrixInt[r-1][c-1]) % cols ;
#pragma omp ordered depend (source)
}
}
}
```
Compilation:
```
clang -g -stdlib=libc++ -fsanitize=thread -fopenmp -larcher test_case.c
```
It seems like the changes introduced by the commit https://reviews.llvm.org/D114005 causes this particular SEGV while using Archer.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D115328
In most cases, hidden helper task behave similar as detached tasks. That means,
for example, if we have to wait for detached tasks, we have to do the same thing
for hidden helper tasks as well. This patch adds the missing condition for hidden
helper task accordingly along with detached task.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D107316
This patch makes some minor adjustments to `ResourcePool`:
- Don't initialize the resources if `Size` is 0 which can avoid assertion.
- Add a new interface function `clear` to release all hold resources.
- If initial size is 0, resize to 1 when the first request is encountered.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D116340
This patch changes the default aligntment from 8 to 16, and encodes this
information in the `__kmpc_alloc_shared` runtime call to communicate it
to the HeapToStack pass. The previous alignment of 8 was not sufficient
for the maximum size of primitive types on 64-bit systems, and needs to
be increaesd. This reduces the amount of space availible in the data
sharing stack, so this implementation will need to be improved later to
include the alignment requirements in the allocation call, and use it
properly in the data sharing stack in the runtime.
Depends on D115888
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115971
Currently CUDA streams are managed by `StreamManagerTy`. It works very well. Now
we have the need that some resources, such as CUDA stream and event, will be
hold by `libomptarget`. It is always good to buffer those resources. What's more
important, given the way that `libomptarget` and plugins are connected, we cannot
make sure whether plugins are still alive when `libomptarget` is destroyed. That
leads to an issue that those resouces hold by `libomptarget` might not be
released correctly. As a result, we need an unified management of all the resources
that can be shared between `libomptarget` and plugins.
`ResourcePoolTy` is designed to manage the type of resource for one device.
It has to work with an allocator which is supposed to provide `create` and
`destroy`. In this way, when the plugin is destroyed, we can make sure that
all resources allocated from native runtime library will be released correctly,
no matter whether `libomptarget` starts its destroy.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D111954
This patch allows the user to request all resources of a particular
layer (or core-attribute). The syntax of KMP_HW_SUBSET is modified
so the number of units requested is optional or can be replaced with an
'*' character.
e.g., KMP_HW_SUBSET=c:intel_atom@3 will use all the cores after offset 3
e.g., KMP_HW_SUBSET=*c:intel_core will use all the big cores
e.g., KMP_HW_SUBSET=*s,*c,1t will use all the sockets, all cores per
each socket and 1 thread per core.
Differential Revision: https://reviews.llvm.org/D115826
I missed the async info parameter in the first version of this API.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115887
This patch extends the AMDGPU plugin for OpenMP target offloading from using a single HSA queue to multiple queues (four in this patch) per device. This enables concurrent threads to concurrently submit kernel launches to the same GPU.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115771
Just defensive CMake-ing. I pulled this from D115544 and D99484 which
are blocked on some lldb CI failures I don't yet understand. Hoping to land
something smaller in the meantime.
Reviewed By: #libc, ldionne
Differential Revision: https://reviews.llvm.org/D115566
The `not` program is used to test executions prefixed with `%libomptarget-run-fail-`. Currently `not` is not used for libomp tests, but might be used in the future and its dependency does not add any additional burden over the already established `FileCheck` dependency.
Required to add libomptarget testing to the Phabricator pre-merge check (see https://github.com/google/llvm-premerge-checks/issues/368)
Reviewed By: jdenny, JonChesterfield
Differential Revision: https://reviews.llvm.org/D115454
This reverts commit 492de35df4.
I tried to apply John's changes in 8d897ec915 that were expected to
fix his patch but that didn't work unfortunately.
Reverting this again to fix the macOS bots and leave him more time to
investigate the issue.
This reverts commit 797b50d4be.
See the original D99484. @mib who noticed the original problem could not longer
reproduce it, after I tried and also failed. We are threfore hoping it went
away on its own!
Reviewed By: mib
Differential Revision: https://reviews.llvm.org/D115544
Allow filtering of resources based on core attributes. There are two new
attributes added:
1) Core Type (intel_atom, intel_core)
2) Core Efficiency (integer) where the higher the efficiency, the more
performant the core
On hybrid architectures , e.g., Alder Lake, users can specify
KMP_HW_SUBSET=4c:intel_atom,4c:intel_core to select the first four Atom
and first four Big cores. The can also use the efficiency syntax. e.g.,
KMP_HW_SUBSET=2c:eff0,2c:eff1
Differential Revision: https://reviews.llvm.org/D114901
In the OpenMC app we saw `omp target update` spending an awful lot of
time in the shadow map traversal without ever doing any update there.
There are two cases that allow us to avoid the traversal completely.
The simplest thing is that small updates cannot (reasonably) contain
an attached pointer part. The other case requires to track in the
mapping table if an entry might contain an attached pointer as part.
Given that we have a single location shadow map entries are created,
the latter is actually fairly easy as well.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D113124
and synchronous kernel launch implementations into a single
synchronous version. This patch prepares the plugin for asynchronous
implementation by:
Privatizing actual kernel launch code (valid in both cases) into
an anonymous namespace base function (submitted at D115267)
- Separating the control flow path of asynchronous and synchronous
kernel launch functions** (this diff)
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115273
D113602 broke the custom state machine when a reduction is present, as
revealed by the reproducer this patch adds to the test suite. In that
case, openmp-opts changes the return value to undef in
`__kmpc_get_warp_size` (which the custom state machine calls as of
D113602). Later optimizations then optimize away the custom state
machine code as if all threads are outside the thread block, so the
target region does not execute. D114802 fixed that but didn't add a
reproducer.
This patch also adds a `__OMP_RTL_ATTRS` entry for
`__kmpc_get_warp_size` to OMPKinds.def, which D113602 missed. This
change does not seem to have any impact on the reduction problem.
Reviewed By: JonChesterfield, jdoerfert
Differential Revision: https://reviews.llvm.org/D113824
The problem with the old scheme is that we would need to keep track of
the "next region" and reset the num_threads value after it. The new RT
doesn't do it and an assertion is triggered. The old RT doesn't do it
either, I haven't tested it but I assume a num_threads clause might
impact multiple parallel regions "accidentally". Further, in SPMD mode
num_threads was simply ignored, for some reason beyond me.
In any case, parallel_51 is designed to take the clause value directly,
so let's do that instead.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D113623
Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115279