Currently, globals on the device will have an infinite reference count
and an unknown name when using `LIBOMPTARGET_INFO` to print the mapping
table. We already store the name of the global in the offloading entry
so we should be able to use it, although there will be no source
location. To do this we need to create a valid `ident_t` string from a
name only.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D124381
Consider checking whether a pointer has been mapped can be achieved via omp_get_mapped_ptr.
omp_target_is_present is more needed to check whether the storage being pointed is mapped.
This restore the old behavior of omp_target_is_present before D123093
Fixes https://github.com/llvm/llvm-project/issues/54899
Reviewed By: jdenny
Differential Revision: https://reviews.llvm.org/D123891
When hwloc is used and is installed outside of the default paths, the omp CMake target
needs to provide the needed include path thru the CMake target by adding it with
target_include_directories to it, so libompd gets it as well when it defines it's cmake
target using target_link_libraries.
As suggested in D122667
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D123888
The remote offloading server and plugin rely on OpenMP, so this needs to be added as a linker flag. Without this, applications segfault.
Differential Revision: https://reviews.llvm.org/D124200
This fixes a compile-time error recently introduced within the remote
offloading plugin. This patch also removes some extra linker flags that are unnecessary, and adds an explicit abseil linker flag without which we occasionally get problems.
Differential Revision: https://reviews.llvm.org/D119984
Summary:
One test had an old "unsupported" string that used the old `newDriver`
string which was removed. This test should be updated to use the
`oldDriver` one instead.
Previously an opt-in flag `-fopenmp-new-driver` was used to enable the
new offloading driver. After passing tests for a few months it should be
sufficiently mature to flip the switch and make it the default. The new
offloading driver is now enabled if there is OpenMP and OpenMP
offloading present and the new `-fno-openmp-new-driver` is not present.
The new offloading driver has three main benefits over the old method:
- Static library support
- Device-side LTO
- Unified clang driver stages
Depends on D122683
Differential Revision: https://reviews.llvm.org/D122831
Summary:
The changes made in D123177 added new targets to the
`LIBOMPTARGET_TESTED_PLUGINS` variable which was linked against when
building the `llvm-omp-target-info` tool. This caused linker errors on
the export scripts. This patch removes that dependency, it still builds
and runs as expected so I will assume it's correct.
As described in 5.1 spec
2.21.7.2 Pointer Initialization for Device Data Environments
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D123093
This patch adds the `llvm_omp_target_dynamic_shared_alloc` function to
the `omp.h` header file so users can access it by default. Also changed
the name to keep it consistent with the other target allocators. Added
some documentation so users know how to use it. Didn't add the interface
for Fortran since there's no way to test it right now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D123246
The target allocators have been supported for NVPTX offloading for
awhile. The tests should use the allocators instead of calling the
functions manually. Also the comments indicating these being a preview
should be removed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D123242
In a clean build directory, `check-openmp` or `check-libomptarget` will fail because of missing device RTL .bc files. Ensure that the new targets new custom targets `omptarget.devicertl.nvptx` and `omptarget.devicertl.amdgpu` (corresponding to the plugin rtl targets `omptarget.rtl.cuda`, respectively `omptarget.rlt.amdgpu` ) are dependencies of the regression tests.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D123177
This change adds support for ompt_callback_dispatch with the new
dispatch chunk type introduced in 5.2. Definitions of the new
ompt_work_loop types were also added in the header file.
Differential Revision: https://reviews.llvm.org/D122107
Latest OpenMP spec says parallel_data is NULL for initial/implicit-task-end.
We nevertheless need to cleanup the ParallelData here, as there is no other
callback for the end of the implicit parallel region. We can use the reference
stored in the TaskData.
Reviewed By: dreachem
Differential Revision: https://reviews.llvm.org/D114005
If we decided to delete a mapping entry we did not act on it right away
but first issued and waited for memory copies. In the meantime some
other thread might reuse the entry. While there was some logic to avoid
colliding on the actual "deletion" part, there were two races happening:
1) The data transfer back of the thread deleting the entry and
the data transfer back of the thread taking over the entry raced.
2) The update to the shadow map happened regardless if the entry was
actually reused by another thread which left the shadow map in a
inconsistent state.
To fix both issues we will now update the shadow map and delete the
entry only if we are sure the thread is responsible for deletion, hence
no other thread took over the entry and reused it. We also wait for a
potential former data transfer from the device to finish before we issue
another one that would race with it.
Fixes https://github.com/llvm/llvm-project/issues/54216
Differential Revision: https://reviews.llvm.org/D121058
Inline assembly is scary but we need to support it for the OpenMP GPU
device runtime. The new assumption expresses the fact that it may not
have call semantics, that is, it will not call another function but
simply perform an operation or side-effect. This is important for
reachability in the presence of inline assembly.
Differential Revision: https://reviews.llvm.org/D109986
As we mentioned in the code comments for function `ResourcePoolTy::release`,
at some point there could be two identical resources on the two sides of `Next`
mark. It is usually not an issue, unless the following case:
1. Some resources are not returned.
2. We need to iterate the pool and free the element.
That will cause double free, which is the case for event pool. Since we don't release
events hold by the data map, it can happen that the `Next` mark is not reset, and
we have two identical items in the pool. When the pool is destroyed, we will call
`cuEventDestroy` twice on the same event. In the best case, we can only observe
CUDA errors. In the worst case, it can cause internal failures in CUDART and further
crash.
This patch fixes the issue by tracking all resources that have been given using
an `unordered_set`. We don't remove it when a resource is returned. When the pool
is destroyed, we merge the pool (a `vector`) and the set. In this way, we can make
sure that the set contains all resources allocated from the device. We just need
to iterate the set and free the resource accordingly.
For now, only event pool is set to use it. Stream pool is not because we can make
sure all streams are returned when the plugin is destroyed.
Someone might be wondering, why don't we release all events hold in the data map.
That is because, plugins are determined to be destroyed *before* `libomptarget`.
If we can somehow make the plugin outlast `libomptarget`, life will be much
easier.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D122014
This patch adds the necessary AMDGPU calling convention to the ctor /
dtor kernels. These are fundamentally device kenels called by the host
on image load. Without this calling convention information the AMDGPU
plugin is unable to identify them.
Depends on D122504
Fixes#54091
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D122515
This patch solves two problems with the `HostDataToTargetMap` (HDTT
map) which caused races and crashes before:
1) Any access to the HDTT map needs to be exclusive access. This was not
the case for the "dump table" traversals that could collide with
updates by other threads. The new `Accessor` and `ProtectedObject`
wrappers will ensure we have a hard time introducing similar races in
the future. Note that we could allow multiple concurrent
read-accesses but that feature can be added to the `Accessor` API
later.
2) The elements of the HDTT map were `HostDataToTargetTy` objects which
meant that they could be copied/moved/deleted as the map was changed.
However, we sometimes kept pointers to these elements around after we
gave up the map lock which caused potential races again. The new
indirection through `HostDataToTargetMapKeyTy` will allows us to
modify the map while keeping the (interesting part of the) entries
valid. To offset potential cost we duplicate the ordering key of the
entry which avoids an additional indirect lookup.
We should replace more objects with "protected objects" as we go.
Differential Revision: https://reviews.llvm.org/D121057
The unroll pragma did not properly work as the loop bound was not known
when we optimize the runtime and we then added a "unroll disable"
metadata which prevented unrolling later when the bounds were known.
For now we manually unroll to make sure up to 16 elements are handled
nicely. This helps optimizations to look through the argument passing.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109164
Currently we set ccontext everywhere accordingly, but that causes many
unnecessary function calls. For example, in the resource pool, if we need to
resize the pool, we need to get from allocator. Each call to allocate sets the
current context once, which is unnecessary. In this patch, we set the context
only in the entry interface functions, if needed. Actually in the best way this
should be implemented via RAII, but since `cuCtxSetCurrent` could return error,
and we don't use exception, we can't stop the execution if RAII fails.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121322
This patch fixes the issue introduced in 14de0820e8 and D120089, that
if dynamic libraries are used, the `CUmodule` array could be overwritten.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121308
In SupportAndFAQ.rst, add blank lines before and after a bullet list and
sublist. This avoids an "Unepxected indentation" warning.
In Runtimes.rst, adjust the suggestion for setting LIBOMPTARGET_INFO.
The right shifts are not necessary as the bit mask values are already
correct.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119595
The modules vector was for some reason special which could lead to it
not being of the same size (=num devices). Easiest solution is to treat
it like we do all the other vectors.
An event pool, similar to the stream pool, needs to be kept per device.
For one, events are associated with cuda contexts which means we cannot
destroy the former after the latter. Also, CUDA documentation states
streams and events need to be associated with the same context, which
we did not ensure at all.
Differential Revision: https://reviews.llvm.org/D120142
There are two problems this patch tries to address:
1) We currently free resources in a random order wrt. plugin and
libomptarget destruction. This patch should ensure the CUDA plugin
is less fragile if something during the deinitialization goes wrong.
2) We need to support (hard) pause runtime calls eventually. This patch
allows us to free all associated resources, though we cannot
reinitialize the device yet.
Follow up patch will associate one event pool per device/context.
Differential Revision: https://reviews.llvm.org/D120089
Register constraint switched to "=q" which means very specifically (from
https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints)
> Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 64-bit
mode, any integer register.
Older gcc versions (8.x and below) were trying to use esi or edi for the
8 bit flag variable, but it wound up displaying this error in the end:
kmp_lock.cpp: In function ‘void __kmp_spin_backoff(kmp_backoff_t*)’:
kmp_lock.cpp:2684:1: error: unsupported size for integer register
Hence the correct restriction is "=q" instead of "=r".
Fixes: https://github.com/llvm/llvm-project/issues/53309
Differential Revision: https://reviews.llvm.org/D120519
Before this patch task priorities were ignored, that was a valid implementation
as the task priority is a hint according to OpenMP specification.
Implemented shared list of sorted (high -> low) task deques one per task
priority value. Tasks execution changed to first check if priority tasks ready
for execution exist, and these tasks executed before others;
otherwise usual tasks execution mechanics work.
Differential Revision: https://reviews.llvm.org/D119676