This patch adds a new target to the OpenMP CPU offloading tests. This
tests the usage of the new driver for CPU offloading. If this all works
then we can move to transition to the new driver as the default.
Depends on D119613
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119736
Currently whenever we compile the device runtime we get the following
'Mapping.cpp:32:32: warning: inline function '_OMP::impl::getGridValue'
is not defined [-Wundefined-inline]' warning. This can be silenced by
removing the constexpr attribute for this function. Doing this doesn't
change the generated bitcode at all but prevents the screen from getting
filled with warnings whenver we build the runtime.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119747
This patch fixes the issue that the for loop in `applyToShadowMapEntries`
is infinite because `Itr` is not incremented in `CB`. Fixes#53727.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119471
`bug49334.cpp` directly uses `!=` to compare two floating point values,
which is almost wrong.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D119485
Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a `int32_t` can represent. There is no limitation mentioned in the spec. This patch simply removes it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119313
This patch refines the logic to determine grid size as previous method
can escape the check of whether `CudaBlocksPerGrid` could be greater than the actual
hardware limit.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119311
The 'bug49779.cpp' test has been failing recently. This is because the
runtime is sufficiently complex when using nested parallelism without
optimizations that the CUDA tools cannot statically determine the stack
size. Because of this the kernel can exceed the thread stack size and
crash. Work around this using the 'LIBOMPTARGET_STACK_SIZE' environment
variable and add an FAQ entry for this situation.
Fixes#53670
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D119357
This patch manually adds the runtime include files to the list of
dependencies when we build the bitcode runtime library. Previously if
only the header was changed we would not recompile the source files.
The solution used here isn't optimal because every source file not has a
dependency on each header file regardless of if it was actually used by
that file.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D119254
This patch enables running the new driver tests for AMDGPU. Previously
this was disabled because some tests failed. This was only because the
new driver tests hadn't been listed as unsupported or expected to fail.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D119240
This patch replaces the ValueRAII pointer with a default 'nullptr'
value. Previously this was initialized as a reference to an existing
variable. The use of this variable caused overhead as the compiler could
not look through the uses and determine that it was unused if 'Active'
was not set. Because of this accesses to the variable would be left in
the runtime once compiled.
Fixes#53641
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119187
This patch completely removes the old OpenMP device runtime. Previously,
the old runtime had the prefix `libomptarget-new-` and the old runtime
was simply called `libomptarget-`. This patch makes the formerly new
runtime the only runtime available. The entire project has been deleted,
and all references to the `libomptarget-new` runtime has been replaced
with `libomptarget-`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D118934
Due to num_threads (probably also other reasons) we cannot assume
explicit barriers are always executed by all threads in an aligned
fashion. We can optimize them if that property can be proven but
that is different.
This patch adds a new target to the tests to run using the new driver as
the method for generating offloading code.
Depends on D116541
Differential Revision: https://reviews.llvm.org/D118637
This patch changes the error message to instead mention the
documentation page for the debugging options provided by libomptarget
and the bitcode runtimes. Add some extra information to the documentation to
help users more quickly identify debugging resources.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118626
Reduces the shared memory size used for globalization to 512 bytes from
2048 to reduce the pressure on shared memory. This patch ado adds a
debug mesage to indicate when the shared memory was insufficient.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118625
Openmp executables need to find libomp and libomptarget at runtime.
This currently requires LD_LIBRARY_PATH or the user to specify rpath. Change
that to set the expected location of the openmp libraries in the install tree.
Whether rpath means rpath or runpath is system dependent. The attached test
shows that the Wl,--disable-new-dtags control interacts correctly with this feature.
The implicit rpath field is appended to any user specified ones which is ideal.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D118493
Openmp executables need to find libomp and libomptarget at runtime.
This currently requires LD_LIBRARY_PATH or the user to specify rpath. Change
that to set the expected location of the openmp libraries in the install tree.
Whether rpath means rpath or runpath is system dependent. The attached test
shows that the Wl,--disable-new-dtags control interacts correctly with this feature.
The implicit rpath field is appended to any user specified ones which is ideal.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D118493
Fully respect LIBOMPTARGET_BUILD_NVPTX_BCLIB. There is no CUDA toolchain dependency. Complement D118268.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118522
If we have a broken assumption we want to print a message to the user.
If the assumption is broken by many threads in many teams this can
become a problem. To avoid it we use a hash that tracks if a broken
assumption has (likely) been printed and avoid printing it again. This
is not fool proof and has some caveats that might cause problems in
the future (see comment) but it should improve the situation
considerably for now.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D112156
IdentTy objects are useful for debugging and profiling so we want to
keep them around in more places, especially those that have a large
impact on performance, e.g., everything related to state.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112494
This implements the runtime portion of the interop directive.
It expects the frontend and IRBuilder portions to be in place
for proper execution. It currently works only for GPUs
and has several TODOs that should be addressed going forward.
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D106674
The old runtime is not tested by CI. Disable the build prior to the llvm-14 branch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118268
This patch changes the visibility for all construct in the new device
RTL to be hidden by default. This is done after the changes introduced
in D117806 changed the visibility from being hidden by default for all
device compilations. This asserts that the visibility for the device
runtime library will be hidden except for the internal environment
variable. This is done to aid optimization and linking of the device
library.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D117807
In the OpenMC app we saw `omp target update` spending an awful lot of
time in the shadow map traversal without ever doing any update there.
There are two cases that allow us to avoid the traversal completely.
The simplest thing is that small updates cannot (reasonably) contain
an attached pointer part. The other case requires to track in the
mapping table if an entry might contain an attached pointer as part.
Given that we have a single location shadow map entries are created,
the latter is actually fairly easy as well.
Differential Revision: https://reviews.llvm.org/D113124
Atomic handling of map clauses was introduced to comply with the OpenMP
standard (see D104418). However, many apps won't need this feature which
can be costly in certain situations. To allow for applications to
opt-out we now introduce the `LIBOMPTARGET_MAP_FORCE_ATOMIC` environment
flag that voids the atomicity guarantee of the standard for map clauses
again, shifting the burden to the user.
This patch also de-duplicates the code that introduces the events used
to enforce atomicity as a cleanup.
Differential Revision: https://reviews.llvm.org/D117627
The OpenMP offloading libraries are built with fixed triples and linked
in during compile time. This would cause un-helpful errors if the user
passed in the wrong expansion of the triple used for the bitcode
library. because we only support these triples for OpenMP offloading we
can normalize them to the full verion used in the bitcode library.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D117634
After the changes in D117362 made variables declared inside of a target
declare directive visible outside the plugin, some variables inside the
runtime were given visiblity that conflicted with their address space
type. This caused problems when shared or local memory was made
externally visible. This patch fixes this issue by making these
varialbes static within the module, therefore limiting their visibility
to being internal.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D117526
After the changes in D117362 made variables declared inside of a target
declare directive visible outside the plugin, some variables inside the
runtime were given visiblity that conflicted with their address space
type. This caused problems when shared or local memory was made
externally visible. This patch fixes this issue by making these
varialbes static within the module, therefore limiting their visibility
to being internal.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D117526
This patch adds the `cold` attribute to the keepAlive functions in the
RTL. This dummy function exists to keep certain RTL calls alive without
them being optimized out, but it is never called and can be declared
cold. This also helps some erroneous remarks being given on this
function because it has weak linkage and cannot be made internal.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D117513
This patch adds the `weak` identifier to the openmp device environment
variable. The changes introduced in https://reviews.llvm.org/D117211
result in multiply defined symbols. Because the symbol is potentially
included multiple times for each offloading file we will get symbol
colisions, and because it needs to have external visiblity it should be
weak.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D117231
In function `DeviceTy::getTargetPointer`, `Entry` could be `nullptr` because of
zero length array section. We need to check if it is a valid iterator before
using it.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D116716
The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.
For each issued data movement, a new event is generated and returned to `libomptarget`
by calling `createEvent`. The event will be attached to the corresponding mapping table
entry.
For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.
This design is to avoid synchronization on host side.
Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:
- `Event` can be any kind of flag that has at least two status, 0 and 1.
- `waitEvent` can directly busy loop if `Event` is still 0.
My local test shows that `bug49334.cpp` can pass.
Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940
Reviewed By: grokos, JonChesterfield, ye-luo
Differential Revision: https://reviews.llvm.org/D104418
In most cases, hidden helper task behave similar as detached tasks. That means,
for example, if we have to wait for detached tasks, we have to do the same thing
for hidden helper tasks as well. This patch adds the missing condition for hidden
helper task accordingly along with detached task.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D107316