The unroll pragma did not properly work as the loop bound was not known
when we optimize the runtime and we then added a "unroll disable"
metadata which prevented unrolling later when the bounds were known.
For now we manually unroll to make sure up to 16 elements are handled
nicely. This helps optimizations to look through the argument passing.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109164
Currently we set ccontext everywhere accordingly, but that causes many
unnecessary function calls. For example, in the resource pool, if we need to
resize the pool, we need to get from allocator. Each call to allocate sets the
current context once, which is unnecessary. In this patch, we set the context
only in the entry interface functions, if needed. Actually in the best way this
should be implemented via RAII, but since `cuCtxSetCurrent` could return error,
and we don't use exception, we can't stop the execution if RAII fails.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121322
This patch fixes the issue introduced in 14de0820e8 and D120089, that
if dynamic libraries are used, the `CUmodule` array could be overwritten.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121308
The modules vector was for some reason special which could lead to it
not being of the same size (=num devices). Easiest solution is to treat
it like we do all the other vectors.
An event pool, similar to the stream pool, needs to be kept per device.
For one, events are associated with cuda contexts which means we cannot
destroy the former after the latter. Also, CUDA documentation states
streams and events need to be associated with the same context, which
we did not ensure at all.
Differential Revision: https://reviews.llvm.org/D120142
There are two problems this patch tries to address:
1) We currently free resources in a random order wrt. plugin and
libomptarget destruction. This patch should ensure the CUDA plugin
is less fragile if something during the deinitialization goes wrong.
2) We need to support (hard) pause runtime calls eventually. This patch
allows us to free all associated resources, though we cannot
reinitialize the device yet.
Follow up patch will associate one event pool per device/context.
Differential Revision: https://reviews.llvm.org/D120089
`LIBOMPTARGET_LLVM_INCLUDE_DIRS` is currently checked and included for
multiple times redundantly. This patch is simply a clean up.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D121055
Libomptarget uses some shared variables to track certain internal stated
in the runtime. This causes problems when we have code that contains no
OpenMP kernels. These variables are normally initialized upon kernel
entry, but if there are no kernels we will see no initialization.
Currently we load the runtime into each source file when not running in
LTO mode, so these variables will be erroneously considered undefined or
dead and removed, causing miscompiles. This patch temporarily works
around the most obvious case, but others still exhibit this problem. We
will need to fix this more soundly later.
Fixes#54208.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121007
When using asynchronous plugin calls, shadow pointer restore could happen before the D2H copy for the entire struct has completed, effectively leaving a device pointer in a host struct.
This patch fixes the problem by delaying restore's to after a synchronization happens (target regions) and by calling early synchronization (target update).
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119968
The runtime uses thread state values to indicate when we use an ICV or
are in nested parallelism. This is done for OpenMP correctness, but it
not needed in the majority of cases. The new flag added is
`-fopenmp-assume-no-thread-state`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120106
`bug49334.cpp` has one issue that causes flaky result reported in #53730.
The root cause is `BlockedC` is never initialized but in `BlockMatMul_TargetNowait`
it is directly read and written (via `+=`). Fixes#53730.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D119988
The `IsSPMD` global can only be read by threads other than the main
thread *after* initialization is complete. To allow usage of
`mapping::getBlockSize` before initialization is done, we can pass the
`IsSPMD` state explicitly. This is similar to other APIs that take
`IsSPMD` explicitly to avoid such a race, e.g.,
`mapping::isInitialThreadInLevel0(IsSPMD)`
Fixes https://github.com/llvm/llvm-project/issues/53857
This patch adds a new target to the OpenMP CPU offloading tests. This
tests the usage of the new driver for CPU offloading. If this all works
then we can move to transition to the new driver as the default.
Depends on D119613
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119736
Currently whenever we compile the device runtime we get the following
'Mapping.cpp:32:32: warning: inline function '_OMP::impl::getGridValue'
is not defined [-Wundefined-inline]' warning. This can be silenced by
removing the constexpr attribute for this function. Doing this doesn't
change the generated bitcode at all but prevents the screen from getting
filled with warnings whenver we build the runtime.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119747
This patch fixes the issue that the for loop in `applyToShadowMapEntries`
is infinite because `Itr` is not incremented in `CB`. Fixes#53727.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119471
`bug49334.cpp` directly uses `!=` to compare two floating point values,
which is almost wrong.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D119485
Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a `int32_t` can represent. There is no limitation mentioned in the spec. This patch simply removes it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119313
This patch refines the logic to determine grid size as previous method
can escape the check of whether `CudaBlocksPerGrid` could be greater than the actual
hardware limit.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119311
The 'bug49779.cpp' test has been failing recently. This is because the
runtime is sufficiently complex when using nested parallelism without
optimizations that the CUDA tools cannot statically determine the stack
size. Because of this the kernel can exceed the thread stack size and
crash. Work around this using the 'LIBOMPTARGET_STACK_SIZE' environment
variable and add an FAQ entry for this situation.
Fixes#53670
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D119357
This patch manually adds the runtime include files to the list of
dependencies when we build the bitcode runtime library. Previously if
only the header was changed we would not recompile the source files.
The solution used here isn't optimal because every source file not has a
dependency on each header file regardless of if it was actually used by
that file.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D119254
This patch enables running the new driver tests for AMDGPU. Previously
this was disabled because some tests failed. This was only because the
new driver tests hadn't been listed as unsupported or expected to fail.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D119240
This patch replaces the ValueRAII pointer with a default 'nullptr'
value. Previously this was initialized as a reference to an existing
variable. The use of this variable caused overhead as the compiler could
not look through the uses and determine that it was unused if 'Active'
was not set. Because of this accesses to the variable would be left in
the runtime once compiled.
Fixes#53641
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119187
This patch completely removes the old OpenMP device runtime. Previously,
the old runtime had the prefix `libomptarget-new-` and the old runtime
was simply called `libomptarget-`. This patch makes the formerly new
runtime the only runtime available. The entire project has been deleted,
and all references to the `libomptarget-new` runtime has been replaced
with `libomptarget-`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D118934
Due to num_threads (probably also other reasons) we cannot assume
explicit barriers are always executed by all threads in an aligned
fashion. We can optimize them if that property can be proven but
that is different.
This patch adds a new target to the tests to run using the new driver as
the method for generating offloading code.
Depends on D116541
Differential Revision: https://reviews.llvm.org/D118637
This patch changes the error message to instead mention the
documentation page for the debugging options provided by libomptarget
and the bitcode runtimes. Add some extra information to the documentation to
help users more quickly identify debugging resources.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118626
Reduces the shared memory size used for globalization to 512 bytes from
2048 to reduce the pressure on shared memory. This patch ado adds a
debug mesage to indicate when the shared memory was insufficient.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118625
Openmp executables need to find libomp and libomptarget at runtime.
This currently requires LD_LIBRARY_PATH or the user to specify rpath. Change
that to set the expected location of the openmp libraries in the install tree.
Whether rpath means rpath or runpath is system dependent. The attached test
shows that the Wl,--disable-new-dtags control interacts correctly with this feature.
The implicit rpath field is appended to any user specified ones which is ideal.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D118493
Openmp executables need to find libomp and libomptarget at runtime.
This currently requires LD_LIBRARY_PATH or the user to specify rpath. Change
that to set the expected location of the openmp libraries in the install tree.
Whether rpath means rpath or runpath is system dependent. The attached test
shows that the Wl,--disable-new-dtags control interacts correctly with this feature.
The implicit rpath field is appended to any user specified ones which is ideal.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D118493
Fully respect LIBOMPTARGET_BUILD_NVPTX_BCLIB. There is no CUDA toolchain dependency. Complement D118268.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118522
If we have a broken assumption we want to print a message to the user.
If the assumption is broken by many threads in many teams this can
become a problem. To avoid it we use a hash that tracks if a broken
assumption has (likely) been printed and avoid printing it again. This
is not fool proof and has some caveats that might cause problems in
the future (see comment) but it should improve the situation
considerably for now.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D112156
IdentTy objects are useful for debugging and profiling so we want to
keep them around in more places, especially those that have a large
impact on performance, e.g., everything related to state.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112494
This implements the runtime portion of the interop directive.
It expects the frontend and IRBuilder portions to be in place
for proper execution. It currently works only for GPUs
and has several TODOs that should be addressed going forward.
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D106674