Summary:
This test was failing because of an implicit declaration of `printf`
which isn't legal with newer C, causing it to fail. This patch just adds
the necessary header.
When we build the libomptarget device runtime library targeting bitcode,
we need special care to make sure that certain functions are not
optimized out. This is because we manually internalize and optimize
these definitions, ignoring their standard linkage semantics. When we
build with the static library, we can maintain these semantics and we do
not need these to be kept-alive. Furthermore, if they are kept-alive it
prevents them from being removed during LTO. This prevents us from
completely internalizing `IsSPMDMode` and removing several other
functions. This patch removes these for the static library target by
using a macro definition to enable them.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D126701
This patchs adds the arguments necessary to allocate the size of the
dynamic shared memory via the `LIBOMPTARGET_SHARED_MEMORY_SIZE`
environment variable. This patch only allocates the memory, AMDGPU has a
limitation that shared memory can only be accessed from the kernel
directly. So this will currently only work with optimizations to inline
the accessor function.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D125252
Without this patch, arguments to the
`llvm::OpenMPIRBuilder::AtomicOpValue` initializer are reversed.
Reviewed By: ABataev, tianshilei1992
Differential Revision: https://reviews.llvm.org/D126619
OpenMP 5.2, sec. 10.2 "teams Construct", p. 232, L9-12 restricts what
regions can be strictly nested within a `teams` construct. This patch
relaxes Clang's enforcement of this restriction in the case of nested
`atomic` constructs unless `-fno-openmp-extensions` is specified.
Cases like the following then seem to work fine with no additional
implementation changes:
```
#pragma omp target teams map(tofrom:x)
#pragma omp atomic update
x++;
```
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D126323
Summary:
We usually used the `OMP_LIKELY` and `OMP_UNLIKELY` macros to add branch
prediction intrinsics to help the optimizer ignore unlikely loops. This
wasn't applied to this one loop so add that in.
Summary:
This patch adds the `leaf` attribute to the `vprintf` declaration in the
OpenMP runtime. This attribute allows us to determine that the `vprintf`
function will not call any functions within the translation unit,
allowing us to deduce `norecurse` attributes on the caller.
The OpenMP device offloading library is a bitcode library and thus only
expect to build and linked with the same version of clang that was used
to create it. This somewhat copmlicates the building process as we
require the Clang that was just built to be used to create the library.
This is either done with a two-step build, where OpenMP is built with
the Clang that was just installed, or through the
`-DLLLVM_ENABLE_RUNTIMES=openmp` option. This has always been the case,
but recent changes have caused this to make it difficult to build the
rest of OpenMP. This patchs adds a check to not build the OpenMP device
runtime if the current compiler is not Clang with the same version as
the LLVM installation. This should allow users to build OpenMP as a
project using any compiler without it erroring out due to the bitcode
library, but if users require it they will need to use the above methods
to compile it.
Reviewed By: jdoerfert, tianshilei1992, ye-luo
Differential Revision: https://reviews.llvm.org/D125698
Summary:
This patch allows users to compile the static library without CUDA
installed on the system. This requires the new flag `--cuda-feature` to
indicate that we need `+ptx61` in order to compile the runtime.
This patch adds the necessary CMake configuration to build a static
library version of the device runtime, `libomptarget.devicertl.a`.
Various improvements in how we handle static libraries and generating
offloading code should allow us to treat the device library as a regular
project without needing to invoke the clang front-end directly. Here we
generate a job for each offloading architecture supported. Each
offloading architecture will be embedded into the static library and
used as-needed by the host.
This library will primarily be used to replace the bitcode library when
performing LTO. Currently, we need to manually pass in the bitcode
library which requires foreknowledge of the offloading architecture.
This approach lets us handle that in the linker wrapper instead.
Furthermore this should improve our interface to the device runtime. We
can now build it fully under a release build and have all the expected
entry points, as well as supporting debug builds.
Depends on D125265 D125256 D125260 D125314 D125563
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D125315
We used to globally include the libomptarget include directory for all
projects. This caused some conflicts with the other files named
"Debug.h". This patch changes the cmake to include these files via the
target include instead.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D125563
This patche attemps to address the current warnings in the OpenMP
offloading device runtime. Previously we did not see these because we
compiled the runtime without the standard warning flags enabled.
However, these warnings are used when we now build the static library
version of this runtime. This became extremely noisy when coupled with
the fact the we compile each file roughly 32 times when all the
architectures are considered. So it would be ideal to not have all these
warnings show up when building.
Most of these errors were simply implicit switch-case fallthroughs,
which can be addressed using C++17's fallthrough attribute. Additionally
there was a volatile variable that was being casted away. This is most
likely safe to remove because we cast it away before its even used and
didn't seem to affect anything in testing.
Depends on D125260
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D125339
Currently the OpenMP offloading device runtime is only expected to be
compiled for the specific architecture it's targeting. This is
problematic if we want to make compiling the device runtime more general
via the standar `clang` driver rather than invoking the clang front-end
directly. This patch addresses this by primarily changing the declare
type to `nohost` so the host will not contain any of this code.
Additionally we forward declare the functions that are defined via
variants, otherwise these would cause problems on the host.
Reviewed By: jdoerfert, tianshilei1992
Differential Revision: https://reviews.llvm.org/D125260
Currently, globals on the device will have an infinite reference count
and an unknown name when using `LIBOMPTARGET_INFO` to print the mapping
table. We already store the name of the global in the offloading entry
so we should be able to use it, although there will be no source
location. To do this we need to create a valid `ident_t` string from a
name only.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D124381
Consider checking whether a pointer has been mapped can be achieved via omp_get_mapped_ptr.
omp_target_is_present is more needed to check whether the storage being pointed is mapped.
This restore the old behavior of omp_target_is_present before D123093
Fixes https://github.com/llvm/llvm-project/issues/54899
Reviewed By: jdenny
Differential Revision: https://reviews.llvm.org/D123891
The remote offloading server and plugin rely on OpenMP, so this needs to be added as a linker flag. Without this, applications segfault.
Differential Revision: https://reviews.llvm.org/D124200
This fixes a compile-time error recently introduced within the remote
offloading plugin. This patch also removes some extra linker flags that are unnecessary, and adds an explicit abseil linker flag without which we occasionally get problems.
Differential Revision: https://reviews.llvm.org/D119984
Summary:
One test had an old "unsupported" string that used the old `newDriver`
string which was removed. This test should be updated to use the
`oldDriver` one instead.
Previously an opt-in flag `-fopenmp-new-driver` was used to enable the
new offloading driver. After passing tests for a few months it should be
sufficiently mature to flip the switch and make it the default. The new
offloading driver is now enabled if there is OpenMP and OpenMP
offloading present and the new `-fno-openmp-new-driver` is not present.
The new offloading driver has three main benefits over the old method:
- Static library support
- Device-side LTO
- Unified clang driver stages
Depends on D122683
Differential Revision: https://reviews.llvm.org/D122831
Summary:
The changes made in D123177 added new targets to the
`LIBOMPTARGET_TESTED_PLUGINS` variable which was linked against when
building the `llvm-omp-target-info` tool. This caused linker errors on
the export scripts. This patch removes that dependency, it still builds
and runs as expected so I will assume it's correct.
As described in 5.1 spec
2.21.7.2 Pointer Initialization for Device Data Environments
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D123093
This patch adds the `llvm_omp_target_dynamic_shared_alloc` function to
the `omp.h` header file so users can access it by default. Also changed
the name to keep it consistent with the other target allocators. Added
some documentation so users know how to use it. Didn't add the interface
for Fortran since there's no way to test it right now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D123246
The target allocators have been supported for NVPTX offloading for
awhile. The tests should use the allocators instead of calling the
functions manually. Also the comments indicating these being a preview
should be removed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D123242
In a clean build directory, `check-openmp` or `check-libomptarget` will fail because of missing device RTL .bc files. Ensure that the new targets new custom targets `omptarget.devicertl.nvptx` and `omptarget.devicertl.amdgpu` (corresponding to the plugin rtl targets `omptarget.rtl.cuda`, respectively `omptarget.rlt.amdgpu` ) are dependencies of the regression tests.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D123177
If we decided to delete a mapping entry we did not act on it right away
but first issued and waited for memory copies. In the meantime some
other thread might reuse the entry. While there was some logic to avoid
colliding on the actual "deletion" part, there were two races happening:
1) The data transfer back of the thread deleting the entry and
the data transfer back of the thread taking over the entry raced.
2) The update to the shadow map happened regardless if the entry was
actually reused by another thread which left the shadow map in a
inconsistent state.
To fix both issues we will now update the shadow map and delete the
entry only if we are sure the thread is responsible for deletion, hence
no other thread took over the entry and reused it. We also wait for a
potential former data transfer from the device to finish before we issue
another one that would race with it.
Fixes https://github.com/llvm/llvm-project/issues/54216
Differential Revision: https://reviews.llvm.org/D121058
Inline assembly is scary but we need to support it for the OpenMP GPU
device runtime. The new assumption expresses the fact that it may not
have call semantics, that is, it will not call another function but
simply perform an operation or side-effect. This is important for
reachability in the presence of inline assembly.
Differential Revision: https://reviews.llvm.org/D109986
As we mentioned in the code comments for function `ResourcePoolTy::release`,
at some point there could be two identical resources on the two sides of `Next`
mark. It is usually not an issue, unless the following case:
1. Some resources are not returned.
2. We need to iterate the pool and free the element.
That will cause double free, which is the case for event pool. Since we don't release
events hold by the data map, it can happen that the `Next` mark is not reset, and
we have two identical items in the pool. When the pool is destroyed, we will call
`cuEventDestroy` twice on the same event. In the best case, we can only observe
CUDA errors. In the worst case, it can cause internal failures in CUDART and further
crash.
This patch fixes the issue by tracking all resources that have been given using
an `unordered_set`. We don't remove it when a resource is returned. When the pool
is destroyed, we merge the pool (a `vector`) and the set. In this way, we can make
sure that the set contains all resources allocated from the device. We just need
to iterate the set and free the resource accordingly.
For now, only event pool is set to use it. Stream pool is not because we can make
sure all streams are returned when the plugin is destroyed.
Someone might be wondering, why don't we release all events hold in the data map.
That is because, plugins are determined to be destroyed *before* `libomptarget`.
If we can somehow make the plugin outlast `libomptarget`, life will be much
easier.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D122014
This patch adds the necessary AMDGPU calling convention to the ctor /
dtor kernels. These are fundamentally device kenels called by the host
on image load. Without this calling convention information the AMDGPU
plugin is unable to identify them.
Depends on D122504
Fixes#54091
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D122515
This patch solves two problems with the `HostDataToTargetMap` (HDTT
map) which caused races and crashes before:
1) Any access to the HDTT map needs to be exclusive access. This was not
the case for the "dump table" traversals that could collide with
updates by other threads. The new `Accessor` and `ProtectedObject`
wrappers will ensure we have a hard time introducing similar races in
the future. Note that we could allow multiple concurrent
read-accesses but that feature can be added to the `Accessor` API
later.
2) The elements of the HDTT map were `HostDataToTargetTy` objects which
meant that they could be copied/moved/deleted as the map was changed.
However, we sometimes kept pointers to these elements around after we
gave up the map lock which caused potential races again. The new
indirection through `HostDataToTargetMapKeyTy` will allows us to
modify the map while keeping the (interesting part of the) entries
valid. To offset potential cost we duplicate the ordering key of the
entry which avoids an additional indirect lookup.
We should replace more objects with "protected objects" as we go.
Differential Revision: https://reviews.llvm.org/D121057
The unroll pragma did not properly work as the loop bound was not known
when we optimize the runtime and we then added a "unroll disable"
metadata which prevented unrolling later when the bounds were known.
For now we manually unroll to make sure up to 16 elements are handled
nicely. This helps optimizations to look through the argument passing.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109164
Currently we set ccontext everywhere accordingly, but that causes many
unnecessary function calls. For example, in the resource pool, if we need to
resize the pool, we need to get from allocator. Each call to allocate sets the
current context once, which is unnecessary. In this patch, we set the context
only in the entry interface functions, if needed. Actually in the best way this
should be implemented via RAII, but since `cuCtxSetCurrent` could return error,
and we don't use exception, we can't stop the execution if RAII fails.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121322
This patch fixes the issue introduced in 14de0820e8 and D120089, that
if dynamic libraries are used, the `CUmodule` array could be overwritten.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121308
The modules vector was for some reason special which could lead to it
not being of the same size (=num devices). Easiest solution is to treat
it like we do all the other vectors.
An event pool, similar to the stream pool, needs to be kept per device.
For one, events are associated with cuda contexts which means we cannot
destroy the former after the latter. Also, CUDA documentation states
streams and events need to be associated with the same context, which
we did not ensure at all.
Differential Revision: https://reviews.llvm.org/D120142
There are two problems this patch tries to address:
1) We currently free resources in a random order wrt. plugin and
libomptarget destruction. This patch should ensure the CUDA plugin
is less fragile if something during the deinitialization goes wrong.
2) We need to support (hard) pause runtime calls eventually. This patch
allows us to free all associated resources, though we cannot
reinitialize the device yet.
Follow up patch will associate one event pool per device/context.
Differential Revision: https://reviews.llvm.org/D120089
`LIBOMPTARGET_LLVM_INCLUDE_DIRS` is currently checked and included for
multiple times redundantly. This patch is simply a clean up.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D121055
Libomptarget uses some shared variables to track certain internal stated
in the runtime. This causes problems when we have code that contains no
OpenMP kernels. These variables are normally initialized upon kernel
entry, but if there are no kernels we will see no initialization.
Currently we load the runtime into each source file when not running in
LTO mode, so these variables will be erroneously considered undefined or
dead and removed, causing miscompiles. This patch temporarily works
around the most obvious case, but others still exhibit this problem. We
will need to fix this more soundly later.
Fixes#54208.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D121007
When using asynchronous plugin calls, shadow pointer restore could happen before the D2H copy for the entire struct has completed, effectively leaving a device pointer in a host struct.
This patch fixes the problem by delaying restore's to after a synchronization happens (target regions) and by calling early synchronization (target update).
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119968
The runtime uses thread state values to indicate when we use an ICV or
are in nested parallelism. This is done for OpenMP correctness, but it
not needed in the majority of cases. The new flag added is
`-fopenmp-assume-no-thread-state`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D120106
`bug49334.cpp` has one issue that causes flaky result reported in #53730.
The root cause is `BlockedC` is never initialized but in `BlockMatMul_TargetNowait`
it is directly read and written (via `+=`). Fixes#53730.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D119988
The `IsSPMD` global can only be read by threads other than the main
thread *after* initialization is complete. To allow usage of
`mapping::getBlockSize` before initialization is done, we can pass the
`IsSPMD` state explicitly. This is similar to other APIs that take
`IsSPMD` explicitly to avoid such a race, e.g.,
`mapping::isInitialThreadInLevel0(IsSPMD)`
Fixes https://github.com/llvm/llvm-project/issues/53857
This patch adds a new target to the OpenMP CPU offloading tests. This
tests the usage of the new driver for CPU offloading. If this all works
then we can move to transition to the new driver as the default.
Depends on D119613
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119736
Currently whenever we compile the device runtime we get the following
'Mapping.cpp:32:32: warning: inline function '_OMP::impl::getGridValue'
is not defined [-Wundefined-inline]' warning. This can be silenced by
removing the constexpr attribute for this function. Doing this doesn't
change the generated bitcode at all but prevents the screen from getting
filled with warnings whenver we build the runtime.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119747
This patch fixes the issue that the for loop in `applyToShadowMapEntries`
is infinite because `Itr` is not incremented in `CB`. Fixes#53727.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119471
`bug49334.cpp` directly uses `!=` to compare two floating point values,
which is almost wrong.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D119485
Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a `int32_t` can represent. There is no limitation mentioned in the spec. This patch simply removes it.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119313
This patch refines the logic to determine grid size as previous method
can escape the check of whether `CudaBlocksPerGrid` could be greater than the actual
hardware limit.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119311
The 'bug49779.cpp' test has been failing recently. This is because the
runtime is sufficiently complex when using nested parallelism without
optimizations that the CUDA tools cannot statically determine the stack
size. Because of this the kernel can exceed the thread stack size and
crash. Work around this using the 'LIBOMPTARGET_STACK_SIZE' environment
variable and add an FAQ entry for this situation.
Fixes#53670
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D119357
This patch manually adds the runtime include files to the list of
dependencies when we build the bitcode runtime library. Previously if
only the header was changed we would not recompile the source files.
The solution used here isn't optimal because every source file not has a
dependency on each header file regardless of if it was actually used by
that file.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D119254
This patch enables running the new driver tests for AMDGPU. Previously
this was disabled because some tests failed. This was only because the
new driver tests hadn't been listed as unsupported or expected to fail.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D119240
This patch replaces the ValueRAII pointer with a default 'nullptr'
value. Previously this was initialized as a reference to an existing
variable. The use of this variable caused overhead as the compiler could
not look through the uses and determine that it was unused if 'Active'
was not set. Because of this accesses to the variable would be left in
the runtime once compiled.
Fixes#53641
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119187
This patch completely removes the old OpenMP device runtime. Previously,
the old runtime had the prefix `libomptarget-new-` and the old runtime
was simply called `libomptarget-`. This patch makes the formerly new
runtime the only runtime available. The entire project has been deleted,
and all references to the `libomptarget-new` runtime has been replaced
with `libomptarget-`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D118934
Due to num_threads (probably also other reasons) we cannot assume
explicit barriers are always executed by all threads in an aligned
fashion. We can optimize them if that property can be proven but
that is different.
This patch adds a new target to the tests to run using the new driver as
the method for generating offloading code.
Depends on D116541
Differential Revision: https://reviews.llvm.org/D118637
This patch changes the error message to instead mention the
documentation page for the debugging options provided by libomptarget
and the bitcode runtimes. Add some extra information to the documentation to
help users more quickly identify debugging resources.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118626
Reduces the shared memory size used for globalization to 512 bytes from
2048 to reduce the pressure on shared memory. This patch ado adds a
debug mesage to indicate when the shared memory was insufficient.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118625
Openmp executables need to find libomp and libomptarget at runtime.
This currently requires LD_LIBRARY_PATH or the user to specify rpath. Change
that to set the expected location of the openmp libraries in the install tree.
Whether rpath means rpath or runpath is system dependent. The attached test
shows that the Wl,--disable-new-dtags control interacts correctly with this feature.
The implicit rpath field is appended to any user specified ones which is ideal.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D118493
Openmp executables need to find libomp and libomptarget at runtime.
This currently requires LD_LIBRARY_PATH or the user to specify rpath. Change
that to set the expected location of the openmp libraries in the install tree.
Whether rpath means rpath or runpath is system dependent. The attached test
shows that the Wl,--disable-new-dtags control interacts correctly with this feature.
The implicit rpath field is appended to any user specified ones which is ideal.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D118493
Fully respect LIBOMPTARGET_BUILD_NVPTX_BCLIB. There is no CUDA toolchain dependency. Complement D118268.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118522
If we have a broken assumption we want to print a message to the user.
If the assumption is broken by many threads in many teams this can
become a problem. To avoid it we use a hash that tracks if a broken
assumption has (likely) been printed and avoid printing it again. This
is not fool proof and has some caveats that might cause problems in
the future (see comment) but it should improve the situation
considerably for now.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D112156
IdentTy objects are useful for debugging and profiling so we want to
keep them around in more places, especially those that have a large
impact on performance, e.g., everything related to state.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112494
This implements the runtime portion of the interop directive.
It expects the frontend and IRBuilder portions to be in place
for proper execution. It currently works only for GPUs
and has several TODOs that should be addressed going forward.
Reviewed By: RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D106674
The old runtime is not tested by CI. Disable the build prior to the llvm-14 branch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118268
This patch changes the visibility for all construct in the new device
RTL to be hidden by default. This is done after the changes introduced
in D117806 changed the visibility from being hidden by default for all
device compilations. This asserts that the visibility for the device
runtime library will be hidden except for the internal environment
variable. This is done to aid optimization and linking of the device
library.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D117807
In the OpenMC app we saw `omp target update` spending an awful lot of
time in the shadow map traversal without ever doing any update there.
There are two cases that allow us to avoid the traversal completely.
The simplest thing is that small updates cannot (reasonably) contain
an attached pointer part. The other case requires to track in the
mapping table if an entry might contain an attached pointer as part.
Given that we have a single location shadow map entries are created,
the latter is actually fairly easy as well.
Differential Revision: https://reviews.llvm.org/D113124
Atomic handling of map clauses was introduced to comply with the OpenMP
standard (see D104418). However, many apps won't need this feature which
can be costly in certain situations. To allow for applications to
opt-out we now introduce the `LIBOMPTARGET_MAP_FORCE_ATOMIC` environment
flag that voids the atomicity guarantee of the standard for map clauses
again, shifting the burden to the user.
This patch also de-duplicates the code that introduces the events used
to enforce atomicity as a cleanup.
Differential Revision: https://reviews.llvm.org/D117627
The OpenMP offloading libraries are built with fixed triples and linked
in during compile time. This would cause un-helpful errors if the user
passed in the wrong expansion of the triple used for the bitcode
library. because we only support these triples for OpenMP offloading we
can normalize them to the full verion used in the bitcode library.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D117634
After the changes in D117362 made variables declared inside of a target
declare directive visible outside the plugin, some variables inside the
runtime were given visiblity that conflicted with their address space
type. This caused problems when shared or local memory was made
externally visible. This patch fixes this issue by making these
varialbes static within the module, therefore limiting their visibility
to being internal.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D117526
After the changes in D117362 made variables declared inside of a target
declare directive visible outside the plugin, some variables inside the
runtime were given visiblity that conflicted with their address space
type. This caused problems when shared or local memory was made
externally visible. This patch fixes this issue by making these
varialbes static within the module, therefore limiting their visibility
to being internal.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D117526
This patch adds the `cold` attribute to the keepAlive functions in the
RTL. This dummy function exists to keep certain RTL calls alive without
them being optimized out, but it is never called and can be declared
cold. This also helps some erroneous remarks being given on this
function because it has weak linkage and cannot be made internal.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D117513
This patch adds the `weak` identifier to the openmp device environment
variable. The changes introduced in https://reviews.llvm.org/D117211
result in multiply defined symbols. Because the symbol is potentially
included multiple times for each offloading file we will get symbol
colisions, and because it needs to have external visiblity it should be
weak.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D117231
In function `DeviceTy::getTargetPointer`, `Entry` could be `nullptr` because of
zero length array section. We need to check if it is a valid iterator before
using it.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D116716
The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.
For each issued data movement, a new event is generated and returned to `libomptarget`
by calling `createEvent`. The event will be attached to the corresponding mapping table
entry.
For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.
This design is to avoid synchronization on host side.
Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:
- `Event` can be any kind of flag that has at least two status, 0 and 1.
- `waitEvent` can directly busy loop if `Event` is still 0.
My local test shows that `bug49334.cpp` can pass.
Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940
Reviewed By: grokos, JonChesterfield, ye-luo
Differential Revision: https://reviews.llvm.org/D104418
In most cases, hidden helper task behave similar as detached tasks. That means,
for example, if we have to wait for detached tasks, we have to do the same thing
for hidden helper tasks as well. This patch adds the missing condition for hidden
helper task accordingly along with detached task.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D107316
This patch makes some minor adjustments to `ResourcePool`:
- Don't initialize the resources if `Size` is 0 which can avoid assertion.
- Add a new interface function `clear` to release all hold resources.
- If initial size is 0, resize to 1 when the first request is encountered.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D116340
This patch changes the default aligntment from 8 to 16, and encodes this
information in the `__kmpc_alloc_shared` runtime call to communicate it
to the HeapToStack pass. The previous alignment of 8 was not sufficient
for the maximum size of primitive types on 64-bit systems, and needs to
be increaesd. This reduces the amount of space availible in the data
sharing stack, so this implementation will need to be improved later to
include the alignment requirements in the allocation call, and use it
properly in the data sharing stack in the runtime.
Depends on D115888
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115971
Currently CUDA streams are managed by `StreamManagerTy`. It works very well. Now
we have the need that some resources, such as CUDA stream and event, will be
hold by `libomptarget`. It is always good to buffer those resources. What's more
important, given the way that `libomptarget` and plugins are connected, we cannot
make sure whether plugins are still alive when `libomptarget` is destroyed. That
leads to an issue that those resouces hold by `libomptarget` might not be
released correctly. As a result, we need an unified management of all the resources
that can be shared between `libomptarget` and plugins.
`ResourcePoolTy` is designed to manage the type of resource for one device.
It has to work with an allocator which is supposed to provide `create` and
`destroy`. In this way, when the plugin is destroyed, we can make sure that
all resources allocated from native runtime library will be released correctly,
no matter whether `libomptarget` starts its destroy.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D111954
I missed the async info parameter in the first version of this API.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115887
This patch extends the AMDGPU plugin for OpenMP target offloading from using a single HSA queue to multiple queues (four in this patch) per device. This enables concurrent threads to concurrently submit kernel launches to the same GPU.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115771
In the OpenMC app we saw `omp target update` spending an awful lot of
time in the shadow map traversal without ever doing any update there.
There are two cases that allow us to avoid the traversal completely.
The simplest thing is that small updates cannot (reasonably) contain
an attached pointer part. The other case requires to track in the
mapping table if an entry might contain an attached pointer as part.
Given that we have a single location shadow map entries are created,
the latter is actually fairly easy as well.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D113124
and synchronous kernel launch implementations into a single
synchronous version. This patch prepares the plugin for asynchronous
implementation by:
Privatizing actual kernel launch code (valid in both cases) into
an anonymous namespace base function (submitted at D115267)
- Separating the control flow path of asynchronous and synchronous
kernel launch functions** (this diff)
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115273
D113602 broke the custom state machine when a reduction is present, as
revealed by the reproducer this patch adds to the test suite. In that
case, openmp-opts changes the return value to undef in
`__kmpc_get_warp_size` (which the custom state machine calls as of
D113602). Later optimizations then optimize away the custom state
machine code as if all threads are outside the thread block, so the
target region does not execute. D114802 fixed that but didn't add a
reproducer.
This patch also adds a `__OMP_RTL_ATTRS` entry for
`__kmpc_get_warp_size` to OMPKinds.def, which D113602 missed. This
change does not seem to have any impact on the reduction problem.
Reviewed By: JonChesterfield, jdoerfert
Differential Revision: https://reviews.llvm.org/D113824
The problem with the old scheme is that we would need to keep track of
the "next region" and reset the num_threads value after it. The new RT
doesn't do it and an assertion is triggered. The old RT doesn't do it
either, I haven't tested it but I assume a num_threads clause might
impact multiple parallel regions "accidentally". Further, in SPMD mode
num_threads was simply ignored, for some reason beyond me.
In any case, parallel_51 is designed to take the clause value directly,
so let's do that instead.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D113623
Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115279
Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115279
At present, amdgpu plugin merges both asynchronous and synchronous kernel launch implementations into a single synchronous version.
This patch prepares the plugin for asynchronous implementation by:
- Privatizing actual kernel launch code (valid in both cases) into an anonymous namespace base function
Actual separation of kernel launch code (async vs sync) is a following patch.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115267
amdgpu plugin depends on libhsa-runtime64 library. Add runpath in case it is not on the LD_LIBRARY_PATH.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115198
Minor fix to the lit.cfg. Currently, nvptx runs the tests twice on the new runtime.
Soon, amdgpu will run them on the new runtime as well as the old.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115150
These tests tend to hang or crash on hardware that doesn't
support USM. Disabling them helps diagnose other issues. To safely
enable we require a means of testing whether USM is expected to work.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115144
This was trying to figure out the build path for amdgpu-arch, and
making assumptions about where it is which were not working on my
system. Whether a standalone build or not, we should have a proper
imported target to get the location from.
OpenMP (compiler) does not currently request any implicit kernel
arguments. OpenMP (runtime) allocates and initialises a reasonable guess at
the implicit kernel arguments anyway.
This change makes the plugin check the number of explicit arguments, instead
of all arguments, and puts the pointer to hostcall buffer in both the current
location and at the offset expected when implicit arguments are added to the
metadata by D113538.
This is intended to keep things running while fixing the oversight in the
compiler (in D113538). Once that patch lands, and a following one marks
openmp kernels that use printf such that the backend emits an args element
with the right type (instead of hidden_node), the over-allocation can be
removed and the hardcoded 8*e+3 offset replaced with one read from the
.offset of the corresponding metadata element.
Reviewed By: estewart08
Differential Revision: https://reviews.llvm.org/D114274
A function with no definition was left in the old runtime, causing
linker errors when trying to compile.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D114264
Removes a +x/-x pair on the only store/load of a variable
and deletes some nearby dead code. Also reduces the size of the implicit
struct to reflect the code currently emitted by clang.
Differential Revision: https://reviews.llvm.org/D114270
The RAII class used for debugging RTL entry used a shared variable to
keep track of the current depth. This used a global initializer, which
isn't supported on AMDGPU. This patch removes the initializer and
instead sets it to zero when the state is initialized in the runtime.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D113963
Extension of D112504. Lower amdgpu printf to `__llvm_omp_vprintf`
which takes the same const char*, void* arguments as cuda vprintf and also
passes the size of the void* alloca which will be needed by a non-stub
implementation of `__llvm_omp_vprintf` for amdgpu.
This removes the amdgpu link error on any printf in a target region in favour
of silently compiling code that doesn't print anything to stdout.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D112680