Summary:
This patch consolidates the error handling and messaging routines to a single
file omptargetmessage. The goal is to simplify the error handling interface
prior to adding more error handling support
Reviewers: jdoerfert grokos ABataev AndreyChurbanov ronlieb JonChesterfield ye-luo tianshilei1992
Subscribers: danielkiss guansong jvesely kerbowa nhaehnle openmp-commits sstefan1 yaxunl
PrivateArgumentManager shall immediately allocate firstprivates if they
are bases for the next parameters and the next paramaters rely on the
fact that the base musst be allocated already.
Differential Revision: https://reviews.llvm.org/D86781
The test command in `private_mapping.c` was set to expect failure by mistake. It is fixed in this patch.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D86758
Summary:
This patch changes the libomptarget runtime to always emit debug messages that
occur before offloading failure. The goal is to provide users with information
about why their application failed in the target region rather than a single
failure message. This is only done in regions that precede offloading failure
so this should not impact runtime performance. if the debug environment
variable is set then the message is forwarded to the debug output as usual.
A new environment variable was added for future use but does nothing in this
current patch. LIBOMPTARGET_INFO will be used to report runtime information to
the user if requrested, such as grid size, SPMD usage, or data mapping. It will
take an integer indicating the level of information verbosity and a value of 0
will disable it.
Reviewers: jdoerfort
Subscribers: guansong sstefan1 yaxunl ye-luo
Tags: #OpenMP
Differential Revision: https://reviews.llvm.org/D86483
In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive.
Let's take the test case as example.
```
int main() {
int data1[3] = {1}, data2[3] = {2}, data3[3] = {3};
int sum[16] = {0};
#pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3)
for (int i = 0; i < 16; ++i) {
for (int j = 0; j < 3; ++j) {
sum[i] += data1[j];
sum[i] += data2[j];
sum[i] += data3[j];
}
}
}
```
Here `data1`, `data2`, and `data3` are three first-private arguments of the target region. In the previous `libomptarget`, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is `(12+4)*3=48` where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way:
1. First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer `nullptr` to the `vector` for kernel arguments, and we will update them later.
2. After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address.
3. Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments `vector`.
The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable?
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D86307
Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.
The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.
In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.
The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:
1. The program ends. Obviously. However, there is a little problem that plugin
library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.
2. Device is out of memory when we request a new memory. The manager will walk
through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.
Update:
A threshold (8KB by default) is set such that users could control what size of memory
will be managed by the manager. It can also be configured by an environment variable
`LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`.
Reviewed By: jdoerfert, ye-luo, JonChesterfield
Differential Revision: https://reviews.llvm.org/D81054
This patch contains the following changes:
1. Renamed the function `DeviceTy::data_exchange` to `DeviceTy::dataExchange`;
2. Changed the second argument `DeviceTy DstDev` to `DeviceTy &DstDev`;
3. Renamed the last argument.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D86238
Instead of calling `cuFuncGetAttribute` with
`CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` for every kernel invocation,
we can do it for the first one and cache the result as part of the
`KernelInfo` struct. The only functional change is that we now expect
`cuFuncGetAttribute` to succeed and otherwise propagate the error.
Ignoring any error seems like a slippery slope...
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D86038
[libomptarget] Implement host plugin for amdgpu
Replacement for D71384. Primary difference is inlining the dependency on atmi
followed by extensive simplification and bugfixes. This is the latest version
from https://github.com/ROCm-Developer-Tools/amd-llvm-project/tree/aomp12 with
minor patches and a rename from hsa to amdgpu, on the basis that this can't be
used by other implementations of hsa without additional work.
This will not build unless the ROCM_DIR variable is passed so won't break other
builds. That variable is used to locate two amdgpu specific libraries that ship
as part of rocm:
libhsakmt at https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface
libhsa-runtime64 at https://github.com/RadeonOpenCompute/ROCR-Runtime
These libraries build from source. The build scripts in those repos are for
shared libraries, but can be adapted to statically link both into this plugin.
There are caveats.
- This works well enough to run various tests and benchmarks, and will be used
to support the current clang bring up
- It is adequately thread safe for the above but there will be races remaining
- It is not stylistically correct for llvm, though has had clang-format run
- It has suboptimal memory management and locking strategies
- The debug printing / error handling is inconsistent
I would like to contribute this pretty much as-is and then improve it in-tree.
This would be advantagous because the aomp12 branch that was in use for fixing
this codebase has just been joined with the amd internal rocm dev process.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D85742
For example:
```
#pragma omp target data map(tofrom:arr[0:100])
{
#pragma omp target exit data map(delete:arr[0:100])
#pragma omp target enter data map(alloc:arr[98:2])
}
```
Without this patch, the transfer at the end of the target data region
is broken and fails depending on the target device. According to my
read of the spec, the transfer shouldn't even be attempted because
`arr[0:100]` isn't (fully) present there. To fix that, this patch
makes `DeviceTy::getTgtPtrBegin` return null for this case.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D85342
For example, without this patch, the following fails as expected with
or without the `present` modifier, but the `present` modifier doesn't
produce its usual diagnostic:
```
#pragma omp target data map(alloc: arr[0:2])
{
#pragma omp target map(present, tofrom: arr[0:100]) // not fully present
;
}
```
Reviewed By: grokos, vzakhari
Differential Revision: https://reviews.llvm.org/D85320
The standard way of printing `int64_t` data is via the PRId64 macro, `ld`
is for `long int` and int64_t is not guaranteed to be typedef'ed as `long int`
on all platforms. E.g. on Windows we get mismatch warnings.
Differential Revision: https://reviews.llvm.org/D85353
targetDataMapper function fills arrays with the mapping data in the
direct order. When this function is called by targetDataBegin or
tgt_target_update functions, it works as expected. But targetDataEnd
function processes mapped data in reverse order. In this case, the base
pointer might be deleted before the associated data is deleted. Need to
reverse data, mapped by mapper, too, since it always adds data that must
be deleted at the end of the buffer.
Fixes the test declare_mapper_target_update.cpp.
Also, reduces the memry fragmentation by preallocation the memory
buffers.
Differential Revision: https://reviews.llvm.org/D85216
OpenMP TR8 sec. 2.15.6 "target update Construct", p. 183, L3-4 states:
> If the corresponding list item is not present in the device data
> environment and there is no present modifier in the clause, then no
> assignment occurs to or from the original list item.
L10-11 states:
> If a present modifier appears in the clause and the corresponding
> list item is not present in the device data environment then an
> error occurs and the program termintates.
(OpenMP 5.0 also has the first passage but without mention of the
present modifier of course.)
In both passages, I assume "is not present" includes the case of
partially but not entirely present. However, without this patch, the
target update directive misbehaves in this case both with and without
the present modifier. For example:
```
#pragma omp target enter data map(to:arr[0:3])
#pragma omp target update to(arr[0:5]) // might fail on data transfer
#pragma omp target update to(present:arr[0:5]) // might fail on data transfer
```
The problem is that `DeviceTy::getTgtPtrBegin` does not return a null
pointer in that case, so `target_data_update` sees the data as fully
present, and the data transfer then might fail depending on the target
device. However, without the present modifier, there should never be
a failure. Moreover, with the present modifier, there should always
be a failure, and the diagnostic should mention the present modifier.
This patch fixes `DeviceTy::getTgtPtrBegin` to return null when
`target_data_update` is the caller. I'm wondering if it should do the
same for more callers.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D85246
Without this patch, the following example fails but shouldn't
according to OpenMP TR8:
```
#pragma omp target enter data map(alloc:i)
#pragma omp target data map(present, alloc: i)
{
#pragma omp target exit data map(delete:i)
} // fails presence check here
```
OpenMP TR8 sec. 2.22.7.1 "map Clause", p. 321, L23-26 states:
> If the map clause appears on a target, target data, target enter
> data or target exit data construct with a present map-type-modifier
> then on entry to the region if the corresponding list item does not
> appear in the device data environment an error occurs and the
> program terminates.
There is no corresponding statement about the exit from a region.
Thus, the `present` modifier should:
1. Check for presence upon entry into any region, including a `target
exit data` region. This behavior is already implemented correctly.
2. Should not check for presence upon exit from any region, including
a `target` or `target data` region. Without this patch, this
behavior is not implemented correctly, breaking the above example.
In the case of `target data`, this patch fixes the latter behavior by
removing the `present` modifier from the map types Clang generates for
the runtime call at the end of the region.
In the case of `target`, we have not found a valid OpenMP program for
which such a fix would matter. It appears that, if a program can
guarantee that data is present at the beginning of a `target` region
so that there's no error there, that data is also guaranteed to be
present at the end. This patch adds a comment to the runtime to
document this case.
Reviewed By: grokos, RaviNarayanaswamy, ABataev
Differential Revision: https://reviews.llvm.org/D84422
This patch fixed the issue that target memory might be deallocated when
they're still being used or before they're used.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84996
Refactored the function `targetDataEnd` to make preparation of fixing
the issue of ahead-of-time target memory deallocation. This patch only
renamed `targetDataEnd` related variables and functions to conform
with LLVM code standard.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84991
Refactored the function `target` to make preparation for fixing the
issue of ahead-of-time device memory deallocation.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84816
Need to map the base pointer for all directives, not only target
data-based ones.
The base pointer is mapped for array sections, array subscript, array
shaping and other array-like constructs with the base pointer. Also,
codegen for use_device_ptr clause was modified to correctly handle
mapping combination of array like constructs + use_device_ptr clause.
The data for use_device_ptr clause is emitted as the last records in the
data mapping array.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84767
Need to map the base pointer for all directives, not only target
data-based ones.
The base pointer is mapped for array sections, array subscript, array
shaping and other array-like constructs with the base pointer. Also,
codegen for use_device_ptr clause was modified to correctly handle
mapping combination of array like constructs + use_device_ptr clause.
The data for use_device_ptr clause is emitted as the last records in the
data mapping array.
It applies only for global pointers.
Differential Revision: https://reviews.llvm.org/D84767
This patch implements OpenMP runtime support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
previous patch in this series implements Clang front end support.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D84712
This patch implements OpenMP runtime support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
previous patch in this series implements Clang front end support.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D84712
On runtime failures, D83963 causes the runtime to abort instead of
merely exiting with a non-zero value, but many tests in the
libomptarget test suite still expect the former behavior. This patch
updates the test suite and was discussed in post-commit comments on
D83963 and D84557.
Summary:
1. Add DeviceTy::data_alloc, DeviceTy::data_delete, DeviceTy::data_alloc, DeviceTy::synchronize pass-through functions. Avoid directly accessing Device.RTL
2. Fix the type of the first argument of synchronize_ty in rth.h, device id is int32_t which is consistent with other functions.
Reviewers: tianshilei1992, jdoerfert
Reviewed By: tianshilei1992
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84487
See PR46515 for the rational but generally, we want to *really* abort
not gracefully shut down.
Reviewed By: grokos, ABataev
Differential Revision: https://reviews.llvm.org/D83963
Summary:
In the function `target`, memory deallocation and `target_data_end` is called
immediately returning from launching kernel. This might cause a race condition
that the corresponding memory is still being used by the kernel and a potential
issue that when the kernel starts to execute, its required data have already
been deallocated, especially when multiple kernels running concurrently. Since
nevertheless, we will block the thread issuing the target offloading at the end
of the target, we just move the synchronization ahead a little bit to make sure
the correctness.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84381
This implements OpenMP runtime support for the OpenMP TR8 `present`
map type modifier. The previous patch in this series implements Clang
front end support. See that patch summary for behaviors that are not
yet supported.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D83062
This implements OpenMP runtime support for the OpenMP TR8 `present`
map type modifier. The previous patch in this series implements Clang
front end support. See that patch summary for behaviors that are not
yet supported.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D83062
Libomptarget patch adding runtime support for "declare mapper".
Patch co-developed by Lingda Li and George Rokos.
Differential revision: https://reviews.llvm.org/D68100
There are various runtime calls in the device runtime with unused, or
always fixed, arguments. This is bad for all sorts of reasons. Clean up
two before as we match them in OpenMPOpt now.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83268
Summary:
Retaining per device primary context is preferred to creating a context owned by the plugin.
From CUDA documentation
1. Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used." from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
2. Right under cuCtxCreate. In most cases it is recommended to use cuDevicePrimaryCtxRetain. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf
3. The primary context is unique per device and shared with the CUDA runtime API. These functions allow integration with other libraries using CUDA. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX
Two issues are addressed by this patch:
1. Not using the primary context caused interoperability issue with libraries like cublas, cusolver. CUBLAS_STATUS_EXECUTION_FAILED and cudaErrorInvalidResourceHandle
2. On OLCF summit, "Error returned from cuCtxCreate" and "CUDA error is: invalid device ordinal"
Regarding the flags of the primary context. If it is inactive, we set CU_CTX_SCHED_BLOCKING_SYNC. If it is already active, we respect the current flags.
Reviewers: grokos, ABataev, jdoerfert, protze.joachim, AndreyChurbanov, Hahnfeld
Reviewed By: jdoerfert
Subscribers: openmp-commits, yaxunl, guansong, sstefan1, tianshilei1992
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82718
This function uses __builtin_amdgcn_atomic_inc32():
uint32_t atomicInc(uint32_t *address, uint32_t max);
These functions use __builtin_amdgcn_fence():
__kmpc_impl_threadfence()
__kmpc_impl_threadfence_block()
__kmpc_impl_threadfence_system()
They will take place of current mechanism of directly calling IR functions.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83132