targetDataMapper function fills arrays with the mapping data in the
direct order. When this function is called by targetDataBegin or
tgt_target_update functions, it works as expected. But targetDataEnd
function processes mapped data in reverse order. In this case, the base
pointer might be deleted before the associated data is deleted. Need to
reverse data, mapped by mapper, too, since it always adds data that must
be deleted at the end of the buffer.
Fixes the test declare_mapper_target_update.cpp.
Also, reduces the memry fragmentation by preallocation the memory
buffers.
Differential Revision: https://reviews.llvm.org/D85216
OpenMP TR8 sec. 2.15.6 "target update Construct", p. 183, L3-4 states:
> If the corresponding list item is not present in the device data
> environment and there is no present modifier in the clause, then no
> assignment occurs to or from the original list item.
L10-11 states:
> If a present modifier appears in the clause and the corresponding
> list item is not present in the device data environment then an
> error occurs and the program termintates.
(OpenMP 5.0 also has the first passage but without mention of the
present modifier of course.)
In both passages, I assume "is not present" includes the case of
partially but not entirely present. However, without this patch, the
target update directive misbehaves in this case both with and without
the present modifier. For example:
```
#pragma omp target enter data map(to:arr[0:3])
#pragma omp target update to(arr[0:5]) // might fail on data transfer
#pragma omp target update to(present:arr[0:5]) // might fail on data transfer
```
The problem is that `DeviceTy::getTgtPtrBegin` does not return a null
pointer in that case, so `target_data_update` sees the data as fully
present, and the data transfer then might fail depending on the target
device. However, without the present modifier, there should never be
a failure. Moreover, with the present modifier, there should always
be a failure, and the diagnostic should mention the present modifier.
This patch fixes `DeviceTy::getTgtPtrBegin` to return null when
`target_data_update` is the caller. I'm wondering if it should do the
same for more callers.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D85246
Without this patch, the following example fails but shouldn't
according to OpenMP TR8:
```
#pragma omp target enter data map(alloc:i)
#pragma omp target data map(present, alloc: i)
{
#pragma omp target exit data map(delete:i)
} // fails presence check here
```
OpenMP TR8 sec. 2.22.7.1 "map Clause", p. 321, L23-26 states:
> If the map clause appears on a target, target data, target enter
> data or target exit data construct with a present map-type-modifier
> then on entry to the region if the corresponding list item does not
> appear in the device data environment an error occurs and the
> program terminates.
There is no corresponding statement about the exit from a region.
Thus, the `present` modifier should:
1. Check for presence upon entry into any region, including a `target
exit data` region. This behavior is already implemented correctly.
2. Should not check for presence upon exit from any region, including
a `target` or `target data` region. Without this patch, this
behavior is not implemented correctly, breaking the above example.
In the case of `target data`, this patch fixes the latter behavior by
removing the `present` modifier from the map types Clang generates for
the runtime call at the end of the region.
In the case of `target`, we have not found a valid OpenMP program for
which such a fix would matter. It appears that, if a program can
guarantee that data is present at the beginning of a `target` region
so that there's no error there, that data is also guaranteed to be
present at the end. This patch adds a comment to the runtime to
document this case.
Reviewed By: grokos, RaviNarayanaswamy, ABataev
Differential Revision: https://reviews.llvm.org/D84422
This patch fixed the issue that target memory might be deallocated when
they're still being used or before they're used.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84996
This is to address the issue reported at:
https://bugs.llvm.org/show_bug.cgi?id=46863
Since weak is meaningless for a shared library interface function, this patch
disables the attribute, when the OpenMP library is built as shared library.
ompt_start_tool is not an interface function, but a internally called function
possibly implemented by an OMPT tool.
This function needs to be weak if possible to allow overwriting ompt_start_tool
with a function implementation built into the application.
Differential Revision: https://reviews.llvm.org/D84871
Refactored the function `targetDataEnd` to make preparation of fixing
the issue of ahead-of-time target memory deallocation. This patch only
renamed `targetDataEnd` related variables and functions to conform
with LLVM code standard.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84991
Refactored the function `target` to make preparation for fixing the
issue of ahead-of-time device memory deallocation.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84816
Need to map the base pointer for all directives, not only target
data-based ones.
The base pointer is mapped for array sections, array subscript, array
shaping and other array-like constructs with the base pointer. Also,
codegen for use_device_ptr clause was modified to correctly handle
mapping combination of array like constructs + use_device_ptr clause.
The data for use_device_ptr clause is emitted as the last records in the
data mapping array.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D84767
Need to map the base pointer for all directives, not only target
data-based ones.
The base pointer is mapped for array sections, array subscript, array
shaping and other array-like constructs with the base pointer. Also,
codegen for use_device_ptr clause was modified to correctly handle
mapping combination of array like constructs + use_device_ptr clause.
The data for use_device_ptr clause is emitted as the last records in the
data mapping array.
It applies only for global pointers.
Differential Revision: https://reviews.llvm.org/D84767
This patch implements OpenMP runtime support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
previous patch in this series implements Clang front end support.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D84712
This patch implements OpenMP runtime support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
previous patch in this series implements Clang front end support.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D84712
On runtime failures, D83963 causes the runtime to abort instead of
merely exiting with a non-zero value, but many tests in the
libomptarget test suite still expect the former behavior. This patch
updates the test suite and was discussed in post-commit comments on
D83963 and D84557.
Summary:
1. Add DeviceTy::data_alloc, DeviceTy::data_delete, DeviceTy::data_alloc, DeviceTy::synchronize pass-through functions. Avoid directly accessing Device.RTL
2. Fix the type of the first argument of synchronize_ty in rth.h, device id is int32_t which is consistent with other functions.
Reviewers: tianshilei1992, jdoerfert
Reviewed By: tianshilei1992
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84487
See PR46515 for the rational but generally, we want to *really* abort
not gracefully shut down.
Reviewed By: grokos, ABataev
Differential Revision: https://reviews.llvm.org/D83963
Additionally fix the copy if enabled on multi-config targets.
Summary:
This changes the copy command for libomp.so to use the output of the target as
the source of the copy, rather than trying to find it based on
${LIBOMP_LIBRARY_DIR}, which appears to be incorrect in multi-config generator
builds.
Reviewers: jdoerfert
Subscribers: mgorny, yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84148
Summary:
In the function `target`, memory deallocation and `target_data_end` is called
immediately returning from launching kernel. This might cause a race condition
that the corresponding memory is still being used by the kernel and a potential
issue that when the kernel starts to execute, its required data have already
been deallocated, especially when multiple kernels running concurrently. Since
nevertheless, we will block the thread issuing the target offloading at the end
of the target, we just move the synchronization ahead a little bit to make sure
the correctness.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D84381
This implements OpenMP runtime support for the OpenMP TR8 `present`
map type modifier. The previous patch in this series implements Clang
front end support. See that patch summary for behaviors that are not
yet supported.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D83062
This implements OpenMP runtime support for the OpenMP TR8 `present`
map type modifier. The previous patch in this series implements Clang
front end support. See that patch summary for behaviors that are not
yet supported.
Reviewed By: grokos, jdoerfert
Differential Revision: https://reviews.llvm.org/D83062
Following tests were disabled for clang-11 after upgrading to
version 5.0 in D82963:
1. openmp/runtime/test/env/kmp_set_dispatch_buf.c
2. openmp/runtime/test/worksharing/for/kmp_set_dispatch_buf.c
They are also failing for clang-12. Thus this temporary disabling
until they are fixed.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D84241
Add check of frm to prevent array out-of-bound access;
add check of new_nproc to prevent access of unallocated hot_teams array;
add check of location info pointer to prevent NULL dereference;
add check of d_tn pointer to prevent NULL dereference in release build.
These checks make static analyzers happier.
This is second part of the patch from https://reviews.llvm.org/D84062.
Add check of negative gtid before indexing __kmp_threads.
This makes static analyzers happier.
This is the first part of the patch split in two parts.
Differential Revision: https://reviews.llvm.org/D84062
hwloc documentation guarantees the only object that is always present
in the topology is PU. We can check the presence of other objects
in the topology, just in case.
Differential Revision: https://reviews.llvm.org/D84065
Add barrier/region notification for parallel inside teams construct
when number of teams is 1, as VTune only shows outer level regions for
simplicity.
Differential Revision: https://reviews.llvm.org/D84024
Libomptarget patch adding runtime support for "declare mapper".
Patch co-developed by Lingda Li and George Rokos.
Differential revision: https://reviews.llvm.org/D68100
There are various runtime calls in the device runtime with unused, or
always fixed, arguments. This is bad for all sorts of reasons. Clean up
two before as we match them in OpenMPOpt now.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83268
Summary:
Retaining per device primary context is preferred to creating a context owned by the plugin.
From CUDA documentation
1. Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used." from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
2. Right under cuCtxCreate. In most cases it is recommended to use cuDevicePrimaryCtxRetain. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf
3. The primary context is unique per device and shared with the CUDA runtime API. These functions allow integration with other libraries using CUDA. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX
Two issues are addressed by this patch:
1. Not using the primary context caused interoperability issue with libraries like cublas, cusolver. CUBLAS_STATUS_EXECUTION_FAILED and cudaErrorInvalidResourceHandle
2. On OLCF summit, "Error returned from cuCtxCreate" and "CUDA error is: invalid device ordinal"
Regarding the flags of the primary context. If it is inactive, we set CU_CTX_SCHED_BLOCKING_SYNC. If it is already active, we respect the current flags.
Reviewers: grokos, ABataev, jdoerfert, protze.joachim, AndreyChurbanov, Hahnfeld
Reviewed By: jdoerfert
Subscribers: openmp-commits, yaxunl, guansong, sstefan1, tianshilei1992
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82718
This function uses __builtin_amdgcn_atomic_inc32():
uint32_t atomicInc(uint32_t *address, uint32_t max);
These functions use __builtin_amdgcn_fence():
__kmpc_impl_threadfence()
__kmpc_impl_threadfence_block()
__kmpc_impl_threadfence_system()
They will take place of current mechanism of directly calling IR functions.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83132
This patch adds missing GOMP_5.0 loop entry points which incorporate
new non-monotonic default into entry point name. Since monotonic
schedules are a subset of nonmonotonic, it is acceptable to use
monotonic as the implementation. This patch simply has the nonmonotonic
(and possibly non-monontonic) versions of the loop entry points as
wrappers around the monotonic ones.
Differential Revision: https://reviews.llvm.org/D73922
Following tests are failing after upgrading to version 5.0 but are passing
for version 4.5:
1. openmp/runtime/test/env/kmp_set_dispatch_buf.c
2. openmp/runtime/test/worksharing/for/kmp_set_dispatch_buf.c
To be enabled as soon as these tests are fixed.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D82963
If the compilation fails, the test is marked as unsupported.
-> This will never change for a specific version of gcc
If the linking fails, the test is marked as expected to fail.
-> This might change as LLVM/OpenMP implements the missing GOMP interface function
Reviewed by: Hahnfeld
Differential Revision: https://reviews.llvm.org/D83077
When configuring in-tree, the correct names are LLVM_VERSION_MAJOR
and LLVM_VERSION_MINOR. This has been wrong since the code was added
in commits fc473dee98 and 821649229e.
Summary: Warnings are printed by clang when building LIBOMPTARGET_ENABLE_DEBUG=ON due incorrect format string.
Reviewers: tianshilei1992, jdoerfert
Reviewed By: tianshilei1992
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82789
Summary:
`config.test_extra_flags` is passed in from `lit.site.cfg.in` files, but they're not used in the LIT configs. This variable can be useful for distros which don't have the standard c/c++ headers in the default search paths. Since the tests run clang on c/c++ source code, we rely on `test_extra_flags` to pass in the necessary header files.
This is a similar setup that's also done in litomptarget https://github.com/llvm/llvm-project/blob/master/openmp/libomptarget/test/lit.cfg#L42 and openmp/runtime.
Reviewers: jdoerfert, jdenny, protze.joachim
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82516
Summary:
lookupMapping took significant time due to linear complexity searching.
This is bad for offloading from multiple host threads because lookupMapping is protected by mutex.
Use std::set for logarithmic complexity searching.
Before my change.
libomptarget inclusive time 16.7 sec, exclusive time 8.6 sec.
After the change
libomptarget inclusive time 7.3 sec, exclusive time 0.4 sec.
Most of the overhead of libomptarget (exclusive time) is gone.
Reviewers: jdoerfert, grokos
Reviewed By: grokos
Subscribers: tianshilei1992, yaxunl, guansong, sstefan1
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82264
DeviceID is added for some cases that we only have the __tgt_async_info but do
not know its corresponding device id. However, to communicate with target
plugins, we need that information.
Event is added for another way to synchronize.
Summary:
The OpenMP loops are normalized and transformed into the loops from 0 to
max number of iterations. In some cases, original scheme may lead to
overflow during calculation of number of iterations. If it is unknown,
if we can end up with overflow or not (the bounds are not constant and
we cannot define if there is an overflow), cast original type to the
unsigned.
Reviewers: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits, cfe-commits, caomhin
Tags: #clang, #openmp
Differential Revision: https://reviews.llvm.org/D81881
Adds the callbacks for ordered with source/sink dependencies.
The test for task dependencies changed, because callbach.h now actually prints
the passed dependencies and the test also checks for the address.
Reviewed by: hbae
Differential Revision: https://reviews.llvm.org/D81807
This patch allows to specify a prefix (default:empty) to be included into print-out
written by callback.h.
Also adding a cmake target to find the header file from other tests.
Reviewed by: jdoerfert
Differential Revision: https://reviews.llvm.org/D76008
Summary:
In current implementation, D2D memcpy is first to copy data back to host and then
copy from host to device. This is very efficient if the device supports D2D
memcpy, like CUDA.
In this patch, D2D memcpy will first try to use native supported driver API. If
it fails, fall back to original way. It is worth noting that D2D memcpy in this
scenerio contains two ideas:
- Same devices: this is the D2D memcpy in the CUDA context.
- Different devices: this is the PeerToPeer memcpy in the CUDA context.
My implementation merges this two parts. It chooses the best API according to
the source device and destination device.
Reviewers: jdoerfert, AndreyChurbanov, grokos
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D80649
The OpenMP spec has the task-fulfill event for a call to omp_fulfill_event.
If the task did not yet finish execution, ompt_task_early_fulfill is used,
otherwise ompt_task_late_fulfill.
If a task does not complete, when the execution finishes (i.e., the task goes
in detached mode), ompt_task_detach instead of ompt_task_complete must be
used, when the next task is scheduled.
A test for both cases is included, which only work with clang-11+
Reviewed By: hbae
Differential revision: https://reviews.llvm.org/D80843
__kmp_realloc_task_deque implicitly assumes, that the task queue is full
(ntasks == size), therefore tail = size in line 319.
An assertion is added to document this assumption.
The first check for a full queue is before the locking and might not hold
when the lock is taken. So, we need to check again for this condition when
we have the lock.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D80480
Spurious assertion failures are symptoms of a race condition for the handling
of detached tasks:
Assertion failure at kmp_tasking.cpp(3744): taskdata->td_flags.complete == 1.
Assertion failure at kmp_tasking.cpp(710): taskdata->td_flags.executing == 0.
in the case of detach=true, all accesses to taskdata in __kmp_task_finish need
to happen before (~line 873):
taskdata->td_flags.proxy = TASK_PROXY;
This assignment signals to __kmp_fulfill_event, that the task will need to be
freed there. So, conceptionally the ownership of taskdata is moved.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D79702
This patch adds a libomptarget plugin for the NEC SX-Aurora TSUBASA Vector
Engine (VE target). The code is largely based on the existing generic-elf
plugin and uses the NEC VEO and VEOSINFO libraries for offloading.
Differential Revision: https://reviews.llvm.org/D76843
D78566 introduced a `\bnot\b` lit substitution in OpenMP test suites.
However, that would corrupt a command like
`FileCheck -implicit-check-not` or any file name like `%t.not`. We
could use lookbehind/lookahead assertions to avoid such cases, but
this patch switches to `%not` (suggested during the D78566 review) as
a safer option.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D79529
Summary: There is a typo in DeviceRTLTy::getNumOfDevices that the type of its return value is bool. It will lead to a problem of wrong device number returned from omp_get_num_devices.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D79255
The two locals IsNew and Pointer_IsNew were uninitialized at declaration, and then passed by
reference to Device.getOrAllocTgtPtr which in turn did not assign on all
paths within the function. This resulted in occasional runtime failures in one application.
Device::getOrAllocTgtPtr will now initialize IsNew to false on entry to function.
Differential Revision: https://reviews.llvm.org/D78744
Without this patch, target_data_begin continues after an illegal
mapping or an out-of-memory error on the device. With this patch, it
terminates the runtime with an error instead.
The new test exercises only illegal mappings. I didn't think of a
good way to exercise out-of-memory errors from the test suite.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D78170
Without this patch, the openmp project's test suites do not appear to
have support for negative tests. However, D78170 needs to add a test
that an expected runtime failure occurs.
This patch makes `not` visible in all of the openmp project's test
suites. In all but `libomptarget/test`, it should be possible for a
test author to insert `not` before a use of the lit substitution for
running a test program. In `libomptarget/test`, that substitution is
target-specific, and its value is `echo` when the target is not
available. In that case, inserting `not` before a lit substitution
would expect an `echo` fail, so this patch instead defines a separate
lit substitution for expected runtime fails.
Reviewed By: jdoerfert, Hahnfeld
Differential Revision: https://reviews.llvm.org/D78566
On systems with weak memory consistency, this patch fixes an intermittent crash
in the reduction function called by __kmp_hyper_barrier_gather, which suffers
from a race on a child thread's data.
Reviewed-By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D77603
Summary: Current implementation mixed everything up so that there is almost no encapsulation. In this patch, all CUDA related operations are put into a new class DeviceRTLTy and only necessary functions are exposed. In addition, all C++ code now conforms with LLVM code standard, keeping those API functions following C style.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: jfb, yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77951
...onization
Summary: In previous patch, in order to optimize performance, we only synchronize once
for each target region. The syncrhonization is via stream synchronization.
However, in the extreme situation, the performce might be bad. Consider the
following case: There is a task that requires transferring huge amount of data
(call many times of data transferring function). It is scheduled to the first
stream. And then we have 255 very light tasks scheduled to the remaining 255
streams (by default we have 256 streams). They can be finished before we do
synchronization at the end of the first task. Next, we get another very huge
task. It will be scheduled again to the first stream. Now the first task
finishes its kernel launch and call stream synchronization. Right now, the
stream already contains two kernels, and the synchronization will wait until the
two kernels finish instead of just the first one for the first task.
In this patch, we introduce stream pool. After each synchronization, the stream
will be returned back to the pool to make sure that for each synchronization,
only expected operations are waited.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: gregrodgers, yaxunl, lildmh, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77412
Summary: According to comments on bi-weekly meeting, this patch put back old APIs and added new `_async` series
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77822
Summary:
This patch introduces two things for offloading:
1. Asynchronous data transferring: those functions are suffix with `_async`. They have one more argument compared with their synchronous counterparts: `__tgt_async_info*`, which is a new struct that only has one field, `void *Identifier`. This struct is for information exchange between different asynchronous operations. It can be used for stream selection, like in this case, or operation synchronization, which is also used. We may expect more usages in the future.
2. Optimization of stream selection for data mapping. Previous implementation was using asynchronous device memory transfer but synchronizing after each memory transfer. Actually, if we say kernel A needs four memory copy to device and two memory copy back to host, then we can schedule these seven operations (four H2D, two D2H, and one kernel launch) into a same stream and just need synchronization after memory copy from device to host. In this way, we can save a huge overhead compared with synchronization after each operation.
Reviewers: jdoerfert, ye-luo
Reviewed By: jdoerfert
Subscribers: yaxunl, lildmh, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77005
Summary:
[libomptarget][nfc] Move non-freestanding headers out of common
Lowers the bar for building deviceRTL.
Drops math.h entirely as it wasn't used and libm is a big dependency.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77071
Data race occurs when acquiring lock for critical section
triggering assertion failure. Added barrier to ensure
all memory is commited before checking assertion.
Reviewed By: Hahnfeld
Differential Revision: https://reviews.llvm.org/D76780