Summary: Warnings are printed by clang when building LIBOMPTARGET_ENABLE_DEBUG=ON due incorrect format string.
Reviewers: tianshilei1992, jdoerfert
Reviewed By: tianshilei1992
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82789
Summary:
lookupMapping took significant time due to linear complexity searching.
This is bad for offloading from multiple host threads because lookupMapping is protected by mutex.
Use std::set for logarithmic complexity searching.
Before my change.
libomptarget inclusive time 16.7 sec, exclusive time 8.6 sec.
After the change
libomptarget inclusive time 7.3 sec, exclusive time 0.4 sec.
Most of the overhead of libomptarget (exclusive time) is gone.
Reviewers: jdoerfert, grokos
Reviewed By: grokos
Subscribers: tianshilei1992, yaxunl, guansong, sstefan1
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D82264
DeviceID is added for some cases that we only have the __tgt_async_info but do
not know its corresponding device id. However, to communicate with target
plugins, we need that information.
Event is added for another way to synchronize.
Summary:
In current implementation, D2D memcpy is first to copy data back to host and then
copy from host to device. This is very efficient if the device supports D2D
memcpy, like CUDA.
In this patch, D2D memcpy will first try to use native supported driver API. If
it fails, fall back to original way. It is worth noting that D2D memcpy in this
scenerio contains two ideas:
- Same devices: this is the D2D memcpy in the CUDA context.
- Different devices: this is the PeerToPeer memcpy in the CUDA context.
My implementation merges this two parts. It chooses the best API according to
the source device and destination device.
Reviewers: jdoerfert, AndreyChurbanov, grokos
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, sstefan1, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D80649
This patch adds a libomptarget plugin for the NEC SX-Aurora TSUBASA Vector
Engine (VE target). The code is largely based on the existing generic-elf
plugin and uses the NEC VEO and VEOSINFO libraries for offloading.
Differential Revision: https://reviews.llvm.org/D76843
D78566 introduced a `\bnot\b` lit substitution in OpenMP test suites.
However, that would corrupt a command like
`FileCheck -implicit-check-not` or any file name like `%t.not`. We
could use lookbehind/lookahead assertions to avoid such cases, but
this patch switches to `%not` (suggested during the D78566 review) as
a safer option.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D79529
Summary: There is a typo in DeviceRTLTy::getNumOfDevices that the type of its return value is bool. It will lead to a problem of wrong device number returned from omp_get_num_devices.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D79255
The two locals IsNew and Pointer_IsNew were uninitialized at declaration, and then passed by
reference to Device.getOrAllocTgtPtr which in turn did not assign on all
paths within the function. This resulted in occasional runtime failures in one application.
Device::getOrAllocTgtPtr will now initialize IsNew to false on entry to function.
Differential Revision: https://reviews.llvm.org/D78744
Without this patch, target_data_begin continues after an illegal
mapping or an out-of-memory error on the device. With this patch, it
terminates the runtime with an error instead.
The new test exercises only illegal mappings. I didn't think of a
good way to exercise out-of-memory errors from the test suite.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D78170
Without this patch, the openmp project's test suites do not appear to
have support for negative tests. However, D78170 needs to add a test
that an expected runtime failure occurs.
This patch makes `not` visible in all of the openmp project's test
suites. In all but `libomptarget/test`, it should be possible for a
test author to insert `not` before a use of the lit substitution for
running a test program. In `libomptarget/test`, that substitution is
target-specific, and its value is `echo` when the target is not
available. In that case, inserting `not` before a lit substitution
would expect an `echo` fail, so this patch instead defines a separate
lit substitution for expected runtime fails.
Reviewed By: jdoerfert, Hahnfeld
Differential Revision: https://reviews.llvm.org/D78566
Summary: Current implementation mixed everything up so that there is almost no encapsulation. In this patch, all CUDA related operations are put into a new class DeviceRTLTy and only necessary functions are exposed. In addition, all C++ code now conforms with LLVM code standard, keeping those API functions following C style.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: jfb, yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77951
...onization
Summary: In previous patch, in order to optimize performance, we only synchronize once
for each target region. The syncrhonization is via stream synchronization.
However, in the extreme situation, the performce might be bad. Consider the
following case: There is a task that requires transferring huge amount of data
(call many times of data transferring function). It is scheduled to the first
stream. And then we have 255 very light tasks scheduled to the remaining 255
streams (by default we have 256 streams). They can be finished before we do
synchronization at the end of the first task. Next, we get another very huge
task. It will be scheduled again to the first stream. Now the first task
finishes its kernel launch and call stream synchronization. Right now, the
stream already contains two kernels, and the synchronization will wait until the
two kernels finish instead of just the first one for the first task.
In this patch, we introduce stream pool. After each synchronization, the stream
will be returned back to the pool to make sure that for each synchronization,
only expected operations are waited.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: gregrodgers, yaxunl, lildmh, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77412
Summary: According to comments on bi-weekly meeting, this patch put back old APIs and added new `_async` series
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: yaxunl, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77822
Summary:
This patch introduces two things for offloading:
1. Asynchronous data transferring: those functions are suffix with `_async`. They have one more argument compared with their synchronous counterparts: `__tgt_async_info*`, which is a new struct that only has one field, `void *Identifier`. This struct is for information exchange between different asynchronous operations. It can be used for stream selection, like in this case, or operation synchronization, which is also used. We may expect more usages in the future.
2. Optimization of stream selection for data mapping. Previous implementation was using asynchronous device memory transfer but synchronizing after each memory transfer. Actually, if we say kernel A needs four memory copy to device and two memory copy back to host, then we can schedule these seven operations (four H2D, two D2H, and one kernel launch) into a same stream and just need synchronization after memory copy from device to host. In this way, we can save a huge overhead compared with synchronization after each operation.
Reviewers: jdoerfert, ye-luo
Reviewed By: jdoerfert
Subscribers: yaxunl, lildmh, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77005
Summary:
[libomptarget][nfc] Move non-freestanding headers out of common
Lowers the bar for building deviceRTL.
Drops math.h entirely as it wasn't used and libm is a big dependency.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D77071
Summary:
[libomptarget][nfc] Explicitly static function scope shared variables
`__shared__` in CUDA implies static in function scope. See e.g. D.2.1.1
in CUDA_C_Programming_Guide.pdf,
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/
This is surprising for non-cuda developers, see e.g. D73239 where I thought
local variables would be thread local.
Tested by IR diff of libomptarget.bc (no change), running in tree tests,
and binary diff of the nvcc static archives (no significant change).
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D76713
Summary: Explicitly initialize data members of RTLsTy class upon construction.
Reviewers: grokos
Subscribers: guansong, openmp-commits, caomhin, kkwli0
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75946
Summary:
[libomptarget] Implement locks for amdgcn
The nvptx implementation deadlocks on amdgcn. atomic_cas with multiple
active lanes can deadlock - if one lane succeeds, all the others are locked
out. The set_lock implementation therefore runs on a single lane.
Also uses a sleep intrinsic instead of the system clock for a probably
minor performance improvement. The unset/test implementations may be revised
later, based on code size / performance or similar concerns.
This implements the lock at a per-wavefront scope. That's not strictly as
specified, since openmp describes locks in terms of threads. I think the
nvptx implementation provides true per-thread locking on volta and the same
per-warp locking on other architectures.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75546
Summary:
[libomptarget][nfc] Move GetWarp/LaneId functions into per arch code
No code change for nvptx. Amdgcn currently has two implementations of GetLaneId,
this patch keeps the one a colleague considered to be superior for our ISA.
GetWarpId is currently the same function for amdgcn and nvptx, but I think it's
cleaner to keep it grouped with all the others than to keep it in support.cu.
Reviewers: jdoerfert, grokos, ABataev
Reviewed By: jdoerfert
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75587
Summary:
[libomptarget] Implement hip atomic functions in terms of intrinsics
All but atomicInc can be implemented using type generic clang intrinsics.
There is not yet a corresponding intrinsic for atomicInc in clang, only one in
LLVM. This patch leaves atomicInc as an unresolved symbol.
Reviewers: jdoerfert, ABataev, hfinkel, grokos, arsenm
Reviewed By: arsenm
Subscribers: sri, saiislam, wdng, jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D73076
Summary: fixed the warning from gcc since prios 0-100 are reserved for the internal use.
Reviewers: grokos
Subscribers: kkwli0, caomhin, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D75458
Summary:
Instead of using global variables with unpredicted time of
deinitialization, use dynamically allocated variables with functions
explicitly marked as global constructor/destructor and priority. This
allows to prevent the crash because of the incorrect order of dynamic
libraries deinitialization.
Reviewers: grokos, hfinkel
Subscribers: caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D74837
Summary: The 5.0 spec states, "The omp_get_max_threads routine returns an upper bound on the number of threads that could be used to form a new team if a parallel construct without a num_threads clause were encountered after execution returns from this routine." The attached test shows Max Threads: 96, Num Threads: 128 without the proposed change. The number of threads should not exceed the (max) nthreads ICV, hence we should return the higher SPMD thread number even when omp_get_max_threads() is called in a generic kernel. This change does fail the api test, max_threads.c, because now it would return 64 instead of 32.
Reviewers: jdoerfert, ABataev, grokos, JonChesterfield
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D74092
Summary:
[libomptarget][nfc] Change enum values to match those in cuda/rtl
support.h and cuda/rtl.cpp (and downsteam hsa/rtl.cpp) have enums for execution
mode. These are actually independent - the numbers that used within support, or
within the plugin, are never passed across the boundary.
Nevertheless, trying to work out why the values are different between the two
has generated a reasonable amount of confusion. This patch changes support to
match the values in plugin, on the basis that the plugin also has some comments
which I'd have to update if I changed that one instead. Credit to Ron for
working through this in our own fork. See rocm-developer-tools/aomp/issues/7
for that earlier diagnostic write up.
Also happy with generic = 0, spmd = 1 - provided it's the same in both places.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D74503
EXCLUDE_FROM_ALL means something else for add_lit_testsuite as it does
for something like add_executable. Distinguish between the two by
renaming the variable and making it an argument to add_lit_testsuite.
Differential revision: https://reviews.llvm.org/D74168
Summary:
[nfc][libomptarget] Remove SHARED annotation from local variables
A few local variables in reduction.cu were marked SHARED. This patch leaves
all per-kernel global state localised in omp_data.cu.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D73239
Summary:
This patch is to fix issue in the following simple case:
#include <omp.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int num = omp_get_num_devices();
printf("%d\n", num);
return 0;
}
Currently it returns 0 even devices exist. Since this file doesn't contain any
target region, the host entry is empty so further actions like initialization
will not be proceeded, leading to wrong device number returned by runtime
function call.
Reviewers: jdoerfert, ABataev, protze.joachim
Reviewed By: ABataev
Subscribers: protze.joachim
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72576
Summary:
[libomptarget] Implement smid for amdgcn
Implementation is in a new file as it uses an intrinsic with
complicated encoding that warranted substantial comments.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72956
The reference counter for global objects marked with declare target is INF. This patch prevents the runtime from incrementing /decrementing INF refcounts. Without it, the map(delete: global_object) directive actually deallocates the global on the device. With this patch, such a directive becomes a no-op.
Differential Revision: https://reviews.llvm.org/D72525
Summary:
[nfc][libomptarget] Refactor nxptx/target_impl.cu
Use __kmpc_impl_atomic_add instead of atomicAdd to match the rest of the file.
Alternatively, target_impl.cu could use the cuda functions directly. Using a mixture in this
file was an oversight, happy to resolve in either direction.
Removed some comments that look outdated.
Call __kmpc_impl_unset_lock directly to avoid a redundant diagnostic and remove an implict
dependency on interface.h.
Reviewers: ABataev, grokos, jdoerfert
Reviewed By: jdoerfert
Subscribers: jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72719
Summary:
[nfc][libomptarget] Refactor amdgcn target_impl
Removes references to internal libraries from the header
Standardises on C++ mangling for all the target_impl functions
Update comment block
clang-format
Move some functions into a new target_impl.hip source file
This lays the groundwork for implementing the remaining unresolved
symbols in the target_impl.hip source.
Reviewers: jdoerfert, grokos, ABataev, ronlieb
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72712
Summary:
If the dynamically loaded module has been compiled with -fopenmp-targets
and has no target regions, it has empty target descriptor. It leads to a
crash at the runtime if another module has at least one target region
and at least one entry in its descriptor. The runtime library is unable
to load the empty binary descriptor and terminates the execution.
Caused by a clang-offload-wrapper.
Reviewers: grokos, jdoerfert
Subscribers: caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D72472
Summary:
[libomptarget][nfc] Introduce atomic wrapper function
Wraps atomic functions in a template prefixed __kmpc_atomic that
dispatches to cuda or hip atomic functions. Intended to be easily extended
to dispatch to OpenCL or C++ atomics for a third target.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: Anastasia, jvesely, mgrang, dexonsmith, llvm-commits, mgorny, jfb, openmp-commits
Tags: #openmp, #llvm
Differential Revision: https://reviews.llvm.org/D71404
Summary:
[libomptarget][nfc] Extract function from data_sharing, move to common
Finding the first active thread in the warp is different on nvptx and amdgcn,
mostly due to warp size and the desire for efficiency.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71643
Summary:
[libomptarget][nfc] Move three files under common, build them for amdgcn
Change to reduction.cu to remove two dead includes, otherwise no code change.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71601
Summary:
[libomptarget][nfc] Move omp locks under target_impl
These are likely to be target specific, even down to the lock_t which is
correspondingly moved out of interface.h. The alternative is to include
interface.h in target_impl which substantiatially increases the scope of
those symbols.
The current nvptx implementation deadlocks on amdgcn. The preferred
implementation for that arch is still under discussion - this change
leaves declarations in target_impl.
The functions could be inline for nvptx. I'd prefer to keep the internals
hidden in the target_impl translation unit, but will add the (possibly renamed)
macros to target_impl.h if preferred.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71574
Summary:
[libomptarget][nfc] Wrap cuda min() in target_impl
nvptx forwards to cuda min, amdgcn implements directly.
Sufficient to build parallel.cu for amdgcn, added to CMakeLists.
All call sites are homogenous except one that passes a uint32_t and an
int32_t. This could be smoothed over by taking two type parameters
and some care over the return type, but overall I think the inline
<uint32_t> calling attention to what was an implicit sign conversion
is cleaner.
Reviewers: ABataev, jdoerfert
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71580
Summary:
This reverts commit dd8a7fcdd7.
Alexey reports undefined symbols for the new inline functions defined in target_impl.h
This does not reproduce for me for nvptx, or amdgcn, under release or debug builds.
I believe the patch is fine, based on:
- the semantics of an inline function in C++ (the cuda INLINE functions end
up as linkonce_odr in IR), which are only legal to drop if they have no uses
- the code generated from a debug build of clang 9 does not show these undef symbols
- the tests pass
- the code is trivial
To progress from here I either need:
- A tie break - someone to play the role of CI in determining whether the patch works
- Alexey to provide sufficient information about his build for me to reproduce the failure
- Alexey to debug why the symbols are disappearing for him and report back
Reviewers: ABataev, jdoerfert, grokos
Subscribers: jvesely, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71502
Summary:
[libomptarget] Build most of common/src for amdgcn
Excluding parallel.cu, which uses an integer min() from cuda,
Excluding support.cu, which calls malloc that is not yet available for amdgcn
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: gregrodgers, ronlieb, jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71446
Summary:
[libomptarget][nfc] Add declarations of atomic functions for amdgcn
This enables building more source for amdgcn. The functions are usually available
in a hip runtime header, but are duplicated here to decouple the implementation
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: jvesely, mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71412
Summary:
[libomptarget][nfc] Move cuda threadfence functions behind kmpc_impl
Part of building code under common/ without requiring a cuda compiler
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: ABataev
Subscribers: jvesely, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71102
Summary:
[libomptarget][nfc] Move omptarget-nvptx under common
Almost all files depend on require omptarget-nvptx, which no longer
contains any obviously architecture dependent code. Moving it under
common unblocks task/loop for amdgcn, and allows moving other code.
At some point there should probably be a widespread symbol renaming to
replace the nvptx string. I'd prefer to get things working first.
Building this (and task.cu, loop.cu) without a cuda library requires
some more refactoring, e.g. wrap threadfence(), use DEVICE macro more
consistently. Patches for that are orthogonal and will be posted shortly.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: ABataev
Subscribers: mgorny, fedor.sergeev, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D71073
Summary:
[libomptarget] Build a minimal deviceRTL for amdgcn
Repeat of D70414, with an include path fixed. Diff for sanity checking.
The CMakeLists.txt file is functionally identical to the one used in the aomp fork.
Whitespace changes were made based on nvptx/CMakeLists.txt, plus the
copyright notice updated to match (Greg was the original author so would
like his sign off on that here).
This change will build a small subset of the deviceRTL if an appropriate toolchain is
available, e.g. a local install of rocm. Support.h is moved from nvptx as a dependency
of debug.h.
Reviewers: ABataev, jdoerfert
Reviewed By: ABataev
Subscribers: jvesely, mgorny, jfb, openmp-commits, jdoerfert
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70971
Summary:
[libomptarget] Build a minimal deviceRTL for amdgcn
The CMakeLists.txt file is functionally identical to the one used in the aomp fork.
Whitespace changes were made based on nvptx/CMakeLists.txt, plus the
copyright notice updated to match (Greg was the original author so would
like his sign off on that here).
This change will build a small subset of the deviceRTL if an appropriate toolchain is
available, e.g. a local install of rocm. Support.h is moved from nvptx as a dependency
of debug.h.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Reviewed By: jdoerfert
Subscribers: jfb, Hahnfeld, jvesely, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70414
Summary:
"make check-all" or "make check-libomptarget" would attempt to run offloading
tests before the offload plugins are built. This patch corrects that by adding
dependencies to the libomptarget CMake rules.
Reviewers: jdoerfert
Subscribers: mgorny, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70803
Summary:
[libomptarget][nfc] Move some source into common from nvptx
Moves some source that compiles cleanly under amdgcn into a common subdirectory
Includes some non-trivial files and some headers. Keeps the cuda file extension.
The build systems for different architectures seem unlikely to have much in
common. The idea is therefore to set include paths such that files under
common/src compile as if they were under arch/src as the mechanism for sharing.
In particular, files under common/src need to be able to include target_impl.h.
The corresponding -Icommon is left out in favour of explicit includes on the
basis that the it makes it clearer which files under common are used by a given
architecture.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: ABataev
Subscribers: jfb, mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70328
Summary:
[libomptarget][nfc] Use cuda variable wrappers from support.h
Reimplementation of D69693, after the revert of D69885
Use the wrappers in support.h for cuda builtin variables at all call sites.
Localises use of cuda and removes WARPSIZE==32 assumption in debug.h.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70186
Summary:
[libomptarget] Move supporti.h to support.cu
Reimplementation of D69652, without the unity build and refactors.
Will need a clean build of libomptarget as the cmakelists changed.
Reviewers: ABataev, jdoerfert
Reviewed By: jdoerfert
Subscribers: mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D70131
Summary:
[libomptarget] Revert all improvements to support
The change to unity build for nvcc has broken the build for some developers.
This patch reverts to a known-working state.
There has been some confusion over exactly how the build broke. I think we
have reached a common understanding that the disappearing symbols are from
the bitcode library built by clang. The static archive built by nvcc may show the
same problem. Some of the confusion arose from building the deviceRTL twice
and using one or the other library based on various environmental factors.
I'm pretty sure the problem is clang expanding `__forceinline__` into both `__inline__`
and `attribute(("always_inline"))`. The `__inline__` attribute resolves to linkonce_odr
which is not safe for exporting symbols from translation units.
"always_inline" is the desired semantic for small functions defined in one translation
unit that are intended to be inlined at link time. "inline" is not.
This therefore reintroduces the dependency hazard of supporti.h and some code
duplication, and blocks progress separating deviceRTL into reusable components.
See also D69857, D69859 for attempts at a fix instead of a revert.
Reviewers: ABataev, jdoerfert, grokos, ikitayama, tianshilei1992
Reviewed By: ABataev
Subscribers: mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69885
Summary:
[libomptarget] Implement target_impl for amdgcn
Smallest atomic addition for a new target. Implements enough of the amdgcn
specific code that some of the source files under nvptx/src could be compiled,
without modification, to run on amdgcn.
This foreshadows a work in progress patch to move said source out of nvptx/src.
Patch based on fork at https://github.com/ROCm-Developer-Tools/llvm-project
Reviewers: ABataev, jdoerfert, grokos, ronlieb
Subscribers: jvesely, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69718
Summary:
[nfc][omptarget] Use builtin var abstraction. Second pass at D69476
Use the wrappers in support.h for cuda builtin variables at all call sites.
Localises use of cuda and removes WARPSIZE==32 assumption in debug.h.
Reviewers: ABataev, jdoerfert, grokos
Reviewed By: jdoerfert
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69693
Summary:
[nfc][libomptarget] Reorganise support header
All functions defined in support implementation are now declared in support.h
Reordered functions in support implementation to match the sequence in support.h
Added include guards to support.h
Added #include interface to support.h to provide kmp_Ident declaration
Move supporti.h to support.cu and s/INLINE/EXTERN/g
Add remaining includes to support.cu
A minor side effect is to change the name mangling of the support functions to
extern "C". If this matters another macro along the lines of INLINE/EXTERN
can be added - perhaps DEVICE as that's the obvious implementation.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: jdoerfert
Subscribers: mgorny, jfb, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69652
Summary:
[libomptarget] Change nvcc compilation to use a unity build
This allows nvcc to inline functions between what would otherwise be distinct
translation units, which in turn removes any runtime cost from implementing
functions in source files (as opposed to inline in headers).
This will then allow the circular dependencies in deviceRTL to be readily
broken and individual components more easily shared between architectures.
Reviewers: ABataev, jdoerfert, grokos, RaviNarayanaswamy, hfinkel, ronlieb, gregrodgers
Reviewed By: jdoerfert
Subscribers: mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69489
Summary:
[libomptarget] Always call malloc, free via SafeMalloc, SafeFree wrapper
NFC for release, adds some verbosity to debug printing. Motivation is to provide
one place where local modifications can be made to the behaviour of all heap
allocation or deallocation while debugging.
Reviewers: jdoerfert, ABataev, grokos
Reviewed By: ABataev
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69492
Summary:
[nfc][libomptarget] Decrease coupling between files
debug.h used the symbol omptarget_device_environment so implicitly required
an include of omptarget-nvptx.h to compile. Similarly interface.h uses size_t.
Moving this declaration to a new header means cancel, critical can now build
without omptarget-nvptx.h. After this change, debug.h, cancel.cu, critical.cu
could move under a common source directory.
Reviewers: ABataev, jdoerfert, grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69473
Summary:
[nfc][libomptarget] Inline option into target_impl
Subset of D69423. The macros that were in option.h are all target dependent.
Inlining the header simplifies the dependency graph when looking to move code
into a common subdir.
Reviewers: ABataev, jdoerfert, grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69472
Summary:
[NFC][libomptarget] move remaining device specific code out of omptarget-nvptx.h
Strictly there is one remaining difference wrt amdgcn - parallelLevel is
volatile qualified on amdgcn and not on nvptx. Determining whether this is
correct - and how to represent the different semantics of 'volatile' under
various conditions - is beyond the scope of this code motion patch.
Reviewers: ABataev, jdoerfert, grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D69424
Summary:
[libomptarget][nfc] Make interface.h target independent
Move interface.h under a top level include directory.
Remove #includes to avoid the interface depending on the implementation.
Reviewers: ABataev, jdoerfert, grokos, ronlieb, RaviNarayanaswamy
Reviewed By: jdoerfert
Subscribers: mgorny, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D68615
llvm-svn: 374919
Summary:
[libomptarget][nfc] Update remaining uint32 to use lanemask_t
Update a few functions in the API to use lanemask_t instead of i32. NFC for
nvptx. Also update the ActiveThreads type in DataSharingStateTy.
This removes a lot of #ifdef from the downsteam amdgcn implementation.
Reviewers: ABataev, jdoerfert, grokos, ronlieb, RaviNarayanaswamy
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D68513
llvm-svn: 373806
Linker automatically provides __start_<section name> and __stop_<section name> symbols to satisfy unresolved references if <section name> is representable as a C identifier (see https://sourceware.org/binutils/docs/ld/Input-Section-Example.html for details). These symbols indicate the start address and end address of the output section respectively. Therefore, renaming OpenMP offload entries section name from ".omp.offloading_entries" to "omp_offloading_entries" to use this feature.
This is the first part of the patch for eliminating OpenMP linker script (please see https://reviews.llvm.org/D64943).
Differential Revision: https://reviews.llvm.org/D68070
llvm-svn: 373118
Summary:
In non-SPMD mode we may end up with the divergent threads when trying to
increment/decrement parallel level counter. It may lead to incorrect
calculations of the parallel level and wrong results when threads are
divergent. We need to reconverge the threads before trying to modify the
parallel level counter.
Reviewers: grokos, jdoerfert
Subscribers: guansong, openmp-commits, caomhin, kkwli0
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66802
llvm-svn: 370803
Summary:
Use target_impl functions to replace more inline asm
Follow on from D65836. Removes remaining asm shuffles and lanemask accessors
Also changes the types of target_impl bitwise functions to unsigned.
Reviewers: jdoerfert, ABataev, grokos, Hahnfeld, gregrodgers, ronlieb, hfinkel
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66809
llvm-svn: 370216
Summary:
[libomptarget] Refactor syncthreads macro to inline function
See also abandoned D66846, split into this diff and others.
Rev 2 of D66855
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66861
llvm-svn: 370210
Summary:
[libomptarget] Refactor syncwarp macro to inline function
See also abandoned D66846, split into this diff and others.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66857
llvm-svn: 370149
Summary:
[libomptarget] Refactor shfl_down_sync macro to inline function
See also abandoned D66846, split into this diff and others.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66853
llvm-svn: 370146
Summary:
[libomptarget] Refactor shfl_sync macro to inline function
See also abandoned D66846, split into this diff and others.
Reviewers: jdoerfert, ABataev, grokos, ronlieb, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66852
llvm-svn: 370144
Summary:
Added function void __kmpc_syncwarp(int32_t) to expose it to the
compiler. It is required to fix the problem with the critical regions in
Cuda9.0+. We cannot use barrier in the critical region, but still need
to reconverge the threads in the warp after. This function allows to do
this.
Reviewers: grokos, jdoerfert
Subscribers: guansong, openmp-commits, kkwli0, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D66672
llvm-svn: 369933
Summary:
In Cuda 9.0 it is not guaranteed that threads in the warps are
convergent. We need to use __syncwarp() function to reconverge
the threads and to guarantee the memory ordering among threads in the
warps.
This is the first patch to fix the problem with the test
libomptarget/deviceRTLs/nvptx/src/sync.cu on Cuda9+.
This patch just replaces calls to __shfl_sync() function with the call
of __syncwarp() function where we need to reconverge the threads when we
try to modify the value of the parallel level counter.
Reviewers: grokos
Subscribers: guansong, jfb, jdoerfert, caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65013
llvm-svn: 369796
Summary:
[libomptarget] Factor architecture dependent code out of loop.cu
Related to the patch series starting D64217. Added subscribers to said series as reviewers. This effort is smaller in scope.
This patch factors out just enough architecture dependent code from loop.cu to allow the same source to be used with amdgcn, given a different target_impl.h. Testing is that the same bitcode (modulo variable names) is generated for libomptarget before and after the refactor, for nvptx and the out of tree amdgcn.
Reviewers: jdoerfert, ABataev, bollu, jfb, tra, grokos, Hahnfeld, guansong, xtian, gregrodgers, ronlieb, hfinkel, gtbercea, guraypp, arpith-jacob
Reviewed By: jdoerfert, ABataev
Subscribers: dexonsmith, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65836
llvm-svn: 368751
Summary:
This patch adds support for the close map modifier.
The close map modifier will overwrite the unified shared memory requirement and create a device copy of the data.
Reviewers: ABataev, Hahnfeld, caomhin, grokos, jdoerfert, AlexEichenberger
Reviewed By: Hahnfeld, AlexEichenberger
Subscribers: guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65340
llvm-svn: 368488
We have one global RTLs.RequiresFlags, I don't see a need to make a
copy per device that the runtime manages. This was problematic anyway
because the copy happened during the first __tgt_register_lib(). This
made it impossible to call __tgt_register_requires() from normal user
funtions for testing.
Hence, this change also fixes unified_shared_memory/shared_update.c for
older versions of Clang that don't call __tgt_register_requires() before
__tgt_register_lib().
Differential Revision: https://reviews.llvm.org/D66019
llvm-svn: 368465
Summary:
This patch adds support for using unified memory in the case of regular maps that happen when a target region is offloaded to the device.
For cases where only a single version of the data is required then the host address can be used. When variables need to be privatized in any way or globalized, then the copy to the device is still required for correctness.
Reviewers: ABataev, jdoerfert, Hahnfeld, AlexEichenberger, caomhin, grokos
Reviewed By: Hahnfeld
Subscribers: mgorny, guansong, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65001
llvm-svn: 368192
Summary:
[libomptarget] Use forceinline. Necessary for nvcc to inline small functions within the bitcode library
Suggested in D65836
Reviewers: ABataev, jdoerfert, grokos, gregrodgers
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D65876
llvm-svn: 368177
Ensures that CUDA fail reasons (such as "No CUDA-capable device detected")
are printed together with libomptarget's debug message
(e.g. "Error when setting CUDA context"). Previously, the former was
printed only in CMAKE_BUILD_TYPE=Debug builds while the latter was
enabled by LIBOMPTARGET_ENABLE_DEBUG.
With this change, also only call cuGetErrorString when the error will be
printed.
Suggested-by: Ye Luo <xw111luoye@gmail.com>
Differential Revision: https://reviews.llvm.org/D65687
llvm-svn: 367910
This patch implements the libomptarget runtime interface for OpenMP 5.0
declare mapper functions. The declare mapper functions generated by
Clang will call them to complete the mapping of members.
kmpc_mapper_num_components gets the current number of components for a
user-defined mapper; kmpc_push_mapper_component pushes back one
component for a user-defined mapper.
The design slides can be found at
https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
Patch by Lingda Li <lildmh@gmail.com>
Differential Revision: https://reviews.llvm.org/D60972
llvm-svn: 367772
Summary:
According to the OpenMP standard, barrier operation must perform
implicit flush operation. Currently, if there is only one thread in the
team, barrier does not flush the memory. Patch fixes this problem.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D62398
llvm-svn: 367024
If the first target region in a program calls the push_tripcount
function, libomptarget didn't handle the offload policy correctly.
This could lead to unexpected error messages as seen in
http://lists.llvm.org/pipermail/openmp-dev/2019-June/002561.html
To solve this, add a check calling IsOffloadDisabled() as all other
entry points already do. If this method returns false, libomptarget
is effectively disabled.
Differential Revision: https://reviews.llvm.org/D64626
llvm-svn: 366810
Remove loopTripCnt from threaded device stack after consuming it.
Added a libomptarget DP message to aid in future debugging and to
validate the added testcase, which only runs in Debug build.
Differential Revision: https://reviews.llvm.org/D64808
llvm-svn: 366349
Summary:
We used CUDART_VERSION macro to check for the installed cuda version
but this macro is defined in cuda_runtime_api.h, which is not used by
project. Better to use CUDA_VERSION macro, which is defined in cuda.h.
Also, added the check if this macro is defined. If macro is undefined,
there is something wrong with the cuda configuration and we should not
continue the compilation.
This also fixes problems with runtime building in cuda 10+.
Reviewers: grokos
Subscribers: guansong, jdoerfert, caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D64648
llvm-svn: 366224
Summary:
We used to call __kmpc_omp_taskwait function with global threadid set to
0. It may crash the application at the runtime if the thread executing
target region is not a master thread.
Reviewers: grokos, kkwli0
Subscribers: guansong, jdoerfert, caomhin, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D64571
llvm-svn: 366220
These entry points are never called by Clang trunk nor clang-ykt. If
XL doesn't use them either, they can finally go away.
Differential Revision: https://reviews.llvm.org/D52700
llvm-svn: 365817
Summary:
__kmpc_push_tripcount function is not thread safe and may lead to data
race when the target regions are executed in parallel threads. The patch
makes loopTripCnt counter thread aware and stores the tripcount value
per thread in the map. Access to map is guarded by mutex to prevent
data race in the map itself.
Test is for NVPTX target because it does not work correctly on the
host. Seems to me, there is a problem in libomp with target regions in
the parallel threads.
Reviewers: grokos
Subscribers: guansong, jfb, jdoerfert, openmp-commits, kkwli0, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D64080
llvm-svn: 365332
Summary:
According to the OpenMP standard, flush makes a thread’s temporary view of memory consistent with memory and enforces an order on the memory operations of the variables explicitly specified or implied.
According to the Cuda toolkit documentation (https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#memory-fence-functions), __threadfence() functions provides required functionality.
__threadfence_system() also provides required functionality, but it also
includes some extra functionality, like synchronization of page-locked
host memory, synchronization for the host, etc. It is not required per
the standard and we can use more relaxed version of memory fence
operation.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D62397
llvm-svn: 364572
Summary:
This patch adds support for handling variables under the:
```
#pragma omp declare target to()
```
clause when the
```
#pragma omp requires unified_shared_memory
```
is used.
The address of the host variable is copied into the device pointer just like for the declare target link case.
Reviewers: ABataev, caomhin, grokos, AlexEichenberger
Reviewed By: grokos
Subscribers: jcownie, guansong, jdoerfert, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D63106
llvm-svn: 363825
Summary:
The problems with __syncthreads() were fixed in clang >= 9.0 and the
original __syncthreads() can be used instead of the ptx instruction.
Reviewers: grokos
Subscribers: guansong, jdoerfert, openmp-commits, kkwli0, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D63515
llvm-svn: 363807
Summary: This patch enables the usage of a host variable on the device for declare target link variables when unified memory is available.
Reviewers: ABataev, caomhin, grokos
Reviewed By: grokos
Subscribers: Hahnfeld, guansong, jdoerfert, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D60884
llvm-svn: 362505
Summary:
Parallel level counter should be volatile to prevent some dangerous
optimiations by the ptxas. Otherwise, ptxas optimizations lead to
undefined behaviour in some cases.
Also, use __threadfence() for #pragma omp flush and if the barrier
should not be used (we have only one thread in the team), still perform
flush operation since the standard requires implicit flush when
executing barriers.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, jfb, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D62199
llvm-svn: 361421
Summary:
Target link variables are currently implemented by creating a copy of the variables on the device side and unified memory never gets exploited.
When the prgram uses the:
```
#pragma omp requires unified_shared_memory
```
directive in conjunction with a declare target link, the linked variable is no longer allocated on the device and the host version is used instead.
This behavior is overridden by performing an explicit mapping.
A Clang side patch is required.
Reviewers: ABataev, AlexEichenberger, grokos, Hahnfeld
Reviewed By: AlexEichenberger, grokos, Hahnfeld
Subscribers: Hahnfeld, jfb, guansong, jdoerfert, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D60223
llvm-svn: 361294
Summary:
Patch improves performance of the full runtime mode by moving
threads limit counter to the shared memory. It also allows to save
global memory.
Reviewers: grokos, kkwli0, gtbercea
Subscribers: guansong, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D61801
llvm-svn: 360584
Summary:
Patch improves performance of the full runtime mode by moving
number-of-threads counter to the shared memory. It also allows to save
global memory.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D61785
llvm-svn: 360457
Summary:
Patch improves performance of the full runtime mode by moving
thread-limit counter to the shared memory. It also allows to save
global memory.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jdoerfert, caomhin, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D61526
llvm-svn: 359922
Summary:
Used parallelLevel[] counter to simplify and improve implementation of
the existing standard OpenMP functions. Functions are tested already in
several tests, the patch is NFC.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jdoerfert, caomhin, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D61459
llvm-svn: 359892
Summary:
Previously for the different purposes we need to get the active/common
parallel level and with full runtime we iterated over all the records to
calculate this level. Instead, we can used the warp-based parallel level
counters used in no-runtime mode.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, jdoerfert, caomhin, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D61395
llvm-svn: 359822
Summary:
Function omp_get_thread_limit() in SPMD mode can return the maximum
available number of threads as a result.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D61378
llvm-svn: 359790
Summary:
The parallelLevel counter must be on per-thread basis to fully support
L2+ parallelism, otherwise we may end up with undefined behavior.
Introduce the parallelLevel on per-warp basis using shared memory. It
allows to avoid the problems with the synchronization and allows fully
support L2+ parallelism in SPMD mode with no runtime.
Reviewers: gtbercea, grokos
Subscribers: guansong, jdoerfert, caomhin, kkwli0, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D60918
llvm-svn: 359341
Fix the test to run it really in SPMD mode without runtime. Previously
it was run in SPMD + full runtime mode and does not allow to cehck the
functionality correctly.
llvm-svn: 358902
Summary:
If the kernel is executed in SPMD mode and the L2+ parallel for region
with the dynamic scheduling is executed, dynamic scheduling functions
are called. They expect full runtime support, but SPMD kernels may be
executed without the full runtime. It leads to the runtime crash of the
compiled program. Patch fixes this problem + fixes handling of the
parallelism level in SPMD mode, which is required as part of this patch.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, jdoerfert, openmp-commits, caomhin
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D60578
llvm-svn: 358442
At the moment, support for runtime debug output using the
OMPTARGET_DEBUG=1 environment variable is only available with
CMAKE_BUILD_TYPE=Debug builds. The patch allows setting it independently
using the LIBOMPTARGET_ENABLE_DEBUG option, which is enabled by default
depending on CMAKE_BUILD_TYPE. That is, unless this option is set
explicitly, nothing changes. This is the same mechanism used by LLVM for
LLVM_ENABLE_ASSERTIONS.
This patch also removes adding -g -O0 in debug builds, it should be
handled by cmake's CMAKE_{C|CXX}_FLAGS_DEBUG configuration option.
Idea by Hal Finkel
Differential Revision: https://reviews.llvm.org/D55952
llvm-svn: 356998
Summary:
This patch adds a more sophisticated team reduction scheme to the OpenMP libomptarget-nvptx runtime.
The scheme uses a fixed size global memory buffer whose length can be adjusted via compiler flag:
```
-fopenmp-cuda-teams-reduction-recs-num=1024
```
The global buffer is a structure of arrays (with default size of 1024 each and controlled by the above flag), one array for each reduction variable.
Values in the buffer are processed by the last team to finish executing the body of the target region.
In addition to adding support for the new flag, the compiler also emits special functions used for the reduction of the intermediate reduction values. These changes will be added in a separate compiler patch following this one.
Reviewers: ABataev, caomhin
Reviewed By: ABataev
Subscribers: guansong, jfb, jdoerfert, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D58409
llvm-svn: 354471
to reflect the new license. These used slightly different spellings that
defeated my regular expressions.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351648
Summary: Replace existing infrastructure for tracking parallel level using global memory with a per-team shared memory variable. This minimizes the impact of the overhead of tracking the parallel level for non-nested cases.
Reviewers: ABataev, caomhin
Reviewed By: ABataev
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D55773
llvm-svn: 350747
Summary:
Previous implementation may cause the runtime crash when the number of
teams is > 1024. Patch fixes this problem + reduces number of the atomic
operations by 32 times.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56332
llvm-svn: 350524
Summary:
Reduced number of the used register + improved performance propagating
the information about current execution/data sharing mode directly from
the compiler, where it is possible.
In some cases, it requires new/reworked interfaces of the runtime
external functions. Old functions are marked as deprecated.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56278
llvm-svn: 350405