At the moment, support for runtime debug output using the
OMPTARGET_DEBUG=1 environment variable is only available with
CMAKE_BUILD_TYPE=Debug builds. The patch allows setting it independently
using the LIBOMPTARGET_ENABLE_DEBUG option, which is enabled by default
depending on CMAKE_BUILD_TYPE. That is, unless this option is set
explicitly, nothing changes. This is the same mechanism used by LLVM for
LLVM_ENABLE_ASSERTIONS.
This patch also removes adding -g -O0 in debug builds, it should be
handled by cmake's CMAKE_{C|CXX}_FLAGS_DEBUG configuration option.
Idea by Hal Finkel
Differential Revision: https://reviews.llvm.org/D55952
llvm-svn: 356998
Summary:
This patch adds a more sophisticated team reduction scheme to the OpenMP libomptarget-nvptx runtime.
The scheme uses a fixed size global memory buffer whose length can be adjusted via compiler flag:
```
-fopenmp-cuda-teams-reduction-recs-num=1024
```
The global buffer is a structure of arrays (with default size of 1024 each and controlled by the above flag), one array for each reduction variable.
Values in the buffer are processed by the last team to finish executing the body of the target region.
In addition to adding support for the new flag, the compiler also emits special functions used for the reduction of the intermediate reduction values. These changes will be added in a separate compiler patch following this one.
Reviewers: ABataev, caomhin
Reviewed By: ABataev
Subscribers: guansong, jfb, jdoerfert, openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D58409
llvm-svn: 354471
to reflect the new license. These used slightly different spellings that
defeated my regular expressions.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351648
Summary: Replace existing infrastructure for tracking parallel level using global memory with a per-team shared memory variable. This minimizes the impact of the overhead of tracking the parallel level for non-nested cases.
Reviewers: ABataev, caomhin
Reviewed By: ABataev
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D55773
llvm-svn: 350747
Summary:
Previous implementation may cause the runtime crash when the number of
teams is > 1024. Patch fixes this problem + reduces number of the atomic
operations by 32 times.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56332
llvm-svn: 350524
Summary:
Reduced number of the used register + improved performance propagating
the information about current execution/data sharing mode directly from
the compiler, where it is possible.
In some cases, it requires new/reworked interfaces of the runtime
external functions. Old functions are marked as deprecated.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56278
llvm-svn: 350405
The OpenMP runtime's cmake scripts do not correctly locate the
libdevice that the Debian/Ubuntu package nvidia-cuda-toolkit currently
includes, at least on my Ubuntu 18.04.1 installation. This patch
fixes that for me.
This problem was discussed at length in D55269. D40453 added a
similar adjustment in clang, but reviewers of D55269 concluded that,
for the OpenMP runtime, the right place to address this problem is in
cmake's CUDA support. However, it was also suggested we could add a
workaround to OpenMP's cmake scripts now. This patch contains such a
workaround, which I've tried to design so that it will have no harmful
effect if cmake improves in the future.
nvidia-cuda-toolkit also needs improvements because its intended
monolithic CUDA tree shim, /usr/lib/cuda, has many empty directories,
such as bin. I reported that at:
<https://bugs.launchpad.net/ubuntu/+source/nvidia-cuda-toolkit/+bug/1808999>
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D55588
llvm-svn: 350377
Add omp_get_device_num() function for 5.0 which returns the number of the
device the current thread is running on. Currently, we are leaving it to the
compiler to handle this properly if it is called inside target.
Also, did some cleanup and updating of duplicate device API functions (in both
libomp and libomptarget) to make them into weak functions that check for the
symbol from libomptarget, and will call the version in libomptarget if it is
present. If any additional device API functions are implemented also in
libomptarget in the future, we should add the dlsym calls to the host functions.
Also, if the omp_target_* functions are to be implemented for the host (this has
been requested), they should attempt to call the libomptarget versions as well.
Patch by Terry Wilmarth
Differential Revision: https://reviews.llvm.org/D55578
llvm-svn: 350352
Summary:
One of the LLVM optimizations, split critical edges, also clones tail
instructions. This is a dangerous operation for __syncthreads()
functions and this transformation leads to undefined behavior or
incorrect results. Patch fixes this problem by replacing __syncthreads()
function with the assembler instruction, which cost is too high and
wich cannot be copied.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56274
llvm-svn: 350333
Summary:
Avoid using of the atomic loop to wait for the completion of the
data-sharing interface initialization, use __shfl_sync instead for the
communication within the warp to signal other threads in the warp about
completion of the initialization.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D56100
llvm-svn: 350129
Summary:
At high optimization level asserts lead to some unexpected results
because of auto-inserted unreachable instructions. This outlining
prevents some of such dangerous optimizations and leads to better
stability.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D56101
llvm-svn: 350128
Summary:
Use the original shuffle implementation for __kmpc_shuffle_int64 since
default implementation uses the same implementation.
Reviewers: gtbercea
Subscribers: guansong, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55514
llvm-svn: 348772
Summary:
Shuffle on 64bit data is allowed only for CUDA >= 9.0. Also, fixed the
constant for the mask, need one extra L in the end.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55440
llvm-svn: 348758
Summary:
Introduced special noinline function log that allows to save some
registers for optimized builds but with enabled logging. Also, it
increases the stability of the optimized builds with inlined runtime.
Reviewers: gtbercea, kkwli0
Reviewed By: gtbercea
Subscribers: caomhin, guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D55436
llvm-svn: 348606
Summary:
According to the standard, after memory flushing the changes in the
memory must be visible to all the threads in all teams. Patch fixes
this.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55370
llvm-svn: 348491
Summary:
Reworked runtime to make it compatible with the requirements of the
original runtime library. Also, simplified some code to reduce number of
function calls.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55130
llvm-svn: 348003
Summary: To enable the compiler to optimize parts of the function that are not needed when runtime can be omitted, a new version of the SPMD deinit kernel function is needed. This function takes the runtime required flag as an argument.
Reviewers: ABataev, kkwli0, caomhin
Reviewed By: ABataev
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D54969
llvm-svn: 347714
Summary:
Added functions __kmpc_nvptx_teams_reduce_nowait_simple and
__kmpc_nvptx_teams_end_reduce_nowait_simple to implement basic support
for reductions across the teams.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D54967
llvm-svn: 347710
Summary: Refactor the checking for SPMD mode and whether the runtime is initialized or not. This uses constant flags which enables the runtime to optimize out unused sections of code that depend on these flags.
Reviewers: ABataev, caomhin
Reviewed By: ABataev
Subscribers: guansong, jfb, openmp-commits
Differential Revision: https://reviews.llvm.org/D54960
llvm-svn: 347698
Summary:
The base pointer for the lambda mapping must point to the lambda capture
placement and pointer must point to the captured variable itself. Patch
fixes this problem.
Reviewers: gtbercea
Subscribers: guansong, openmp-commits, kkwli0, caomhin
Differential Revision: https://reviews.llvm.org/D54260
llvm-svn: 346407
Summary:
The previously used combination `PTR_AND_OBJ | PRIVATE` could be used
for mapping of some data in Fortran. Changed it to `PTR_AND_OBJ |
LITERAL`.
Reviewers: gtbercea
Subscribers: guansong, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D54035
llvm-svn: 345981
Summary:
Current globalization scheme works correctly only for SPMD+lightweight
runtime mode and does not work for full runtime. Patch improves support
for the globalization scheme + reduces global memory consumption in
lightweight runtime mode.
Patch adds runtime functions to work with the statically allocated
global memory. It allows to improve performance and memory consumption.
This global memory must be allocated by the compiler.
Reviewers: grokos, kkwli0, gtbercea, caomhin
Subscribers: guansong, jfb, openmp-commits
Differential Revision: https://reviews.llvm.org/D53943
llvm-svn: 345976
Summary: In the case of coalesced global records, we need to push the exact data size passed in. This patch fixes this by outlining the common functionality of the previous push function and by adding a separate entry point for coalesced pushes. The pop function remains unchanged.
Reviewers: ABataev, grokos, caomhin
Reviewed By: ABataev, grokos
Subscribers: jholewinski, cfe-commits, Hahnfeld, guansong, jfb, openmp-commits
Differential Revision: https://reviews.llvm.org/D53141
llvm-svn: 345867
Summary:
Added support for correct mapping of variables captured by reference in
lambdas. That kind of mapping may appear only in target-executable
regions and must follow the original lambda or another lambda capture
for the same lambda.
The expected data: base address - the address of the lambda, begin
pointer - pointer to the address of the lambda capture, size - size of
the captured variable.
When OMP_TGT_MAPTYPE_PTR_AND_OBJ mapping type is seen in
target-executable region, the target address of the last processed item
is taken as the address of the original lambda `tgt_lambda_ptr`. Then,
the pointer to capture on the device is calculated like `tgt_lambda_ptr
+ (host_begin_pointer - host_begin_base)` and the target-based address
of the original variable (which host address is
`*(void**)begin_pointer`) is written to that pointer.
Reviewers: kkwli0, gtbercea, grokos
Subscribers: openmp-commits
Differential Revision: https://reviews.llvm.org/D51107
llvm-svn: 345608
If the user requested LIBOMPTARGET_NVPTX_DEBUG, include asserts in
the bitcode library. Everything else will have very unpleasent
effects because asserts will appear when falling back to the static
library libomptarget-nvptx.a.
Differential Revision: https://reviews.llvm.org/D52701
llvm-svn: 343477
Pass in the correct value of isRuntimeUninitialized() which solves
parallel reductions as reported on the mailing list.
For reference: r333285 did the same for loop scheduling.
Differential Revision: https://reviews.llvm.org/D52725
llvm-svn: 343476
NVPTX requires addresses of pointer locations to be 8-byte aligned
or there will be an exception during runtime.
This could happen without this patch as shown in the added test:
getId() requires 4 byte of stack and putValueInParallel() uses 16
bytes to store the addresses of the captured variables.
Differential Revision: https://reviews.llvm.org/D52655
llvm-svn: 343402
According to OpenMP 4.5, p250:12-14:
If the requested nest level is outside the range of 0 and the
nest level of the current thread, as returned by the omp_get_level
routine, the routine returns -1.
The SPMD code path will need a similar fix.
Differential Revision: https://reviews.llvm.org/D51787
llvm-svn: 343401
Clang trunk will serialize nested parallel regions. Check that this
is correctly reflected in various API methods.
Differential Revision: https://reviews.llvm.org/D51786
llvm-svn: 343382
There is no support and according to the OpenMP 4.5, p238:7-9:
For implementations that do not support dynamic adjustment
of the number of threads this routine has no effect: the
value of dyn-var remains false.
Add a test that cancellation and nested parallelism aren't
supported either.
Differential Revision: https://reviews.llvm.org/D51785
llvm-svn: 343381
If there is no num_threads() clause we must consider the
nthreads-var ICV. Its value is set by omp_set_num_threads()
and can be queried using omp_get_max_num_threads().
The rewritten code now closely resembles the algorithm given
in the OpenMP standard.
Differential Revision: https://reviews.llvm.org/D51783
llvm-svn: 343380
infinite loop on removing non-mapped pointer-with-object.
Added test to check that libomptarget does not cause infinite loop when
trying to unmap the pointer-with-object data that was not previously
mapped.
llvm-svn: 343344
This patch also introduces testing for libomptarget-nvptx
which has been missing until now. I propose to add tests for
all bugs that are fixed in the future.
The target check-libomptarget-nvptx is not run by default because
- we can't determine if there is a GPU plugged into the system.
- it will require the latest Clang compiler. Keeping compatibility
with older releases would prevent testing newer code generation
developed in trunk.
Differential Revision: https://reviews.llvm.org/D51687
llvm-svn: 343324
Summary: NFC - just fixing a bug: the empty slot test was before the re-setting of the Stack pointer.
Reviewers: ABataev, caomhin, Hahnfeld
Reviewed By: ABataev
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D52122
llvm-svn: 343006
Summary:
There is currently no supported situation where the warp master is not the first thread in the warp.
This also avoids the device execution from hanging on Volta GPUs when ballot_sync is called by a number of threads that is less that the size of a warp.
Reviewers: ABataev, caomhin, grokos
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D50188
llvm-svn: 342972
Summary:
We need the support for per-team shared variables to support codegen for
lastprivates/reductions. Patch adds this support by using shared memory
if the total size of the reductions/lastprivates is <= 128 bytes,
then pre-allocated buffer in global memory if size is <= 4K bytes,or
uses malloc/free, otherwise.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D51875
llvm-svn: 342737
Summary:
Missed operation of the incrementing iterator when required just to
continue execution.
Reviewers: kkwli0, gtbercea, grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D51937
llvm-svn: 341964
This is a follow-up to r341371: The new test for PR38704 doesn't
work with Clang 6.0. It uses an UNSUPPORTED: clang-6, but that
hasn't worked because the compiler features weren't known to lit.
llvm-svn: 341448
cuDeviceGetProperties has apparently been deprecated since CUDA 5.0.
Nvidia started using annotations only in CUDA 9.2, so nobody noticed
nor cared before.
The new function returns the same values, tested with a P100.
Differential Revision: https://reviews.llvm.org/D51624
llvm-svn: 341372
* cg and HasCancel in WorkDescr were never read and can be removed.
* This eliminates the last use of priv in ThreadPrivateContext.
* CounterGroup is unused afterwards.
* Remove duplicate external declares in omptarget-nvptx.cu that are
already in the header omptarget-nvptx.h.
Differential Revision: https://reviews.llvm.org/D51622
llvm-svn: 341370