to reflect the new license. These used slightly different spellings that
defeated my regular expressions.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351648
Summary: Replace existing infrastructure for tracking parallel level using global memory with a per-team shared memory variable. This minimizes the impact of the overhead of tracking the parallel level for non-nested cases.
Reviewers: ABataev, caomhin
Reviewed By: ABataev
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D55773
llvm-svn: 350747
Summary:
Previous implementation may cause the runtime crash when the number of
teams is > 1024. Patch fixes this problem + reduces number of the atomic
operations by 32 times.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56332
llvm-svn: 350524
Summary:
Reduced number of the used register + improved performance propagating
the information about current execution/data sharing mode directly from
the compiler, where it is possible.
In some cases, it requires new/reworked interfaces of the runtime
external functions. Old functions are marked as deprecated.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, jfb, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56278
llvm-svn: 350405
Summary:
One of the LLVM optimizations, split critical edges, also clones tail
instructions. This is a dangerous operation for __syncthreads()
functions and this transformation leads to undefined behavior or
incorrect results. Patch fixes this problem by replacing __syncthreads()
function with the assembler instruction, which cost is too high and
wich cannot be copied.
Reviewers: grokos, gtbercea, kkwli0
Subscribers: guansong, openmp-commits, caomhin
Differential Revision: https://reviews.llvm.org/D56274
llvm-svn: 350333
Summary:
Avoid using of the atomic loop to wait for the completion of the
data-sharing interface initialization, use __shfl_sync instead for the
communication within the warp to signal other threads in the warp about
completion of the initialization.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D56100
llvm-svn: 350129
Summary:
At high optimization level asserts lead to some unexpected results
because of auto-inserted unreachable instructions. This outlining
prevents some of such dangerous optimizations and leads to better
stability.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D56101
llvm-svn: 350128
Summary:
Use the original shuffle implementation for __kmpc_shuffle_int64 since
default implementation uses the same implementation.
Reviewers: gtbercea
Subscribers: guansong, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55514
llvm-svn: 348772
Summary:
Shuffle on 64bit data is allowed only for CUDA >= 9.0. Also, fixed the
constant for the mask, need one extra L in the end.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55440
llvm-svn: 348758
Summary:
Introduced special noinline function log that allows to save some
registers for optimized builds but with enabled logging. Also, it
increases the stability of the optimized builds with inlined runtime.
Reviewers: gtbercea, kkwli0
Reviewed By: gtbercea
Subscribers: caomhin, guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D55436
llvm-svn: 348606
Summary:
According to the standard, after memory flushing the changes in the
memory must be visible to all the threads in all teams. Patch fixes
this.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55370
llvm-svn: 348491
Summary:
Reworked runtime to make it compatible with the requirements of the
original runtime library. Also, simplified some code to reduce number of
function calls.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D55130
llvm-svn: 348003
Summary: To enable the compiler to optimize parts of the function that are not needed when runtime can be omitted, a new version of the SPMD deinit kernel function is needed. This function takes the runtime required flag as an argument.
Reviewers: ABataev, kkwli0, caomhin
Reviewed By: ABataev
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D54969
llvm-svn: 347714
Summary:
Added functions __kmpc_nvptx_teams_reduce_nowait_simple and
__kmpc_nvptx_teams_end_reduce_nowait_simple to implement basic support
for reductions across the teams.
Reviewers: gtbercea, kkwli0
Subscribers: guansong, jfb, caomhin, openmp-commits
Differential Revision: https://reviews.llvm.org/D54967
llvm-svn: 347710
Summary: Refactor the checking for SPMD mode and whether the runtime is initialized or not. This uses constant flags which enables the runtime to optimize out unused sections of code that depend on these flags.
Reviewers: ABataev, caomhin
Reviewed By: ABataev
Subscribers: guansong, jfb, openmp-commits
Differential Revision: https://reviews.llvm.org/D54960
llvm-svn: 347698
Summary:
Current globalization scheme works correctly only for SPMD+lightweight
runtime mode and does not work for full runtime. Patch improves support
for the globalization scheme + reduces global memory consumption in
lightweight runtime mode.
Patch adds runtime functions to work with the statically allocated
global memory. It allows to improve performance and memory consumption.
This global memory must be allocated by the compiler.
Reviewers: grokos, kkwli0, gtbercea, caomhin
Subscribers: guansong, jfb, openmp-commits
Differential Revision: https://reviews.llvm.org/D53943
llvm-svn: 345976
Summary: In the case of coalesced global records, we need to push the exact data size passed in. This patch fixes this by outlining the common functionality of the previous push function and by adding a separate entry point for coalesced pushes. The pop function remains unchanged.
Reviewers: ABataev, grokos, caomhin
Reviewed By: ABataev, grokos
Subscribers: jholewinski, cfe-commits, Hahnfeld, guansong, jfb, openmp-commits
Differential Revision: https://reviews.llvm.org/D53141
llvm-svn: 345867
If the user requested LIBOMPTARGET_NVPTX_DEBUG, include asserts in
the bitcode library. Everything else will have very unpleasent
effects because asserts will appear when falling back to the static
library libomptarget-nvptx.a.
Differential Revision: https://reviews.llvm.org/D52701
llvm-svn: 343477
Pass in the correct value of isRuntimeUninitialized() which solves
parallel reductions as reported on the mailing list.
For reference: r333285 did the same for loop scheduling.
Differential Revision: https://reviews.llvm.org/D52725
llvm-svn: 343476
NVPTX requires addresses of pointer locations to be 8-byte aligned
or there will be an exception during runtime.
This could happen without this patch as shown in the added test:
getId() requires 4 byte of stack and putValueInParallel() uses 16
bytes to store the addresses of the captured variables.
Differential Revision: https://reviews.llvm.org/D52655
llvm-svn: 343402
According to OpenMP 4.5, p250:12-14:
If the requested nest level is outside the range of 0 and the
nest level of the current thread, as returned by the omp_get_level
routine, the routine returns -1.
The SPMD code path will need a similar fix.
Differential Revision: https://reviews.llvm.org/D51787
llvm-svn: 343401
Clang trunk will serialize nested parallel regions. Check that this
is correctly reflected in various API methods.
Differential Revision: https://reviews.llvm.org/D51786
llvm-svn: 343382
There is no support and according to the OpenMP 4.5, p238:7-9:
For implementations that do not support dynamic adjustment
of the number of threads this routine has no effect: the
value of dyn-var remains false.
Add a test that cancellation and nested parallelism aren't
supported either.
Differential Revision: https://reviews.llvm.org/D51785
llvm-svn: 343381
If there is no num_threads() clause we must consider the
nthreads-var ICV. Its value is set by omp_set_num_threads()
and can be queried using omp_get_max_num_threads().
The rewritten code now closely resembles the algorithm given
in the OpenMP standard.
Differential Revision: https://reviews.llvm.org/D51783
llvm-svn: 343380
This patch also introduces testing for libomptarget-nvptx
which has been missing until now. I propose to add tests for
all bugs that are fixed in the future.
The target check-libomptarget-nvptx is not run by default because
- we can't determine if there is a GPU plugged into the system.
- it will require the latest Clang compiler. Keeping compatibility
with older releases would prevent testing newer code generation
developed in trunk.
Differential Revision: https://reviews.llvm.org/D51687
llvm-svn: 343324
Summary: NFC - just fixing a bug: the empty slot test was before the re-setting of the Stack pointer.
Reviewers: ABataev, caomhin, Hahnfeld
Reviewed By: ABataev
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D52122
llvm-svn: 343006
Summary:
There is currently no supported situation where the warp master is not the first thread in the warp.
This also avoids the device execution from hanging on Volta GPUs when ballot_sync is called by a number of threads that is less that the size of a warp.
Reviewers: ABataev, caomhin, grokos
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D50188
llvm-svn: 342972
Summary:
We need the support for per-team shared variables to support codegen for
lastprivates/reductions. Patch adds this support by using shared memory
if the total size of the reductions/lastprivates is <= 128 bytes,
then pre-allocated buffer in global memory if size is <= 4K bytes,or
uses malloc/free, otherwise.
Reviewers: gtbercea, kkwli0, grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D51875
llvm-svn: 342737
* cg and HasCancel in WorkDescr were never read and can be removed.
* This eliminates the last use of priv in ThreadPrivateContext.
* CounterGroup is unused afterwards.
* Remove duplicate external declares in omptarget-nvptx.cu that are
already in the header omptarget-nvptx.h.
Differential Revision: https://reviews.llvm.org/D51622
llvm-svn: 341370
If the runtime is uninitialized the master thread must Enqueue the
state object, and ALL threads must return immediately.
Found post-commit of https://reviews.llvm.org/D51222.
llvm-svn: 341328
Summary:
Implemented simple and lightweight runtime support for SPMD mode-based
constructs. It adds support for L2 sequential parallelism wihtout full
runtime support. Also, patch fixes some use cases for
uninitialized|lightweight runtime.
Reviewers: grokos, kkwli0, Hahnfeld, gtbercea
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D51222
llvm-svn: 340944
Summary:
1. Fixed internal problem in `__kmpc_barrier` function: SPMD mode
synchronization function should be called only in L1 parallel level.
2. Removed some extra code for synchronization inside of the code, used
`__kmpc_barrier` instead.
3. Some code cleanup.
Reviewers: gtbercea, grokos
Subscribers: openmp-commits
Differential Revision: https://reviews.llvm.org/D49564
llvm-svn: 337691
Summary: This patch fixes the data sharing infrastructure to work for the SPMD and non-SPMD cases.
Reviewers: ABataev, grokos, carlo.bertolli, caomhin
Reviewed By: ABataev, grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D49204
llvm-svn: 337013
Summary:
Patch fixes the next problems.
1. Removes unused functions from omptarget_nvptx_ThreadPrivateContext
class + simplified data members.
2. Fixed calculation of loop boundaries for dynamic loops with static
scheduling.
3. Introduced saving/restoring of the dynamic loop boundaries to support
several nested parallel dynamic loops.
Reviewers: grokos
Subscribers: guansong, kkwli0, openmp-commits
Differential Revision: https://reviews.llvm.org/D49241
llvm-svn: 336915
Summary:
Patch fixes several problems in the implementation of NVPTX RTL.
1. Detection of the last iteration for loops with static scheduling, no chunks.
2. Fixes reductions for the serialized parallel constructs.
3. Fixes handling of the barriers.
Reviewers: grokos
Reviewed By: grokos
Subscribers: Hahnfeld, guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D48480
llvm-svn: 335469
The generic entry points for static loop scheduling previously
hardcoded that the runtime was initialized. This can be wrong if
the compiler analyzes that the runtime is not needed and calls
the init functions accordingly.
This didn't affect clang-ykt because they have entry points for
different combinations of SPMD x Runtime not needed. I didn't do
measurements yet but with inlining we might get away with always
calling the generic interface and letting compiler and runtime
figure out the rest.
In any case, a correct runtime is always better than having
functions that may only be called if previous calls passed in
a specific set of arguments!
Differential Revision: https://reviews.llvm.org/D47131
llvm-svn: 333285
Introduce OPENMP_INSTALL_LIBDIR and use in all install() commands.
This also fixes installation of libomptarget-nvptx that previously
didn't honor {OPENMP,LLVM}_LIBDIR_SUFFIX.
Differential Revision: https://reviews.llvm.org/D47130
llvm-svn: 333284
The existing implementation of the dynamic scheduling
breaks the contract introduced by the original openmp
runtime and, thus, is incorrect. Patch fixes it and
introduces correct dynamic scheduling model.
Thanks to Alexey Bataev for submitting this patch.
Differential Revision: https://reviews.llvm.org/D47333
llvm-svn: 333225
Move all logic related to selecting the bitcode compiler and linker
into a new file and dynamically test required compiler flags. This
also adds -fcuda-rdc for Clang trunk as previously attempted in D44992
which fixes the build.
As a result this change also enables building the library by default
if all prerequisites are met.
Differential Revision: https://reviews.llvm.org/D46901
llvm-svn: 332494
Summary: Add function to the NVPTX libomptarget library that will return true if the current target region is being executed in SPMD mode.
Reviewers: ABataev, grokos, carlo.bertolli, caomhin
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D46840
llvm-svn: 332360
Summary:
Enable the device side debug messages at compile time, use env var to control at runtime.
To achieve this, an environment data block is passed to the device lib when it is loaded.
By default, the message is off, to enable it, a user need to set LIBOMPDEVICE_DEBUG=1.
Reviewers: grokos
Reviewed By: grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D46210
llvm-svn: 331550
Summary: Minor printf format correction. NVCC ignore those. Clang will give warning on these if debug is enabled.
Reviewers: grokos
Reviewed By: grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D45528
llvm-svn: 330944
Summary: The LIBOMPTARGET_NVPTX_DEBUG flag is inconsistent between using nvcc to generate .a file and clang to generate .bc file. Sync the two setting so we can get debug messages from the bc file path as well.
Reviewers: grokos
Subscribers: Hahnfeld, openmp-commits, mgorny
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D45530
llvm-svn: 330477
Summary:
This one line change is to remove this warning message
"warning: integer conversion resulted in a change of sign"
Reviewers: grokos
Reviewed By: grokos
Subscribers: openmp-commits
Tags: #openmp
Differential Revision: https://reviews.llvm.org/D45415
llvm-svn: 329713
Summary: The global stack initialization function may be called multiple times. The initialization of the shared memory slots should only happen when the function is called for the first time for a given warp master thread.
Reviewers: grokos, carlo.bertolli, ABataev, caomhin
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D44754
llvm-svn: 328148
Summary: The check for the master warp must take into consideration the actual number of warps: the master warp is equal to the last active warp not necessarily WARPSIZE - 1.
Reviewers: grokos, carlo.bertolli, ABataev, caomhin
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D44537
llvm-svn: 328146
Summary:
This patch allows worker to have a global memory stack managed by the runtime. This patch is needed for completeness and consistency with the globalization policy: if a worker-side variable escapes the current context it then needs to be globalized.
Until now, only the master thread was allowed to have such a stack. These global values can now potentially be shared amongst workers if the semantics of the OpenMP program require it.
Reviewers: ABataev, grokos, carlo.bertolli, caomhin
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D44487
llvm-svn: 328144
Summary:
Allow the runtime to use the existing shared memory statically allocated slots.
When a variable is globalized, the underlying memory can be either shared or global memory (both have block-wide visibility). In this case, we allow that the storage to use a limited amount of shared memory that has been statically allocated already. Only if shared memory doesn't prove to be enough do we then invoke malloc() to create a new global memory slot.
Reviewers: ABataev, carlo.bertolli, grokos, caomhin
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D44486
llvm-svn: 327639
Summary: To save on calls to malloc, this patch enables the re-use of pre-allocated global memory slots.
Reviewers: ABataev, grokos, carlo.bertolli, caomhin
Reviewed By: grokos
Subscribers: guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D44470
llvm-svn: 327637
Summary:
This patch adds support for the sharing of variables from the master thread of a team to the worker threads of the team.
The runtime uses a stack structure implemented as a doubly-linked list of slots with each slot having the exact same size as the size requested. This implementation leverages existing data structures. The runtime functions are added as separate functions to avoid interfering with the current interface.
Limitations to be addressed in future patches:
- This current patch only employs global memory. In a future patch we will enable to usage for shared memory as an optimization.
- Allow the allocation of several requested sizes in the same slot.
Reviewers: ABataev, grokos, caomhin, carlo.bertolli
Reviewed By: grokos
Subscribers: Hahnfeld, guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D44260
llvm-svn: 327440
Summary: To make the two parts of the union have the same size, the size of vect needs to be increased by 16 bits.
Reviewers: grokos, carlo.bertolli, caomhin, ABataev
Reviewed By: grokos, ABataev
Subscribers: fedor.sergeev, guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D44254
llvm-svn: 327040
Summary:
This patch reverts the changes to libomptarget that were coupled with the changes to Clang code gen for data sharing using shared memory. A similar patch exists for Clang: D43625
Shared memory is meant to be used as an optimization on top of a more general scheme. So far we didn't have a global memory implementation ready so shared memory was a solution which applied to the current level of OpenMP complexity supported by trunk on GPU devices (due to the missing NVPTX backend patch this functionality has never been exercised). Now that we have a global memory solution this patch is "in the way" and needs to be removed (for now). This patch (or an equivalent version of it) will be put out for review once the global memory scheme is in place.
Reviewers: ABataev, grokos, carlo.bertolli, caomhin
Reviewed By: grokos
Subscribers: Hahnfeld, guansong, openmp-commits
Differential Revision: https://reviews.llvm.org/D43626
llvm-svn: 326950
Summary:
Different NVIDIA GPUs support different compute capabilities. To enable the inlining of runtime functions and the best performance on different generations of NVIDIA GPUs, a bc library for each compute capability needs to be compiled. The same compiler build will then be usable in conjunction with multiple generations of NVIDIA GPUs.
To differentiate between versions of the same bc lib, the output file name will contain the compute capability ID.
Depends on D14254
Reviewers: Hahnfeld, hfinkel, carlo.bertolli, caomhin, ABataev, grokos
Reviewed By: Hahnfeld, grokos
Subscribers: guansong, mgorny, openmp-commits
Differential Revision: https://reviews.llvm.org/D41724
llvm-svn: 324904
Summary: This patch extends the libomptarget functionality in patch D14254 with support for the data sharing scheme for supporting implicitly shared variables. The runtime therefore maintains a list of references to shared variables.
Reviewers: carlo.bertolli, ABataev, Hahnfeld, grokos, caomhin, hfinkel
Reviewed By: Hahnfeld, grokos
Subscribers: guansong, llvm-commits, openmp-commits
Differential Revision: https://reviews.llvm.org/D41485
llvm-svn: 324495
https://reviews.llvm.org/D42757
The method ThreadsInTeam is used to determine the number of threads to be used in a parallel region under SPMD mode (see line 127 of supporti.h in libomptarget/deviceRTLs/nvptx/src/). This patch fixes the corresponding debug print upon initialization of the kernel in SPMD mode.
llvm-svn: 323978
This patch implements the device runtime library whose interface is used in the code generation for OpenMP offloading devices.
Currently there is a single device RTL written in CUDA meant to CUDA enabled GPUs.
The interface is a variation of the kmpc interface that includes some extra calls to do thread and storage management that only make sense for a GPU target.
Differential revision: https://reviews.llvm.org/D14254
llvm-svn: 323649