Summary:
Previously critical regions were emitted with the barrier making it a
worksharing construct though it is not. Also, it leads to incorrect
behavior in Cuda9+. Patch fixes this problem.
Reviewers: ABataev, jdoerfert
Subscribers: jholewinski, guansong, cfe-commits, grokos
Tags: #clang
Differential Revision: https://reviews.llvm.org/D66673
llvm-svn: 369946
performance.
Internally generated functions must be marked as always_inlines in most
cases. Patch marks some extra reduction function + outlined parallel
functions as always_inline for better performance, but only if the
optimization is requested.
llvm-svn: 361269
nvvm_barrier0.
Use runtime functions instead of the direct call to the nvvm intrinsics.
It allows to prevent some dangerous LLVM optimizations, that breaks the
code for the NVPTX target.
llvm-svn: 350328
buffer.
Seems to me, nvlink has a bug with the proper support of the weakly
linked symbols. It does not allow to define several shared memory buffer
with the different sizes even with the weak linkage. Instead we always
use 128 bytes buffer to prevent nvlink from the error message emission.
llvm-svn: 349540
Critical regions in NVPTX are the constructs, which, generally speaking,
are not supported by the NVPTX target. Instead we're using special
technique to handle the critical regions. Currently they are supported
only within the loop and all the threads in the loop must execute the
same critical region.
Inside of this special regions the regions still must be emitted as
critical, to avoid possible data races between the teams +
synchronization must use __kmpc_barrier functions.
llvm-svn: 348272
__kmpc_barrier runtime functions must be marked as convergent to prevent
some dangerous optimizations. Also, for NVPTX target all barriers must
be emitted as simple barriers.
llvm-svn: 348271
reductions.
Fixed previously committed code for the reduction support in
teams/parallel constructs taking into account new design of the NVPTX
support in the compiler. Teams reduction are not fully functional yet,
it is going to be fixed in the following patches.
llvm-svn: 347081
target|teams|distribute variables.
If the total size of the variables, declared in target|teams|distribute
regions, is less than the maximal size of shared memory available, the
buffer is allocated in the shared memory.
llvm-svn: 346507
target/teams/distribute regions.
Target/teams/distribute regions exist for all the time the kernel is
executed. Thus, if the variable is declared in their context and then
escape it, we can allocate global memory statically instead of
allocating it dynamically.
Patch captures all the globalized variables in target/teams/distribute
contexts, merges them into the records, one per each target region.
Those records are then joined into the union, one per compilation unit
(to save the global memory). Those units are organized into
2 x dimensional arrays, where the first dimension is
the number of blocks per SM and the second one is the number of SMs.
Runtime functions manage this global memory space between the executing
teams.
llvm-svn: 345978
Added support for memory coalescing for better performance for
globalized variables. From now on all the globalized variables are
represented as arrays of 32 elements and each thread accesses these
elements using `tid & 31` as index.
llvm-svn: 344049
parallel region.
If the current construct requires sharing of the local variable in the
inner parallel region, this variable must be globalized to avoid
runtime crash.
llvm-svn: 335285
Added initial support for L2 parallelism in SPMD mode. Note, though,
that the orphaned parallel directives are not currently supported in
SPMD mode.
llvm-svn: 332016
Some symbols are not allowed to be used as names on some targets. Patch
ries to unify the emission of the names of LLVM globals so they could be
used on different targets.
llvm-svn: 331358
Added NUW flags for all the add|mul|sub operations + replaced sdiv by udiv
as we operate on unsigned values only (addresses, converted to integers)
llvm-svn: 329411
Summary: Remove this scheme for now since it will be covered by another more generic scheme using global memory. This code will be worked into an optimization for the generic data sharing scheme. Removing this completely and then adding it via future patches will make all future data sharing patches cleaner.
Reviewers: ABataev, carlo.bertolli, caomhin
Reviewed By: ABataev
Subscribers: jholewinski, guansong, cfe-commits
Differential Revision: https://reviews.llvm.org/D43625
llvm-svn: 326948
In the future the compiler will analyze whether the OpenMP
runtime needs to be (fully) initialized and avoid that overhead
if possible. The functions already take an argument to transfer
that information to the runtime, so pass in the default value 1.
(This is needed for binary compatibility with libomptarget-nvptx
currently being upstreamed.)
Differential Revision: https://reviews.llvm.org/D40354
llvm-svn: 318836
Summary:
This patch is part of the development effort to add support in the current OpenMP GPU offloading implementation for implicitly sharing variables between a target region executed by the team master thread and the worker threads within that team.
This patch is the first of three required for successfully performing the implicit sharing of master thread variables with the worker threads within a team. The remaining two patches are:
- Patch D38978 to the LLVM NVPTX backend which ensures the lowering of shared variables to an device memory which allows the sharing of references;
- Patch (coming soon) is a patch to libomptarget runtime library which ensures that a list of references to shared variables is properly maintained.
A simple code snippet which illustrates an implicit data sharing situation is as follows:
```
#pragma omp target
{
// master thread only
int v;
#pragma omp parallel
{
// worker threads
// use v
}
}
```
Variable v is implicitly shared from the team master thread which executes the code in between the target and parallel directives. The worker threads must operate on the latest version of v, including any updates performed by the master.
The code generated in this patch relies on the LLVM NVPTX patch (mentioned above) which prevents v from being lowered in the thread local memory of the master thread thus making the reference to this variable un-shareable with the workers. This ensures that the code generated by this patch is correct.
Since the parallel region is outlined the passing of arguments to the outlined regions must preserve the original order of arguments. The runtime therefore maintains a list of references to shared variables thus ensuring their passing in the correct order. The passing of arguments to the outlined parallel function is performed in a separate function which the data sharing infrastructure constructs in this patch. The function is inlined when optimizations are enabled.
Reviewers: hfinkel, carlo.bertolli, arpith-jacob, Hahnfeld, ABataev, caomhin
Reviewed By: ABataev
Subscribers: cfe-commits, jholewinski
Differential Revision: https://reviews.llvm.org/D38976
llvm-svn: 318773
Summary:
This patch introduces support for the execution of parallel constructs in a target
region on the NVPTX device. Parallel regions must be in the lexical scope of the
target directive.
The master thread in the master warp signals parallel work for worker threads in worker
warps on encountering a parallel region.
Note: The patch does not yet support capture of arguments in a parallel region so
the test cases are simple.
Reviewers: ABataev
Differential Revision: https://reviews.llvm.org/D28145
llvm-svn: 291565