This reverts commit 2240b41ee4.
A value of 0 for KernDescVal WG_Size implies it is unknown, so it should be
set to the default. The above change was made without this assumption.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105250
A step towards making this function adequately self contained that it
can be tested easily. No functional change intended here, left variable
names unchanged.
Reviewed By: ronlieb
Differential Revision: https://reviews.llvm.org/D105229
Removes stdarg header, drops uses of iostream, fix some format string errors.
Also changes a C style struct to C++ style to avoid a warning from clang/
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D104923
In our ongoing work, we are using `AbstractAttributor` to deduct execution model
of device functions, and potententially remove unnecessary function calls to
`__kmpc_is_spmd_exec_mode`. In current device runtime, we have mixed use of
`isSPMDMode` and `__kmpc_is_spmd_exec_mode`, but in fact in `__kmpc_is_spmd_exec_mode`
it simply calls `isSPMDMode`. Since all functions starting with `__kmpc` is C
function, which doesn't have things like name mangling. It is more optimization
friendly. In this patch, we simply replaced all calls to `isSPMDMode` with
`__kmpc_is_spmd_exec_mode` to pave the way for the optimization.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105211
This patch is related to https://reviews.llvm.org/D98832. Based on discussions there, I decided to separate out the teams default as this patch. This change is to increase the number of teams per computation unit so as to provide more wavefronts for hiding latency. This change improves performance for some programs, including 20-50% for some Stream benchmarks.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D99003
When max flat workgroup size is not specified, it is set to the default
workgroup size. This prevents kernel launch with a workgroup size larger
than the default. The fix is to ignore a size of 0 and treat it as
unspecified.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D105073
This reverts commit eab1fd389b.
This commit fixed a problem with 25073a4ecf (D103121) which is the one
we actually need to revert to unblock non-X86 builds of OpenMP. Can be
reapplied, or merged into, D103121 as it goes in again.
This patch adds documentation for using the CMake find module for OpenMP
target offloading provided by LLVM. It also removes the requirement for
AMD's architecture to be set as this isn't necessary for upstream LLVM.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105051
Add some information about the optimizations currently provided by
OpenMPOpt. Every optimization performed should eventually be listed
here.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D105050
The logic is almost similar to that of system.cpp with one change that
instead of adding all the memory pools to a device struct it only
keeps a single pool. The existing approach also always allocated memory on
the first HSA pool found for a GPU.
This depends on D104691. The goal of this series of patches is to remove
_atl_machine global. The next patch will drop g_atl_machine entirely.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104695
[libomptarget][amdgpu] Build openmp for two more targets
The 4800U APU is a gfx902 and the MI100 accelerator is a gfx908.
Both numbers are listed in ROCT topology.c
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D104922
Normalized bounds of chunk of iterations to steal from are inclusive,
so upper bound should not be decremented in expression to check.
Problem was in attempt to steal iterations 0:0, that caused assertion after
wrong decrement. Reported in comment to https://reviews.llvm.org/D103648.
Differential Revision: https://reviews.llvm.org/D104880
For example, without this patch:
```
$ cat test.c
int main() {
int x;
#pragma omp target enter data map(alloc: x)
#pragma omp target enter data map(alloc: x)
#pragma omp target enter data map(alloc: x)
#pragma omp target exit data map(delete: x)
;
return 0;
}
$ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test.c
$ LIBOMPTARGET_DEBUG=1 ./a.out |& grep 'Creating\|Mapping exists\|last'
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=1, Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=2 (incremented), Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=3 (incremented), Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=2 (decremented)
Libomptarget --> There are 4 bytes allocated at target address 0x00000000013bb040 - is not last
```
`RefCount` is reported as decremented to 2, but it ought to be reset
because of the `delete` map type, and `is not last` is incorrect.
This patch migrates the reset of reference counts from
`DeviceTy::deallocTgtPtr` to `DeviceTy::getTgtPtrBegin`, which then
correctly reports the reset. Based on the `IsLast` result from
`DeviceTy::getTgtPtrBegin`, `targetDataEnd` then correctly reports `is
last` for any deletion. `DeviceTy::deallocTgtPtr` is responsible only
for the final reference count decrement and mapping removal.
An obscure side effect of this patch is that a `delete` map type when
the reference count is infinite yields `DelEntry=IsLast=false` in
`targetDataEnd` and so no longer results in a
`DeviceTy::deallocTgtPtr` call. Without this patch, that call is a
no-op anyway besides some unnecessary locking and mapping table
lookups.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D104560
For example, without this patch:
```
$ cat test.c
int main() {
int x;
#pragma omp target enter data map(alloc: x)
#pragma omp target exit data map(release: x)
;
return 0;
}
$ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test.c
$ LIBOMPTARGET_DEBUG=1 ./a.out |& grep 'Creating\|Mapping exists'
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, updated RefCount=1
```
There are two problems in this example:
* `RefCount` is not reported when a mapping is created, but it might
be 1 or infinite. In this case, because it's created by `omp target
enter data`, it's 1. Seeing that would make later `RefCount`
messages easier to understand.
* `RefCount` is still 1 at the `omp target exit data`, but it's
reported as `updated`. The reason it's still 1 is that, upon
deletions, the reference count is generally not updated in
`DeviceTy::getTgtPtrBegin`, where the report is produced. Instead,
it's zeroed later in `DeviceTy::deallocTgtPtr`, where it's actually
removed from the mapping table.
This patch makes the following changes:
* Report the reference count when creating a mapping.
* Where an existing mapping is reported, always report a reference
count action:
* `update suppressed` when `UpdateRefCount=false`
* `incremented`
* `decremented`
* `deferred final decrement`, which replaces the misleading
`updated` in the above example
* Add comments to `DeviceTy::getTgtPtrBegin` to explain why it does
not zero the reference count. (Please advise if these comments miss
the point.)
* For unified shared memory, don't report confusing messages like
`RefCount=` or `RefCount= updated` given that reference counts are
irrelevant in this case. Instead, just report `for unified shared
memory`.
* Use `INFO` not `DP` consistently for `Mapping exists` messages.
* Fix device table dumps to print `INF` instead of `-1` for an
infinite reference count.
Reviewed By: jhuber6, grokos
Differential Revision: https://reviews.llvm.org/D104559
This introduces a CMake find module for detecting target offloading support in
a compiler. The goal is to make it easier to incorporate target offloading into
a cmake project.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D104710
The OpenMP 5.1 standard defines the environment variable
`OMP_TEAMS_THREAD_LIMIT` to limit the number of threads that will be run in a
single block. This patch adds support for this into the AMDGPU and CUDA
plugins.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D103923
Currently the runtime implementation of `__kmpc_alloc_shared` is extremely slow because it allocated memory for each thread individually. This patch adds a small buffer for the threads to share data and will greatly improve performance for builds where all globalization could not be optimized out. If the shared buffer is full, then memory will not only be allocated per-warp rather than per-thread.
Depends on D97680
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104666
Summary:
This patch introduces the new globalization runtime to be used by D97680. These
runtime calls will replace the __kmpc_data_sharing_push_stack and
__kmpc_data_sharing_pop_stack functions.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102532
Restructured dynamic loop dispatcher code.
Fixed use of dispatch buffers for nonmonotonic dynamic (static_steal) schedule:
- eliminated possibility of stealing iterations of the wrong loop when victim
thread changed its buffer to work on another loop;
- fixed race when victim thread changed its buffer to work in nested parallel;
- eliminated "static" property of the schedule, that is now a single thread can
execute whole loop.
Differential Revision: https://reviews.llvm.org/D103648
There does not seem to be any use of these functions. They just
put the value to a local which is never used again.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D104512
`bug49334.cpp` cannot detect data race in `libomptarget` efficiently. It
is reported that with `N = 256` and `BS = 16`, the data race can be reproduced
more steadily. The next coming pathces will fix it so this patch is expected to
fail now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D104552
Add an FAQ entry and add a few lines to an existing one. Document
the use of `GCC_INSTALL_PREFIX` for pointing clang to correct
GCC installation for two-stage build.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D104474
Two-level distributed barrier is a new experimental barrier designed
for Intel hardware that has better performance in some cases than the
default hyper barrier.
This barrier is designed to handle fine granularity parallelism where
barriers are used frequently with little compute and memory access
between barriers. There is no need to use it for codes with few
barriers and large granularity compute, or memory intensive
applications, as little difference will be seen between this barrier
and the default hyper barrier. This barrier is designed to work
optimally with a fixed number of threads, and has a significant setup
time, so should NOT be used in situations where the number of threads
in a team is varied frequently.
The two-level distributed barrier is off by default -- hyper barrier
is used by default. To use this barrier, you must set all barrier
patterns to use this type, because it will not work with other barrier
patterns. Thus, to turn it on, the following settings are required:
KMP_FORKJOIN_BARRIER_PATTERN=dist,dist
KMP_PLAIN_BARRIER_PATTERN=dist,dist
KMP_REDUCTION_BARRIER_PATTERN=dist,dist
Branching factors (set with KMP_FORKJOIN_BARRIER, KMP_PLAIN_BARRIER,
and KMP_REDUCTION_BARRIER) are ignored by the two-level distributed
barrier.
Differential Revision: https://reviews.llvm.org/D103121
This change-set removes libelf usage from elf_common part of the plugins.
libelf is still used in x86_64 generic plugin code and in some plugins
(e.g. amdgpu) - these will have to be cleaned up in separate checkins.
Differential Revision: https://reviews.llvm.org/D103545
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Differential Revision: https://reviews.llvm.org/D97085
Several variables were left unused as a result of different patches removing
their use.
Two variables have some use:
`poll_count` is used by the KMP_BLOCKING macro only under certain conditions.
Adding (void) to tell the compiler to ignore the unused variable.
`padding` is a dummy stack allocation with no intent to be used. Also adding
(void) to make the compiler ignore the unused variable.
Differential Revision: https://reviews.llvm.org/D104303
* Add GOMP versioned pause functions
* Add GOMP versioned affinity format functions
To do the affinity format functions, only attach versioned symbols
to the APPEND Fortran entries (e.g., omp_set_affinity_format_) since
GOMP only exports two symbols (one for Fortran, one for C). Our
affinity format functions have three symbols.
e.g., with omp_set_affinity_format:
1) omp_set_affinity_format (Fortran interface)
2) omp_set_affinity_format_ (Fortran interface)
3) ompc_set_affinity_format (C interface)
Have the GOMP version of the C symbol alias the ompc_* 3) version
instead of the Fortran unappended version 1).
Differential Revision: https://reviews.llvm.org/D103647
Remove strange checks for syscall() arguments where mask is NULL.
Valgrind reports these as error usages for the syscall.
Instead, just check if CACHE_LINE bytes is long enough. If not, then
search for the size. Also, by limiting the first size detection
attempt to CACHE_LINE bytes, instead of 1MB, we don't use more than one
cache line for the mask size. Before this patch, sometimes the returned
mask size was 640 bytes (10 cache lines) because the initial call to
getaffinity() was limited only by the internal kernel mask size
which can be very large.
Differential Revision: https://reviews.llvm.org/D103637
Lazily set affinity for root threads. Previously, the root thread
executing middle initialization would attempt to assign affinity
to other existing root threads. This was not working properly as the
set_system_affinity() function wasn't setting the affinity for the
target thread. Instead, the middle init thread was resetting the
its own affinity using the target thread's affinity mask.
Differential Revision: https://reviews.llvm.org/D103625
This patch includes some changes which deletes the code accessing
g_atl_machine global. Some accesses related to memory_pools are
still remaining.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103813
The current handling of dependencies in Archer has two flaws:
- annotation of dependency synchronization is not limited to sibling tasks
- annotation of in/out dependencies is based on the assumption, that dependency
variables will rarely be byte-sized variables.
This patch introduces a map in the generating task to manage the dependency
variables for the child tasks. The map is only accesses from the generating
task, so no locking is necessary. This also limits the dependency-based
synchronization to sibling tasks.
This patch also introduces proper handling for new dependency types such as
mutexinoutset and inoutset.
Differential Revision: https://reviews.llvm.org/D103608
The main motivation for reusing objects is that it helps to avoid creating and
leaking synchronization clocks in TSan. The reused object will reuse the
synchronization clock in TSan.
Before, new and delete operators were overloaded to get and return memory for
the object from/to the object pool.
This patch replaces the operator overloading with explicit static New/Delete
functions.
Objects for parallel regions and implicit tasks will always be recruited and
returned to the thread-local object pool. Only for explicit task, there is a
chance that an other thread completes the task and will free the object. This
patch optimizes the thread-local New/Delete calls by avoiding locks and only
lock if the pool is empty. Remote threads return the object into a separate
queue.
The chunk size for allocations is now decided based on page size. The objects
will also be aligned to cache lines avoiding false sharing.
This is the first patch in a series to provide better tasking support.
Differential Revision: https://reviews.llvm.org/D103606
Archer uses weak symbol overloads of TSan functions to enable loading the tool
even if the application is not built with TSan. For MACOS the tool collects
the function pointer at runtime.
When adding the function entry/exit markers, we missed to add the functions
in the MACOS codepath.
This patch also replaces the repeated function lookup by a single initial
function lookup and fixes the disabling logic in RunningOnValgrind.
Differential Revision: https://reviews.llvm.org/D103607
This patch adds an information flag that indicated when data is being copied to
and from the device. This will be helpful for finding redundant or unnecessary
data transfers in applications.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D103927
This is the first of seven patches that implements OMPD, a debugging interface to support debugging of OpenMP programs.
It contains support code required in "openmp/runtime" for OMPD implementation.
Reviewed By: @hbae
Differential Revision: https://reviews.llvm.org/D100181