Commit Graph

1958 Commits

Author SHA1 Message Date
Shilei Tian f1b8fa55d0 [OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLs
We build `deviceRTLs` with `-O1` by default, which also triggers OpenMPOpt. When
the info cache is created, some attributes are removed. As a result, although we
mark a few functions `noinline`, they are still inlined when the bitcode library
is generated. This can cause an issue in middle end optimization.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106710
2021-07-25 10:38:27 -04:00
Ye Luo 4079037a3e [OpenMP] always compile with c++14 instead of gnu++14
Fixes PR 51174. c++14 should be a more portable option than gnu++14.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D106632
2021-07-23 17:29:08 -04:00
Shilei Tian c2c43132f6 [OpenMP] Fix bug 50022
Bug 50022 [0] reports target nowait fails in certain case, which is added in this
patch. The root cause of the failure is, when the second task is created, its
parent's `td_incomplete_child_tasks` will not be incremented because there is no
parallel region here thus its team is serialized. Therefore, when the initial
thread is waiting for its unfinished children tasks, it thought there is only
one, the first task, because it is hidden helper task, so it is tracked. The
second task will only be pushed to the queue when the first task is finished.
However, when the first task finishes, it first decrements the counter of its
parent, and then release dependences. Once the counter is decremented, the thread
will move on because its counter is reset, but actually, the second task has not
been executed at all. As a result, since in this case, the main function finishes,
then `libomp` starts to destroy. When the second task is pushed somewhere, all
some of the structures might already have already been destroyed, then anything
could happen.

This patch simply moves `__kmp_release_deps` ahead of decrement of the counter.
In this way, we can make sure that the initial thread is aware of the existence
of another task(s) so it will not move on. In addition, in order to tackle
dependence chain starting with hidden helper thread, when hidden helper task is
encountered, we force the task to release dependences.

Reference:
[0] https://bugs.llvm.org/show_bug.cgi?id=50022

Reviewed By: AndreyChurbanov

Differential Revision: https://reviews.llvm.org/D106519
2021-07-23 16:54:11 -04:00
Joseph Huber e1dedecaa6 [Libomptarget] Add unroll flag to shared variables loop
Unrolling this loop provides better performance in practice because it is
executed on the device and is likely to be very small.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D106692
2021-07-23 16:45:27 -04:00
Shilei Tian 18ce3d3f2c [OpenMP][Offloading] Fix data race in data mapping by using two locks
This patch tries to partially fix one of the two data race issues reported in
[1] by introducing a per-entry mutex. Additional discussion can also be found in
D104418, which will also be refined to fix another data race problem.

Here is how it works. Like before, `DataMapMtx` is still being used for mapping
table lookup and update. In any case, we will get a table entry. If we need to
make a data transfer (update the data on the device), we need to lock the entry
right before releasing `DataMapMtx`, and the issue of data transfer should be
after releasing `DataMapMtx`, and the entry is unlocked afterwards. This can
guarantee that: 1) issue of data movement is not in critical region, which will
not affect performance too much, and also will not affect other threads that don't
touch the same entry; 2) if another thread accesses the same entry, the state of
data movement is consistent (which requires that a thread must first get the
update lock before getting data movement information).

For a target that doesn't support async data transfer, issue of data movement is
data transfer. This two-lock design can potentially improve concurrency compared
with the design that guards data movement with `DataMapMtx` as well. For a target
that supports async data movement, we could simply attach the event between the
issue of data movement and unlock the entry. For a thread that wants to get the
event, it must first get the lock. This can also get rid of the busy wait until
the event pointer is valid.

Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D104555
2021-07-23 16:10:51 -04:00
Abhinav Gaba f7c92995c0 [OpenMP] Fix CUDA plugin build after 3817ba13ae.
The build was broken on machines that don't have Cuda SDK installed.

See https://reviews.llvm.org/D106627 for the original discussion.
2021-07-23 16:50:00 +08:00
Johannes Doerfert d12ee28e2e [OpenMP] Simplify the ThreadStackTy for globalization fallback
With D106496 we can make the globalization fallback stack much simpler
and this version doesn't seem to experience the spurious failures and
deadlocks we have seen before.

Differential Revision: https://reviews.llvm.org/D106576
2021-07-22 23:57:46 -05:00
Joseph Huber 76c0c0ca86 [OpenMP][NFC] Fix formatting in CUDA plugin 2021-07-22 21:50:40 -04:00
Joseph Huber 3817ba13ae [OpenMP] Add environment variables to change stack / heap size in the CUDA plugin
This patch adds support for two environment variables to configure the device.
``LIBOMPTARGET_STACK_SIZE`` sets the amount of memory in bytes that each thread
has for its stack. ``LIBOMPTARGET_HEAP_SIZE`` sets the amount of heap memory
that can be allocated using malloc / free on the device.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106627
2021-07-22 21:40:02 -04:00
Shilei Tian ea452353c0 [OpenMP] Refined the logic to give a regular task from a hidden helper task
In current implementation, if a regular task depends on a hidden helper task,
and when the hidden helper task is releasing its dependences, it directly calls
`__kmp_omp_task`. This could cause a problem that if `__kmp_push_task` returns
`TASK_NOT_PUSHED`, the task will be executed immediately. However, the hidden
helper threads are assumed to only execute hidden helper tasks. This could cause
problems because when calling `__kmp_omp_task`, the encountering gtid, which is
not the real one of the thread, is passed.

This patch uses `__kmp_give_task`, but because it is a static function, a new
wrapper `__kmpc_give_task` is added.

Reviewed By: AndreyChurbanov

Differential Revision: https://reviews.llvm.org/D106572
2021-07-22 19:21:29 -04:00
Jose M Monsalve Diaz 68d6278a6e [OpenMP] Renaming RT functions `GetNumberOfBlocksInKernel` and `GetNumberOfThreadsInBlock`
These functions should follow the camel case convention. These are really easy to change
and are needed for D106033.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D106390
2021-07-22 18:17:49 -04:00
Jon Chesterfield 9e05c084e5 [libomptarget][amdgpu][nfc] Normalise license headers
Reviewed By: gregrodgers, jdoerfert

Differential Revision: https://reviews.llvm.org/D106581
2021-07-22 20:23:41 +01:00
Jon Chesterfield 14e34a83b0 [libomptarget][amdgpu][nfc] Replace use of gelf.h with libelf.h
AMDGPU can assume Elf64 so doesn't need to abstract over Elf32

Drop a few other unused headers at the same time. Now only llvm elf
and libelf are used by the plugin.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106579
2021-07-22 20:04:13 +01:00
Jon Chesterfield 1a96570621 [libomptarget][amdgpu] Implement dlopen of libhsa
AMDGPU plugin equivalent of D95155, build without HSA installed locally

Compiles a new file, plugins/amdgpu/dynamic_hsa/hsa.cpp, to an object file that
exposes the same symbols that the plugin presently uses from hsa. The object
file contains dlopen of hsa and cached dlsym calls. Also provides header files
corresponding to the subset that is used.

This is behind a feature flag, LIBOMPTARGET_FORCE_DLOPEN_LIBHSA, default off.
That allows developers to build against the dlopen/dlsym implementation, e.g.
while testing this mode.

Enabling by default will cause this plugin to build on a wider variety of
machines than it does at present so may break some CI builds. That risk can
be minimised by reviewing the header dependencies of the library and ensuring
it doesn't use any libraries that are not already used by libomptarget.

Separating the implementation from enabling by default in case the latter needs
to be rolled back after wider CI results.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106559
2021-07-22 16:54:10 +01:00
Jon Chesterfield 6e9cd3e9f1 [libomptarget][nfc] Improve static assert message in dlwrap
Revision of D102858. Raise dlwrap arity argument to template argument
so the correct value is given in the error message. E.g. '2 == 1' instead of
'2 == trait<>::nargs'.

Arity higher than it should be:
Before diff
```
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: error:
      static_assert failed due to requirement '2 == trait<cudaError_enum (*)(unsigned int)>::nargs'
      "Arity Error"
DLWRAP_INTERNAL(cuInit, 2);
^~~~~~~~~~~~~~~~~~~~~~~~~~
...
$/include/dlwrap.h:166:3: note: expanded from macro
      'DLWRAP_COMMON'
  static_assert(ARITY == trait<decltype(&SYMBOL)>::nargs, "Arity Error");      \
```

After diff
In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16:
```
$/include/dlwrap.h:131:3: error: static_assert failed due to
      requirement '2UL == 1UL' "Arity Error"
  static_assert(Requested == Required, "Arity Error");
  ^             ~~~~~~~~~~~~~~~~~~~~~
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in
      instantiation of function template specialization 'dlwrap::verboseAssert<2UL, 1UL>' requested
      here
DLWRAP_INTERNAL(cuInit, 2);
```

Arity lower than it should be:
Before diff
```
$/plugins/cuda/dynamic_cuda/cuda.cpp:131:10: error: no
      matching function for call to 'dlwrap_cuInit'
  return dlwrap_cuInit(X);
         ^~~~~~~~~~~~~
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: candidate
      function not viable: requires 0 arguments, but 1 was provided
DLWRAP_INTERNAL(cuInit, 0);
```

After diff
In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16:
```
$/include/dlwrap.h:131:3: error: static_assert failed due to
      requirement '0UL == 1UL' "Arity Error"
  static_assert(Requested == Required, "Arity Error");
  ^             ~~~~~~~~~~~~~~~~~~~~~
$/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in
      instantiation of function template specialization 'dlwrap::verboseAssert<0UL, 1UL>' requested
      here
DLWRAP_INTERNAL(cuInit, 0);
```

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106543
2021-07-22 15:24:20 +01:00
Joseph Huber a158d3663f [OpenMP] Fix warnings for uninitialized block counts
Summary:
Fixes some warning given for uninitialized block counts if the exection mode is
not recognized. This shouldn't happen in practice because the execution mode is
checked when it's read from the device.
2021-07-22 09:24:07 -04:00
Jon Chesterfield dc1f6f8b92 [libomptarget][amdgpu][nfc] Drop dead signal pool setup
This class is instantiated once in rtl.cpp before hsa_init is
called. The hsa_signal_create call therefore fails leaving the pool empty.

This signal pool is a legacy from ATMI where it was constructed after hsa_init.
Moving the state into the rtl.cpp global class disabled the initial populating
of the pool without noticeably changing performance. Just rechecked with a fix
that allocates the signals after hsa_init and that also doesn't noticeably
change performance.

This patch therefore drops the initialisation. Only change from main is to
drop a DEBUG_PRINT statement that would say the pool initial size is zero.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106515
2021-07-22 10:29:32 +01:00
Joseph Huber 4a66860424 [OpenMP] Add an option to disable function internalization
Function internalization can sometimes occur in situations where we want to
keep the call sites intact. This patch adds an option to disable function
internalization and prevents the device runtime from being internalized while
creating the bitcode library.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106438
2021-07-21 21:18:18 -04:00
Joseph Huber 1684012a47 [Libomptarget] Introduce new main thread ID runtime function
This patch introduces `__kmpc_is_generic_main_thread_id` which splits the old
comparison into its own runtime function. The purpose of this is so we can fold
this part independently, so when both this and `is_spmd_mode` are folded the
final function will be folded as well.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106437
2021-07-21 21:18:14 -04:00
Joseph Huber 7d57639264 [OpenMP] Add new execution mode for SPMD execution with Generic semantics
Qualified kernels can be transformed from generic-mode to SPMD mode using an
optimization in OpenMPOpt. This patch introduces a new execution mode to
indicate kernels that have been transformed from generic-mode to SPMD-mode.
These kernels have SPMD-mode execution, but need generic-mode semantics for
scheduling the blocks and threads. Without this far too few blocks will be
scheduled for a generic region as SPMD mode expects the trip count to be
divided by the number of threads.

Reviewed By: ggeorgakoudis

Differential Revision: https://reviews.llvm.org/D106460
2021-07-21 20:57:28 -04:00
Joseph Huber 754eb1c210 [OpenMP] Change `__kmpc_free_shared` to include the paired allocation size
This patch changes `__kmpc_free_shared` to take an additional argument
corresponding to the associated allocation's size. This makes it easier to
implement the allocator in the runtime.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106496
2021-07-21 20:56:21 -04:00
Giorgis Georgakoudis 5a682d9b91 [OpenMP] Expose libomptarget function to get HW thread id
The patch exposes the libomptarget runtime function that gets the hardware thread id through the kmpc API. This is to be used in SPMDization for checking the thread id to execute regions by a single thread in a block.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106323
2021-07-21 10:26:04 -07:00
Jon Chesterfield a733bbbd17 [libomptarget][amdgpu][nfc] Refactor #includes
Create a hsa_api.h header that includes the ROCr headers in use
Drop some unused headers and _cplusplus macros

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106455
2021-07-21 17:28:07 +01:00
Shilei Tian 55c65884a4 [OpenMP][deviceRTLs] Update return type of function __kmpc_parallel_level
In `deviceRTLs`, the parallel level is stored in a shared variable of type `uint8_t`.
`__kmpc_parallel_level` currently returns a 16-bit interger. This patch first
changes the return type of the function to `uint8_t`, same as the shared variable,
and then corrects function type which was updated in D105955.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106384
2021-07-20 15:45:43 -04:00
Shilei Tian 02dff78983 [NFC][OpenMP] Fix an issue that no CHECK in test cases
This fixes the complaint from FileCheck.

Reviewed By: abhinavgaba, jdoerfert

Differential Revision: https://reviews.llvm.org/D106387
2021-07-20 15:39:18 -04:00
Joseph Huber b917a1d713 [OpenMP] Change AMDGCN to AMDGPU in the Cmake Module
Summary:
Change the name for targeting AMD offloading.
2021-07-20 12:52:53 -04:00
Joseph Huber 6242f9b966 [OpenMP][Documentation] Fix hyperlink location
Fixes the documentation hyperlinks not showing the header.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106374
2021-07-20 12:42:32 -04:00
Tony Tye 038602139d [NFC] Correct documentation error in OpenMP release ReleaseNotes
Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D106330
2021-07-20 02:04:43 +00:00
Shilei Tian 996baa58a4 [OpenMP] Fixed a segmentation fault when using taskloop and target nowait
The synchronization of task loop misses hidden helper tasks, causing segmentation
fault reported in https://bugs.llvm.org/show_bug.cgi?id=50002.

Reviewed By: ye-luo

Differential Revision: https://reviews.llvm.org/D106220
2021-07-19 21:09:05 -04:00
Joseph Huber 762badb0ab [Libomptarget] Remove volatile from NVPTX work function
Currently the NPVTX work function is marked volatile. This prevents some
optimizations from using this value.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106310
2021-07-19 20:03:25 -04:00
Giorgis Georgakoudis fb0cf01795 Revert "[OpenMP] Codegen aggregate for outlined function captures"
This reverts commit e9c7291cb2.

Fix failing tests
2021-07-19 07:54:26 -07:00
Shilei Tian 4504e1134c [OpenMP][CMake] Fix an issue when there is space in the argument LIBOMPTARGET_LIT_ARGS
D106236 added a new CMake argument for `libomptarget` test, but when user's
input contains white spaces, CMake will add escape char to the final lit command,
which leads to an error. This patch converts the user's input `LIBOMPTARGET_LIT_ARGS`
into a local array, and then passes the array to the function.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D106247
2021-07-18 21:54:14 -04:00
Shilei Tian 954711ed8f [OpenMP][Offloading] Add a CMake argument LIBOMPTARGET_LIT_ARGS to control behavior of libomptarget lit test
By default, `lit` uses all threads to invoke tests, which  can easily cause out
of memory on GPUs because most of OpenMP offloading test usually take about 1GB
GPU memory, but a typical GPU only has 4-8GB memory. This patch introduce a
CMake argument `LIBOMPTARGET_LIT_ARGS` to allow users to control the behavior of
`libomptarget` tests, similar to `LLVM_LIT_ARGS`.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D106236
2021-07-18 13:16:10 -04:00
Shilei Tian 4357cfc792 [OpenMP][Offloading] Add -g when compiling deviceRTLs in debug mode
Currently when we compile the project in debug mode, `-g` will not be added to
compilation flag. The bc files generated in different mode are of different size.
When using GPU debuggers like `cuda-gdb`, it is expected to provide more info
with a debug version of bc lib.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D106229
2021-07-18 09:34:54 -04:00
Giorgis Georgakoudis e9c7291cb2 [OpenMP] Codegen aggregate for outlined function captures
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3)  forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D102107
2021-07-16 23:27:44 -07:00
Joseph Huber 1616407921 [OpenMP] Add remark documentation to the OpenMP webpage
This patch begins adding documentation for each remark emitted by
`openmp-opt`. This builds on the IDs introduced in D105939 so that users
can more easily identify each remark in the webpage.

Depends on D105939.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106018
2021-07-16 14:09:43 -04:00
Shilei Tian 97c8f60bba [NFC][OpenMP][Offloading] Replaced explicit parallel level computation with function `__kmpc_parallel_level`
There are two places in current deviceRTLs where it computes parallel level explicitly,
which is basically the functionality of `__kmpc_parallel_level`. Starting from
D105787, we plan to introduce a series of function call folding based on information
that can be deducted during compilation time. Computation of parallel level is
the next target. This patch makes steps for the optimization.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D105955
2021-07-15 22:21:06 -04:00
George Rokos 0c7a4870c5 [libomptarget] Keep the Shadow Pointer Map up-to-date
D105812 introduced a regression where if a PTR_AND_OBJ entry was mapped on the device, then the OBJ was deallocated and then reallocated at a different address, the Shadow Pointer Map would still contain an entry for the PTR but pointing to the old address. This caused test `env/base_ptr_ref_count.c` to fail.

Differential Revision: https://reviews.llvm.org/D105947
2021-07-14 15:19:58 -07:00
Peyton, Jonathan L 424f14f0d2 [OpenMP] Fix one sign-compare warning from GCC 2021-07-13 12:36:12 -05:00
Peyton, Jonathan L 405eefe464 [OpenMP][NFC] Change comment style to eliminate warnings from GCC
Standalone build for OpenMP runtime using GCC is giving -Wcomment
warnings where a backslash newline is encountered in the // style
comment. This switches the // style for /* style to silence the
warnings.
2021-07-13 12:27:08 -05:00
Hansang Bae db635a28e6 [OpenMP] Minor improvement in task allocation
This patch includes a few changes to improve task allocation
performance slightly. These changes are enough to restore performance
drop observed after introducing hidden helper.

Differential Revision: https://reviews.llvm.org/D105715
2021-07-13 09:07:14 -05:00
Roman Lebedev 4709d9d5be
[libomp] ompd_init(): fix heap-buffer-overflow when constructing libompd.so path
There is no guarantee that the space allocated in `libname`
is enough to accomodate the whole `dl_info.dli_fname`,
because it could e.g. have an suffix  - `.5`,
and that highlights another problem - what it should do about suffxies,
and should it do anything to resolve the symlinks before changing the filename?

```
$ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib"  ./src/utilities/rstest/rstest -c /tmp/f49137920.NEF
dl_info.dli_fname "/usr/local/lib/libomp.so.5"
strlen(dl_info.dli_fname) 26
lib_path_length 14
lib_path_length + 12 26
=================================================================
==30949==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60300000002a at pc 0x000000548648 bp 0x7ffdfa0aa780 sp 0x7ffdfa0a9f40
WRITE of size 27 at 0x60300000002a thread T0
    #0 0x548647 in strcpy (/home/lebedevri/rawspeed/build-Clang-SANITIZE/src/utilities/rstest/rstest+0x548647)
    #1 0x7fb9e3e3d234 in ompd_init() /repositories/llvm-project/openmp/runtime/src/ompd-specific.cpp:102:5
    #2 0x7fb9e3dcb446 in __kmp_do_serial_initialize() /repositories/llvm-project/openmp/runtime/src/kmp_runtime.cpp:6742:3
    #3 0x7fb9e3dcb40b in __kmp_get_global_thread_id_reg /repositories/llvm-project/openmp/runtime/src/kmp_runtime.cpp:251:7
    #4 0x59e035 in main /home/lebedevri/rawspeed/build-Clang-SANITIZE/../src/utilities/rstest/rstest.cpp:491
    #5 0x7fb9e3762d09 in __libc_start_main csu/../csu/libc-start.c:308:16
    #6 0x4df449 in _start (/home/lebedevri/rawspeed/build-Clang-SANITIZE/src/utilities/rstest/rstest+0x4df449)

0x60300000002a is located 0 bytes to the right of 26-byte region [0x603000000010,0x60300000002a)
allocated by thread T0 here:
    #0 0x55cc5d in malloc (/home/lebedevri/rawspeed/build-Clang-SANITIZE/src/utilities/rstest/rstest+0x55cc5d)
    #1 0x7fb9e3e3d224 in ompd_init() /repositories/llvm-project/openmp/runtime/src/ompd-specific.cpp:101:17
    #2 0x7fb9e3762d09 in __libc_start_main csu/../csu/libc-start.c:308:16

SUMMARY: AddressSanitizer: heap-buffer-overflow (/home/lebedevri/rawspeed/build-Clang-SANITIZE/src/utilities/rstest/rstest+0x548647) in strcpy
Shadow bytes around the buggy address:
  0x0c067fff7fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c067fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c067fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c067fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c067fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c067fff8000: fa fa 00 00 00[02]fa fa fa fa fa fa fa fa fa fa
  0x0c067fff8010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff8020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==30949==ABORTING
Aborted
```
2021-07-13 15:36:46 +03:00
George Rokos bb0166dc72 [libomptarget] Update device pointer only if needed
Currently, libomptarget will always perform a host-to-device memory transfer in
order to update the device pointer of a PTR_AND_OBJ entry. This is not always
necessary because the device pointer may have been set to the correct pointee
address already, so we can eliminate the redundant memory transfer.
2021-07-13 04:18:55 -07:00
Jon Chesterfield b6b53ffef4 [libomptarget][devicertl] Remove branches around setting parallelLevel
Simplifies control flow to allow store/load forwarding

This change folds two basic blocks into one, leaving a single store to parallelLevel.
This is a step towards spmd kernels with sufficiently aggressive inlining folding
the loads from parallelLevel and thus discarding the nested parallel handling
when it is unused.

Transform:
```
int threadId = GetThreadIdInBlock();
if (threadId == 0) {
  parallelLevel[0] = expr;
} else if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// because
unsigned GetLaneId() { return GetThreadIdInBlock() & (WARPSIZE - 1);}
// so whenever threadId == 0, GetLaneId() is also 0.
```

That replaces a store in two distinct basic blocks with as single store.

A more aggressive follow up is possible if the threads in the warp/wave
race to write the same value to the same address. This is not done as
part of this change.

```
if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
parallelLevel[GetWarpId()] = expr;
// because
unsigned GetWarpId() { return GetThreadIdInBlock() / WARPSIZE; }
// so GetWarpId will index the same element for every thread in the warp
// and, because expr is lane-invariant in this case, every lane stores the
// same value to this unique address
```

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D105699
2021-07-13 12:06:57 +01:00
Joachim Protze 681055ea69 [OpenMP] Remove TSAN annotations from libomp
The annotations in libomp were never built by default. The annotations are
also superseded by the annotations which the OMPT tool libarcher.so provides.
With respect to libarcher, libomp behaves as if libarcher would be the last
element of OMP_TOOL_LIBARARIES. I.e., if no other OMPT tool gets active,
libarcher will check if an OpenMP application is built with TSan.

Since libarcher gets loaded by default, enabling LIBOMP_TSAN_SUPPORT would
result in redundant annotations for TSan, which slightly differ in details
and coverage (e.g. task dependencies are not handled well by the annotations
in libomp).

This patch removes all TSan annotations from the OpenMP runtime code.

Differential Revision: https://reviews.llvm.org/D103767
2021-07-12 18:49:11 +02:00
Joachim Protze fedbff75f4 [OpenMP][OMPT] Fix compile-time assertion in ompt-multiplex.h
The compile-time assertion is supposed to prevent double-free caused by
unexpected combination of preprocessor defines passed by an OMPT tool.
The current defines are not used, so this patch replaces the check with
macros actually used in ompt-multiplex.h

Reported by: Semih Burak

Differential Revision: https://reviews.llvm.org/D104633
2021-07-12 12:12:09 +02:00
Johannes Doerfert a7b7b5dfe5 [OpenMP] Create and use `__kmpc_is_generic_main_thread`
In order to fold calls based on high-level knowledge and control flow
tracking it helps to expose the information as a runtime call. The
logic: `!SPMD && getTID() == getMasterTID()` was used in various places
and is now encapsulated in `__kmpc_is_generic_main_thread`. As part of
this rewrite we replaced eager computation of arguments with on-demand
computation, especially helpful if the calls can be folded and arguments
don't need to be computed consequently.

Differential Revision: https://reviews.llvm.org/D105768
2021-07-11 19:18:03 -05:00
Johannes Doerfert 1ab1f04a2b [OpenMP] Simplify variable sharing and increase shared memory size
In order to avoid malloc/free, up to NUM_SHARED_VARIABLES_IN_SHARED_MEM
(=64) variables are communicated in dedicated shared memory instead. The
simplification does avoid the need for an "init" and requires "deinit"
only if we ever communicate more than NUM_SHARED_VARIABLES_IN_SHARED_MEM
variables.

Differential Revision: https://reviews.llvm.org/D105767
2021-07-11 19:18:03 -05:00
Johannes Doerfert 0a223827de [OpenMP] Remove checkXXXX device runtime functions
We had multiple functions to determine the execution mode (SPMD/Generic)
and runtime status (initialized/uninitialized) but that just increased
complexity without a real benefit. Especially with D102307 in mind it
is helpful to reduce the dependence on the `ident_t` flags.

Differential Revision: https://reviews.llvm.org/D105586
2021-07-10 18:20:40 -05:00
Johannes Doerfert e2cfbfcc0c [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.

The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.

[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11

This was in parts extracted from D59319.

Reviewed By: ABataev, JonChesterfield

Differential Revision: https://reviews.llvm.org/D101976
2021-07-10 17:53:56 -05:00
Nico Weber d3e7491333 Revert Attributor patch series
Broke check-clang, see https://reviews.llvm.org/D102307#2869065
Ran `git revert -n ebbe149a6f08535ede848a531a601ae6591cfbc5..269416d41908bb670f67af689155d5ab8eea689a`
2021-07-10 16:15:55 -04:00
Johannes Doerfert e603ca0306 [OpenMP] Remove checkXXXX device runtime functions
We had multiple functions to determine the execution mode (SPMD/Generic)
and runtime status (initialized/uninitialized) but that just increased
complexity without a real benefit. Especially with D102307 in mind it
is helpful to reduce the dependence on the `ident_t` flags.

Differential Revision: https://reviews.llvm.org/D105586
2021-07-10 12:32:51 -05:00
Johannes Doerfert 1d5711c3ee [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.

The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.

[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11

This was in parts extracted from D59319.

Reviewed By: ABataev, JonChesterfield

Differential Revision: https://reviews.llvm.org/D101976
2021-07-10 12:32:50 -05:00
Joel E. Denny d99f65de2a [OpenMP] Avoid checking parent reference count in targetDataBegin
This patch is an attempt to do for `targetDataBegin` what D104924 does
for `targetDataEnd`:

* Eliminates a lock/unlock of the data mapping table.
* Clarifies the logic that determines whether a struct member's
  host-to-device transfer occurs.  The old logic, which checks the
  parent struct's reference count, is a leftover from back when we had
  a different map interface (as pointed out at
  <https://reviews.llvm.org/D104924#2846972>).

Additionally, it eliminates the `DeviceTy::getMapEntryRefCnt`, which
is no longer used after this patch.

While D104924 does not change the computation of `IsLast`, I found I
needed to change the computation of `IsNew` for this patch.  As far as
I can tell, the change is correct, and this patch does not cause any
additional `openmp` tests to fail.  However, I'm not sure I've thought
of all use cases.  Please advise.

Reviewed By: jdoerfert, jhuber6, protze.joachim, tianshilei1992, grokos, RaviNarayanaswamy

Differential Revision: https://reviews.llvm.org/D105121
2021-07-10 12:15:04 -04:00
Joel E. Denny 1d0456361a [OpenMP] Avoid checking parent reference count in targetDataEnd
The patch has the following benefits:

* Eliminates a lock/unlock of the data mapping table.
* Clarifies the logic that determines whether a struct member's
  device-to-host transfer occurs.  The old logic, which checks the
  parent struct's reference count, is a leftover from back when we had
  a different map interface (as pointed out at
  <https://reviews.llvm.org/D104924#2846972>).

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D104924
2021-07-10 12:15:04 -04:00
Alexey Bataev ab8989ab87 [OPENMP]Fix overlapped mapping for dereferenced pointer members.
If the base is used in a map clause and later we have a memberexpr with
this base, and the member is a pointer, and this pointer is dereferenced
anyhow (subscript, array section, dereference, etc.), such components
should be considered as overlapped, otherwise it may lead to incorrect
size computations, since we try to map a pointee as a part of the whole
struct, which is not true for the pointer members.

Differential Revision: https://reviews.llvm.org/D105562
2021-07-09 12:51:26 -07:00
Michał Górny 2b0d95fb58 [openmp] [test] Add missing <limits> include to capacity_nthreads
Differential Revision: https://reviews.llvm.org/D105474
2021-07-06 20:39:53 +02:00
Jon Chesterfield ddfb074a80 [libomptarget][nfc] Group environment variables, drop accesses to DeviceInfo global
[libomptarget][nfc] Group environment variables, drop accesses to DeviceInfo global

Folds some duplicates logic into a helper function, passes the new environment
struct into getLaunchVals which no longer reads the DeviceInfo global.

Implemented on top of D105237

Reviewed By: dhruvachak

Differential Revision: https://reviews.llvm.org/D105239
2021-07-06 17:06:38 +01:00
Atmn Patel 21e92612c0 [Libomptarget] Experimental Remote Plugin Fixes
D97883 introduced a compile-time error in the experimental remote offloading
libomptarget plugin, this patch fixes it and resolves a number of
inconsistencies in the plugin as well:

1. Non-functional Asynchronous API
2. Unnecessarily verbose debug printing
3. Misc. code clean ups

This is not intended to make any functional changes to the plugin.

Differential Revision: https://reviews.llvm.org/D105325
2021-07-02 12:38:34 -04:00
Hansang Bae f1b9ce2736 [OpenMP] Fix a few issues with hidden helper task
This patch includes the following changes to address a few issues when
using hidden helper task.

- Assertion is triggered when there are inadvertent calls to hidden
  helper functions on non-Linux OS
- Added deinit code in __kmp_internal_end_library function to fix random
  shutdown crashes
- Moved task data access into the lock-guarded region in __kmp_push_task

Differential Revision: https://reviews.llvm.org/D105308
2021-07-01 17:10:32 -05:00
Shilei Tian 369216ab31 [OpenMP][Offloading] Refined return value of `DeviceTy::getOrAllocTgtPtr`
`DeviceTy::getOrAllocTgtPtr` just returns a target pointer. In addition,
two bool values (`IsNew` and `IsHostPtr`) are passed by reference to make the
change in the function available in callee.

In this patch, a struct, which contains the target pointer, two flags, and an
iterator to the map table entry corresponding to the queried host pointer, will
be returned. In addition to make the logic clearer regarding the two bool values,
this paves the way for the next patch to fix the data race in `bug49334.cpp` by
attaching an event to the map table entry (and that's why we need the iterator).

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D104382
2021-07-01 12:32:03 -04:00
Jon Chesterfield db89414da4 [libomptarget][nfc] Move grid size computation
Change getLaunchVals to return the integers used for launch

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D105237
2021-07-01 12:53:04 +01:00
Dhruva Chakrabarti 98c36f0079 Revert "[libomptarget] [amdgpu] Fix default setting of max flat workgroup size"
This reverts commit 2240b41ee4.
A value of 0 for KernDescVal WG_Size implies it is unknown, so it should be
set to the default. The above change was made without this assumption.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D105250
2021-06-30 17:15:00 -07:00
Jon Chesterfield 4b0926b044 [libomptarget][nfc] Replace out arguments with struct return
A step towards making this function adequately self contained that it
can be tested easily. No functional change intended here, left variable
names unchanged.

Reviewed By: ronlieb

Differential Revision: https://reviews.llvm.org/D105229
2021-06-30 22:40:07 +01:00
Jon Chesterfield d86b0073cf [libomptarget][amdgpu][nfc] Fix build warnings, drop some headers
Removes stdarg header, drops uses of iostream, fix some format string errors.
Also changes a C style struct to C++ style to avoid a warning from clang/

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D104923
2021-06-30 22:23:36 +01:00
Shilei Tian 24a36ce58b [OpenMP][Offloading] Replace all calls to `isSPMDMode` with `__kmpc_is_spmd_exec_mode`
In our ongoing work, we are using `AbstractAttributor` to deduct execution model
of device functions, and potententially remove unnecessary function calls to
`__kmpc_is_spmd_exec_mode`. In current device runtime, we have mixed use of
`isSPMDMode` and `__kmpc_is_spmd_exec_mode`, but in fact in `__kmpc_is_spmd_exec_mode`
it simply calls `isSPMDMode`. Since all functions starting with `__kmpc` is C
function, which doesn't have things like name mangling. It is more optimization
friendly. In this patch, we simply replaced all calls to `isSPMDMode` with
`__kmpc_is_spmd_exec_mode` to pave the way for the optimization.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D105211
2021-06-30 15:39:57 -04:00
Dhruva Chakrabarti e0b713a035 [libomptarget] [amdgpu] Change default number of teams per computation unit
This patch is related to https://reviews.llvm.org/D98832. Based on discussions there, I decided to separate out the teams default as this patch. This change is to increase the number of teams per computation unit so as to provide more wavefronts for hiding latency. This change improves performance for some programs, including 20-50% for some Stream benchmarks.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D99003
2021-06-29 15:34:35 -07:00
Dhruva Chakrabarti 2240b41ee4 [libomptarget] [amdgpu] Fix default setting of max flat workgroup size
When max flat workgroup size is not specified, it is set to the default
workgroup size. This prevents kernel launch with a workgroup size larger
than the default. The fix is to ignore a size of 0 and treat it as
unspecified.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D105073
2021-06-29 13:47:24 -07:00
Johannes Doerfert 4eb90e893f Revert "[OpenMP] Add Two-level Distributed Barrier"
This reverts commit 25073a4ecf.

This breaks non-x86 OpenMP builds for a while now. Until a solution is
ready to be upstreamed we revert the feature and unblock those builds.
See:
  https://reviews.llvm.org/rG25073a4ecfc9b2e3cb76776185e63bfdb094cd98#1005821
and
  https://reviews.llvm.org/rG25073a4ecfc9b2e3cb76776185e63bfdb094cd98#1005821

The currently proposed fix (D104788) seems not to be ready yet:
  https://reviews.llvm.org/D104788#2841928
2021-06-29 09:38:27 -05:00
Johannes Doerfert bc8bb3df35 Revert "[omp] Fix build without ITT after D103121 changes"
This reverts commit eab1fd389b.

This commit fixed a problem with 25073a4ecf (D103121) which is the one
we actually need to revert to unblock non-X86 builds of OpenMP. Can be
reapplied, or merged into, D103121 as it goes in again.
2021-06-29 09:38:27 -05:00
Joseph Huber 2190c48fde [OpenMP][Documentation] Add FAQ entry for CMake module
This patch adds documentation for using the CMake find module for OpenMP
target offloading provided by LLVM. It also removes the requirement for
AMD's architecture to be set as this isn't necessary for upstream LLVM.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D105051
2021-06-28 17:05:07 -04:00
Joseph Huber c9f3240c9d [OpenMP][Documentation] Add OpenMPOpt optimization section
Add some information about the optimizations currently provided by
OpenMPOpt. Every optimization performed should eventually be listed
here.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D105050
2021-06-28 17:05:03 -04:00
Pushpinder Singh 20df2c7052 [AMDGPU][Libomptarget] Collect allocatable memory pools using HSA
The logic is almost similar to that of system.cpp with one change that
instead of adding all the memory pools to a device struct it only
keeps a single pool. The existing approach also always allocated memory on
the first HSA pool found for a GPU.

This depends on D104691. The goal of this series of patches is to remove
_atl_machine global. The next patch will drop g_atl_machine entirely.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D104695
2021-06-28 11:28:04 +00:00
Jon Chesterfield f66b8fdc0a [libomptarget][amdgpu] Build openmp for two more targets
[libomptarget][amdgpu] Build openmp for two more targets

The 4800U APU is a gfx902 and the MI100 accelerator is a gfx908.
Both numbers are listed in ROCT topology.c

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D104922
2021-06-25 19:02:03 +01:00
Jon Chesterfield 96f6873dff [OpenMP][NFC] Drop unused headers from amdgpu plugin 2021-06-25 12:08:56 +01:00
AndreyChurbanov b2787945f9 [OpenMP][NFC] libomp: fix wrong debug assertion.
Normalized bounds of chunk of iterations to steal from are inclusive,
so upper bound should not be decremented in expression to check.
Problem was in attempt to steal iterations 0:0, that caused assertion after
wrong decrement. Reported in comment to https://reviews.llvm.org/D103648.

Differential Revision: https://reviews.llvm.org/D104880
2021-06-25 02:02:14 +03:00
Aakanksha Patil 3453f3dd46 [AMDGPU] Add gfx1035 target
Differential Revision: https://reviews.llvm.org/D104804
2021-06-24 14:32:41 -04:00
Joel E. Denny 9fa5e3280d [OpenMP] Fix delete map type in ref count debug messages
For example, without this patch:

```
$ cat test.c
int main() {
  int x;
  #pragma omp target enter data map(alloc: x)
  #pragma omp target enter data map(alloc: x)
  #pragma omp target enter data map(alloc: x)
  #pragma omp target exit data map(delete: x)
  ;
  return 0;
}
$ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test.c
$ LIBOMPTARGET_DEBUG=1 ./a.out |& grep 'Creating\|Mapping exists\|last'
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=1, Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=2 (incremented), Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=3 (incremented), Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffddf1eaea8, TgtPtrBegin=0x00000000013bb040, Size=4, RefCount=2 (decremented)
Libomptarget --> There are 4 bytes allocated at target address 0x00000000013bb040 - is not last
```

`RefCount` is reported as decremented to 2, but it ought to be reset
because of the `delete` map type, and `is not last` is incorrect.

This patch migrates the reset of reference counts from
`DeviceTy::deallocTgtPtr` to `DeviceTy::getTgtPtrBegin`, which then
correctly reports the reset.  Based on the `IsLast` result from
`DeviceTy::getTgtPtrBegin`, `targetDataEnd` then correctly reports `is
last` for any deletion.  `DeviceTy::deallocTgtPtr` is responsible only
for the final reference count decrement and mapping removal.

An obscure side effect of this patch is that a `delete` map type when
the reference count is infinite yields `DelEntry=IsLast=false` in
`targetDataEnd` and so no longer results in a
`DeviceTy::deallocTgtPtr` call.  Without this patch, that call is a
no-op anyway besides some unnecessary locking and mapping table
lookups.

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D104560
2021-06-23 09:57:19 -04:00
Joel E. Denny 48421ac441 [OpenMP] Improve ref count debug messages
For example, without this patch:

```
$ cat test.c
int main() {
  int x;
  #pragma omp target enter data map(alloc: x)
  #pragma omp target exit data map(release: x)
  ;
  return 0;
}
$ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test.c
$ LIBOMPTARGET_DEBUG=1 ./a.out |& grep 'Creating\|Mapping exists'
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, updated RefCount=1
```

There are two problems in this example:

* `RefCount` is not reported when a mapping is created, but it might
  be 1 or infinite.  In this case, because it's created by `omp target
  enter data`, it's 1.  Seeing that would make later `RefCount`
  messages easier to understand.
* `RefCount` is still 1 at the `omp target exit data`, but it's
  reported as `updated`.  The reason it's still 1 is that, upon
  deletions, the reference count is generally not updated in
  `DeviceTy::getTgtPtrBegin`, where the report is produced.  Instead,
  it's zeroed later in `DeviceTy::deallocTgtPtr`, where it's actually
  removed from the mapping table.

This patch makes the following changes:

* Report the reference count when creating a mapping.
* Where an existing mapping is reported, always report a reference
  count action:
    * `update suppressed` when `UpdateRefCount=false`
    * `incremented`
    * `decremented`
    * `deferred final decrement`, which replaces the misleading
      `updated` in the above example
* Add comments to `DeviceTy::getTgtPtrBegin` to explain why it does
  not zero the reference count.  (Please advise if these comments miss
  the point.)
* For unified shared memory, don't report confusing messages like
  `RefCount=` or `RefCount= updated` given that reference counts are
  irrelevant in this case.  Instead, just report `for unified shared
  memory`.
* Use `INFO` not `DP` consistently for `Mapping exists` messages.
* Fix device table dumps to print `INF` instead of `-1` for an
  infinite reference count.

Reviewed By: jhuber6, grokos

Differential Revision: https://reviews.llvm.org/D104559
2021-06-23 09:57:19 -04:00
Joseph Huber 72d4cd627c [OpenMP] Introduce an CMake find module for OpenMP Target support
This introduces a CMake find module for detecting target offloading support in
a compiler. The goal is to make it easier to incorporate target offloading into
a cmake project.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D104710
2021-06-22 23:01:38 -04:00
Joseph Huber 422adaa879 [OpenMP] Add thread limit environment variable support to plugins
The OpenMP 5.1 standard defines the environment variable
`OMP_TEAMS_THREAD_LIMIT` to limit the number of threads that will be run in a
single block. This patch adds support for this into the AMDGPU and CUDA
plugins.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D103923
2021-06-22 16:25:40 -04:00
Shilei Tian 0029059074 [NFC][OpenMP][Offloading] Unified the construction of mapping table entry
This patch unifies construction of mapping table entry to use `emplace`.

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D104580
2021-06-22 12:38:47 -04:00
Joseph Huber 244e98ff48 [Libomptarget] Improve device runtime implementation for globalized variables.
Currently the runtime implementation of `__kmpc_alloc_shared` is extremely slow because it allocated memory for each thread individually. This patch adds a small buffer for the threads to share data and will greatly improve performance for builds where all globalization could not be optimized out. If the shared buffer is full, then memory will not only be allocated per-warp rather than per-thread.

Depends on D97680

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D104666
2021-06-22 11:52:49 -04:00
Joseph Huber 952a0f2385 [Libomptarget] Introduce new globalization runtime calls
Summary:
This patch introduces the new globalization runtime to be used by D97680. These
runtime calls will replace the __kmpc_data_sharing_push_stack and
__kmpc_data_sharing_pop_stack functions.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D102532
2021-06-22 10:05:42 -04:00
AndreyChurbanov 5dd4d0d46f [OpenMP] libomp: fix dynamic loop dispatcher
Restructured dynamic loop dispatcher code.
Fixed use of dispatch buffers for nonmonotonic dynamic (static_steal) schedule:
- eliminated possibility of stealing iterations of the wrong loop when victim
  thread changed its buffer to work on another loop;
- fixed race when victim thread changed its buffer to work in nested parallel;
- eliminated "static" property of the schedule, that is now a single thread can
  execute whole loop.

Differential Revision: https://reviews.llvm.org/D103648
2021-06-22 16:29:01 +03:00
Pushpinder Singh 9d110f9159 [AMDGPU][Libomptarget] Move allow_access_to_all_gpu_agents to rtl.cpp
Moving this method helps eliminate a use of g_atl_machine.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D104691
2021-06-22 11:44:52 +00:00
Vladislav Vinogradov eab1fd389b [omp] Fix build without ITT after D103121 changes
Reviewed By: AndreyChurbanov

Differential Revision: https://reviews.llvm.org/D104638
2021-06-21 18:17:52 +03:00
Vyacheslav Zakharin aad9e48c5f [NFC][libomptarget] Remove redundant libelf dependency for elf_common.
Differential Revision: https://reviews.llvm.org/D104549
2021-06-21 07:19:55 -07:00
Pushpinder Singh 7a97cd9da7 [AMDGPU][Libomptarget] Remove redundant functions
There does not seem to be any use of these functions. They just
put the value to a local which is never used again.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D104512
2021-06-21 06:13:24 +00:00
Shilei Tian ec97866454 [OpenMP] Make bug49334.cpp more reproducible
`bug49334.cpp` cannot detect data race in `libomptarget` efficiently. It
is reported that with `N = 256` and `BS = 16`, the data race can be reproduced
more steadily. The next coming pathces will fix it so this patch is expected to
fail now.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D104552
2021-06-18 18:35:41 -04:00
Asher Mancinelli 5c189d30e6 [OpenMP] Update FAQ for enabling cuda offloading
Add an FAQ entry and add a few lines to an existing one. Document
the use of `GCC_INSTALL_PREFIX` for pointing clang to correct
GCC installation for two-stage build.

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D104474
2021-06-18 11:55:45 -06:00
Vyacheslav Zakharin 836992ab9a [NFC][libomptarget] Build elf_common with PIC.
Differential Revision: https://reviews.llvm.org/D104545
2021-06-18 09:20:10 -07:00
Vyacheslav Zakharin c5b7c7c8f7 [NFC][libomptarget] Fixed -DLLVM_ENABLE_RUNTIMES="openmp" build.
Differential Revision: https://reviews.llvm.org/D104535
2021-06-18 09:20:10 -07:00
Terry Wilmarth 25073a4ecf [OpenMP] Add Two-level Distributed Barrier
Two-level distributed barrier is a new experimental barrier designed
for Intel hardware that has better performance in some cases than the
default hyper barrier.

This barrier is designed to handle fine granularity parallelism where
barriers are used frequently with little compute and memory access
between barriers.  There is no need to use it for codes with few
barriers and large granularity compute, or memory intensive
applications, as little difference will be seen between this barrier
and the default hyper barrier. This barrier is designed to work
optimally with a fixed number of threads, and has a significant setup
time, so should NOT be used in situations where the number of threads
in a team is varied frequently.

The two-level distributed barrier is off by default -- hyper barrier
is used by default. To use this barrier, you must set all barrier
patterns to use this type, because it will not work with other barrier
patterns.  Thus, to turn it on, the following settings are required:

KMP_FORKJOIN_BARRIER_PATTERN=dist,dist
KMP_PLAIN_BARRIER_PATTERN=dist,dist
KMP_REDUCTION_BARRIER_PATTERN=dist,dist

Branching factors (set with KMP_FORKJOIN_BARRIER, KMP_PLAIN_BARRIER,
and KMP_REDUCTION_BARRIER) are ignored by the two-level distributed
barrier.

Differential Revision: https://reviews.llvm.org/D103121
2021-06-16 15:34:55 -05:00
Vyacheslav Zakharin b5c4fc0f23 [NFC][libomptarget] Reduce the dependency on libelf
This change-set removes libelf usage from elf_common part of the plugins.
libelf is still used in x86_64 generic plugin code and in some plugins
(e.g. amdgpu) - these will have to be cleaned up in separate checkins.

Differential Revision: https://reviews.llvm.org/D103545
2021-06-16 08:34:23 -07:00
AndreyChurbanov 610fea65e2 [OpenMP] libomp: fixed implementation of OMP 5.1 inoutset task dependence type
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.

Differential Revision: https://reviews.llvm.org/D97085
2021-06-16 14:47:29 +03:00
Joachim Protze d2a7871b5e [OpenMP][NFC] Add back suppression of warning
Commit cff215565e did not fix all unused variables in different builds,
so adding back the suppression for now.
2021-06-16 10:14:59 +02:00
Joachim Protze cff215565e [OpenMP] Remove unused variables from libomp code
Several variables were left unused as a result of different patches removing
their use.

Two variables have some use:
`poll_count` is used by the KMP_BLOCKING macro only under certain conditions.
Adding (void) to tell the compiler to ignore the unused variable.

`padding` is a dummy stack allocation with no intent to be used. Also adding
(void) to make the compiler ignore the unused variable.

Differential Revision: https://reviews.llvm.org/D104303
2021-06-16 09:33:46 +02:00
Peyton, Jonathan L 56da28240f [OpenMP] Add GOMP 5.0 version symbols to API
* Add GOMP versioned pause functions
* Add GOMP versioned affinity format functions

To do the affinity format functions, only attach versioned symbols
to the APPEND Fortran entries (e.g., omp_set_affinity_format_) since
GOMP only exports two symbols (one for Fortran, one for C). Our
affinity format functions have three symbols.
e.g., with omp_set_affinity_format:
1) omp_set_affinity_format (Fortran interface)
2) omp_set_affinity_format_ (Fortran interface)
3) ompc_set_affinity_format (C interface)

Have the GOMP version of the C symbol alias the ompc_* 3) version
instead of the Fortran unappended version 1).

Differential Revision: https://reviews.llvm.org/D103647
2021-06-15 16:25:00 -05:00
Peyton, Jonathan L 92baf414db [OpenMP] Fix affinity determine capable algorithm on Linux
Remove strange checks for syscall() arguments where mask is NULL.
Valgrind reports these as error usages for the syscall.
Instead, just check if CACHE_LINE bytes is long enough. If not, then
search for the size. Also, by limiting the first size detection
attempt to CACHE_LINE bytes, instead of 1MB, we don't use more than one
cache line for the mask size. Before this patch, sometimes the returned
mask size was 640 bytes (10 cache lines) because the initial call to
getaffinity() was limited only by the internal kernel mask size
which can be very large.

Differential Revision: https://reviews.llvm.org/D103637
2021-06-15 16:21:30 -05:00
Peyton, Jonathan L 0ddde4d865 [OpenMP] Lazily assign root affinity
Lazily set affinity for root threads. Previously, the root thread
executing middle initialization would attempt to assign affinity
to other existing root threads. This was not working properly as the
set_system_affinity() function wasn't setting the affinity for the
target thread. Instead, the middle init thread was resetting the
its own affinity using the target thread's affinity mask.

Differential Revision: https://reviews.llvm.org/D103625
2021-06-15 16:21:06 -05:00
Pushpinder Singh cadcaf3f46 [AMDGPU][Libomptarget] Drop dead code related to g_atl_machine
This patch includes some changes which deletes the code accessing
g_atl_machine global. Some accesses related to memory_pools are
still remaining.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103813
2021-06-15 05:21:35 +00:00
Ron Lieberman 91f147792e [libomptarget][amdgpu] Remove stray fprintf in rtl.cpp
remove unintended fprintf in rtl.cpp

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D104003
2021-06-10 01:57:30 +00:00
AndreyChurbanov 9ce2e5e700 Revert "[OpenMP] libomp: implement OpenMP 5.1 inoutset task dependence type"
This reverts commit a1f550e052.

Revert in order to fix backwards compatibility breakage
caused by type size change for task dependence flag.
2021-06-09 17:38:38 +03:00
Joachim Protze 639b397931 [OpenMP][Tools] Fix Archer handling of task dependencies
The current handling of dependencies in Archer has two flaws:

- annotation of dependency synchronization is not limited to sibling tasks
- annotation of in/out dependencies is based on the assumption, that dependency
  variables will rarely be byte-sized variables.

This patch introduces a map in the generating task to manage the dependency
variables for the child tasks. The map is only accesses from the generating
task, so no locking is necessary. This also limits the dependency-based
synchronization to sibling tasks.
This patch also introduces proper handling for new dependency types such as
mutexinoutset and inoutset.

Differential Revision: https://reviews.llvm.org/D103608
2021-06-09 13:36:20 +02:00
Joachim Protze 08d8f1a958 [OpenMP][Tools] Cleanup memory pool used in Archer
The main motivation for reusing objects is that it helps to avoid creating and
leaking synchronization clocks in TSan. The reused object will reuse the
synchronization clock in TSan.

Before, new and delete operators were overloaded to get and return memory for
the object from/to the object pool.
This patch replaces the operator overloading with explicit static New/Delete
functions.

Objects for parallel regions and implicit tasks will always be recruited and
returned to the thread-local object pool. Only for explicit task, there is a
chance that an other thread completes the task and will free the object. This
patch optimizes the thread-local New/Delete calls by avoiding locks and only
lock if the pool is empty. Remote threads return the object into a separate
queue.

The chunk size for allocations is now decided based on page size. The objects
will also be aligned to cache lines avoiding false sharing.

This is the first patch in a series to provide better tasking support.

Differential Revision: https://reviews.llvm.org/D103606
2021-06-09 13:36:19 +02:00
Joachim Protze 82e4e50531 [OpenMP][Tools] Fix Archer for MACOS
Archer uses weak symbol overloads of TSan functions to enable loading the tool
even if the application is not built with TSan. For MACOS the tool collects
the function pointer at runtime.
When adding the function entry/exit markers, we missed to add the functions
in the MACOS codepath.
This patch also replaces the repeated function lookup by a single initial
function lookup and fixes the disabling logic in RunningOnValgrind.

Differential Revision: https://reviews.llvm.org/D103607
2021-06-09 13:36:19 +02:00
Brendon Cahoon 294efbbd3e Reland "[AMDGPU] Add gfx1013 target"
This reverts commit 211e584fa2.

Fixed a use-after-free error that caused the sanitizers to fail.
2021-06-08 21:15:35 -04:00
Joseph Huber df965513a9 [OpenMP] Add an information flag for device data transfers
This patch adds an information flag that indicated when data is being copied to
and from the device. This will be helpful for finding redundant or unnecessary
data transfers in applications.

Reviewed By: jdoerfert, grokos

Differential Revision: https://reviews.llvm.org/D103927
2021-06-08 20:23:27 -04:00
Brendon Cahoon 211e584fa2 Revert "[AMDGPU] Add gfx1013 target"
This reverts commit ea10a86984.

A sanitizer buildbot reports an error.
2021-06-08 16:29:41 -04:00
Brendon Cahoon ea10a86984 [AMDGPU] Add gfx1013 target
Differential Revision: https://reviews.llvm.org/D103663
2021-06-08 12:49:49 -04:00
Vignesh Balasubramanian f61602b0d3 [OpenMP][OMPD] Implementation of OMPD debugging library - libompd.
This is the first of seven patches that implements OMPD, a debugging interface to support debugging of OpenMP programs.
It contains support code required in "openmp/runtime" for OMPD implementation.

Reviewed By: @hbae
Differential Revision: https://reviews.llvm.org/D100181
2021-06-08 16:44:22 +05:30
Peyton, Jonathan L d70e1f1276 [OpenMP][runtime] add .clang-tidy file
Use same checks as compiler-rt which removes checks for readability-*
and llvm-header style.

Differential Revision: https://reviews.llvm.org/D103711
2021-06-07 13:56:39 -05:00
AndreyChurbanov a1f550e052 [OpenMP] libomp: implement OpenMP 5.1 inoutset task dependence type
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
Size of type of the dependence flag changed from 1 to 4 bytes in clang.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.

Differential Revision: https://reviews.llvm.org/D97085
2021-06-07 21:42:51 +03:00
Bryan Chan 54f059c900 [OpenMP] Check loc for NULL before dereferencing it
The ident_t * argument in __kmp_get_monotonicity was being used without
a customary NULL check, causing the function to crash in a Debug build.
Release builds were not affected thanks to dead store elimination.
2021-06-07 10:45:48 -04:00
Pushpinder Singh 4f8bc7caf4 [AMDGPU][Libomptarget] Remove atlc global
This global struct used to hold various flags for monitoring the
initialization of hsa.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103795
2021-06-07 11:09:01 +00:00
Pushpinder Singh f5f329a371 [AMDGPU][Libomptarget] Rework logic for locating kernarg pools
Previous logic was to always use the first kernarg pool found to allocate
kernel args. This patch changes this to use only the kernarg pool which
has non-zero size. This logic is also reworked to not use any globals.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103600
2021-06-07 06:41:37 +00:00
Terry Wilmarth 8ec9aa236e [OpenMP] Add experimental nesting mode feature
Nesting mode is a new experimental feature in the OpenMP
runtime. It allows a user to set up nesting for an application in a
way that corresponds to the hardware topology levels on the machine an
application is being run on.  For example, if a machine has 2 sockets,
each with 12 cores, then use of nesting mode could set up an outer
level of nesting that uses 2 threads per parallel region, and an inner
level of nesting that uses 12 threads per parallel region.

Nesting mode is controlled with the KMP_NESTING_MODE environment
variable as follows:

1) KMP_NESTING_MODE = 0: Nesting mode is off (default); max-active-levels-var
is set to 1 (the default -- nesting is off, nested parallel regions
are serialized).

2) KMP_NESTING_MODE = 1: Nesting mode is on, and a number of threads
will be assigned for each level discovered in the machine topology;
max-active-levels-var is set to the number of levels discovered.

3) KMP_NESTING_MODE = n, n>1: [Note: this option is experimental and may change
or be removed in the future.] Nesting mode is on, and a number of
threads will be assigned for each topology level discovered on the
machine, up to k<=n levels (since there may be fewer than n levels
discovered in the topology), and beyond the kth level, nested parallel
regions will be serialized; NOTE: max-active-levels-var is 1 (the default --
nesting is off, and nested parallel regions are serialized until the
user changes max-active-levels-var.

If the user sets OMP_NUM_THREADS or OMP_MAX_ACTIVE_LEVELS, they will
override KMP_NESTING_MODE settings for the associated environment
variables. The detected topology may be limited by an affinity mask
setting on the initial thread, or if the user sets KMP_HW_SUBSET. See
also: KMP_HOT_TEAMS_MAX_LEVEL for controlling use of hot teams for
nested parallel regions. Note that this feature only sets numbers of
threads used at nesting levels.  The user should make use of
OMP_PLACES and OMP_PROC_BIND or KMP_AFFINITY for affinitizing those
threads, if desired.

Differential Revision: https://reviews.llvm.org/D102188
2021-06-04 16:01:11 -05:00
Peyton, Jonathan L 56dd158c32 [OpenMP] fix spelling error in message-converter.pl 2021-06-04 11:20:32 -05:00
Peyton, Jonathan L f7655f3df3 [OpenMP] Fix improper printf format specifier 2021-06-02 11:04:48 -05:00
Hansang Bae 7ba4e96ede [OpenMP] Use new task type/flag for taskwait depend events.
Differential Revision: https://reviews.llvm.org/D103464
2021-06-02 10:16:38 -05:00
Pushpinder Singh b25546a4b4 [AMDGPU][Libomptarget][NFC] Remove bunch of dead structs
Dropped structs are atmi_machine_t, atmi_device_t and atmi_memory_t

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103509
2021-06-02 10:40:51 +00:00
Pushpinder Singh 2368170a8d [AMDGPU][Libomptarget][NFC] Remove atmi_place_t
atmi_place_t has been replaced with int DeviceId.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103508
2021-06-02 10:35:28 +00:00
Peyton, Jonathan L 2020c981fa [OpenMP] Add L2-Tile equivalence for KNL
When on KNL and L2 or Tile layer is detected, manually add
the corresponding layer which is equivalent.

Differential Revision: https://reviews.llvm.org/D102865
2021-06-01 14:17:13 -05:00
Hansang Bae cf5c94ef08 [OpenMP] Define named constants for interop's foreign runtime ID
Also added missing Fortran definitions for interop support.

Differential Revision: https://reviews.llvm.org/D102883
2021-06-01 13:06:59 -05:00
Pushpinder Singh fb113264a8 [AMDGPU][Libomptarget] Remove g_atmi_machine global
Turns out the only purpose of this class was verify if device ID
was in range or not which could be done easily by using g_atl_machine.

Still getting rid of g_atl_machine is pending which would be done in
a later patch.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103443
2021-06-01 12:34:24 +00:00
Pushpinder Singh 4fc3286951 [AMDGPU][Libomptarget][NFC] Split host and device malloc
This patch splits the code path for host and device malloc.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103389
2021-05-31 12:09:18 +00:00
Pushpinder Singh 8b79dfb302 [AMDGPU][Libomptarget][NFC] Remove atmi_mem_place_t
This struct was used to specify the device on which memory was
being allocated/free in atmi_malloc/free. It has now been replaced
with int DeviceId.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103239
2021-05-27 11:53:18 +00:00
Jon Chesterfield 2fdf8bbd19 [libomptarget][nfc][amdgpu] Factor out setting upper bounds
Refactor suggested in D103037 to help avoid similar copy-paste errors.
Change is mechanical. Some parts of this would be more robust with unsigned.

Reviewed By: dhruvachak

Differential Revision: https://reviews.llvm.org/D103090
2021-05-26 19:57:49 +01:00
Jon Chesterfield c5c1ec7945 [libomptarget][nfc][amdgpu] Refactor uses of KernelInfoTable
Suggested in D103059. Use a single lookup instead of two, more const, less mutation.

Reviewed By: dhruvachak

Differential Revision: https://reviews.llvm.org/D103093
2021-05-26 19:25:25 +01:00
Jon Chesterfield 07f59baad6 [libomptarget][nfc][amdgpu] Remove atmi_status_t type
ATMI_STATUS_UNKNOWN was unused, deleted references to it.
Replaced ATMI_STATUS_{SUCCESS,ERROR} with HSA_STATUS_{SUCCESS,ERROR}
Replaced atmi_status_t with hsa_status_t

Otherwise no change. In particular, conversions between atmi_status_t and
hsa_status_t will now be conversions between hsa_status_t and itself.

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D103115
2021-05-26 17:02:19 +01:00
Pushpinder Singh a2d6ef5876 [AMDGPU][Libomptarget] Inline atmi_init/atmi_finalize
After D102847, these functions can be inlined.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103075
2021-05-26 10:50:08 +00:00
Pushpinder Singh cc8661ac4a [AMDGPU][Libomptarget] Delete g_atmi_initialized
This patch drops g_atmi_initialized and inlines the Initialize &
Finalize methods from Runtime class.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D102847
2021-05-26 10:46:54 +00:00
Pushpinder Singh 7648b6978e [AMDGPU][Libomptarget] Move Kernel/Symbol info tables to RTLDeviceInfoTy
Two globals KernelInfoTable & SymbolInfoTable are moved
into RTLDeviceInfoTy class.
This builds on the top of D102691.
[2/2]

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D102692
2021-05-26 10:02:28 +00:00
Jon Chesterfield df005fa364 [libomptarget][nfc] Move hostcall required test to rtl
[libomptarget][nfc] Move hostcall required test to rtl

Remove a global, fix minor race. First of N patches to bring up hostcall.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103058
2021-05-25 22:43:17 +01:00
Pushpinder Singh b0d68c7141 [AMDGPU][Libomptarget] Mark lambda_by_value test as XFAIL
Reason: Missing printf definition

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103078
2021-05-25 12:16:54 +00:00
Jon Chesterfield 75492e20fb [libomptarget][nfc] Accept callable for hsa iterate_symbols
[libomptarget][nfc] Accept callable for hsa iterate_symbols
Candidate refactor to simplify D102692

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D103030
2021-05-25 09:29:11 +01:00
Dhruva Chakrabarti 96d70f4d28 [libomptarget] [amdgpu] Added LDS usage to the kernel trace
Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103059
2021-05-24 19:33:48 -07:00
Hansang Bae 95cefacfe1 [OpenMP] Fix crashing critical section with hint clause
Runtime was using the default lock type without using the hint.

Differential Revision: https://reviews.llvm.org/D102955
2021-05-24 17:25:01 -05:00
Dhruva Chakrabarti ca17b26d4d [libomptarget] [amdgpu] Fix copy-paste error setting NumThreads for a corner case.
Fix the case where NumTeams was set incorrectly instead of NumThreads

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D103037
2021-05-24 15:23:15 -07:00
Pushpinder Singh 486110eb41 [AMDGPU][Libomptarget] Remove global KernelNameMap
KernelNameMap contains entries like "key.kd" => key which clearly
could be replaced by simple logic of removing suffix from the key.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D102691
2021-05-24 08:46:08 +00:00
AndreyChurbanov aa6e7e8da8 [OpenMP] libomp: move warnings to after library initialization
Warnings on deprecated api cannot be suppressed if the library is not initialized.
With this change it is possible to set KMP_WARNINGS=false to suppress the warnings.

Differential Revision: https://reviews.llvm.org/D102676
2021-05-21 23:47:23 +03:00
George Rokos d0bc04d6b9 [libomptarget] Fix a bug whereby firstprivates are not copied over to the device
The check for the TO flag when processing firstprivates is missing. As a result,
sometimes the device copy of a firstprivate never gets initialized. Currectly we
try to force lambda structs to be allocated immediately by marking them as a
non-firstprivate, so that PrivateArgumentManagerTy::addArg allocates memory for
them immediately. However, calling addArg with IsFirstPrivate=false makes the
function skip initializing the device copy. Whether an argument is firstprivate
and whether we need to allocate memory immediately are not synonyms, so this
patch introduces one more control variable for immediate allocation and sets it
apart from initialization.

Differential Revision: https://reviews.llvm.org/D102890
2021-05-21 10:52:08 -07:00
Jon Chesterfield d54712ab4d [libomptarget][amdgpu] Mark alloc, free weak to facilitate local experimentation
[libomptarget][amdgpu] Mark alloc, free weak to facilitate local experimentation

There are a lot of different ways we might implement the devicertl local alloc
and free functions. Via host, local buffers (stack or arena), specialising per
kernel etc. It is not yet clear what the right design is. This change makes the
alloc and free functions weak, so one can override them from local tests while
comparing options.

Not strictly necessary, as a comparable patch can be applied locally each time,
but would be convenient for out of tree dev. Plan would be to drop the weak
attribute at the same time as introducing a working allocator to trunk.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D102499
2021-05-21 16:09:22 +01:00
Jon Chesterfield 68b88ae670 [libomptarget] Improve dlwrap compile time error diagnostic
[libomptarget] Improve dlwrap compile time error diagnostic

The dlwrap interface takes an explict arity, e.g. DLWRAP(cuAlloc, 2);
This probably can't be eliminated as it controls the argument list of an
external symbol, not an inline header function. If the arity given is too
big, the error from clang referring to the line is in the middle of
implementation details.

/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1277:7: error: static_assert failed
      due to requirement '0UL < tuple_size<std::tuple<>>::value' "tuple index is in range"
      static_assert(__i < tuple_size<tuple<>>::value,
      ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:93:27 ...

/home/amd/llvm-project/openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp:34:1: note: in
      instantiation of template class 'dlwrap::trait<cudaError_enum (*)(unsigned long *, unsigned
      long)>::arg<2>' requested here
DLWRAP(cuMemAlloc, 3);
^
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:51:31: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:166:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:133:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:186:37: ...

If the arity is too small, the diagnostic is better:

cuda/dynamic_cuda/cuda.cpp:34:1: error: too few
      arguments to function call, expected 2, have 1
DLWRAP(cuMemAlloc, 1);

This patch changes the diagnostic to:

cuda/dynamic_cuda/cuda.cpp:34:1: error:
      static_assert failed due to requirement '1 == trait<cudaError_enum (*)(unsigned long *, unsigned
      long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 1);

or

cuda/dynamic_cuda/cuda.cpp:34:1: error:
      static_assert failed due to requirement '3 == trait<cudaError_enum (*)(unsigned long *, unsigned
      long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 3);

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D102858
2021-05-20 20:33:36 +01:00
Jon Chesterfield d18fb09c69 [libomptarget][amdgpu] Remove majority of fatal errors
[libomptarget][amdgpu] Remove majority of fatal errors

Replaces most calls to exit() with returning an error to the library entry
point. Minor changes to error handling for clear bugs, remove some dead code.

Each exit() call site replaced is either in a library entry point or a
function that already returns error codes on some paths. The existing handling
is not well tested but replacing exit() with a fallback path should be a strict
improvement.

Remaining two early exit points are an abort() from a callback and exit() from
within msgpack. Fixes for those are less obvious and left for a later patch.

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D102346
2021-05-20 16:26:43 +01:00
Jon Chesterfield ea68ad6e26 [libomptarget] Disable test bug49334 on amdgpu
[libomptarget] Disable test bug49334 on amdgpu

Hangs on amdgpu, do not know why. Disable to unblock build.

Reviewed By: ye-luo

Differential Revision: https://reviews.llvm.org/D102017
2021-05-20 15:46:56 +01:00
Pushpinder Singh d7503c3bce [AMDGPU][Libomptarget] Rename & move g_executables to private
This patch moves g_executables to private member of Runtime class
and is renamed to HSAExecutables following LLVM naming convention.

This movement required making Runtime::Initialize and Runtime::Finalize
non-static. Verified the correctness of this change by running
libomptarget tests on gfx906.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D102600
2021-05-18 05:43:23 +00:00
Pushpinder Singh 3bc2b97b34 [AMDGPU][libomptarget] Remove unused global variables
This initial patch removes some unused variables from global namespace.
There will more incoming patches for moving global variables to classes
or static members.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D102598
2021-05-18 05:40:49 +00:00
Shilei Tian af6511d730 [OpenMP] Fixed Bug 49356
Bug 49356 (https://bugs.llvm.org/show_bug.cgi?id=49356) reports crash in
the test case `tasking/bug_taskwait_detach.cpp`, which is caused by the wrong
function declaration. `gtid` in `__kmpc_omp_task` should be `kmp_int32`.

Reviewed By: AndreyChurbanov

Differential Revision: https://reviews.llvm.org/D102584
2021-05-17 12:14:54 -04:00
Aakanksha Patil 464e4dc50f [AMDGPU] Add gfx1034 target
Differential Revision: https://reviews.llvm.org/D102306
2021-05-13 14:25:18 -04:00
Jon Chesterfield 10de217209 [libomptarget][amdgpu] Fix truncation error for partial wavefront
[libomptarget][amdgpu] Fix truncation error for partial wavefront

The partial barrier implementation involves one wavefront resetting and N-1
waiting. This change future proofs against launching with a number of threads
that is not a multiple of the wavefront size.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D102407
2021-05-13 17:31:57 +01:00
Jon Chesterfield b049870d3b [libomptarget][amdgpu] Convert an assert to print and offload_fail
[libomptarget][amdgpu] Convert an assert to print and offload_fail

The kernel launched is supposed to be present in the binary, but a not yet
diagnosed bug means it is missing for some of the qmcpack test cases. Changing
from assert to print and offload_fail should help diagnose that and similar bugs.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D102378
2021-05-13 17:31:36 +01:00
Michael Kruse 34ed3e6337 [OpenMP] Test unified shared memory tests only on systems that support it.
Add a `REQUIRES: unified_shared_memory` option to tests that use `#pragma omp requires unified_shared_memory`.

For CUDA, the feature tag is derived from LIBOMPTARGET_DEP_CUDA_ARCH which itself is derived using [[ https://cmake.org/cmake/help/latest/module/FindCUDA.html#commands | cuda_select_nvcc_arch_flags ]]. The latter determines which compute capability the GPU in the system supports. To ensure that this is the CUDA arch being used, we could also set the `-Xopenmp-target -march=` flag.
In the absence of an NVIDIA GPU, LIBOMPTARGET_DEP_CUDA_ARCH will be 35. That is, in that case we are assuming unified_shared_memory is not available. CUDA plugin testing could be disabled entirely in this case, but this currently depends on `LIBOMPTARGET_CAN_LINK_LIBCUDA OR LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA`, not on whether the hardware is actually available.

For all other targets, nothing changes and we are assuming unified shared memory is available. This might need refinement if not the case.

This tries to fix the [[ http://meinersbur.de:8011/#/builders/143 | OpenMP Offloading Buildbot ]] that, although brand-new, only has a Pascal-generation (sm_61) GPU installed. Hence, tests that require unified shared memory are currently failing. I wish I had known in advance.

Reviewed By: protze.joachim, tianshilei1992

Differential Revision: https://reviews.llvm.org/D101498
2021-05-13 11:08:04 -05:00
Jon Chesterfield 9934571eab [libomptarget][amdgpu][nfc] Expand errorcheck macros
[libomptarget][amdgpu][nfc] Expand errorcheck macros

These macros expand to continue, which is confusing, or exit,
which is incompatible with continuing execution on offloading fail.

Expanding the macros in place makes the code look untidy but the
control flow obvious and amenable to improving. In particular, exit
becomes easier to eliminate.

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D102230
2021-05-12 17:30:41 +01:00
Christopher Pulido 4fb0aaf033 [OpenMP] Changes to enable MSVC ARM64 build of libomp
This is the first in a series of changes to the OpenMP runtime
that have been done internally by Microsoft. This patch makes
the necessary changes to enable libomp.dll to build with
the MSVC compiler targeting ARM64.

Differential Revision: https://reviews.llvm.org/D101173
2021-05-11 23:03:12 +03:00
Jon Chesterfield 72995a4bdf [libomptarget][nfc] Add hook to easily disable building amdgcn bclib
[libomptarget][nfc] Add hook to easily disable building amdgcn bclib

This is useful when building LLVM with a toolchain that can't emit code
for amdgcn, e.g. because it overrides the include search path with headers
from another architecture, or the clang compiler is missing builtins.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D102229
2021-05-11 17:23:09 +01:00
Peyton, Jonathan L c765d140fe [OpenMP] Fix hidden helper + affinity
When KMP_AFFINITY is set, each worker thread's gtid value is used as an
index into the place list to determine the thread's placement. With hidden
helpers enabled, this gtid value is shifted down leading to unexpected
shifted thread placement. This patch restores the previous behavior by
adjusting the mask index to take the number of hidden helper threads
into account.

Hidden helper threads are given the full initial mask and do not
participate in any of the other affinity mechanisms (place partitioning,
balanced affinity). Their affinity is only printed for debug builds.

Differential Revision: https://reviews.llvm.org/D101882
2021-05-11 08:54:22 -05:00
Jon Chesterfield dedca78d48 [libomptarget][nfc] Drop stringify in macro
[libomptarget][nfc] Drop stringify in macro
A step towards deleting the macros entirely.

Differential Revision: https://reviews.llvm.org/D102228
2021-05-11 12:19:55 +01:00
Jon Chesterfield 6da348569c [libomptarget] Add support for target allocators to dynamic cuda RTL
[libomptarget] Add support for target allocators to dynamic cuda RTL

Follow on to D102000 which introduced new calls into libcuda. This patch adds
the corresponding entry points to dynamic_cuda, fixing the build for systems
that do not have the cuda toolkit installed.

Function types and enum from https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html

Reviewed By: pdhaliwal

Differential Revision: https://reviews.llvm.org/D102169
2021-05-10 15:27:50 +01:00
Pushpinder Singh 9586937ef5 [AMDGPU][OpenMP] Disable tests when amdgpu-arch fails
This patch prevents runtime tests running on systems without amdgpu.

Reviewed By: protze.joachim, tianshilei1992

Differential Revision: https://reviews.llvm.org/D102054
2021-05-10 07:37:27 +00:00
Vyacheslav Zakharin f2f88f3e7a An attempt to abandon omptarget out-of-tree builds.
I want to start using LLVM component libraries in libomptarget
to stop duplicating implementations already available in LLVM
(e.g. LLVMObject, LLVMSupport, etc.). Without relying on LLVM
in all libomptarget builds one has to provide fallback implementation
for each used LLVM feature.

This is an attempt to stop supporting out-of-llvm-tree builds of libomptarget.

I understand that I may need to revert this,
if this affects downstream projects in a bad way.

Differential Revision: https://reviews.llvm.org/D101509
2021-05-07 12:43:50 -07:00
Joseph Huber a15f8589f4 [libomptarget] Add support for target memory allocators to cuda RTL
Summary:
The allocator interface added in D97883 allows the RTL to allocate shared and
host-pinned memory from the cuda plugin. This patch adds support for these to
the runtime.

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D102000
2021-05-07 10:27:02 -04:00
Jon Chesterfield 44ee974e2f [libomptarget][nfc] Refactor amdgpu partial barrier to simplify adding a second one
[libomptarget][nfc] Refactor amdgpu partial barrier to simplify adding a second one

D101976 would require a second barrier instance. This NFC to amdgpu makes it
simpler to add one (an extra global, one more line in init). Also renames the
current barrier to L0.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D102016
2021-05-06 23:52:19 +01:00
Jon Chesterfield 7e9351b9de [libomptarget][amdgpu][nfc] Remove dead code from amdgpu plugin
[libomptarget][amdgpu][nfc] Remove dead code from amdgpu plugin

Drops an enum that was identical to a HSA one, localises some functions where
they were only called from one TU. Covers everything internalize + adce can
identify as dead, except for msgpack::dump which is useful when debugging.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D102014
2021-05-06 23:16:32 +01:00
Jon Chesterfield 25fe17d3c1 [libomptarget] Initial documentation on amdgpu offload
[libomptarget] Initial documentation on amdgpu offload

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D101927
2021-05-05 19:58:52 +01:00
Peyton, Jonathan L 9982f33e2c [OpenMP] Refactor/Rework topology discovery code
This patch does the following:

1) Introduce kmp_topology_t as the runtime-friendly structure (the
corresponding global variable is __kmp_topology) to determine the
exact machine topology which can vary widely among current and future
architectures. The current design is not easy to expand beyond the assumed
three layer topology: sockets, cores, and threads so a rework capable of
using the existing KMP_AFFINITY mechanisms is required.

This new topology structure has:
* The depth and types of the topology
* Ratio count for each consecutive level (e.g., number of cores per
   socket, number of threads per core)
* Absolute count for each level (e.g., 2 sockets, 16 cores, 32 threads)
* Equivalent topology layer map (e.g., Numa domain is equivalent to
   socket, L1/L2 cache equivalent to core)
* Whether it is uniform or not

The hardware threads are represented with the kmp_hw_thread_t
structure. This structure contains the ids (e.g., socket 0, core 1,
thread 0) and other information grabbed from the previous Address
structure. The kmp_topology_t structure contains an array of these.

2) Generalize the KMP_HW_SUBSET envirable for the new
kmp_topology_t structure. The algorithm doesn't assume any order with
tiles,numa domains,sockets,cores,threads. Instead it just parses the
envirable, makes sure it is consistent with the detected topology
(including taking into account equivalent layers) and then trims away
the unneeded subset of hardware threads. To enable this, a new
kmp_hw_subset_t structure is introduced which contains a vector of
items (hardware type, number user wants, offset). Any keyword within
__kmp_hw_get_keyword() can be used as a name and can be shortened as
well. e.g.,
KMP_HW_SUBSET=1s,2numa,4tile,2c,3t can be used on the KNL SNC-4 machine.

3) Simplify topology detection functions so they only do the singular
task of detecting the machine's topology. Printing, and all
canonicalizing functionality is now done afterwards. So many lines of
duplicated code are eliminated.

4) Add new ll_caches and numa_domains to OMP_PLACES, and
consequently, KMP_AFFINITY's granularity setting. All the names within
__kmp_hw_get_keyword() are available for use in OMP_PLACES or
KMP_AFFINITY's granularity setting.

5) Simplify and future-proof code where explicit lists of allowed
affinity settings keywords inside if() conditions.

6) Add x86 CPUID leaf 4 cache detection to existing x2apic id method
so equivalent caches could be detected (in particular for the ll_caches
place).

Differential Revision: https://reviews.llvm.org/D100997
2021-05-03 18:00:24 -05:00
Pushpinder Singh ae845d6426 [AMDGPU][OpenMP] Enable Libomptarget runtime tests
This enables the runtime tests on amdgpu targets.
10 tests have been marked as XFAIL on amdgcn currently mostly due to
missing printf.

Reviewed By: protze.joachim

Differential Revision: https://reviews.llvm.org/D99656
2021-05-03 05:56:42 +00:00
Martin Storsjö 01d27fc408 [OpenMP] Fix warnings due to redundant semicolons. NFC. 2021-05-02 21:51:06 +03:00
Kevin Athey bc9120047b Correct tiny misspelling (readlef -> readelf).
Getting my feet wet here as a new committer.

Correct misspelling in check-depends.pl.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D101552
2021-04-30 17:20:35 -07:00
Michael Kruse 7308862ff5 [OpenMP][CMake] Use in-project clang as CUDA->IR compiler.
If available, use the clang that is already built in the same project as
CUDA compiler unless another executable is explicitly defined. This also
ensures the generated deviceRTL IR will be consistent with the version
of Clang.

This patch is required to reliably test OpenMP offloading in a buildbot
without either a two-stage build (e.g. with LLVM_ENABLE_RUNTIMES) or a
separately installed clang on the worker that will eventually become
outdated.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D101265
2021-04-30 12:45:52 -05:00
Michael Kruse 3244a8b536 [OpenMP][CMake] Pass --cuda-path to regression tests.
The OpenMP runtime can be compiled using a CUDA installed at non-default
location with the -DCUDA_TOOLKIT_ROOT_DIR setting. However, check-openmp
will fail afterwards because Clang needs to know where to find the CUDA
headers.

Fix by passing -cuda-path to Clang using the value of
CUDA_TOOLKIT_ROOT_DIR which has been determined by CMake. Also set
LD_LIBRARY_PATH such that it can find the cuda runtime when executing.
This will ensure that the regression test do not depend on the current
environment, but use the environment it was configured for.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D101266
2021-04-27 16:27:40 -05:00
Joachim Protze 24f836e8fd [OpenMP][libomptarget] Separate lit tests for different offloading targets (2/2)
This patch fuses the RUN lines for most libomptarget tests. The previous patch
D101315 created separate test targets for each supported offloading triple.

This patch updates the RUN lines in libomptarget tests to use a generic run
line independent of the offloading target selected for the lit instance.

In cases, where no RUN line was defined for a specific offloading target,
the corresponding target is declared as XFAIL. If it turns out that a test
actually supports the target, the XFAIL line can be removed.

Differential Revision: https://reviews.llvm.org/D101326
2021-04-27 15:54:32 +02:00
Joachim Protze b845217b1d [OpenMP][libomptarget] Separate lit tests for different offloading targets (1/2)
This patch creates a separate test directory for each offloading target to be
tested. This allows to test multiple architectures in one configuration, while
still see all failing tests separately. The lit test names include the target
triple, so that it will be easier to spot the failing target.

This patch also allows to mark expected failing tests based on the
target-triple, as the currently used triple is added to the lit "features":
```
// XFAIL: nvptx64-nvidia-cuda
```

Differential Revision: https://reviews.llvm.org/D101315
2021-04-27 12:30:01 +02:00
Joseph Huber 077fe0f739 [OpenMP][Documentation] Add FAQ entry for dynamically linked libraries
Summary:
Add an FAW entry detailing the support for using dynamically linked libraries
with OpenMP Offloading
2021-04-26 14:21:17 -04:00
Jon Chesterfield 58f125493d [libomptarget] Enable AMDGPU devicertl
[libomptarget] Enable AMDGPU devicertl

The amdgpu devicertl is written in freestanding openmp and compiles to a
bitcode library (per listed gfx arch) with no unresolved symbols. It requires
a recent clang, preferably the one from the same monorepo checkout.

This is D98658, with printf explicitly stubbed out, after patching clang to no
longer require an llvm with the amdgpu target enabled.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D101213
2021-04-24 02:24:44 +01:00
Johannes Doerfert 17330a3cb1 [OpenMP] Avoid reading uninitialized parallel level values
In a last minute change request for a2dbfb6b72 we introduced a
read of the uninitialized parallel level value in SPMD-mode.
We go back to initializing the array early and checking for an
adjusted level.

Found by the miniqmc unit tests:
  https://cdash.qmcpack.org/CDash/viewTest.php?onlyfailed&buildid=203434

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D101123
2021-04-23 11:21:58 -05:00
Joseph Huber 59b6849012 [OpenMP] Replace global InfoLevel with a reference to an internal one.
Summary:
This patch improves the implementation of D100774 by replacing the global
variable introduced with a function that returns a reference to an internal
one. This removes the need to define the variable in every plugin that uses it.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D101102
2021-04-23 09:43:46 -04:00
Joseph Huber 2b6f20082e [OpenMP] Add function for setting LIBOMPTARGET_INFO at runtime
Summary:
This patch adds a new runtime function __tgt_set_info_flag that allows the
user to set the information level at runtime without using the environment
variable. Using this will require an extern function, but will eventually be
added into an auxilliary library for OpenMP support functions.

This patch required moving the current InfoLevel to a global variable which must
be instantiated by each plugin.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D100774
2021-04-22 12:48:11 -04:00
Alexey Bataev ca70512099 [OPENMP]Mark test as unsupported to avoid possible unexpected passes,
NFC.
2021-04-22 08:06:25 -07:00
Giorgis Georgakoudis a2dbfb6b72 [OpenMP] Simplify offloading parallel call codegen
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.

Reviewed By: jdoerfert, Meinersbur

Differential Revision: https://reviews.llvm.org/D95976
2021-04-21 18:46:07 -07:00
Alexey Bataev 079884225a [OPENMP]Fix PR49698: OpenMP declare mapper causes segmentation fault.
The implicitly generated mappings for allocation/deallocation in mappers
runtime should be mapped as implicit, also no need to clear member_of
flag to avoid ref counter increment. Also, the ref counter should not be
incremented for the very first element that comes from the mapper
function.

Differential Revision: https://reviews.llvm.org/D100673
2021-04-21 10:38:31 -07:00
Peyton, Jonathan L 4457565757 [OpenMP] Implement GOMP task reductions
Implement the remaining GOMP_* functions to support task reductions
in taskgroup, parallel, loop, and taskloop constructs.  The unused mem
argument to many of the work-sharing constructs has to do with the
scan() directive/ inscan() modifier.  If mem is set, each function
will call KMP_FATAL() and tell the user scan/inscan is unsupported.  The
GOMP reduction implementation is kept separate from our implementation
because of how GOMP presents reduction data and computes the reductions.
GOMP expects the privatized copies to be present even after a #pragma
omp parallel reduction(task:...) region has ended so the data is stored
inside GOMP's uintptr_t* data pseudo-structure.  This style is tightly
coupled with GCC compiler codegen.  There also isn't any init(),
combiner(), fini() functions in GOMP's codegen so the two
implementations were to disparate to try to wrap GOMP's around our own.

Differential Revision: https://reviews.llvm.org/D98806
2021-04-16 16:36:31 -05:00
Peyton, Jonathan L 5ebbb366c4 [OpenMP] Allow affinity to re-detect for child processes
Current atfork() handler for child processes does not reset
the affinity masks array which prevents users from setting their own
affinity in child processes.

Differential Revision: https://reviews.llvm.org/D99218
2021-04-16 16:34:02 -05:00
Hansang Bae 9b98497b44 [OpenMP] Add omp_target_is_accessible() to header files
-- Added omp_target_is_accessible to the header files
-- Added missing const qualifier to device memory routines

Differential Revision: https://reviews.llvm.org/D100420
2021-04-16 07:54:15 -05:00
Joseph Huber 83d4b2e2e0 [OpenMP] Add info for device table changes
Summary:
This patch adds a feature to print information whenever the host-device pointer
mapping table is changed by inserting or removing an entry. This introduces a
new bit field for LIBOMPTARGET_INFO at position 0x8.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D100600
2021-04-15 18:39:48 -04:00
Hansang Bae 77dc7b4653 [OpenMP] Fix printing routine for OMP_TOOL_VERBOSE_INIT
Also fixed typo in the verbose message.

Differential Revision: https://reviews.llvm.org/D100414
2021-04-14 07:55:26 -05:00
Hansang Bae 3da61ddae7 [OpenMP] Define omp_is_initial_device() variants in omp.h
omp_is_initial_device() is marked as a built-in function in the current
compiler, and user code guarded by this call may be optimized away,
resulting in undesired behavior in some cases. This patch provides a
possible fix for such cases by defining the routine as a variant
function and removing it from builtin list.

Differential Revision: https://reviews.llvm.org/D99447
2021-04-06 16:58:01 -05:00
Peyton, Jonathan L 2aebb7cb3c [OpenMP] Fix incorrect KMP_STRLEN() macro
The second argument to the strnlen_s(str, size) function should be
sizeof(str) when str is a true array of characters with known size
(instead of just a char*). Use type traits to determine if first
parameter is a character array and use the correct size based on that
trait.

Differential Revision: https://reviews.llvm.org/D98209
2021-04-05 09:03:09 -05:00
Joseph Huber 0af4e74aef [OpenMP][NFC] Fix typo in libomptarget error message
Summary:
There was a typo suggesting the user to use `LIBOMPTARGET_DEBUG` instead of
`LIBOMPTARGET_INFO`
2021-04-01 12:45:28 -04:00
Joseph Huber 29338459fb [OpenMP] Trim error messages in CUDA plugin
Summary:
Remove some of the error messages printed when the CUDA plugin fails. The current error messages can be confusing because they are the first error messages printed after the async stream finds an error. This means that the printed values aren't related to what caused the issue, but are simply the last asyncronous operation that succeeded on the device. Remove these as they can be misleading.

Reviewers: jdoerfert

Differential Revision: https://reviews.llvm.org/D99510
2021-03-29 12:20:19 -04:00
Alexey Bataev 0411b23319 [OPENMP]Map data field with l-value reference types.
Added initial support dfor the mapping of the data members with l-value
reference types.

Differential Revision: https://reviews.llvm.org/D98812
2021-03-29 07:07:09 -07:00
Joseph Huber 16064e71e9 [OpenMP] Reset async stream properly upon failure
Summary:
If the call to `synchronize` fails, it will currently block the stream indefinitely if execution is continued from this point. Additionally, if the program exits it will trigger an assertion on the non-null value of the async queue and prevent the runtime from printing debugging information.

Reviewers: jdoerfert

Differential Revision: https://reviews.llvm.org/D99443
2021-03-26 19:05:06 -04:00
Hansang Bae 467f39249d [OpenMP] Misc. changes that add or remove pointer/bound checks
-- Added or moved checks to appropriate places.
-- Removed ineffective null check where the pointer is already being
   dereferenced around the code.
-- Initialized variables that can be used without definitions.
-- Added call to dlclose/FreeLibrary in OMPT tool activation.
-- Added a new build compiler definition.

Differential Revision: https://reviews.llvm.org/D98584
2021-03-23 18:55:08 -05:00
Shilei Tian 2df65f87c1 [OpenMP] Fixed a crash in hidden helper thread
It is reported that after enabling hidden helper thread, the program
can hit the assertion `new_gtid < __kmp_threads_capacity` sometimes. The root
cause is explained as follows. Let's say the default `__kmp_threads_capacity` is
`N`. If hidden helper thread is enabled, `__kmp_threads_capacity` will be offset
to `N+8` by default. If the number of threads we need exceeds `N+8`, e.g. via
`num_threads` clause, we need to expand `__kmp_threads`. In
`__kmp_expand_threads`, the expansion starts from `__kmp_threads_capacity`, and
repeatedly doubling it until the new capacity meets the requirement. Let's
assume the new requirement is `Y`.  If `Y` happens to meet the constraint
`(N+8)*2^X=Y` where `X` is the number of iterations, the new capacity is not
enough because we have 8 slots for hidden helper threads.

Here is an example.
```
#include <vector>

int main(int argc, char *argv[]) {
  constexpr const size_t N = 1344;
  std::vector<int> data(N);

#pragma omp parallel for
  for (unsigned i = 0; i < N; ++i) {
    data[i] = i;
  }

#pragma omp parallel for num_threads(N)
  for (unsigned i = 0; i < N; ++i) {
    data[i] += i;
  }

  return 0;
}
```
My CPU is 20C40T, then `__kmp_threads_capacity` is 160. After offset,
`__kmp_threads_capacity` becomes 168. `1344 = (160+8)*2^3`, then the assertions
hit.

Reviewed By: protze.joachim

Differential Revision: https://reviews.llvm.org/D98838
2021-03-18 18:25:36 -04:00
Jon Chesterfield 626a31de15 [libomptarget] Add register usage info to kernel metadata
Add register usage information to the runtime metadata so that it can be used during kernel launch (that change will be in a different commit). Add this information to the kernel trace.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D98829
2021-03-18 17:00:42 +00:00
Jon Chesterfield dbf8f2b089 Revert "[libomptarget] Build amdgcn devicertl by default"
This reverts commit e23f3502d9.
It broke the build of openmp for clang built without amdgcn
support. D98746, under review, would allow this to reland.
2021-03-17 11:34:44 +00:00
Hansang Bae a6f9cb6adc [OpenMP] Add runtime interface for OpenMP 5.1 error directive
The proposed new interface is for supporting `at(execution)` clause in the
error directive.

Differential Revision: https://reviews.llvm.org/D98448
2021-03-16 08:55:25 -05:00
Johannes Doerfert 0a954a528b [OpenMP][FIX] Repair accidental replacement of _shfl_sync with _shfl
This was broken accidentally in D95752.

Reviewed By: ye-luo

Differential Revision: https://reviews.llvm.org/D98677
2021-03-15 22:46:00 -05:00
Jon Chesterfield e23f3502d9 [libomptarget] Build amdgcn devicertl by default
[libomptarget] Build amdgcn devicertl by default

The cmake for this looks for an llvm install and does the right thing when
building as part of enable_runtimes. It will probably do the right thing
in other settings - at least, it won't try to build this with gcc.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D98658
2021-03-15 23:17:50 +00:00