Commit Graph

1870 Commits

Author SHA1 Message Date
Jon Chesterfield a5f4074d85 [libomptarget][amdgpu] Macro for accessing GPU variables from plugin
Lets the amdgpu plugin write to omptarget_device_environment
to enable debugging. Intend to use in the near future to record the
wavesize that a given deviceRTL was compiled with for running on hardware
that supports 32 or 64.

Patch sets all the attributes that are useful. Notably .data means the variable
is set by writing to host memory before copying to the GPU instead of launching
a kernel to update the image. Can simplify the plugin slightly to drop the
code for patching after load if this is used consistently.

NFC on nvptx, cuda plugin seems to work fine without any annotations.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108698
2021-08-26 17:28:18 +01:00
Jon Chesterfield ba0af885e7 [libomptarget][amdgpu][nfc] Make grid value access match devicertl 2021-08-25 15:11:19 +01:00
Jon Chesterfield 9b2c6c07b5 [libomptarget][amdgpu] Refactor debug printing
Move most debug printing in rtl.cpp behind DP() macro
Adjust the print output for gpu arch mismatch when the architectures match
Convert an assert into graceful failure

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108562
2021-08-25 14:57:51 +01:00
Jon Chesterfield ba8547775b [libomptarget][amdgpu] Fix debug build from D104696 2021-08-25 01:27:51 +01:00
Michael Kruse 1275ee3041 [OpenMP][amdgcn] Don't use in-tree clang if not available.
The use of `$<TARGET_FILE:clang>` was adapted too broadly from D101265.

Fixes llvm.org/PR51579

Also see discussion in D108534.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D108640
2021-08-24 12:50:49 -05:00
Pushpinder Singh 9b8b7c1180 [AMDGPU][Libomptarget] Delete g_atl_machine global
With uses of g_atl_machine gone, a significant portion of dead
code has been removed.

This patch depends on D104691 and D104695.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D104696
2021-08-24 07:59:40 +00:00
Jon Chesterfield d26000e4cc [openmp][devicertl] Freestanding nvptx via stub printf
Compiled nvptx devicertl as freestanding, breaking the
dependency on host glibc and gcc-multilibs. Thus build it by default.

Comes at the cost of #defining out printf. Tried mapping it onto
__builtin_printf but that gets transformed back to printf instead
of hitting the cuda/openmp lowering transform.

Printf could be preserved by one of:
- dropping all the standard headers and ffreestanding
- providing a header only printf implementation
- changing the compiler handling of printf

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D108349
2021-08-23 23:07:47 +01:00
Jon Chesterfield 842f875c8b [openmp] Use llvm GridValues from devicertl
Add include path to the cmakefiles and set the target_impl enums
from the llvm constants instead of copying the values.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108391
2021-08-23 20:25:24 +01:00
Peyton, Jonathan L d39d3a327b [OpenMP][test] fix omp_get_wtime.c test to be more accommodating
The omp_get_wtime.c test fails intermittently if the recorded times are
off by too much which can happen when many tests are run in parallel.

Instead of failing if one timing is a little off, take average of 100
timings minus the 10 worst.

Differential Revision: https://reviews.llvm.org/D108488
2021-08-23 08:13:42 -05:00
Vignesh Balasubramanian 589519b9ab [OpenMP][OMPD]Code movement required for OMPD
These changes don't come under OMPD guard as it is a movement of existing code to capture parallel behavior correctly.
"Runtime Entry Points for OMPD" like "ompd_bp_parallel_begin" and "ompd_bp_parallel_begin" should be placed at the correct execution point for the debugging tool to access proper handles/data.
Without the below changes, in certain cases, debugging tool will pick the wrong parallel and task handle.

Reviewed By: @hbae
Differential Revision: https://reviews.llvm.org/D100366
2021-08-20 14:36:22 +05:30
Shilei Tian 1d8d43ae61 [OpenMP] Use `__kmpc_give_task` in `__kmp_push_task` when encountering a hidden helper task
This patch replaces the current implementation, overwrites `gtid` and `thread`,
with `__kmpc_give_task`.

Reviewed By: AndreyChurbanov

Differential Revision: https://reviews.llvm.org/D106977
2021-08-19 20:49:29 -04:00
Joachim Protze 4bb36df144 [libomptarget][amdcgn] Add build dependency for llvm-link and opt
D107156 and D107320 are not sufficient when OpenMP is built as llvm runtime
(LLVM_ENABLE_RUNTIMES=openmp) because dependencies only work within the same
cmake instance.

We could limit the dependency to cases where libomptarget/plugins are really
built. But compared to the whole llvm project, building openmp runtime is
negligible and postponing the build of OpenMP runtime after the dependencies
are ready seems reasonable.

The direct dependency introduced in D107156 and D107320 is necessary for the
case where OpenMP is built as llvm project (LLVM_ENABLE_PROJECTS=openmp).

Differential Revision: https://reviews.llvm.org/D108404
2021-08-20 01:57:58 +02:00
Jennifer Yu c274b19866 Add implicit map for a list item appears in a reduction clause.
A new rule is added in 5.0:
If a list item appears in a reduction, lastprivate or linear clause
on a combined target construct then it is treated as if it also appears
in a map clause with a map-type of tofrom.

Currently map clauses for all capture variables are added implicitly.
But missing for list item of expression for array elements or array
sections.

The change is to add implicit map clause for array of elements used in
reduction clause. Skip adding map clause if the expression is not
mappable.
Noted: For linear and lastprivate, since only variable name is
accepted, the map has been added though capture variables.

To do so:
During the mappable checking, if error, ignore diagnose and skip
adding implicit map clause.

The changes:
1> Add code to generate implicit map in ActOnOpenMPExecutableDirective,
   for omp 5.0 and up.
2> Add extra default parameter NoDiagnose in ActOnOpenMPMapClause:
Use that to skip error as well as skip adding implicit map during the
mappable checking.

Note: there are only tow places need to be check for NoDiagnose. Rest
of them either the check is for < omp 5.0 or the error already generated for
reduction clause.

Differential Revision: https://reviews.llvm.org/D108132
2021-08-19 12:53:47 -07:00
Jon Chesterfield ad0f6e1d98 [openmp] Disable the tests that block CI for amdgpu and host offloading. 2021-08-19 20:43:30 +01:00
Jon Chesterfield 6c75ce1b8b [libomptarget][nfc] Move lanemask_t type into target_impl.h 2021-08-19 18:50:03 +01:00
Jon Chesterfield 77579b99e9 [openmp][nfc] Replace OMPGridValues array with struct
[nfc] Replaces enum indices into an array with a struct. Named the
fields to match the enum, leaves memory layout and initialization unchanged.

Motivation is to later safely remove dead fields and replace redundant ones
with (compile time) computation. It should also be possible to factor some
common fields into a base and introduce a gfx10 amdgpu instance with less
duplication than the arrays of integers require.

Reviewed By: ronlieb

Differential Revision: https://reviews.llvm.org/D108339
2021-08-19 13:25:42 +01:00
Jon Chesterfield f420939b82 [libomptarget] Apply D106710 to amdgcn devicertl 2021-08-19 01:34:33 +01:00
Jon Chesterfield c480792b6a [libomptarget][nfc][devicertl] Delete unused enums 2021-08-19 00:14:34 +01:00
Jon Chesterfield 21d91a8ef3 [libomptarget][devicertl] Replace lanemask with uint64 at interface
Use uint64_t for lanemask on all GPU architectures at the interface
with clang. Updates tests. The deviceRTL is always linked as IR so the zext
and trunc introduced for wave32 architectures will fold after inlining.

Simplification partly motivated by amdgpu gfx10 which will be wave32 and
is awkward to express in the current arch-dependant typedef interface.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108317
2021-08-18 20:47:33 +01:00
Joseph Huber edb8acdc6e [Libomptarget] Correctly default to Generic if exec_mode is not present
Currently, the runtime returns an error when the `exec_mode` global is
not present. The expected behvaiour is that the region will default to
Generic. This prevents global constructors from being called because
they do not contain execution mode globals.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D108255
2021-08-18 11:24:28 -04:00
Martin Storsjö f5616a981c [OpenMP] Fix the usage of sscanf on MinGW
KMP_SSCANF only evaluates to sscanf_s within
    #if KMP_OS_WINDOWS && KMP_MSVC_COMPAT
so we need to pass the sscanf_s specific parameters within a similar
condition.

Differential Revision: https://reviews.llvm.org/D108196
2021-08-17 21:36:09 +03:00
Peyton, Jonathan L b4a1f441d9 [OpenMP] Add a few small fixes
* Add comment to help ensure new construct data are added in two places
* Check for division by zero in the loop worksharing code
* Check for syntax errors in parrange parsing

Differential Revision: https://reviews.llvm.org/D105929
2021-08-16 10:02:49 -05:00
Peyton, Jonathan L 6eeb4c1f32 [OpenMP] Fix incorrect parameters to sscanf_s call
On Windows, the documentation states that when using sscanf_s,
each %c and %s specifier must also have additional size parameter.
This patch adds the size parameter in the one place where %c is
used.

Differential Revision: https://reviews.llvm.org/D105931
2021-08-16 09:59:21 -05:00
AndreyChurbanov 52cac541d4 [OpenMP] libomp: cleanup: minor fixes to silence static analyzer.
Added couple more checks to silence KlocWork static code analyzer.

Differential Revision: https://reviews.llvm.org/D107348
2021-08-16 13:39:23 +03:00
AndreyChurbanov f94da67f49 [OpenMP][NFC] libomp: reduced timeouts in the test from 50 to 2 sec. 2021-08-11 17:58:52 +03:00
George Rokos df06ec3057 [libomptarget][NFC] Fix compilation issue with GCC
Removed redundant assignment from condition which causes gcc to emit the following error:

error: operation on ‘MoveData’ may be undefined [-Werror=sequence-point]
2021-08-10 09:43:43 -07:00
Joel E. Denny 2ced1f338a [OpenMP][NFC] Simplify targetDataEnd conditions for CopyMember
targetDataEnd and targetDataBegin compute CopyMember/copy differently,
and I don't see why they should.  This patch eliminates one of those
differences by making a simplifying NFC change to targetDataEnd.

The change is NFC as follows.  The change only affects the case when
`!UNIFIED_SHARED_MEMORY || HasCloseModifier`.  In that case, the
following points are always true:

* The value of CopyMember is relevant later only if DelEntry = false.
* DelEntry = false only if one of the following is true:
    * IsLast = false.  In this case, it's always true that CopyMember
      = false = IsLast.
    * `MEMBER_OF && !PTR_AND_OBJ` is true.  In this case, CopyMember =
      IsLast.
* Thus, if CopyMember is relevant, CopyMember = IsLast.

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D105990
2021-08-10 12:29:55 -04:00
Pirama Arumuga Nainar 49fabd9d76 [openmp] Do not use shared memory on Android
Android provides ashmem/ASharedMemory support on newer releases, which
we can use if requested by openmp users on Android.

Also refactor the preprocessor check for using shared memory to
kmp_config.h.cmake.

Differential Revision: https://reviews.llvm.org/D107181
2021-08-09 09:41:32 -07:00
Dimitry Andric 400cd6d2f0 [libomptarget][amdgpu] use --allow-shlib-undefined to link on FreeBSD
On FreeBSD, the `environ` symbol is undefined at link time for shared
libraries, but resolved by the dynamic linker at runtime. Therefore,
allow the symbol to be undefined when creating a shared library, by
using the `--allow-shlib-undefined` linker flag, instead of `-z defs`
(a.k.a `--no-undefined`).

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D107698
2021-08-08 13:52:44 +02:00
Ye Luo 262289c103 [OpenMP] mark target task untied
OpenMP specification Tasking Terminology
target task :A mergeable and untied task that ...

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D107686
2021-08-07 12:31:20 -04:00
Dimitry Andric 71ae2e0221 [libomptarget][amdgpu] don't declare Elf_Note on FreeBSD
On FreeBSD, the system `<libelf.h>` already declares `struct Elf_Note`
indirectly (via `<sys/elf_common.h>`). This results in compile errors
when building the libomptarget amdgpu plugin. Avoid redeclaring `struct
Elf_Note` on FreeBSD to fix the errors.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D107661
2021-08-06 21:45:26 +02:00
Shilei Tian 28939b6ae5 [NFC] Clean up and clang-format openmp/libomptarget/plugins/cuda/src/rtl.cpp 2021-08-05 22:32:28 -04:00
Shilei Tian 680c71b127 [OpenMP] Clean up for hidden helper task
This patch makes some clean up for code of hidden helper task.

Reviewed By: protze.joachim

Differential Revision: https://reviews.llvm.org/D107008
2021-08-04 12:36:44 -04:00
Shilei Tian 9f5d6ea52e [OpenMP] Fix performance regression reported in bug #51235
This patch fixes the "performance regression" reported in https://bugs.llvm.org/show_bug.cgi?id=51235. In fact it has nothing to do with performance. The root cause is, the stolen task is not allowed to execute by another thread because by default it is tied task. Since hidden helper task will always be executed by hidden helper threads, it should be untied.

Reviewed By: protze.joachim

Differential Revision: https://reviews.llvm.org/D107121
2021-08-04 12:34:49 -04:00
Lechen Yu 3bc8ce5dd7 [openmp] Add OMPT initialization in libomptarget
When loading libomptarget, the init function in libomptarget/src/rtl.cpp
will search for the libomptarget_start_tool function using libdl.
libomptarget_start_tool will pass those OMPT callbacks related to target
constructs to libomptarget

Differential Revision: https://reviews.llvm.org/D99803
2021-08-04 18:00:11 +02:00
AndreyChurbanov 8e29b4b323 [OpenMP] libomp: taskwait depend implementation fixed.
Fix for https://bugs.llvm.org/show_bug.cgi?id=49723.
Eliminated references from task dependency hash to node allocated on stack,
thus eliminated accesses to stale memory. So the node now never freed.
Uncommented assertion which triggered when stale memory accessed.
Removed unneeded ref count increment for stack allocated node.

Differential Revision: https://reviews.llvm.org/D106705
2021-08-03 15:45:20 +03:00
Jon Chesterfield 567c8c7bfd [libomptarget][nfc] Only set cuda-path for nvptx tests
Remove --cuda-path=CUDA_TOOLKIT_ROOT_DIR-NOTFOUND
from the invocation of non-nvptx test cases. Better signal
to noise ratio on other architectures.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D107074
2021-07-30 23:01:09 +01:00
Jose M Monsalve Diaz 5424ceeda0 [OpenMP] Fixing llvm-omp-device-info compilation with runtimes
When using `-DLLVM_ENABLED_RUNTIMES` instead of `-DLLVM_ENABLED_PROJECTS`
the `llvm-omp-device-info` tool is not compiled or installed.
In general, no llvm tool would be build on runtimes, because the
-DLLVM_BUILD_TOOLS flag is removed by the way runtimes compilation calls
cmake again.

This patch is simple. Just forward the value of this flag to the
runtime cmake command.

I'm also removing an unnecessary comment in the compilation of the tool

Differential Revision: https://reviews.llvm.org/D107177
2021-07-30 13:09:08 -05:00
Shilei Tian 36d53af4a9 [OpenMP][Offloading] Remove task wait in nowait interfaces
All `nowait` series of interfaces in `libomptarget` accept four more arguments (`int32_t depNum, void *depList, int32_t noAliasDepNum, void *noAliasDepList`) compared with their counterparts w/o `nowait`. These extra arguments were expected for dependence resolution, potentially lowered to device side. Current implementation calls `libomp` function `__kmpc_omp_taskwait`. However, the front end simply ignores them, that these four arguments are not emitted at all. As a consequence, the `depNum` and `noAliasDepNum` are garbage, which could lead to unnecessary task wait.

Reviewed By: grokos

Differential Revision: https://reviews.llvm.org/D107164
2021-07-30 11:39:46 -04:00
AndreyChurbanov 8b81524c6d [OpenMP][NFC] libomp: silence warnings on unused variables.
Put declarations/definitions of unused variables under corresponding macros
to silence clang build warnings.

Differential Revision: https://reviews.llvm.org/D106608
2021-07-30 17:04:42 +03:00
Joachim Protze 4ffa1478fd [libomptarget][amdcgn] Add build dependency for opt
This patch should fix the build we observe when building LLVM from scratch.

Differential Revision: https://reviews.llvm.org/D107156
2021-07-30 15:45:13 +02:00
Terry Wilmarth d8e4cb9121 [OpenMP] libomp: Add new experimental barrier: two-level distributed barrier
Two-level distributed barrier is a new experimental barrier designed
for Intel hardware that has better performance in some cases than the
default hyper barrier.

This barrier is designed to handle fine granularity parallelism where
barriers are used frequently with little compute and memory access
between barriers. There is no need to use it for codes with few
barriers and large granularity compute, or memory intensive
applications, as little difference will be seen between this barrier
and the default hyper barrier. This barrier is designed to work
optimally with a fixed number of threads, and has a significant setup
time, so should NOT be used in situations where the number of threads
in a team is varied frequently.

The two-level distributed barrier is off by default -- hyper barrier
is used by default. To use this barrier, you must set all barrier
patterns to use this type, because it will not work with other barrier
patterns. Thus, to turn it on, the following settings are required:

KMP_FORKJOIN_BARRIER_PATTERN=dist,dist
KMP_PLAIN_BARRIER_PATTERN=dist,dist
KMP_REDUCTION_BARRIER_PATTERN=dist,dist

Branching factors (set with KMP_FORKJOIN_BARRIER, KMP_PLAIN_BARRIER,
and KMP_REDUCTION_BARRIER) are ignored by the two-level distributed
barrier.

Patch fixed for ITTNotify disabled builds and non-x86 builds

Co-authored-by: Jonathan Peyton <jonathan.l.peyton@intel.com>
Co-authored-by: Vladislav Vinogradov <vlad.vinogradov@intel.com>

Differential Revision: https://reviews.llvm.org/D103121
2021-07-29 14:09:26 -05:00
Joachim Protze 4acc2f29a2 [OpenMP][Tools][Tests][NFC] Address flaky archer tests
Adding more concurrent threads significantly increases the
chance that the data race can be observed during testing.
2021-07-29 17:56:44 +02:00
Jon Chesterfield a90da62adb [libomptarget][amdgpu] Update printed plugin name 2021-07-29 14:46:42 +01:00
Jose M Monsalve Diaz 88e66fa60a [OpenMP] Fixing missing variables when CUDA SDK not in system
This patch fixes the error reported in D106751. When there is no CUDA SDK
installed in the system, the build fails due to missing `CU_DEVICE_ATTRIBUTE`
variables.

Using @zsrkmyn sugested fix

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106933
2021-07-27 23:46:15 -05:00
Jose M Monsalve Diaz 313c523995 [OpenMP][Tool] Introducing the `llvm-omp-device-info` tool
This patch introduces the `llvm-omp-device-info` tool, which uses the
omptarget library and interface to query the device info from all the
available devices as seen by OpenMP. This is inspired by PGI's `pgaccelinfo`

Since omptarget usually requires a description structure with executable
kernels, I split the initialization of the RTLs and Devices to be able to
initialize all possible devices and query each of them.

This revision relies on the patch that introduces the print device info.

A limitation is that the order in which the devices are initialized, and the
corresponding device ID is not necesarily the one seen by OpenMP.

The changes are as follows:
1. Separate the RTL initialization that was performed in `RegisterLib` to its own `initRTLonce` function
2. Create an `initAllRTLs` method that initializes all available RTLs at runtime
3. Created the `llvm-deviceinfo.cpp` tool that uses `omptarget` to query each device and prints its information.

Example Output:
```
Device (0):
    print_device_info not implemented

Device (1):
    print_device_info not implemented

Device (2):
    print_device_info not implemented

Device (3):
    print_device_info not implemented

Device (4):
    CUDA Driver Version:                11000
    CUDA Device Number:                 0
    Device Name:                        Quadro P1000
    Global Memory Size:                 4236312576 bytes
    Number of Multiprocessors:          5
    Concurrent Copy and Execution:      Yes
    Total Constant Memory:              65536 bytes
    Max Shared Memory per Block:        49152 bytes
    Registers per Block:                65536
    Warp Size:                          32 Threads
    Maximum Threads per Block:          1024
    Maximum Block Dimensions:           1024, 1024, 64
    Maximum Grid Dimensions:            2147483647 x 65535 x 65535
    Maximum Memory Pitch:               2147483647 bytes
    Texture Alignment:                  512 bytes
    Clock Rate:                         1480500 kHz
    Execution Timeout:                  Yes
    Integrated Device:                  No
    Can Map Host Memory:                Yes
    Compute Mode:                       DEFAULT
    Concurrent Kernels:                 Yes
    ECC Enabled:                        No
    Memory Clock Rate:                  2505000 kHz
    Memory Bus Width:                   128 bits
    L2 Cache Size:                      1048576 bytes
    Max Threads Per SMP:                2048
    Async Engines:                      Yes (2)
    Unified Addressing:                 Yes
    Managed Memory:                     Yes
    Concurrent Managed Memory:          Yes
    Preemption Supported:               Yes
    Cooperative Launch:                 Yes
    Multi-Device Boars:                 No
    Compute Capabilities:               61
```

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D106752
2021-07-27 22:38:35 -04:00
Jose M Monsalve Diaz d2f85d0910 [OpenMP][Libomptarget] Adding `print_device_info` to RTL and `omptarget`
This patch introduces a function in the device's plugin to print the
device information. This patch relates to another patch that introduces
a CLI tool to obtain the device information from the omplibrary directly.
It is inspired by PGI's pgaccelinfo.

The modifications are as follows:
1. Introduce the optional `void __tgt_rtl_print_device_info(RTLdevID)` function into the RTL.
2. Introduce the `bool __tgt_print_device_info(devID)` function into `omptarget` interface. Returns false if the RTL is not implemented
3. Added `bool printDeviceInfo(RTLDevID)` to the `DeviceTy`
4. Implement the `__tgt_rtl_print_device_info` for CUDA. Added additional CUDA Runtime calls.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106751
2021-07-27 21:47:57 -04:00
Jose M Monsalve Diaz 5ab6aedda9 [OpenMP] Folding threadLimit and numThreads when single value in kernels
The device runtime contains several calls to `__kmpc_get_hardware_num_threads_in_block`
and `__kmpc_get_hardware_num_blocks`. If the thread_limit and the num_teams are constant,
these calls can be folded to the constant value.

In this patch we use the already introduced `AAFoldRuntimeCall` and the `NumTeams` and
`NumThreads` kernel attributes (to be introduced in a different patch) to fold these functions.
The code checks all the kernels, and if their attributes match, the functions are folded.

In the future we will explore specializing for multiple values of NumThreads and NumTeams.

Depends on D106390

Reviewed By: jdoerfert, JonChesterfield

Differential Revision: https://reviews.llvm.org/D106033
2021-07-27 21:47:12 -04:00
Johannes Doerfert ed7ec860f0 [OpenMP] Improve alignment handling in the new device runtime 2021-07-27 17:50:27 -05:00
Joseph Huber e3ee76245e [Libomptarget] Revert new variable sharing to use the old method
The new method of sharing variables introduces a `__kmpc_alloc_shared` call
that cannot be removed in the middle end because of its non-constant argument
and unconnected free. This patch reverts this to the old method that used a
static amount of shared memory for sharing variables.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106905
2021-07-27 18:14:01 -04:00