Several variables were left unused as a result of different patches removing
their use.
Two variables have some use:
`poll_count` is used by the KMP_BLOCKING macro only under certain conditions.
Adding (void) to tell the compiler to ignore the unused variable.
`padding` is a dummy stack allocation with no intent to be used. Also adding
(void) to make the compiler ignore the unused variable.
Differential Revision: https://reviews.llvm.org/D104303
* Add GOMP versioned pause functions
* Add GOMP versioned affinity format functions
To do the affinity format functions, only attach versioned symbols
to the APPEND Fortran entries (e.g., omp_set_affinity_format_) since
GOMP only exports two symbols (one for Fortran, one for C). Our
affinity format functions have three symbols.
e.g., with omp_set_affinity_format:
1) omp_set_affinity_format (Fortran interface)
2) omp_set_affinity_format_ (Fortran interface)
3) ompc_set_affinity_format (C interface)
Have the GOMP version of the C symbol alias the ompc_* 3) version
instead of the Fortran unappended version 1).
Differential Revision: https://reviews.llvm.org/D103647
Remove strange checks for syscall() arguments where mask is NULL.
Valgrind reports these as error usages for the syscall.
Instead, just check if CACHE_LINE bytes is long enough. If not, then
search for the size. Also, by limiting the first size detection
attempt to CACHE_LINE bytes, instead of 1MB, we don't use more than one
cache line for the mask size. Before this patch, sometimes the returned
mask size was 640 bytes (10 cache lines) because the initial call to
getaffinity() was limited only by the internal kernel mask size
which can be very large.
Differential Revision: https://reviews.llvm.org/D103637
Lazily set affinity for root threads. Previously, the root thread
executing middle initialization would attempt to assign affinity
to other existing root threads. This was not working properly as the
set_system_affinity() function wasn't setting the affinity for the
target thread. Instead, the middle init thread was resetting the
its own affinity using the target thread's affinity mask.
Differential Revision: https://reviews.llvm.org/D103625
This patch includes some changes which deletes the code accessing
g_atl_machine global. Some accesses related to memory_pools are
still remaining.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103813
The current handling of dependencies in Archer has two flaws:
- annotation of dependency synchronization is not limited to sibling tasks
- annotation of in/out dependencies is based on the assumption, that dependency
variables will rarely be byte-sized variables.
This patch introduces a map in the generating task to manage the dependency
variables for the child tasks. The map is only accesses from the generating
task, so no locking is necessary. This also limits the dependency-based
synchronization to sibling tasks.
This patch also introduces proper handling for new dependency types such as
mutexinoutset and inoutset.
Differential Revision: https://reviews.llvm.org/D103608
The main motivation for reusing objects is that it helps to avoid creating and
leaking synchronization clocks in TSan. The reused object will reuse the
synchronization clock in TSan.
Before, new and delete operators were overloaded to get and return memory for
the object from/to the object pool.
This patch replaces the operator overloading with explicit static New/Delete
functions.
Objects for parallel regions and implicit tasks will always be recruited and
returned to the thread-local object pool. Only for explicit task, there is a
chance that an other thread completes the task and will free the object. This
patch optimizes the thread-local New/Delete calls by avoiding locks and only
lock if the pool is empty. Remote threads return the object into a separate
queue.
The chunk size for allocations is now decided based on page size. The objects
will also be aligned to cache lines avoiding false sharing.
This is the first patch in a series to provide better tasking support.
Differential Revision: https://reviews.llvm.org/D103606
Archer uses weak symbol overloads of TSan functions to enable loading the tool
even if the application is not built with TSan. For MACOS the tool collects
the function pointer at runtime.
When adding the function entry/exit markers, we missed to add the functions
in the MACOS codepath.
This patch also replaces the repeated function lookup by a single initial
function lookup and fixes the disabling logic in RunningOnValgrind.
Differential Revision: https://reviews.llvm.org/D103607
This patch adds an information flag that indicated when data is being copied to
and from the device. This will be helpful for finding redundant or unnecessary
data transfers in applications.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D103927
This is the first of seven patches that implements OMPD, a debugging interface to support debugging of OpenMP programs.
It contains support code required in "openmp/runtime" for OMPD implementation.
Reviewed By: @hbae
Differential Revision: https://reviews.llvm.org/D100181
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
Size of type of the dependence flag changed from 1 to 4 bytes in clang.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Differential Revision: https://reviews.llvm.org/D97085
The ident_t * argument in __kmp_get_monotonicity was being used without
a customary NULL check, causing the function to crash in a Debug build.
Release builds were not affected thanks to dead store elimination.
This global struct used to hold various flags for monitoring the
initialization of hsa.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103795
Previous logic was to always use the first kernarg pool found to allocate
kernel args. This patch changes this to use only the kernarg pool which
has non-zero size. This logic is also reworked to not use any globals.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103600
Nesting mode is a new experimental feature in the OpenMP
runtime. It allows a user to set up nesting for an application in a
way that corresponds to the hardware topology levels on the machine an
application is being run on. For example, if a machine has 2 sockets,
each with 12 cores, then use of nesting mode could set up an outer
level of nesting that uses 2 threads per parallel region, and an inner
level of nesting that uses 12 threads per parallel region.
Nesting mode is controlled with the KMP_NESTING_MODE environment
variable as follows:
1) KMP_NESTING_MODE = 0: Nesting mode is off (default); max-active-levels-var
is set to 1 (the default -- nesting is off, nested parallel regions
are serialized).
2) KMP_NESTING_MODE = 1: Nesting mode is on, and a number of threads
will be assigned for each level discovered in the machine topology;
max-active-levels-var is set to the number of levels discovered.
3) KMP_NESTING_MODE = n, n>1: [Note: this option is experimental and may change
or be removed in the future.] Nesting mode is on, and a number of
threads will be assigned for each topology level discovered on the
machine, up to k<=n levels (since there may be fewer than n levels
discovered in the topology), and beyond the kth level, nested parallel
regions will be serialized; NOTE: max-active-levels-var is 1 (the default --
nesting is off, and nested parallel regions are serialized until the
user changes max-active-levels-var.
If the user sets OMP_NUM_THREADS or OMP_MAX_ACTIVE_LEVELS, they will
override KMP_NESTING_MODE settings for the associated environment
variables. The detected topology may be limited by an affinity mask
setting on the initial thread, or if the user sets KMP_HW_SUBSET. See
also: KMP_HOT_TEAMS_MAX_LEVEL for controlling use of hot teams for
nested parallel regions. Note that this feature only sets numbers of
threads used at nesting levels. The user should make use of
OMP_PLACES and OMP_PROC_BIND or KMP_AFFINITY for affinitizing those
threads, if desired.
Differential Revision: https://reviews.llvm.org/D102188
When on KNL and L2 or Tile layer is detected, manually add
the corresponding layer which is equivalent.
Differential Revision: https://reviews.llvm.org/D102865
Turns out the only purpose of this class was verify if device ID
was in range or not which could be done easily by using g_atl_machine.
Still getting rid of g_atl_machine is pending which would be done in
a later patch.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103443
This struct was used to specify the device on which memory was
being allocated/free in atmi_malloc/free. It has now been replaced
with int DeviceId.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103239
Refactor suggested in D103037 to help avoid similar copy-paste errors.
Change is mechanical. Some parts of this would be more robust with unsigned.
Reviewed By: dhruvachak
Differential Revision: https://reviews.llvm.org/D103090
Suggested in D103059. Use a single lookup instead of two, more const, less mutation.
Reviewed By: dhruvachak
Differential Revision: https://reviews.llvm.org/D103093
ATMI_STATUS_UNKNOWN was unused, deleted references to it.
Replaced ATMI_STATUS_{SUCCESS,ERROR} with HSA_STATUS_{SUCCESS,ERROR}
Replaced atmi_status_t with hsa_status_t
Otherwise no change. In particular, conversions between atmi_status_t and
hsa_status_t will now be conversions between hsa_status_t and itself.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D103115
This patch drops g_atmi_initialized and inlines the Initialize &
Finalize methods from Runtime class.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102847
Two globals KernelInfoTable & SymbolInfoTable are moved
into RTLDeviceInfoTy class.
This builds on the top of D102691.
[2/2]
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102692
[libomptarget][nfc] Move hostcall required test to rtl
Remove a global, fix minor race. First of N patches to bring up hostcall.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103058
Fix the case where NumTeams was set incorrectly instead of NumThreads
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103037
KernelNameMap contains entries like "key.kd" => key which clearly
could be replaced by simple logic of removing suffix from the key.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102691
Warnings on deprecated api cannot be suppressed if the library is not initialized.
With this change it is possible to set KMP_WARNINGS=false to suppress the warnings.
Differential Revision: https://reviews.llvm.org/D102676
The check for the TO flag when processing firstprivates is missing. As a result,
sometimes the device copy of a firstprivate never gets initialized. Currectly we
try to force lambda structs to be allocated immediately by marking them as a
non-firstprivate, so that PrivateArgumentManagerTy::addArg allocates memory for
them immediately. However, calling addArg with IsFirstPrivate=false makes the
function skip initializing the device copy. Whether an argument is firstprivate
and whether we need to allocate memory immediately are not synonyms, so this
patch introduces one more control variable for immediate allocation and sets it
apart from initialization.
Differential Revision: https://reviews.llvm.org/D102890
[libomptarget][amdgpu] Mark alloc, free weak to facilitate local experimentation
There are a lot of different ways we might implement the devicertl local alloc
and free functions. Via host, local buffers (stack or arena), specialising per
kernel etc. It is not yet clear what the right design is. This change makes the
alloc and free functions weak, so one can override them from local tests while
comparing options.
Not strictly necessary, as a comparable patch can be applied locally each time,
but would be convenient for out of tree dev. Plan would be to drop the weak
attribute at the same time as introducing a working allocator to trunk.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102499
[libomptarget] Improve dlwrap compile time error diagnostic
The dlwrap interface takes an explict arity, e.g. DLWRAP(cuAlloc, 2);
This probably can't be eliminated as it controls the argument list of an
external symbol, not an inline header function. If the arity given is too
big, the error from clang referring to the line is in the middle of
implementation details.
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1277:7: error: static_assert failed
due to requirement '0UL < tuple_size<std::tuple<>>::value' "tuple index is in range"
static_assert(__i < tuple_size<tuple<>>::value,
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:93:27 ...
/home/amd/llvm-project/openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp:34:1: note: in
instantiation of template class 'dlwrap::trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::arg<2>' requested here
DLWRAP(cuMemAlloc, 3);
^
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:51:31: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:166:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:133:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:186:37: ...
If the arity is too small, the diagnostic is better:
cuda/dynamic_cuda/cuda.cpp:34:1: error: too few
arguments to function call, expected 2, have 1
DLWRAP(cuMemAlloc, 1);
This patch changes the diagnostic to:
cuda/dynamic_cuda/cuda.cpp:34:1: error:
static_assert failed due to requirement '1 == trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 1);
or
cuda/dynamic_cuda/cuda.cpp:34:1: error:
static_assert failed due to requirement '3 == trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 3);
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102858
[libomptarget][amdgpu] Remove majority of fatal errors
Replaces most calls to exit() with returning an error to the library entry
point. Minor changes to error handling for clear bugs, remove some dead code.
Each exit() call site replaced is either in a library entry point or a
function that already returns error codes on some paths. The existing handling
is not well tested but replacing exit() with a fallback path should be a strict
improvement.
Remaining two early exit points are an abort() from a callback and exit() from
within msgpack. Fixes for those are less obvious and left for a later patch.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D102346
[libomptarget] Disable test bug49334 on amdgpu
Hangs on amdgpu, do not know why. Disable to unblock build.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D102017
This patch moves g_executables to private member of Runtime class
and is renamed to HSAExecutables following LLVM naming convention.
This movement required making Runtime::Initialize and Runtime::Finalize
non-static. Verified the correctness of this change by running
libomptarget tests on gfx906.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102600
This initial patch removes some unused variables from global namespace.
There will more incoming patches for moving global variables to classes
or static members.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102598
Bug 49356 (https://bugs.llvm.org/show_bug.cgi?id=49356) reports crash in
the test case `tasking/bug_taskwait_detach.cpp`, which is caused by the wrong
function declaration. `gtid` in `__kmpc_omp_task` should be `kmp_int32`.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D102584
[libomptarget][amdgpu] Fix truncation error for partial wavefront
The partial barrier implementation involves one wavefront resetting and N-1
waiting. This change future proofs against launching with a number of threads
that is not a multiple of the wavefront size.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102407
[libomptarget][amdgpu] Convert an assert to print and offload_fail
The kernel launched is supposed to be present in the binary, but a not yet
diagnosed bug means it is missing for some of the qmcpack test cases. Changing
from assert to print and offload_fail should help diagnose that and similar bugs.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102378
Add a `REQUIRES: unified_shared_memory` option to tests that use `#pragma omp requires unified_shared_memory`.
For CUDA, the feature tag is derived from LIBOMPTARGET_DEP_CUDA_ARCH which itself is derived using [[ https://cmake.org/cmake/help/latest/module/FindCUDA.html#commands | cuda_select_nvcc_arch_flags ]]. The latter determines which compute capability the GPU in the system supports. To ensure that this is the CUDA arch being used, we could also set the `-Xopenmp-target -march=` flag.
In the absence of an NVIDIA GPU, LIBOMPTARGET_DEP_CUDA_ARCH will be 35. That is, in that case we are assuming unified_shared_memory is not available. CUDA plugin testing could be disabled entirely in this case, but this currently depends on `LIBOMPTARGET_CAN_LINK_LIBCUDA OR LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA`, not on whether the hardware is actually available.
For all other targets, nothing changes and we are assuming unified shared memory is available. This might need refinement if not the case.
This tries to fix the [[ http://meinersbur.de:8011/#/builders/143 | OpenMP Offloading Buildbot ]] that, although brand-new, only has a Pascal-generation (sm_61) GPU installed. Hence, tests that require unified shared memory are currently failing. I wish I had known in advance.
Reviewed By: protze.joachim, tianshilei1992
Differential Revision: https://reviews.llvm.org/D101498
[libomptarget][amdgpu][nfc] Expand errorcheck macros
These macros expand to continue, which is confusing, or exit,
which is incompatible with continuing execution on offloading fail.
Expanding the macros in place makes the code look untidy but the
control flow obvious and amenable to improving. In particular, exit
becomes easier to eliminate.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D102230
This is the first in a series of changes to the OpenMP runtime
that have been done internally by Microsoft. This patch makes
the necessary changes to enable libomp.dll to build with
the MSVC compiler targeting ARM64.
Differential Revision: https://reviews.llvm.org/D101173
[libomptarget][nfc] Add hook to easily disable building amdgcn bclib
This is useful when building LLVM with a toolchain that can't emit code
for amdgcn, e.g. because it overrides the include search path with headers
from another architecture, or the clang compiler is missing builtins.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102229
When KMP_AFFINITY is set, each worker thread's gtid value is used as an
index into the place list to determine the thread's placement. With hidden
helpers enabled, this gtid value is shifted down leading to unexpected
shifted thread placement. This patch restores the previous behavior by
adjusting the mask index to take the number of hidden helper threads
into account.
Hidden helper threads are given the full initial mask and do not
participate in any of the other affinity mechanisms (place partitioning,
balanced affinity). Their affinity is only printed for debug builds.
Differential Revision: https://reviews.llvm.org/D101882
[libomptarget] Add support for target allocators to dynamic cuda RTL
Follow on to D102000 which introduced new calls into libcuda. This patch adds
the corresponding entry points to dynamic_cuda, fixing the build for systems
that do not have the cuda toolkit installed.
Function types and enum from https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D102169
This patch prevents runtime tests running on systems without amdgpu.
Reviewed By: protze.joachim, tianshilei1992
Differential Revision: https://reviews.llvm.org/D102054
I want to start using LLVM component libraries in libomptarget
to stop duplicating implementations already available in LLVM
(e.g. LLVMObject, LLVMSupport, etc.). Without relying on LLVM
in all libomptarget builds one has to provide fallback implementation
for each used LLVM feature.
This is an attempt to stop supporting out-of-llvm-tree builds of libomptarget.
I understand that I may need to revert this,
if this affects downstream projects in a bad way.
Differential Revision: https://reviews.llvm.org/D101509
Summary:
The allocator interface added in D97883 allows the RTL to allocate shared and
host-pinned memory from the cuda plugin. This patch adds support for these to
the runtime.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D102000
[libomptarget][nfc] Refactor amdgpu partial barrier to simplify adding a second one
D101976 would require a second barrier instance. This NFC to amdgpu makes it
simpler to add one (an extra global, one more line in init). Also renames the
current barrier to L0.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102016
[libomptarget][amdgpu][nfc] Remove dead code from amdgpu plugin
Drops an enum that was identical to a HSA one, localises some functions where
they were only called from one TU. Covers everything internalize + adce can
identify as dead, except for msgpack::dump which is useful when debugging.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102014
This patch does the following:
1) Introduce kmp_topology_t as the runtime-friendly structure (the
corresponding global variable is __kmp_topology) to determine the
exact machine topology which can vary widely among current and future
architectures. The current design is not easy to expand beyond the assumed
three layer topology: sockets, cores, and threads so a rework capable of
using the existing KMP_AFFINITY mechanisms is required.
This new topology structure has:
* The depth and types of the topology
* Ratio count for each consecutive level (e.g., number of cores per
socket, number of threads per core)
* Absolute count for each level (e.g., 2 sockets, 16 cores, 32 threads)
* Equivalent topology layer map (e.g., Numa domain is equivalent to
socket, L1/L2 cache equivalent to core)
* Whether it is uniform or not
The hardware threads are represented with the kmp_hw_thread_t
structure. This structure contains the ids (e.g., socket 0, core 1,
thread 0) and other information grabbed from the previous Address
structure. The kmp_topology_t structure contains an array of these.
2) Generalize the KMP_HW_SUBSET envirable for the new
kmp_topology_t structure. The algorithm doesn't assume any order with
tiles,numa domains,sockets,cores,threads. Instead it just parses the
envirable, makes sure it is consistent with the detected topology
(including taking into account equivalent layers) and then trims away
the unneeded subset of hardware threads. To enable this, a new
kmp_hw_subset_t structure is introduced which contains a vector of
items (hardware type, number user wants, offset). Any keyword within
__kmp_hw_get_keyword() can be used as a name and can be shortened as
well. e.g.,
KMP_HW_SUBSET=1s,2numa,4tile,2c,3t can be used on the KNL SNC-4 machine.
3) Simplify topology detection functions so they only do the singular
task of detecting the machine's topology. Printing, and all
canonicalizing functionality is now done afterwards. So many lines of
duplicated code are eliminated.
4) Add new ll_caches and numa_domains to OMP_PLACES, and
consequently, KMP_AFFINITY's granularity setting. All the names within
__kmp_hw_get_keyword() are available for use in OMP_PLACES or
KMP_AFFINITY's granularity setting.
5) Simplify and future-proof code where explicit lists of allowed
affinity settings keywords inside if() conditions.
6) Add x86 CPUID leaf 4 cache detection to existing x2apic id method
so equivalent caches could be detected (in particular for the ll_caches
place).
Differential Revision: https://reviews.llvm.org/D100997
This enables the runtime tests on amdgpu targets.
10 tests have been marked as XFAIL on amdgcn currently mostly due to
missing printf.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D99656
Getting my feet wet here as a new committer.
Correct misspelling in check-depends.pl.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D101552
If available, use the clang that is already built in the same project as
CUDA compiler unless another executable is explicitly defined. This also
ensures the generated deviceRTL IR will be consistent with the version
of Clang.
This patch is required to reliably test OpenMP offloading in a buildbot
without either a two-stage build (e.g. with LLVM_ENABLE_RUNTIMES) or a
separately installed clang on the worker that will eventually become
outdated.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101265
The OpenMP runtime can be compiled using a CUDA installed at non-default
location with the -DCUDA_TOOLKIT_ROOT_DIR setting. However, check-openmp
will fail afterwards because Clang needs to know where to find the CUDA
headers.
Fix by passing -cuda-path to Clang using the value of
CUDA_TOOLKIT_ROOT_DIR which has been determined by CMake. Also set
LD_LIBRARY_PATH such that it can find the cuda runtime when executing.
This will ensure that the regression test do not depend on the current
environment, but use the environment it was configured for.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101266
This patch fuses the RUN lines for most libomptarget tests. The previous patch
D101315 created separate test targets for each supported offloading triple.
This patch updates the RUN lines in libomptarget tests to use a generic run
line independent of the offloading target selected for the lit instance.
In cases, where no RUN line was defined for a specific offloading target,
the corresponding target is declared as XFAIL. If it turns out that a test
actually supports the target, the XFAIL line can be removed.
Differential Revision: https://reviews.llvm.org/D101326
This patch creates a separate test directory for each offloading target to be
tested. This allows to test multiple architectures in one configuration, while
still see all failing tests separately. The lit test names include the target
triple, so that it will be easier to spot the failing target.
This patch also allows to mark expected failing tests based on the
target-triple, as the currently used triple is added to the lit "features":
```
// XFAIL: nvptx64-nvidia-cuda
```
Differential Revision: https://reviews.llvm.org/D101315
[libomptarget] Enable AMDGPU devicertl
The amdgpu devicertl is written in freestanding openmp and compiles to a
bitcode library (per listed gfx arch) with no unresolved symbols. It requires
a recent clang, preferably the one from the same monorepo checkout.
This is D98658, with printf explicitly stubbed out, after patching clang to no
longer require an llvm with the amdgpu target enabled.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D101213
Summary:
This patch improves the implementation of D100774 by replacing the global
variable introduced with a function that returns a reference to an internal
one. This removes the need to define the variable in every plugin that uses it.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D101102
Summary:
This patch adds a new runtime function __tgt_set_info_flag that allows the
user to set the information level at runtime without using the environment
variable. Using this will require an extern function, but will eventually be
added into an auxilliary library for OpenMP support functions.
This patch required moving the current InfoLevel to a global variable which must
be instantiated by each plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100774
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
The implicitly generated mappings for allocation/deallocation in mappers
runtime should be mapped as implicit, also no need to clear member_of
flag to avoid ref counter increment. Also, the ref counter should not be
incremented for the very first element that comes from the mapper
function.
Differential Revision: https://reviews.llvm.org/D100673
Implement the remaining GOMP_* functions to support task reductions
in taskgroup, parallel, loop, and taskloop constructs. The unused mem
argument to many of the work-sharing constructs has to do with the
scan() directive/ inscan() modifier. If mem is set, each function
will call KMP_FATAL() and tell the user scan/inscan is unsupported. The
GOMP reduction implementation is kept separate from our implementation
because of how GOMP presents reduction data and computes the reductions.
GOMP expects the privatized copies to be present even after a #pragma
omp parallel reduction(task:...) region has ended so the data is stored
inside GOMP's uintptr_t* data pseudo-structure. This style is tightly
coupled with GCC compiler codegen. There also isn't any init(),
combiner(), fini() functions in GOMP's codegen so the two
implementations were to disparate to try to wrap GOMP's around our own.
Differential Revision: https://reviews.llvm.org/D98806
Current atfork() handler for child processes does not reset
the affinity masks array which prevents users from setting their own
affinity in child processes.
Differential Revision: https://reviews.llvm.org/D99218
Summary:
This patch adds a feature to print information whenever the host-device pointer
mapping table is changed by inserting or removing an entry. This introduces a
new bit field for LIBOMPTARGET_INFO at position 0x8.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100600
omp_is_initial_device() is marked as a built-in function in the current
compiler, and user code guarded by this call may be optimized away,
resulting in undesired behavior in some cases. This patch provides a
possible fix for such cases by defining the routine as a variant
function and removing it from builtin list.
Differential Revision: https://reviews.llvm.org/D99447
The second argument to the strnlen_s(str, size) function should be
sizeof(str) when str is a true array of characters with known size
(instead of just a char*). Use type traits to determine if first
parameter is a character array and use the correct size based on that
trait.
Differential Revision: https://reviews.llvm.org/D98209
Summary:
Remove some of the error messages printed when the CUDA plugin fails. The current error messages can be confusing because they are the first error messages printed after the async stream finds an error. This means that the printed values aren't related to what caused the issue, but are simply the last asyncronous operation that succeeded on the device. Remove these as they can be misleading.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D99510
Summary:
If the call to `synchronize` fails, it will currently block the stream indefinitely if execution is continued from this point. Additionally, if the program exits it will trigger an assertion on the non-null value of the async queue and prevent the runtime from printing debugging information.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D99443
-- Added or moved checks to appropriate places.
-- Removed ineffective null check where the pointer is already being
dereferenced around the code.
-- Initialized variables that can be used without definitions.
-- Added call to dlclose/FreeLibrary in OMPT tool activation.
-- Added a new build compiler definition.
Differential Revision: https://reviews.llvm.org/D98584
It is reported that after enabling hidden helper thread, the program
can hit the assertion `new_gtid < __kmp_threads_capacity` sometimes. The root
cause is explained as follows. Let's say the default `__kmp_threads_capacity` is
`N`. If hidden helper thread is enabled, `__kmp_threads_capacity` will be offset
to `N+8` by default. If the number of threads we need exceeds `N+8`, e.g. via
`num_threads` clause, we need to expand `__kmp_threads`. In
`__kmp_expand_threads`, the expansion starts from `__kmp_threads_capacity`, and
repeatedly doubling it until the new capacity meets the requirement. Let's
assume the new requirement is `Y`. If `Y` happens to meet the constraint
`(N+8)*2^X=Y` where `X` is the number of iterations, the new capacity is not
enough because we have 8 slots for hidden helper threads.
Here is an example.
```
#include <vector>
int main(int argc, char *argv[]) {
constexpr const size_t N = 1344;
std::vector<int> data(N);
#pragma omp parallel for
for (unsigned i = 0; i < N; ++i) {
data[i] = i;
}
#pragma omp parallel for num_threads(N)
for (unsigned i = 0; i < N; ++i) {
data[i] += i;
}
return 0;
}
```
My CPU is 20C40T, then `__kmp_threads_capacity` is 160. After offset,
`__kmp_threads_capacity` becomes 168. `1344 = (160+8)*2^3`, then the assertions
hit.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D98838
Add register usage information to the runtime metadata so that it can be used during kernel launch (that change will be in a different commit). Add this information to the kernel trace.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D98829
[libomptarget] Build amdgcn devicertl by default
The cmake for this looks for an llvm install and does the right thing when
building as part of enable_runtimes. It will probably do the right thing
in other settings - at least, it won't try to build this with gcc.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98658
[libomptarget] Build amdgpu plugin by default
This will build the amdgpu plugin if cmake is able to find the hsa
runtime library, which will be the case if rocm is installed or if
the hsa library has been installed somewhere cmake looks.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D98654
[libomptarget] Fix devicertl build
The target specific functions in target_interface are extern C, but the
implementations for nvptx were mostly C++ mangling. That worked out as
a quirk of DEVICE macro expanding to nothing, except for shuffle.h which
only forward declared the functions with C++ linkage.
Also implements GetWarpSize, as used by shuffle, and includes target_interface
in nvptx target_impl.cu to help catch future divergence between interface and
implementation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98651
[libomptarget] Drop assert.h, use freestanding for amdgcn devicertl
Promotes the runtime assert to a link time error for the unimplemented
fallback functions. Enables amdgcn to build with only clang provided
headers, which makes it less likely to break other builds when enabled.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D98649
[libomptarget][amdgcn] Drop use of inttypes.h, moving closer to freestanding
The glibc headers are a periodic source of problems compiling the devicertl.
This patch resolves the following error run into while building llvm on a slightly
different linux system.
```
In file included from .../lib/clang/13.0.0/include/inttypes.h:21:
In file included from /usr/include/inttypes.h:25:
/usr/include/features.h:461:12: fatal error: 'sys/cdefs.h' file not found
# include <sys/cdefs.h>
^~~~~~~~~~~~~
```
As a second patch, removing assert.h from shuffle will let amdgcn build as
-ffreestanding, at which point only the headers that clang itself provides are
used and interactions with the host glibc are eliminated. Doing the same for
nvptx is complicated by printf handling but also seems worthwhile.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D98565
This patch adds the infrastructure for allocator support for target memory.
Three allocators are introduced for device, host and shared memory.
The corresponding API functions have the llvm_ prefix temporarily, until they become part of the OpenMP standard.
Differential Revision: https://reviews.llvm.org/D97883
The shuffle idiom is differently implemented in our supported targets.
To reduce the "target_impl" file we now move the shuffle idiom in it's
own self-contained header that provides the implementation for AMDGPU
and NVPTX. A fallback can be added later on.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95752
Summary:
The changes introduced in D87946 changed the API for libomptarget
functions. `__kmpc_push_target_tripcount` was a function in Clang 11.x
but was not given a backward-compatible interface. This change will
require people using Clang 13.x or 12.x to recompile their offloading
programs.
Reviewed By: jdoerfert cchen
Differential Revision: https://reviews.llvm.org/D98358
For clang this change is NFC cleanup, because clang
never calls atomic functions from runtime library.
Basically, pause is good in spin-loops waiting for something.
Atomic CAS loops do not wait for anything,
each CAS failure means some other thread progressed.
Performance experiments show that the pause only causes unnecessary slowdown
on CPUs with slow pause instruction, no difference on CPUs with fast pause
instruction, removal of the pause gives lesser binary size which is good.
Differential Revision: https://reviews.llvm.org/D97079
In D97003, CUDA 9.2 is the minimum requirement for OpenMP offloading on
NVPTX target. We don't need to have macros in source code to select right functions
based on CUDA version. we don't need to compile multiple bitcode libraries of
different CUDA versions for each SM. We don't need to worry about future
compatibility with newer CUDA version.
`-target-feature +ptx61` is used in this patch, which corresponds to the highest
PTX version that CUDA 9.2 can support.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97198
Restrict the chunk_size * chunk_num to only occur for valid
chunk_nums and reimplement calculating the limit to avoid overflow.
Differential Revision: https://reviews.llvm.org/D96747
and __kmpc_end_masked. The "master" construct is deprecated. Changed
proc-bind keyword from "master" to "primary". Use of both master
construct and master as proc-bind keyword is still allowed, but
deprecated.
Remove references to "master" in comments and strings, and replace
with "primary" or "primary thread". Function names and variables were
not touched, nor were references to deprecated master construct. These
can be updated over time. No new code should refer to master.
This patch just encapsulates some repeated code. To do so, it
relocates some functions from interface.cpp to omptarget.cpp. It also
adjusts them to the LLVM coding style.
This patch is almost NFC except some `DP` messages are a bit
different. For example, messages like "Entering target region" are
now emitted even if offload is disabled, but a subsequent "Offload is
disabled" is then emitted.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D97908
Without this patch, an `omp target exit data` before the runtime is
initialized produces a runtime error. This patch fixes that by
changing `__tgt_target_data_end_mapper` to call `CheckDeviceAndCtors`
like many other runtime routines.
Discussed at
<https://lists.llvm.org/pipermail/openmp-dev/2021-March/003920.html>.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D97907
Without this patch, when the offload device is set to
`omp_get_initial_device()`, the runtime fails with an error diagnostic
when entering target regions or target data regions.
However, OpenMP 5.1, sec. 2.14.5 "target Construct", "Restrictions",
p. 203, L3-5 states:
> The device clause expression must evaluate to a non-negative integer
> value that is less than or equal to the value of
> omp_get_num_devices().
Sec. 3.7.7 "omp_get_initial_device", p. 412, L2-3 states:
> The value of the device number is the value returned by the
> omp_get_num_devices routine.
Similarly, OpenMP 5.0, sec. 2.12.5 "target Construct", "Restrictions",
p. 174 L30-32 states:
> The device clause expression must evaluate to a non-negative integer
> value less than the value of omp_get_num_devices() or to the value
> of omp_get_initial_device().
This patch fixes this behavior by changing the runtime to behave as if
offloading is disabled whenever it finds the offload device (either
from a `device` clause or the default device) is set to the host
device. In the case of mandatory offloading when
`omp_get_num_devices() == 0`, it incorporates the behavior proposed
for OpenMP 5.2 in OpenMP spec github issue 2669.
Reviewed By: grokos, RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D97616
This is a preview of allocator support for target memory that depends on the
offload runtime API which allocates memory as described below.
llvm_omp_target_alloc_host(size_t size, int device_num);
-- Returns non-migratable memory owned by host.
-- Memory is accessible by host and device(s).
llvm_omp_target_alloc_shared(size_t size, int device_num);
-- Returns migratable memory owned by host and device.
-- Memory is accessible by host and device.
llvm_omp_target_alloc_device(size_t size, int device_num);
-- Returns memory owned by device.
-- Memory is only accessible by device.
New memory space and predefined allocator names are
-- llvm_omp_target_host_mem_space
-- llvm_omp_target_shared_mem_space
-- llvm_omp_target_device_mem_space
-- llvm_omp_target_host_mem_alloc
-- llvm_omp_target_shared_mem_alloc
-- llvm_omp_target_device_mem_alloc
Differential Revision: https://reviews.llvm.org/D96669
If the mapped structure has data members, which have 'default' mappers,
need to map these members individually using their 'default' mappers.
Differential Revision: https://reviews.llvm.org/D92195
Cleanup changes:
- check value read from file;
- remove dead code;
- make unsigned variable to read hexadecimal number to;
- add debug assertion to check ref count.
Differential Revision: https://reviews.llvm.org/D96893
Stitching id could be overridden causing reference of destroyed object
when number of teams is 1. The patch separates stitching id store
location for teams and parallel nested in teams.
Differential Revision: https://reviews.llvm.org/D96562
Allow users to use a non-system version of perl, python and awk, which is useful
in certain package managers.
Reviewed By: JDevlieghere, MaskRay
Differential Revision: https://reviews.llvm.org/D95119
PR#49334 reports a crash when offloading to x86_64 with `target nowait`,
which is caused by referencing a nullptr. The root cause of the issue is, when
pushing a hidden helper task in `__kmp_push_task`, it also maps the gtid to its
shadow gtid, which is wrong.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97329
This makes sure that images are loaded in the order in which they are registered with libomptarget.
If a target can load multiple images and these images depend on each other (for example if one image contains the programs target regions and one image contains library code), then the order in which images are loaded can be important for symbol resolution (for example, in the VE plugin).
In this case: because the same code exist in the host binaries, the order in which the host linker loads them (which is also the order in which images are registered with libomptarget) is the order in which the images have to be loaded onto the device.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95530
When including <ostream>, the register_callback macro of the OMPT callback.h
clashes with a function defined in ostream. This patch renames the macro
and includes ompt into the macro name.
`ptx71` is not supported in release version of LLVM yet. As a result,
the support of CUDA 11.2 and CUDA 11.1 caused a compilation error as mentioned
in D97004. Since the support in D97004 is just a WA for releease, and we'll not
use it in the near future, using `ptx70` for CUDA 11 is feasible.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97195
This code alleviates some pathological loop parameters (lower,
upper, stride) within calculations involved in the static loop code. It
bounds the chunk size to the trip count if it is greater than the trip
count and also minimizes problematic code for when trip count < nth.
Differential Revision: https://reviews.llvm.org/D96426
Only attempt shutdown if lpReserved is NULL. The Windows documentation
states:
When handling DLL_PROCESS_DETACH, a DLL should free resources such as
heap memory only if the DLL is being unloaded dynamically (the
lpReserved parameter is NULL). If the process is terminating (the
lpReserved parameter is non-NULL), all threads in the process except the
current thread either have exited already or have been explicitly
terminated by a call to the ExitProcess function, which might leave some
process resources such as heaps in an inconsistent state. In this case,
it is not safe for the DLL to clean up the resources. Instead, the DLL
should allow the operating system to reclaim the memory.
Differential Revision: https://reviews.llvm.org/D96750
This patch limits the number of dispatch buffers (used for
loop worksharing construct) to between 1 and 4096.
Differential Revision: https://reviews.llvm.org/D96749
As mentioned in PR#49250, without this patch, ptxas for CUDA 9.1 fails
in the following two tests:
- openmp/libomptarget/test/mapping/lambda_mapping.cpp
- openmp/libomptarget/test/offloading/bug49021.cpp
The error looks like:
```
ptxas /tmp/lambda_mapping-081ea9.s, line 828; error : Not a name of any known instruction: 'activemask'
```
The problem is that our cmake script converts CUDA version strings
incorrectly: 9.1 becomes 9100, but it should be 9010, as shown in
`getCudaVersion` in `clang/lib/Driver/ToolChains/Cuda.cpp`. Thus,
`openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu`
inadvertently enables `activemask` because it apparently becomes
available in 9.2. This patch fixes the conversion.
This patch does not fix the other two tests in PR#49250.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D97012
Without this patch, there's a runtime error for those map types at
exit from an "omp target data" or at "omp target exit data", but the
spec says the list item should be ignored.
This patch tests that fix in data_absent_at_exit.c, and it also
improves other testing for data that is not fully present at exit.
Reviewed By: grokos, RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D96999
allow bit masking to select various trace features.
bit 0 => Launch tracing (stderr)
bit 1 => timing of runtime (stdout)
bit 2 => detailed launch tracing (stderr)
bit 3 => timing goes to stdout instead of stderr
example: LIBOMPTARGET_KERNEL_TRACE=7 does it all
LIBOMPTARGET_KERNEL_TRACE=5 Launch + details
LIBOMPTARGET_KERNEL_TRACE=2 timings + launch to stderr
LIBOMPTARGET_KERNEL_TRACE=10 timings + launch to stdout
Differential Revision: https://reviews.llvm.org/D96998
OpenMP 5.0 removed a lot of restriction for overlapped mapped items
comparing to OpenMP 4.5. Patch restricts the checks for overlapped data
mappings only for OpenMP 4.5 and less and reorders mapping of the
arguments so, that present and alloc mappings are processed first and
then all others.
Differential Revision: https://reviews.llvm.org/D86119
As reported by Guilherme Valarini [0], we used to pass stack allocations
to calls that can nowadays be asynchronous. This is arguably a problem
and it will inevitably result in UB. To remedy the situation we
allocate the locations as part of the AsyncInfoTy object. The lifetime
of that object matches what we need for now. If the synchronization is
not tied to the AsyncInfoTy object anymore we might need to have a
different buffer construct in global space.
This should be back-ported to LLVM 12 but needs slight modifications as
it is based on refactoring patches we do not need to backport.
[0] https://lists.llvm.org/pipermail/openmp-dev/2021-February/003867.html
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D96667
This patch unifies our libomptarget API in two ways:
- always pass a `__tgt_async_info` object, the Queue member decides if
it is in use or not.
- (almost) always synchronize in the interface layer and not in the
omptarget layer.
A side effect is that we now put all constructor and static initializer
kernels in a stream too, if the device utilizes `__tgt_async_info`.
The patch contains a TODO which can be addressed as we add support for
asynchronous malloc and free in the plugin API. This is the only
`synchronizeAsyncInfo` left in the omptarget layer.
Site note: On a V100 system the GridMini performance for small sizes
more than doubled.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D96379
The AsyncInfo should be passed everywhere and it should offer a way to
ensure synchronization, given a libomptarget Device.
This replaces D96431.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D96438
This unifies the API of `target` relative to `targetUpdateData` and
such.
Reviewed By: tianshilei1992, grokos
Differential Revision: https://reviews.llvm.org/D96429
The struct and enum alignments are kept by disabling clang-format for
that code region.
Reviewed By: tianshilei1992, JonChesterfield, grokos
Differential Revision: https://reviews.llvm.org/D96428
This silences warnings like these, in mingw builds with clang:
runtime/src/kmp_atomic.h:1021:13: warning: '__kmpc_atomic_cmplx8_rd' has C-linkage specified, but returns user-defined type 'kmp_cmplx64' (aka '__kmp_cmplx64_t') which is incompatible with C [-Wreturn-type-c-linkage]
runtime/src/z_Windows_NT_util.cpp:479:17: warning: cast from 'volatile void *' to 'type-parameter-0-0 *' drops volatile qualifier [-Wcast-qual]
flag = (C *)th->th.th_sleep_loc;
runtime/src/z_Windows_NT_util.cpp:1321:14: warning: cast to 'void *' from smaller integer type 'DWORD' (aka 'unsigned long') [-Wint-to-void-pointer-cast]
} else if ((void *)exit_val != (void *)th) {
Differential Revision: https://reviews.llvm.org/D96585
Add ifdefs around one function that only is used in unix build
configurations.
Add a void cast for a windows specific function that currently is
unused but may be intended to be used at some point.
Differential Revision: https://reviews.llvm.org/D96584
These variables are used only in certain build configurations,
or marked with a todo comment indicating that they should be
used/checked/reported.
Differential Revision: https://reviews.llvm.org/D96582
MinGW build configurations don't support this pragma (unless
compiling with clang, with -fms-extensions, and linking with
lld), and at least clang warns about it.
This library does end up linked by the cmake files anyway (as
long as the check works properly).
Differential Revision: https://reviews.llvm.org/D96581
check_library_exists fails for stdcall functions, because that
check doesn't include the necessary headers (and thus fails with
an undefined reference to _EnumProcessModules, when the import
library symbol actually is called _EnumProcessModules@16).
Merge the two previous checks check_include_files and
check_library_exists into one with check_c_source_compiles, and
merge the variables that indicate whether it succeeded.
Differential Revision: https://reviews.llvm.org/D96580
[libomptarget][amdgcn] Build amdgcn devicertl as openmp
Change cmake to build as openmp and fix up some minor errors in the code.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D96533
Three minor changes in this patch:
- added UNLIKELY hint to few rarely executed branches;
- replaced couple of run time checks with debug assertions;
- moved check of presence of ittnotify tool from inside the function call.
Differential Revision: https://reviews.llvm.org/D95816
This patch enables omp_get_num_devices() and omp_get_initial_device() on
Windows by providing an alternative to dlsym on Windows, and proposes to
add a new libomptarget entry, __tgt_get_num_devices().
Differential Revision: https://reviews.llvm.org/D96182
This patch adds lower-bound and upper-bound to num_teams clause
according to OpenMP 5.1 specification. The initial number of teams
created is implementation defined, but it will be greater than or
equal to lower-bound and less than or equal to upper-bound. If
num_teams clause is not specified, the number of teams created is
implementation defined, but it will be greater or equal to 1.
Differential Revision: https://reviews.llvm.org/D95820
[libomptarget][amdgcn] Tolerate deadstripped device_state variable
The device_state variable may have been deadstripped. Similar to
device_environment, leave detection of missing but used symbol to loader.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D96330
[libomptarget][amdgcn] Tolerate deadstripped env variable
Discovered by Pushpinder. If the device_environment variable is unused
it can be deadstripped, in which case we should not abort due to it
missing. This change is safe in that a missing symbol which is actually
used can be reported by both linker and loader, and a missing unused
symbol is better deadstripped than left in the image.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D96329
Currently if there is not kernel argument, device synchronization will
be skipped. This can lead to two issues:
1. If there is any device error, it will not be captured;
2. The target region might end before the kernel is done, which is not spec
conformant.
The test added in this patch only runs on NVPTX platform, although it will not
be executed by Phab at all. It also requires `not` which is not available on most
systems.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D96067
The header `assert.h` needs to be included in order to use `assert` in the code.
When building NVPTX `deviceRTLs` on a CUDA free system, it requires headers from
`gcc-multilib`, which some systems don't have. This patch drops the use of
`assert` in common parts of `deviceRTLs`. In light of
`openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.h`, a code block
```
if (!cond)
__builtin_trap();
```
is being used. The builtin will be translated to `call void @llvm.trap()`, and
the corresponding PTX is `trap;`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95986
New affinity patch introduced legitimate sign-compare warnings that
clang doesn't report but GCC-10 does. This removes the warnings by
changing two variables types to unsigned.
Differential Revision: https://reviews.llvm.org/D95818
From https://bugs.llvm.org/show_bug.cgi?id=48973, we know that
`std::call_once(PM->RTLs.initFlag, &RTLsTy::LoadRTLs, PM->RTLs)` causes compile
time problems in libstdc++v3 5.3.1. This is because there was a defect in the
standard regarding the `call_once` (LWG 2442). This was fixed in libstdc++ soon
thereafter, but there are likely other standard libraries where this will fail.
By matching this function call with the other one, we fix this bug.
Differential Revision: https://reviews.llvm.org/D95769
FileCheck is required for OpenMP tests. The current detection can fail
if building OpenMP in-tree when user sets `LLVM_INSTALL_TOOLCHAIN_ONLY=ON`. As a
result, CMake will raise an error and the compilation will be broken. This patch
fixed the issue. When `FileCheck` is not a target, tests will just be skipped.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D95689
Summary:
One option for the LIBOMPTARGET_INFO environment variable is to print the current status of the device's data mappings. These are a shared resource among threads so this needs to be protected when using multiple streams.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95786
This patch refines the logic to choose compute capabilites via the
environment variable `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES`. It supports the
following values (all case insensitive):
- "all": Build `deviceRTLs` for all supported compute capabilites;
- "auto": Only build for the compute capability auto detected. Note that this
requires CUDA. If CUDA is not found, a CMake fatal error will be raised.
- "xx,yy" or "xx;yy": Build for compute capabilities `xx` and `yy`.
If `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES` is not set, it is equivalent to set
it to `all`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95687
This patch introduces a new environment variable to force monotonic
behavior for users that absolutely need it. This is in anticipation
of 5.0 change that uses non-monotonic behavior for dynamic scheduling
by default. Fixes for that and the actual switch are coming soon.
Differential Revision: https://reviews.llvm.org/D95263
Added release note for new `deviceRTLs` and hidden helper task for LLVM
12.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95584
This patch created a new header file `target_interface.h` for declarations of all target dependent functions. All future targets can get things work by simply implementing all functions declared in the header and macros/data same as each `target_impl.h`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95300
In the past `-O1` was used when building NVPTX bitcode libraries. After
we switched to OpenMP, `-O1` was missing by mistake, leading to a huge performance
regression.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95545
`[[clang::loader_uninitialized]]` is in macro `SHARED` but it doesn't
work for array like `parallelLevel`, so the variable will be zero initialized.
There is also a similar issue for `omptarget_nvptx_device_State` which is in
global address space. Its c'tor is also generated, which was not in the past when
building the `deviceRTLs` with CUDA. In this patch, we added the attribute to
the two variables explicitly.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95550
Link error occurred when time profiling in libomp is enabled by default
because `libomp` is assumed to be a C library but the dependence on
`libLLVMSupport` for profiling is a C++ library. Currently the issue blocks all
OpenMP tests in Phabricator.
This patch set a new CMake option `OPENMP_ENABLE_LIBOMP_PROFILING` to
enable/disable the feature. By default it is disabled. Note that once time
profiling is enabled for `libomp`, it becomes a C++ library.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95585
The remote offloading plugin's CMakeLists was trying to build if its
flag was enabled even if it didn't find gRPC/protobuf. The conditional
was wrong, it's fixed by this.
Differential Revision: https://reviews.llvm.org/D95574
D95466 dropped CUDA to build NVPTX deviceRTL and enabled it by default.
However, the building requires some libraries that are not available on non-CUDA
system by default, which could break the compilation. This patch disabled the
build by default. It can be enabled with `LIBOMPTARGET_BUILD_NVPTX_BCLIB=ON`.
Reviewed By: kparzysz
Differential Revision: https://reviews.llvm.org/D95556
When OMP_PLACES contains an invalid value, the warning informs the user
that the fallback is OMP_PLACES=threads, but the actual internal setting
is OMP_PLACES=cores and is detected as such with KMP_SETTINGS=1.
This patch informs the user that OMP_PLACES=cores is being used instead
of OMP_PLACES=threads.
Differential Revision: https://reviews.llvm.org/D95170
This patch adds the new algorithm for topology discovery using cpuid
leaf 1f. Only the new die level is detected and integrated into the
current affinity mechanisms including KMP_AFFINITY (granularity level
and compact/scatter algorithm), OMP_PLACES=dies, and KMP_HW_SUBSET.
Differential Revision: https://reviews.llvm.org/D95157
HWLOC 2.0 has numa nodes as separate children and are not in the main
parent/child topology tree anymore. This change takes this into
account. The main topology detection loop in the create_hwloc_map()
routine starts at a hardware thread within the initial affinity mask and
goes up the topology tree setting the socket/core/thread labels
correctly.
This change also introduces some of the more generic changes that the
future kmp_topology_t structure will take advantage of including a
generic ratio & count array (finding all ratios of topology layers like
threads/core cores/socket and finding all counts of each topology
layer), generic radix1 reduction step, generic uniformity check, and
generic printing of topology (en_US.txt)
Differential Revision: https://reviews.llvm.org/D95156
The check-libomptarget fails when building with LLVM_ENABLE_PROJECTS. This is because test configuration misses the path to libomp.so and libLLVMSupport.so when time profiling is enabled (both libraries have the same path when building). This patch add the path to the configuration.
Reviewed By: vzakhari
Differential Revision: https://reviews.llvm.org/D95376
Problem reported by Joseph Shen <joseph.smeng@gmail.com>.
The patch changes *(&<atomic-var>) to (&<atomic-var>)->load().
Differential Revision: https://reviews.llvm.org/D95485
With D94745, we no longer use CUDA SDK to compile `deviceRTLs`. Therefore,
many CMake code in the project is useless. This patch cleans up unnecessary code
and also drops the requirement to build NVPTX `deviceRTLs`. CUDA detection is
still being used however to determine whether we need to involve the tests. Auto
detection of compute capability is enabled by default and can be disabled by
setting CMake variable `LIBOMPTARGET_NVPTX_AUTODETECT_COMPUTE_CAPABILITY=OFF`.
If auto detection is enabled, and CUDA is also valid, it will only build the
bitcode library for the detected version; otherwise, all variants supported will
be generated. One drawback of this patch is, we now generate 96 variants of
bitcode library, and totally 1485 files to be built with a clean build on a
non-CUDA system. `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=""` can be used to
disable building NVPTX `deviceRTLs`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95466
This patch sets the def-allocator-var ICV based on the environment variables
provided in OMP_ALLOCATOR. Previously, only allowed value for OMP_ALLOCATOR
was a predefined memory allocator. OpenMP 5.1 specification allows predefined
memory allocator, predefined mem space, or predefined mem space with traits in
OMP_ALLOCATOR. If an allocator can not be created using the provided environment
variables, the def-allocator-var is set to omp_default_mem_alloc.
Differential Revision: https://reviews.llvm.org/D94985
[libomptarget][cuda] Handle missing _v2 symbols gracefully
Follow on from D95367. Dlsym the _v2 symbols if present, otherwise use the
unsuffixed version. Builds a hashtable for the check, can revise for zero
heap allocations later if necessary.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95415
Requiring 3.15 causes a build breakage, I'm sure none of the contents actually require
3.15 or above.
Differential Revision: https://reviews.llvm.org/D95474
[libomptarget][cuda] Gracefully handle missing cuda library
If using dynamic cuda, and it failed to load, it is not safe to call
cuGetErrorString.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95412
[libomptarget][cuda] Only run tests when sure there is cuda available
Prior to D95155, building the cuda plugin implied cuda was installed locally.
With that change, every machine can build a cuda plugin, but they won't all have
cuda and/or an nvptx card installed locally.
This change enables the nvptx tests when either:
- libcuda is present
- the user has forced use of the dlopen stub
The default case when there is no cuda detected will no longer attempt to
run the tests on nvptx hardware, as was the case before D95155.
Reviewed By: jdoerfert, ronlieb
Differential Revision: https://reviews.llvm.org/D95467
This introduces a remote offloading plugin for libomptarget. This
implementation relies on gRPC and protobuf, so this library will only
build if both libraries are available on the system. The corresponding
server is compiled to `openmp-offloading-server`.
This is a large change, but the only way to split this up is into RTL/server
but I fear that could introduce an inconsistency amongst them.
Ideally, tests for this should be added to the current ones that but that is
problematic for at least one reason. Given that libomptarget registers plugin
on a first-come-first-serve basis, if we wanted to offload onto a local x86
through a different process, then we'd have to either re-order the plugin list
in `rtl.cpp` (which is what I did locally for testing) or find a better
solution for runtime plugin registration in libomptarget.
Differential Revision: https://reviews.llvm.org/D95314
In order to support remote execution, we need to be able to send the
target binary description to the remote host for registration (and
consequent deregistration). To support this, I added these two
optional new functions to the plugin API:
- `__tgt_rtl_register_lib`
- `__tgt_rtl_unregister_lib`
These functions will be called to properly manage the instance of
libomptarget running on the remote host.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D93293
From this patch (plus some landed patches), `deviceRTLs` is taken as a regular OpenMP program with just `declare target` regions. In this way, ideally, `deviceRTLs` can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics.
Here're a list of changes in this patch.
1. For NVPTX, `DEVICE` is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove `DEVICE` or probably some other macros.
2. Shared variable is implemented with OpenMP allocator, which is defined in `allocator.h`. Again, this feature is not available on AMDGCN, so two macros are redefined properly.
3. CUDA header `cuda.h` is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation.
4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as `libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc`, such as `libomptarget-nvptx-cuda_80-sm_20.bc`.
With this change, there are also multiple features to be expected in the near future:
1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version.
2. Atomic operations used in `deviceRTLs` can be replaced by `omp atomic` if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong.
3. Target specific parts will be wrapped into `declare variant` with `isa` selector if it can work properly. No target specific macro is needed anymore.
4. (Maybe more...)
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94745
In much of the libomptarget interface we have an ident_t object now, if
it is not null we can use it to improve the profile output. For now, we
simply use the ident_t "source information string" as generated by the
FE.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95282
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
[libomptarget][cuda] Gracefully handle missing cuda library
If using dynamic cuda, and it failed to load, it is not safe to call
cuGetErrorString.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95412
`omp_is_initial_device` in device code was implemented as a builtin
function in D38968 for a better performance. Therefore there is no chance that
this function will be called to `deviceRTLs`. As we're moving to build `deviceRTLs`
with OpenMP compiler, this function can lead to a compilation error. This patch
just simply removes it.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95397
This patch makes prep for dropping CUDA when compiling `deviceRTLs`.
CUDA intrinsics are replaced by NVVM intrinsics which refers to code in
`__clang_cuda_intrinsics.h`. We don't want to directly include it because in the
near future we're going to switch to OpenMP and by then the header cannot be
used anymore.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95327
D95161 removed the option `--libomptarget-nvptx-path`, which is used in
the tests for `libomptarget-nvptx`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95293
[libomptarget][nvptx] Replace cuda atomic primitives with clang intrinsics
Tested by diff of IR generated for target_impl.cu before and after. NFC. Part
of removing deviceRTL build time dependency on cuda SDK.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95294
[libomptarget][cuda] Call v2 functions explicitly
rtl.cpp calls functions like cuMemFree that are replaced by a macro
in cuda.h with cuMemFree_v2. This patch changes the source to use
the v2 names consistently.
See also D95104, D95155 for the idea. Alternatives are to use a mixture,
e.g. call the macro names and explictly dlopen the _v2 names, or to keep
the current status where the symbols are replaced by macros in both files
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95274
[libomptarget] Build cuda plugin without cuda installed locally
Compiles a new file, `plugins/cuda/dynamic_cuda/cuda.cpp`, to an object file that exposes the same symbols that the plugin presently uses from libcuda. The object file contains dlopen of libcuda and cached dlsym calls. Also provides a cuda.h containing the subset that is used.
This lets the cmake file choose between the system cuda and a dlopen shim, with no changes to rtl.cpp.
The corresponding change to amdgpu is postponed until after a refactor of the plugin to reduce the size of the hsa.h stub required
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95155
The buckets are initialized in __kmp_dephash_create but when they are extended
the memory is allocated but not NULL'd, potentially leaving some buckets
uninitialized after all entries have been copied into the new allocation.
This commit makes sure the buckets are properly initialized with NULL before
copying the entries.
Differential Revision: https://reviews.llvm.org/D95167
[libomptarget][devicertl] Drop templated atomic functions
The five __kmpc_atomic templates are instantiated a total of seven times.
This change replaces the template with explictly typed functions, which
have the same prototype for amdgcn and nvptx, and implements them with
the same code presently in use.
Rolls in the accepted but not yet landed D95085.
The unsigned long long type can be replaced with uint64_t when replacing
the cuda function. Until then, clang warns on casting a pointer to one to
a pointer to the other.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95093
Summary:
Prior to D91261 the information checked the OMP_MAP_TARGET_PARAM flag, change this as it has been removed. The INFO macro was changed to accept a flag as input to make conditionally printing information easier.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95133
Profiling has been recently implemented in libomptarget (D93055). This patch enables time profiling support for libomptarget in libomp, to support profiling of multi-threaded execution of offloaded regions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94855
Pretty similar to D95058, this patch added forward declaration for
CUDA atomic functions. We already have definitions with right mangled names in
internal CUDA headers so the forward declaration here can work properly.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D95085
Summary:
The custom mapper API did not previously support the mapping names added previously. This means they were not present if a user requested debugging information while using the mapper functions. This adds basic support for passing the mapped names to the runtime library.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D94806
Once we switch to build deviceRTLs with OpenMP, primitives and CUDA
intrinsics cannot be used directly anymore because `__device__` is not recognized
by OpenMP compiler. To avoid involving all CUDA internal headers we had in `clang`,
we forward declared these functions. Eventually they will be transformed into
right LLVM instrinsics.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95058
[libomptarget][devicertl][nfc] Simplify target_atomic abstraction
Atomic functions were implemented as a shim around cuda's atomics, with
amdgcn implementing those symbols as a shim around gcc style intrinsics.
This patch folds target_atomic.h into target_impl.h and folds amdgcn.
Further work is likely to be useful here, either changing to openmp's atomic
interface or instantiating the templates on the few used types in order to
move them into a cuda/c++ implementation file. This change is mostly to
group the remaining uses of the cuda api under nvptx' target_impl abstraction.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95062
[libomptarget][devicertl][nfc] Remove some cuda intrinsics, simplify
Replace __popc, __ffs with clang intrinsics. Move kmpc_impl_min to only file
that uses it and replace template with explictly typed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95060
Replaced CUDA builtin vars with LLVM intrinsics such that we don't need
definitions of those intrinsics.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95013
[libomptarget][devicertl] Wrap source in declare target pragmas
Factored out of D93135 / D94745. C++ and cuda ignore unknown pragmas
so this is a NFC for the current implementation language. Removes noise
from patches for building deviceRTL as openmp.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95048