* Add GOMP versioned pause functions
* Add GOMP versioned affinity format functions
To do the affinity format functions, only attach versioned symbols
to the APPEND Fortran entries (e.g., omp_set_affinity_format_) since
GOMP only exports two symbols (one for Fortran, one for C). Our
affinity format functions have three symbols.
e.g., with omp_set_affinity_format:
1) omp_set_affinity_format (Fortran interface)
2) omp_set_affinity_format_ (Fortran interface)
3) ompc_set_affinity_format (C interface)
Have the GOMP version of the C symbol alias the ompc_* 3) version
instead of the Fortran unappended version 1).
Differential Revision: https://reviews.llvm.org/D103647
Remove strange checks for syscall() arguments where mask is NULL.
Valgrind reports these as error usages for the syscall.
Instead, just check if CACHE_LINE bytes is long enough. If not, then
search for the size. Also, by limiting the first size detection
attempt to CACHE_LINE bytes, instead of 1MB, we don't use more than one
cache line for the mask size. Before this patch, sometimes the returned
mask size was 640 bytes (10 cache lines) because the initial call to
getaffinity() was limited only by the internal kernel mask size
which can be very large.
Differential Revision: https://reviews.llvm.org/D103637
Lazily set affinity for root threads. Previously, the root thread
executing middle initialization would attempt to assign affinity
to other existing root threads. This was not working properly as the
set_system_affinity() function wasn't setting the affinity for the
target thread. Instead, the middle init thread was resetting the
its own affinity using the target thread's affinity mask.
Differential Revision: https://reviews.llvm.org/D103625
This patch includes some changes which deletes the code accessing
g_atl_machine global. Some accesses related to memory_pools are
still remaining.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103813
The current handling of dependencies in Archer has two flaws:
- annotation of dependency synchronization is not limited to sibling tasks
- annotation of in/out dependencies is based on the assumption, that dependency
variables will rarely be byte-sized variables.
This patch introduces a map in the generating task to manage the dependency
variables for the child tasks. The map is only accesses from the generating
task, so no locking is necessary. This also limits the dependency-based
synchronization to sibling tasks.
This patch also introduces proper handling for new dependency types such as
mutexinoutset and inoutset.
Differential Revision: https://reviews.llvm.org/D103608
The main motivation for reusing objects is that it helps to avoid creating and
leaking synchronization clocks in TSan. The reused object will reuse the
synchronization clock in TSan.
Before, new and delete operators were overloaded to get and return memory for
the object from/to the object pool.
This patch replaces the operator overloading with explicit static New/Delete
functions.
Objects for parallel regions and implicit tasks will always be recruited and
returned to the thread-local object pool. Only for explicit task, there is a
chance that an other thread completes the task and will free the object. This
patch optimizes the thread-local New/Delete calls by avoiding locks and only
lock if the pool is empty. Remote threads return the object into a separate
queue.
The chunk size for allocations is now decided based on page size. The objects
will also be aligned to cache lines avoiding false sharing.
This is the first patch in a series to provide better tasking support.
Differential Revision: https://reviews.llvm.org/D103606
Archer uses weak symbol overloads of TSan functions to enable loading the tool
even if the application is not built with TSan. For MACOS the tool collects
the function pointer at runtime.
When adding the function entry/exit markers, we missed to add the functions
in the MACOS codepath.
This patch also replaces the repeated function lookup by a single initial
function lookup and fixes the disabling logic in RunningOnValgrind.
Differential Revision: https://reviews.llvm.org/D103607
This patch adds an information flag that indicated when data is being copied to
and from the device. This will be helpful for finding redundant or unnecessary
data transfers in applications.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D103927
This is the first of seven patches that implements OMPD, a debugging interface to support debugging of OpenMP programs.
It contains support code required in "openmp/runtime" for OMPD implementation.
Reviewed By: @hbae
Differential Revision: https://reviews.llvm.org/D100181
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
Size of type of the dependence flag changed from 1 to 4 bytes in clang.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Differential Revision: https://reviews.llvm.org/D97085
The ident_t * argument in __kmp_get_monotonicity was being used without
a customary NULL check, causing the function to crash in a Debug build.
Release builds were not affected thanks to dead store elimination.
This global struct used to hold various flags for monitoring the
initialization of hsa.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103795
Previous logic was to always use the first kernarg pool found to allocate
kernel args. This patch changes this to use only the kernarg pool which
has non-zero size. This logic is also reworked to not use any globals.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103600
Nesting mode is a new experimental feature in the OpenMP
runtime. It allows a user to set up nesting for an application in a
way that corresponds to the hardware topology levels on the machine an
application is being run on. For example, if a machine has 2 sockets,
each with 12 cores, then use of nesting mode could set up an outer
level of nesting that uses 2 threads per parallel region, and an inner
level of nesting that uses 12 threads per parallel region.
Nesting mode is controlled with the KMP_NESTING_MODE environment
variable as follows:
1) KMP_NESTING_MODE = 0: Nesting mode is off (default); max-active-levels-var
is set to 1 (the default -- nesting is off, nested parallel regions
are serialized).
2) KMP_NESTING_MODE = 1: Nesting mode is on, and a number of threads
will be assigned for each level discovered in the machine topology;
max-active-levels-var is set to the number of levels discovered.
3) KMP_NESTING_MODE = n, n>1: [Note: this option is experimental and may change
or be removed in the future.] Nesting mode is on, and a number of
threads will be assigned for each topology level discovered on the
machine, up to k<=n levels (since there may be fewer than n levels
discovered in the topology), and beyond the kth level, nested parallel
regions will be serialized; NOTE: max-active-levels-var is 1 (the default --
nesting is off, and nested parallel regions are serialized until the
user changes max-active-levels-var.
If the user sets OMP_NUM_THREADS or OMP_MAX_ACTIVE_LEVELS, they will
override KMP_NESTING_MODE settings for the associated environment
variables. The detected topology may be limited by an affinity mask
setting on the initial thread, or if the user sets KMP_HW_SUBSET. See
also: KMP_HOT_TEAMS_MAX_LEVEL for controlling use of hot teams for
nested parallel regions. Note that this feature only sets numbers of
threads used at nesting levels. The user should make use of
OMP_PLACES and OMP_PROC_BIND or KMP_AFFINITY for affinitizing those
threads, if desired.
Differential Revision: https://reviews.llvm.org/D102188
When on KNL and L2 or Tile layer is detected, manually add
the corresponding layer which is equivalent.
Differential Revision: https://reviews.llvm.org/D102865
Turns out the only purpose of this class was verify if device ID
was in range or not which could be done easily by using g_atl_machine.
Still getting rid of g_atl_machine is pending which would be done in
a later patch.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103443
This struct was used to specify the device on which memory was
being allocated/free in atmi_malloc/free. It has now been replaced
with int DeviceId.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103239
Refactor suggested in D103037 to help avoid similar copy-paste errors.
Change is mechanical. Some parts of this would be more robust with unsigned.
Reviewed By: dhruvachak
Differential Revision: https://reviews.llvm.org/D103090
Suggested in D103059. Use a single lookup instead of two, more const, less mutation.
Reviewed By: dhruvachak
Differential Revision: https://reviews.llvm.org/D103093
ATMI_STATUS_UNKNOWN was unused, deleted references to it.
Replaced ATMI_STATUS_{SUCCESS,ERROR} with HSA_STATUS_{SUCCESS,ERROR}
Replaced atmi_status_t with hsa_status_t
Otherwise no change. In particular, conversions between atmi_status_t and
hsa_status_t will now be conversions between hsa_status_t and itself.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D103115
This patch drops g_atmi_initialized and inlines the Initialize &
Finalize methods from Runtime class.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102847
Two globals KernelInfoTable & SymbolInfoTable are moved
into RTLDeviceInfoTy class.
This builds on the top of D102691.
[2/2]
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102692
[libomptarget][nfc] Move hostcall required test to rtl
Remove a global, fix minor race. First of N patches to bring up hostcall.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103058
Fix the case where NumTeams was set incorrectly instead of NumThreads
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D103037
KernelNameMap contains entries like "key.kd" => key which clearly
could be replaced by simple logic of removing suffix from the key.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102691
Warnings on deprecated api cannot be suppressed if the library is not initialized.
With this change it is possible to set KMP_WARNINGS=false to suppress the warnings.
Differential Revision: https://reviews.llvm.org/D102676
The check for the TO flag when processing firstprivates is missing. As a result,
sometimes the device copy of a firstprivate never gets initialized. Currectly we
try to force lambda structs to be allocated immediately by marking them as a
non-firstprivate, so that PrivateArgumentManagerTy::addArg allocates memory for
them immediately. However, calling addArg with IsFirstPrivate=false makes the
function skip initializing the device copy. Whether an argument is firstprivate
and whether we need to allocate memory immediately are not synonyms, so this
patch introduces one more control variable for immediate allocation and sets it
apart from initialization.
Differential Revision: https://reviews.llvm.org/D102890
[libomptarget][amdgpu] Mark alloc, free weak to facilitate local experimentation
There are a lot of different ways we might implement the devicertl local alloc
and free functions. Via host, local buffers (stack or arena), specialising per
kernel etc. It is not yet clear what the right design is. This change makes the
alloc and free functions weak, so one can override them from local tests while
comparing options.
Not strictly necessary, as a comparable patch can be applied locally each time,
but would be convenient for out of tree dev. Plan would be to drop the weak
attribute at the same time as introducing a working allocator to trunk.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102499
[libomptarget] Improve dlwrap compile time error diagnostic
The dlwrap interface takes an explict arity, e.g. DLWRAP(cuAlloc, 2);
This probably can't be eliminated as it controls the argument list of an
external symbol, not an inline header function. If the arity given is too
big, the error from clang referring to the line is in the middle of
implementation details.
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1277:7: error: static_assert failed
due to requirement '0UL < tuple_size<std::tuple<>>::value' "tuple index is in range"
static_assert(__i < tuple_size<tuple<>>::value,
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/usr/lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/tuple:1260:7: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:93:27 ...
/home/amd/llvm-project/openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp:34:1: note: in
instantiation of template class 'dlwrap::trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::arg<2>' requested here
DLWRAP(cuMemAlloc, 3);
^
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:51:31: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:166:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:133:3: ...
/home/amd/llvm-project/openmp/libomptarget/include/dlwrap.h:186:37: ...
If the arity is too small, the diagnostic is better:
cuda/dynamic_cuda/cuda.cpp:34:1: error: too few
arguments to function call, expected 2, have 1
DLWRAP(cuMemAlloc, 1);
This patch changes the diagnostic to:
cuda/dynamic_cuda/cuda.cpp:34:1: error:
static_assert failed due to requirement '1 == trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 1);
or
cuda/dynamic_cuda/cuda.cpp:34:1: error:
static_assert failed due to requirement '3 == trait<cudaError_enum (*)(unsigned long *, unsigned
long)>::nargs' "Arity Error"
DLWRAP(cuMemAlloc, 3);
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D102858
[libomptarget][amdgpu] Remove majority of fatal errors
Replaces most calls to exit() with returning an error to the library entry
point. Minor changes to error handling for clear bugs, remove some dead code.
Each exit() call site replaced is either in a library entry point or a
function that already returns error codes on some paths. The existing handling
is not well tested but replacing exit() with a fallback path should be a strict
improvement.
Remaining two early exit points are an abort() from a callback and exit() from
within msgpack. Fixes for those are less obvious and left for a later patch.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D102346
[libomptarget] Disable test bug49334 on amdgpu
Hangs on amdgpu, do not know why. Disable to unblock build.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D102017
This patch moves g_executables to private member of Runtime class
and is renamed to HSAExecutables following LLVM naming convention.
This movement required making Runtime::Initialize and Runtime::Finalize
non-static. Verified the correctness of this change by running
libomptarget tests on gfx906.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102600