Declarations of 5.1 atomic entries were added under
"#if KMP_ARCH_X86 || KMP_ARCH_X86_64" in kmp_atomic.h,
but definitions of the functions missed architecture guard in kmp_atomic.cpp.
As a result mangled symbols were available on non-x86 architecture.
The patch eliminates these unexpected symbols from the library.
Differential Revision: https://reviews.llvm.org/D112261
__ompt_get_task_info_internal function is adapted to support thread_num
determination during the execution of multiple nested serialized
parallel regions enclosed by a regular parallel region.
Consider the following program that contains parallel region R1 executed
by two threads. Let the worker thread T of region R1 executes serialized
parallel regions R2 that encloses another serialized parallel region R3.
Note that the thread T is the master thread of both R2 and R3 regions.
Assume that __ompt_get_task_info_internal function is called with the
argument "ancestor_level == 1" during the execution of region R3.
The function should determine the "thread_num" of the thread T inside
the team of region R2, whose implicit task is at level 1 inside the
hierarchy of active tasks. Since the thread T is the master thread of
region R2, one should expected that "thread_num" takes a value 0.
After the while loop finishes, the following stands: "lwt != NULL",
"prev_lwt == NULL", "prev_team" represents the team information about
the innermost serialized parallel region R3. This results in executing
the assignment "thread_num = prev_team->t.t_master_tid". Note that
"prev_team->t.t_master_tid" was initialized at the moment of
R2’s creation and represents the "thread_num" of the thread T inside
the region R1 which encloses R2. Since the thread T is the worker thread
of the region R1, "the thread_num" takes value 1, which is a contradiction.
This patch proposes to use "lwt" instead of "prev_lwt" when determining
the "thread_num". If "lwt" exists, the task at the requested level belongs
to the serialized parallel region. Since the serialized parallel region
is executed by one thread only, the "thread_num" takes value 0.
Similarly, assume that __ompt_get_task_info_internal function is called
with the argument "ancestor_level == 2" during the execution of region R3.
The function should determine the "thread_num" of the thread T inside the
team of region R1. Since the thread is the worker inside the region R1,
one should expected that "thread_num" takes value 1. After the loop finishes,
the following stands: "lwt == NULL", "prev_lwt != NULL", "prev_team" represents
the team information about the innermost serialized parallel region R3.
This leads to execution of the assignment "thread_num = 0", which causes
a contradiction.
Ignoring the "prev_lwt" leads to executing the assignment
"thread_num = prev_team->t.t_master_tid" instead. From the previous explanation,
it is obvious that "thread_num" takes value 1.
Note that the "prev_lwt" variable is marked as unnecessary and thus removed.
This patch introduces the test case which represents the OpenMP program
described earlier in the summary.
Differential Revision: https://reviews.llvm.org/D110699
__kmp_fork_call sets the enter_frame of the active task (th_curren_task)
before new parallel region begins. After the region is finished, the
enter_frame is cleared.
The old implementation of __kmpc_fork_call didn’t clear the enter_frame of
active task.
Also, the way of initializing the enter_frame of the active task was wrong.
Consider the following two OpenMP programs.
The first program: Let R1 be the serialized parallel region that encloses
another serialized parallel region R2. Assume that thread that executes R2 is
going to create a new serialized parallel region R3 by executing
__kmpc_fork_call. This thread is responsible to set enter_frame of R2's
implicit task. Note that the information about R2's implicit task is present
inside master_th->th.th_current_task at this moment, while lwt represents the
information about R1's implicit task. The old implementation uses lwt and
resets enter_frame of R1's implicit task instead of R2's implicit task. The
new implementation uses master_th->th.th_current_task instead.
The second program: Consider the OpenMP program that contains parallel region
R1 which encloses an explicit task T. Assume that thread should create another
parallel region R2 during the execution of the T. The __kmpc_fork_call is
responsible to create R2 and set enter frame of T whose information is present
inside the master_th->th.th_current_task.
Old implementation tries to set the frame of
parent_team->t.t_implicit_task_taskdata[tid] which corresponds to the implicit
task of the R1, instead of T.
Differential Revision: https://reviews.llvm.org/D112419
As discussed in D108488, testing for invariants of omp_get_wtime would be more
reliable than testing for duration of sleep, as return from sleep might be
delayed due to system load.
Alternatively/in addition, we could compare the time measured by omp_get_wtime
to time measured with C++11 chrono (for portability?).
Differential Revision: https://reviews.llvm.org/D112458
The CHECK: line in the test had no effect, because the test does not
pipe to FileCheck. Since the test only checks for a single value,
encode the result in the return value of the test.
Where possible change to declare the variable before the loop.
Where not possible, specifically request -std=c99 (could be limited to
specific compilers like icc).
Implemented by patching python config instead of modifying all
the tests so that -generic and XFAIL work as usual. Expectation is for
this to be reverted once the old runtime is deleted.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D112225
KMP_API_NAME_GOMP_PARALLEL_SECTIONS function was missing the task frame support.
This patch introduced a fix responsible to set properly the exit_frame of
the innermost implicit task that corresponds to the parallel section construct,
as well as the enter_frame of the task that encloses the mentioned implicit task.
This patch also introduced a simple test case sections_serialized.c that contains
serialized parallel section construct and validates whether the mentioned
task frames are set correctly.
Differential Revision: https://reviews.llvm.org/D112205
Step towards building the DeviceRTL for amdgpu.
Mostly replaces cuda-specific toolchain finding logic with the
generic logic currently found in the amdgpu deviceRTL cmake. Also
deletes dead code and changes the default to build on systems
without cuda installed, as the library doesn't use cuda and the
amdgpu-only systems generally won't have cuda installed.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D111983
The plugin currently uses a macro to check if this is a debug built
before assigning the debug kind variable to the device environment
struct. This is being deprecated because the new device runtime does not
maintain separate debug builds and should always be availible.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112083
This patch allows to simplify compiler implementation on "taskwait nowait"
construct. The "taskwait nowait" is semantically equivalent to the empty task.
Instead of creating an empty routine as a task entry, compiler can just send
NULL pointer to the runtime. Then the runtime will make all the work with
dependences and return because of the absent task routine.
Differential Revision: https://reviews.llvm.org/D112015
__ompt_get_task_info_internal is now able to determine the right value of the
“thread_num” argument during the execution of an explicit task.
During the execution of a while loop that iterates over the ancestor tasks
hierarchy, the “prev_team” variable was always set to “team” variable at the
beginning of each loop iteration.
Assume that the program contains a parallel region which encloses an explicit
task executed by the worker thread of the region. Also assume that the tool
inquires the “thread_num” of a worker thread for the implicit task that
corresponds to the region (task at “ancestor_level == 1”) and expects to
receive the value of “thread_num > 0”.
After the loop finishes, both “team” and “prev_team” variables are equal and
point to the team information of the parallel region.
The “thread_num” is set to “prev_team->t.t_master_tid”, that is equal to
“team->t.t_master_tid”. In this case, “team->t.t_master_tid” is 0, since
the master thread of the region is the initial master thread of the program.
This leads to a contradiction.
To prevent this, “prev_team” variable is set to “team” variable only at the
time when the loop that has already encountered the implicit task (“taskdata”
variable contains the information about an implicit task) continues iterating
over the implicit task’s ancestors, if any.
After the mentioned loop finishes, the “prev_team” variable might be equal to
NULL. This means that the task at requested “ancestor_level” belongs to the
innermost parallel region, so the “thread_num” will be determined by calling
the “__kmp_get_tid”.
To prove that this patch works, the test case “explicit_task_thread_num.c” is
provided.
It contains the example of the program explained earlier in the summary.
Differential Revision: https://reviews.llvm.org/D110473
Older intel compilers miss the privatization of nested loop variables for
doacross loops. Declaring the variable in the loop makes the test more
robust.
D110279 introduced a bug to the device runtime. In `__kmpc_parallel_51`, we detect
whether we are already in parallel region by `__kmpc_parallel_level() > __kmpc_is_spmd_exec_mode()`.
It is based on the assumption that:
- In SPMD mode, parallel level is initialized to 1.
- In generic mode, parallel level is initialized to 0.
- `__kmpc_is_spmd_exec_mode` returns `1` for SPMD mode, 0 otherwise.
Because the return value type of `__kmpc_is_spmd_exec_mode` is `int8_t`, there
was an implicit cast from `bool` to `int8_t`. We can make sure it is either 0 or
1 since C++14. In D110279, the return value is the result of an `and` operation,
which is 2 in SPMD mode. This breaks the assumption in `__kmpc_parallel_51`.
Reviewed By: carlo.bertolli, dpalermo
Differential Revision: https://reviews.llvm.org/D111905
Detect, through CPUID.1A, and show user different core types through
KMP_AFFINITY=verbose mechanism. Offer future runtime optimizations
__kmp_is_hybrid_cpu() to know whether running on a hybrid system or not.
Differential Revision: https://reviews.llvm.org/D110435
This patch implements teams affinity on the host.
The default is spread. A user can specify either spread, close, or
primary using KMP_TEAMS_PROC_BIND environment variable. Unlike
OMP_PROC_BIND, KMP_TEAMS_PROC_BIND is only a single value and is not a
list of values. The values follow the same semantics under the OpenMP
specification for parallel regions except T is the number of teams in
a league instead of the number of threads in a parallel region.
Differential Revision: https://reviews.llvm.org/D109921
Added functions those implement "atomic compare".
Though clang does not use library interfaces to implement OpenMP atomics,
the functions added for consistency.
Also added missed functions for 80-bit floating min/max atomics.
Differential Revision: https://reviews.llvm.org/D110109
Replaced storing of ittnotify domain array index into
location info structure (which is now read-only) with storing of
(location info address + ittnotify domain + team size) into hash map.
Replaced __kmp_itt_barrier_domains and __kmp_itt_imbalance_domains arrays with
__kmp_itt_barrier_domains hash map; __kmp_itt_region_domains and
__kmp_itt_region_team_size arrays with __kmp_itt_region_domains hash map.
Basic functionality did not change (at least tried to not change).
The patch fixes https://bugs.llvm.org/show_bug.cgi?id=48644.
Differential Revision: https://reviews.llvm.org/D111580
This patch adds support for the
`__kmpc_get_hardware_num_threads_in_block` function that returns the
number of threads. This was missing in the new runtime and was used by
the AMDGPU plugin which prevented it from using the new runtime. This
patchs also unified the interface for getting the thread numbers in the
frontend.
Originally authored by jdoerfert.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D111475
Until we hit the first barrier we should not call `mapping::isSPMDMode`
with all threads. Instead, we now have (and use during initialization) a
`mapping::isMainThreadInGenericMode` overload that takes the known
SPMD-mode state and one that queries it.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D111381
This patch adds an external interface to access the dynamic shared
memory buffer in the device runtime. The function introduced is
``llvm_omp_get_dynamic_shared``. This includes a host-side
definition that only returns a null pointer so that it can be used when
host-fallback is enabled without crashing. Support for dynamic shared
memory was also ported to the old device runtime.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110957
For NVPTX, `printf` can be used just with a function declaration. For AMDGCN, an
function definition is added, but it simply returns.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109728
We need to synchronize the threads *before* we destroy the RAII objects
that hold the old values and not after to avoid threads executing the
parallel region but seeing an inconsistent state.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D111369
Follow on to D110006, related to D110957
Where implementations have diverged this resolves to match the new DeviceRTL
- replaces definitions of this struct in deviceRTL and plugins with include
- changes the dynamic_shared_size field from D110006 to 32 bits
- handles stdint being unavailable in DeviceRTL
- adds a zero initializer for the field to amdgpu
- moves the extern declaration for deviceRTL to target_interface
(omptarget.h is more natural, but doesn't work due to include order
with debug.h)
- Renames the fields everywhere to match the LLVM format used in DeviceRTL
- Makes debug_level uint32_t everywhere (previously sometimes int32_t)
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111069
The hand-rolled linking logic in elf_common does not account for
the possibility of using LLVM dylib rather than a dozen static
libraries. Since it does not seem to be easily convertible
to add_llvm_library, just hand-roll support for LLVM_LINK_LLVM_DYLIB.
This is necessary to support stand-alone builds against installed LLVM.
Differential Revision: https://reviews.llvm.org/D111038
Add a FAQ entry about the names of openmp offloading components
and how they are searched for.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D109619
Fixes 51982. Adds a missing CreatePointerCast and allocates a global in
the correct address space.
Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\
blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting
parts while checking the assertion failure still occurred.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110556
Use enum for execution mode.
This is partly a port from ROCm and partly a port from D110029. Attempted to
make the same choices as ROCm as far as comments etc go to reduce the merge
conflicts.
There is some cleanup warranted here - in particular I like the cuda patch
factoring out the comparisons into named variables - but I'd like to leave
that for a follow up patch, keeping this one minimal.
Reviewed By: carlo.bertolli
Differential Revision: https://reviews.llvm.org/D110845
Fixes: SWDEV-275232 (With contributions from Ammar Elwazir, Laurent Morichetti, and Tony Tye)
The current code is racy. After the packet is submitted, the GPU will increment the read index. If this wraps around before the memory is read from it'll refer to a signal from an unrelated packet. Change avoids reading from the packet post-submission.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110679
Fixes 51982. Minor refactor to remove `return x = y` construct.
Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\
blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting
parts while checking the assertion failure still occurred.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110556
The minor code refactorization introduces the TASK_TIED constant inside
kmp_gsupprot.cpp as a replacement for the literal value 1.
The mentioned constant is now used in both kmp_tasking.cpp and
kmp_gsupport.cpp files.
Differential Revision: https://reviews.llvm.org/D110441
This path defines the newly added `__kmpc_disitrute_static_init`
functions in the device runtime library. These functions are currently
exact copies of the current worksharing method but can be tuned later.
Depends on D110429
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110430
Use the in-project clang, llvm-link and opt if available and unless
CMake cache variables specify to use a different compiler. This applies
D101265 to the new DeviceRTL's CMakeLists.txt which was copied before
D101265 was applied.
Fixes the openmp-offloading-cuda-runtime builder which was failing
since D110006.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110251
Store queues in unique_ptr so they are destroyed when the global DeviceInfo is. Currently they leak which raises an assert in debug builds of hsa.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D109511
This patch fixes a data-race observed when using the new device runtime
library. The Internal control variable for the parallel level is read in
the `__kmpc_parallel_51` function while it could potentially be written
by other threads. This causes data corruption and will cause
nondetermistic behaviour in the runtime. This patch fixes this by adding
an explicit synchronization before the region starts.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110366
This is a follow-up of D110029, which uses bitset to indicate execution mode. This patches makes the changes in the function call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110279
This patch adds support for an RAII struct that will print function
traces when placed inside of a function declaration. Each successive
call will increase the indentation to make it easier to visually
inspect.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110202
The execution mode of a kernel is stored in a global variable, whose value means:
- 0 - SPMD mode
- 1 - indicates generic mode
- 2 - SPMD mode execution with generic mode semantics
We are going to add support for SIMD execution mode. It will be come with another
execution mode, such as SIMD-generic mode. As a result, this value-based indicator
is not flexible.
This patch changes to bitset based solution to encode execution mode. Each
position is:
[0] - generic mode
[1] - SPMD mode
[2] - SIMD mode (will be added later)
In this way, `0x1` is generic mode, `0x2` is SPMD mode, and `0x3` is SPMD mode
execution with generic mode semantics. In the future after we add the support for
SIMD mode, `0b1xx` will be in SIMD mode.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110029
Summary:
The thread ID function was reintroduced in D110195, but could
potentially be removed by the optimizer. Make the function noinline to
preserve the call sites and add it to the externalization RAII so its
definition is not removed by the attributor.
The new device runtime library currently lacks the
`kmpc_get_hardware_thread_id_in_block` function which is currently used
when doing the SPMDzation optimization. This call would be introduced
through the optimization and then cause a linking error because it was
not present. This patch adds support for this runtime call.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110195
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Reviewed By: jdoerfert, jhuber6
Differential Revision: https://reviews.llvm.org/D102107
The indirect lock table can exhibit a race condition during initializing
and setting/unsetting locks. This occurs if the lock table is
resized by one thread (during an omp_init_lock) and accessed (during an
omp_set|unset_lock) by another thread.
The test runtime/test/lock/omp_init_lock.c test exposed this issue and
will fail if run enough times.
This patch restructures the lock table so pointer/iterator validity is
always kept. Instead of reallocating a single table to a larger size, the
lock table begins preallocated to accommodate 8K locks. Each row of the
table is allocated as needed with each row allowing 1K locks. If the 8K
limit is reached for the initial table, then another table, capable of
holding double the number of locks, is allocated and linked
as the next table. The indices stored in the user's locks take this
linked structure into account when finding the lock within the table.
Differential Revision: https://reviews.llvm.org/D109725
This patch adds support for using dynamic shared memory in the new
device runtime. The new function `__kmpc_get_dynamic_shared` will return a
pointer to the buffer of dynamic shared memory. Currently the amount of memory
allocated is set by an environment variable.
In the future this amount will be added to the amount used for the smart stack
which will be configured in a similar way.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110006
This patch adds fields for the device number and number of devices into
the device environment struct and debugging values.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110004
This patch implements the `__assert_fail` function in the new device
runtime. This allows users and developers to use the standars assert
function inside of the device.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D109886
The third-party ittnotify sources updated from https://github.com/intel/ittapi.
Changes applied:
- llvm license aded to all files; initial BSD license saved in LICENSE.txt;
- clang-formatted;
- renamed *.c to *.cpp, similar to what we did with all our sources;
- added #include "kmp_config.h" with definition of INTEL_ITTNOTIFY_PREFIX macro
into ittnotify_static.cpp.
Differential Revision: https://reviews.llvm.org/D109333
GOMP depobjs are represented as a two intptr_t array. The first
element is the base address of the dependency and the second element
is the flag indicating the type the depobj represents.
Differential Revision: https://reviews.llvm.org/D108790
OPENMP_INSTALL_LIBDIR is set to the installation path of shared and static
libompd.This should avoid the mixing of 32 and 64 bit on same path in
multi-lib set-up.
Reviewed By: @mceier
Differential Revision: https://reviews.llvm.org/D109352
We peform runtime folding, but do not currently emit remarks when it is
performed. This is because it comes from the runtime library and is
beyond the users control. However, people may still wish to view this
and similar information easily, so we can enable this behaviour using a
special flag to enable verbose remarks.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D109627
The defintion of OFFLOAD_SUCCESS and OFFLOAD_FAIL used in plugin APIs and libomptarget public APIs are not consistent.
Create __tgt_target_return_t for libomptarget public APIs.
Differential Revision: https://reviews.llvm.org/D109304
This should have happened a long time ago, now that openmp.llvm.org
redirects to openmp.llvm.org/docs we completely switched over to the
sphinx documentation page instead.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D108588
The hsa library must be initialized before any calls into it and
destructed after the last call into it. There have been a number of bugs in
this area related to member variables which would like to use raii to manage
resources acquired from hsa.
This patch moves the init/shutdown of hsa into a class, such that when used as
the first member variable (could be a base), the lifetime of other member
variables are reliably scoped within it. This will allow other classes to use
raii reliably when used as member variables within the global.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D109512
Given D109057, change test runner to use the libomptarget-x-bc-path
argument instead of the LIBRARY_PATH environment variable to find the device
library.
Also drop the use of LIBRARY_PATH environment variable as it is far
too easy to pull in the device library from an unrelated toolchain by accident
with the current setup. No loss in flexibility to developers as the clang
commandline used here is still available.
Reviewed By: jdoerfert, tianshilei1992
Differential Revision: https://reviews.llvm.org/D109061
New omp_all_memory task dependence type is implemented.
Library recognizes the new type via either
(dependence_address == NULL && dependence_flag == 0x80)
or
(dependence_address == SIZE_MAX).
A task with new dependence type depends on each preceding task
with any dependence type (kind of a dependence barrier).
Differential Revision: https://reviews.llvm.org/D108574
The new interface only marks begin/end of a scope construct for
corresponding OMPT events, and we can use existing interfaces for
reduction operations.
Differential Revision: https://reviews.llvm.org/D108062
This patch changes the default monotonicity of dynamic schedule from
monotonic to non-monotonic when no modifier is specified.
Differential Revision: https://reviews.llvm.org/D109026
Using std::vector<DeviceTy> requires implementing copy constructor and copied assign operator for DeviceTy.
Indeed DeviceTy should never be copied. After changing to std::vector<std::unique_ptr<DeviceTy>>,
All the unsafe copy constructor and copy assign operator implementations can be removed.
Compilers mark them deleted due to mutex or underlying objects and this is the desired behavior.
Differential Revision: https://reviews.llvm.org/D109276
Use the same debug print as the rest of libomptarget plugins with
the same environment control. Also drop the max queue size debugging hook as
I don't believe it is still in use, can bring it back near the rest of the env
handling in rtl.cpp if someone objects.
That makes most of rt.h and all of utils.cpp unused. Clean that up and simplify
control flow in a couple of places.
Behaviour change is that debug prints that used to use the old environment
variable now use the new one and print in slightly different format, and the
removal of the max queue size variable.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D108784
Use unique_ptr to achieve the effect of mutable.
Remove mutable keyword of DynRefCount and HoldRefCount
Remove std::shared_ptr from UpdateMtx
Reviewed By: tianshilei1992, grokos
Differential Revision: https://reviews.llvm.org/D109007