This patch allows the user to request all resources of a particular
layer (or core-attribute). The syntax of KMP_HW_SUBSET is modified
so the number of units requested is optional or can be replaced with an
'*' character.
e.g., KMP_HW_SUBSET=c:intel_atom@3 will use all the cores after offset 3
e.g., KMP_HW_SUBSET=*c:intel_core will use all the big cores
e.g., KMP_HW_SUBSET=*s,*c,1t will use all the sockets, all cores per
each socket and 1 thread per core.
Differential Revision: https://reviews.llvm.org/D115826
I missed the async info parameter in the first version of this API.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115887
This patch extends the AMDGPU plugin for OpenMP target offloading from using a single HSA queue to multiple queues (four in this patch) per device. This enables concurrent threads to concurrently submit kernel launches to the same GPU.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115771
Just defensive CMake-ing. I pulled this from D115544 and D99484 which
are blocked on some lldb CI failures I don't yet understand. Hoping to land
something smaller in the meantime.
Reviewed By: #libc, ldionne
Differential Revision: https://reviews.llvm.org/D115566
The `not` program is used to test executions prefixed with `%libomptarget-run-fail-`. Currently `not` is not used for libomp tests, but might be used in the future and its dependency does not add any additional burden over the already established `FileCheck` dependency.
Required to add libomptarget testing to the Phabricator pre-merge check (see https://github.com/google/llvm-premerge-checks/issues/368)
Reviewed By: jdenny, JonChesterfield
Differential Revision: https://reviews.llvm.org/D115454
This reverts commit 492de35df4.
I tried to apply John's changes in 8d897ec915 that were expected to
fix his patch but that didn't work unfortunately.
Reverting this again to fix the macOS bots and leave him more time to
investigate the issue.
This reverts commit 797b50d4be.
See the original D99484. @mib who noticed the original problem could not longer
reproduce it, after I tried and also failed. We are threfore hoping it went
away on its own!
Reviewed By: mib
Differential Revision: https://reviews.llvm.org/D115544
Allow filtering of resources based on core attributes. There are two new
attributes added:
1) Core Type (intel_atom, intel_core)
2) Core Efficiency (integer) where the higher the efficiency, the more
performant the core
On hybrid architectures , e.g., Alder Lake, users can specify
KMP_HW_SUBSET=4c:intel_atom,4c:intel_core to select the first four Atom
and first four Big cores. The can also use the efficiency syntax. e.g.,
KMP_HW_SUBSET=2c:eff0,2c:eff1
Differential Revision: https://reviews.llvm.org/D114901
In the OpenMC app we saw `omp target update` spending an awful lot of
time in the shadow map traversal without ever doing any update there.
There are two cases that allow us to avoid the traversal completely.
The simplest thing is that small updates cannot (reasonably) contain
an attached pointer part. The other case requires to track in the
mapping table if an entry might contain an attached pointer as part.
Given that we have a single location shadow map entries are created,
the latter is actually fairly easy as well.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D113124
and synchronous kernel launch implementations into a single
synchronous version. This patch prepares the plugin for asynchronous
implementation by:
Privatizing actual kernel launch code (valid in both cases) into
an anonymous namespace base function (submitted at D115267)
- Separating the control flow path of asynchronous and synchronous
kernel launch functions** (this diff)
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115273
D113602 broke the custom state machine when a reduction is present, as
revealed by the reproducer this patch adds to the test suite. In that
case, openmp-opts changes the return value to undef in
`__kmpc_get_warp_size` (which the custom state machine calls as of
D113602). Later optimizations then optimize away the custom state
machine code as if all threads are outside the thread block, so the
target region does not execute. D114802 fixed that but didn't add a
reproducer.
This patch also adds a `__OMP_RTL_ATTRS` entry for
`__kmpc_get_warp_size` to OMPKinds.def, which D113602 missed. This
change does not seem to have any impact on the reduction problem.
Reviewed By: JonChesterfield, jdoerfert
Differential Revision: https://reviews.llvm.org/D113824
The problem with the old scheme is that we would need to keep track of
the "next region" and reset the num_threads value after it. The new RT
doesn't do it and an assertion is triggered. The old RT doesn't do it
either, I haven't tested it but I assume a num_threads clause might
impact multiple parallel regions "accidentally". Further, in SPMD mode
num_threads was simply ignored, for some reason beyond me.
In any case, parallel_51 is designed to take the clause value directly,
so let's do that instead.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D113623
Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115279
Regardless that specification requires thread_limit to be positive,
it is better to warn user instead of crash in case the value is negative.
Differential Revision: https://reviews.llvm.org/D115340
Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115279
At present, amdgpu plugin merges both asynchronous and synchronous kernel launch implementations into a single synchronous version.
This patch prepares the plugin for asynchronous implementation by:
- Privatizing actual kernel launch code (valid in both cases) into an anonymous namespace base function
Actual separation of kernel launch code (async vs sync) is a following patch.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115267
In the "runtimes" setup, the runtime (e.g. OpenMP) can be built for
a target entirely different from the current host build (where LLVM
and Clang are built). If profiling is enabled, libomptarget links
against LLVMSupport (which only has been built for the host).
Thus, don't enable profiling by default in this setup.
This should allow relanding D113253.
Differential Revision: https://reviews.llvm.org/D114083
amdgpu plugin depends on libhsa-runtime64 library. Add runpath in case it is not on the LD_LIBRARY_PATH.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D115198
Minor fix to the lit.cfg. Currently, nvptx runs the tests twice on the new runtime.
Soon, amdgpu will run them on the new runtime as well as the old.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115150
These tests tend to hang or crash on hardware that doesn't
support USM. Disabling them helps diagnose other issues. To safely
enable we require a means of testing whether USM is expected to work.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D115144
When OpenMP is compiled as a part runtimes for multiple targets, openmp
is compiled under build/runtimes/runtimes-arch-unknown-linux-gnu-bins
directory. Old implementation treats this directory name as errors.
This patch adds a guard like "[Uu]known[^-]".
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D114346
This was trying to figure out the build path for amdgpu-arch, and
making assumptions about where it is which were not working on my
system. Whether a standalone build or not, we should have a proper
imported target to get the location from.
OpenMP (compiler) does not currently request any implicit kernel
arguments. OpenMP (runtime) allocates and initialises a reasonable guess at
the implicit kernel arguments anyway.
This change makes the plugin check the number of explicit arguments, instead
of all arguments, and puts the pointer to hostcall buffer in both the current
location and at the offset expected when implicit arguments are added to the
metadata by D113538.
This is intended to keep things running while fixing the oversight in the
compiler (in D113538). Once that patch lands, and a following one marks
openmp kernels that use printf such that the backend emits an args element
with the right type (instead of hidden_node), the over-allocation can be
removed and the hardcoded 8*e+3 offset replaced with one read from the
.offset of the corresponding metadata element.
Reviewed By: estewart08
Differential Revision: https://reviews.llvm.org/D114274
A function with no definition was left in the old runtime, causing
linker errors when trying to compile.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D114264
Removes a +x/-x pair on the only store/load of a variable
and deletes some nearby dead code. Also reduces the size of the implicit
struct to reflect the code currently emitted by clang.
Differential Revision: https://reviews.llvm.org/D114270
Eachempati.
This patch adds clang (parsing, sema, serialization, codegen) support for the 'depend' clause on the 'taskwait' directive.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D113540
Teach the HWLOC topology method how to detect Atom and Core
types so hybrid CPUs are properly detected and represented when using
the HWLOC topology method.
Differential Revision: https://reviews.llvm.org/D112270
The current implementation of Windows Processor Groups has
a separate topology method to handle them. This patch deprecates
that specific method and uses the regular CPUID topology
method by default and inserts the Windows Processor Group objects
in the topology manually.
Notes:
* The preference for processor groups is lowered to a value less than
socket so that the user will see sockets in the KMP_AFFINITY=verbose
output instead of processor groups when sockets=processor groups.
* The topology's capacity is modified to handle additional topology layers
without the need for reallocation.
* If a user asks for a granularity setting that is "above" the processor
group layer, then the granularity is adjusted "down" to the processor
group since this is the coarsest layer available for threads.
Differential Revision: https://reviews.llvm.org/D112273
If some CPUs are offline, then make sure they are not included in the
fullMask even if norespect is given to KMP_AFFINITY.
Differential Revision: https://reviews.llvm.org/D112274
Remove restriction forcing users to specify the KMP_HW_SUBSET value in
topology order. This patch sorts the user KMP_HW_SUBSET value before
trying to apply it. For example: 1s,4c,2t is equivalent to 2t,1s,4c
Differential Revision: https://reviews.llvm.org/D112027
The RAII class used for debugging RTL entry used a shared variable to
keep track of the current depth. This used a global initializer, which
isn't supported on AMDGPU. This patch removes the initializer and
instead sets it to zero when the state is initialized in the runtime.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D113963