This is a preview of allocator support for target memory that depends on the
offload runtime API which allocates memory as described below.
llvm_omp_target_alloc_host(size_t size, int device_num);
-- Returns non-migratable memory owned by host.
-- Memory is accessible by host and device(s).
llvm_omp_target_alloc_shared(size_t size, int device_num);
-- Returns migratable memory owned by host and device.
-- Memory is accessible by host and device.
llvm_omp_target_alloc_device(size_t size, int device_num);
-- Returns memory owned by device.
-- Memory is only accessible by device.
New memory space and predefined allocator names are
-- llvm_omp_target_host_mem_space
-- llvm_omp_target_shared_mem_space
-- llvm_omp_target_device_mem_space
-- llvm_omp_target_host_mem_alloc
-- llvm_omp_target_shared_mem_alloc
-- llvm_omp_target_device_mem_alloc
Differential Revision: https://reviews.llvm.org/D96669
If the mapped structure has data members, which have 'default' mappers,
need to map these members individually using their 'default' mappers.
Differential Revision: https://reviews.llvm.org/D92195
Cleanup changes:
- check value read from file;
- remove dead code;
- make unsigned variable to read hexadecimal number to;
- add debug assertion to check ref count.
Differential Revision: https://reviews.llvm.org/D96893
Stitching id could be overridden causing reference of destroyed object
when number of teams is 1. The patch separates stitching id store
location for teams and parallel nested in teams.
Differential Revision: https://reviews.llvm.org/D96562
Allow users to use a non-system version of perl, python and awk, which is useful
in certain package managers.
Reviewed By: JDevlieghere, MaskRay
Differential Revision: https://reviews.llvm.org/D95119
PR#49334 reports a crash when offloading to x86_64 with `target nowait`,
which is caused by referencing a nullptr. The root cause of the issue is, when
pushing a hidden helper task in `__kmp_push_task`, it also maps the gtid to its
shadow gtid, which is wrong.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97329
This makes sure that images are loaded in the order in which they are registered with libomptarget.
If a target can load multiple images and these images depend on each other (for example if one image contains the programs target regions and one image contains library code), then the order in which images are loaded can be important for symbol resolution (for example, in the VE plugin).
In this case: because the same code exist in the host binaries, the order in which the host linker loads them (which is also the order in which images are registered with libomptarget) is the order in which the images have to be loaded onto the device.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95530
When including <ostream>, the register_callback macro of the OMPT callback.h
clashes with a function defined in ostream. This patch renames the macro
and includes ompt into the macro name.
`ptx71` is not supported in release version of LLVM yet. As a result,
the support of CUDA 11.2 and CUDA 11.1 caused a compilation error as mentioned
in D97004. Since the support in D97004 is just a WA for releease, and we'll not
use it in the near future, using `ptx70` for CUDA 11 is feasible.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D97195
This code alleviates some pathological loop parameters (lower,
upper, stride) within calculations involved in the static loop code. It
bounds the chunk size to the trip count if it is greater than the trip
count and also minimizes problematic code for when trip count < nth.
Differential Revision: https://reviews.llvm.org/D96426
Only attempt shutdown if lpReserved is NULL. The Windows documentation
states:
When handling DLL_PROCESS_DETACH, a DLL should free resources such as
heap memory only if the DLL is being unloaded dynamically (the
lpReserved parameter is NULL). If the process is terminating (the
lpReserved parameter is non-NULL), all threads in the process except the
current thread either have exited already or have been explicitly
terminated by a call to the ExitProcess function, which might leave some
process resources such as heaps in an inconsistent state. In this case,
it is not safe for the DLL to clean up the resources. Instead, the DLL
should allow the operating system to reclaim the memory.
Differential Revision: https://reviews.llvm.org/D96750
This patch limits the number of dispatch buffers (used for
loop worksharing construct) to between 1 and 4096.
Differential Revision: https://reviews.llvm.org/D96749
As mentioned in PR#49250, without this patch, ptxas for CUDA 9.1 fails
in the following two tests:
- openmp/libomptarget/test/mapping/lambda_mapping.cpp
- openmp/libomptarget/test/offloading/bug49021.cpp
The error looks like:
```
ptxas /tmp/lambda_mapping-081ea9.s, line 828; error : Not a name of any known instruction: 'activemask'
```
The problem is that our cmake script converts CUDA version strings
incorrectly: 9.1 becomes 9100, but it should be 9010, as shown in
`getCudaVersion` in `clang/lib/Driver/ToolChains/Cuda.cpp`. Thus,
`openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu`
inadvertently enables `activemask` because it apparently becomes
available in 9.2. This patch fixes the conversion.
This patch does not fix the other two tests in PR#49250.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D97012
Without this patch, there's a runtime error for those map types at
exit from an "omp target data" or at "omp target exit data", but the
spec says the list item should be ignored.
This patch tests that fix in data_absent_at_exit.c, and it also
improves other testing for data that is not fully present at exit.
Reviewed By: grokos, RaviNarayanaswamy
Differential Revision: https://reviews.llvm.org/D96999
allow bit masking to select various trace features.
bit 0 => Launch tracing (stderr)
bit 1 => timing of runtime (stdout)
bit 2 => detailed launch tracing (stderr)
bit 3 => timing goes to stdout instead of stderr
example: LIBOMPTARGET_KERNEL_TRACE=7 does it all
LIBOMPTARGET_KERNEL_TRACE=5 Launch + details
LIBOMPTARGET_KERNEL_TRACE=2 timings + launch to stderr
LIBOMPTARGET_KERNEL_TRACE=10 timings + launch to stdout
Differential Revision: https://reviews.llvm.org/D96998
OpenMP 5.0 removed a lot of restriction for overlapped mapped items
comparing to OpenMP 4.5. Patch restricts the checks for overlapped data
mappings only for OpenMP 4.5 and less and reorders mapping of the
arguments so, that present and alloc mappings are processed first and
then all others.
Differential Revision: https://reviews.llvm.org/D86119
As reported by Guilherme Valarini [0], we used to pass stack allocations
to calls that can nowadays be asynchronous. This is arguably a problem
and it will inevitably result in UB. To remedy the situation we
allocate the locations as part of the AsyncInfoTy object. The lifetime
of that object matches what we need for now. If the synchronization is
not tied to the AsyncInfoTy object anymore we might need to have a
different buffer construct in global space.
This should be back-ported to LLVM 12 but needs slight modifications as
it is based on refactoring patches we do not need to backport.
[0] https://lists.llvm.org/pipermail/openmp-dev/2021-February/003867.html
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D96667
This patch unifies our libomptarget API in two ways:
- always pass a `__tgt_async_info` object, the Queue member decides if
it is in use or not.
- (almost) always synchronize in the interface layer and not in the
omptarget layer.
A side effect is that we now put all constructor and static initializer
kernels in a stream too, if the device utilizes `__tgt_async_info`.
The patch contains a TODO which can be addressed as we add support for
asynchronous malloc and free in the plugin API. This is the only
`synchronizeAsyncInfo` left in the omptarget layer.
Site note: On a V100 system the GridMini performance for small sizes
more than doubled.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D96379
The AsyncInfo should be passed everywhere and it should offer a way to
ensure synchronization, given a libomptarget Device.
This replaces D96431.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D96438
This unifies the API of `target` relative to `targetUpdateData` and
such.
Reviewed By: tianshilei1992, grokos
Differential Revision: https://reviews.llvm.org/D96429
The struct and enum alignments are kept by disabling clang-format for
that code region.
Reviewed By: tianshilei1992, JonChesterfield, grokos
Differential Revision: https://reviews.llvm.org/D96428
This silences warnings like these, in mingw builds with clang:
runtime/src/kmp_atomic.h:1021:13: warning: '__kmpc_atomic_cmplx8_rd' has C-linkage specified, but returns user-defined type 'kmp_cmplx64' (aka '__kmp_cmplx64_t') which is incompatible with C [-Wreturn-type-c-linkage]
runtime/src/z_Windows_NT_util.cpp:479:17: warning: cast from 'volatile void *' to 'type-parameter-0-0 *' drops volatile qualifier [-Wcast-qual]
flag = (C *)th->th.th_sleep_loc;
runtime/src/z_Windows_NT_util.cpp:1321:14: warning: cast to 'void *' from smaller integer type 'DWORD' (aka 'unsigned long') [-Wint-to-void-pointer-cast]
} else if ((void *)exit_val != (void *)th) {
Differential Revision: https://reviews.llvm.org/D96585
Add ifdefs around one function that only is used in unix build
configurations.
Add a void cast for a windows specific function that currently is
unused but may be intended to be used at some point.
Differential Revision: https://reviews.llvm.org/D96584
These variables are used only in certain build configurations,
or marked with a todo comment indicating that they should be
used/checked/reported.
Differential Revision: https://reviews.llvm.org/D96582
MinGW build configurations don't support this pragma (unless
compiling with clang, with -fms-extensions, and linking with
lld), and at least clang warns about it.
This library does end up linked by the cmake files anyway (as
long as the check works properly).
Differential Revision: https://reviews.llvm.org/D96581
check_library_exists fails for stdcall functions, because that
check doesn't include the necessary headers (and thus fails with
an undefined reference to _EnumProcessModules, when the import
library symbol actually is called _EnumProcessModules@16).
Merge the two previous checks check_include_files and
check_library_exists into one with check_c_source_compiles, and
merge the variables that indicate whether it succeeded.
Differential Revision: https://reviews.llvm.org/D96580
[libomptarget][amdgcn] Build amdgcn devicertl as openmp
Change cmake to build as openmp and fix up some minor errors in the code.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D96533
Three minor changes in this patch:
- added UNLIKELY hint to few rarely executed branches;
- replaced couple of run time checks with debug assertions;
- moved check of presence of ittnotify tool from inside the function call.
Differential Revision: https://reviews.llvm.org/D95816
This patch enables omp_get_num_devices() and omp_get_initial_device() on
Windows by providing an alternative to dlsym on Windows, and proposes to
add a new libomptarget entry, __tgt_get_num_devices().
Differential Revision: https://reviews.llvm.org/D96182
This patch adds lower-bound and upper-bound to num_teams clause
according to OpenMP 5.1 specification. The initial number of teams
created is implementation defined, but it will be greater than or
equal to lower-bound and less than or equal to upper-bound. If
num_teams clause is not specified, the number of teams created is
implementation defined, but it will be greater or equal to 1.
Differential Revision: https://reviews.llvm.org/D95820
[libomptarget][amdgcn] Tolerate deadstripped device_state variable
The device_state variable may have been deadstripped. Similar to
device_environment, leave detection of missing but used symbol to loader.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D96330
[libomptarget][amdgcn] Tolerate deadstripped env variable
Discovered by Pushpinder. If the device_environment variable is unused
it can be deadstripped, in which case we should not abort due to it
missing. This change is safe in that a missing symbol which is actually
used can be reported by both linker and loader, and a missing unused
symbol is better deadstripped than left in the image.
Reviewed By: pdhaliwal
Differential Revision: https://reviews.llvm.org/D96329
Currently if there is not kernel argument, device synchronization will
be skipped. This can lead to two issues:
1. If there is any device error, it will not be captured;
2. The target region might end before the kernel is done, which is not spec
conformant.
The test added in this patch only runs on NVPTX platform, although it will not
be executed by Phab at all. It also requires `not` which is not available on most
systems.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D96067
The header `assert.h` needs to be included in order to use `assert` in the code.
When building NVPTX `deviceRTLs` on a CUDA free system, it requires headers from
`gcc-multilib`, which some systems don't have. This patch drops the use of
`assert` in common parts of `deviceRTLs`. In light of
`openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.h`, a code block
```
if (!cond)
__builtin_trap();
```
is being used. The builtin will be translated to `call void @llvm.trap()`, and
the corresponding PTX is `trap;`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95986
New affinity patch introduced legitimate sign-compare warnings that
clang doesn't report but GCC-10 does. This removes the warnings by
changing two variables types to unsigned.
Differential Revision: https://reviews.llvm.org/D95818
From https://bugs.llvm.org/show_bug.cgi?id=48973, we know that
`std::call_once(PM->RTLs.initFlag, &RTLsTy::LoadRTLs, PM->RTLs)` causes compile
time problems in libstdc++v3 5.3.1. This is because there was a defect in the
standard regarding the `call_once` (LWG 2442). This was fixed in libstdc++ soon
thereafter, but there are likely other standard libraries where this will fail.
By matching this function call with the other one, we fix this bug.
Differential Revision: https://reviews.llvm.org/D95769
FileCheck is required for OpenMP tests. The current detection can fail
if building OpenMP in-tree when user sets `LLVM_INSTALL_TOOLCHAIN_ONLY=ON`. As a
result, CMake will raise an error and the compilation will be broken. This patch
fixed the issue. When `FileCheck` is not a target, tests will just be skipped.
Reviewed By: jdoerfert, JonChesterfield
Differential Revision: https://reviews.llvm.org/D95689
Summary:
One option for the LIBOMPTARGET_INFO environment variable is to print the current status of the device's data mappings. These are a shared resource among threads so this needs to be protected when using multiple streams.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95786
This patch refines the logic to choose compute capabilites via the
environment variable `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES`. It supports the
following values (all case insensitive):
- "all": Build `deviceRTLs` for all supported compute capabilites;
- "auto": Only build for the compute capability auto detected. Note that this
requires CUDA. If CUDA is not found, a CMake fatal error will be raised.
- "xx,yy" or "xx;yy": Build for compute capabilities `xx` and `yy`.
If `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES` is not set, it is equivalent to set
it to `all`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95687
This patch introduces a new environment variable to force monotonic
behavior for users that absolutely need it. This is in anticipation
of 5.0 change that uses non-monotonic behavior for dynamic scheduling
by default. Fixes for that and the actual switch are coming soon.
Differential Revision: https://reviews.llvm.org/D95263
Added release note for new `deviceRTLs` and hidden helper task for LLVM
12.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95584
This patch created a new header file `target_interface.h` for declarations of all target dependent functions. All future targets can get things work by simply implementing all functions declared in the header and macros/data same as each `target_impl.h`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95300
In the past `-O1` was used when building NVPTX bitcode libraries. After
we switched to OpenMP, `-O1` was missing by mistake, leading to a huge performance
regression.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95545
`[[clang::loader_uninitialized]]` is in macro `SHARED` but it doesn't
work for array like `parallelLevel`, so the variable will be zero initialized.
There is also a similar issue for `omptarget_nvptx_device_State` which is in
global address space. Its c'tor is also generated, which was not in the past when
building the `deviceRTLs` with CUDA. In this patch, we added the attribute to
the two variables explicitly.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95550
Link error occurred when time profiling in libomp is enabled by default
because `libomp` is assumed to be a C library but the dependence on
`libLLVMSupport` for profiling is a C++ library. Currently the issue blocks all
OpenMP tests in Phabricator.
This patch set a new CMake option `OPENMP_ENABLE_LIBOMP_PROFILING` to
enable/disable the feature. By default it is disabled. Note that once time
profiling is enabled for `libomp`, it becomes a C++ library.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95585
The remote offloading plugin's CMakeLists was trying to build if its
flag was enabled even if it didn't find gRPC/protobuf. The conditional
was wrong, it's fixed by this.
Differential Revision: https://reviews.llvm.org/D95574
D95466 dropped CUDA to build NVPTX deviceRTL and enabled it by default.
However, the building requires some libraries that are not available on non-CUDA
system by default, which could break the compilation. This patch disabled the
build by default. It can be enabled with `LIBOMPTARGET_BUILD_NVPTX_BCLIB=ON`.
Reviewed By: kparzysz
Differential Revision: https://reviews.llvm.org/D95556
When OMP_PLACES contains an invalid value, the warning informs the user
that the fallback is OMP_PLACES=threads, but the actual internal setting
is OMP_PLACES=cores and is detected as such with KMP_SETTINGS=1.
This patch informs the user that OMP_PLACES=cores is being used instead
of OMP_PLACES=threads.
Differential Revision: https://reviews.llvm.org/D95170
This patch adds the new algorithm for topology discovery using cpuid
leaf 1f. Only the new die level is detected and integrated into the
current affinity mechanisms including KMP_AFFINITY (granularity level
and compact/scatter algorithm), OMP_PLACES=dies, and KMP_HW_SUBSET.
Differential Revision: https://reviews.llvm.org/D95157
HWLOC 2.0 has numa nodes as separate children and are not in the main
parent/child topology tree anymore. This change takes this into
account. The main topology detection loop in the create_hwloc_map()
routine starts at a hardware thread within the initial affinity mask and
goes up the topology tree setting the socket/core/thread labels
correctly.
This change also introduces some of the more generic changes that the
future kmp_topology_t structure will take advantage of including a
generic ratio & count array (finding all ratios of topology layers like
threads/core cores/socket and finding all counts of each topology
layer), generic radix1 reduction step, generic uniformity check, and
generic printing of topology (en_US.txt)
Differential Revision: https://reviews.llvm.org/D95156
The check-libomptarget fails when building with LLVM_ENABLE_PROJECTS. This is because test configuration misses the path to libomp.so and libLLVMSupport.so when time profiling is enabled (both libraries have the same path when building). This patch add the path to the configuration.
Reviewed By: vzakhari
Differential Revision: https://reviews.llvm.org/D95376
Problem reported by Joseph Shen <joseph.smeng@gmail.com>.
The patch changes *(&<atomic-var>) to (&<atomic-var>)->load().
Differential Revision: https://reviews.llvm.org/D95485
With D94745, we no longer use CUDA SDK to compile `deviceRTLs`. Therefore,
many CMake code in the project is useless. This patch cleans up unnecessary code
and also drops the requirement to build NVPTX `deviceRTLs`. CUDA detection is
still being used however to determine whether we need to involve the tests. Auto
detection of compute capability is enabled by default and can be disabled by
setting CMake variable `LIBOMPTARGET_NVPTX_AUTODETECT_COMPUTE_CAPABILITY=OFF`.
If auto detection is enabled, and CUDA is also valid, it will only build the
bitcode library for the detected version; otherwise, all variants supported will
be generated. One drawback of this patch is, we now generate 96 variants of
bitcode library, and totally 1485 files to be built with a clean build on a
non-CUDA system. `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=""` can be used to
disable building NVPTX `deviceRTLs`.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D95466
This patch sets the def-allocator-var ICV based on the environment variables
provided in OMP_ALLOCATOR. Previously, only allowed value for OMP_ALLOCATOR
was a predefined memory allocator. OpenMP 5.1 specification allows predefined
memory allocator, predefined mem space, or predefined mem space with traits in
OMP_ALLOCATOR. If an allocator can not be created using the provided environment
variables, the def-allocator-var is set to omp_default_mem_alloc.
Differential Revision: https://reviews.llvm.org/D94985
[libomptarget][cuda] Handle missing _v2 symbols gracefully
Follow on from D95367. Dlsym the _v2 symbols if present, otherwise use the
unsuffixed version. Builds a hashtable for the check, can revise for zero
heap allocations later if necessary.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95415
Requiring 3.15 causes a build breakage, I'm sure none of the contents actually require
3.15 or above.
Differential Revision: https://reviews.llvm.org/D95474
[libomptarget][cuda] Gracefully handle missing cuda library
If using dynamic cuda, and it failed to load, it is not safe to call
cuGetErrorString.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95412
[libomptarget][cuda] Only run tests when sure there is cuda available
Prior to D95155, building the cuda plugin implied cuda was installed locally.
With that change, every machine can build a cuda plugin, but they won't all have
cuda and/or an nvptx card installed locally.
This change enables the nvptx tests when either:
- libcuda is present
- the user has forced use of the dlopen stub
The default case when there is no cuda detected will no longer attempt to
run the tests on nvptx hardware, as was the case before D95155.
Reviewed By: jdoerfert, ronlieb
Differential Revision: https://reviews.llvm.org/D95467
This introduces a remote offloading plugin for libomptarget. This
implementation relies on gRPC and protobuf, so this library will only
build if both libraries are available on the system. The corresponding
server is compiled to `openmp-offloading-server`.
This is a large change, but the only way to split this up is into RTL/server
but I fear that could introduce an inconsistency amongst them.
Ideally, tests for this should be added to the current ones that but that is
problematic for at least one reason. Given that libomptarget registers plugin
on a first-come-first-serve basis, if we wanted to offload onto a local x86
through a different process, then we'd have to either re-order the plugin list
in `rtl.cpp` (which is what I did locally for testing) or find a better
solution for runtime plugin registration in libomptarget.
Differential Revision: https://reviews.llvm.org/D95314
In order to support remote execution, we need to be able to send the
target binary description to the remote host for registration (and
consequent deregistration). To support this, I added these two
optional new functions to the plugin API:
- `__tgt_rtl_register_lib`
- `__tgt_rtl_unregister_lib`
These functions will be called to properly manage the instance of
libomptarget running on the remote host.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D93293
From this patch (plus some landed patches), `deviceRTLs` is taken as a regular OpenMP program with just `declare target` regions. In this way, ideally, `deviceRTLs` can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics.
Here're a list of changes in this patch.
1. For NVPTX, `DEVICE` is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove `DEVICE` or probably some other macros.
2. Shared variable is implemented with OpenMP allocator, which is defined in `allocator.h`. Again, this feature is not available on AMDGCN, so two macros are redefined properly.
3. CUDA header `cuda.h` is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation.
4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as `libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc`, such as `libomptarget-nvptx-cuda_80-sm_20.bc`.
With this change, there are also multiple features to be expected in the near future:
1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version.
2. Atomic operations used in `deviceRTLs` can be replaced by `omp atomic` if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong.
3. Target specific parts will be wrapped into `declare variant` with `isa` selector if it can work properly. No target specific macro is needed anymore.
4. (Maybe more...)
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D94745
In much of the libomptarget interface we have an ident_t object now, if
it is not null we can use it to improve the profile output. For now, we
simply use the ident_t "source information string" as generated by the
FE.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D95282
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
[libomptarget][cuda] Gracefully handle missing cuda library
If using dynamic cuda, and it failed to load, it is not safe to call
cuGetErrorString.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95412