CUDA requires that static variables be visible to the host when
offloading. However, The standard semantics of a stiatc variable dictate
that it should not be visible outside of the current file. In order to
access it from the host we need to perform "externalization" on the
static variable on the device. This requires generating a semi-unique
name that can be affixed to the variable as to not cause linker errors.
This is currently done using the CUID functionality, an MD5 hash value
set up by the clang driver. This allows us to achieve is mostly unique
ID that is unique even between multiple compilations of the same file.
However, this is not always availible. Instead, this patch uses the
unique ID from the file to generate a unique symbol name. This will
create a unique name that is consistent between the host and device side
compilations without requiring the CUID to be entered by the driver. The
one downside to this is that we are no longer stable under multiple
compilations of the same file. However, this is a very niche use-case
and is not supported by Nvidia's CUDA compiler so it likely to be good
enough.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D125904
The changes made in D123460 generalized the code generation for OpenMP's
offloading entries. We can use the same scheme to register globals for
CUDA code. This patch adds the code generation to create these
offloading entries when compiling using the new offloading driver mode.
The offloading entries are simple structs that contain the information
necessary to register the global. The struct used is as follows:
```
Type struct __tgt_offload_entry {
void *addr; // Pointer to the offload entry info.
// (function or global)
char *name; // Name of the function or global.
size_t size; // Size of the entry info (0 if it a function).
int32_t flags;
int32_t reserved;
};
```
Currently CUDA handles RDC code generation by deferring the registration
of globals in the current TU to a callback function containing the
modules ID. Later all the module IDs will be used to register all of the
globals at once. Rather than mimic this, offloading entries allow us to
mimic the way OpenMP registers globals. That is, we create a simple
global struct for each device global to be registered. These are placed
at a special section `cuda_offloading_entires`. Because this section is
a valid C-identifier, the linker will profide a `__start` and `__stop`
pointer that we can use to iterate and register all globals at runtime.
the registration requires a flag variable to indicate which registration
function to use. I have assigned the flags somewhat arbitrarily, but
these use the following values.
Kernel: 0
Variable: 0
Managed: 1
Surface: 2
Texture: 3
Depends on D120272
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D123471
CUDA/HIP needs to mangle for aux target. When mangling for aux target,
the mangler should use mangling number for aux target. Previously
in https://reviews.llvm.org/D122734 a state was introduced in
ASTContext to let the mangler get mangling number for aux target
from ASTContext. This patch removes that state from ASTConext
and add an IsAux member to MangleContext to indicate that
the mangle context is for aux target. This reflects the reality that
the mangle context is created for mangling aux target and makes
ASTContext cleaner.
Reviewed by: Artem Belevich, Reid Kleckner
Differential Revision: https://reviews.llvm.org/D124842
MSVC and Itanium mangling use different mangling numbers
for function-scope structs, which causes inconsistent
mangled kernel names in device and host compilations.
This patch uses Itanium mangling number for structs
in for mangling device side names in CUDA/HIP host
compilation on Windows to fix this issue.
A state is added to ASTContext to indicate whether the
current name mangling is for device side names in host
compilation. Device and host mangling number
are encoded/decoded as upper and lower half of 32 bit
unsigned integer to fit into the original mangling number
field for AST. Diagnostic will be emitted if a manglining
number exceeds limit.
Reviewed by: Artem Belevich, Reid Kleckner
Differential Revision: https://reviews.llvm.org/D122734
Fixes: SWDEV-328515
In theory, constructors can take arguments when called via .init_array
where at least glibc passes in (argc, argv, envp). This isn't used in
the generated code and if it was, the first argument should be an
integer, not a pointer. For destructors registered via atexit, the
function should never take an argument.
Differential Revision: https://reviews.llvm.org/D123370
Introduce -fgpu-default-stream={legacy|per-thread} option to
support per-thread default stream for HIP runtime.
When -fgpu-default-stream=per-thread, HIP kernels are
launched through hipLaunchKernel_spt instead of
hipLaunchKernel. Also HIP_API_PER_THREAD_DEFAULT_STREAM=1
is defined by the preprocessor to enable other per-thread stream
API's.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D120298
To make uses of the deprecated constructor easier to spot, and to
ensure that no new uses are introduced, rename it to
Address::deprecated().
While doing the rename, I've filled in element types in cases
where it was relatively obvious, but we're still left with 135
calls to the deprecated constructor.
This patch fixes issues for -fgpu-rdc for Windows MSVC
toolchain:
Fix COFF specific section flags and remove section types
in llvm-mc input file for Windows.
Escape fatbin path in llvm-mc input file.
Add -triple option to llvm-mc.
Put __hip_gpubin_handle in comdat when it has linkonce_odr
linkage.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D115039
Two identical instantiations of a template function can be emitted by two TU's
with linkonce_odr linkage without causing duplicate symbols in linker. MSVC
also requires these symbols be in comdat sections. Linux does not require
the symbols in comdat sections to be merged by linker but by default
clang puts them in comdat sections.
If a template kernel is instantiated identically in two TU's. MSVC requires
that them to be in comdat sections, otherwise MSVC linker will diagnose them as
duplicate symbols. However, currently clang does not put instantiated template
kernels in comdat sections, which causes link error for MSVC.
This patch allows putting instantiated template kernels into comdat sections.
Reviewed by: Artem Belevich, Reid Kleckner
Differential Revision: https://reviews.llvm.org/D112492
createReplacementInstr was a trivial wrapper around
ConstantExpr::getAsInstruction, which also inserted the new instruction
into a basic block. Implement this directly in getAsInstruction by
adding an InsertBefore parameter and change all callers to use it. NFC.
A follow-up patch will remove createReplacementInstr.
Differential Revision: https://reviews.llvm.org/D112791
Remove uses of to-be-deprecated API. I've fallen back to calling
getPointerElementType() in some cases where the correct type wasn't
immediately obvious to me.
The original version of this was reverted, and @rjmcall provided some
advice to architect a new solution. This is that solution.
This implements a builtin to provide a unique name that is stable across
compilations of this TU for the purposes of implementing the library
component of the unnamed kernel feature of SYCL. It does this by
running the Itanium mangler with a few modifications.
Because it is somewhat common to wrap non-kernel-related lambdas in
macros that aren't present on the device (such as for logging), this
uniquely generates an ID for all lambdas involved in the naming of a
kernel. It uses the lambda-mangling number to do this, except replaces
this with its own number (starting at 10000 for readabililty reasons)
for lambdas used to name a kernel.
Additionally, this implements itself as constexpr with a slight catch:
if a name would be invalidated by the use of this lambda in a later
kernel invocation, it is diagnosed as an error (see the Sema tests).
Differential Revision: https://reviews.llvm.org/D103112
Currently clang does not emit device template variables
instantiated only in host functions, however, nvcc is
able to do that:
https://godbolt.org/z/fneEfferY
This patch fixes this issue by refactoring and extending
the existing mechanism for emitting static device
var ODR-used by host only. Basically clang records
device variables ODR-used by host code and force
them to be emitted in device compilation. The existing
mechanism makes sure these device variables ODR-used
by host code are added to llvm.compiler-used, therefore
they are guaranteed not to be deleted.
It also fixes non-ODR-use of static device variable by host code
causing static device variable to be emitted and registered,
which should not.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D102237
Add device variables to llvm.compiler.used if they are
ODR-used by either host or device functions.
This is necessary to prevent them from being
eliminated by whole-program optimization
where the compiler has no way to know a device
variable is used by some host code.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D98814
Add builtin function __builtin_get_device_side_mangled_name
to get device side manged name for functions and global
variables, which can be used to get symbol address of kernels
or variables by mangled name in dynamically loaded
bundled code objects at run time.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D99301
Currently clang uses stub function to launch kernel. This is inconvenient
to interop with C++ programs since the stub function has different name
as kernel, which is required by ROCm debugger.
This patch emits a variable symbol which has the same name as the kernel
and uses it to register and launch the kernel. This allows C++ program to
launch a kernel by using the original kernel name.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D86376
For -fgpu-rdc mode, static device vars in different TU's may have the same name.
To support accessing file-scope static device variables in host code, we need to give them
a distinct name and external linkage. This can be done by postfixing each static device variable with
a distinct CUID (Compilation Unit ID) hash.
Since the static device variables have different name across compilation units, now we let
them have external linkage so that they can be looked up by the runtime.
Reviewed by: Artem Belevich, and Jon Chesterfield
Differential Revision: https://reviews.llvm.org/D85223
Currently managed variables are emitted as undefined symbols, which
causes difficulty for diagnosing undefined symbols for non-managed
variables.
This patch transforms managed variables in device compilation so that
they can be emitted as normal variables.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D96195
For -fgpu-rdc, shadow variables should not be internalized, otherwise
they cannot be accessed by other TUs. This is necessary because
the shadow variable of external device variables are always
emitted as undefined symbols, which need to resolve to a global
symbols.
Managed variables need to be emitted as undefined symbols
in device compilations.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D95901
- The failures are all cc1-based tests due to the missing `-aux-triple` options,
which is always prepared by the driver in CUDA/HIP compilation.
- Add extra check on the missing aux-targetinfo to prevent crashing.
[hip][cuda] Enable extended lambda support on Windows.
- On Windows, extended lambda has extra issues due to the numbering
schemes are different between the host compilation (Microsoft C++ ABI)
and the device compilation (Itanium C++ ABI. Additional device side
lambda number is required per lambda for the host compilation to
correctly mangle the device-side lambda name.
- A hybrid numbering context `MSHIPNumberingContext` is introduced to
number a lambda for both host- and device-compilations.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D69322
This reverts commit 4874ff0241.
- On Windows, extended lambda has extra issues due to the numbering
schemes are different between the host compilation (Microsoft C++ ABI)
and the device compilation (Itanium C++ ABI. Additional device side
lambda number is required per lambda for the host compilation to
correctly mangle the device-side lambda name.
- A hybrid numbering context `MSHIPNumberingContext` is introduced to
number a lambda for both host- and device-compilations.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D69322
Extract registering device variable to CUDA runtime codegen function since it
will be called in multiple places.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D95558
This patch implements codegen for __managed__ variable attribute for HIP.
Diagnostics will be added later.
Differential Revision: https://reviews.llvm.org/D94814
Followup to D85191.
This changes getTypeInfoInChars to return a TypeInfoChars
struct instead of a std::pair of CharUnits. This lets the
interface match getTypeInfo more closely.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D86447
To facilitate faster loading of device binaries and share them among processes,
HIP runtime favors their alignment being 4096 bytes. HIP runtime can load
unaligned device binaries, however, aligning them at 4096 bytes results in
faster loading and less shared memory usage.
This patch adds an option -bundle-align to clang-offload-bundler which allows
bundles to be aligned at specified alignment. By default it is 1, which is NFC
compared to existing format.
This patch then aligns embedded fat binary and device binary inside fat binary
at 4096 bytes.
It has been verified this change does not cause significant overall file size increase
for typical HIP applications (less than 1%).
Differential Revision: https://reviews.llvm.org/D88734
Do not detect device library by default in rocm detector.
Only detect device library in Rocm and HIP toolchain.
Separate detection of HIP runtime and Rocm device library.
Detect rocm path by version file in host toolchains.
Also added detecting rocm version and printing rocm
installation path and version with -v.
Fixed include path and device library detection for
ROCm 3.5.
Added --hip-version option. Renamed --hip-device-lib-path
to --rocm-device-lib-path.
Fixed default value for -fhip-new-launch-api.
Added default -std option for HIP.
Differential Revision: https://reviews.llvm.org/D82930
Summary:
- `RegisterVar` has `void` return type and `size_t` in its variable size
parameter in HIP or CUDA 9.0+.
Reviewers: tra, yaxunl
Subscribers: cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D77398
Summary:
- Even though the bindless surface/texture interfaces are promoted,
there are still code using surface/texture references. For example,
[PR#26400](https://bugs.llvm.org/show_bug.cgi?id=26400) reports the
compilation issue for code using `tex2D` with texture references. For
better compatibility, this patch proposes the support of
surface/texture references.
- Due to the absent documentation and magic headers, it's believed that
`nvcc` does use builtins for texture support. From the limited NVVM
documentation[^nvvm] and NVPTX backend texture/surface related
tests[^test], it's believed that surface/texture references are
supported by replacing their reference types, which are annotated with
`device_builtin_surface_type`/`device_builtin_texture_type`, with the
corresponding handle-like object types, `cudaSurfaceObject_t` or
`cudaTextureObject_t`, in the device-side compilation. On the host
side, that global handle variables are registered and will be
established and updated later when corresponding binding/unbinding
APIs are called[^bind]. Surface/texture references are most like
device global variables but represented in different types on the host
and device sides.
- In this patch, the following changes are proposed to support that
behavior:
+ Refine `device_builtin_surface_type` and
`device_builtin_texture_type` attributes to be applied on `Type`
decl only to check whether a variable is of the surface/texture
reference type.
+ Add hooks in code generation to replace that reference types with
the correponding object types as well as all accesses to them. In
particular, `nvvm.texsurf.handle.internal` should be used to load
object handles from global reference variables[^texsurf] as well as
metadata annotations.
+ Generate host-side registration with proper template argument
parsing.
---
[^nvvm]: https://docs.nvidia.com/cuda/pdf/NVVM_IR_Specification.pdf
[^test]: https://raw.githubusercontent.com/llvm/llvm-project/master/llvm/test/CodeGen/NVPTX/tex-read-cuda.ll
[^bind]: See section 3.2.11.1.2 ``Texture reference API` in [CUDA C Programming Guide](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf).
[^texsurf]: According to NVVM IR, `nvvm.texsurf.handle` should be used. But, the current backend doesn't have that supported. We may revise that later.
Reviewers: tra, rjmccall, yaxunl, a.sidorin
Subscribers: cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76365
HIP emits a device stub function for each kernel in host code.
The HIP debugger requires device stub function to have a different unmangled name as the kernel.
Currently the name of the device stub function is the mangled name with a postfix .stub. However,
this does not work with the HIP debugger since the unmangled name is the same as the kernel.
This patch adds prefix __device__stub__ to the unmangled name of the device stub before mangling,
therefore the device stub function has a valid mangled name which is different than the device kernel
name. The device side kernel name is kept unchanged. kernels with extern "C" also gets the prefix added
to the corresponding device stub function.
Differential Revision: https://reviews.llvm.org/D68578
This is how it should've been and brings it more in line with
std::string_view. There should be no functional change here.
This is mostly mechanical from a custom clang-tidy check, with a lot of
manual fixups. It uncovers a lot of minor inefficiencies.
This doesn't actually modify StringRef yet, I'll do that in a follow-up.
Summary:
- Revise the interface to derive the stub name and simplify the
assertion of it.
Reviewers: yaxunl, tra
Subscribers: cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D63335
llvm-svn: 363553
Also for CUDA, we need to disable producing these fat binary functions when there is no GPU code.
Reviewers: yaxunl, tra
Differential Revision: https://reviews.llvm.org/D60141
llvm-svn: 357526
Skip producing the fat binary functions for HIP when no device code is present.
Reviewers: yaxunl
Differential Review: https://reviews.llvm.org/D60141
llvm-svn: 357520
Add .stub to kernel stub function name so that it is different from kernel
name in device code. This is necessary to let debugger find correct symbol
for kernel.
Differential Revision: https://reviews.llvm.org/D58518
llvm-svn: 354948
Add .stub to kernel stub function name so that it is different from kernel
name in device code. This is necessary to let debugger find correct symbol
for kernel
Differential Revision: https://reviews.llvm.org/D58518
llvm-svn: 354615
__hipRegisterFunction and __hipRegisterVar need to accept device side kernel and variable names
so that HIP runtime can associate kernel stub functions in host code with kernel symbols in fat binaries,
and associate shadow variables in host code with device variables in fat binaries.
Currently, clang assumes kernel functions and device variables have the same name as the kernel
stub functions and shadow variables. However, when host is compiled in windows with MSVC C++
ABI and device is compiled with Itanium C++ ABI (e.g. AMDGPU), kernels and device symbols in fat
binary are mangled differently than host.
This patch gets the device side kernel and variable name by mangling them in the mangle context
of aux target.
Differential Revision: https://reviews.llvm.org/D58163
llvm-svn: 354004