To facilitate faster loading of device binaries and share them among processes,
HIP runtime favors their alignment being 4096 bytes. HIP runtime can load
unaligned device binaries, however, aligning them at 4096 bytes results in
faster loading and less shared memory usage.
This patch adds an option -bundle-align to clang-offload-bundler which allows
bundles to be aligned at specified alignment. By default it is 1, which is NFC
compared to existing format.
This patch then aligns embedded fat binary and device binary inside fat binary
at 4096 bytes.
It has been verified this change does not cause significant overall file size increase
for typical HIP applications (less than 1%).
Differential Revision: https://reviews.llvm.org/D88734
Do not detect device library by default in rocm detector.
Only detect device library in Rocm and HIP toolchain.
Separate detection of HIP runtime and Rocm device library.
Detect rocm path by version file in host toolchains.
Also added detecting rocm version and printing rocm
installation path and version with -v.
Fixed include path and device library detection for
ROCm 3.5.
Added --hip-version option. Renamed --hip-device-lib-path
to --rocm-device-lib-path.
Fixed default value for -fhip-new-launch-api.
Added default -std option for HIP.
Differential Revision: https://reviews.llvm.org/D82930
Summary:
- `RegisterVar` has `void` return type and `size_t` in its variable size
parameter in HIP or CUDA 9.0+.
Reviewers: tra, yaxunl
Subscribers: cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D77398
Summary:
- Even though the bindless surface/texture interfaces are promoted,
there are still code using surface/texture references. For example,
[PR#26400](https://bugs.llvm.org/show_bug.cgi?id=26400) reports the
compilation issue for code using `tex2D` with texture references. For
better compatibility, this patch proposes the support of
surface/texture references.
- Due to the absent documentation and magic headers, it's believed that
`nvcc` does use builtins for texture support. From the limited NVVM
documentation[^nvvm] and NVPTX backend texture/surface related
tests[^test], it's believed that surface/texture references are
supported by replacing their reference types, which are annotated with
`device_builtin_surface_type`/`device_builtin_texture_type`, with the
corresponding handle-like object types, `cudaSurfaceObject_t` or
`cudaTextureObject_t`, in the device-side compilation. On the host
side, that global handle variables are registered and will be
established and updated later when corresponding binding/unbinding
APIs are called[^bind]. Surface/texture references are most like
device global variables but represented in different types on the host
and device sides.
- In this patch, the following changes are proposed to support that
behavior:
+ Refine `device_builtin_surface_type` and
`device_builtin_texture_type` attributes to be applied on `Type`
decl only to check whether a variable is of the surface/texture
reference type.
+ Add hooks in code generation to replace that reference types with
the correponding object types as well as all accesses to them. In
particular, `nvvm.texsurf.handle.internal` should be used to load
object handles from global reference variables[^texsurf] as well as
metadata annotations.
+ Generate host-side registration with proper template argument
parsing.
---
[^nvvm]: https://docs.nvidia.com/cuda/pdf/NVVM_IR_Specification.pdf
[^test]: https://raw.githubusercontent.com/llvm/llvm-project/master/llvm/test/CodeGen/NVPTX/tex-read-cuda.ll
[^bind]: See section 3.2.11.1.2 ``Texture reference API` in [CUDA C Programming Guide](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf).
[^texsurf]: According to NVVM IR, `nvvm.texsurf.handle` should be used. But, the current backend doesn't have that supported. We may revise that later.
Reviewers: tra, rjmccall, yaxunl, a.sidorin
Subscribers: cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76365
HIP emits a device stub function for each kernel in host code.
The HIP debugger requires device stub function to have a different unmangled name as the kernel.
Currently the name of the device stub function is the mangled name with a postfix .stub. However,
this does not work with the HIP debugger since the unmangled name is the same as the kernel.
This patch adds prefix __device__stub__ to the unmangled name of the device stub before mangling,
therefore the device stub function has a valid mangled name which is different than the device kernel
name. The device side kernel name is kept unchanged. kernels with extern "C" also gets the prefix added
to the corresponding device stub function.
Differential Revision: https://reviews.llvm.org/D68578
This is how it should've been and brings it more in line with
std::string_view. There should be no functional change here.
This is mostly mechanical from a custom clang-tidy check, with a lot of
manual fixups. It uncovers a lot of minor inefficiencies.
This doesn't actually modify StringRef yet, I'll do that in a follow-up.
Summary:
- Revise the interface to derive the stub name and simplify the
assertion of it.
Reviewers: yaxunl, tra
Subscribers: cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D63335
llvm-svn: 363553
Also for CUDA, we need to disable producing these fat binary functions when there is no GPU code.
Reviewers: yaxunl, tra
Differential Revision: https://reviews.llvm.org/D60141
llvm-svn: 357526
Skip producing the fat binary functions for HIP when no device code is present.
Reviewers: yaxunl
Differential Review: https://reviews.llvm.org/D60141
llvm-svn: 357520
Add .stub to kernel stub function name so that it is different from kernel
name in device code. This is necessary to let debugger find correct symbol
for kernel.
Differential Revision: https://reviews.llvm.org/D58518
llvm-svn: 354948
Add .stub to kernel stub function name so that it is different from kernel
name in device code. This is necessary to let debugger find correct symbol
for kernel
Differential Revision: https://reviews.llvm.org/D58518
llvm-svn: 354615
__hipRegisterFunction and __hipRegisterVar need to accept device side kernel and variable names
so that HIP runtime can associate kernel stub functions in host code with kernel symbols in fat binaries,
and associate shadow variables in host code with device variables in fat binaries.
Currently, clang assumes kernel functions and device variables have the same name as the kernel
stub functions and shadow variables. However, when host is compiled in windows with MSVC C++
ABI and device is compiled with Itanium C++ ABI (e.g. AMDGPU), kernels and device symbols in fat
binary are mangled differently than host.
This patch gets the device side kernel and variable name by mangling them in the mangle context
of aux target.
Differential Revision: https://reviews.llvm.org/D58163
llvm-svn: 354004
Emit{Nounwind,}RuntimeCall{,OrInvoke} have been modified to take a
FunctionCallee as an argument, and CreateRuntimeFunction has been
modified to return a FunctionCallee. All callers have been updated.
Additionally, CreateBuiltinFunction is removed, as it was redundant
with CreateRuntimeFunction after some previous changes.
Differential Revision: https://reviews.llvm.org/D57668
llvm-svn: 353184
This argument was added in r254554 in order to support the
pass_object_size attribute. However, in r296076, the attribute's
presence is now also represented in FunctionProtoType's
ExtParameterInfo, and thus it's unnecessary to pass along a separate
FunctionDecl.
The functions modified are:
RequiredArgs::forPrototype{,Plus}, and
CodeGenTypes::ConvertFunctionType.
After this, it's also (again) unnecessary to have a separate
ConvertFunctionType function ConvertType, so convert callers back to
the latter, leaving the former as an internal helper function.
llvm-svn: 352946
Instead of calling CUDA runtime to arrange function arguments,
the new API constructs arguments in a local array and the kernels
are launched with __cudaLaunchKernel().
The old API has been deprecated and is expected to go away
in the next CUDA release.
Differential Revision: https://reviews.llvm.org/D57488
llvm-svn: 352799
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
getGUID() returns an uint64_t and "%x" only prints 32 bits of it.
Use PRIx64 format string to print all 64 bits.
Differential Revision: https://reviews.llvm.org/D52938
llvm-svn: 343875
This patch renames -f{no-}cuda-rdc to -f{no-}gpu-rdc and keeps the original
options as aliases. When -fgpu-rdc is off,
clang will assume the device code in each translation unit does not call
external functions except those in the device library, therefore it is possible
to compile the device code in each translation unit to self-contained kernels
and embed them in the host object, so that the host object behaves like
usual host object which can be linked by lld.
The benefits of this feature is: 1. allow users to create static libraries which
can be linked by host linker; 2. amortized device code linking time.
This patch modifies HIP action builder to insert actions for linking device
code and generating HIP fatbin, and pass HIP fatbin to host backend action.
It extracts code for constructing command for generating HIP fatbin as
a function so that it can be reused by early finalization. It also modifies
codegen of HIP host constructor functions to embed the device fatbin
when it is available.
Differential Revision: https://reviews.llvm.org/D52377
llvm-svn: 343611
Different shared libraries contain different fat binary, which is stored in a global variable
__hip_gpubin_handle. Since different compilation units share the same fat binary, this
variable has linkonce linkage. However, it should not be merged across different shared
libraries.
This patch set the visibility of the global variable to be hidden, which will make it invisible
in the shared library, therefore preventing it from being merged.
Differential Revision: https://reviews.llvm.org/D50596
llvm-svn: 340056
HIP generates one fat binary for all devices after linking. However, for each compilation
unit a ctor function is emitted which register the same fat binary. Measures need to be
taken to make sure the fat binary is only registered once.
Currently each ctor function calls __hipRegisterFatBinary and stores the returned value
to __hip_gpubin_handle. This patch changes the linkage of __hip_gpubin_handle to be linkonce
so that they are shared between LLVM modules. Then this patch adds check of value of
__hip_gpubin_handle to make sure __hipRegisterFatBinary is only called once. The code
is equivalent to
void *_gpubin_handle;
void ctor() {
if (__hip_gpubin_handle == 0) {
__hip_gpubin_handle = __hipRegisterFatBinary(...);
}
// register kernels and variables.
}
The patch also does similar change to dtors so that __hipUnregisterFatBinary
is called once.
Differential Revision: https://reviews.llvm.org/D49083
llvm-svn: 337631
This matches the way NVCC does it. Doing module cleanup at global
destructor phase used to work, but is, apparently, too late for
the CUDA runtime in CUDA-9.2, which ends up crashing with double-free.
Differential Revision: https://reviews.llvm.org/D48613
llvm-svn: 335763
CGM.GetAddrOfConstantCString() sets the adress of the created GlobalValue
to unnamed. When emitting the object file LLVM will mark the surrounding
section as SHF_MERGE iff the string is nul-terminated and contains no
other nuls (see IsNullTerminatedString). This results in problems when
saving temporaries because LLVM doesn't set an EntrySize, so reading in
the serialized assembly file fails.
This never happened for the GPU binaries because they usually contain
a nul-character somewhere. Instead this only affected the module ID
when compiling relocatable device code.
However, this points to a potentially larger problem: If we put a
constant string into a named section, we really want the data to end
up in that section in the object file. To avoid LLVM merging sections
this patch unmarks the GlobalVariable's address as unnamed which also
fixes the problem of invalid serialized assembly files when saving
temporaries.
Differential Revision: https://reviews.llvm.org/D47902
llvm-svn: 334281
To support linking device code in different source files, it is necessary to
embed fat binary at host linking stage.
This patch emits an external symbol for fat binary in host codegen, then
embed the fat binary by lld through a linker script.
Differential Revision: https://reviews.llvm.org/D46472
llvm-svn: 332724
HIP is a language similar to CUDA (https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md ).
The language syntax is very similar, which allows a hip program to be compiled as a CUDA program by Clang. The main difference
is the host API. HIP has a set of vendor neutral host API which can be implemented on different platforms. Currently there is open source
implementation of HIP runtime on amdgpu target (https://github.com/ROCm-Developer-Tools/HIP).
This patch adds support of input kind and language standard hip.
When hip file is compiled, both LangOpts.CUDA and LangOpts.HIP is turned on. This allows compilation of hip program as CUDA
in most cases and only special handling of hip program is needed LangOpts.HIP is checked.
This patch also adds support of kernel launching of HIP program using HIP host API.
When -x hip is not specified, there is no behaviour change for CUDA.
Patch by Greg Rodgers.
Revised and lit test added by Yaxun Liu.
Differential Revision: https://reviews.llvm.org/D44984
llvm-svn: 330790
nvcc generates a unique registration function for each object file
that contains relocatable device code. Unique names are achieved
with a module id that is also reflected in the function's name.
Differential Revision: https://reviews.llvm.org/D42922
llvm-svn: 330425
Binaries for multiple architectures are combined by fatbinary,
so the current code was effectively not needed.
Differential Revision: https://reviews.llvm.org/D43461
llvm-svn: 326342
Use variadic templates instead of relying on <cstdarg> + sentinel.
This enforces better type checking and makes code more readable.
Differential revision: https://reviews.llvm.org/D32550
llvm-svn: 302572
This matches the way nvcc encapsulates GPU binaries into host object file.
Now cuobjdump can deal with clang-compiled object files.
Differential Revision: https://reviews.llvm.org/D23429
llvm-svn: 278549
Summary:
Before this patch, we computed the offsets in memory of args passed to
GPU kernel functions by throwing all of the args into an LLVM struct.
clang emits packed llvm structs basically whenever it feels like it, and
packed structs have alignment 1. So we cannot rely on the llvm type's
alignment matching the C++ type's alignment.
This patch fixes our codegen so we always respect the clang types'
alignments.
Reviewers: rnk
Subscribers: cfe-commits, tra
Differential Revision: https://reviews.llvm.org/D22879
llvm-svn: 276927
... and register them with CUDA runtime.
This is needed for commonly used cudaMemcpy*() APIs that use address of
host-side shadow to access their counterparts on device side.
Fixes PR26340
Differential Revision: http://reviews.llvm.org/D17779
llvm-svn: 262498
Summary:
Previously we compiled CUDA device code to PTX assembly and embedded
that asm as text in our host binary. Now we compile to PTX assembly and
then invoke ptxas to assemble the PTX into a cubin file. We gather the
ptx and cubin files for each of our --cuda-gpu-archs and combine them
using fatbinary, and then embed that into the host binary.
Adds two new command-line flags, -Xcuda_ptxas and -Xcuda_fatbinary,
which pass args down to the external tools.
Reviewers: tra, echristo
Subscribers: cfe-commits, jhen
Differential Revision: http://reviews.llvm.org/D16082
llvm-svn: 257809
Introduce an Address type to bundle a pointer value with an
alignment. Introduce APIs on CGBuilderTy to work with Address
values. Change core APIs on CGF/CGM to traffic in Address where
appropriate. Require alignments to be non-zero. Update a ton
of code to compute and propagate alignment information.
As part of this, I've promoted CGBuiltin's EmitPointerWithAlignment
helper function to CGF and made use of it in a number of places in
the expression emitter.
The end result is that we should now be significantly more correct
when performing operations on objects that are locally known to
be under-aligned. Since alignment is not reliably tracked in the
type system, there are inherent limits to this, but at least we
are no longer confused by standard operations like derived-to-base
conversions and array-to-pointer decay. I've also fixed a large
number of bugs where we were applying the complete-object alignment
to a pointer instead of the non-virtual alignment, although most of
these were hidden by the very conservative approach we took with
member alignment.
Also, because IRGen now reliably asserts on zero alignments, we
should no longer be subject to an absurd but frustrating recurring
bug where an incomplete type would report a zero alignment and then
we'd naively do a alignmentAtOffset on it and emit code using an
alignment equal to the largest power-of-two factor of the offset.
We should also now be emitting much more aggressive alignment
attributes in the presence of over-alignment. In particular,
field access now uses alignmentAtOffset instead of min.
Several times in this patch, I had to change the existing
code-generation pattern in order to more effectively use
the Address APIs. For the most part, this seems to be a strict
improvement, like doing pointer arithmetic with GEPs instead of
ptrtoint. That said, I've tried very hard to not change semantics,
but it is likely that I've failed in a few places, for which I
apologize.
ABIArgInfo now always carries the assumed alignment of indirect and
indirect byval arguments. In order to cut down on what was already
a dauntingly large patch, I changed the code to never set align
attributes in the IR on non-byval indirect arguments. That is,
we still generate code which assumes that indirect arguments have
the given alignment, but we don't express this information to the
backend except where it's semantically required (i.e. on byvals).
This is likely a minor regression for those targets that did provide
this information, but it'll be trivial to add it back in a later
patch.
I partially punted on applying this work to CGBuiltin. Please
do not add more uses of the CreateDefaultAligned{Load,Store}
APIs; they will be going away eventually.
llvm-svn: 246985