Avoid unnecessary spills of byval arguments of device functions to
local space on SASS level and subsequent pointer conversion to generic
address space that follows. Instead, make a local copy in IR, provide
a way to access arguments directly, and let LLVM optimize the copy away
when possible.
Differential Review: https://reviews.llvm.org/D21421
llvm-svn: 276153
Taking address of a byval variable in PTX is legal, but currently runs
into miscompilation by ptxas on sm_50+ (NVIDIA issue 1789042).
Work around the issue by enforcing minimum alignment on byval arguments
of device functions.
The change is a no-op on SASS level for sm_3x where ptxas already aligns
local copy by at least 4.
Differential Revision: https://reviews.llvm.org/D22428
llvm-svn: 275893
- Rename the ptx.read.* intrinsics to nvvm.read.ptx.sreg.* - some but
not all of these registers were already accessible via the nvvm
name.
- Rename ptx.bar.sync nvvm.bar.sync, to match nvvm.bar0.
There's a fair amount of code motion here, but it's all very
mechanical.
llvm-svn: 274769
Avoid unnecessary spills of such vars to local space on SASS level and
pointer space conversion.
Instead, make a local copy with appropriate addrspacecasts and let
LLVM optimize them away when possible.
This allows loading value of the argument using [symbol+offset]
instead of converting argument to general space pointer and using it
for indexing (which also implicitly converts param space pointer to
local space one on SASS level and triggers copying of argument into
local space in the process).
This reduces call overhead, uses less registers and reduces overall
SASS size by 2-4%.
Differential Review: http://reviews.llvm.org/D21421
llvm-svn: 273313
Summary:
Currently clang emits these instructions via inline (volatile) asm in
the CUDA headers. Switching to intrinsics will let the optimizer reason
across calls to these intrinsics.
Reviewers: tra
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D21160
llvm-svn: 272298
NVVMIntrRange adds !range metadata to calls of NVVM intrinsics
that return values within known limited range.
This allows LLVM to generate optimal code for indexing arrays
based on tid/ctaid which is a frequently used pattern in CUDA code.
Differential Revision: http://reviews.llvm.org/D20644
llvm-svn: 270872
Summary:
We don't have sign-/zero-extending ldg/ldu instructions defined,
so we need to emulate them with explicit CVTs. We were originally
handling the i8 case, but not any other cases.
Fixes PR26185
Reviewers: jingyue, jlebar
Subscribers: jholewinski
Differential Revision: http://reviews.llvm.org/D19615
llvm-svn: 268272
Currently each Function points to a DISubprogram and DISubprogram has a
scope field. For member functions the scope is a DICompositeType. DIScopes
point to the DICompileUnit to facilitate type uniquing.
Distinct DISubprograms (with isDefinition: true) are not part of the type
hierarchy and cannot be uniqued. This change removes the subprograms
list from DICompileUnit and instead adds a pointer to the owning compile
unit to distinct DISubprograms. This would make it easy for ThinLTO to
strip unneeded DISubprograms and their transitively referenced debug info.
Motivation
----------
Materializing DISubprograms is currently the most expensive operation when
doing a ThinLTO build of clang.
We want the DISubprogram to be stored in a separate Bitcode block (or the
same block as the function body) so we can avoid having to expensively
deserialize all DISubprograms together with the global metadata. If a
function has been inlined into another subprogram we need to store a
reference the block containing the inlined subprogram.
Attached to https://llvm.org/bugs/show_bug.cgi?id=27284 is a python script
that updates LLVM IR testcases to the new format.
http://reviews.llvm.org/D19034
<rdar://problem/25256815>
llvm-svn: 266446
Summary:
Previously the NVVMReflect pass would read its configuration from
command-line flags or a static configuration given to the pass at
instantiation time.
This doesn't quite work for clang's use-case. It needs to pass a value
for __CUDA_FTZ down on a per-module basis. We use a module flag for
this, so the NVVMReflect pass needs to be updated to read said flag.
Reviewers: tra, rnk
Subscribers: cfe-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D18672
llvm-svn: 265090
Summary:
The old address space inference pass (NVPTXFavorNonGenericAddrSpaces) is unable
to convert the address space of a pointer induction variable. This patch adds a
new pass called NVPTXInferAddressSpaces that overcomes that limitation using a
fixed-point data-flow analysis (see the file header comments for details).
The new pass is experimental and not enabled by default. Users can turn
it on by setting the -nvptx-use-infer-addrspace flag of llc.
Reviewers: jholewinski, tra, jlebar
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D17965
llvm-svn: 263916
Summary:
Calls sometimes need to be convergent. This is already handled at the
LLVM IR level, but it also needs to be handled at the MI level.
Ideally we'd propagate convergence from instructions, down through the
selection DAG, and into MIs. But this is Hard, and would affect
optimizations in the SDNs -- right now only SDNs with two operands have
any flags at all.
Instead, here's a much simpler hack: Add new opcodes for NVPTX for
convergent calls, and generate these when lowering convergent LLVM
calls.
Reviewers: jholewinski
Subscribers: jholewinski, chandlerc, joker.eph, jhen, tra, llvm-commits
Differential Revision: http://reviews.llvm.org/D17423
llvm-svn: 262373
Summary:
Convergent instrs shouldn't be made control-dependent on other values,
but this is basically the whole point of tail duplication. So just bail
if we see a convergent instruction.
Reviewers: iteratee
Subscribers: jholewinski, jhen, hfinkel, tra, jingyue, llvm-commits
Differential Revision: http://reviews.llvm.org/D17320
llvm-svn: 261540
Summary:
The syncthreads MI is modeled as mayread/maywrite -- convergence doesn't
even come into play here. Nonetheless this property is highly implicit
in the tablegen files, so a test seems appropriate.
Reviewers: jingyue
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D17319
llvm-svn: 261114
Summary:
Otherwise we'll try to do unsafe optimizations on these MIs, such as
sinking loads below calls.
(I suspect that this is not the only bug in the NVPTX instruction
tablegen files; I need to comb through them.)
Reviewers: jholewinski, tra
Subscribers: jingyue, jhen, llvm-commits
Differential Revision: http://reviews.llvm.org/D17315
llvm-svn: 261113
Summary:
Previously, we would just output "foo = bar" in the assembly, and then
ptxas would choke. Now we die before emitting any invalid code.
Reviewers: echristo
Subscribers: jholewinski, llvm-commits, jhen, tra
Differential Revision: http://reviews.llvm.org/D16490
llvm-svn: 258638
We had two code paths. One would create names like "foo.1" and the other
names like "foo1".
For globals it is important to use "foo.1" to help C++ name demangling.
For locals there is no strong reason to go one way or the other so I
kept the most common mangling (foo1).
llvm-svn: 253804
Note, this was reviewed (and more details are in) http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151109/312083.html
These intrinsics currently have an explicit alignment argument which is
required to be a constant integer. It represents the alignment of the
source and dest, and so must be the minimum of those.
This change allows source and dest to each have their own alignments
by using the alignment attribute on their arguments. The alignment
argument itself is removed.
There are a few places in the code for which the code needs to be
checked by an expert as to whether using only src/dest alignment is
safe. For those places, they currently take the minimum of src/dest
alignments which matches the current behaviour.
For example, code which used to read:
call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dest, i8* %src, i32 500, i32 8, i1 false)
will now read:
call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 8 %dest, i8* align 8 %src, i32 500, i1 false)
For out of tree owners, I was able to strip alignment from calls using sed by replacing:
(call.*llvm\.memset.*)i32\ [0-9]*\,\ i1 false\)
with:
$1i1 false)
and similarly for memmove and memcpy.
I then added back in alignment to test cases which needed it.
A similar commit will be made to clang which actually has many differences in alignment as now
IRBuilder can generate different source/dest alignments on calls.
In IRBuilder itself, a new argument was added. Instead of calling:
CreateMemCpy(Dst, Src, getInt64(Size), DstAlign, /* isVolatile */ false)
you now call
CreateMemCpy(Dst, Src, getInt64(Size), DstAlign, SrcAlign, /* isVolatile */ false)
There is a temporary class (IntegerAlignment) which takes the source alignment and rejects
implicit conversion from bool. This is to prevent isVolatile here from passing its default
parameter to the source alignment.
Note, changes in future can now be made to codegen. I didn't change anything here, but this
change should enable better memcpy code sequences.
Reviewed by Hal Finkel.
llvm-svn: 253511
Summary:
Let NVPTX backend detect integer min and max patterns during isel and emit intrinsics that enable hardware support.
Reviewers: jholewinski, meheff, jingyue
Subscribers: arsenm, llvm-commits, meheff, jingyue, eliben, jholewinski
Differential Revision: http://reviews.llvm.org/D12377
llvm-svn: 246107
Summary:
__shared__ variable may now emit undef value as initializer, do not
throw error on that.
Test Plan: test/CodeGen/NVPTX/global-addrspace.ll
Patch by Xuetian Weng
Reviewers: jholewinski, tra, jingyue
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D12242
llvm-svn: 245785
For NVPTX, try to use 32-bit division instead of 64-bit division when the dividend and divisor
fit in 32 bits. This speeds up some internal benchmarks significantly. The underlying reason
is that many index computations are carried out in 64-bits but never actually exceed the
capacity of a 32-bit word.
llvm-svn: 244684
Summary:
For example:
s6 = s0*s5;
s2 = s6*s6 + s6;
...
s4 = s6*s3;
We notice that it is possible for s2 is folded to fma (s0, s5, fmul (s6 s6)).
This only happens when Aggressive is true, otherwise hasOneUse() check
already prevents from folding the multiplication with more uses.
Test Plan: test/CodeGen/NVPTX/fma-assoc.ll
Patch by Xuetian Weng
Reviewers: hfinkel, apazos, jingyue, ohsallen, arsenm
Subscribers: arsenm, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D11855
llvm-svn: 244649
I looked into adding a warning / error for this to FileCheck, but there doesn't
seem to be a good way to avoid it triggering on the instances of it in RUN lines.
llvm-svn: 244481
More specifically, make NVPTXISelDAGToDAG able to emit cached loads (LDG) for pointer induction variables.
Also fix latent bug where LDG was not restricted to kernel functions. I believe that this could not be triggered so far since we do not currently infer that a pointer is global outside a kernel function, and only loads of global pointers are considered for cached loads.
llvm-svn: 244166
Summary:
For example, in
struct S {
int *x;
int *y;
};
__global__ void foo(S s) {
int *b = s.y;
// use b
}
"b" is guaranteed to point to global. NVPTX should emit ld.global/st.global for
accessing "b".
Reviewers: jholewinski
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11505
llvm-svn: 243790
Summary:
MCRegAliasIterator only works for physical registers. So, do not run it
on virtual registers.
With this issue fixed, we can resurrect the BranchFolding pass in NVPTX
backend.
Reviewers: jholewinski, bkramer
Subscribers: henryhu, meheff, llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11174
llvm-svn: 242871
Summary:
[NVPTX] make load on global readonly memory to use ldg
Summary:
As describe in [1], ld.global.nc may be used to load memory by nvcc when
__restrict__ is used and compiler can detect whether read-only data cache
is safe to use.
This patch will try to check whether ldg is safe to use and use them to
replace ld.global when possible. This change can improve the performance
by 18~29% on affected kernels (ratt*_kernel and rwdot*_kernel) in
S3D benchmark of shoc [2].
Patched by Xuetian Weng.
[1] http://docs.nvidia.com/cuda/kepler-tuning-guide/#read-only-data-cache
[2] https://github.com/vetter/shoc
Test Plan: test/CodeGen/NVPTX/load-with-non-coherent-cache.ll
Reviewers: jholewinski, jingyue
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D11314
llvm-svn: 242713