The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture,
by setting MaxInterleaveFactor to 1. Unrolling in loop vectorization pass may introduce
the cost of overflow check, memory boundary check and extra prologue/epilogue code when
regular unroller will unroll the loop another time. Disable it when VF==1 remove the
unnecessary cost on x86. The same can be done for other platforms after verifying
interleaving/memory bound checking to be not perf critical on those platforms.
Differential Revision: http://reviews.llvm.org/D9515
llvm-svn: 236613
Summary:
Some optimizations such as jump threading and loop unswitching can negatively
affect performance when applied to divergent branches. The divergence analysis
added in this patch conservatively estimates which branches in a GPU program
can diverge. This information can then help LLVM to run certain optimizations
selectively.
Test Plan: test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
Reviewers: resistor, hfinkel, eliben, meheff, jholewinski
Subscribers: broune, bjarke.roune, madhur13490, tstellarAMD, dberlin, echristo, jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8576
llvm-svn: 234567
Summary:
DataLayout keeps the string used for its creation.
As a side effect it is no longer needed in the Module.
This is "almost" NFC, the string is no longer
canonicalized, you can't rely on two "equals" DataLayout
having the same string returned by getStringRepresentation().
Get rid of DataLayoutPass: the DataLayout is in the Module
The DataLayout is "per-module", let's enforce this by not
duplicating it more than necessary.
One more step toward non-optionality of the DataLayout in the
module.
Make DataLayout Non-Optional in the Module
Module->getDataLayout() will never returns nullptr anymore.
Reviewers: echristo
Subscribers: resistor, llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D7992
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 231270
This patch adds the isProfitableToHoist API. For AArch64, we want to prevent a
fmul from being hoisted in cases where it is more profitable to form a
fmsub/fmadd.
Phabricator Review: http://reviews.llvm.org/D7299
Patch by Lawrence Hu <lawrence@codeaurora.org>
llvm-svn: 230241
Summary: When evaluating floating point instructions in the inliner, ask the TTI whether it is an expensive operation. By default, it's not an expensive operation. This keeps the default behavior the same as before. The ARM TTI has been updated to return back TCC_Expensive for targets which don't have hardware floating point.
Reviewers: chandlerc, echristo
Reviewed By: echristo
Subscribers: t.p.northover, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D6936
llvm-svn: 228263
terms of the new pass manager's TargetIRAnalysis.
Yep, this is one of the nicer bits of the new pass manager's design.
Passes can in many cases operate in a vacuum and so we can just nest
things when convenient. This is particularly convenient here as I can
now consolidate all of the TargetMachine logic on this analysis.
The most important change here is that this pushes the function we need
TTI for all the way into the TargetMachine, and re-creates the TTI
object for each function rather than re-using it for each function.
We're now prepared to teach the targets to produce function-specific TTI
objects with specific subtargets cached, etc.
One piece of feedback I'd love here is whether its worth renaming any of
this stuff. None of the names really seem that awesome to me at this
point, but TargetTransformInfoWrapperPass is particularly ... odd.
TargetIRAnalysisWrapper might make more sense. I would want to do that
rename separately anyways, but let me know what you think.
llvm-svn: 227731
produce it.
This adds a function to the TargetMachine that produces this analysis
via a callback for each function. This in turn faves the way to produce
a *different* TTI per-function with the correct subtarget cached.
I've also done the necessary wiring in the opt tool to thread the target
machine down and make it available to the pass registry so that we can
construct this analysis from a target machine when available.
llvm-svn: 227721
base which it adds a single analysis pass to, to instead return the type
erased TargetTransformInfo object constructed for that TargetMachine.
This removes all of the pass variants for TTI. There is now a single TTI
*pass* in the Analysis layer. All of the Analysis <-> Target
communication is through the TTI's type erased interface itself. While
the diff is large here, it is nothing more that code motion to make
types available in a header file for use in a different source file
within each target.
I've tried to keep all the doxygen comments and file boilerplate in line
with this move, but let me know if I missed anything.
With this in place, the next step to making TTI work with the new pass
manager is to introduce a really simple new-style analysis that produces
a TTI object via a callback into this routine on the target machine.
Once we have that, we'll have the building blocks necessary to accept
a function argument as well.
llvm-svn: 227685
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
Specifically, gc.result benefits from this greatly. Instead of:
gc.result.int.*
gc.result.float.*
gc.result.ptr.*
...
We now have a gc.result.* that can specialize to literally any type.
Differential Revision: http://reviews.llvm.org/D7020
llvm-svn: 226857
I'm recommiting the codegen part of the patch.
The vectorizer part will be send to review again.
Masked Vector Load and Store Intrinsics.
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
http://reviews.llvm.org/D6191
llvm-svn: 223348
The statepoint intrinsics are intended to enable precise root tracking through the compiler as to support garbage collectors of all types. The addition of the statepoint intrinsics to LLVM should have no impact on the compilation of any program which does not contain them. There are no side tables created, no extra metadata, and no inhibited optimizations.
A statepoint works by transforming a call site (or safepoint poll site) into an explicit relocation operation. It is the frontend's responsibility (or eventually the safepoint insertion pass we've developed, but that's not part of this patch series) to ensure that any live pointer to a GC object is correctly added to the statepoint and explicitly relocated. The relocated value is just a normal SSA value (as seen by the optimizer), so merges of relocated and unrelocated values are just normal phis. The explicit relocation operation, the fact the statepoint is assumed to clobber all memory, and the optimizers standard semantics ensure that the relocations flow through IR optimizations correctly.
This is the first patch in a small series. This patch contains only the IR parts; the documentation and backend support will be following separately. The entire series can be seen as one combined whole in http://reviews.llvm.org/D5683.
Reviewed by: atrick, ributzka
llvm-svn: 223078
This reverts commit r222632 (and follow-up r222636), which caused a host
of LNT failures on an internal bot. I'll respond to the commit on the
list with a reproduction of one of the failures.
Conflicts:
lib/Target/X86/X86TargetTransformInfo.cpp
llvm-svn: 222936
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
http://reviews.llvm.org/D6191
llvm-svn: 222632
These are named following the IEEE-754 names for these
functions, rather than the libm fmin / fmax to avoid
possible ambiguities. Some languages may implement something
resembling fmin / fmax which return NaN if either operand is
to propagate errors. These implement the IEEE-754 semantics
of returning the other operand if either is a NaN representing
missing data.
llvm-svn: 220341
The annotation instructions are dropped during codegen and have no
impact on size. In some cases, the annotations were preventing the
unroller from unrolling a loop because the annotation calls were
pushing the cost over the unrolling threshold.
Differential Revision: http://reviews.llvm.org/D5335
llvm-svn: 218525
shim between the TargetTransformInfo immutable pass and the Subtarget
via the TargetMachine and Function. Migrate a single call from
BasicTargetTransformInfo as an example and provide shims where TargetMachine
begins taking a Function to determine the subtarget.
No functional change.
llvm-svn: 218004
"Unroll" is not the appropriate name for this variable. Clang already uses
the term "interleave" in pragmas and metadata for this.
Differential Revision: http://reviews.llvm.org/D5066
llvm-svn: 217528
This patch adds support to recognize division by uniform power of 2 and modifies the cost table to vectorize division by uniform power of 2 whenever possible.
Updates Cost model for Loop and SLP Vectorizer.The cost table is currently only updated for X86 backend.
Thanks to Hal, Andrea, Sanjay for the review. (http://reviews.llvm.org/D4971)
llvm-svn: 216371
Some types, such as 128-bit vector types on AArch64, don't have any callee-saved registers. So if a value needs to stay live over a callsite, it must be spilled and refilled. This cost is now taken into account.
llvm-svn: 214859
This is the first commit in a series that add an @llvm.assume intrinsic which
can be used to provide the optimizer with a condition it may assume to be true
(when the control flow would hit the intrinsic call). Some basic properties are added here:
- llvm.invariant(true) is dead.
- llvm.invariant(false) is unreachable (this directly corresponds to the
documented behavior of MSVC's __assume(0)), so is llvm.invariant(undef).
The intrinsic is tagged as writing arbitrarily, in order to maintain control
dependencies. BasicAA has been updated, however, to return NoModRef for any
particular location-based query so that we don't unnecessarily block code
motion.
llvm-svn: 213973
definition below all the header #include lines, lib/Analysis/...
edition.
This one has a bit extra as there were *other* #define's before #include
lines in addition to DEBUG_TYPE. I've sunk all of them as a block.
llvm-svn: 206843
The implementation of getUserCost had duplicated (and hard-coded) the default
logic in getGEPCost. Instead, it is better to use getGEPCost directly, which
limits the default logic to the implementation of one function, and allows
targets to override the behavior.
No functionality change intended.
llvm-svn: 205346
Extend the target hook to take also the operand index into account when
calculating the cost of the constant materialization.
Related to <rdar://problem/16381500>
llvm-svn: 204435
This commit extends the coverage of the constant hoisting pass, adds additonal
debug output and updates the function names according to the style guide.
Related to <rdar://problem/16381500>
llvm-svn: 204389
the stack of the analysis group because they are all immutable passes.
This is made clear by Craig's recent work to use override
systematically -- we weren't overriding anything for 'finalizePass'
because there is no such thing.
This is kind of a lame restriction on the API -- we can no longer push
and pop things, we just set up the stack and run. However, I'm not
invested in building some better solution on top of the existing
(terrifying) immutable pass and legacy pass manager.
llvm-svn: 203437
This commit caused -Woverloaded-virtual warnings. The two new
TargetTransformInfo::getIntImmCost functions were only added to the superclass,
and to the X86 subclass. The other targets were not updated, and the
warning highlighted this by pointing out that e.g. ARMTTI::getIntImmCost was
hiding the two new getIntImmCost variants.
We could pacify the warning by adding "using TargetTransformInfo::getIntImmCost"
to the various subclasses, or turning it off, but I suspect that it's wrong to
leave the functions unimplemnted in those targets. The default implementations
return TCC_Free, which I don't think is right e.g. for ARM.
llvm-svn: 200058
Retry commit r200022 with a fix for the build bot errors. Constant expressions
have (unlike instructions) module scope use lists and therefore may have users
in different functions. The fix is to simply ignore these out-of-function uses.
llvm-svn: 200034
This pass identifies expensive constants to hoist and coalesces them to
better prepare it for SelectionDAG-based code generation. This works around the
limitations of the basic-block-at-a-time approach.
First it scans all instructions for integer constants and calculates its
cost. If the constant can be folded into the instruction (the cost is
TCC_Free) or the cost is just a simple operation (TCC_BASIC), then we don't
consider it expensive and leave it alone. This is the default behavior and
the default implementation of getIntImmCost will always return TCC_Free.
If the cost is more than TCC_BASIC, then the integer constant can't be folded
into the instruction and it might be beneficial to hoist the constant.
Similar constants are coalesced to reduce register pressure and
materialization code.
When a constant is hoisted, it is also hidden behind a bitcast to force it to
be live-out of the basic block. Otherwise the constant would be just
duplicated and each basic block would have its own copy in the SelectionDAG.
The SelectionDAG recognizes such constants as opaque and doesn't perform
certain transformations on them, which would create a new expensive constant.
This optimization is only applied to integer constants in instructions and
simple (this means not nested) constant cast experessions. For example:
%0 = load i64* inttoptr (i64 big_constant to i64*)
Reviewed by Eric
llvm-svn: 200022
subsequent changes are easier to review. About to fix some layering
issues, and wanted to separate out the necessary churn.
Also comment and sink the include of "Windows.h" in three .inc files to
match the usage in Memory.inc.
llvm-svn: 198685
Upcoming SLP vectorization improvements will want to be able to estimate costs
of horizontal reductions. Add infrastructure to support this.
We model reductions as a series of (shufflevector,add) tuples ultimately
followed by an extractelement. For example, for an add-reduction of <4 x float>
we could generate the following sequence:
(v0, v1, v2, v3)
\ \ / /
\ \ /
+ +
(v0+v2, v1+v3, undef, undef)
\ /
((v0+v2) + (v1+v3), undef, undef)
%rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef,
<4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
%bin.rdx = fadd <4 x float> %rdx, %rdx.shuf
%rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef,
<4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7
%r = extractelement <4 x float> %bin.rdx8, i32 0
This commit adds a cost model interface "getReductionCost(Opcode, Ty, Pairwise)"
that will allow clients to ask for the cost of such a reduction (as backends
might generate more efficient code than the cost of the individual instructions
summed up). This interface is excercised by the CostModel analysis pass which
looks for reduction patterns like the one above - starting at extractelements -
and if it sees a matching sequence will call the cost model interface.
We will also support a second form of pairwise reduction that is well supported
on common architectures (haddps, vpadd, faddp).
(v0, v1, v2, v3)
\ / \ /
(v0+v1, v2+v3, undef, undef)
\ /
((v0+v1)+(v2+v3), undef, undef, undef)
%rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
<4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>
%rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef,
<4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
%bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1
%rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
<4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
%rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
<4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1
%r = extractelement <4 x float> %bin.rdx.1, i32 0
llvm-svn: 190876
Allow targets to customize the default behavior of the generic loop unrolling
transformation. This will be used by the PowerPC backend when targeting the A2
core (which is in-order with a deep pipeline), and using more aggressive
defaults is important.
llvm-svn: 190542