We would crash if we couldn't locate a Function that either Location's
Value belonged to. Now we just print out a debug message and return
conservatively.
llvm-svn: 228901
Without a valid data layout, deferenceable(N) doesn't get parsed or
propagated. Since this is the key item we are testing, add a dependency
on the pass.
Differential Revision: http://reviews.llvm.org/D7508
llvm-svn: 228611
While a theoretical GC might change dereferenceability on collection,
there is no such known collector and no need to account for the case
with a flag yet.
Differential Revision: http://reviews.llvm.org/D7454
llvm-svn: 228606
When creating a scev for sext({X,+,Y}), scev checks if the expression
is equivalent to {sext X,+,zext Y}. If it can prove that, it also
tags the original {X,+,Y} as <nsw>, which is not correct.
In the test case I run `-scalar-evolution` twice because the bug
manifests only once SCEV has run through and seen the `sext`
expressions (and then does a in-place mutation on {X,+,Y}).
Differential Revision: http://reviews.llvm.org/D7495
llvm-svn: 228586
For the attached test case different types are used in the ICmpInst
and SelectInst that represent the min/max expressions. However, if the
ICmpInst type is smaller a comparison with the sign/zero extended
operands would have yielded the same result. This situation might
arise after the instruction combination pass was applied.
Differential Revision: http://reviews.llvm.org/D7338
llvm-svn: 228572
add recurrences don't overflow.
This change makes the optimization more restrictive. It still assumes
that an overflowing `add nsw` is undefined behavior; and this change
will need revisiting once we have a consistent semantics for poison
values.
Differential Revision: http://reviews.llvm.org/D7331
llvm-svn: 228552
different fields.
We can show that two GEPs off of the same (possibly multidimensional)
array of structs, into different fields, can't alias. Quoting:
For two GEPOperators GEP1 and GEP2, if we find that:
- both GEPs begin indexing from the exact same pointer;
- the last indices in both GEPs are constants, indexing into a struct;
- said indices are different, hence,the pointed-to fields are different;
- and both GEPs only index through arrays prior to that;
this lets us determine that the struct that GEP1 indexes into and the
struct that GEP2 indexes into must either precisely overlap or be
completely disjoint. Because they cannot partially overlap, indexing
into different non-overlapping fields of the struct will never alias.
The other BasicAA::aliasGEP rules worked in some cases, but not all
(for example, the i32x3 struct in the testcase).
We can add this simple ad-hoc rule to complement them.
rdar://19717375
Differential Revision: http://reviews.llvm.org/D7453
llvm-svn: 228498
Since testing the function indirectly is tricky, introduce a direct
print-memderefs pass, in the same spirit as print-memdeps, which prints
dereferenceability information matched by FileCheck.
Differential Revision: http://reviews.llvm.org/D7075
llvm-svn: 228369
Fixes PR22462: two of the tests have regressed for a while,
but were using CHECK-NOT to match "May:". The actual output
was changed to "MayAlias:" at some point, which made the tests
useless.
Two others return MayAlias only because of a lack of analysis;
BasicAA returns PartialAlias in those cases, when a datalayout
is present.
llvm-svn: 228346
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
I had already factored this analysis specifically to enable doing this,
but hadn't actually committed the necessary wiring to get at this from
the new pass manager. This also nicely shows how the separate cache
object can be directly managed by the new pass manager.
This analysis didn't have any direct tests and so I've added a printer
pass and a boring test case. I chose to print the i1 value which is
being assumed rather than the call to llvm.assume as that seems much
more useful for testing... but suggestions on an even better printing
strategy welcome. My main goal was to make sure things actually work. =]
llvm-svn: 226868
ScalarEvolution currently lowers a subtraction recurrence to an add
recurrence with the same no-wrap flags as the subtraction. This is
incorrect because `sub nsw X, Y` is not the same as `add nsw X, -Y`
and `sub nuw X, Y` is not the same as `add nuw X, -Y`. This patch
fixes the issue, and adds two test cases demonstrating the bug.
Differential Revision: http://reviews.llvm.org/D7081
llvm-svn: 226755
pass and a LoopPrinterPass with the expected associated wiring.
I've added a RUN line to the only test case (!!!) we have that actually
prints loops. Everything seems to be working.
This is somewhat exciting as this is the first analysis using another
analysis to go in for the new pass manager. =D I also believe it is the
last analysis necessary for porting instcombine, but of course I may yet
discover more.
llvm-svn: 226560
This adds the domtree analysis to the new pass manager. The analysis
returns the same DominatorTree result entity used by the old pass
manager and essentially all of the code is shared. We just have
different boilerplate for running and printing the analysis.
I've converted one test to run in both modes just to make sure this is
exercised while both are live in the tree.
llvm-svn: 225969
Correct, we have *zero* basic testing of the dominator tree in the
regression test suite. There is a single test that even prints it out,
and that test only checks a single line of the output. There are
a handful of tests that check post dominators, but all of those are
looking for bugs rather than just exercising the basic machinery.
This test is super boring and unexciting. But hey, it's something.
I needed there to be something so I could switch the basic test to run
with both the old and new pass manager.
llvm-svn: 225936
doing Load PRE"
It's not really expected to stick around, last time it provoked a weird LTO
build failure that I can't reproduce now, and the bot logs are long gone. I'll
re-revert it if the failures recur.
Original description: Perform Scalar PRE on gep indices that feed loads before
doing Load PRE.
llvm-svn: 225536
Now that `Metadata` is typeless, reflect that in the assembly. These
are the matching assembly changes for the metadata/value split in
r223802.
- Only use the `metadata` type when referencing metadata from a call
intrinsic -- i.e., only when it's used as a `Value`.
- Stop pretending that `ValueAsMetadata` is wrapped in an `MDNode`
when referencing it from call intrinsics.
So, assembly like this:
define @foo(i32 %v) {
call void @llvm.foo(metadata !{i32 %v}, metadata !0)
call void @llvm.foo(metadata !{i32 7}, metadata !0)
call void @llvm.foo(metadata !1, metadata !0)
call void @llvm.foo(metadata !3, metadata !0)
call void @llvm.foo(metadata !{metadata !3}, metadata !0)
ret void, !bar !2
}
!0 = metadata !{metadata !2}
!1 = metadata !{i32* @global}
!2 = metadata !{metadata !3}
!3 = metadata !{}
turns into this:
define @foo(i32 %v) {
call void @llvm.foo(metadata i32 %v, metadata !0)
call void @llvm.foo(metadata i32 7, metadata !0)
call void @llvm.foo(metadata i32* @global, metadata !0)
call void @llvm.foo(metadata !3, metadata !0)
call void @llvm.foo(metadata !{!3}, metadata !0)
ret void, !bar !2
}
!0 = !{!2}
!1 = !{i32* @global}
!2 = !{!3}
!3 = !{}
I wrote an upgrade script that handled almost all of the tests in llvm
and many of the tests in cfe (even handling many `CHECK` lines). I've
attached it (or will attach it in a moment if you're speedy) to PR21532
to help everyone update their out-of-tree testcases.
This is part of PR21532.
llvm-svn: 224257
When a loop gets bundled up, its outgoing edges are quite large, and can
just barely overflow 64-bits. If one successor has multiple incoming
edges -- and that successor is getting all the incoming mass --
combining just its edges can overflow. Handle that by saturating rather
than asserting.
This fixes PR21622.
llvm-svn: 223500
Summary:
Several places in DependenceAnalysis assumes both SCEVs in a subscript pair
share the same integer type. For instance, isKnownPredicate calls
SE->getMinusSCEV(X, Y) which asserts X and Y share the same type. However,
DependenceAnalysis fails to ensure this assumption when producing a subscript
pair, causing tests such as NonCanonicalizedSubscript to crash. With this
patch, DependenceAnalysis runs unifySubscriptType before producing any
subscript pair, ensuring the assumption.
Test Plan:
Added NonCanonicalizedSubscript.ll on which DependenceAnalysis before the fix
crashed because subscripts have different types.
Reviewers: spop, sebpop, jingyue
Reviewed By: jingyue
Subscribers: eliben, meheff, llvm-commits
Differential Revision: http://reviews.llvm.org/D6289
llvm-svn: 222100
HowFarToZero was supposed to use unsigned division in order to calculate
the backedge taken count. However, SCEVDivision::divide performs signed
division. Unless I am mistaken, no users of SCEVDivision actually want
signed arithmetic: switch to udiv and urem.
This fixes PR21578.
llvm-svn: 222093
doing Load PRE"
This commit updates the failing test in
Analysis/TypeBasedAliasAnalysis/gvn-nonlocal-type-mismatch.ll
The failing test is sensitive to the order in which we process loads. This
version turns on the RPO traversal instead of the while DT traversal in GVN.
The new test code is functionally same just the order of loads that are
eliminated is swapped.
This new version also fixes an issue where GVN splits a critical edge and
potentially invalidate the RPO/DT iterator.
llvm-svn: 222039
Let's try this again...
This reverts r219432, plus a bug fix.
Description of the bug in r219432 (by Nick):
The bug was using AllPositive to break out of the loop; if the loop break
condition i != e is changed to i != e && AllPositive then the
test_modulo_analysis_with_global test I've added will fail as the Modulo will
be calculated incorrectly (as the last loop iteration is skipped, so Modulo
isn't updated with its Scale).
Nick also adds this comment:
ComputeSignBit is safe to use in loops as it takes into account phi nodes, and
the == EK_ZeroEx check is safe in loops as, no matter how the variable changes
between iterations, zero-extensions will always guarantee a zero sign bit. The
isValueEqualInPotentialCycles check is therefore definitely not needed as all
the variable analysis holds no matter how the variables change between loop
iterations.
And this patch also adds another enhancement to GetLinearExpression - basically
to convert ConstantInts to Offsets (see test_const_eval and
test_const_eval_scaled for the situations this improves).
Original commit message:
This reverts r218944, which reverted r218714, plus a bug fix.
Description of the bug in r218714 (by Nick):
The original patch forgot to check if the Scale in VariableGEPIndex flipped the
sign of the variable. The BasicAA pass iterates over the instructions in the
order they appear in the function, and so BasicAliasAnalysis::aliasGEP is
called with the variable it first comes across as parameter GEP1. Adding a
%reorder label puts the definition of %a after %b so aliasGEP is called with %b
as the first parameter and %a as the second. aliasGEP later calculates that %a
== %b + 1 - %idxprom where %idxprom >= 0 (if %a was passed as the first
parameter it would calculate %b == %a - 1 + %idxprom where %idxprom >= 0) -
ignoring that %idxprom is scaled by -1 here lead the patch to incorrectly
conclude that %a > %b.
Revised patch by Nick White, thanks! Thanks to Lang to isolating the bug.
Slightly modified by me to add an early exit from the loop and avoid
unnecessary, but expensive, function calls.
Original commit message:
Two related things:
1. Fixes a bug when calculating the offset in GetLinearExpression. The code
previously used zext to extend the offset, so negative offsets were converted
to large positive ones.
2. Enhance aliasGEP to deduce that, if the difference between two GEP
allocations is positive and all the variables that govern the offset are also
positive (i.e. the offset is strictly after the higher base pointer), then
locations that fit in the gap between the two base pointers are NoAlias.
Patch by Nick White!
llvm-svn: 221876
AVX2 is available.
According to IACA, the new lowering has a throughput of 8 cycles instead of 13
with the previous one.
Althought this lowering kicks in some SPECs benchmarks, the performance
improvement was within the noise.
Correctness testing has been done for the whole range of uint32_t with the
following program:
uint4 v = (uint4) {0,1,2,3};
uint32_t i;
//Check correctness over entire range for uint4 -> float4 conversion
for( i = 0; i < 1U << (32-2); i++ )
{
float4 t = test(v);
float4 c = correct(v);
if( 0xf != _mm_movemask_ps( t == c ))
{
printf( "Error @ %vx: %vf vs. %vf\n", v, c, t);
return -1;
}
v += 4;
}
Where "correct" is the old lowering and "test" the new one.
The patch adds a test case for the two custom lowering instruction.
It also modifies the vector cost model, which is why cast.ll and uitofp.ll are
modified.
2009-02-26-MachineLICMBug.ll is also modified because we now hoist 7
instructions instead of 4 (3 more constant loads).
rdar://problem/18153096>
llvm-svn: 221657
In a case where we have a no {un,}signed wrap flag on the increment, if
RHS - Start is constant then we can avoid inserting a max operation bewteen
the two, since we can statically determine which is greater.
This allows us to unroll loops such as:
void testcase3(int v) {
for (int i=v; i<=v+1; ++i)
f(i);
}
llvm-svn: 220960
DSE's overlap checking contained special logic, used only when no DataLayout
was available, which inferred a complete overwrite when the pointee types were
equal. This logic seems fine for regular loads/stores, but does not work for
memcpy and friends. Instead of fixing this, I'm just removing it.
Philosophically, transformations should not contain enhanced behavior used only
when data layout is lacking (data layout should be strictly additive), and
maintaining these rarely-tested code paths seems not worthwhile at this stage.
Credit to Aliaksei Zasenka for the bug report and the diagnosis. The test case
(slightly reduced from that provided by Aliaksei) replaces the original
contents of test/Transforms/DeadStoreElimination/no-targetdata.ll -- a few
other tests have been updated to have a data layout.
llvm-svn: 220035
The CFL-AA implementation was missing a visit* routine for va_arg instructions,
causing it to assert when run on a function that had one. For now, handle these
in a conservative way.
Fixes PR20954.
llvm-svn: 219718
It also makes it more aggressive in querying range information by
adding a call to isKnownPredicateWithRanges to
isLoopBackedgeGuardedByCond and isLoopEntryGuardedByCond.
phabricator: http://reviews.llvm.org/D5638
Reviewed by: atrick, hfinkel
llvm-svn: 219532
ScalarEvolution in the presence of multiple exits. Previously all
loops exits had to have identical counts for a loop trip count to be
considered computable. This pessimization was implemented by calling
getBackedgeTakenCount(L) rather than getExitCount(L, ExitingBlock)
inside of ScalarEvolution::getSmallConstantTripCount() (see the FIXME
in the comments of that function). The pessimization was added to fix
a corner case involving undefined behavior (pr/16130). This patch more
precisely handles the undefined behavior case allowing the pessimization
to be removed.
ControlsExit replaces IsSubExpr to more precisely track the case where
undefined behavior is expected to occur. Because undefined behavior is
tracked more precisely we can remove MustExit from ExitLimit. MustExit
was used to track the case where the limit was computed potentially
assuming undefined behavior even if undefined behavior didn't necessarily
occur.
llvm-svn: 219517
This reverts r218944, which reverted r218714, plus a bug fix.
Description of the bug in r218714 (by Nick)
The original patch forgot to check if the Scale in VariableGEPIndex flipped the
sign of the variable. The BasicAA pass iterates over the instructions in the
order they appear in the function, and so BasicAliasAnalysis::aliasGEP is
called with the variable it first comes across as parameter GEP1. Adding a
%reorder label puts the definition of %a after %b so aliasGEP is called with %b
as the first parameter and %a as the second. aliasGEP later calculates that %a
== %b + 1 - %idxprom where %idxprom >= 0 (if %a was passed as the first
parameter it would calculate %b == %a - 1 + %idxprom where %idxprom >= 0) -
ignoring that %idxprom is scaled by -1 here lead the patch to incorrectly
conclude that %a > %b.
Revised patch by Nick White, thanks! Thanks to Lang to isolating the bug.
Slightly modified by me to add an early exit from the loop and avoid
unnecessary, but expensive, function calls.
Original commit message:
Two related things:
1. Fixes a bug when calculating the offset in GetLinearExpression. The code
previously used zext to extend the offset, so negative offsets were converted
to large positive ones.
2. Enhance aliasGEP to deduce that, if the difference between two GEP
allocations is positive and all the variables that govern the offset are also
positive (i.e. the offset is strictly after the higher base pointer), then
locations that fit in the gap between the two base pointers are NoAlias.
Patch by Nick White!
llvm-svn: 219135
We used to return PartialAlias if *either* variable being queried interacted
with arguments or globals. AFAICT, we can change this to only returning
MayAlias iff *both* variables being queried interacted with arguments or
globals.
Also, adding some basic functionality tests: some basic IPA tests, checking
that we give conservative responses with arguments/globals thrown in the mix,
and ensuring that we trace values through stores and loads.
Note that saying that 'x' interacted with arguments or globals means that the
Attributes of the StratifiedSet that 'x' belongs to has any bits set.
Patch by George Burgess IV, thanks!
llvm-svn: 219122
This patch broke 447.dealII on Darwin. I'm currently working on a reduced
test-case, but reverting for now to keep the bots happy.
<rdar://problem/18530107>
llvm-svn: 218944
argument of the llvm.dbg.declare/llvm.dbg.value intrinsics.
Previously, DIVariable was a variable-length field that has an optional
reference to a Metadata array consisting of a variable number of
complex address expressions. In the case of OpPiece expressions this is
wasting a lot of storage in IR, because when an aggregate type is, e.g.,
SROA'd into all of its n individual members, the IR will contain n copies
of the DIVariable, all alike, only differing in the complex address
reference at the end.
By making the complex address into an extra argument of the
dbg.value/dbg.declare intrinsics, all of the pieces can reference the
same variable and the complex address expressions can be uniqued across
the CU, too.
Down the road, this will allow us to move other flags, such as
"indirection" out of the DIVariable, too.
The new intrinsics look like this:
declare void @llvm.dbg.declare(metadata %storage, metadata %var, metadata %expr)
declare void @llvm.dbg.value(metadata %storage, i64 %offset, metadata %var, metadata %expr)
This patch adds a new LLVM-local tag to DIExpressions, so we can detect
and pretty-print DIExpression metadata nodes.
What this patch doesn't do:
This patch does not touch the "Indirect" field in DIVariable; but moving
that into the expression would be a natural next step.
http://reviews.llvm.org/D4919
rdar://problem/17994491
Thanks to dblaikie and dexonsmith for reviewing this patch!
Note: I accidentally committed a bogus older version of this patch previously.
llvm-svn: 218787
argument of the llvm.dbg.declare/llvm.dbg.value intrinsics.
Previously, DIVariable was a variable-length field that has an optional
reference to a Metadata array consisting of a variable number of
complex address expressions. In the case of OpPiece expressions this is
wasting a lot of storage in IR, because when an aggregate type is, e.g.,
SROA'd into all of its n individual members, the IR will contain n copies
of the DIVariable, all alike, only differing in the complex address
reference at the end.
By making the complex address into an extra argument of the
dbg.value/dbg.declare intrinsics, all of the pieces can reference the
same variable and the complex address expressions can be uniqued across
the CU, too.
Down the road, this will allow us to move other flags, such as
"indirection" out of the DIVariable, too.
The new intrinsics look like this:
declare void @llvm.dbg.declare(metadata %storage, metadata %var, metadata %expr)
declare void @llvm.dbg.value(metadata %storage, i64 %offset, metadata %var, metadata %expr)
This patch adds a new LLVM-local tag to DIExpressions, so we can detect
and pretty-print DIExpression metadata nodes.
What this patch doesn't do:
This patch does not touch the "Indirect" field in DIVariable; but moving
that into the expression would be a natural next step.
http://reviews.llvm.org/D4919
rdar://problem/17994491
Thanks to dblaikie and dexonsmith for reviewing this patch!
llvm-svn: 218778
Two related things:
1. Fixes a bug when calculating the offset in GetLinearExpression. The code
previously used zext to extend the offset, so negative offsets were converted
to large positive ones.
2. Enhance aliasGEP to deduce that, if the difference between two GEP
allocations is positive and all the variables that govern the offset are also
positive (i.e. the offset is strictly after the higher base pointer), then
locations that fit in the gap between the two base pointers are NoAlias.
Patch by Nick White!
llvm-svn: 218714
The default implementation of getCmpSelInstrCost, which provides the cost of
icmp/fcmp/select instructions, did not deal sensibly with illegal vector types
that were scalarized. We'd ask for the legalization cost of the vector type,
which would return something like (4, f64) given an input of <4 x double>, and
we'd then check the TLI status of the ISD opcode on that scalar type. This would
result in querying (ISD::VSELECT, f64), for example. Amusingly enough,
ISD::VSELECT on scalar types is marked as Legal by default (as with most other
operations), and most backends never change this because VSELECT is never
generated on scalars. However, seeing the resulting operation as Legal, we'd
neglect to add the scalarization cost before returning. The result is that we'd
grossly under-estimate the cost of cmps/selects on illegal vector types.
Now, if type legalization clearly results in scalarization, we skip the early
return and add the scalarization cost.
llvm-svn: 217859
Cross-class copies being expensive is actually a trait of the microarchitecture, but as I haven't yet seen an example of a microarchitecture where they're cheap it seems best to just enable this by default, covering the non-mcpu build case.
llvm-svn: 217674
This adds a basic (but important) use of @llvm.assume calls in ScalarEvolution.
When SE is attempting to validate a condition guarding a loop (such as whether
or not the loop count can be zero), this check should also include dominating
assumptions.
llvm-svn: 217348
This provides an implementation of CFL alias analysis (including some
supporting data structures). Currently, we don't have any extremely fancy
features, sans some interprocedural analysis (i.e. no field sensitivity, etc.),
and we do best sitting behind BasicAA + TBAA. In such a configuration, we take
~0.6-0.8% of total compile time, and give ~7-8% NoAlias responses to queries
TBAA and BasicAA couldn't answer when bootstrapping LLVM. In testing this on
other projects, we've seen up to 10.5% of queries dropped by BasicAA+TBAA
answered with NoAlias by this algorithm.
Patch by George Burgess IV (with minor modifications by me -- mostly adapting
some BasicAA tests), thanks!
llvm-svn: 216970
This is the first commit in a series that add an @llvm.assume intrinsic which
can be used to provide the optimizer with a condition it may assume to be true
(when the control flow would hit the intrinsic call). Some basic properties are added here:
- llvm.invariant(true) is dead.
- llvm.invariant(false) is unreachable (this directly corresponds to the
documented behavior of MSVC's __assume(0)), so is llvm.invariant(undef).
The intrinsic is tagged as writing arbitrarily, in order to maintain control
dependencies. BasicAA has been updated, however, to return NoModRef for any
particular location-based query so that we don't unnecessarily block code
motion.
llvm-svn: 213973
This functionality is currently turned off by default.
Part of the motivation for introducing scoped-noalias metadata is to enable the
preservation of noalias parameter attribute information after inlining.
Sometimes this can be inferred from the code in the caller after inlining, but
often we simply lose valuable information.
The overall process if fairly simple:
1. Create a new unqiue scope domain.
2. For each (used) noalias parameter, create a new alias scope.
3. For each pointer, collect the underlying objects. Add a noalias scope for
each noalias parameter from which we're not derived (and has not been
captured prior to that point).
4. Add an alias.scope for each noalias parameter from which we might be
derived (or has been captured before that point).
Note that the capture checks apply only if one of the underlying objects is not
an identified function-local object.
llvm-svn: 213949
In the process of fixing the noalias parameter -> metadata conversion process
that will take place during inlining (which will be committed soon, but not
turned on by default), I have come to realize that the semantics provided by
yesterday's commit are not really what we want. Here's why:
void foo(noalias a, noalias b, noalias c, bool x) {
*q = x ? a : b;
*c = *q;
}
Generically, we know that *c does not alias with *a and with *b (so there is an
'and' in what we know we're not), and we know that *q might be derived from *a
or from *b (so there is an 'or' in what we know that we are). So we do not want
the semantics currently, where any noalias scope matching any alias.scope
causes a NoAlias return. What we want to know is that the noalias scopes form a
superset of the alias.scope list (meaning that all the things we know we're not
is a superset of all of things the other instruction might be).
Making that change, however, introduces a composibility problem. If we inline
once, adding the noalias metadata, and then inline again adding more, and we
append new scopes onto the noalias and alias.scope lists each time. But, this
means that we could change what was a NoAlias result previously into a MayAlias
result because we appended an additional scope onto one of the alias.scope
lists. So, instead of giving scopes the ability to have parents (which I had
borrowed from the TBAA implementation, but seems increasingly unlikely to be
useful in practice), I've given them domains. The subset/superset condition now
applies within each domain independently, and we only need it to hold in one
domain. Each time we inline, we add the new scopes in a new scope domain, and
everything now composes nicely. In addition, this simplifies the
implementation.
llvm-svn: 213948
This commit adds scoped noalias metadata. The primary motivations for this
feature are:
1. To preserve noalias function attribute information when inlining
2. To provide the ability to model block-scope C99 restrict pointers
Neither of these two abilities are added here, only the necessary
infrastructure. In fact, there should be no change to existing functionality,
only the addition of new features. The logic that converts noalias function
parameters into this metadata during inlining will come in a follow-up commit.
What is added here is the ability to generally specify noalias memory-access
sets. Regarding the metadata, alias-analysis scopes are defined similar to TBAA
nodes:
!scope0 = metadata !{ metadata !"scope of foo()" }
!scope1 = metadata !{ metadata !"scope 1", metadata !scope0 }
!scope2 = metadata !{ metadata !"scope 2", metadata !scope0 }
!scope3 = metadata !{ metadata !"scope 2.1", metadata !scope2 }
!scope4 = metadata !{ metadata !"scope 2.2", metadata !scope2 }
Loads and stores can be tagged with an alias-analysis scope, and also, with a
noalias tag for a specific scope:
... = load %ptr1, !alias.scope !{ !scope1 }
... = load %ptr2, !alias.scope !{ !scope1, !scope2 }, !noalias !{ !scope1 }
When evaluating an aliasing query, if one of the instructions is associated
with an alias.scope id that is identical to the noalias scope associated with
the other instruction, or is a descendant (in the scope hierarchy) of the
noalias scope associated with the other instruction, then the two memory
accesses are assumed not to alias.
Note that is the first element of the scope metadata is a string, then it can
be combined accross functions and translation units. The string can be replaced
by a self-reference to create globally unqiue scope identifiers.
[Note: This overview is slightly stylized, since the metadata nodes really need
to just be numbers (!0 instead of !scope0), and the scope lists are also global
unnamed metadata.]
Existing noalias metadata in a callee is "cloned" for use by the inlined code.
This is necessary because the aliasing scopes are unique to each call site
(because of possible control dependencies on the aliasing properties). For
example, consider a function: foo(noalias a, noalias b) { *a = *b; } that gets
inlined into bar() { ... if (...) foo(a1, b1); ... if (...) foo(a2, b2); } --
now just because we know that a1 does not alias with b1 at the first call site,
and a2 does not alias with b2 at the second call site, we cannot let inlining
these functons have the metadata imply that a1 does not alias with b2.
llvm-svn: 213864
In order to enable the preservation of noalias function parameter information
after inlining, and the representation of block-level __restrict__ pointer
information (etc.), additional kinds of aliasing metadata will be introduced.
This metadata needs to be carried around in AliasAnalysis::Location objects
(and MMOs at the SDAG level), and so we need to generalize the current scheme
(which is hard-coded to just one TBAA MDNode*).
This commit introduces only the necessary refactoring to allow for the
introduction of other aliasing metadata types, but does not actually introduce
any (that will come in a follow-up commit). What it does introduce is a new
AAMDNodes structure to hold all of the aliasing metadata nodes associated with
a particular memory-accessing instruction, and uses that structure instead of
the raw MDNode* in AliasAnalysis::Location, etc.
No functionality change intended.
llvm-svn: 213859
This reverts, "r213024 - Revert r212572 "improve BasicAA CS-CS queries", it
causes PR20303." with a fix for the bug in pr20303. As it turned out, the
relevant code was both wrong and over-conservative (because, as with the code
it replaced, it would return the overall ModRef mask even if just Ref had been
implied by the argument aliasing results). Hopefully, this correctly fixes both
problems.
Thanks to Nick Lewycky for reducing the test case for pr20303 (which I've
cleaned up a little and added in DSE's test directory). The BasicAA test has
also been updated to check for this error.
Original commit message:
BasicAA contains knowledge of certain intrinsics, such as memcpy and memset,
and uses that information to form more-accurate answers to CallSite vs. Loc
ModRef queries. Unfortunately, it did not use this information when answering
CallSite vs. CallSite queries.
Generically, when an intrinsic takes one or more pointers and the intrinsic is
marked only to read/write from its arguments, the offset/size is unknown. As a
result, the generic code that answers CallSite vs. CallSite (and CallSite vs.
Loc) queries in AA uses UnknownSize when forming Locs from an intrinsic's
arguments. While BasicAA's CallSite vs. Loc override could use more-accurate
size information for some intrinsics, it did not do the same for CallSite vs.
CallSite queries.
This change refactors the intrinsic-specific logic in BasicAA into a generic AA
query function: getArgLocation, which is overridden by BasicAA to supply the
intrinsic-specific knowledge, and used by AA's generic implementation. This
allows the intrinsic-specific knowledge to be used by both CallSite vs. Loc and
CallSite vs. CallSite queries, and simplifies the BasicAA implementation.
Currently, only one function, Mac's memset_pattern16, is handled by BasicAA
(all the rest are intrinsics). As a side-effect of this refactoring, BasicAA's
getModRefBehavior override now also returns OnlyAccessesArgumentPointees for
this function (which is an improvement).
llvm-svn: 213219
BasicAA contains knowledge of certain intrinsics, such as memcpy and memset,
and uses that information to form more-accurate answers to CallSite vs. Loc
ModRef queries. Unfortunately, it did not use this information when answering
CallSite vs. CallSite queries.
Generically, when an intrinsic takes one or more pointers and the intrinsic is
marked only to read/write from its arguments, the offset/size is unknown. As a
result, the generic code that answers CallSite vs. CallSite (and CallSite vs.
Loc) queries in AA uses UnknownSize when forming Locs from an intrinsic's
arguments. While BasicAA's CallSite vs. Loc override could use more-accurate
size information for some intrinsics, it did not do the same for CallSite vs.
CallSite queries.
This change refactors the intrinsic-specific logic in BasicAA into a generic AA
query function: getArgLocation, which is overridden by BasicAA to supply the
intrinsic-specific knowledge, and used by AA's generic implementation. This
allows the intrinsic-specific knowledge to be used by both CallSite vs. Loc and
CallSite vs. CallSite queries, and simplifies the BasicAA implementation.
Currently, only one function, Mac's memset_pattern16, is handled by BasicAA
(all the rest are intrinsics). As a side-effect of this refactoring, BasicAA's
getModRefBehavior override now also returns OnlyAccessesArgumentPointees for
this function (which is an improvement).
llvm-svn: 212572
This patch:
1) Improves the cost model for x86 alternate shuffles (originally
added at revision 211339);
2) Teaches the Cost Model Analysis pass how to analyze alternate shuffles.
Alternate shuffles are a special kind of blend; on x86, we can often
easily lowered alternate shuffled into single blend
instruction (depending on the subtarget features).
The existing cost model didn't take into account subtarget features.
Also, it had a couple of "dead" entries for vector types that are never
legal (example: on x86 types v2i32 and v2f32 are not legal; those are
always either promoted or widened to 128-bit vector types).
The new x86 cost model takes into account what target features we have
before returning the shuffle cost (i.e. the number of instructions
after the blend is lowered/expanded).
This patch also teaches the Cost Model Analysis how to identify and analyze
alternate shuffles (i.e. 'SK_Alternate' shufflevector instructions):
- added function 'isAlternateVectorMask';
- added some logic to check if an instruction is a alternate shuffle and, in
case, call the target specific TTI to get the corresponding shuffle cost;
- added a test to verify the cost model analysis on alternate shuffles.
llvm-svn: 212296
Before, we where looking at the size of the pointer type that specifies the
location from which to load the element. This did not make any sense at all.
This change fixes a bug in the delinearization where we failed to delinerize
certain load instructions.
llvm-svn: 210435
The delinearization is needed only to remove the non linearity induced by
expressions involving multiplications of parameters and induction variables.
There is no problem in dealing with constant times parameters, or constant times
an induction variable.
For this reason, the current patch discards all constant terms and multipliers
before running the delinearization algorithm on the terms. The only thing
remaining in the term expressions are parameters and multiply expressions of
parameters: these simplified term expressions are passed to the array shape
recognizer that will not recognize constant dimensions anymore: these will be
recognized as different strides in parametric subscripts.
The only important special case of a constant dimension is the size of elements.
Instead of relying on the delinearization to infer the size of an element,
compute the element size from the base address type. This is a much more precise
way of computing the element size than before, as we would have mixed together
the size of an element with the strides of the innermost dimension.
llvm-svn: 209691
This commit starts with a "git mv ARM64 AArch64" and continues out
from there, renaming the C++ classes, intrinsics, and other
target-local objects for consistency.
"ARM64" test directories are also moved, and tests that began their
life in ARM64 use an arm64 triple, those from AArch64 use an aarch64
triple. Both should be equivalent though.
This finishes the AArch64 merge, and everyone should feel free to
continue committing as normal now.
llvm-svn: 209577
This is a follow-up to r209358: PR19799: Indvars miscompile due to an
incorrect max backedge taken count from SCEV.
That fix was incomplete as pointed out by Arnold and Michael Z. The
code was also too confusing. It needed a careful rewrite with more
unit tests. This version will also happen to optimize more cases.
<rdar://17005101> PR19799: Indvars miscompile...
llvm-svn: 209545
This has to do with the trip count computation for loops with multiple
exits, which is quite subtle. Most passes just ask for a single trip
count number, so we must be conservative assuming any exit could be
taken. Normally, we rely on the "exact" trip count, which was
correctly given as "unknown". However, SCEV also gives a "max"
back-edge taken count. The loops max BE taken count is conservatively
a maximum over the max of each exit's non-exiting iterations
count. Note that some exit tests can be skipped so the max loop
back-edge taken count can actually exceed the max non-exiting
iterations for some exits. However, when we know the loop *latch*
cannot be skipped, we can directly use its max taken count
disregarding other exits. I previously took the minimum here without
checking whether the other exit could be skipped. The correct, and
simpler thing to do here is just to directly use the loop latch's max
non-exiting iterations as the loops max back-edge count.
In the problematic test case, the first loop exit had a max of zero
non-exiting iterations, but could be skipped. The loop latch was known
not to be skipped but had max of one non-exiting iteration. We
incorrectly claimed the loop back-edge could be taken zero times, when
it is actually taken one time.
Fixes Loop %for.body.i: <multiple exits> Unpredictable backedge-taken count.
Loop %for.body.i: max backedge-taken count is 1.
llvm-svn: 209358
Tested by comparing make check VERBOSE=1 before and after to make sure
no tests are missed. (VERBOSE=1 prints the list of tests.)
Only one test :( remains where .cpp is required:
tools/llvm-cov/range_based_for.cpp:// RUN: llvm-cov range_based_for.cpp | FileCheck %s --check-prefix=STDOUT
The topic was discussed in this thread:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140428/214905.html
llvm-svn: 208621
To compute the dimensions of the array in a unique way, we split the
delinearization analysis in three steps:
- find parametric terms in all memory access functions
- compute the array dimensions from the set of terms
- compute the delinearized access functions for each dimension
The first step is executed on all the memory access functions such that we
gather all the patterns in which an array is accessed. The second step reduces
all this information in a unique description of the sizes of the array. The
third step is delinearizing each memory access function following the common
description of the shape of the array computed in step 2.
This rewrite of the delinearization pass also solves a problem we had with the
previous implementation: because the previous algorithm was by induction on the
structure of the SCEV, it would not correctly recognize the shape of the array
when the memory access was not following the nesting of the loops: for example,
see polly/test/ScopInfo/multidim_only_ivs_3d_reverse.ll
; void foo(long n, long m, long o, double A[n][m][o]) {
;
; for (long i = 0; i < n; i++)
; for (long j = 0; j < m; j++)
; for (long k = 0; k < o; k++)
; A[i][k][j] = 1.0;
Starting with this patch we no longer delinearize access functions that do not
contain parameters, for example in test/Analysis/DependenceAnalysis/GCD.ll
;; for (long int i = 0; i < 100; i++)
;; for (long int j = 0; j < 100; j++) {
;; A[2*i - 4*j] = i;
;; *B++ = A[6*i + 8*j];
these accesses will not be delinearized as the upper bound of the loops are
constants, and their access functions do not contain SCEVUnknown parameters.
llvm-svn: 208232
This reverts commit r207287, reapplying r207286.
I'm hoping that declaring an explicit struct and instantiating
`addBlockEdges()` directly works around the GCC crash from r207286.
This is a lot more boilerplate, though.
llvm-svn: 207438
This reverts commit r207286. It causes an ICE on the
cmake-llvm-x86_64-linux buildbot [1]:
llvm/lib/Analysis/BlockFrequencyInfo.cpp: In lambda function:
llvm/lib/Analysis/BlockFrequencyInfo.cpp:182:1: internal compiler error: in get_expr_operands, at tree-ssa-operands.c:1035
[1]: http://bb.pgr.jp/builders/cmake-llvm-x86_64-linux/builds/12093/steps/build_llvm/logs/stdio
llvm-svn: 207287
Previously, irreducible backedges were ignored. With this commit,
irreducible SCCs are discovered on the fly, and modelled as loops with
multiple headers.
This approximation specifies the headers of irreducible sub-SCCs as its
entry blocks and all nodes that are targets of a backedge within it
(excluding backedges within true sub-loops). Block frequency
calculations act as if we insert a new block that intercepts all the
edges to the headers. All backedges and entries to the irreducible SCC
point to this imaginary block. This imaginary block has an edge (with
even probability) to each header block.
The result is now reasonable enough that I've added a number of
testcases for irreducible control flow. I've outlined in
`BlockFrequencyInfoImpl.h` ways to improve the approximation.
<rdar://problem/14292693>
llvm-svn: 207286
Remove the concepts of "forward" and "general" mass distributions, which
was wrong. The split might have made sense in an early version of the
algorithm, but it's definitely wrong now.
<rdar://problem/14292693>
llvm-svn: 207195
Strip irreducible testcases to pure control flow. The function calls
made the branch weights more believable but cluttered it up a lot.
There isn't going to be any constant analysis here, so just use dumb
branch logic to clarify the important parts.
<rdar://problem/14292693>
llvm-svn: 207192
The branch that skips irreducible backedges was only active when
propagating mass at the top-level. In particular, when propagating mass
through a loop recognized by `LoopInfo` with irreducible control flow
inside, irreducible backedges would not be skipped.
Not sure where that idea came from, but the result was that mass was
lost until after loop exit. Added a testcase that covers this case.
llvm-svn: 206860
This reverts commit r206707, reapplying r206704. The preceding commit
to CalcSpillWeights should have sorted out the failing buildbots.
<rdar://problem/14292693>
llvm-svn: 206766
This reverts commit r206677, reapplying my BlockFrequencyInfo rewrite.
I've done a careful audit, added some asserts, and fixed a couple of
bugs (unfortunately, they were in unlikely code paths). There's a small
chance that this will appease the failing bots [1][2]. (If so, great!)
If not, I have a follow-up commit ready that will temporarily add
-debug-only=block-freq to the two failing tests, allowing me to compare
the code path between what the failing bots and what my machines (and
the rest of the bots) are doing. Once I've triggered those builds, I'll
revert both commits so the bots go green again.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445
<rdar://problem/14292693>
llvm-svn: 206704
This reverts commit r206666, as planned.
Still stumped on why the bots are failing. Sanitizer bots haven't
turned anything up. If anyone can help me debug either of the failures
(referenced in r206666) I'll owe them a beer. (In the meantime, I'll be
auditing my patch for undefined behaviour.)
llvm-svn: 206677
This reverts commit r206628, reapplying r206622 (and r206626).
Two tests are failing only on buildbots [1][2]: i.e., I can't reproduce
on Darwin, and Chandler can't reproduce on Linux. Asan and valgrind
don't tell us anything, but we're hoping the msan bot will catch it.
So, I'm applying this again to get more feedback from the bots. I'll
leave it in long enough to trigger builds in at least the sanitizer
buildbots (it was failing for reasons unrelated to my commit last time
it was in), and hopefully a few others.... and then I expect to revert a
third time.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445
llvm-svn: 206666
This reverts commit r206622 and the MSVC fixup in r206626.
Apparently the remotely failing tests are still failing, despite my
attempt to fix the nondeterminism in r206621.
llvm-svn: 206628
This reverts commit r206556, effectively reapplying commit r206548 and
its fixups in r206549 and r206550.
In an intervening commit I've added target triples to the tests that
were failing remotely [1] (but passing locally). I'm hoping the mystery
is solved? I'll revert this again if the tests are still failing
remotely.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
llvm-svn: 206622
LazyCallGraph. This is the start of the whole point of this different
abstraction, but it is just the initial bits. Here is a run-down of
what's going on here. I'm planning to incorporate some (or all) of this
into comments going forward, hopefully with better editing and wording.
=]
The crux of the problem with the traditional way of building SCCs is
that they are ephemeral. The new pass manager however really needs the
ability to associate analysis passes and results of analysis passes with
SCCs in order to expose these analysis passes to the SCC passes. Making
this work is kind-of the whole point of the new pass manager. =]
So, when we're building SCCs for the call graph, we actually want to
build persistent nodes that stick around and can be reasoned about
later. We'd also like the ability to walk the SCC graph in more complex
ways than just the traditional postorder traversal of the current CGSCC
walk. That means that in addition to being persistent, the SCCs need to
be connected into a useful graph structure.
However, we still want the SCCs to be formed lazily where possible.
These constraints are quite hard to satisfy with the SCC iterator. Also,
using that would bypass our ability to actually add data to the nodes of
the call graph to facilite implementing the Tarjan walk. So I've
re-implemented things in a more direct and embedded way. This
immediately makes it easy to get the persistence and connectivity
correct, and it also allows leveraging the existing nodes to simplify
the algorithm. I've worked somewhat to make this implementation more
closely follow the traditional paper's nomenclature and strategy,
although it is still a bit obtuse because it isn't recursive, using
an explicit stack and a tail call instead, and it is interruptable,
resuming each time we need another SCC.
The other tricky bit here, and what actually took almost all the time
and trials and errors I spent building this, is exactly *what* graph
structure to build for the SCCs. The naive thing to build is the call
graph in its newly acyclic form. I wrote about 4 versions of this which
did precisely this. Inevitably, when I experimented with them across
various use cases, they became incredibly awkward. It was all
implementable, but it felt like a complete wrong fit. Square peg, round
hole. There were two overriding aspects that pushed me in a different
direction:
1) We want to discover the SCC graph in a postorder fashion. That means
the root node will be the *last* node we find. Using the call-SCC DAG
as the graph structure of the SCCs results in an orphaned graph until
we discover a root.
2) We will eventually want to walk the SCC graph in parallel, exploring
distinct sub-graphs independently, and synchronizing at merge points.
This again is not helped by the call-SCC DAG structure.
The structure which, quite surprisingly, ended up being completely
natural to use is the *inverse* of the call-SCC DAG. We add the leaf
SCCs to the graph as "roots", and have edges to the caller SCCs. Once
I switched to building this structure, everything just fell into place
elegantly.
Aside from general cleanups (there are FIXMEs and too few comments
overall) that are still needed, the other missing piece of this is
support for iterating across levels of the SCC graph. These will become
useful for implementing #2, but they aren't an immediate priority.
Once SCCs are in good shape, I'll be working on adding mutation support
for incremental updates and adding the pass manager that this analysis
enables.
llvm-svn: 206581
Rewrite the shared implementation of BlockFrequencyInfo and
MachineBlockFrequencyInfo entirely.
The old implementation had a fundamental flaw: precision losses from
nested loops (or very wide branches) compounded past loop exits (and
convergence points).
The @nested_loops testcase at the end of
test/Analysis/BlockFrequencyAnalysis/basic.ll is motivating. This
function has three nested loops, with branch weights in the loop headers
of 1:4000 (exit:continue). The old analysis gives non-sensical results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
---- Block Freqs ----
entry = 1.0
for.cond1.preheader = 1.00103
for.cond4.preheader = 5.5222
for.body6 = 18095.19995
for.inc8 = 4.52264
for.inc11 = 0.00109
for.end13 = 0.0
The new analysis gives correct results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
block-frequency-info: nested_loops
- entry: float = 1.0, int = 8
- for.cond1.preheader: float = 4001.0, int = 32007
- for.cond4.preheader: float = 16008001.0, int = 128064007
- for.body6: float = 64048012001.0, int = 512384096007
- for.inc8: float = 16008001.0, int = 128064007
- for.inc11: float = 4001.0, int = 32007
- for.end13: float = 1.0, int = 8
Most importantly, the frequency leaving each loop matches the frequency
entering it.
The new algorithm leverages BlockMass and PositiveFloat to maintain
precision, separates "probability mass distribution" from "loop
scaling", and uses dithering to eliminate probability mass loss. I have
unit tests for these types out of tree, but it was decided in the review
to make the classes private to BlockFrequencyInfoImpl, and try to shrink
them (or remove them entirely) in follow-up commits.
The new algorithm should generally have a complexity advantage over the
old. The previous algorithm was quadratic in the worst case. The new
algorithm is still worst-case quadratic in the presence of irreducible
control flow, but it's linear without it.
The key difference between the old algorithm and the new is that control
flow within a loop is evaluated separately from control flow outside,
limiting propagation of precision problems and allowing loop scale to be
calculated independently of mass distribution. Loops are visited
bottom-up, their loop scales are calculated, and they are replaced by
pseudo-nodes. Mass is then distributed through the function, which is
now a DAG. Finally, loops are revisited top-down to multiply through
the loop scales and the masses distributed to pseudo nodes.
There are some remaining flaws.
- Irreducible control flow isn't modelled correctly. LoopInfo and
MachineLoopInfo ignore irreducible edges, so this algorithm will
fail to scale accordingly. There's a note in the class
documentation about how to get closer. See also the comments in
test/Analysis/BlockFrequencyInfo/irreducible.ll.
- Loop scale is limited to 4096 per loop (2^12) to avoid exhausting
the 64-bit integer precision used downstream.
- The "bias" calculation proposed on llvmdev is *not* incorporated
here. This will be added in a follow-up commit, once comments from
this review have been handled.
llvm-svn: 206548
Previously, BranchProbabilityInfo::calcLoopBranchHeuristics would determine the weights of basic blocks inside loops even when it didn't have enough information to estimate the branch probabilities correctly. This patch fixes the function to exit early if it doesn't see any exit edges or back edges and let the later heuristics determine the weights.
This fixes PR18705 and <rdar://problem/15991090>.
Differential Revision: http://reviews.llvm.org/D3363
llvm-svn: 206194
BasicTTI::getMemoryOpCost must explicitly check for non-simple types; setting
AllowUnknown=true with TLI->getSimpleValueType is not sufficient because, for
example, non-power-of-two vector types return non-simple EVTs (not MVT::Other).
llvm-svn: 206150
This provides more realistic costs for the insert/extractelement instructions
(which are load/store pairs), accounts for the cheap unaligned Altivec load
sequence, and for unaligned VSX load/stores.
Bad news:
MultiSource/Applications/sgefa/sgefa - 35% slowdown (this will require more investigation)
SingleSource/Benchmarks/McGill/queens - 20% slowdown (we no longer vectorize this, but it was a constant store that was scalarized)
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2 - 2% slowdown
Good news:
SingleSource/Benchmarks/Shootout/ary3 - 54% speedup
SingleSource/Benchmarks/Shootout-C++/ary - 40% speedup
MultiSource/Benchmarks/Ptrdist/ks/ks - 35% speedup
MultiSource/Benchmarks/FreeBench/neural/neural - 30% speedup
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt - 20% speedup
Unfortunately, estimating the costs of the stack-based scalarization sequences
is hard, and adjusting these costs is like a game of whac-a-mole :( I'll
revisit this again after we have better codegen for vector extloads and
truncstores and unaligned load/stores.
llvm-svn: 205658
When a vector type legalizes to a larger vector type, and the target does not
support the associated extending load (or truncating store), then legalization
will scalarize the load (or store) resulting in an associated scalarization
cost. BasicTTI::getMemoryOpCost needs to account for this.
Between this, and r205487, PowerPC on the P7 with VSX enabled shows:
MultiSource/Benchmarks/PAQ8p/paq8p: 43% speedup
SingleSource/Benchmarks/BenchmarkGame/puzzle: 51% speedup
SingleSource/UnitTests/Vectorizer/gcc-loops 28% speedup
(some of these are new; some of these, such as PAQ8p, just reverse regressions
that VSX support would trigger)
llvm-svn: 205495
For an cast (extension, etc.), the currently logic predicts a low cost if the
associated operation (keyed on the destination type) is legal (or promoted).
This is not true when the number of values required to legalize the type is
changing. For example, <8 x i16> being sign extended by <8 x i32> is not
generically cheap on PPC with VSX, even though sign extension to v4i32 is
legal, because two output v4i32 values are required compared to the single
v8i16 input value, and without custom logic in the target, this conversion will
scalarize.
llvm-svn: 205487