It's a no-op, no overflow happens ever: https://alive2.llvm.org/ce/z/Zw89rZ
While generally i don't like such hacks,
we have a very good reason to do this: here we are expanding
a run-time correctness check for the vectorization,
and said `umul_with_overflow` will not be optimized out
before we query the cost of the checks we've generated.
Which means, the cost of run-time checks would be artificially inflated,
and after https://reviews.llvm.org/D109368 that will affect
the minimal trip count for which these checks are even evaluated.
And if they aren't even evaluated, then the vectorized code
certainly won't be run.
We could consider doing this in IRBuilder, but then we'd need to
also teach `CreateExtractValue()` to look into chain of `insertvalue`'s,
and i'm not sure there's precedent for that.
Refs. https://reviews.llvm.org/D109368#3089809
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.
Differential Revision: https://reviews.llvm.org/D112454
While we could emit such a tautological `select`,
it will stick around until the next instsimplify invocation,
which may happen after we count the cost of this redundant `select`.
Which is precisely what happens with loop vectorization legality checks,
and that artificially increases the cost of said checks,
which is bad.
There is prior art for this in `IRBuilderBase::CreateAnd()`/`IRBuilderBase::CreateOr()`.
Refs. https://reviews.llvm.org/D109368#3089809
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.
Differential Revision: https://reviews.llvm.org/D112454
Combine FADD and FMUL intrinsics into FMA when the result of the FMUL is an FADD operand
with one only use and both use the same predicate.
Differential Revision: https://reviews.llvm.org/D111638
Need to emit select(cmp) instructions for poison-safe forms of select
ops. Currently alive reports that `Target is more poisonous than source`
for operations we generating for such instructions.
https://alive2.llvm.org/ce/z/FiNiAA
Differential Revision: https://reviews.llvm.org/D112562
I have removed LoopVectorizationPlanner::setBestPlan, since this
function is quite aggressive because it deletes all other plans
except the one containing the <VF,UF> pair required. The code is
currently written to assume that all <VF,UF> pairs will live in the
same vplan. This is overly restrictive, since scalable VFs live in
different plans to fixed-width VFS. When we add support for
vectorising epilogue loops when the main loop uses scalable vectors
then we will the vplan for the main loop will be different to the
epilogue.
Instead I have added a new function called
LoopVectorizationPlanner::getBestPlanFor
that returns the best vplan for the <VF,UF> pair requested and leaves
all the vplans untouched. We then pass this best vplan to
LoopVectorizationPlanner::executePlan
which now takes an additional VPlanPtr argument.
Differential revision: https://reviews.llvm.org/D111125
The final reduction nodes should not be reordered, the order does not
matter for reductions. Also, it might be profitable to vectorize smaller
reduction trees, reduction cost may compensate small tree cost.
Part of D111574
Differential Revision: https://reviews.llvm.org/D112467
The recently added logic to canonicalize exit conditions to unsigned relies on facts which hold about the use (i.e. exit test). Applying this blindly to the icmp is not legal, as there may be another use which never reaches the exit. Restrict ourselves to case where we have a single use.
Need to change the order of the reduction/binops args pair vectorization
attempts. Need to try to find the reduction at first and postpone
vectorization of binops args. This may help to find more reduction
patterns and vectorize them.
Part of D111574.
Differential Revision: https://reviews.llvm.org/D112224
Fixes a crash observed by oss-fuzz in 39934. Issue at hand is that code expects a pattern match on m_Mul to imply the operand is a mul instruction, however mul constexprs are also valid here.
The logic in this patch is that if we find a comparison which would be unsigned except for when the loop is infinite, and we can prove that an infinite loop must be ill defined, we can still make the predicate unsigned.
The eventual goal (combined with a follow on patch) is to use the fact the loop exits to remove the zext (see tests) entirely.
A couple of points worth noting:
* We loose the ability to prove the loop unreachable by committing to the must exit interpretation. If instead, we later proved that rhs was definitely outside the range required for finiteness, we could have killed the loop entirely. (We don't currently implement this transform, but could in theory, do so.)
* simplifyAndExtend has a very limited list of users it walks. In particular, in the examples is stops at the zext and never visits the icmp. (Because we can't fold the zext to an addrec yet in SCEV.) Being willing to visit when we haven't simplified regresses multiple tests (seemingly because of less optimal results when computing trip counts). D112170 explores fixing that, but - at least so far - appears to be too expensive compile time wise.
Differential Revision: https://reviews.llvm.org/D111836
This follows up on D111023 by exporting the generic "load value
from constant at given offset as given type" and using it in the
store to load forwarding code. We now need to make sure that the
load size is smaller than the store size, previously this was
implicitly ensured by ConstantFoldLoadThroughBitcast().
Differential Revision: https://reviews.llvm.org/D112260
Make use of the getGEPIndicesForOffset() helper for creating GEPs.
This handles arrays as well, uses correct GEP index types and
reduces code duplication.
Differential Revision: https://reviews.llvm.org/D112263
This patch adds more complex test cases with redundant stores of an
existing memset, with other stores in between.
It also makes a few of the existing tests more robust.
The math here is:
Cost of 1 load = cost of n loads / n
Cost of live loads = num live loads * Cost of 1 load
Cost of live loads = num live loads * (cost of n loads / n)
Cost of live loads = cost of n loads * (num live loads / n)
But, all the variables here are integers,
and integer division rounds down,
but this calculation clearly expects float semantics.
Instead multiply upfront, and then perform round-up-division.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112302
When I playing with Coroutines, I found that it is possible to generate
following IR:
```
%struct = alloca ...
%sub.element = getelementptr %struct, i64 0, i64 index ; index is not
%zero
lifetime.marker.start(%sub.element)
% use of %sub.element
lifetime.marker.end(%sub.element)
store %struct to xxx ; %struct is escaping!
<suspend points>
```
Then the AllocaUseVisitor would collect the lifetime marker for
sub.element and treat it as the lifetime markers of the alloca! So it
judges that the alloca could be put on the stack instead of the frame by
judging the lifetime markers only.
The root cause for the bug is that AllocaUseVisitor collects wrong
lifetime markers.
This patch fixes this.
Reviewed By: lxfind
Differential Revision: https://reviews.llvm.org/D112216
As discussed in D112016, our current requirement of speculatability
for ephemeral is overly strict: What we really care about is that
the instruction will be DCEd once the assume is dropped. For that
it is sufficient that the instruction is side-effect free and not
a terminator.
In particular, this allows non-dereferenceable loads to be ephemeral
values.
Differential Revision: https://reviews.llvm.org/D112179
shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)
This is motivated by an example in D111800
(although that patch avoids the problem for that particular example).
The pattern is shown in reduced form with:
https://llvm.org/PR52178https://alive2.llvm.org/ce/z/d8zB4D
There is no difference on the PhaseOrdering test from D111800
because the aarch64 cost model says that the shuffle cost is 3 while
the fadd cost is 2.
Differential Revision: https://reviews.llvm.org/D111901
Vectorization of PHIs and stores very similar, it might be beneficial to
try to revectorize stores (like PHIs) if the total number of stores with
the same/alternate opcode is less than the vector size but number of
stores with the same type is larger than the vector size.
Differential Revision: https://reviews.llvm.org/D109831