LoopVectorize has a fairly deeply baked in design problem where it will try to query analysis (primarily SCEV, but also ValueTracking) in the midst of mutating IR. In particular, the intermediate IR state does not represent the semantics of the original (or final) program.
Fixing this for real is hard, but all of the cases seen so far share a common symptom. In cases seen to date, the analysis being queried is the computation of the original loop's trip count. We can fix this particular instance of the issue by simply computing the trip count early, and caching it.
I want to be really clear that this is nothing but a workaround. It does nothing to fix the root issue, and at best, delays the time until we have to fix this for real. Florian and I have discussed an eventual solution in the review comments for https://reviews.llvm.org/D100663, but it's a lot of work.
Test taken from https://reviews.llvm.org/D100663.
Differential Revision: https://reviews.llvm.org/D101487
This patch updates the code that sinks recipes required for first-order
recurrences to properly handle replicate-regions. At the moment, the
code would just move the replicate recipe out of its replicate-region,
producing an invalid VPlan.
When sinking a recipe in a replicate-region, we have to sink the whole
region. To do that, we first need to split the block at the target
recipe and move the region in between.
This patch also adds a splitAt helper to VPBasicBlock to split a
VPBasicBlock at a given iterator.
Fixes PR50009.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D100751
This patch is to address https://bugs.llvm.org/show_bug.cgi?id=49916.
When the size of an alloca is 0, it will trigger an assertion in OptimizedStructLayout when being added to the frame.
Fix it by not adding it at all. We return index 0 (beginning of the frame) for all 0-sized allocas.
Differential Revision: https://reviews.llvm.org/D101841
We need to use a logical or instead of a bitwise or to preserve
poison behavior. Poison from the second condition should not
propagate if the first condition is true.
We were already handling this correctly in FoldBranchToCommonDest(),
but not in this fold. (There are still other folds with this issue.)
This fixes https://llvm.org/PR48900 , but as seen in the
regression tests prevents some optimizations.
There are a few options to restore those (switch to min/max
intrinsics, add larger pattern matching for select with
dominating condition, improve CVP), but we need to prevent
the bug 1st.
This patch updates the code handling reduction recipes to also keep
track of the incoming value from the latch in the recipe. This is needed
to model the def-use chains completely in VPlan, so that it is possible
to replace the incoming value with an arbitrary VPValue.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D99294
We were missing bitreverse matches in cases where InstCombine had seen a byte-level rotation at the end of a bitreverse sequence (replacing or() with fshl()), hindering the exhaustive bitreverse matching in CodeGenPrepare later on.
Summary:
Add the AAExecutionDomainInfo attributor instance to OpenMPOpt.
This will infer information relating to domain information that an
instruction might be expecting in. Right now this only includes a very
crude check for instructions that will be executed by the master thread
by comparing a thread-id function with a constant zero.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101578
When passingValueIsAlwaysUndefined scans for an instruction between an
inst with a null or undef argument and its first use, it was checking
for instructions that may have side effects, which is a superset of the
instructions it intended to find (as per the comments, control flow
changing instructions that would prevent reaching the uses). Switch
to using isGuaranteedToTransferExecutionToSuccessor() instead.
Without this change, when enabling -fwhole-program-vtables, which causes
assumes to be inserted by clang, we can get different simplification
decisions. In particular, when building with instrumentation FDO it can
affect the optimizations decisions before FDO matching, leading to some
mismatches.
I had to modify d83507-knowledge-retention-bug.ll since this fix enables
more aggressive optimization of that code such that it no longer tested
the original bug it was meant to test. I removed the undef which still
provokes the original failure (confirmed by temporarily reverting the
fix) and also changed it to just invoke the passes of interest to narrow
the testing.
Similarly I needed to adjust code for UnreachableEliminate.ll to avoid
an undef which was causing the function body to get optimized away with
this fix.
Differential Revision: https://reviews.llvm.org/D101507
There's a TODO comment in the code and discussion in D99912
about generalizing this, but I wasn't sure how to implement that,
so just going with a potential minimal fix to avoid crashing.
The test is a reduction beyond useful code (there's no user of
%user...), but it is based on https://llvm.org/PR50191, so this
is asserting on real code.
Differential Revision: https://reviews.llvm.org/D101772
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
Originally submitted as 3338290c18.
Reverted in c7df6b1223.
Need to check if target allows/supports masked gathers before trying to
estimate its cost, otherwise we may fail to vectorize some of the
patterns because of too pessimistic cost model.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D101297
Need to check if target allows/supports masked gathers before trying to
estimate its cost, otherwise we may fail to vectorize some of the
patterns because of too pessimistic cost model.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D101297
This update supports the following transformation:
```
select(extract(mul_with_overflow(a, _), _), (a == 0), false)
=>
and(extract(mul_with_overflow(a, _), _), (a == 0))
```
which is correct because if `a` was poison the select's condition was
also poison.
This update is splitted from D101423.
Apply the same logic used to check if CMPXCHG nodes should be expanded
at -O0: the register allocator may end up spilling some register in
between the atomic load/store pairs, breaking the atomicity and possibly
stalling the execution.
Fixes PR48017
Reviewed By: efriedman
Differential Revision: https://reviews.llvm.org/D101163
If the extracts from the non-power-2 vectors are recognized as shuffles,
need some extra checks to not crash cost calculations if trying to gext
the ecost for subvector extracts. In this case need to check carefully
that we do not exit out of bounds of the original vector, otherwise the
TTI's cost model will crash on assert.
Differential Revision: https://reviews.llvm.org/D101477
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
Previous attempt to fix infinite recursion in min/max reassociation was not fully successful (D100170). Newly discovered failing case is due to not properly handled when there is a single use. It should be processed separately from 2 uses case.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D101359
Hoisting and sinking instructions out of conditional blocks enables
additional vectorization by:
1. Executing memory accesses unconditionally.
2. Reducing the number of instructions that need predication.
After disabling early hoisting / sinking, we miss out on a few
vectorization opportunities. One of those is causing a ~10% performance
regression in one of the Geekbench benchmarks on AArch64.
This patch tires to recover the regression by running hoisting/sinking
as part of a SimplifyCFG run after LoopRotate and before LoopVectorize.
Note that in the legacy pass-manager, we run LoopRotate just before
vectorization again and there's no SimplifyCFG run in between, so the
sinking/hoisting may impact the later run on LoopRotate. But the impact
should be limited and the benefit of hosting/sinking at this stage
should outweigh the risk of not rotating.
Compile-time impact looks slightly positive for most cases.
http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=instructions
NewPM-O3: geomean -0.19%
NewPM-ReleaseThinLTO: geoman -0.54%
NewPM-ReleaseLTO-g: geomean -0.03%
With a few benchmarks seeing a notable increase, but also some
improvements.
Alternative to D101290.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D101468
It seems incorrect to use TTI data in some places,
and override it in others. In this case, TTI says
that `extractvalue` are free, yet we bill them.
While this doesn't address https://bugs.llvm.org/show_bug.cgi?id=50099 yet,
it reduces the cost from 55 to 50 while the threshold is 45.
Differential Revision: https://reviews.llvm.org/D101228
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
The profitability check is: we don't want to create more than a single PHI
per instruction sunk. We need to create the PHI unless we'll sink
all of it's would-be incoming values.
But there is a caveat there.
This profitability check doesn't converge on the first iteration!
If we first decide that we want to sink 10 instructions,
but then determine that 5'th one is unprofitable to sink,
that may result in us not sinking some instructions that
resulted in determining that some other instruction
we've determined to be profitable to sink becoming unprofitable.
So we need to iterate until we converge, as in determine
that all leftover instructions are profitable to sink.
But, the direct approach of just re-iterating seems dumb,
because in the worst case we'd find that the last instruction
is unprofitable, which would result in revisiting instructions
many many times.
Instead, i think we can get away with just two passes - forward and backward.
However then it isn't obvious what is the most performant way to update
InstructionsToSink.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
This seems like a reasonable upper bound on VL. WG discussions for
the V spec would probably allow us to use 2^16 as an upper bound
on VLEN, but this is good enough for now.
This allows us to remove sext and zext if user happens to assign
the size_t result into an int and then uses it as a VL intrinsic
argument which is size_t.
Reviewed By: frasercrmck, rogfer01, arcbbb
Differential Revision: https://reviews.llvm.org/D101472
By converting the SVE intrinsic to a normal LLVM insertelement we give
the code generator a better chance to remove transitions between GPRs
and VPRs
Co-authored-by: Paul Walker <paul.walker@arm.com>
Depends on D101302
Differential Revision: https://reviews.llvm.org/D101167
As part of this the ptrue coalescing done in SVEIntrinsicOpts has been
modified to not introduce redundant converts, since the convert removal
will no longer run after that optimisation to clean up.
Differential Revision: https://reviews.llvm.org/D101302
While we have a known profitability issue for sinking in presence of
non-unconditional predecessors, there isn't any known issues
for having multiple such non-unconditional predecessors,
so said restriction appears to be artificial. Lift it.
This patch causes the loop vectorizer to not interleave loops that have
nounroll loop hints (llvm.loop.unroll.disable and llvm.loop.unroll_count(1)).
Note that if a particular interleave count is being requested
(through llvm.loop.interleave_count), it will still be honoured, regardless
of the presence of nounroll hints.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D101374
SCEV does not look through non-header PHIs inside the loop. Such phis
can be analyzed by adding separate accesses for each incoming pointer
value.
This results in 2 more loops vectorized in SPEC2000/186.crafty and
avoids regressions when sinking instructions before vectorizing.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D101286
Before this change LLVM cannot simplify printf in following cases:
printf("%s", "") --> noop
printf("%s", str"\n") --> puts(str)
From the other hand GCC can perform such transformations for many years:
https://godbolt.org/z/7nnqbedfe
Differential Revision: https://reviews.llvm.org/D100724
This patch fixes a crash encountered when vectorising the following loop:
void foo(float *dst, float *src, long long n) {
for (long long i = 0; i < n; i++)
dst[i] = -src[i];
}
using scalable vectors. I've added a test to
Transforms/LoopVectorize/AArch64/sve-basic-vec.ll
as well as cleaned up the other tests in the same file.
Differential Revision: https://reviews.llvm.org/D98054
If the first tree element is vectorize and the second is gather, it
still might be profitable to vectorize it if the gather node contains
less scalars to vectorize than the original tree node. It might be
profitable to use shuffles.
Differential Revision: https://reviews.llvm.org/D101397
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur
in practice:
1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
variable that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. PHI nodes are costed separately and were never previously
multiplied by VF. For all other cases I have added an assert that none of
the users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.
I have also added a new test for the case when a pointer PHI feeds directly
into a store that will be scalarised as we were previously never testing it.
Differential Revision: https://reviews.llvm.org/D99718
This patch also refactors the way the feasible max VF is calculated,
although this is NFC for fixed-width vectors.
After this change scalable VF hints are no longer truncated/clamped
to a shorter scalable VF, nor does it drop the 'scalable flag' from
the suggested VF to vectorize with a similar VF that is fixed.
Instead, the hint is ignored which means the vectorizer is free
to find a more suitable VF, using the CostModel to determine the
best possible VF.
Reviewed By: c-rhodes, fhahn
Differential Revision: https://reviews.llvm.org/D98509
There are post-commit notest for e4c61d5 that suggest
the test is failing on certain bots. It looks like
the code there isn't being moved, which suggests
cost-model involvement, which suggests that we need to
hardcode the target triple.
Hopefully this helps?
When using the -enable-strict-reductions flag where UF>1 we generate multiple
Phi nodes, though only one of these is used as an input to the vector.reduce.fadd
intrinsics. The unused Phi nodes are removed later by instcombine.
This patch changes widenPHIInstruction/fixReduction to only generate
one Phi, and adds an additional test for unrolling to strict-fadd.ll
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D100570
Prevent cases in which the start value of IV is bigger than bound for
increasing.
Prevent cases in which the start value of IV is smaller than bound for
decreasing.
Differential Revision: https://reviews.llvm.org/D101174
Solves PR11896
As noted, this can be improved futher (calloc -> malloc) in some cases. But for know, this is the first step.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D101391