Fold %x umin_seq %y to %x if %x ule %y. This also subsumes the
special handling for constant operands, as if %y is constant this
folds to umin via implied poison reasoning, and if %x is constant
then either %x is not zero and it folds to umin, or it is known
zero, in which case it is ule anything.
Fold %x umin_seq %y to %x umin %y if %x cannot be zero. They only
differ in semantics for %x==0.
More generally %x *_seq %y folds to %x * %y if %x cannot be the
saturation fold (though currently we only have umin_seq).
This adds some extra costs for reverse shuffles under AArch64, filling
in the i16/f16/i8 gaps in the cost model.
Differential Revision: https://reviews.llvm.org/D124786
Based off the script from D103695, on AVX1, Jaguar/Bulldozer both have low throughput for ymm select patterns (BLENDV + OR(AND,ANDN))), and even on AVX2 Haswell still struggles with BLENDV ops
The new flag -aarch64-insert-extract-base-cost can be used to
set the value of AArch64Subtarget::getVectorInsertExtractBaseCost(),
for the purposes of experimentation.
Differential Revision: https://reviews.llvm.org/D124835
Similar to how we convert logical and/or to bitwise and/or, we should
also convert umin_seq to umin based on implied poison reasoning. In
%x umin_seq %y, if %y being poison implies %x being poison, then we
don't need the sequential evaluation: Having %y contribute towards
the result will never make the result more poisonous. An important
corollary of this is that if %y is never poison, we also don't need
the sequential evaluation.
This avoids some of the regressions in D124910.
Differential Revision: https://reviews.llvm.org/D124921
Added a motivating test case for D123400 where the loopnest has a
suboptimal loop order j-i-k. After D123400 we ensure that the order
of loop cache analysis output is loop i-j-k, despite the suboptimal
order in the original loopnest.
Reviewed By: bmahjour, #loopoptwg
Differential Revision: https://reviews.llvm.org/D122776
This builds on top of the target-independent cost model added in D124269
to add aarch64 specific costs for fptoui_sat and fptosi_sat intrinsics.
For many common types they will be legal instructions as the AArch64
instructions will saturate naturally. For unsupported pairs of integer
and floating point types, an additional min/max clamp is needed.
Differential Revision: https://reviews.llvm.org/D124357
We can quickly extract multiple elements of a bool vector using MOVMSK ops - since we don't know what generated the vXi1, I've been optimistic and assumed we can use PMOVMSKB to extract the maximum number of bools with a single op.
The MOVMSK pattern isn't great for extract+insert round trips as vXi1 type legalization can interfere with this a lot - so this relies on us remaining good at using getScalarizationOverhead properly (and tagging both Insert and Extract modes) for those round trip cases.
The AVX512 KMOV codegen for bool extraction is a bit of a mess so for now I've not included that - the per-element cost is a lot more accurate for current codegen.
The print output of loop cache analysis sometimes has a non-deterministic order
and therefore we have been using `CHECK-DAG` in its lit tests. This patch changes
the sorting of LoopCosts to llvm::stable_sort() where we compare loop cost numbers
and sort the loops. In case of the same loop cost numbers, llvm::stable_sort() now
would output a deterministic loop order.
Reviewed By: Meinersbur, fhahn, #loopoptwg
Differential Revision: https://reviews.llvm.org/D124725
If the legalized src/dst types are the same, assume the "truncation" is free.
This fixes some edge cases such as mul lo/hi ops and bool vectors which will get legalized back to legal vector widths
Based off the script from D103695, we were exaggerating the cost of the OR(AND(X,M),AND(Y,~M)) expansion using instruction count instead of effective throughput
Currently loop cache cost (LCC) cannot analyze fix-sized arrays
since it cannot delinearize them. This patch adds the capability
to delinearize fix-sized arrays to LCC. Most of the code is ported
from DependenceAnalysis.cpp and some refactoring will be done in a
next patch.
Reviewed By: #loopoptwg, Meinersbur
Differential Revision: https://reviews.llvm.org/D122857
Need to normalizize the mask to avoid possible crashes during attempts
to estimate cost of the very long shuffles with non-power-2 number of
elements in masks.
The legacy LoopUnswitch pass is only used in the legacy pass manager
pipeline, which is deprecated.
The NewPM replacement is SimpleLoopUnswitch and I think it is time to
remove the legacy LoopUnswitch code.
Fixes#31000.
Reviewed By: aeubanks, Meinersbur, asbirlea
Differential Revision: https://reviews.llvm.org/D124376
Add cost model for broadcast shuffle in RISCVTTIImpl::getShuffleCost
with scalable vector. The cost model might not the best.
For scalable vector, BasicTTIImpl::getShuffleCost return invalid cost,
so this patch relies on the existing cost model in BasicTTIImpl.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124101
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Differential Revision: https://reviews.llvm.org/D100486
Renamed test/Analysis/CostModel/X86/splat-load.ll to shuffle-load.ll
to align it with AArch64's similar test.
Also added a complete list of checks for all vector combinations up to 512-bits.
Differential Revision: https://reviews.llvm.org/D124528
Given a larger-than-legal shuffle mask, the final codegen will split
into multiple sub-vectors. This attempts to model that in
AArch64TTIImpl::getShuffleCost, splitting masks up according to the size
of the legalized vectors. If the sub-masks have at most 2 input sources
we can call getShuffleCost on them and sum the costs, to get a more
accurate final cost for the entire shuffle. The call to
improveShuffleKindFromMask helps to improve the shuffle kind for the
sub-mask cost call.
Differential Revision: https://reviews.llvm.org/D123414
Given a shuffle with 4 elements size 16 or 32, we can use the costs
directly from the PerfectShuffle tables to get a slightly more accurate
cost for the resulting shuffle.
Differential Revision: https://reviews.llvm.org/D123409
This adds some basic fptosi_sat and fptoui_sat target independent cost
modelling. The fptosi_sat is modelled as a fmin/fmax to saturate the
value, followed by a fp convert. The signed values then have an
additional fcmp+select for handling Nan correctly.
The AArch64/Arm costs may be more incorrect, as the instruction exist
natively. This can be fixed with target specific cost updates.
Differential Revision: https://reviews.llvm.org/D124269
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
This test is no longer necessary as it is a subset of:
llvm/test/Analysis/CostModel/AArch64/shuffle-load.ll
Differential Revision: https://reviews.llvm.org/D124456
When constructing canonical relationships between two regions, the first instruction of a basic block from the first region is used to find the corresponding basic block from the second region. However, debug instructions are not included in similarity matching, and therefore do not have a canonical numbering. This patch makes sure to ignore the debug instructions when finding the first instruction in a basic block.
Reviewer: paquette
Differential Revision: https://reviews.llvm.org/D123903