This was modeled to have a cost of 1, but since we do not have a MUL.2d this is
scalarized into vector inserts/extracts and scalar muls.
Motivating precommitted test is test/Transforms/SLPVectorizer/AArch64/mul.ll,
which we don't want to SLP vectorize.
Test Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
unfortunately needed changing, but the reason is documented in
LoopVectorize.cpp:6855:
// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.
which I will address next as a follow up of this.
Differential Revision: https://reviews.llvm.org/D92208
Add a basic implementation of getGatherScatterOpCost to BasicTTIImpl.
The implementation estimates the cost of scalarizing the loads/stores,
the cost of packing/extracting the individual lanes and the cost of
only selecting enabled lanes.
This more accurately reflects the current cost on targets like AArch64.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91984
For the scattered operands of load instructions it makes sense
to use gathering load intrinsic, which can lower to native instruction
for X86/AVX512 and ARM/SVE. This also enables building
vectorization tree with entries containing scattered operands.
The next step is to add scattered store.
Fixes PR47629 and PR47623
Differential Revision: https://reviews.llvm.org/D90445
Claim to not have any vector support to dissuade SLP, LV and friends
from generating SIMD IR for the VE target. We will take this back once
vector isel is stable.
Reviewed By: kaz7, fhahn
Differential Revision: https://reviews.llvm.org/D90462
This reverts the revert commit a1b53db324.
This patch includes a fix for a reported issue, caused by
matchSelectPattern returning UMIN for selects of pointers in
some cases by looking to some connected casts.
For now, ensure integer instrinsics are only returned for selects of
ints or int vectors.
This reverts commit 1922570489.
This appears to cause a crash in the following example
a, b, c;
l() {
int e = a, f = l, g, h, i, j;
float *d = c, *k = b;
for (;;)
for (; g < f; g++) {
k[h] = d[i];
k[h - 1] = d[j];
h += e << 1;
i += e;
}
}
clang -cc1 -triple i386-unknown-linux-gnu -emit-obj -target-cpu pentium-m -O1 -vectorize-loops -vectorize-slp reduced.c
llvm::Type *llvm::Type::getWithNewBitWidth(unsigned int) const: Assertion `isIntOrIntVectorTy() && "Original type expected to be a vector of integers or a scalar integer."' failed.
Use -0.0 instead of 0.0 as the start value. The previous use of 0.0
was fine for all existing uses of this function though, as it is
always generated with fast flags right now, and thus nsz.
Some architectures do not have general vector select instructions (e.g.
AArch64). But some cmp/select patterns can be vectorized using other
instructions/intrinsics.
One example is using min/max instructions for certain patterns.
This patch updates the cost calculations for selects in the SLP
vectorizer to consider using min/max intrinsics.
This patch does not change SLP vectorizer's codegen itself to actually
generate those intrinsics, but relies on the backends to lower the
vector cmps & selects. This keeps things simple on the SLP side and
works well in practice for AArch64.
This exposes additional SLP vectorization opportunities in some
benchmarks on AArch64 (-O3 -flto).
Metric: SLP.NumVectorInstructions
Program base slp diff
test-suite...ications/JM/ldecod/ldecod.test 502.00 697.00 38.8%
test-suite...ications/JM/lencod/lencod.test 1023.00 1414.00 38.2%
test-suite...-typeset/consumer-typeset.test 56.00 65.00 16.1%
test-suite...6/464.h264ref/464.h264ref.test 804.00 822.00 2.2%
test-suite...006/453.povray/453.povray.test 3335.00 3357.00 0.7%
test-suite...CFP2000/177.mesa/177.mesa.test 2110.00 2121.00 0.5%
test-suite...:: External/Povray/povray.test 2378.00 2382.00 0.2%
Reviewed By: RKSimon, samparker
Differential Revision: https://reviews.llvm.org/D89969
AArch64 does not have a flexible vector select instruction. In some
cases, the selects can be turned into min/max however, for which there
are dedicated vector instructions on AArch64.
This patch adds some tests for such cases.
D70365 allows us to make attributes default. This is a follow up to
actually make nosync, nofree and willreturn default. The approach we
chose, for now, is to opt-in to default attributes to avoid introducing
problems to target specific intrinsics. Intrinsics with default
attributes can be created using `DefaultAttrsIntrinsic` class.
If some leaves have the same instructions to be vectorized, we may
incorrectly evaluate the best order for the root node (it is built for the
vector of instructions without repeated instructions and, thus, has less
elements than the root node). In this case we just can not try to reorder
the tree + we may calculate the wrong number of nodes that requre the
same reordering.
For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
be reordered, the best order will be \<1, 0\>. We need to extend this
order for the root node. For the root node this order should look like
\<3, 0, 1, 2\>. This patch allows extension of the orders of the nodes
with the reused instructions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D45263
If some leaves have the same instructions to be vectorized, we may
incorrectly evaluate the best order for the root node (it is built for the
vector of instructions without repeated instructions and, thus, has less
elements than the root node). In this case we just can not try to reorder
the tree + we may calculate the wrong number of nodes that requre the
same reordering.
For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
be reordered, the best order will be \<1, 0\>. We need to extend this
order for the root node. For the root node this order should look like
\<3, 0, 1, 2\>. This patch allows extension of the orders of the nodes
with the reused instructions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D45263
This is one (small) part of improving PR41312:
https://llvm.org/PR41312
As shown there and in the smaller tests here, if we have some member of the
reduction values that does not match the others, we want to push it to the
end (bring the matching members forward and together).
In the regression tests, we have 5 candidates for the 4 slots of the reduction.
If the one "wrong" compare is grouped with the others, it prevents forming the
ideal v4i1 compare reduction.
Differential Revision: https://reviews.llvm.org/D87772
For scalable type, the aggregated size is unknown at compile-time.
Skip instructions with scalable type to ensure the list of instructions
for vectorizeSimpleInstructions does not contains any scalable-vector instructions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D87550
The test example based on PR47450 shows that we can
match non-byte-sized shifts, but those won't ever be
bswap opportunities. This isn't a full fix (we'd still
match if the shifts were by 8-bits for example), but
this should be enough until there's evidence that we
need to do more (this is a borderline case for
vectorization in the first place).
Previously we could match fcmp+select to a reduction if the fcmp had
the nonans fast math flag. But if the select had the nonans fast
math flag, InstCombine would turn it into a fminnum/fmaxnum intrinsic
before SLP gets to it. Seems fairly likely that if one of the
fcmp+select pair have the fast math flag, they both would.
My plan is to start vectorizing the fmaxnum/fminnum version soon,
but I wanted to get this code out as it had some of the strangest
fast math flag behaviors.
Other types can be handled in future patches but their uniform / non-uniform costs are more similar and don't appear to cause many vectorization issues.
The legacy SLPVectorizer has a dependency on InjectTLIMappingsLegacy.
That cannot be expressed in the new PM since they are both normal
passes. Explicitly add -inject-tli-mappings as a pass.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86492
The 1st attempt (rG557b890) was reverted because it caused miscompiles.
That bug is avoided here by changing the order of folds and as verified
in the new tests.
Original commit message:
InstCombine currently has odd rules for folding insert-extract chains to shuffles,
so we miss collapsing seemingly simple cases as shown in the tests here.
But poison makes this not quite as easy as we might have guessed. Alive2 tests to
show the subtle difference (similar to the regression tests):
https://alive2.llvm.org/ce/z/hp4hv3 (this is ok)
https://alive2.llvm.org/ce/z/ehEWaN (poison leakage)
SLP tends to create these patterns (as shown in the SLP tests), and this could
help with solving PR16739.
Differential Revision: https://reviews.llvm.org/D86460
InstCombine currently has odd rules for folding insert-extract chains to shuffles,
so we miss collapsing seemingly simple cases as shown in the tests here.
But poison makes this not quite as easy as we might have guessed. Alive2 tests to
show the subtle difference (similar to the regression tests):
https://alive2.llvm.org/ce/z/hp4hv3 (this is ok)
https://alive2.llvm.org/ce/z/ehEWaN (poison leakage)
SLP tends to create these patterns (as shown in the SLP tests), and this could
help with solving PR16739.
Differential Revision: https://reviews.llvm.org/D86460
The legacy PM alias analysis pipeline by default includes basic-aa.
When running `opt -foo-pass` under the NPM and -disable-basic-aa is not
specified, use basic-aa.
This decreases the number of check-llvm failures under NPM from 913 to 752.
Reviewed By: ychen, asbirlea
Differential Revision: https://reviews.llvm.org/D86167
This reverts commit 52b71aa8b1.
The problem was a missing lit.local.cfg file, which was causing the
test to be incorrectly run on bots that had not built the WebAssembly
target.