Motivating examples are seen in the PhaseOrdering tests based on:
https://bugs.llvm.org/show_bug.cgi?id=43953#c2 - if we have
intrinsics there, some pass can fold them.
The intrinsics are still named "experimental" at this point, but
if there is no fallout from this patch, that will be a good
indicator that it is safe to finalize them.
Differential Revision: https://reviews.llvm.org/D80867
This patch attempts to more accurately model the reduction of
power of 2 vectors of types we natively support. This takes into
account the narrowing of vectors that occur as we go from 512
bits to 256 bits, to 128 bits. It also takes into account the use
of wider elements in the shuffles for the first 2 steps of a
reduction from 128 bits. And uses a v8i16 shift for the final step
of vXi8 reduction.
The default implementation uses the legalized type for the arithmetic
for all levels. And uses the single source permute cost of the
legalized type for all levels. This penalizes things like
lack of v16i8 pshufb on pre-sse3 targets and the splitting and
joining that needs to be done for integer types on AVX1. We never
need v16i8 shuffle for a reduction and we only need split AVX1 ops
when type the type wide and needs to be split. I think we're still
over costing splits and joins for AVX1, but we're closer now.
I've also removed all pairwise special casing because I don't
think we ever want to generate that on X86. I've also adjusted
the add handling to more accurately account for any type splitting
that occurs before we reach a legal type.
Differential Revision: https://reviews.llvm.org/D76478
Initially SLP vectorizer replaced all going-to-be-vectorized
instructions with Undef values. It may break ScalarEvaluation and may
cause a crash.
Reworked SLP vectorizer so that it does not replace vectorized
instructions by UndefValue anymore. Instead vectorized instructions are
marked for deletion inside if BoUpSLP class and deleted upon class
destruction.
Reviewers: mzolotukhin, mkuper, hfinkel, RKSimon, davide, spatel
Subscribers: RKSimon, Gerolf, anemet, hans, majnemer, llvm-commits, sanjoy
Differential Revision: https://reviews.llvm.org/D29641
llvm-svn: 373166
Summary:
Initially SLP vectorizer replaced all going-to-be-vectorized
instructions with Undef values. It may break ScalarEvaluation and may
cause a crash.
Reworked SLP vectorizer so that it does not replace vectorized
instructions by UndefValue anymore. Instead vectorized instructions are
marked for deletion inside if BoUpSLP class and deleted upon class
destruction.
Reviewers: mzolotukhin, mkuper, hfinkel, RKSimon, davide, spatel
Subscribers: RKSimon, Gerolf, anemet, hans, majnemer, llvm-commits, sanjoy
Differential Revision: https://reviews.llvm.org/D29641
llvm-svn: 372626
As it's causing some bot failures (and per request from kbarton).
This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.
llvm-svn: 358546
We were overcounting the number of arithmetic operations needed at each level before we reach a legal type. We were using the full vector type for that level, but we are going to split the input vector at that level in half. So the effective arithmetic operation cost at that level is half the width.
So for example on 8i32 on an sse target. Were were calculating the cost of an 8i32 op which is likely 2 for basic integer. Then after the loop we count 2 more v4i32 ops. For a total arith cost of 4. But if you look at the assembly there would only be 3 arithmetic ops.
There are still more bugs in this code that I'm going to work on next. The non pairwise code shouldn't count extract subvectors in the loop. There are no extracts, the types are split in registers. For pairwise we need to use 2 two src permute shuffles.
Differential Revision: https://reviews.llvm.org/D55397
llvm-svn: 348621
We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type.
For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed.
Fixes PR37731
Rebased and reapplied after being reverted in rL347541 due to PR39774 - which was fixed by D54955/rL347759 and D55017/rL347997
Differential Revision: https://reviews.llvm.org/D54585
llvm-svn: 348076
This reverts commit r346970.
It was causing PR39774, a crash in slp-vectorizer on a rather simple loop
with just a bunch of 'and's in the body.
llvm-svn: 347541
We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type.
For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed.
Fixes PR37731
Differential Revision: https://reviews.llvm.org/D54585
llvm-svn: 346970
Expand arithmetic reduction to include mul/and/or/xor instructions.
This patch just fixes the SLPVectorizer - the effective reduction costs for AVX1+ are still poor (see rL344846) and will need to be improved before SLP sees this as a valid transform - but we can already see the effect on SSE2 tests.
This partially helps PR37731, but doesn't fix it all as it still falls over on the extraction/reduction order for some reason.
Differential Revision: https://reviews.llvm.org/D53473
llvm-svn: 345037
We miss arithmetic reduction for everything but Add/FAdd (I assume because that's the only cases which x86 has horizontal ops for.....)
llvm-svn: 344849
Currently when cost of scalar operations is evaluated the vector type is
used for scalar operations. Patch fixes this issue and fixes evaluation
of the vector operations cost.
Several test showed that vector cost model is too optimistic. It
allowed vectorization of 8 or less add/fadd operations, though scalar
code is faster. Actually, only for 16 or more operations vector code
provides better performance.
Differential Revision: https://reviews.llvm.org/D26277
llvm-svn: 288398
unrolling.
The next code is not vectorized by the SLPVectorizer:
```
int test(unsigned int *p) {
int sum = 0;
for (int i = 0; i < 8; i++)
sum += p[i];
return sum;
}
```
During optimization this loop is fully unrolled and SLPVectorizer is
unable to vectorize it. Patch tries to fix this problem.
Differential Revision: https://reviews.llvm.org/D24796
llvm-svn: 283535