binop i1 (cmp Pred (ext X, Index0), C0), (cmp Pred (ext X, Index1), C1)
-->
vcmp = cmp Pred X, VecC
ext (binop vNi1 vcmp, (shuffle vcmp, Index1)), Index0
This is a larger pattern than the existing extractelement folds because we can't
reasonably vectorize the sub-patterns with constants based on cost model calcs
(it doesn't usually make sense to replace a single extracted scalar op with
constant operand with a vector op).
I salvaged as much of the existing logic as I could, but there might be better
ways to share and reduce code.
The motivating case from PR43745:
https://bugs.llvm.org/show_bug.cgi?id=43745
...is the special case of a 2-way reduction. We tried to get SLP to handle that
particular pattern in D59710, but that caused crashing and regressions.
This patch is more general, but hopefully safer.
The v2f64 test with SSE2 surprised me - the cost model accounting looks like this:
OldCost = 0 (free extract of f64 at index 0) + 1 (extract of f64 at index 1) + 2 (scalar fcmps) + 1 (and of bools) = 4
NewCost = 2 (vector fcmp) + 1 (shuffle) + 1 (vector 'and') + 1 (extract of bool) = 5
Differential Revision: https://reviews.llvm.org/D82474
This saves creating/destroying a builder every time we
perform some transform.
The tests show instruction ordering diffs resulting from
always inserting at the root instruction now, but those
should be benign.
Generalize scalarization (recently enhanced with D80885)
to allow compares as well as binops.
Similar to binops, we are avoiding scalarization of a loaded
value because that could avoid a register transfer in codegen.
This requires 1 extra predicate that I am aware of: we do not
want to scalarize the condition value of a vector select. That
might also invert a transform that we do in instcombine that
prefers a vector condition operand for a vector select.
I think this is the final step in solving PR37463:
https://bugs.llvm.org/show_bug.cgi?id=37463
Differential Revision: https://reviews.llvm.org/D81661
scalarizeBinop currently folds
vec_bo((inselt VecC0, V0, Index), (inselt VecC1, V1, Index))
->
inselt(vec_bo(VecC0, VecC1), scl_bo(V0,V1), Index)
This patch extends this to account for cases where one of the vec_bo operands is already all-constant and performs similar cost checks to determine if the scalar binop with a constant still makes sense:
vec_bo((inselt VecC0, V0, Index), VecC1)
->
inselt(vec_bo(VecC0, VecC1), scl_bo(V0,extractelt(V1,Index)), Index)
Fixes PR42174
Differential Revision: https://reviews.llvm.org/D80885
Goes with proposal in D80885.
This is adapted from the InstCombine tests that were added for
D50992
But these should be adjusted further to provide more interesting
scenarios for x86-specific codegen. Eg, vector types/sizes will
have different costs depending on ISA attributes.
We also need to add tests that include a load of the scalar
variable and add tests that include extra uses of the insert
to further exercise the cost model.
This is split off from D79799 - where I was proposing to fully iterate
over a function until there are no more transforms. I suspect we are
still going to want to do something like that eventually.
But we can achieve the same gains much more efficiently on the current
set of regression tests just by reversing the order that we visit the
instructions.
This may also reduce the motivation for D79078, but we are still not
getting the optimal pattern for a reduction.
As with the extractelement patterns that are currently in vector-combine,
there are going to be several possible variations on this theme. This
should be the clearest, simplest example.
Scalarization is the right direction for target-independent canonicalization,
and InstCombine has some of those folds already, but it doesn't do this.
I proposed a similar transform in D50992. Here in vector-combine, we can
check the cost model to be sure it's profitable, so there should be less risk.
Differential Revision: https://reviews.llvm.org/D79452
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
This is the widen shuffle elements enhancement to D76727.
It builds on the analysis and simplifications in
D77881 and rG6a7e958a423e.
The phase ordering tests show that we can simplify inverse
shuffles across a binop in both directions (widen/narrow or
narrow/widen) now.
There's another potential transform visible in some of the
remaining TODOs - move a bitcasted operand of a shuffle
after the shuffle.
Differential Revision: https://reviews.llvm.org/D78371
Extracting to the same index that we are going to insert back into
allows forming select ("blend") shuffles and enables further transforms.
Admittedly, this is a quick-fix for a more general problem that I'm
hoping to solve by adding transforms for patterns that start with an
insertelement.
But this might resolve some regressions known to be caused by the
extract-extract transform (although I have not gotten more details on
those yet).
In the motivating case from PR34724:
https://bugs.llvm.org/show_bug.cgi?id=34724
The combination of subsequent instcombine and codegen transforms gets us this improvement:
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm4
vmovshdup %xmm1, %xmm3 ## xmm3 = xmm1[1,1,3,3]
vaddps %xmm0, %xmm2, %xmm0
vaddps %xmm1, %xmm3, %xmm1
vshufps $200, %xmm4, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm4[0,3]
vinsertps $177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]
-->
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm1
vaddps %xmm0, %xmm2, %xmm0
vshufps $200, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm1[0,3]
Differential Revision: https://reviews.llvm.org/D76623
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
We do not attempt this in InstCombine because we do not want to change
types and create new shuffle ops that are potentially not lowered as
well as the original code. Here, we can check the cost model to see if
it is worthwhile.
I've aggressively enabled this transform even if the types are the same
size and/or equal cost because moving the bitcast allows InstCombine to
make further simplifications.
In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the
redundant shuffles, but we probably want to extend VectorCombine to
handle the inverse pattern (shuffle-of-bitcast) to get that
simplification directly in IR.
Differential Revision: https://reviews.llvm.org/D76727
opcode (extelt V0, Ext0), (ext V1, Ext1) --> extelt (opcode (splat V0, Ext0), V1), Ext1
The first part of this patch generalizes the cost calculation to accept
different extraction indexes. The second part creates a shuffle+extract
before feeding into the existing code to create a vector op+extract.
The patch conservatively uses "TargetTransformInfo::SK_PermuteSingleSrc"
rather than "TargetTransformInfo::SK_Broadcast" (splat specifically
from element 0) because we do not have a more general "SK_Splat"
currently. That does not affect any of the current regression tests,
but we might be able to find some cost model target specialization where
that comes into play.
I suspect that we can expose some missing x86 horizontal op codegen with
this transform, so I'm speculatively adding a debug flag to disable the
binop variant of this transform to allow easier testing.
The test changes show that we're sensitive to cost model diffs (as we
should be), so that means that patches like D74976
should have better coverage.
Differential Revision: https://reviews.llvm.org/D75689
Code duplication (subsequently removed by refactoring) allowed
a logic discrepancy to creep in here.
We were being conservative about creating a vector binop -- but
not a vector cmp -- in the case where a vector op has the same
estimated cost as the scalar op. We want to be more aggressive
here because that can allow other combines based on reduced
instruction count/uses.
We can reverse the transform in DAGCombiner (potentially with a
more accurate cost model) if this causes regressions.
AFAIK, this does not conflict with InstCombine. We have a
scalarize transform there, but it relies on finding a constant
operand or a matching insertelement, so that means it eliminates
an extractelement from the sequence (so we won't have 2 extracts
by the time we get here if InstCombine succeeds).
Differential Revision: https://reviews.llvm.org/D75062
getOperationCost() is not the cost we wanted; that's not the
throughput value that the rest of the calculation uses.
We may want to switch everything in this code to use the
getInstructionThroughput() wrapper to avoid these kinds of
problems, but I'll look at that as a follow-up because that
can create other logical diffs via using optional parameters
(we'd need to speculatively create the vector instruction to
make a fair(er) comparison).
binop (extelt X, C), (extelt Y, C) --> extelt (binop X, Y), C
This is a transform that has been considered for canonicalization (instcombine)
in the past because it reduces instruction count. But as shown in the x86 tests,
it's impossible to know if it's profitable without a cost model. There are many
potential target constraints to consider.
We have implemented similar transforms in the backend (DAGCombiner and
target-specific), but I don't think we have this exact fold there either (and if
we did it in SDAG, it wouldn't work across blocks).
Note: this patch was intended to handle the more general case where the extract
indexes do not match, but it got too big, so I scaled it back to this pattern
for now.
Differential Revision: https://reviews.llvm.org/D74495
We want the extra-use tests to be consistent with the
earlier single-use tests and be as cheap as possible
in vector form to show cost model edge cases. So use
i8 and extract from element 0 since that should be
cheap for all x86 targets.
We have several bug reports that could be characterized as "reducing scalarization",
and this topic was also raised on llvm-dev recently:
http://lists.llvm.org/pipermail/llvm-dev/2020-January/138157.html
...so I'm proposing that we deal with these patterns in a new, lightweight IR vector
pass that runs before/after other vectorization passes.
There are 4 alternate options that I can think of to deal with this kind of problem
(and we've seen various attempts at all of these), but they all have flaws:
InstCombine - can't happen without TTI, but we don't want target-specific
folds there.
SDAG - too late to assist other vectorization passes; TLI is not equipped
for these kind of cost queries; limited to a single basic block.
CGP - too late to assist other vectorization passes; would need to re-implement
basic cleanups like CSE/instcombine.
SLP - doesn't fit with existing transforms; limited to a single basic block.
This initial patch/transform is based on existing code in AggressiveInstCombine:
we walk backwards through the function looking for a pattern match. But we diverge
from that cost-independent IR canonicalization pass by using TTI to decide if the
vector alternative is profitable.
We probably have at least 10 similar bug reports/patterns (binops, constants,
inserts, cheap shuffles, etc) that would fit in this pass as follow-up enhancements.
It's possible that we could iterate on a worklist to fix-point like InstCombine does,
but it's safer to start with a most basic case and evolve from there, so I didn't
try to do anything fancy with this initial implementation.
Differential Revision: https://reviews.llvm.org/D73480