We can not bitcast pointers across different address spaces, and VectorCombine
should be careful when it attempts to find the original source of the loaded
data.
Differential Revision: https://reviews.llvm.org/D89577
As discussed in:
https://llvm.org/PR47558
...there are several potential fixes/follow-ups visible
in the test case, but this is the quickest and safest
fix of the perf regression.
Similar to the tsan suppression in
`Utils/VNCoercion.cpp:getLoadLoadClobberFullWidthSize` (rL175034; load widening used by GVN),
the D81766 optimization should be suppressed under tsan due to potential
spurious data race reports:
struct A {
int i;
const short s; // the load cannot be vectorized because
int modify; // it overlaps with bytes being concurrently modified
long pad1, pad2;
};
// __tsan_read16 does not know that some bytes are undef and accessing is safe
Similarly, under asan, users can mark memory regions with
`__asan_poison_memory_region`. A widened load can lead to a spurious
use-after-poison error. hwasan/memtag should be similarly suppressed.
`mustSuppressSpeculation` suppresses asan/hwasan/tsan but not memtag, so
we need to exclude memtag in `vectorizeLoadInsert`.
Note, memtag suppression can be relaxed if the load is aligned to the
its granule (usually 16), but that is out of scope of this patch.
Reviewed By: spatel, vitalybuka
Differential Revision: https://reviews.llvm.org/D87538
First, shuffle cost for scalable type is not known for scalable type;
Second, we cannot reason if the narrowed shuffle mask for scalable type
is a splat or not.
E.g., Bitcast splat vector from type <vscale x 4 x i32> to <vscale x 8 x i16>
will involve narrowing shuffle mask <vscale x 4 x i32> zeroinitializer to
<vscale x 8 x i32> with element sequence of <0, 1, 0, 1, ...>, which cannot be
reasoned if it's a valid splat or not.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86995
This is an enhancement to D81766 to allow loading the minimum target
vector type into an IR vector with a different number of elements.
In one of the motivating tests from PR16739, SLP creates <2 x float>
load ops mixed with <4 x float> insert ops, so we want to handle that
pattern in addition to potential oversized vectors created by the
vectorizers.
For now, we are assuming the insert/extract subvector with undef is
free because there is no exact corresponding TTI modeling for that.
Differential Revision: https://reviews.llvm.org/D86160
This is a fixup to commit 43bdac2906, to make sure the
address space from the original load pointer is retained in the
vector pointer.
Resolves problem with
Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed.
due to address space mismatch.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D85912
This patch was adjusted to match the most basic pattern that starts with an insertelement
(so there's no extract created here). Hopefully, that removes any concern about
interfering with other passes. Ie, the transform should almost always be profitable.
We could make an argument that this could be part of canonicalization, but we
conservatively try not to create vector ops from scalar ops in passes like instcombine.
If the transform is not profitable, the backend should be able to re-scalarize the load.
Differential Revision: https://reviews.llvm.org/D81766
binop i1 (cmp Pred (ext X, Index0), C0), (cmp Pred (ext X, Index1), C1)
-->
vcmp = cmp Pred X, VecC
ext (binop vNi1 vcmp, (shuffle vcmp, Index1)), Index0
This is a larger pattern than the existing extractelement folds because we can't
reasonably vectorize the sub-patterns with constants based on cost model calcs
(it doesn't usually make sense to replace a single extracted scalar op with
constant operand with a vector op).
I salvaged as much of the existing logic as I could, but there might be better
ways to share and reduce code.
The motivating case from PR43745:
https://bugs.llvm.org/show_bug.cgi?id=43745
...is the special case of a 2-way reduction. We tried to get SLP to handle that
particular pattern in D59710, but that caused crashing and regressions.
This patch is more general, but hopefully safer.
The v2f64 test with SSE2 surprised me - the cost model accounting looks like this:
OldCost = 0 (free extract of f64 at index 0) + 1 (extract of f64 at index 1) + 2 (scalar fcmps) + 1 (and of bools) = 4
NewCost = 2 (vector fcmp) + 1 (shuffle) + 1 (vector 'and') + 1 (extract of bool) = 5
Differential Revision: https://reviews.llvm.org/D82474
This saves creating/destroying a builder every time we
perform some transform.
The tests show instruction ordering diffs resulting from
always inserting at the root instruction now, but those
should be benign.
Generalize scalarization (recently enhanced with D80885)
to allow compares as well as binops.
Similar to binops, we are avoiding scalarization of a loaded
value because that could avoid a register transfer in codegen.
This requires 1 extra predicate that I am aware of: we do not
want to scalarize the condition value of a vector select. That
might also invert a transform that we do in instcombine that
prefers a vector condition operand for a vector select.
I think this is the final step in solving PR37463:
https://bugs.llvm.org/show_bug.cgi?id=37463
Differential Revision: https://reviews.llvm.org/D81661
scalarizeBinop currently folds
vec_bo((inselt VecC0, V0, Index), (inselt VecC1, V1, Index))
->
inselt(vec_bo(VecC0, VecC1), scl_bo(V0,V1), Index)
This patch extends this to account for cases where one of the vec_bo operands is already all-constant and performs similar cost checks to determine if the scalar binop with a constant still makes sense:
vec_bo((inselt VecC0, V0, Index), VecC1)
->
inselt(vec_bo(VecC0, VecC1), scl_bo(V0,extractelt(V1,Index)), Index)
Fixes PR42174
Differential Revision: https://reviews.llvm.org/D80885
Goes with proposal in D80885.
This is adapted from the InstCombine tests that were added for
D50992
But these should be adjusted further to provide more interesting
scenarios for x86-specific codegen. Eg, vector types/sizes will
have different costs depending on ISA attributes.
We also need to add tests that include a load of the scalar
variable and add tests that include extra uses of the insert
to further exercise the cost model.
This is split off from D79799 - where I was proposing to fully iterate
over a function until there are no more transforms. I suspect we are
still going to want to do something like that eventually.
But we can achieve the same gains much more efficiently on the current
set of regression tests just by reversing the order that we visit the
instructions.
This may also reduce the motivation for D79078, but we are still not
getting the optimal pattern for a reduction.
As with the extractelement patterns that are currently in vector-combine,
there are going to be several possible variations on this theme. This
should be the clearest, simplest example.
Scalarization is the right direction for target-independent canonicalization,
and InstCombine has some of those folds already, but it doesn't do this.
I proposed a similar transform in D50992. Here in vector-combine, we can
check the cost model to be sure it's profitable, so there should be less risk.
Differential Revision: https://reviews.llvm.org/D79452
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
This is the widen shuffle elements enhancement to D76727.
It builds on the analysis and simplifications in
D77881 and rG6a7e958a423e.
The phase ordering tests show that we can simplify inverse
shuffles across a binop in both directions (widen/narrow or
narrow/widen) now.
There's another potential transform visible in some of the
remaining TODOs - move a bitcasted operand of a shuffle
after the shuffle.
Differential Revision: https://reviews.llvm.org/D78371
Extracting to the same index that we are going to insert back into
allows forming select ("blend") shuffles and enables further transforms.
Admittedly, this is a quick-fix for a more general problem that I'm
hoping to solve by adding transforms for patterns that start with an
insertelement.
But this might resolve some regressions known to be caused by the
extract-extract transform (although I have not gotten more details on
those yet).
In the motivating case from PR34724:
https://bugs.llvm.org/show_bug.cgi?id=34724
The combination of subsequent instcombine and codegen transforms gets us this improvement:
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm4
vmovshdup %xmm1, %xmm3 ## xmm3 = xmm1[1,1,3,3]
vaddps %xmm0, %xmm2, %xmm0
vaddps %xmm1, %xmm3, %xmm1
vshufps $200, %xmm4, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm4[0,3]
vinsertps $177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]
-->
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm1
vaddps %xmm0, %xmm2, %xmm0
vshufps $200, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm1[0,3]
Differential Revision: https://reviews.llvm.org/D76623
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
We do not attempt this in InstCombine because we do not want to change
types and create new shuffle ops that are potentially not lowered as
well as the original code. Here, we can check the cost model to see if
it is worthwhile.
I've aggressively enabled this transform even if the types are the same
size and/or equal cost because moving the bitcast allows InstCombine to
make further simplifications.
In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the
redundant shuffles, but we probably want to extend VectorCombine to
handle the inverse pattern (shuffle-of-bitcast) to get that
simplification directly in IR.
Differential Revision: https://reviews.llvm.org/D76727
opcode (extelt V0, Ext0), (ext V1, Ext1) --> extelt (opcode (splat V0, Ext0), V1), Ext1
The first part of this patch generalizes the cost calculation to accept
different extraction indexes. The second part creates a shuffle+extract
before feeding into the existing code to create a vector op+extract.
The patch conservatively uses "TargetTransformInfo::SK_PermuteSingleSrc"
rather than "TargetTransformInfo::SK_Broadcast" (splat specifically
from element 0) because we do not have a more general "SK_Splat"
currently. That does not affect any of the current regression tests,
but we might be able to find some cost model target specialization where
that comes into play.
I suspect that we can expose some missing x86 horizontal op codegen with
this transform, so I'm speculatively adding a debug flag to disable the
binop variant of this transform to allow easier testing.
The test changes show that we're sensitive to cost model diffs (as we
should be), so that means that patches like D74976
should have better coverage.
Differential Revision: https://reviews.llvm.org/D75689