Commit Graph

62 Commits

Author SHA1 Message Date
Artem Belevich c36c0fabd1 [VectorCombine] Avoid crossing address space boundaries.
We can not bitcast pointers across different address spaces, and VectorCombine
should be careful when it attempts to find the original source of the loaded
data.

Differential Revision: https://reviews.llvm.org/D89577
2020-10-16 13:19:31 -07:00
Sanjay Patel 48a23bccf3 [VectorCombine] limit load+insert transform to one-use
As discussed in:
https://llvm.org/PR47558
...there are several potential fixes/follow-ups visible
in the test case, but this is the quickest and safest
fix of the perf regression.
2020-09-17 14:29:15 -04:00
Sanjay Patel e06914b59b [VectorCombine] add test for multi-use load (PR47558); NFC 2020-09-17 13:50:37 -04:00
Fangrui Song 4452cc4086 [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan
Similar to the tsan suppression in
`Utils/VNCoercion.cpp:getLoadLoadClobberFullWidthSize` (rL175034; load widening used by GVN),
the D81766 optimization should be suppressed under tsan due to potential
spurious data race reports:

  struct A {
    int i;
    const short s; // the load cannot be vectorized because
    int modify;    // it overlaps with bytes being concurrently modified
    long pad1, pad2;
  };
  // __tsan_read16 does not know that some bytes are undef and accessing is safe

Similarly, under asan, users can mark memory regions with
`__asan_poison_memory_region`. A widened load can lead to a spurious
use-after-poison error. hwasan/memtag should be similarly suppressed.

`mustSuppressSpeculation` suppresses asan/hwasan/tsan but not memtag, so
we need to exclude memtag in `vectorizeLoadInsert`.

Note, memtag suppression can be relaxed if the load is aligned to the
its granule (usually 16), but that is out of scope of this patch.

Reviewed By: spatel, vitalybuka

Differential Revision: https://reviews.llvm.org/D87538
2020-09-15 09:47:21 -07:00
Huihui Zhang b4f04d7135 [VectorCombine][SVE] Do not fold bitcast shuffle for scalable type.
First, shuffle cost for scalable type is not known for scalable type;
Second, we cannot reason if the narrowed shuffle mask for scalable type
is a splat or not.

E.g., Bitcast splat vector from type <vscale x 4 x i32> to <vscale x 8 x i16>
will involve narrowing shuffle mask <vscale x 4 x i32> zeroinitializer to
<vscale x 8 x i32> with element sequence of <0, 1, 0, 1, ...>, which cannot be
reasoned if it's a valid splat or not.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D86995
2020-09-02 15:02:16 -07:00
Sanjay Patel 8fb055932c [VectorCombine] allow vector loads with mismatched insert type
This is an enhancement to D81766 to allow loading the minimum target
vector type into an IR vector with a different number of elements.

In one of the motivating tests from PR16739, SLP creates <2 x float>
load ops mixed with <4 x float> insert ops, so we want to handle that
pattern in addition to potential oversized vectors created by the
vectorizers.

For now, we are assuming the insert/extract subvector with undef is
free because there is no exact corresponding TTI modeling for that.

Differential Revision: https://reviews.llvm.org/D86160
2020-09-02 08:11:36 -04:00
Sanjay Patel 9cea682faa [VectorCombine] adjust test for better coverage; NFC
A >2x insert might crash if we do not generate the shuffle mask carefully.

D86160
2020-08-26 16:52:48 -04:00
Sanjay Patel 0b98a59fed [VectorCombine] add tests for vector loads; NFC 2020-08-18 16:23:33 -04:00
Bjorn Pettersson 11446b02c7 [VectorCombine] Fix for non-zero addrspace when creating vector load from scalar load
This is a fixup to commit 43bdac2906, to make sure the
address space from the original load pointer is retained in the
vector pointer.

Resolves problem with
  Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed.
due to address space mismatch.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D85912
2020-08-13 18:25:32 +02:00
Sanjay Patel cc892fd9f4 [VectorCombine] early exit if target has no vector registers
Based on post-commit discussion in:
D81766

Other vectorization passes (SLP and Loop) use this TTI API similarly.
2020-08-12 09:22:31 -04:00
Sanjay Patel 89a7f64afc [VectorCombine] add test for x86 target with SSE disabled; NFC 2020-08-12 09:22:31 -04:00
Sanjay Patel b97e402ca5 [VectorCombine] add test for Hexagon that would crash; NFC
This test verifies the code change from:
rGb0b95dab1ce2
(although that would not be true if PR47128 is fixed)
2020-08-12 08:38:20 -04:00
Sanjay Patel 43bdac2906 [VectorCombine] try to create vector loads from scalar loads
This patch was adjusted to match the most basic pattern that starts with an insertelement
(so there's no extract created here). Hopefully, that removes any concern about
interfering with other passes. Ie, the transform should almost always be profitable.

We could make an argument that this could be part of canonicalization, but we
conservatively try not to create vector ops from scalar ops in passes like instcombine.

If the transform is not profitable, the backend should be able to re-scalarize the load.

Differential Revision: https://reviews.llvm.org/D81766
2020-08-09 09:05:06 -04:00
Sanjay Patel c9bcc237a2 [VectorCombine] add tests for load+insert; NFC 2020-08-06 15:45:02 -04:00
Sanjay Patel d620a6fe98 [VectorCombine] add tests for non-zero gep offsets; NFC 2020-08-01 10:18:37 -04:00
Sanjay Patel cfe40acd16 [VectorCombine] add tests for load vectorization; NFC 2020-07-23 11:24:04 -04:00
Sanjay Patel b6315aee5b [VectorCombine] try to form vector compare and binop to eliminate scalar ops
binop i1 (cmp Pred (ext X, Index0), C0), (cmp Pred (ext X, Index1), C1)
-->
vcmp = cmp Pred X, VecC
ext (binop vNi1 vcmp, (shuffle vcmp, Index1)), Index0

This is a larger pattern than the existing extractelement folds because we can't
reasonably vectorize the sub-patterns with constants based on cost model calcs
(it doesn't usually make sense to replace a single extracted scalar op with
constant operand with a vector op).

I salvaged as much of the existing logic as I could, but there might be better
ways to share and reduce code.

The motivating case from PR43745:
https://bugs.llvm.org/show_bug.cgi?id=43745
...is the special case of a 2-way reduction. We tried to get SLP to handle that
particular pattern in D59710, but that caused crashing and regressions.
This patch is more general, but hopefully safer.

The v2f64 test with SSE2 surprised me - the cost model accounting looks like this:
OldCost = 0 (free extract of f64 at index 0) + 1 (extract of f64 at index 1) + 2 (scalar fcmps) + 1 (and of bools) = 4
NewCost = 2 (vector fcmp) + 1 (shuffle) + 1 (vector 'and') + 1 (extract of bool) = 5

Differential Revision: https://reviews.llvm.org/D82474
2020-06-29 10:38:52 -04:00
Sanjay Patel 931411136a [VectorCombine] add test for scalable vectors; NFC 2020-06-28 12:44:44 -04:00
Sanjay Patel 2f3549f813 Revert "[VectorCombine] add test for scalable vectors; NFC"
This reverts commit 700ec6b848.
An extra test diff snuck here.
2020-06-28 12:43:11 -04:00
Sanjay Patel 700ec6b848 [VectorCombine] add test for scalable vectors; NFC 2020-06-28 12:42:00 -04:00
Sanjay Patel 9e8afee47b [VectorCombine] add tests for extract + cmp + binop; NFC 2020-06-24 11:10:36 -04:00
Sanjay Patel 98c2f4eea5 [VectorCombine] add helper to replace uses and rename
The tests are regenerated to show a path that missed renaming,
but there should be no functional difference from this patch.
2020-06-22 09:58:49 -04:00
Sanjay Patel de65b356dc [VectorCombine] add/use pass-level IRBuilder
This saves creating/destroying a builder every time we
perform some transform.

The tests show instruction ordering diffs resulting from
always inserting at the root instruction now, but those
should be benign.
2020-06-22 09:01:29 -04:00
Sanjay Patel cce625f73d [VectorCombine] improve IR debugging by providing/salvaging value names
The tests are regenerated to show the diffs, but there should be no
functional change from this patch.
2020-06-22 08:35:47 -04:00
Sanjay Patel 741e20f3d6 [VectorCombine] fix assert for type of compare operand
As shown in the post-commit comment for D81661 - we need to
loosen the type assertion to allow scalarization of a compare
for vectors of pointers.
2020-06-20 15:20:17 -04:00
Sanjay Patel 6d864097a2 [VectorCombine] fix crash while transforming constants
This is a variation of the proposal in D82049 with an extra test.
2020-06-19 12:30:32 -04:00
Sanjay Patel ed67f5e7ab [VectorCombine] scalarize compares with insertelement operand(s)
Generalize scalarization (recently enhanced with D80885)
to allow compares as well as binops.
Similar to binops, we are avoiding scalarization of a loaded
value because that could avoid a register transfer in codegen.
This requires 1 extra predicate that I am aware of: we do not
want to scalarize the condition value of a vector select. That
might also invert a transform that we do in instcombine that
prefers a vector condition operand for a vector select.

I think this is the final step in solving PR37463:
https://bugs.llvm.org/show_bug.cgi?id=37463

Differential Revision: https://reviews.llvm.org/D81661
2020-06-16 13:48:10 -04:00
Sanjay Patel d386297c67 [VectorCombine] add tests for compare scalarization; NFC 2020-06-11 12:29:00 -04:00
Simon Pilgrim 5dc4e7c2b9 [VectorCombine] scalarizeBinop - support an all-constant src vector operand
scalarizeBinop currently folds

  vec_bo((inselt VecC0, V0, Index), (inselt VecC1, V1, Index))
  ->
  inselt(vec_bo(VecC0, VecC1), scl_bo(V0,V1), Index)

This patch extends this to account for cases where one of the vec_bo operands is already all-constant and performs similar cost checks to determine if the scalar binop with a constant still makes sense:

  vec_bo((inselt VecC0, V0, Index), VecC1)
  ->
  inselt(vec_bo(VecC0, VecC1), scl_bo(V0,extractelt(V1,Index)), Index)

Fixes PR42174

Differential Revision: https://reviews.llvm.org/D80885
2020-06-09 19:02:05 +01:00
Simon Pilgrim c2e27ac1ce [VectorCombine] Add multi-use shl test for D80885 2020-06-03 19:42:15 +01:00
Simon Pilgrim 9f8ea2e6cf [VectorCombine] Add multi-use multiply test for D80885 2020-06-03 18:54:03 +01:00
Simon Pilgrim 6ce6960b92 [VectorCombine][X86] Add loaded insert tests from D80885 2020-06-02 10:04:05 +01:00
Sanjay Patel e31f2a894a [VectorCombine] add tests for scalarizing binop-with-constant; NFC
Goes with proposal in D80885.

This is adapted from the InstCombine tests that were added for
D50992

But these should be adjusted further to provide more interesting
scenarios for x86-specific codegen. Eg, vector types/sizes will
have different costs depending on ISA attributes.

We also need to add tests that include a load of the scalar
variable and add tests that include extra uses of the insert
to further exercise the cost model.
2020-05-31 09:11:30 -04:00
Sanjay Patel 81e9ede3a2 [VectorCombine] forward walk through instructions to improve chaining of transforms
This is split off from D79799 - where I was proposing to fully iterate
over a function until there are no more transforms. I suspect we are
still going to want to do something like that eventually.

But we can achieve the same gains much more efficiently on the current
set of regression tests just by reversing the order that we visit the
instructions.

This may also reduce the motivation for D79078, but we are still not
getting the optimal pattern for a reduction.
2020-05-16 13:08:01 -04:00
Sanjay Patel 6211830fba [VectorCombine] add reduction-like patterns; NFC
These are based on tests originally included in:
D79078
2020-05-16 12:45:01 -04:00
Sanjay Patel 93bd696347 [VectorCombine] add test to check for iterative improvements; NFC 2020-05-12 12:49:25 -04:00
Sanjay Patel 5f730b645d [VectorCombine] account for extra uses in scalarization cost
Follow-up to D79452.
Mimics the extra use cost formula for the inverse transform with extracts.
2020-05-11 15:20:57 -04:00
Sanjay Patel 7c480c4385 [VectorCombine] add tests for possible scalarization with extra uses; NFC 2020-05-11 15:04:31 -04:00
Sanjay Patel 0d2a0b44c8 [VectorCombine] scalarize binop of inserted elements into vector constants
As with the extractelement patterns that are currently in vector-combine,
there are going to be several possible variations on this theme. This
should be the clearest, simplest example.

Scalarization is the right direction for target-independent canonicalization,
and InstCombine has some of those folds already, but it doesn't do this.
I proposed a similar transform in D50992. Here in vector-combine, we can
check the cost model to be sure it's profitable, so there should be less risk.

Differential Revision: https://reviews.llvm.org/D79452
2020-05-08 16:31:12 -04:00
Sanjay Patel 5b48f7d2fc [VectorCombine] adjust test to make intent clearer; NFC
Create a non-zero result to show that the other lane is computed correctly.
2020-05-07 16:21:17 -04:00
Sanjay Patel 5d0f2fdfa5 [VectorCombine] add tests with undefs; NFC
Goes with D79452.
2020-05-07 15:28:26 -04:00
Sanjay Patel 666c61db79 [VectorCombine] add tests for insert into arbitrary constant; NFC
Goes with D79452.
2020-05-07 10:27:25 -04:00
Sanjay Patel e3eb297deb [VectorCombine] add tests for possible scalarization; NFC 2020-05-06 09:58:27 -04:00
Sanjay Patel bef6e67e95 [VectorCombine] transform bitcasted shuffle to wider elements
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'

This is the widen shuffle elements enhancement to D76727.
It builds on the analysis and simplifications in
D77881 and rG6a7e958a423e.

The phase ordering tests show that we can simplify inverse
shuffles across a binop in both directions (widen/narrow or
narrow/widen) now.

There's another potential transform visible in some of the
remaining TODOs - move a bitcasted operand of a shuffle
after the shuffle.

Differential Revision: https://reviews.llvm.org/D78371
2020-04-19 08:24:38 -04:00
Sanjay Patel ce97ce3a5d [VectorCombine] try to form a better extractelement
Extracting to the same index that we are going to insert back into
allows forming select ("blend") shuffles and enables further transforms.

Admittedly, this is a quick-fix for a more general problem that I'm
hoping to solve by adding transforms for patterns that start with an
insertelement.

But this might resolve some regressions known to be caused by the
extract-extract transform (although I have not gotten more details on
those yet).

In the motivating case from PR34724:
https://bugs.llvm.org/show_bug.cgi?id=34724

The combination of subsequent instcombine and codegen transforms gets us this improvement:

  vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
  vhaddps	%xmm1, %xmm1, %xmm4
  vmovshdup	%xmm1, %xmm3    ## xmm3 = xmm1[1,1,3,3]
  vaddps	%xmm0, %xmm2, %xmm0
  vaddps	%xmm1, %xmm3, %xmm1
  vshufps	$200, %xmm4, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm4[0,3]
  vinsertps	$177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]

  -->

  vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
  vhaddps	%xmm1, %xmm1, %xmm1
  vaddps	%xmm0, %xmm2, %xmm0
  vshufps	$200, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm1[0,3]

Differential Revision: https://reviews.llvm.org/D76623
2020-04-03 13:55:13 -04:00
Sanjay Patel b6050ca181 [VectorCombine] transform bitcasted shuffle to narrower elements
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'

We do not attempt this in InstCombine because we do not want to change
types and create new shuffle ops that are potentially not lowered as
well as the original code. Here, we can check the cost model to see if
it is worthwhile.

I've aggressively enabled this transform even if the types are the same
size and/or equal cost because moving the bitcast allows InstCombine to
make further simplifications.

In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the
redundant shuffles, but we probably want to extend VectorCombine to
handle the inverse pattern (shuffle-of-bitcast) to get that
simplification directly in IR.

Differential Revision: https://reviews.llvm.org/D76727
2020-04-02 13:30:22 -04:00
Sanjay Patel f631b9dc36 [VectorCombine] add shuffle tests; NFC
Goes with DD76727.
2020-03-25 10:35:03 -04:00
Sanjay Patel c84446f4e9 [VectorCombine] add tests for bitcast (shuffle); NFC 2020-03-24 15:18:32 -04:00
Sanjay Patel 5eeea337be [VectorCombine] add more tests for extract-extract patterns; NFC 2020-03-23 09:33:56 -04:00
Sanjay Patel a69158c12a [VectorCombine] fold extract-extract-op with different extraction indexes
opcode (extelt V0, Ext0), (ext V1, Ext1) --> extelt (opcode (splat V0, Ext0), V1), Ext1

The first part of this patch generalizes the cost calculation to accept
different extraction indexes. The second part creates a shuffle+extract
before feeding into the existing code to create a vector op+extract.

The patch conservatively uses "TargetTransformInfo::SK_PermuteSingleSrc"
rather than "TargetTransformInfo::SK_Broadcast" (splat specifically
from element 0) because we do not have a more general "SK_Splat"
currently. That does not affect any of the current regression tests,
but we might be able to find some cost model target specialization where
that comes into play.

I suspect that we can expose some missing x86 horizontal op codegen with
this transform, so I'm speculatively adding a debug flag to disable the
binop variant of this transform to allow easier testing.

The test changes show that we're sensitive to cost model diffs (as we
should be), so that means that patches like D74976
should have better coverage.

Differential Revision: https://reviews.llvm.org/D75689
2020-03-08 09:57:55 -04:00