Commit Graph

758 Commits

Author SHA1 Message Date
Sanjay Patel 142cbe73e9 [SLP] improve test readability; NFC 2019-11-13 12:59:00 -05:00
Sanjay Patel 2d06375c3f [SLP] add test for miscompile with reduction (PR43948); NFC 2019-11-12 11:36:41 -05:00
Vasileios Porpodas 6a18a95487 [SLP] Look-ahead operand reordering heuristic.
Summary: This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for examples).

Reviewers: RKSimon, ABataev, dtemirbulatov, Ayal, hfinkel, rnk

Reviewed By: RKSimon, dtemirbulatov

Subscribers: xbolva00, Carrot, hiraditya, phosek, rnk, rcorcs, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60897
2019-11-11 21:06:51 -08:00
Sanjay Patel 7ff57705ba [SLP] allow forming 2-way reduction patterns
We have a vector compare reduction problem seen in PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2

Or slightly reduced here:

define i1 @cmp2(<2 x double> %a0) {
  %a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>
  %b = extractelement <2 x i1> %a, i32 0
  %c = extractelement <2 x i1> %a, i32 1
  %d = and i1 %b, %c
  ret i1 %d
}

SLP would not attempt to turn this into a vector reduction because there is an
artificial lower limit on that transform. We can not completely remove that limit
without inducing regressions though, so this patch just hacks an extra attempt at
creating a 2-way reduction to the end of the analysis.

As shown in the test file, we are still not getting some of the motivating cases,
so follow-on patches will be needed to solve those cases.

Differential Revision: https://reviews.llvm.org/D59710
2019-11-07 06:08:42 -05:00
Eric Christopher e511c4b0df Temporarily Revert:
"[SLP] Generalization of stores vectorization."
 "[SLP] Fix -Wunused-variable. NFC"
 "[SLP] Vectorize jumbled stores."

As they're causing significant (10-30x) compile time regressions on
vectorizable code.

The primary cause of the compile-time regression is f228b53716.

This reverts commits:

f228b53716
5503455ccb
21d498c9c0
2019-11-06 16:06:15 -08:00
Sanjay Patel 7effd37b00 [SLP] add tests for 2-wide reductions; NFC 2019-11-05 17:18:37 -05:00
Alexey Bataev b80c41cd3c [SLP]Fix PR43799: Crash on different sizes of GEP indices.
Summary:
If the GEP instructions are going to be vectorized, the indices in those
GEP instructions must be of the same type. Otherwise, the compiler may
crash when trying to build the vector constant.

Reviewers: RKSimon, spatel

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D69627
2019-11-04 10:36:26 -05:00
Sanjay Patel fc98907535 [SLP] avoid 'tmp' value name conflict with auto-generated CHECK script; NFC
The script uses 'TMP#' as its substitute for nameless values,
so if a test already contains 'tmp#' *named* values, then
there could be trouble. We should probably just fix the
script to avoid this problem going forward, but it's easy
enough to change a test too (and explicitly naming variables
'tmp' is always a sad choice).
2019-11-01 09:27:35 -04:00
Sanjay Patel 7faf33484e [SLP] avoid 'tmp' value name conflict with auto-generated CHECK script; NFC
The script uses 'TMP#' as its substitute for nameless values,
so if a test already contains 'tmp#' *named* values, then
there could be trouble. We should probably just fix the
script to avoid this problem going forward, but it's easy
enough to change a test too (and explicitly naming variables
'tmp' is always a sad choice).
2019-11-01 09:27:35 -04:00
Sanjay Patel 37628802be [SLP] avoid 'tmp' value name conflict with auto-generated CHECK script; NFC
The script uses 'TMP#' as its substitute for nameless values,
so if a test already contains 'tmp#' *named* values, then
there could be trouble. We should probably just fix the
script to avoid this problem going forward, but it's easy
enough to change a test too (and explicitly naming variables
'tmp' is always a sad choice).
2019-11-01 09:27:35 -04:00
Alexey Bataev 70ad617dd6 [SLP] Vectorize jumbled stores.
Summary:
Patch adds support for vectorization of the jumbled stores. The value
operands are vectorized and then shuffled in the right order before
store.

Reviewers: RKSimon, spatel, hfinkel, mkuper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D43339
2019-10-31 16:02:25 -04:00
Haojian Wu e65ddcafee Revert "[SLP] Vectorize jumbled stores."
This reverts commit 21d498c9c0.

This commit causes some crashes on some targets.
2019-10-31 10:21:24 +01:00
Alexey Bataev 21d498c9c0 [SLP] Vectorize jumbled stores.
Summary:
Patch adds support for vectorization of the jumbled stores. The value
operands are vectorized and then shuffled in the right order before
store.

Reviewers: RKSimon, spatel, hfinkel, mkuper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D43339
2019-10-30 13:33:52 -04:00
Alexey Bataev f228b53716 [SLP] Generalization of stores vectorization.
Stores are vectorized with maximum vectorization factor of 16. Patch
tries to improve the situation and use maximal vectorization factor.

Reviewers: spatel, RKSimon, mkuper, hfinkel

Differential Revision: https://reviews.llvm.org/D43582
2019-10-29 11:46:36 -04:00
Sanjay Patel 8cc6d42e8d [SLP] avoid reduction transform on patterns that the backend can load-combine (2nd try)
The 1st attempt at this modified the cost model in a bad way to avoid the vectorization,
but that caused problems for other users (the loop vectorizer) of the cost model.

I don't see an ideal solution to these 2 related, potentially large, perf regressions:
https://bugs.llvm.org/show_bug.cgi?id=42708
https://bugs.llvm.org/show_bug.cgi?id=43146

We decided that load combining was unsuitable for IR because it could obscure other
optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.
Therefore, preventing SLP from destroying load combine opportunities requires that it
recognizes patterns that could be combined later, but not do the optimization itself (
it's not a vector combine anyway, so it's probably out-of-scope for SLP).

Here, we add a cost-independent bailout with a conservative pattern match for a
multi-instruction sequence that can probably be reduced later.

In the x86 tests shown (and discussed in more detail in the bug reports), SDAG combining
will produce a single instruction on these tests like:

  movbe   rax, qword ptr [rdi]

or:

  mov     rax, qword ptr [rdi]

Not some (half) vector monstrosity as we currently do using SLP:

  vpmovzxbq       ymm0, dword ptr [rdi + 1] # ymm0 = mem[0],zero,zero,..
  vpsllvq ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
  movzx   eax, byte ptr [rdi]
  movzx   ecx, byte ptr [rdi + 5]
  shl     rcx, 40
  movzx   edx, byte ptr [rdi + 6]
  shl     rdx, 48
  or      rdx, rcx
  movzx   ecx, byte ptr [rdi + 7]
  shl     rcx, 56
  or      rcx, rdx
  or      rcx, rax
  vextracti128    xmm1, ymm0, 1
  vpor    xmm0, xmm0, xmm1
  vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
  vpor    xmm0, xmm0, xmm1
  vmovq   rax, xmm0
  or      rax, rcx
  vzeroupper
  ret

Differential Revision: https://reviews.llvm.org/D67841

llvm-svn: 375025
2019-10-16 18:06:24 +00:00
Simon Pilgrim 1385b27e92 [CostModel][X86] Add CTLZ scalar costs
Add specific scalar costs for CTLZ instructions, we can't discriminate between CTLZ and CTLZ_ZERO_UNDEF so we have to assume the worst. Given how BSR is often a microcoded nightmare on some older targets we might still be underestimating it.

For targets supporting LZCNT (Intel Haswell+ or AMD Fam10+), we provide overrides that assume 1cy costs.

llvm-svn: 374786
2019-10-14 16:30:17 +00:00
Simon Pilgrim 151bbba758 [CostModel][X86] Add CTPOP scalar costs (PR43656)
Add specific scalar costs for ctpop instructions, these are based on the llvm-mca's SLM throughput numbers (the oldest model we have).

For targets supporting POPCNT, we provide overrides that assume 1cy costs.

llvm-svn: 374775
2019-10-14 14:07:43 +00:00
Simon Pilgrim 1b59a16c0b [CostModel][X86] Improve sum reduction costs.
I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2.

I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674.

llvm-svn: 374655
2019-10-12 13:21:50 +00:00
Sanjay Patel df14bd315d [SLP] respect target register width for GEP vectorization (PR43578)
We failed to account for the target register width (max vector factor)
when vectorizing starting from GEPs. This causes vectorization to
proceed to obviously illegal widths as in:
https://bugs.llvm.org/show_bug.cgi?id=43578

For x86, this also means that SLP can produce rogue AVX or AVX512
code even when the user specifies a narrower vector width.

The AArch64 test in ext-trunc.ll appears to be better using the
narrower width. I'm not exactly sure what getelementptr.ll is trying
to do, but it's testing with "-slp-threshold=-18", so I'm not worried
about those diffs. The x86 test is an over-reduction from SPEC h264;
this patch appears to restore the perf loss caused by SLP when using
-march=haswell.

Differential Revision: https://reviews.llvm.org/D68667

llvm-svn: 374183
2019-10-09 16:32:49 +00:00
Sanjay Patel 796a58107a [SLP] add test with prefer-vector-width function attribute; NFC (PR43578)
llvm-svn: 374090
2019-10-08 17:18:32 +00:00
Sanjay Patel 5cce533525 [SLP] add test with prefer-vector-width function attribute; NFC
llvm-svn: 374039
2019-10-08 12:43:46 +00:00
Martin Storsjo dfc1aee25b Revert "[SLP] avoid reduction transform on patterns that the backend can load-combine"
This reverts SVN r373833, as it caused a failed assert "Non-zero loop
cost expected" on building numerous projects, see PR43582 for details
and reproduction samples.

llvm-svn: 373882
2019-10-07 08:21:37 +00:00
Sanjay Patel e2321bb448 [SLP] avoid reduction transform on patterns that the backend can load-combine
I don't see an ideal solution to these 2 related, potentially large, perf regressions:
https://bugs.llvm.org/show_bug.cgi?id=42708
https://bugs.llvm.org/show_bug.cgi?id=43146

We decided that load combining was unsuitable for IR because it could obscure other
optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.
Therefore, preventing SLP from destroying load combine opportunities requires that it
recognizes patterns that could be combined later, but not do the optimization itself (
it's not a vector combine anyway, so it's probably out-of-scope for SLP).

Here, we add a scalar cost model adjustment with a conservative pattern match and cost
summation for a multi-instruction sequence that can probably be reduced later.
This should prevent SLP from creating a vector reduction unless that sequence is
extremely cheap.

In the x86 tests shown (and discussed in more detail in the bug reports), SDAG combining
will produce a single instruction on these tests like:

  movbe   rax, qword ptr [rdi]

or:

  mov     rax, qword ptr [rdi]

Not some (half) vector monstrosity as we currently do using SLP:

  vpmovzxbq       ymm0, dword ptr [rdi + 1] # ymm0 = mem[0],zero,zero,..
  vpsllvq ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
  movzx   eax, byte ptr [rdi]
  movzx   ecx, byte ptr [rdi + 5]
  shl     rcx, 40
  movzx   edx, byte ptr [rdi + 6]
  shl     rdx, 48
  or      rdx, rcx
  movzx   ecx, byte ptr [rdi + 7]
  shl     rcx, 56
  or      rcx, rdx
  or      rcx, rax
  vextracti128    xmm1, ymm0, 1
  vpor    xmm0, xmm0, xmm1
  vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
  vpor    xmm0, xmm0, xmm1
  vmovq   rax, xmm0
  or      rax, rcx
  vzeroupper
  ret

Differential Revision: https://reviews.llvm.org/D67841

llvm-svn: 373833
2019-10-05 18:03:58 +00:00
Sanjay Patel 3f4726b818 [SLP] add test for vectorization of different widths (PR28457); NFC
llvm-svn: 373483
2019-10-02 16:12:42 +00:00
Alexey Bataev 8b1eeafb91 [SLP] Fix for PR31847: Assertion failed: (isLoopInvariant(Operands[i], L) && "SCEVAddRecExpr operand is not loop-invariant!")
Initially SLP vectorizer replaced all going-to-be-vectorized
instructions with Undef values. It may break ScalarEvaluation and may
cause a crash.
Reworked SLP vectorizer so that it does not replace vectorized
instructions by UndefValue anymore. Instead vectorized instructions are
marked for deletion inside if BoUpSLP class and deleted upon class
destruction.

Reviewers: mzolotukhin, mkuper, hfinkel, RKSimon, davide, spatel

Subscribers: RKSimon, Gerolf, anemet, hans, majnemer, llvm-commits, sanjoy

Differential Revision: https://reviews.llvm.org/D29641

llvm-svn: 373166
2019-09-29 14:18:06 +00:00
Simon Pilgrim 756f5cfc2a [SLPVectorizer][X86] Regenerate arith-fp tests
llvm-svn: 373063
2019-09-27 10:04:25 +00:00
Jordan Rupprecht f98d2c099a Revert [SLP] Fix for PR31847: Assertion failed: (isLoopInvariant(Operands[i], L) && "SCEVAddRecExpr operand is not loop-invariant!")
This reverts r372626 (git commit 6a278d9073)

llvm-svn: 373019
2019-09-26 22:09:17 +00:00
Simon Pilgrim fc82c7a1b0 [SLPVectorizer][X86] Add SSE common check prefix to let us merge SSE2+SLM checks
llvm-svn: 372955
2019-09-26 10:23:57 +00:00
Simon Pilgrim d7f0207d73 [CostModel][X86] Fix SLM <2 x i64> icmp costs
SLM is 2 x slower for <2 x i64> comparison ops than other vector types, we should account for this like we do for SLM <2 x i64> add/sub/mul costs.

This should remove some of the SLM codegen diffs in D43582

llvm-svn: 372954
2019-09-26 10:14:38 +00:00
Alexey Bataev 6a278d9073 [SLP] Fix for PR31847: Assertion failed: (isLoopInvariant(Operands[i], L) && "SCEVAddRecExpr operand is not loop-invariant!")
Summary:
Initially SLP vectorizer replaced all going-to-be-vectorized
instructions with Undef values. It may break ScalarEvaluation and may
cause a crash.
Reworked SLP vectorizer so that it does not replace vectorized
instructions by UndefValue anymore. Instead vectorized instructions are
marked for deletion inside if BoUpSLP class and deleted upon class
destruction.

Reviewers: mzolotukhin, mkuper, hfinkel, RKSimon, davide, spatel

Subscribers: RKSimon, Gerolf, anemet, hans, majnemer, llvm-commits, sanjoy

Differential Revision: https://reviews.llvm.org/D29641

llvm-svn: 372626
2019-09-23 16:25:03 +00:00
Simon Pilgrim 665ccbff60 [Cost][X86] Add v2i64 truncation costs
We are missing costs for a lot of truncation cases, I'm hoping to address all the 'zero cost' cases in trunc.ll

I thought this was a vector widening side effect, but even before this we had some interesting LV decisions (notably over indvars) being made due to these zero costs.

llvm-svn: 372498
2019-09-22 12:04:38 +00:00
Sanjay Patel 4896f7243d [SLPVectorizer] add tests for bogus reductions; NFC
https://bugs.llvm.org/show_bug.cgi?id=42708
https://bugs.llvm.org/show_bug.cgi?id=43146

llvm-svn: 372393
2019-09-20 14:17:00 +00:00
Sanjay Patel b6a0faaa0c [SLP] limit vectorization of Constant subclasses (PR33958)
This is a fix for:
https://bugs.llvm.org/show_bug.cgi?id=33958

It seems universally true that we would not want to transform this kind of
sequence on any target, but if that's not correct, then we could view this
as a target-specific cost model problem. We could also white-list ConstantInt,
ConstantFP, etc. rather than blacklist Global and ConstantExpr.

Differential Revision: https://reviews.llvm.org/D67362

llvm-svn: 371931
2019-09-15 13:03:24 +00:00
Sanjay Patel 4ba6717c7e [SLP] add test for vectorization of constant expressions; NFC
Goes with D67362.

llvm-svn: 371879
2019-09-13 18:33:02 +00:00
Sanjay Patel c0728eac15 [SLP] add test for over-vectorization (PR33958); NFC
llvm-svn: 371426
2019-09-09 17:16:03 +00:00
Craig Topper 8cfff1e1bc [X86] Add prefer-128-bit subtarget feature.
Summary:
Similar to the previous prefer-256-bit flag. We might want to
enable this by default some CPUs. This just starts the initial
work to implement and prove that it effects TTI's vector width.

Reviewers: RKSimon, echristo, spatel, atdt

Reviewed By: RKSimon

Subscribers: lebedev.ri, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67311

llvm-svn: 371319
2019-09-07 19:54:22 +00:00
Craig Topper a31112e357 [X86] Replace -mcpu with -mattr on some tests.
llvm-svn: 371260
2019-09-06 21:48:44 +00:00
Sanjay Patel 0f9b5f86f1 [SLP] add test that requires shuffle of scalars; NFC
llvm-svn: 369255
2019-08-19 12:41:09 +00:00
Sanjay Patel 144903310f [SLP] add tests for PR16739; NFC
llvm-svn: 369127
2019-08-16 17:01:26 +00:00
Craig Topper 30d3e9c395 [X86][CostModel] Adjust the costs of ZERO_EXTEND/SIGN_EXTEND with less than 128-bit inputs
Now that we legalize by widening, the element types here won't change. Previously these were modeled as the elements being widened and then the instruction might become an AND or SHL/ASHR pair. But now they'll become something like a ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG.

For AVX2, when the destination type is legal its clear the cost should be 1 since we have extend instructions that can produce 256 bit vectors from less than 128 bit vectors. I'm a little less sure about AVX1 costs, but I think the ones I changed were definitely too high, but they might still be too high.

Differential Revision: https://reviews.llvm.org/D66169

llvm-svn: 368858
2019-08-14 14:52:39 +00:00
Craig Topper 8b5f2ab2a4 Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."
The assert that caused this to be reverted should be fixed now.

Original commit message:

This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 368183
2019-08-07 16:24:26 +00:00
Mitch Phillips bd0d97e1c4 Revert "[X86] Enable -x86-experimental-vector-widening-legalization by default."
This reverts commit 3de33245d2.

This commit broke the MSan buildbots. See
https://reviews.llvm.org/rL367901 for more information.

llvm-svn: 368107
2019-08-06 23:00:43 +00:00
Craig Topper 3de33245d2 [X86] Enable -x86-experimental-vector-widening-legalization by default.
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 367901
2019-08-05 18:25:36 +00:00
Hideto Ueno cc0a4cdc89 [FunctionAttrs] Annotate "willreturn" for intrinsics
Summary:
In D62801, new function attribute `willreturn` was introduced. In short, a function with `willreturn` is guaranteed to come back to the call site(more precise definition is in LangRef).

In this patch, willreturn is annotated for LLVM intrinsics.

Reviewers: jdoerfert

Reviewed By: jdoerfert

Subscribers: jvesely, nhaehnle, sstefan1, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64904

llvm-svn: 367184
2019-07-28 06:09:56 +00:00
Michael Liao 17a8a9277c [LAA] Re-check bit-width of pointers after stripping.
Summary:
- As the pointer stripping now tracks through `addrspacecast`, prepare
  to handle the bit-width difference from the result pointer.

Reviewers: jdoerfert

Subscribers: jvesely, nhaehnle, hiraditya, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64928

llvm-svn: 366470
2019-07-18 17:30:27 +00:00
Eric Christopher 93dfb93ad6 Temporarily Revert "[SLP] Recommit: Look-ahead operand reordering heuristic."
As there are some reported miscompiles with AVX512 and performance regressions
in Eigen. Verified with the original committer and testcases will be forthcoming.

This reverts commit r364964.

llvm-svn: 366154
2019-07-15 23:36:02 +00:00
Sanjay Patel c1c86adb16 [SLP] add tests for bitcasted vector pointer load; NFC
I'm not sure if this falls within the scope of SLP,
but we could create vector loads for some of these
patterns.

llvm-svn: 365055
2019-07-03 16:46:14 +00:00
Vasileios Porpodas cf47ff5ffb [SLP] Recommit: Look-ahead operand reordering heuristic.
Summary: This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for an example).

Reviewers: RKSimon, ABataev, dtemirbulatov, Ayal, hfinkel, rnk

Reviewed By: RKSimon, dtemirbulatov

Subscribers: hiraditya, phosek, rnk, rcorcs, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60897

llvm-svn: 364964
2019-07-02 20:20:28 +00:00
Jordan Rupprecht a7972dc04a Revert [SLP] Look-ahead operand reordering heuristic.
This reverts r364478 (git commit 574cb0eb3a)

The patch is causing compilation timeouts.

llvm-svn: 364846
2019-07-01 21:10:43 +00:00
Vasileios Porpodas 574cb0eb3a [SLP] Look-ahead operand reordering heuristic.
Summary: This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for an example).

Reviewers: RKSimon, ABataev, dtemirbulatov, Ayal, hfinkel, rnk

Reviewed By: RKSimon, dtemirbulatov

Subscribers: rnk, rcorcs, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60897

llvm-svn: 364478
2019-06-26 21:25:24 +00:00
Simon Pilgrim e98f8cf78f [SLPVectorizer] Precommit of supernode.ll test for D63661
This is a pre-commit of the tests introduced by the SuperNode SLP patch D63661.

Committed on behalf of @vporpo (Vasileios Porpodas)

Differential Revision: https://reviews.llvm.org/D63664

llvm-svn: 364320
2019-06-25 14:58:20 +00:00
Cameron McInally fe3f15cf90 [SLP] Support unary FNeg vectorization
Differential Revision: https://reviews.llvm.org/D63609

llvm-svn: 364219
2019-06-24 19:24:23 +00:00
Reid Kleckner 592a193285 Revert [SLP] Look-ahead operand reordering heuristic.
This reverts r364084 (git commit 5698921be2)

It caused crashes while compiling a file in Chrome. Reduction
forthcoming.

llvm-svn: 364111
2019-06-21 23:10:25 +00:00
Simon Pilgrim 5698921be2 [SLP] Look-ahead operand reordering heuristic.
This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for an example).

Committed on behalf of @vporpo (Vasileios Porpodas)

Differential Revision: https://reviews.llvm.org/D60897

llvm-svn: 364084
2019-06-21 17:57:01 +00:00
Cameron McInally 9589db7a98 [NFC][SLP] Pre-commit unary FNeg test to X86/propagate_ir_flags.ll
llvm-svn: 363978
2019-06-20 20:53:51 +00:00
Cameron McInally 4452c3b490 [NFC][SLP] Pre-commit unary FNeg test to X86/phi3.ll
llvm-svn: 363937
2019-06-20 15:17:17 +00:00
Simon Pilgrim 72186a2494 [SLP][X86] Add lookahead reordering tests from D60897
llvm-svn: 363925
2019-06-20 12:52:58 +00:00
Fangrui Song ac14f7b10c [lit] Delete empty lines at the end of lit.local.cfg NFC
llvm-svn: 363538
2019-06-17 09:51:07 +00:00
Sander de Smalen 51c2fa0e2a Improve reduction intrinsics by overloading result value.
This patch uses the mechanism from D62995 to strengthen the
definitions of the reduction intrinsics by letting the scalar
result/accumulator type be overloaded from the vector element type.

For example:

  ; The LLVM LangRef specifies that the scalar result must equal the
  ; vector element type, but this is not checked/enforced by LLVM.
  declare i32 @llvm.experimental.vector.reduce.or.i32.v4i32(<4 x i32> %a)

This patch changes that into:

  declare i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> %a)

Which has the type-constraint more explicit and causes LLVM to check
the result type with the vector element type.

Reviewers: RKSimon, arsenm, rnk, greened, aemerson

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D62996

llvm-svn: 363240
2019-06-13 09:37:38 +00:00
Dinar Temirbulatov b2f45ba1e8 [SLP] Update propagate_ir_flags.ll test to check that we do retain the common subset, NFC.
llvm-svn: 363218
2019-06-13 00:19:50 +00:00
Dinar Temirbulatov 15c657d13d [SLP] Fix regression in broadcasts caused by operand reordering patch D59973.
This patch fixes a regression caused by the operand reordering refactoring patch https://reviews.llvm.org/D59973 .
The fix changes the strategy to Splat instead of Opcode, if broadcast opportunities are found.
Please see the lit test for some examples.

Committed on behalf of @vporpo (Vasileios Porpodas)
    
Differential Revision: https://reviews.llvm.org/D62427

llvm-svn: 362613
2019-06-05 15:26:28 +00:00
Simon Pilgrim e6d1a80370 [SLPVectorizer][X86] Add other tests described in PR28474
llvm-svn: 362297
2019-06-01 12:35:03 +00:00
Simon Pilgrim 2ef83571f2 [SLPVectorizer][X86] This test was from PR28474
llvm-svn: 362296
2019-06-01 12:10:29 +00:00
Simon Pilgrim 48c8bdad2a [SLPVectorizer][X86] Add broadcast test case from D62427
llvm-svn: 361805
2019-05-28 11:10:56 +00:00
Simon Pilgrim 6b10fde69b [CostModel][X86] Add min/max reduction costs for all SSE targets
The original costs stopped at SSE42, I've added conservative estimates for everything down to SSE1/SSE2 and moved some of the SSE42 costs to SSE41 (really only the addition of PCMPGT makes any difference).

I've also added missing vXi8 costs (we use PHMINPOSUW for i8/i16 for scarily quick results) and 256-bit vector costs for AVX1.

llvm-svn: 360528
2019-05-11 17:12:52 +00:00
Simon Pilgrim 83098d28a1 [SLP] Lit test that cannot get vectorized due to lack of look-ahead operand reordering heuristic.
The code in this test is not vectorized by SLP because its operand reordering cannot look beyond the immediate predecessors.
This will get fixed in a follow-up patch that introduces the look-ahead operand reordering heuristic.

Committed on behalf of @vporpo (Vasileios Porpodas)

Differential Revision: https://reviews.llvm.org/D61283

llvm-svn: 359553
2019-04-30 11:03:09 +00:00
Alexey Bataev ef3c1884ec [SLP] Fix crash after r358519, by V. Porpodas.
Summary: The code did not check if operand was undef before casting it to Instruction.

Reviewers: RKSimon, ABataev, dtemirbulatov

Reviewed By: ABataev

Subscribers: uabelho

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61024

llvm-svn: 359136
2019-04-24 20:21:32 +00:00
Eric Christopher cee313d288 Revert "Temporarily Revert "Add basic loop fusion pass.""
The reversion apparently deleted the test/Transforms directory.

Will be re-reverting again.

llvm-svn: 358552
2019-04-17 04:52:47 +00:00
Eric Christopher a863435128 Temporarily Revert "Add basic loop fusion pass."
As it's causing some bot failures (and per request from kbarton).

This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.

llvm-svn: 358546
2019-04-17 02:12:23 +00:00
Simon Pilgrim 82ffa88a04 [SLP] Refactoring of the operand reordering code.
This is a refactoring patch which should have all the functionality of the current code. Its goal is twofold:
i. Cleanup and simplify the reordering code, and
ii. Generalize reordering so that it will work for an arbitrary number of operands, not just 2.

This is the second patch in a series of patches that will enable operand reordering across chains of operations. An example of this was presented in EuroLLVM'18 https://www.youtube.com/watch?v=gIEn34LvyNo .

Committed on behalf of @vporpo (Vasileios Porpodas)

Differential Revision: https://reviews.llvm.org/D59973

llvm-svn: 358519
2019-04-16 19:27:00 +00:00
Simon Pilgrim 5ad10f4df9 [SLP][X86] Regenerate operandorder tests with arguments on same line. NFCI.
Stops update_test_checks.py from splitting the later arguments after the CHECKs.

llvm-svn: 357679
2019-04-04 09:31:12 +00:00
Sanjay Patel b276dd195a [InstCombine] canonicalize select shuffles by commuting
In PR41304:
https://bugs.llvm.org/show_bug.cgi?id=41304
...we have a case where we want to fold a binop of select-shuffle (blended) values.

Rather than try to match commuted variants of the pattern, we can canonicalize the
shuffles and check for mask equality with commuted operands.

We don't produce arbitrary shuffle masks in instcombine, but select-shuffles are a
special case that the backend is required to handle because we already canonicalize
vector select to this shuffle form.

So there should be no codegen difference from this change. It's possible that this
improves CSE in IR though.

Differential Revision: https://reviews.llvm.org/D60016

llvm-svn: 357366
2019-03-31 15:01:30 +00:00
Simon Pilgrim 6a75c36ea9 [SLP] Add support for commutative icmp/fcmp predicates
For the cases where the icmp/fcmp predicate is commutative, use reorderInputsAccordingToOpcode to collect and commute the operands.

This requires a helper to recognise commutativity in both general Instruction and CmpInstr types - the CmpInst::isCommutative doesn't overload the Instruction::isCommutative method for reasons I'm not clear on (maybe because its based on predicate not opcode?!?).

Differential Revision: https://reviews.llvm.org/D59992

llvm-svn: 357266
2019-03-29 15:28:25 +00:00
Simon Pilgrim 62f0d1650a [SLP] Add support for swapping icmp/fcmp predicates to permit vectorization
We should be able to match elements with the swapped predicate as well - as long as we commute the source operands.

Differential Revision: https://reviews.llvm.org/D59956

llvm-svn: 357243
2019-03-29 10:41:00 +00:00
Simon Pilgrim ceb3de5d25 [SLP][X86] Add tests showing failure to commute icmp/fcmp by swapping predicate
By swapping icmp/fcmp predicates we can commute their operands to improve vectorization

llvm-svn: 357204
2019-03-28 19:13:38 +00:00
Simon Pilgrim 66b5e322fc [SLP][X86] Add tests showing failure to commute icmp/fcmp operands
Some predicates are fully commutative - we should be able to easily commute their operands to improve vectorization

llvm-svn: 357202
2019-03-28 19:03:53 +00:00
Simon Pilgrim 6f96795b88 [SLPVectorizer] Merge reorderAltShuffleOperands into reorderInputsAccordingToOpcode
As discussed on D59738, this generalizes reorderInputsAccordingToOpcode to handle multiple + non-commutative instructions so we can get rid of reorderAltShuffleOperands and make use of the extra canonicalizations that reorderInputsAccordingToOpcode brings.

Differential Revision: https://reviews.llvm.org/D59784

llvm-svn: 356939
2019-03-25 20:05:27 +00:00
Simon Pilgrim 77749567a1 [SLPVectorizer] Update file missed in rL356913
Differential Revision: https://reviews.llvm.org/D59738

llvm-svn: 356915
2019-03-25 16:14:21 +00:00
Simon Pilgrim ff3abef395 [SLPVectorizer] reorderInputsAccordingToOpcode - remove non-Instruction canonicalization
Remove attempts to commute non-Instructions to the LHS - the codegen changes appear to rely on chance more than anything else and also have a tendency to fight existing instcombine canonicalization which moves constants to the RHS of commutable binary ops.

This is prep work towards:
(a) reusing reorderInputsAccordingToOpcode for alt-shuffles and removing the similar reorderAltShuffleOperands
(b) improving reordering to optimized cases with commutable and non-commutable instructions to still find splat/consecutive ops.

Differential Revision: https://reviews.llvm.org/D59738

llvm-svn: 356913
2019-03-25 15:53:55 +00:00
Simon Pilgrim 9eb0de8573 [X86][SLP] Show example of failure to uniformly commute splats for 'alt' shuffles.
If either the main/alt opcodes isn't commutable we may end up with the splats not correctly commuted to the same side.

llvm-svn: 356837
2019-03-23 16:14:04 +00:00
Sanjay Patel a0aaa11afc [SLP] fix variables names in test; NFC
'tmpXXX' conflicts with the auto-generated script regex names.
That could cause mask a bug or fail if the output changes.

llvm-svn: 356790
2019-03-22 18:33:11 +00:00
Dinar Temirbulatov f95351b918 [SLPVectorizer] Add test related to SLP Throttling support, NFCI.
llvm-svn: 356754
2019-03-22 14:50:53 +00:00
Florian Hahn add2d2e304 [SLP] Fix invalid triple in X86 tests
x86-64 is an invalid architecture in triples. Changing it to the correct
triple (x86_64) changes some tests, because SLP is not deemed profitable
any more.

Reviewers: ABataev, RKSimon, spatel

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D58931

llvm-svn: 355420
2019-03-05 17:56:35 +00:00
Simon Pilgrim a066f1f9e6 [Vectorizer] Add vectorization support for fixed smul/umul intrinsics
This requires a couple of tweaks to existing vectorization functions as they were assuming that only the second call argument (ctlz/cttz/powi) could ever be the 'always scalar' argument, but for smul.fix + umul.fix its the third argument.

Differential Revision: https://reviews.llvm.org/D58616

llvm-svn: 354790
2019-02-25 15:42:02 +00:00
Simon Pilgrim f54186abb6 [SLPVectorizer][X86] Add fixed smul/umul tests
Baseline tests - fixed mul intrinsics aren't flagged as vectorizable yet

llvm-svn: 354783
2019-02-25 13:26:30 +00:00
Simon Pilgrim 9921e73d95 [SLPVectorizer][X86] Add add/sub/mul overflow tests
Baseline tests - overflow intrinsics aren't flagged as vectorizable yet

llvm-svn: 354454
2019-02-20 12:04:54 +00:00
Eric Christopher 2534592b9f Temporarily Revert "[X86][SLP] Enable SLP vectorization for 128-bit horizontal X86 instructions (add, sub)"
As this has broken the lto bootstrap build for 3 days and is
showing a significant regression on the Dither_benchmark results (from
the LLVM benchmark suite) -- specifically, on the
BENCHMARK_FLOYD_DITHER_128, BENCHMARK_FLOYD_DITHER_256, and
BENCHMARK_FLOYD_DITHER_512; the others are unchanged.  These have
regressed by about 28% on Skylake, 34% on Haswell, and over 40% on
Sandybridge.

This reverts commit r353923.

llvm-svn: 354434
2019-02-20 04:42:07 +00:00
Anton Afanasyev ca9aff9353 [X86][SLP] Enable SLP vectorization for 128-bit horizontal X86 instructions (add, sub)
Try to use 64-bit SLP vectorization. In addition to horizontal instrs
this change triggers optimizations for partial vector operations (for instance,
using low halfs of 128-bit registers xmm0 and xmm1 to multiply <2 x float> by
<2 x float>).

Fixes llvm.org/PR32433

llvm-svn: 353923
2019-02-13 08:26:43 +00:00
Simon Pilgrim adca820927 [TTI] Add generic SADDSAT/SSUBSAT costs
Add generic costs calculation for SADDSAT/SSUBSAT intrinsics, this uses generic costs for sadd_with_overflow/ssub_with_overflow, an extra sign comparison + a selects based on the sign/overflow.

This completes PR40316

Differential Revision: https://reviews.llvm.org/D57239

llvm-svn: 352315
2019-01-27 13:51:59 +00:00
Nemanja Ivanovic 7d007ddedf [PowerPC] Update Vector Costs for P9
For the power9 CPU, vector operations consume a pair of execution units rather
than one execution unit like a scalar operation. Update the target transform
cost functions to reflect the higher cost of vector operations when targeting
Power9.

Patch by RolandF.

Differential revision: https://reviews.llvm.org/D55461

llvm-svn: 352261
2019-01-26 01:18:48 +00:00
Simon Pilgrim a131e4e296 [TTI] Add generic UADDSAT/USUBSAT costs
Add generic costs calculation for UADDSAT/USUBSAT intrinsics, this fallbacks to using generic costs for uadd_with_overflow/usub_with_overflow + a select.

Differential Revision: https://reviews.llvm.org/D56907

llvm-svn: 352044
2019-01-24 12:27:10 +00:00
Simon Pilgrim c934d3a01b [CostModel][X86] Add explicit vector select costs
Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)|(Y & ~C)) bit select.

Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason).

The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests.

llvm-svn: 351685
2019-01-20 13:55:01 +00:00
Simon Pilgrim 1231904c48 [CostModel][X86] Add explicit fcmp costs for pre-SSE42 targets
Typical throughputs: cmpss/cmpps = 1cy and cmpsd/cmppd = 2cy before the Core2 era

llvm-svn: 351684
2019-01-20 13:21:43 +00:00
Alexey Bataev 18809a6bbb [SLP] Fix PR40310: The reduction nodes should stay scalar.
Summary:
Sometimes the SLP vectorizer tries to vectorize the horizontal reduction
nodes during regular vectorization. This may happen inside of the loops,
when there are some vectorizable PHIs. Patch fixes this by checking if
the node is the reduction node and thus it must not be vectorized, it must
be gathered.

Reviewers: RKSimon, spatel, hfinkel, fedor.sergeev

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56783

llvm-svn: 351349
2019-01-16 15:39:52 +00:00
Alexey Bataev 9514b1c6b4 [SLP] Added test for PR40310, NFC.
llvm-svn: 351240
2019-01-15 20:54:44 +00:00
Nikita Popov d3b86b79fa Reapply "[CodeGen][X86] Expand USUBSAT to UMAX+SUB, also for vectors"
Related to https://bugs.llvm.org/show_bug.cgi?id=40123.

Rather than scalarizing, expand a vector USUBSAT into UMAX+SUB,
which produces much better code for X86.

Reapplying with updated SLPVectorizer tests.

Differential Revision: https://reviews.llvm.org/D56636

llvm-svn: 351219
2019-01-15 18:43:41 +00:00
Simon Pilgrim 4e38b8f8bd [SLP][X86] Split prefer-256-bit 'AVX256BW' tests from AVX2 checks
Fixes SLP test issue with D56636

llvm-svn: 351199
2019-01-15 16:13:37 +00:00
James Y Knight 786558882c Update GettingStarted guide to recommend that people use the new
official Git repository.

Remove the directions for using git-svn, and demote the prominence of
the svn instructions.

Also, fix a few other issues while I'm in there:

* Mention LLVM_ENABLE_PROJECTS more.
* Getting started doesn't need to mention test-suite, but should
  mention clang and the other projects.
* Remove mentions of "configure", since that's long gone.

I've also adjusted a few other mentions of svn to point to github, but
have not done so comprehensively.

Differential Revision: https://reviews.llvm.org/D56654

llvm-svn: 351130
2019-01-14 22:27:32 +00:00
Alexey Bataev 8da9a7538e [SLP]Moved NVPTX test under NVPTX directory, NFC.
llvm-svn: 350969
2019-01-11 20:42:48 +00:00
Alexey Bataev ce2c8b3360 [SLP]Update test checks for the SPL vectorizer, NFC.
llvm-svn: 350967
2019-01-11 20:21:14 +00:00
Simon Pilgrim c2aadfaaad [SLPVectorizer] Flag ADD/SUB SSAT/USAT intrinsics trivially vectorizable (PR40123)
Enables SLP vectorization for the SSE2 PADDS/PADDUS/PSUBS/PSUBUS style intrinsics

llvm-svn: 350300
2019-01-03 12:18:23 +00:00
Simon Pilgrim c5e22b29e4 [SLPVectorizer][X86] Add ADD/SUB SSAT/USAT tests (PR40123)
llvm-svn: 350297
2019-01-03 12:02:14 +00:00
Craig Topper 84d62ce6c1 [CostModel][X86][AArch64] Adjust cost of the scalarization part of min/max reduction.
Summary: The comment says we need 3 extracts and a select at the end. But didn't we just account for the select in the vector cost above. Aren't we just extracting the single element after taking the min/max in the vector register?

Reviewers: RKSimon, spatel, ABataev

Reviewed By: RKSimon

Subscribers: javed.absar, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D55480

llvm-svn: 348739
2018-12-10 06:58:58 +00:00
Craig Topper d1498ed8df [CostModel][X86] Fix overcounting arithmetic cost in illegal types in getArithmeticReductionCost/getMinMaxReductionCost
We were overcounting the number of arithmetic operations needed at each level before we reach a legal type. We were using the full vector type for that level, but we are going to split the input vector at that level in half. So the effective arithmetic operation cost at that level is half the width.

So for example on 8i32 on an sse target. Were were calculating the cost of an 8i32 op which is likely 2 for basic integer. Then after the loop we count 2 more v4i32 ops. For a total arith cost of 4. But if you look at the assembly there would only be 3 arithmetic ops.

There are still more bugs in this code that I'm going to work on next. The non pairwise code shouldn't count extract subvectors in the loop. There are no extracts, the types are split in registers. For pairwise we need to use 2 two src permute shuffles.

Differential Revision: https://reviews.llvm.org/D55397

llvm-svn: 348621
2018-12-07 18:20:56 +00:00
Simon Pilgrim 924f98e579 Add common check prefix. NFCI.
llvm-svn: 348265
2018-12-04 14:32:42 +00:00
Simon Pilgrim 102854f4d4 [TTI] Reduction costs only need to include a single extract element cost (REAPPLIED)
We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type.

For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed.

Fixes PR37731

Rebased and reapplied after being reverted in rL347541 due to PR39774 - which was fixed by D54955/rL347759 and D55017/rL347997

Differential Revision: https://reviews.llvm.org/D54585

llvm-svn: 348076
2018-12-01 14:18:31 +00:00
Craig Topper 502fc1bdd5 [X86] Split skylake-avx512 run lines in SLP vectorizer tests to cover -mprefer=vector-width=256 and -mprefer-vector-width=512.
This will make these tests immune if we ever change the default behavior of -march=skylake-avx512 to prefer 256 bit vectors.

llvm-svn: 348046
2018-11-30 22:53:21 +00:00
Alexey Bataev 3689747619 [SLP]PR39774: Update references of the replaced external instructions.
Summary:
An additional fix for PR39774. Need to update the references for the
RedcutionRoot instruction when it is replaced during the vectorization
phase to avoid compiler crash on reduction vectorization.

Reviewers: RKSimon, spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D55017

llvm-svn: 347997
2018-11-30 15:14:20 +00:00
Alexey Bataev 579c2d9d64 [SLP]Fix PR39774: Set ReductionRoot if the original instruction is vectorized.
Summary:
If the original reduction root instruction was vectorized, it might be
removed from the tree. It means that the insertion point may become
invalidated and the whole vectorization of the reduction leads to the
incorrect output result.
The ReductionRoot instruction must be marked as externally used so it
could not be removed. Otherwise it might cause inconsistency with the
cost model and we may end up with too optimistic optimization.

Reviewers: RKSimon, spatel, hfinkel, mkuper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54955

llvm-svn: 347759
2018-11-28 14:34:11 +00:00
Fedor Sergeev 8cd9d1b5ce Revert "[TTI] Reduction costs only need to include a single extract element cost"
This reverts commit r346970.
It was causing PR39774, a crash in slp-vectorizer on a rather simple loop
with just a bunch of 'and's in the body.

llvm-svn: 347541
2018-11-26 10:17:27 +00:00
Simon Pilgrim 924f193419 [TTI] Reduction costs only need to include a single extract element cost
We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type.

For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed.

Fixes PR37731

Differential Revision: https://reviews.llvm.org/D54585

llvm-svn: 346970
2018-11-15 17:42:53 +00:00
Simon Pilgrim 5a1b7cea91 [SLPVectorizer][X86] Regenerate reduction minmax tests and cleanup check prefixes
llvm-svn: 346965
2018-11-15 16:34:15 +00:00
Simon Pilgrim 4dd692ec2a [SLPVectorizer][X86] Regenerate reduction tests and add PR37731 test
Cleanup check prefixes

llvm-svn: 346964
2018-11-15 16:08:25 +00:00
Simon Pilgrim fc8f1d7da7 [CostModel][X86] SK_ExtractSubvector is free if the subvector is at the start of the source vector
llvm-svn: 346538
2018-11-09 19:04:27 +00:00
Simon Pilgrim 53e8e145e9 [CostModel][X86] Add realistic vXi64 uitofp vXf64 costs
Match codegen improvements from D53649/rL345256

llvm-svn: 345263
2018-10-25 13:06:20 +00:00
Simon Pilgrim 0573b8d8b6 [CostModel][X86] Add realistic i64 uitofp f64 scalar costs
llvm-svn: 345261
2018-10-25 12:42:10 +00:00
Simon Pilgrim 532a0f122e [SLPVectorizer] Add basic support for mul/and/or/xor horizontal reductions
Expand arithmetic reduction to include mul/and/or/xor instructions.

This patch just fixes the SLPVectorizer - the effective reduction costs for AVX1+ are still poor (see rL344846) and will need to be improved before SLP sees this as a valid transform - but we can already see the effect on SSE2 tests.

This partially helps PR37731, but doesn't fix it all as it still falls over on the extraction/reduction order for some reason.

Differential Revision: https://reviews.llvm.org/D53473

llvm-svn: 345037
2018-10-23 15:13:09 +00:00
Simon Pilgrim 0c35aa114d [SLPVectorizer][X86] Add mul/and/or/xor unrolled reduction tests
We miss arithmetic reduction for everything but Add/FAdd (I assume because that's the only cases which x86 has horizontal ops for.....)

llvm-svn: 344849
2018-10-20 15:17:27 +00:00
Sanjay Patel 007416acc8 [SLPVectorizer] regenerate test checks; NFC
llvm-svn: 344848
2018-10-20 14:53:07 +00:00
Sam Clegg 81abca32fb [SLPVectorizer] Check that lowered type is floating point before calling isFabsFree
In the case of soft-fp (e.g. fp128 under wasm) the result of
getTypeLegalizationCost() can be an integer type even if the input is
floating point (See LegalizeTypeAction::TypeSoftenFloat).

Before calling isFabsFree() (which asserts if given a non-fp
type) we need to check that that result is fp.  This is safe since in
fabs is certainly not free in the soft-fp case.

Fixes PR39168

Differential Revision: https://reviews.llvm.org/D52899

llvm-svn: 344069
2018-10-09 18:41:17 +00:00
Matt Arsenault 80ea6dd1d5 Fix vectorization of canonicalize
llvm-svn: 342390
2018-09-17 13:24:30 +00:00
Matt Arsenault c807ce0ee4 SLPVectorizer: Fix assert with different sized address spaces
llvm-svn: 341215
2018-08-31 14:34:53 +00:00
Alexey Bataev 0edcd0278d [SLP] Fix insert point for reused extract instructions.
Summary:
Reworked the previously committed patch to insert shuffles for reused
extract element instructions in the correct position. Previous logic was
incorrect, and might lead to the crash with PHIs and EH instructions.

Reviewers: efriedma, javed.absar

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D50143

llvm-svn: 339166
2018-08-07 19:21:05 +00:00
Alexey Bataev c0c3a6ed5e [SLP] Fix PR38339: Instruction does not dominate all uses!
Summary:
If the ExtractElement instructions can be optimized out during the
vectorization and we need to reshuffle the parent vector, this
ShuffleInstruction may be inserted in the wrong place causing compiler
to produce incorrect code.

Reviewers: spatel, RKSimon, mkuper, hfinkel, javed.absar

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D49928

llvm-svn: 338380
2018-07-31 14:02:43 +00:00
Roman Tereshin ed047b0184 [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
as well as sext(C + x + ...) -> (D + sext(C-D + x + ...))<nuw><nsw>
similar to the equivalent transformation for zext's

if the top level addition in (D + (C-D + x * n)) could be proven to
not wrap, where the choice of D also maximizes the number of trailing
zeroes of (C-D + x * n), ensuring homogeneous behaviour of the
transformation and better canonicalization of such AddRec's

(indeed, there are 2^(2w) different expressions in `B1 + ext(B2 + Y)` form for
the same Y, but only 2^(2w - k) different expressions in the resulting `B3 +
ext((B4 * 2^k) + Y)` form, where w is the bit width of the integral type)

This patch generalizes sext(C1 + C2*X) --> sext(C1) + sext(C2*X) and
sext{C1,+,C2} --> sext(C1) + sext{0,+,C2} transformations added in
r209568 relaxing the requirements the following way:

1. C2 doesn't have to be a power of 2, it's enough if it's divisible by 2
 a sufficient number of times;
2. C1 doesn't have to be less than C2, instead of extracting the entire
  C1 we can split it into 2 terms: (00...0XXX + YY...Y000), keep the
  second one that may cause wrapping within the extension operator, and
  move the first one that doesn't affect wrapping out of the extension
  operator, enabling further simplifications;
3. C1 and C2 don't have to be positive, splitting C1 like shown above
 produces a sum that is guaranteed to not wrap, signed or unsigned;
4. in AddExpr case there could be more than 2 terms, and in case of
  AddExpr the 2nd and following terms and in case of AddRecExpr the
  Step component don't have to be in the C2*X form or constant
  (respectively), they just need to have enough trailing zeros,
  which in turn could be guaranteed by means other than arithmetics,
  e.g. by a pointer alignment;
5. the extension operator doesn't have to be a sext, the same
  transformation works and profitable for zext's as well.

Apparently, optimizations like SLPVectorizer currently fail to
vectorize even rather trivial cases like the following:

 double bar(double *a, unsigned n) {
   double x = 0.0;
   double y = 0.0;
   for (unsigned i = 0; i < n; i += 2) {
     x += a[i];
     y += a[i + 1];
   }
   return x * y;
 }

If compiled with `clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm`
(!{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"})

it produces scalar code with the loop not unrolled with the unsigned `n` and
`i` (like shown above), but vectorized and unrolled loop with signed `n` and
`i`. With the changes made in this commit the unsigned version will be
vectorized (though not unrolled for unclear reasons).

How it all works:

Let say we have an AddExpr that looks like (C + x + y + ...), where C
is a constant and x, y, ... are arbitrary SCEVs. Let's compute the
minimum number of trailing zeroes guaranteed of that sum w/o the
constant term: (x + y + ...). If, for example, those terms look like
follows:

        i
XXXX...X000
YYYY...YY00
   ...
ZZZZ...0000

then the rightmost non-guaranteed-zero bit (a potential one at i-th
position above) can change the bits of the sum to the left (and at
i-th position itself), but it can not possibly change the bits to the
right. So we can compute the number of trailing zeroes by taking a
minimum between the numbers of trailing zeroes of the terms.

Now let's say that our original sum with the constant is effectively
just C + X, where X = x + y + .... Let's also say that we've got 2
guaranteed trailing zeros for X:

         j
CCCC...CCCC
XXXX...XX00  // this is X = (x + y + ...)

Any bit of C to the left of j may in the end cause the C + X sum to
wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not
affect wrapping in any way. If the upper bits cause a wrap, it will be
a wrap regardless of the values of the 2 least significant bits of C.
If the upper bits do not cause a wrap, it won't be a wrap regardless
of the values of the 2 bits on the right (again).

So let's split C to 2 constants like follows:

0000...00CC  = D
CCCC...CC00  = (C - D)

and represent the whole sum as D + (C - D + X). The second term of
this new sum looks like this:

CCCC...CC00
XXXX...XX00
-----------  // let's add them up
YYYY...YY00

The sum above (let's call it Y)) may or may not wrap, we don't know,
so we need to keep it under a sext/zext. Adding D to that sum though
will never wrap, signed or unsigned, if performed on the original bit
width or the extended one, because all that that final add does is
setting the 2 least significant bits of Y to the bits of D:

YYYY...YY00 = Y
0000...00CC = D
-----------  <nuw><nsw>
YYYY...YYCC

Which means we can safely move that D out of the sext or zext and
claim that the top-level sum neither sign wraps nor unsigned wraps.

Let's run an example, let's say we're working in i8's and the original
expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes
like this:

0001 0101  // 21
XXXX XX00  // 12x
YYYY Y000  // 8y

0001 0101  // 21
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
0001 0100  // 21 - D = 20
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
WWWW WW00  // 21 - D + 12x + 8y = 20 + 12x + 8y

therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y))<nuw><nsw>

This approach could be improved if we move away from using trailing
zeroes and use KnownBits instead. For instance, with KnownBits we could
have the following picture:

    i
10 1110...0011  // this is C
XX X1XX...XX00  // this is X = (x + y + ...)

Notice that some of the bits of X are known ones, also notice that
known bits of X are interspersed with unknown bits and not grouped on
the rigth or left.

We can see at the position i that C(i) and X(i) are both known ones,
therefore the (i + 1)th carry bit is guaranteed to be 1 regardless of
the bits of C to the right of i. For instance, the C(i - 1) bit only
affects the bits of the sum at positions i - 1 and i, and does not
influence if the sum is going to wrap or not. Therefore we could split
the constant C the following way:

    i
00 0010...0011  = D
10 1100...0000  = (C - D)

Let's compute the KnownBits of (C - D) + X:

XX1 1            = carry bit, blanks stand for known zeroes
 10 1100...0000  = (C - D)
 XX X1XX...XX00  = X
--- -----------
 XX X0XX...XX00

Will this add wrap or not essentially depends on bits of X. Adding D
to this sum, however, is guaranteed to not to wrap:

0    X
 00 0010...0011  = D
 sX X0XX...XX00  = (C - D) + X
--- -----------
 sX XXXX   XX11

As could be seen above, adding D preserves the sign bit of (C - D) +
X, if any, and has a guaranteed 0 carry out, as expected.

The more bits of (C - D) we constrain, the better the transformations
introduced here canonicalize expressions as it leaves less freedom to
what values the constant part of ((C - D) + x + y + ...) can take.

Reviewed By: mzolotukhin, efriedma

Differential Revision: https://reviews.llvm.org/D48853

llvm-svn: 337943
2018-07-25 18:01:41 +00:00
Simon Pilgrim 1a4f3c93fb [SLPVectorizer] Don't attempt horizontal reduction on pointer types (PR38191)
TTI::getMinMaxReductionCost typically can't handle pointer types - until this is changed its better to limit horizontal reduction to integer/float vector types only.

llvm-svn: 337280
2018-07-17 13:43:33 +00:00
Simon Pilgrim 36b944e778 [SLPVectorizer] Add initial alternate opcode support for cast instructions. (REAPPLIED-2)
We currently only support binary instructions in the alternate opcode shuffles.

This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism:

1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly.
2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this.
3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc.
4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements.

Reapplied with fix to only accept 2 different casts if they come from the same source type (PR38154).

Differential Revision: https://reviews.llvm.org/D49135

llvm-svn: 336989
2018-07-13 11:09:52 +00:00
Martin Storsjo 86b95489fd Revert "[SLPVectorizer] Add initial alternate opcode support for cast instructions. (REAPPLIED)"
This reverts commit r336812, which broke compilation of a number
of projects, see PR38154.

llvm-svn: 336949
2018-07-12 21:33:42 +00:00
Simon Pilgrim 876e99bf2c [SLPVectorizer] Add initial alternate opcode support for cast instructions. (REAPPLIED)
We currently only support binary instructions in the alternate opcode shuffles.

This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism:

1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly.
2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this.
3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc.
4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements.

Reapplied with fix to only accept 2 different casts if they come from the same source type.

Differential Revision: https://reviews.llvm.org/D49135

llvm-svn: 336812
2018-07-11 15:05:10 +00:00
Simon Pilgrim 09832e33b4 [SLPVectorizer] Ensure alternate/passthrough doesn't vectorize sdiv with undef elts
llvm-svn: 336809
2018-07-11 14:34:43 +00:00
Simon Pilgrim 0c7bea0dc0 [SLPVectorizer] Add some additional alternate cast tests
Initial attempt at D49135 failed as we weren't correctly handling casts with different source types.

llvm-svn: 336808
2018-07-11 14:29:13 +00:00
Simon Pilgrim 1edde95abd Revert rL336804: [SLPVectorizer] Add initial alternate opcode support for cast instructions.
Reverting due to buildbot failures

llvm-svn: 336806
2018-07-11 14:08:16 +00:00
Simon Pilgrim 2f963a7e83 [SLPVectorizer] Add initial alternate opcode support for cast instructions.
We currently only support binary instructions in the alternate opcode shuffles.

This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism:

1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly.
2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this.
3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc.
4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements.

Differential Revision: https://reviews.llvm.org/D49135

llvm-svn: 336804
2018-07-11 13:34:09 +00:00
Farhana Aleen 3b416db19b [SLP] Recognize min/max pattern using instructions producing same values.
Summary: It is common to have the following min/max pattern during the intermediate stages of SLP since we only optimize at the end. This patch tries to catch such patterns and allow more vectorization.

         %1 = extractelement <2 x i32> %a, i32 0
         %2 = extractelement <2 x i32> %a, i32 1
         %cond = icmp sgt i32 %1, %2
         %3 = extractelement <2 x i32> %a, i32 0
         %4 = extractelement <2 x i32> %a, i32 1
         %select = select i1 %cond, i32 %3, i32 %4

Author: FarhanaAleen

Reviewed By: ABataev, RKSimon, spatel

Differential Revision: https://reviews.llvm.org/D47608

llvm-svn: 336130
2018-07-02 17:55:31 +00:00
Simon Pilgrim 35f196c179 [SLPVectorizer][X86] Begin adding alternate tests for call operators
Alternate opcode handling only supports binary operators, these tests demonstrate a missed opportunity to vectorize ceil/floor calls

llvm-svn: 336125
2018-07-02 17:23:45 +00:00
Simon Pilgrim 265793d52a [SLPVectorizer] Fix alternate opcode + shuffle cost function to correct handle SK_Select patterns.
We were always using the opcodes of the first 2 scalars for the costs of the alternate opcode + shuffle. This made sense when we used SK_Alternate and opcodes were guaranteed to be alternating, but this fails for the more general SK_Select case.

This fix exposes an issue demonstrated by the fmul_fdiv_v4f32_const test - the SLM model has v4f32 fdiv costs which are more than twice those of the f32 scalar cost, meaning that the cost model determines that the vectorization is not performant. Unfortunately it completely ignores the fact that the fdiv by a constant will be changed into a fmul by InstCombine for a much lower cost vectorization. But at least we're seeing this now...

llvm-svn: 336095
2018-07-02 11:28:01 +00:00
Simon Pilgrim 84f77ecba9 [SLPVectorizer][X86] Add some alternate tests for cast operators
Alternate opcode handling only supports binary operators, these tests demonstrate missed opportunities to vectorize some sitofp/uitofp and fptosi/fptoui style casts as well as some (successful) float bits manipulations

llvm-svn: 336060
2018-07-01 11:29:46 +00:00
Simon Pilgrim bbfc18b5b5 [SLPVectorizer] Recognise non uniform power of 2 constants
Since D46637 we are better at handling uniform/non-uniform constant Pow2 detection; this patch tweaks the SLP argument handling to support them.

As SLP works with arrays of values I don't think we can easily use the pattern match helpers here.

Differential Revision: https://reviews.llvm.org/D48214

llvm-svn: 335621
2018-06-26 16:20:16 +00:00
Simon Pilgrim 9d3ef8ee2b [SLPVectorizer] Support alternate opcodes in tryToVectorizeList
Enable tryToVectorizeList to support InstructionsState alternate opcode patterns at a root (build vector etc.) as well as further down the vectorization tree.

NOTE: This patch reduces some of the debug reporting if there are opcode mismatches - I can try to add it back if it proves a problem. But it could get rather messy trying to provide equivalent verbose debug strings via getSameOpcode etc.

Differential Revision: https://reviews.llvm.org/D48488

llvm-svn: 335364
2018-06-22 16:37:34 +00:00
Simon Pilgrim 1e564504bb [SLPVectorizer] Relax alternate opcodes to accept any BinaryOperator pair
SLP currently only accepts (F)Add/(F)Sub alternate counterpart ops to be merged into an alternate shuffle.

This patch relaxes this to accept any pair of BinaryOperator opcodes instead, assuming the target's cost model accepts the vectorization+shuffle.

Differential Revision: https://reviews.llvm.org/D48477

llvm-svn: 335349
2018-06-22 14:04:06 +00:00
Simon Pilgrim 229a781214 [SLPVectorizer][X86] Add alternate opcode tests for simple build vector cases
llvm-svn: 335348
2018-06-22 13:53:58 +00:00
Simon Pilgrim 9c8f9374b5 [CostModel][AArch64] Add some initial costs for SK_Select and SK_PermuteSingleSrc
AArch64 was only setting costs for SK_Transpose, which meant that many of the simpler shuffles (e.g. SK_Select and SK_PermuteSingleSrc for larger vector elements) was being severely overestimated by the default shuffle expansion.

This patch adds costs to help improve SLP performance and avoid a regression in reductions introduced by D48174.

I'm not very knowledgeable about AArch64 shuffle lowering so I've kept the extra costs to a minimum - someone who knows this code can add extra costs which should improve vectorization a lot more.

Differential Revision: https://reviews.llvm.org/D48172

llvm-svn: 335329
2018-06-22 09:45:31 +00:00
Simon Pilgrim 2a9cde026c [X86][AVX] Reduce v4f64/v4i64 shuffle costs (PR37882)
These were being over cautious for costs for one/two op general shuffles - VSHUFPD doesn't have to replicate the same shuffle in both lanes like VSHUFPS does. 

llvm-svn: 335216
2018-06-21 11:37:13 +00:00
Simon Pilgrim d08fbf6486 [SLPVectorizer][X86] Add horizontal add/sub tests
Shows PR37882 perf regression

llvm-svn: 335215
2018-06-21 11:16:10 +00:00
Simon Pilgrim 2e2f20a949 [SLPVectorizer] Relax "alternate" opcode vectorisation to work with any SK_Select shuffle pattern
D47985 saw the old SK_Alternate 'alternating' shuffle mask replaced with the SK_Select mask which accepts either input operand for each lane, equivalent to a vector select with a constant condition operand.

This patch updates SLPVectorizer to make full use of this SK_Select shuffle pattern by removing the 'isOdd()' limitation.

The AArch64 regression will be fixed by D48172.

Differential Revision: https://reviews.llvm.org/D48174

llvm-svn: 335130
2018-06-20 14:26:28 +00:00
Simon Pilgrim 180497ea11 [SLP][X86] Add AVX2 run to POW2 SDIV Tests
Non-uniform pow2 tests are only make sense on targets with fast (low cost) non-uniform shifts

llvm-svn: 334821
2018-06-15 10:29:37 +00:00
Simon Pilgrim ca6215f8c8 [SLP][X86] Regenerate POW2 SDIV Tests
Added non-uniform pow2 test as well

llvm-svn: 334819
2018-06-15 10:07:03 +00:00
Farhana Aleen 078cd48a39 [SLP] Add testcases of min/max reduction pattern for AMDGPU.
Author: FarhanaAleen
llvm-svn: 334435
2018-06-11 20:29:31 +00:00
Matt Arsenault 1349a04ef5 AMDGPU: Make v2i16/v2f16 legal on VI
This usually results in better code. Fixes using
inline asm with short2, and also fixes having a different
ABI for function parameters between VI and gfx9.

Partially cleans up the mess used for lowering of the d16
operations. Making v4f16 legal will help clean this up more,
but this requires additional work.

llvm-svn: 332953
2018-05-22 06:32:10 +00:00
Farhana Aleen e24f3ff8de [AMDGPU] Support horizontal vectorization of min/max.
Author: FarhanaAleen

Reviewed By: rampitec

Subscribers: AMDGPU

Differential Revision: https://reviews.llvm.org/D46604

llvm-svn: 331920
2018-05-09 21:18:34 +00:00
Shiva Chen 2c864551df [DebugInfo] Add DILabel metadata and intrinsic llvm.dbg.label.
In order to set breakpoints on labels and list source code around
labels, we need collect debug information for labels, i.e., label
name, the function label belong, line number in the file, and the
address label located. In order to keep these information in LLVM
IR and to allow backend to generate debug information correctly.
We create a new kind of metadata for labels, DILabel. The format
of DILabel is

!DILabel(scope: !1, name: "foo", file: !2, line: 3)

We hope to keep debug information as much as possible even the
code is optimized. So, we create a new kind of intrinsic for label
metadata to avoid the metadata is eliminated with basic block.
The intrinsic will keep existing if we keep it from optimized out.
The format of the intrinsic is

llvm.dbg.label(metadata !1)

It has only one argument, that is the DILabel metadata. The
intrinsic will follow the label immediately. Backend could get the
label metadata through the intrinsic's parameter.

We also create DIBuilder API for labels to be used by Frontend.
Frontend could use createLabel() to allocate DILabel objects, and use
insertLabel() to insert llvm.dbg.label intrinsic in LLVM IR.

Differential Revision: https://reviews.llvm.org/D45024

Patch by Hsiangkai Wang.

llvm-svn: 331841
2018-05-09 02:40:45 +00:00
Farhana Aleen e2dfe8a853 [AMDGPU] Support horizontal vectorization.
Author: FarhanaAleen

Reviewed By: rampitec, arsenm

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D46213

llvm-svn: 331313
2018-05-01 21:41:12 +00:00
Matthew Simpson 661e6a02bd [SLP] Add additional test for transposable binary operations with reuse
llvm-svn: 331274
2018-05-01 15:59:26 +00:00
Davide Italiano bd3bf1660b [SLPVectorizer] Debug info shouldn't impact spill cost computation.
<rdar://problem/39794738>

(Also, PR32761).

Differential Revision:  https://reviews.llvm.org/D46199

llvm-svn: 331199
2018-04-30 16:57:33 +00:00
Benjamin Kramer 733c7fc55d [NVPTX] Turn on Loop/SLP vectorization
Since PTX has grown a <2 x half> datatype vectorization has become more
important. The late LoadStoreVectorizer intentionally only does loads
and stores, but now arithmetic has to be vectorized for optimal
throughput too.

This is still very limited, SLP vectorization happily creates <2 x half>
if it's a legal type but there's still a lot of register moving
happening to get that fed into a vectorized store. Overall it's a small
performance win by reducing the amount of arithmetic instructions.

I haven't really checked what the loop vectorizer does to PTX code, the
cost model there might need some more tweaks. I didn't see it causing
harm though.

Differential Revision: https://reviews.llvm.org/D46130

llvm-svn: 331035
2018-04-27 13:36:05 +00:00
Matthew Simpson cfdec0ff70 [SLP] Add tests for transposable binary operations
These test cases are vectorizable, but we are currently unable to vectorize
them effectively.

llvm-svn: 330945
2018-04-26 14:50:04 +00:00
Craig Topper 60c7e0d587 [X86] Remove unnecessary -mattr to enable avx512bw when the -mcpu already enabled it. NFC
This makes the test similar to the arith-sub.ll and arith-mul.ll tests.

llvm-svn: 330144
2018-04-16 18:14:19 +00:00
Haicheng Wu f7466f3164 [SLP] Use getExtractWithExtendCost() to compute the scalar cost of extractelement/ext pair
We use getExtractWithExtendCost to calculate the cost of extractelement and
s|zext together when computing the extract cost after vectorization, but we
calculate the cost of extractelement and s|zext separately when computing the
scalar cost which is larger than it should be.

Differential Revision: https://reviews.llvm.org/D45469

llvm-svn: 330143
2018-04-16 18:09:49 +00:00
Haicheng Wu 5ba379557d [SLP] update a test case. NFC.
llvm-svn: 329818
2018-04-11 15:09:49 +00:00
Alexey Bataev 2f67dbb73e [SLP] Additional tests for reorder reuse vectorization, NFC.
llvm-svn: 329603
2018-04-09 19:02:34 +00:00
Simon Pilgrim f1e668830f [SLPVectorizer][X86] Regenerate some tests. NFCI
llvm-svn: 329196
2018-04-04 13:53:51 +00:00
Alexey Bataev 428e9d9d87 [SLP] Fix PR36481: vectorize reassociated instructions.
Summary:
If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.

Patch does not support reordering of the repeated instruction, this must
be handled in the separate patch.

Reviewers: RKSimon, spatel, hfinkel, mkuper, Ayal, ashahid

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D43776

llvm-svn: 329085
2018-04-03 17:14:47 +00:00
Alexey Bataev 976aff148a [SLP] Added tests for checks of reordering of the repeated instructions,
NFC.

llvm-svn: 329080
2018-04-03 16:31:26 +00:00
Benjamin Kramer 2fc3b18922 Revert "[SLP] Fix PR36481: vectorize reassociated instructions."
This reverts commit r328980 and r329046. Makes the vectorizer crash.

llvm-svn: 329071
2018-04-03 14:40:33 +00:00
Haicheng Wu 7f0daaeb86 [SLP] Distinguish "demanded and shrinkable" from "demanded and not shrinkable" values when determining the minimum bitwidth
We use two approaches for determining the minimum bitwidth.

   * Demanded bits
   * Value tracking

If demanded bits doesn't result in a narrower type, we then try value tracking.
We need this if we want to root SLP trees with the indices of getelementptr
instructions since all the bits of the indices are demanded.

But there is a missing piece though. We need to be able to distinguish "demanded
and shrinkable" from "demanded and not shrinkable". For example, the bits of %i
in

%i = sext i32 %e1 to i64
%gep = getelementptr inbounds i64, i64* %p, i64 %i

are demanded, but we can shrink %i's type to i32 because it won't change the
result of the getelementptr. On the other hand, in

%tmp15 = sext i32 %tmp14 to i64
%tmp16 = insertvalue { i64, i64 } undef, i64 %tmp15, 0

it doesn't make sense to shrink %tmp15 and we can skip the value tracking.

Ideas are from Matthew Simpson!

Differential Revision: https://reviews.llvm.org/D44868

llvm-svn: 329035
2018-04-03 00:05:10 +00:00
Alexey Bataev 3decaf4275 [SLP] Fix PR36481: vectorize reassociated instructions.
Summary:
If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.

Reviewers: RKSimon, spatel, hfinkel, mkuper, Ayal, ashahid

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D43776

llvm-svn: 328980
2018-04-02 14:51:37 +00:00
Dinar Temirbulatov c326c1c582 [SLPVectorizer] Add tests related to PR30787, NFCI.
llvm-svn: 328813
2018-03-29 18:57:03 +00:00
Haicheng Wu b45f921678 [SLP] Add more checks to a test case. NFC.
llvm-svn: 328572
2018-03-26 18:59:28 +00:00
Haicheng Wu 0ec1dbe417 [SLP] Add a test case. NFC.
llvm-svn: 328546
2018-03-26 16:47:37 +00:00
Matthew Simpson 6c289a1c74 [SLP] Stop counting cost of gather sequences with multiple uses
When building the SLP tree, we look for reuse among the vectorized tree
entries. However, each gather sequence is represented by a unique tree entry,
even though the sequence may be identical to another one. This means, for
example, that a gather sequence with two uses will be counted twice when
computing the cost of the tree. We should only count the cost of the definition
of a gather sequence rather than its uses. During code generation, the
redundant gather sequences are emitted, but we optimize them away with CSE. So
it looks like this problem just affects the cost model.

Differential Revision: https://reviews.llvm.org/D44742

llvm-svn: 328316
2018-03-23 14:18:27 +00:00
Matthew Simpson b17fff79f0 [SLP] Add test case for a gather sequence with multiple uses
llvm-svn: 328133
2018-03-21 19:13:14 +00:00
Matthew Simpson eacfefd056 [AArch64] Implement getArithmeticReductionCost
This patch provides an implementation of getArithmeticReductionCost for
AArch64. We can specialize the cost of add reductions since they are computed
using the 'addv' instruction.

Differential Revision: https://reviews.llvm.org/D44490

llvm-svn: 327702
2018-03-16 11:34:15 +00:00
Alexey Bataev 625ce229b1 [SLP] Additional tests for stores vectorization, NFC.
llvm-svn: 326740
2018-03-05 20:20:12 +00:00
Mohammad Shahid ddeee12f59 [SLP] Added new tests and updated existing for jumbled load, NFC.
llvm-svn: 326303
2018-02-28 04:19:34 +00:00
Sanjay Patel 04d1d79ee5 [AArch64] add SLP test based on TSVC; NFC
This is a slight reduction of one of the benchmarks
that suffered with D43079. Cost model changes should
not cause this test to remain scalarized.

llvm-svn: 326217
2018-02-27 18:06:15 +00:00
Simon Pilgrim 9929f90740 [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280)
Agner's tables indicate that for SSE42+ targets (Core2 and later) we can reduce the FADD/FSUB/FMUL costs down to 1, which should fix the Himeno benchmark.

Note: the AVX512 FDIV costs look rather dodgy, but this isn't part of this patch.

Differential Revision: https://reviews.llvm.org/D43733

llvm-svn: 326133
2018-02-26 22:10:17 +00:00
Alexey Bataev b44e2b75e8 [SLP] Added new test + fixed some checks, NFC.
llvm-svn: 326117
2018-02-26 20:01:24 +00:00
Simon Pilgrim 864949d5e9 [SLPVectorizer][X86] Add load extend tests (PR36091)
llvm-svn: 325772
2018-02-22 12:19:34 +00:00
Sanjay Patel d53da082a0 [AArch64] fix IR names to not be 'tmp' because that gives the CHECK script problems
llvm-svn: 325718
2018-02-21 20:48:14 +00:00
Sanjay Patel ffe51e450f [AArch64] add SLP test for matmul (PR36280); NFC
This is a slight reduction of one of the benchmarks
that suffered with D43079. Cost model changes should
not cause this test to remain scalarized.

llvm-svn: 325717
2018-02-21 20:34:16 +00:00
Alexey Bataev cdd0675ddc [SLP] Fix test checks, NFC.
llvm-svn: 325689
2018-02-21 15:32:58 +00:00
Sanjay Patel e6143904b9 revert r325515: [TTI CostModel] change default cost of FP ops to 1 (PR36280)
There are too many perf regressions resulting from this, so we need to 
investigate (and add tests for) targets like ARM and AArch64 before 
trying to reinstate.

llvm-svn: 325658
2018-02-21 01:42:52 +00:00
Alexey Bataev 47dfd249f0 [SLP] Fix tests checks, NFC.
llvm-svn: 325605
2018-02-20 18:11:50 +00:00
Sanjay Patel 3e8a76abfd [TTI CostModel] change default cost of FP ops to 1 (PR36280)
This change was mentioned at least as far back as:
https://bugs.llvm.org/show_bug.cgi?id=26837#c26
...and I found a real program that is harmed by this: 
Himeno running on AMD Jaguar gets 6% slower with SLP vectorization:
https://bugs.llvm.org/show_bug.cgi?id=36280
...but the change here appears to solve that bug only accidentally.

The div/rem costs for x86 look very wrong in some cases, but that's already true, 
so we can fix those in follow-up patches. There's also evidence that more cost model
changes are needed to solve SLP problems as shown in D42981, but that's an independent 
problem (though the solution may be adjusted after this change is made).

Differential Revision: https://reviews.llvm.org/D43079

llvm-svn: 325515
2018-02-19 16:11:44 +00:00
Alexey Bataev 862c476fc2 [SLP] Fix the test for the reversed stores, NFC.
llvm-svn: 325268
2018-02-15 17:11:50 +00:00
Alexey Bataev ac619599d8 [SLP] Added test for reversed stores, NFC.
llvm-svn: 325265
2018-02-15 16:56:49 +00:00
Alexey Bataev 7f246e003a [SLP] Allow vectorization of reversed loads.
Summary:
Reversed loads are handled as gathering. But we can just reshuffle
these values. Patch adds support for vectorization of reversed loads.

Reviewers: RKSimon, spatel, mkuper, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D43022

llvm-svn: 325134
2018-02-14 15:29:15 +00:00
Alexey Bataev ca2396e673 [SLP] Take user instructions cost into consideration in insertelement vectorization.
Summary:
For better vectorization result we should take into consideration the
cost of the user insertelement instructions when we try to
vectorize sequences that build the whole vector. I.e. if we have the
following scalar code:
```
<Scalar code>
insertelement <ScalarCode>, ...
```
we should consider the cost of the last `insertelement ` instructions as
the cost of the scalar code.

Reviewers: RKSimon, spatel, hfinkel, mkuper

Subscribers: javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D42657

llvm-svn: 324893
2018-02-12 14:54:48 +00:00
Sanjay Patel 574fb73c89 [SLPVectorizer] auto-generate complete checks; NFC
llvm-svn: 324616
2018-02-08 15:32:28 +00:00
Sanjay Patel 124392f038 [SLPVectorizer] auto-generate complete checks; NFC
llvm-svn: 324615
2018-02-08 15:30:39 +00:00
Sanjay Patel e2c5e9a970 [SLPVectorizer] move RUN line to top-of-file; NFC
I was confused what we were checking because the RUN line was
in the middle of the file.

llvm-svn: 324614
2018-02-08 15:28:49 +00:00
Sanjay Patel cfa5c03039 [SLPVectorizer] auto-generate complete checks; NFC
llvm-svn: 324612
2018-02-08 15:16:26 +00:00
Alexey Bataev cd8d6de381 [SLP] Add a tests for PR36280, NFC.
llvm-svn: 324510
2018-02-07 20:11:37 +00:00
Alexey Bataev 1e593fe73e [SLP] Update test checks, NFC.
llvm-svn: 324387
2018-02-06 20:00:05 +00:00
Alexey Bataev 1c8f53f47d [SLP] Add extra test for extractelement shuffle, NFC.
llvm-svn: 323815
2018-01-30 21:06:06 +00:00
Alexey Bataev 9c5c103283 [SLP] Fix for PR32086: Count InsertElementInstr of the same elements as shuffle.
Summary:
If the same value is going to be vectorized several times in the same
tree entry, this entry is considered to be a gather entry and cost of
this gather is counter as cost of InsertElementInstrs for each gathered
value. But we can consider these elements as ShuffleInstr with
SK_PermuteSingle shuffle kind.

Reviewers: spatel, RKSimon, mkuper, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38697

llvm-svn: 323662
2018-01-29 16:08:52 +00:00
Alexey Bataev 10f5c9e765 [SLP] Add a test with extract for PR32086, NFC.
llvm-svn: 323661
2018-01-29 15:56:52 +00:00
Alexey Bataev f86be12182 Revert "[SLP] Fix for PR32086: Count InsertElementInstr of the same elements as shuffle."
This reverts commit r323530 to fix possible problems in users code.

llvm-svn: 323581
2018-01-27 02:42:21 +00:00
Alexey Bataev 7ad4e31c3b [SLP] Test for trunc vectorization, NFC.
llvm-svn: 323556
2018-01-26 20:07:55 +00:00
Alexey Bataev 167003df28 [SLP] Fix for PR32086: Count InsertElementInstr of the same elements as shuffle.
Summary:
If the same value is going to be vectorized several times in the same
tree entry, this entry is considered to be a gather entry and cost of
this gather is counter as cost of InsertElementInstrs for each gathered
value. But we can consider these elements as ShuffleInstr with
SK_PermuteSingle shuffle kind.

Reviewers: spatel, RKSimon, mkuper, hfinkel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38697

llvm-svn: 323530
2018-01-26 14:31:09 +00:00