Commit Graph

14267 Commits

Author SHA1 Message Date
Simon Pilgrim 71a39bcf68 [X86] isHorizontalBinOp - add extract_subvector(shuffle(x)) handling (PR39921)
Let's us match horizontal op patterns on fast-variable-shuffle targets (Haswell etc.)

llvm-svn: 362327
2019-06-02 15:47:49 +00:00
Simon Pilgrim b0dc262ffb [X86] Add AVX2 'fast-variable-shuffle' PHADD tests (PR39921)
Haswell etc. will combine shuffles to a extract_subvector(permd(x)) before isHorizontalBinOp can match it.

llvm-svn: 362326
2019-06-02 15:33:28 +00:00
Roman Lebedev 2065ddfd79 [NFC][X86] extract-lowbits.ll: add one more pattern a with truncation
We are also free to interpret this as 'BZHI'/'BEXTR'.
https://rise4fun.com/Alive/dD6

llvm-svn: 362325
2019-06-02 15:07:49 +00:00
Roman Lebedev 0bfa9359b0 [NFC][X86] extract-lowbits.ll: add patterns with truncation too
If we look past truncations of X too eagerly (D62786), we may
end up with 64-bit 'BEXTR', even though 32-bit-one would suffice.

llvm-svn: 362319
2019-06-02 08:05:24 +00:00
Craig Topper fe699c32a2 [X86] Simplify the CHECK lines in vector-reduce-and/or/xor-widen.ll in similar way to r362308.
Forgot to do the widen forms when I was doing the others.

llvm-svn: 362310
2019-06-02 00:43:02 +00:00
Craig Topper 396a915c26 [X86] Add the SSE versions of PMULLW and PMULLD to isAssociativeAndCommutative.
llvm-svn: 362309
2019-06-02 00:42:58 +00:00
Craig Topper 4721fad972 [X86] Simplify the CHECK lines in vector-reduce-and/or/xor.
The AVX512BW and AVX512VL checks were never used. And AVX512 is the same
as AVX on all tests that weren't already split for AVX1 and AVX2.

llvm-svn: 362308
2019-06-02 00:07:52 +00:00
Craig Topper eeaecc63e9 [X86] Add avx512 command lines and test cases to machine-combiner.ll
llvm-svn: 362307
2019-06-02 00:07:48 +00:00
Simon Pilgrim 0d4a040510 [X86][AVX] Add tests for CONCAT(MOVDDUP(x),MOVDDUP(y))
llvm-svn: 362300
2019-06-01 14:05:46 +00:00
Roman Lebedev 1aaa23c0fc [NFC][Codegen] shift-amount-mod.ll: drop innermost operation
I have initially added it in for test to display both
whether the binop w/ constant is sinked or hoisted.
But as it can be seen from the 'sub (sub C, %x), %y'
test, that actually conceals the issues it is supposed to test.

At least two more patterns are unhandled:
* 'add (sub C, %x), %y' - D62266
* 'sub (sub C, %x), %y'

llvm-svn: 362295
2019-06-01 11:08:29 +00:00
Craig Topper c288a19bb7 [X86] Add AVX512BF16 and AVX512VP2INTERSECT instructions to the loading folding tables.
llvm-svn: 362288
2019-06-01 06:20:59 +00:00
Roman Lebedev 7c1ac8269a [NFC][Codegen] Add/sub constant-folding: add scalar tests too
Just for completeness.

llvm-svn: 362208
2019-05-31 08:23:48 +00:00
Craig Topper 31d00d80a2 [X86] Remove patterns for X86VSintToFP/X86VUintToFP+loadv4f32 to v2f64.
These patterns can incorrectly narrow a volatile load from 128-bits to 64-bits.
Similar to PR42079.

Switch to using (v4i32 (bitcast (v2i64 (scalar_to_vector (loadi64))))) as the
load pattern used in the instructions.

This probably still has issues in 32-bit mode where loadi64 isn't legal. Maybe
we should use VZMOVL for widened loads even when we don't need the upper bits
as zeroes?

llvm-svn: 362203
2019-05-31 07:38:26 +00:00
Craig Topper cded573710 [X86] Add test cases for failure to use 128-bit masked vcvtdq2pd when load starts as v2i32.
llvm-svn: 362202
2019-05-31 07:38:22 +00:00
Craig Topper 67d43e0744 [X86] Add test cases for a volatile load shrinking bug involving cvtdq2pd. NFC
Similar to PR42079

llvm-svn: 362201
2019-05-31 07:38:18 +00:00
Craig Topper cb0ad5accb [X86] Copy a test case from avx512-cvt.ll to avx512-cvt-widen.ll. NFC
llvm-svn: 362200
2019-05-31 07:38:14 +00:00
Craig Topper b79cc5f802 [X86] Remove avx512 isel patterns for fpextend+load. Prefer to only match fp extloads instead.
DAG combine will usually fold fpextend+load to an fp extload anyway. So the
256 and 512 patterns were probably unnecessary. The 128 bit pattern was special
in that it looked for a v4f32 load, but then used it in an instruction that
only loads 64-bits. This is bad if the load happens to be volatile. We could
probably make the patterns volatile aware, but that's more work for something
that's probably rare. The peephole pass might kick in and save us anyway. We
might also be able to fix this with some additional DAG combines.

This also adds patterns for vselect+extload to enabled masked vcvtps2pd to be
used. Previously we looked for the unlikely vselect+fpextend+load.

llvm-svn: 362199
2019-05-31 06:21:53 +00:00
Craig Topper 73b07284df [X86] Add test to show missed opportunity to use masked vcvtps2pd for vselect+extload.
llvm-svn: 362198
2019-05-31 06:21:49 +00:00
Craig Topper 8cb076ec6e [X86] Add test case for PR42079. NFC
llvm-svn: 362197
2019-05-31 06:21:45 +00:00
Craig Topper 23066033a1 [X86] Correct the ins operand order for MASKPAIR16STORE to match other store instructions.
This makes the 5 address operands come first. And the data operand comes last.

This matches the operand order the instruction is created with. It's also the
expected order in X86MCInstLower. So everything appeared to work, but the
operands didn't match their declared type.

Fixes a -verify-machineinstrs failure.

Also remove the isel patterns from these instructions since they should only
be used for stack spills and reloads. I'm not even sure what types the patterns
were looking for to match.

llvm-svn: 362193
2019-05-31 05:20:27 +00:00
Pengfei Wang 2e67d0c842 [X86] Add VP2INTERSECT instructions
Support Intel AVX512 VP2INTERSECT instructions in llvm

Patch by Xiang Zhang (xiangzhangllvm)

Differential Revision: https://reviews.llvm.org/D62366

llvm-svn: 362188
2019-05-31 02:50:41 +00:00
Roman Lebedev a4e3b50e26 [DAGCombiner][X86][AArch64] (x - C) + y -> (x + y) - C fold. Try 2
Summary:
Only vector tests are being affected here,
since subtraction by scalar constant is rewritten
as addition by negated constant.

No surprising test changes.

https://rise4fun.com/Alive/pbT

This is a recommit, originally committed in rL361852, but reverted
to investigate test-suite compile-time hangs.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: javed.absar, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62257

llvm-svn: 362146
2019-05-30 20:37:49 +00:00
Roman Lebedev 57aa36ff91 [DAGCombine] (x - C) - y -> (x - y) - C fold. Try 3
Summary:
Again only vectors affected. Frustrating. Let me take a look into that..

https://rise4fun.com/Alive/AAq

This is a recommit, originally committed in rL361852, but reverted
to investigate test-suite compile-time hangs, and then reverted in
rL362109 to fix missing constant folds that were causing
endless combine loops.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: javed.absar, JDevlieghere, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62294

llvm-svn: 362145
2019-05-30 20:37:39 +00:00
Roman Lebedev 63b4741534 [DAGCombine][X86][AArch64][AMDGPU] (x - y) + -1 -> add (xor y, -1), x fold. Try 3
Summary:
This prevents regressions in next patch,
and somewhat recovers from the regression to AMDGPU test in D62223.

It is indeed not great that we leave vector decrement,
don't transform it into vector add all-ones..

https://rise4fun.com/Alive/ZRl

This is a recommit, originally committed in rL361852, but reverted
to investigate test-suite compile-time hangs, and then reverted in
rL362109 to fix missing constant folds that were causing
endless combine loops.

Reviewers: RKSimon, craig.topper, spatel, arsenm

Reviewed By: RKSimon, arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, javed.absar, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62263

llvm-svn: 362144
2019-05-30 20:37:29 +00:00
Roman Lebedev 05ad5fd213 [DAGCombiner][X86][AArch64][SPARC][SystemZ] y - (x + C) -> (y - x) - C fold. Try 3
Summary:
Direct sibling of D62223 patch.
While i don't have a direct motivational pattern for this,
it would seem to make sense to handle both patterns (or none),
for symmetry?

The aarch64 changes look neutral;
sparc and systemz look like improvement (one less instruction each);
x86 changes - 32bit case improves, 64bit case shows that LEA no longer
gets constructed, which may be because that whole test is `-mattr=+slow-lea,+slow-3ops-lea`

https://rise4fun.com/Alive/ffh

This is a recommit, originally committed in rL361852, but reverted
to investigate test-suite compile-time hangs, and then reverted in
rL362109 to fix missing constant folds that were causing
endless combine loops.

Reviewers: RKSimon, craig.topper, spatel, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, jyknight, javed.absar, kristof.beyls, fedor.sergeev, jrtc27, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62252

llvm-svn: 362143
2019-05-30 20:37:18 +00:00
Roman Lebedev 1d9ec7a81b [DAGCombiner][X86][AArch64][AMDGPU] (x + C) - y -> (x - y) + C fold. Try 3
Summary:
The main motivation is shown by all these `neg` instructions that are now created.
In particular, the `@reg32_lshr_by_negated_unfolded_sub_b` test.

AArch64 test changes all look good (`neg` created), or neutral.

X86 changes look neutral (vectors), or good (`neg` / `xor eax, eax` created).

I'm not sure about `X86/ragreedy-hoist-spill.ll`, it looks like the spill
is now hoisted into preheader (which should still be good?),
2 4-byte reloads become 1 8-byte reload, and are elsewhere,
but i'm not sure how that affects that loop.

I'm unable to interpret AMDGPU change, looks neutral-ish?

This is hopefully a step towards solving [[ https://bugs.llvm.org/show_bug.cgi?id=41952 | PR41952 ]].

https://rise4fun.com/Alive/pkdq (we are missing more patterns, i'll submit them later)

This is a recommit, originally committed in rL361852, but reverted
to investigate test-suite compile-time hangs, and then reverted in
rL362109 to fix missing constant folds that were causing
endless combine loops.

Reviewers: craig.topper, RKSimon, spatel, arsenm

Reviewed By: RKSimon

Subscribers: bjope, qcolombet, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, javed.absar, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62223

llvm-svn: 362142
2019-05-30 20:36:54 +00:00
Roman Lebedev 7eb8b5b5dd [DAGCombine] ((c1-A)-c2) -> ((c1-c2)-A) constant-fold
Summary: https://rise4fun.com/Alive/B0A

Reviewers: t.p.northover, RKSimon, spatel, craig.topper

Reviewed By: RKSimon

Subscribers: javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62691

llvm-svn: 362135
2019-05-30 19:27:51 +00:00
Roman Lebedev 691b5e2ecc [DAGCombine] (A-C1)-C2 -> A-(C1+C2) constant-fold
Summary: https://rise4fun.com/Alive/Mb1M

Reviewers: RKSimon, craig.topper, spatel, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62689

llvm-svn: 362134
2019-05-30 19:27:42 +00:00
Roman Lebedev 0a3dbbcdfb [DAGCombine] (A+C1)-C2 -> A+(C1-C2) constant-fold
Summary:
Direct sibling of D62662, the root cause of the endless combine loop in D62257

https://rise4fun.com/Alive/d3W

Reviewers: RKSimon, craig.topper, spatel, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62664

llvm-svn: 362133
2019-05-30 19:27:32 +00:00
Roman Lebedev cc9a9cf237 [DAGCombine] ((A-c1)+c2) -> (A+(c2-c1)) constant-fold
Summary:
This was the root cause of the endless combine loop in D62257

https://rise4fun.com/Alive/d3W

Reviewers: RKSimon, spatel, craig.topper, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62662

llvm-svn: 362131
2019-05-30 19:27:19 +00:00
Roman Lebedev 2ae4b33181 [NFC][Codegen] Potential add/sub constant folding: fixup non-splat tests
llvm-svn: 362114
2019-05-30 16:50:43 +00:00
Roman Lebedev 700fdb1070 [NFC][Codegen] Add better test coverage for potential add/sub constant folding
This adds hopefully-full test coverage for all the possible permutations:
First op is one of:
* x + c1
* x - c1
* c1 - x

Second op is one of:
* + c2
* - c2
* c2 -

And thus 3*3=9 patterns.
Some of them show missed constant-folds.

Without previous patch (the revert), these tests were causing endless
dagcombine loop. I really should have thought about this first :S

llvm-svn: 362110
2019-05-30 16:07:19 +00:00
Roman Lebedev 019d270e43 [DAGCombine] Revert of recommit of "binop-with-const hoisting" patches
I was looking into an endless combine loop the uncommitted follow-up patch
was causing, and it appears even these patches can exibit such an
endless loop. The root cause is that we try to hoist one binop (add/sub) with
constant operand, and if we get two such binops both of which are
eligible for this hoisting, we get stuck.

Some cases may highlight missing constant-folds.

Reverts r361871,r361872,r361873,r361874.

llvm-svn: 362109
2019-05-30 16:07:11 +00:00
Roman Lebedev 8f220a5d2c [NFC][Codegen] Add add+sub/sub+add constant-fold tests for from D62257
add+sub/sub+add when second operands are constants should be folded
into a single add, just like with add+add.

llvm-svn: 362093
2019-05-30 13:02:11 +00:00
Simon Pilgrim 32aac1727a [X86][SSE] Improve bool vector extload (PR26091)
We already have good codegen for (vXiY *ext(vXi1 bitcast(iX))) cases, this patch uses it for loads of vXi1 types as well - changing the load into a iX integer load, and bitcasting so that combineToExtendBoolVectorInReg can then use it.

Differential Revision: https://reviews.llvm.org/D62449

llvm-svn: 362081
2019-05-30 10:25:20 +00:00
Pengfei Wang 1f67d94279 [X86] Add ENQCMD instructions
For more details about these instructions, please refer to the latest
ISE document:
https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference.

Patch by Tianqing Wang (tianqing)

Differential Revision: https://reviews.llvm.org/D62281

llvm-svn: 362053
2019-05-30 03:59:16 +00:00
Benjamin Kramer 107f8d9873 [DAGCombiner] Replace gathers with a zero mask with the passthru value
These can be created by the legalizer when splitting a larger gather.

See https://llvm.org/PR42055 for a motivating example.

Differential Revision: https://reviews.llvm.org/D62613

llvm-svn: 362015
2019-05-29 19:24:19 +00:00
Craig Topper e3a76fa1e2 [X86] Fix machineverifier error on avx512f-256-set0.mir
Previously the pass ran the entire llc pipeline which caused the IR to be recodegened.

This commit restricts it to just running the postrapseudos pass and checking the results of that instead of the final assembly.

llvm-svn: 361991
2019-05-29 17:02:27 +00:00
Peter Collingbourne 31fda09b2d Add IR support, ELF section and user documentation for partitioning feature.
The partitioning feature was proposed here:
http://lists.llvm.org/pipermail/llvm-dev/2019-February/130583.html

This is mostly just documentation. The feature itself will be contributed
in subsequent patches.

Differential Revision: https://reviews.llvm.org/D60242

llvm-svn: 361923
2019-05-29 03:29:01 +00:00
Adhemerval Zanella 6d7bf5e8df [CodeGen] Add lrint/llrint builtins
This patch add the ISD::LRINT and ISD::LLRINT along with new
intrinsics.  The changes are straightforward as for other
floating-point rounding functions, with just some adjustments
required to handle the return value being an interger.

The idea is to optimize lrint/llrint generation for AArch64
in a subsequent patch.  Current semantic is just route it to libm
symbol.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D62017

llvm-svn: 361875
2019-05-28 20:47:44 +00:00
Roman Lebedev dfc34f0211 [DAGCombine] (x - C) - y -> (x - y) - C fold. Try 2
Summary:
Again only vectors affected. Frustrating. Let me take a look into that..

https://rise4fun.com/Alive/AAq

This is a recommit, originally committed in rL361856, but reverted
to investigate test-suite compile-time hangs.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: javed.absar, JDevlieghere, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62294

llvm-svn: 361874
2019-05-28 20:40:10 +00:00
Roman Lebedev d485c6bc9f [DAGCombine][X86][AArch64][AMDGPU] (x - y) + -1 -> add (xor y, -1), x fold. Try 2
Summary:
This prevents regressions in next patch,
and somewhat recovers from the regression to AMDGPU test in D62223.

It is indeed not great that we leave vector decrement,
don't transform it into vector add all-ones..

https://rise4fun.com/Alive/ZRl

This is a recommit, originally committed in rL361855, but reverted
to investigate test-suite compile-time hangs.

Reviewers: RKSimon, craig.topper, spatel, arsenm

Reviewed By: RKSimon, arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, javed.absar, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62263

llvm-svn: 361873
2019-05-28 20:40:03 +00:00
Roman Lebedev 96c9986199 [DAGCombiner][X86][AArch64][SPARC][SystemZ] y - (x + C) -> (y - x) - C fold. Try 2
Summary:
Direct sibling of D62223 patch.
While i don't have a direct motivational pattern for this,
it would seem to make sense to handle both patterns (or none),
for symmetry?

The aarch64 changes look neutral;
sparc and systemz look like improvement (one less instruction each);
x86 changes - 32bit case improves, 64bit case shows that LEA no longer
gets constructed, which may be because that whole test is `-mattr=+slow-lea,+slow-3ops-lea`

https://rise4fun.com/Alive/ffh

This is a recommit, originally committed in rL361853, but reverted
to investigate test-suite compile-time hangs.

Reviewers: RKSimon, craig.topper, spatel, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, jyknight, javed.absar, kristof.beyls, fedor.sergeev, jrtc27, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62252

llvm-svn: 361872
2019-05-28 20:39:55 +00:00
Roman Lebedev 2feb7e56e2 [DAGCombiner][X86][AArch64][AMDGPU] (x + C) - y -> (x - y) + C fold. Try 2
Summary:
The main motivation is shown by all these `neg` instructions that are now created.
In particular, the `@reg32_lshr_by_negated_unfolded_sub_b` test.

AArch64 test changes all look good (`neg` created), or neutral.

X86 changes look neutral (vectors), or good (`neg` / `xor eax, eax` created).

I'm not sure about `X86/ragreedy-hoist-spill.ll`, it looks like the spill
is now hoisted into preheader (which should still be good?),
2 4-byte reloads become 1 8-byte reload, and are elsewhere,
but i'm not sure how that affects that loop.

I'm unable to interpret AMDGPU change, looks neutral-ish?

This is hopefully a step towards solving [[ https://bugs.llvm.org/show_bug.cgi?id=41952 | PR41952 ]].

https://rise4fun.com/Alive/pkdq (we are missing more patterns, i'll submit them later)

This is a recommit, originally committed in rL361852, but reverted
to investigate test-suite compile-time hangs.

Reviewers: craig.topper, RKSimon, spatel, arsenm

Reviewed By: RKSimon

Subscribers: bjope, qcolombet, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, javed.absar, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62223

llvm-svn: 361871
2019-05-28 20:39:39 +00:00
Roman Lebedev 272d70c366 Revert DAGCombine "hoist binop with const" folds
Appear to introduce test-suite compile-time hang.

http://lab.llvm.org:8011/builders/clang-cmake-x86_64-sde-avx512-linux/builds/22825

This reverts r361852,r361853,r361854,r361855,r361856

llvm-svn: 361865
2019-05-28 19:04:21 +00:00
Roman Lebedev 7669665432 [DAGCombine] (x - C) - y -> (x - y) - C fold
Summary:
Again only vectors affected. Frustrating. Let me take a look into that..

https://rise4fun.com/Alive/AAq

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: javed.absar, JDevlieghere, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62294

llvm-svn: 361856
2019-05-28 17:54:21 +00:00
Roman Lebedev 8c9b3e4e4a [DAGCombine][X86][AArch64][AMDGPU] (x - y) + -1 -> add (xor y, -1), x fold
Summary:
This prevents regressions in next patch,
and somewhat recovers from the regression to AMDGPU test in D62223.

It is indeed not great that we leave vector decrement,
don't transform it into vector add all-ones..

https://rise4fun.com/Alive/ZRl

Reviewers: RKSimon, craig.topper, spatel, arsenm

Reviewed By: RKSimon, arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, javed.absar, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62263

llvm-svn: 361855
2019-05-28 17:54:13 +00:00
Roman Lebedev 6a24c9b9ab [DAGCombiner][X86][AArch64] (x - C) + y -> (x + y) - C fold
Summary:
Only vector tests are being affected here,
since subtraction by scalar constant is rewritten
as addition by negated constant.

No surprising test changes.

https://rise4fun.com/Alive/pbT

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: javed.absar, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62257

llvm-svn: 361854
2019-05-28 17:54:04 +00:00
Roman Lebedev 1499f65ac1 [DAGCombiner][X86][AArch64][SPARC][SystemZ] y - (x + C) -> (y - x) - C fold
Summary:
Direct sibling of D62223 patch.
While i don't have a direct motivational pattern for this,
it would seem to make sense to handle both patterns (or none),
for symmetry?

The aarch64 changes look neutral;
sparc and systemz look like improvement (one less instruction each);
x86 changes - 32bit case improves, 64bit case shows that LEA no longer
gets constructed, which may be because that whole test is `-mattr=+slow-lea,+slow-3ops-lea`

https://rise4fun.com/Alive/ffh

Reviewers: RKSimon, craig.topper, spatel, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, jyknight, javed.absar, kristof.beyls, fedor.sergeev, jrtc27, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62252

llvm-svn: 361853
2019-05-28 17:53:54 +00:00
Roman Lebedev 19f51ec04a [DAGCombiner][X86][AArch64][AMDGPU] (x + C) - y -> (x - y) + C fold
Summary:
The main motivation is shown by all these `neg` instructions that are now created.
In particular, the `@reg32_lshr_by_negated_unfolded_sub_b` test.

AArch64 test changes all look good (`neg` created), or neutral.

X86 changes look neutral (vectors), or good (`neg` / `xor eax, eax` created).

I'm not sure about `X86/ragreedy-hoist-spill.ll`, it looks like the spill
is now hoisted into preheader (which should still be good?),
2 4-byte reloads become 1 8-byte reload, and are elsewhere,
but i'm not sure how that affects that loop.

I'm unable to interpret AMDGPU change, looks neutral-ish?

This is hopefully a step towards solving [[ https://bugs.llvm.org/show_bug.cgi?id=41952 | PR41952 ]].

https://rise4fun.com/Alive/pkdq (we are missing more patterns, i'll submit them later)

Reviewers: craig.topper, RKSimon, spatel, arsenm

Reviewed By: RKSimon

Subscribers: bjope, qcolombet, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, javed.absar, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62223

llvm-svn: 361852
2019-05-28 17:53:43 +00:00
Sanjay Patel f7980e727f Revert "[x86] split 256-bit store of concatenated vectors"
This reverts commit d5a8637072.

Most likely suspect for this bot failure:
http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/9684

llvm-svn: 361850
2019-05-28 17:37:58 +00:00
David Greene 561fcc0d63 [X86-64] Fix 256-bit SET0 lowering for non-VLX targets
If we don't have VLX then 256-bit SET0 should be lowered
to VPXOR with ZMM registers.  This restores functionality
accidentally removed by r309926.

Differential Revision: https://reviews.llvm.org/D62415

llvm-svn: 361843
2019-05-28 15:37:01 +00:00
Sanjay Patel d5a8637072 [x86] split 256-bit store of concatenated vectors
This shows up as a side issue to the main problem for the AVX target example from PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3

But as we can see in the pile of existing test diffs, it's actually a widespread problem
that affects any AVX or later target. Apart from a couple of oddballs, I think these are
all improvements for the reasons stated in the code comment: we do not want to enable YMM
unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit
stores anyway.

We could say that MergeConsecutiveStores() is going overboard on some of these examples,
but that won't solve the problem completely. But that is the reason I'm proposing this as
a lowering rather than a combine: we will infinite loop fighting the merge code if we try
this earlier.

Differential Revision: https://reviews.llvm.org/D62498

llvm-svn: 361822
2019-05-28 13:54:17 +00:00
Sanjay Patel 6bf4ca9d2e [x86] fix 256-bit vector store splitting to honor 'volatile'
Forking this out of the discussion in D62498
(and assuming that will be committed later, so adding the helper function here).
The LangRef says:
"the backend should never split or merge target-legal volatile load/store instructions."

Differential Revision: https://reviews.llvm.org/D62506

llvm-svn: 361815
2019-05-28 12:58:07 +00:00
Benjamin Kramer 57e267a2e9 [X86] Custom lower CONCAT_VECTORS of v2i1
The generic legalizer cannot handle this. Add an assert instead of
silently miscompiling vectors with elements smaller than 8 bits.

llvm-svn: 361814
2019-05-28 12:52:57 +00:00
Sanjay Patel 165663aeeb [x86] add test to show volatile store splitting; NFC
From the LangRef:
"the backend should never split or merge target-legal
volatile load/store instructions."

See also:
D62498

llvm-svn: 361785
2019-05-27 23:56:41 +00:00
Matt Arsenault ca84c4be4b RegAllocFast: Set MayLiveAcrossBlocks when allocating uses
Setting mayLiveOut based only on use instructions after allocating the
def block did not work if the use block was allocated before the def
block, since the virtual register uses were already removed.

Fixes bug 41973.

llvm-svn: 361781
2019-05-27 20:37:31 +00:00
David L. Jones 0ff41b8a5a Revert r361356: "[MIR] Add simple PRE pass to MachineCSE"
This is problematic on buildbots, as discussed here: https://reviews.llvm.org/rL361356

It seems like the plan already was to revert, but that hasn't happened yet.

llvm-svn: 361746
2019-05-27 06:00:00 +00:00
Simon Pilgrim a044410f37 [X86][SSE] Add shuffle combining support for ISD::ANY_EXTEND_VECTOR_INREG
Reuses what we already have in place for ISD::ZERO_EXTEND_VECTOR_INREG just with a different sentinel

llvm-svn: 361734
2019-05-26 16:00:35 +00:00
Simon Pilgrim 58a8541dcc [X86][AVX] combineBitcastvxi1 - peek through bitops to determine size of original vector
We were only testing for direct SETCC results - this allows us to peek through AND/OR/XOR combinations of the comparison results as well.

There's a missing SEXT(PACKSS) fold that I need to investigate for v8i1 cases before I can enable it there as well.

llvm-svn: 361716
2019-05-26 10:54:23 +00:00
Simon Pilgrim 40fa52b174 [X86] lowerBuildVectorToBitOp - support build_vector(shift()) -> shift(build_vector(),C)
Commonly occurs in sign-extension cases

llvm-svn: 361706
2019-05-25 18:02:17 +00:00
Nikita Popov d87eceda0e [X86] Combine fminnum/fmaxnum with non-nan operand to fmin/fmax
If we have a known non-nan operand, place it in the second operand
of fmin/fmax that is returned if either operand is nan.

Differential Revision: https://reviews.llvm.org/D62448

llvm-svn: 361704
2019-05-25 16:44:29 +00:00
Simon Pilgrim 34d5a74b03 [X86][SSE] vector-sext - cleanup prefix lists
Add X32-SSE common prefix to merge some checks

llvm-svn: 361702
2019-05-25 16:33:17 +00:00
Sanjay Patel 3f0905e46f [SelectionDAG] define binops as a superset of commutative binops
The test diffs show improved vector narrowing for integer min/max opcodes because
those were all absent from the list. I'm not sure if we can expose functional diffs
for all of the moved/added opcodes though.

It seems like we are missing an AVX512 opportunity to use 256-bit ops in place of
512-bit ops on some tests/targets, but I think that can be a follow-up.

Preliminary steps to make sure the callers are not misusing these queries:
rL361268
rL361547

Differential Revision: https://reviews.llvm.org/D62191

llvm-svn: 361701
2019-05-25 15:28:55 +00:00
Nikita Popov c9de92ee76 [X86] Add tests for min/maxnum with const operand; NFC
llvm-svn: 361700
2019-05-25 15:06:54 +00:00
Craig Topper 46e5052b8e [X86FixupLEAs] Turn optIncDec into a generic two address LEA optimizer. Support LEA64_32r properly.
INC/DEC is really a special case of a more generic issue. We should also turn leas into add reg/reg or add reg/imm regardless of the slow lea flags.

This also supports LEA64_32 which has 64 bit input registers and 32 bit output registers. So we need to convert the 64 bit inputs to their 32 bit equivalents to check if they are equal to base reg.

One thing to note, the original code preserved the kill flags by adding operands to the new instruction instead of using addReg. But I think tied operands aren't supposed to have the kill flag set. I dropped the kill flags, but I could probably try to preserve it in the add reg/reg case if we think its important. Not sure which operand its supposed to go on for the LEA64_32r instruction due to the super reg implicit uses. Though I'm also not sure those are needed since they were probably just created by an INSERT_SUBREG from a 32-bit input.

Differential Revision: https://reviews.llvm.org/D61472

llvm-svn: 361691
2019-05-25 06:17:47 +00:00
Simon Pilgrim 95b8d9bbf8 [SelectionDAG] computeKnownBits - support constant pool values from target
This patch adds the overridable TargetLowering::getTargetConstantFromLoad function which allows targets to return any constant value loaded by a LoadSDNode node - only X86 makes use of this so far but everything should be in place for other targets.

computeKnownBits then uses this function to improve codegen, notably vector code after legalization.

A future commit will do the same for ComputeNumSignBits but computeKnownBits sees the bigger benefit.

This required a couple of fixes:
* SimplifyDemandedBits must early-out for getTargetConstantFromLoad cases to prevent infinite loops of constant regeneration (similar to what we already do for BUILD_VECTOR).
* Fix a DAGCombiner::visitTRUNCATE issue as we had trunc(shl(v8i32),v8i16) <-> shl(trunc(v8i16),v8i32) infinite loops after legalization on AVX512 targets.

Differential Revision: https://reviews.llvm.org/D61887

llvm-svn: 361620
2019-05-24 10:03:11 +00:00
Craig Topper af0add6c39 [X86] Add test case that was supposed to go with r360102.
Found in my working area. Guess I forgot 'git add' before committing.

llvm-svn: 361599
2019-05-24 04:46:56 +00:00
Robert Lougher 170dfeb2ff Resubmit r360436 "[X86] Avoid SFB - Fix inconsistent codegen with/without debug info"
Fixes https://bugs.llvm.org/show_bug.cgi?id=40969

The functions findPotentiallyBlockedCopies and buildCopy are currently not
accounting for the presence of debug instructions. In the former this results
in the optimization not being trigerred, and in the latter results in
inconsistent codegen.

This patch enables the optimization to be performed in a debug build and
ensures the codegen is consistent with non-debug builds.

Patch by Chris Dawson.

Differential Revision: https://reviews.llvm.org/D61680

llvm-svn: 361527
2019-05-23 18:15:12 +00:00
Shoaib Meenai 87226a7202 [AsmPrinter] Treat a narrowing PtrToInt like Trunc
When printing assembly for PtrToInt, AsmPrinter::lowerConstant
incorrectly assumed that if PtrToInt was not converting to an
int with exactly the same number of bits, it must be widening
to a larger int. But this isn't necessarily true; PtrToInt can
also shrink the size, which is useful when you want to produce
a known 32-bit pointer on a 64-bit platform (on x86_64 ELF
this yields a R_X86_64_32 relocation).

The old behavior of falling through to the widening case for a
narrowing PtrToInt yields bogus assembly code like this, which
fails to assemble because the no-op bit and it accidentally
creates is not a valid relocation:

```
        .long   a&-1
```

The fix is to treat a narrowing PtrToInt exactly the same as
it already treats Trunc: just emit the expression and let
the assembler deal with truncating it in the appropriate way.

Patch by Mat Hostetter <mjh@fb.com>.

Differential Revision: https://reviews.llvm.org/D61325

llvm-svn: 361508
2019-05-23 16:29:09 +00:00
Simon Pilgrim 46806749ac [X86] Regenerate LZCNT tests on x86/x32/x64 targets
llvm-svn: 361495
2019-05-23 13:30:10 +00:00
Roman Lebedev 32d976bac1 [NFC][X86] Fix check prefixes and autogenerate fold-pcmpeqd-2.ll test
Being affected by (sub %x, c) -> (add %x, (sub 0, c))
patch in an uncertain way.

llvm-svn: 361483
2019-05-23 10:55:13 +00:00
Fangrui Song 86c9ca48c3 [X86] Support -fno-plt __tls_get_addr calls
In general dynamic/local dynamic TLS models, with -fno-plt,

* x86: emit `calll *___tls_get_addr@GOT(%ebx)` instead of `calll ___tls_get_addr@PLT`
  Note, on x86, if we can get rid of %ebx as the PIC register,
  it may be better to use a register not preserved across function calls.
* x86_64: emit `callq *__tls_get_addr@GOTPCREL(%rip)` instead of `callq __tls_get_addr@PLT`

Reorganize the code by separating 32-bit and 64-bit.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D62106

llvm-svn: 361453
2019-05-23 01:05:13 +00:00
Craig Topper 93f38e1f1a [X86] Explcitly disable VEXTRACT instruction matching for an immediate of 0. Remove a bunch of isel patterns that become unnecessary.
We effectively had a second set of isel patterns that tried to use a
regular store instruction and an extract_subreg instruction. Or a masked move
and an extract_subreg. These patterns were intended to override the
matching of VEXTRACT instructions by taking advantage of the priority
of the explicit immediate 0 for the index.

This patch instaed just disables the immediate 0 matchin the VEXTRACT
patterns. This each of the component pieces of the larger patterns will
match by themselves.

This found a bug of sorts were we didn't use 128-bit store for 512->128
extract on KNL. Its unclear what the right thing here should be.
Using the vextract avoids constraining the register allocator to use
xmm0-15. But it always results in a longer encoding if the register
allocator ends up choosing xmm0-15 anyway.

llvm-svn: 361431
2019-05-22 21:00:18 +00:00
Craig Topper 9816d55776 [X86][InstCombine] Remove InstCombine code that turns X86 round intrinsics into llvm.ceil/floor. Remove some isel patterns that existed because that was happening.
We were turning roundss/sd/ps/pd intrinsics with immediates of 1 or 2 into
llvm.floor/ceil.  The llvm.ceil/floor intrinsics are supposed to correspond
to the libm functions.  For the libm functions we need to disable the
precision exception so the llvm.floor/ceil functions should always map to
encodings 0x9 and 0xA.

We had a mix of isel patterns where some used 0x9 and 0xA and others used
0x1 and 0x2. We need to be consistent and always use 0x9 and 0xA.

Since we have no way in isel of knowing where the llvm.ceil/floor came
from, we can't map X86 specific intrinsics with encodings 1 or 2 to it.
We could map 0x9 and 0xA to llvm.ceil/floor instead, but I'd really like
to see a use case and optimization advantage first.

I've left the backend test cases to show the blend we now emit without
the extra isel patterns. But I've removed the InstCombine tests completely.

llvm-svn: 361425
2019-05-22 20:04:55 +00:00
Roman Lebedev 5e1ce15c5d [NFC][X86][AArch64] Add tests for missing (x - y) + -1 -> not(y) + x fold
https://rise4fun.com/Alive/OaY

llvm-svn: 361409
2019-05-22 16:58:26 +00:00
Roman Lebedev 6a53135698 [NFC][X86] Autogenerate negative-offset.ll test
Being affected by upcoming patch

llvm-svn: 361396
2019-05-22 15:34:43 +00:00
Roman Lebedev 406421b332 [NFC][X86][AArch64] Rewrite sink-addsub-of-const.ll tests to have full permutation coverage
Somehow missed some patterns initially..
While there, add comments.

llvm-svn: 361390
2019-05-22 14:42:41 +00:00
Anton Afanasyev df00c6a54f [MIR] Add simple PRE pass to MachineCSE
This is the second part of the commit fixing PR38917 (hoisting
partitially redundant machine instruction). Most of PRE (partitial
redundancy elimination) and CSE work is done on LLVM IR, but some of
redundancy arises during DAG legalization. Machine CSE is not enough
to deal with it. This simple PRE implementation works a little bit
intricately: it passes before CSE, looking for partitial redundancy
and transforming it to fully redundancy, anticipating that the next
CSE step will eliminate this created redundancy. If CSE doesn't
eliminate this, than created instruction will remain dead and eliminated
later by Remove Dead Machine Instructions pass.

The third part of the commit is supposed to refactor MachineCSE,
to make it more clear and to merge MachinePRE with MachineCSE,
so one need no rely on further Remove Dead pass to clear instrs
not eliminated by CSE.

First step: https://reviews.llvm.org/D54839

Fixes llvm.org/PR38917

llvm-svn: 361356
2019-05-22 07:41:34 +00:00
Nikita Popov 15df05152d [X86] Don't compare i128 through vector if construction not cheap (PR41971)
Fix for https://bugs.llvm.org/show_bug.cgi?id=41971. Make the
combineVectorSizedSetCCEquality() transform more conservative by
checking that the bitcast to the vector type will be cheap/free
for both operands. I'm considering it cheap if it's a constant,
a load or already a vector. I've dropped the explicit check for
f128 because it should fall out naturally (in the cases where
it'd be detrimental).

Differential Revision: https://reviews.llvm.org/D62220

llvm-svn: 361352
2019-05-22 06:47:06 +00:00
Pengfei Wang 6a0d432e9e [X86] [CET] Deal with return-twice function such as vfork, setjmp when
CET-IBT enabled

Return-twice functions will indirectly jump after the caller's position.
So when CET-IBT is enable, we should make sure these is endbr*
instructions follow these Return-twice function caller. Like GCC does.

Patch by Xiang Zhang (xiangzhangllvm)

Differential Revision: https://reviews.llvm.org/D61881

llvm-svn: 361342
2019-05-22 00:50:21 +00:00
Roman Lebedev 21e8ec8d4f [NFC][X86] Autogenerate ragreedy-hoist-spill.ll test
llvm-svn: 361319
2019-05-21 21:49:10 +00:00
Nikita Popov d34d96770e [X86] Add large integer comparison tests for PR41971; NFC
In these cases we would prefer a direct comparison over going through
a vector type.

llvm-svn: 361315
2019-05-21 21:27:08 +00:00
Roman Lebedev a7e88f8570 [NFC][X86][AArch64] Add tests for sinking of add/sub by constant through add/sub
Looks we can transform all 8 variants of the pattern:
https://rise4fun.com/Alive/auH

This comes up as an issue on the path towards
https://bugs.llvm.org/show_bug.cgi?id=41952

llvm-svn: 361303
2019-05-21 20:14:54 +00:00
Leonard Chan 0bada7ce6c [Intrinsic] Signed Fixed Point Saturation Multiplication Intrinsic
Add an intrinsic that takes 2 signed integers with the scale of them provided
as the third argument and performs fixed point multiplication on them. The
result is saturated and clamped between the largest and smallest representable
values of the first 2 operands.

This is a part of implementing fixed point arithmetic in clang where some of
the more complex operations will be implemented as intrinsics.

Differential Revision: https://reviews.llvm.org/D55720

llvm-svn: 361289
2019-05-21 19:17:19 +00:00
Simon Pilgrim 4b82e50315 [X86][SSE] computeKnownBitsForTargetNode - add X86ISD::ANDNP support
Fixes PACKSS-PSHUFB shuffle regressions mentioned on D61692

llvm-svn: 361270
2019-05-21 15:20:24 +00:00
Roman Lebedev d8db224ecb [NFC][X86][AArch64] Shift amount masking: tests that show that 'neg' doesn't last
Meaning if we were to produce 'neg' in dagcombine, we will get an
endless cycle; some inverse transform would need to be guarded somehow.

Also, the 'and (sub 0, x), 31' variant is sticky,
doesn't get optimized in any way.

https://bugs.llvm.org/show_bug.cgi?id=41952

llvm-svn: 361254
2019-05-21 13:04:56 +00:00
Simon Pilgrim bc03bee66b [X86][SSE] Add shuffle tests for 'splat3' patterns.
Test codegen from shuffles for { dst[0] = dst[1] = dst[2] = *src++; dst += 3 } 'splatting' memcpy patterns generated by loop-vectorizer.

llvm-svn: 361243
2019-05-21 11:42:28 +00:00
Roman Lebedev 2aee73f591 [NFC][X86][AArch64] Add some more tests for shift amount masking
The negation creation should be more eager:
https://bugs.llvm.org/show_bug.cgi?id=41952

llvm-svn: 361241
2019-05-21 11:14:01 +00:00
Craig Topper e97e52757c [X86] Add test case for r361177.
That commit makes sure we flush PendingExports in SelectDAGBuilder
before we create INLINEASM_BR. Unfortunatley, I haven't yet found
a CodeGen failure without that change.

This commit uses the debug output from SelectionDAG to at least
ensure we build the DAG correctly.

llvm-svn: 361179
2019-05-20 17:37:52 +00:00
Nikita Popov 9060b6df97 [SDAG] Vector op legalization for overflow ops
Fixes issue reported by aemerson on D57348. Vector op legalization
support is added for uaddo, usubo, saddo and ssubo (umulo and smulo
were already supported). As usual, by extracting TargetLowering methods
and calling them from vector op legalization.

Vector op legalization doesn't really deal with multiple result nodes,
so I'm explicitly performing a recursive legalization call on the
result value that is not being legalized.

There are some existing test changes because expansion happens
earlier, so we don't get a DAG combiner run in between anymore.

Differential Revision: https://reviews.llvm.org/D61692

llvm-svn: 361166
2019-05-20 16:09:22 +00:00
Sander de Smalen f83cccf917 Match types of accumulator and result for llvm.experimental.vector.reduce.fadd/fmul
The scalar start/accumulator value of the fadd- and fmul reduction
should match the result type of the reduction, as well as the vector
element-type of the input vector. Although this was not explicitly
specified in the LangRef, it was taken for granted in code implementing
the reductions. The patch also fixes the LangRef by adding this
constraint.

Reviewed By: aemerson, nikic

Differential Revision: https://reviews.llvm.org/D60260

llvm-svn: 361133
2019-05-20 09:54:06 +00:00
Simon Pilgrim 065431c82b [X86][SSE] Fold movmsk(not(x)) -> not(movmsk)
Helps to improve folding of comparisons with movmsk results.

llvm-svn: 361056
2019-05-17 17:56:25 +00:00
Simon Pilgrim 2c2f8e74b9 [X86][SSE] Match all-of bool scalar reductions into a bitcast/movmsk + cmp.
Same as what we do for vector reductions in combineHorizontalPredicateResult, use movmsk+cmp for scalar (and(extract(x,0),extract(x,1)) reduction patterns. 

llvm-svn: 361052
2019-05-17 17:25:55 +00:00
Roman Lebedev 64c756b991 [DAGCombiner] visitShiftByConstant(): drop bogus signbit check
Summary:
That check claims that the transform is illegal otherwise.
That isn't true:
1. For `ISD::ADD`, we only process `ISD::SHL` outer shift => sign bit does not matter
   https://rise4fun.com/Alive/K4A
2. For `ISD::AND`, there is no restriction on constants:
   https://rise4fun.com/Alive/Wy3
3. For `ISD::OR`, there is no restriction on constants:
   https://rise4fun.com/Alive/GOH
3. For `ISD::XOR`, there is no restriction on constants:
   https://rise4fun.com/Alive/ml6

So, why is it there then?

This changes the testcase that was touched by @spatel in rL347478,
but i'm not sure that test tests anything particular?

Reviewers: RKSimon, spatel, craig.topper, jojo, rengolin

Reviewed By: spatel

Subscribers: javed.absar, llvm-commits, spatel

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61918

llvm-svn: 361044
2019-05-17 15:52:58 +00:00
Simon Pilgrim 62c7032c18 [X86][AVX] isNOT - add extract_subvector(xor X, -1) -> extract_subvector(X) fold.
Prep work for the removal of the remaining x86 CTTZ vector lowering.

llvm-svn: 361035
2019-05-17 14:04:56 +00:00
Clement Courbet 632dfdda16 Re-land r360859: "[MergeICmps] Simplify the code."
With a fix for PR41917: The predecessor list was changing under our feet.

-  for (BasicBlock *Pred : predecessors(EntryBlock_)) {
+  while (!pred_empty(EntryBlock_)) {
+    BasicBlock* const Pred = *pred_begin(EntryBlock_);

llvm-svn: 361009
2019-05-17 09:43:45 +00:00
Nico Weber d764e7c660 Revert r360859: "Reland r360771 "[MergeICmps] Simplify the code.""
It caused PR41917.

llvm-svn: 360963
2019-05-17 00:43:53 +00:00
Craig Topper f09b9d419f [X86] Use 0x9 instead of 0x1 as the immediate in some masked floor pattern. Similarly change 0x2 to 0xA for ceil.
This suppresses exceptions which is what we should be doing for ceil and floor. We already use the correct immediate
in patterns without masking.

llvm-svn: 360915
2019-05-16 16:53:50 +00:00
Adhemerval Zanella 73643b5041 [CodeGen] Add lround/llround builtins
This patch add the ISD::LROUND and ISD::LLROUND along with new
intrinsics.  The changes are straightforward as for other
floating-point rounding functions, with just some adjustments
required to handle the return value being an interger.

The idea is to optimize lround/llround generation for AArch64
in a subsequent patch.  Current semantic is just route it to libm
symbol.

llvm-svn: 360889
2019-05-16 13:15:27 +00:00
Matt Arsenault 828b685ebe RegAllocFast: Improve hinting heuristic
Trace through multiple COPYs when looking for a physreg source. Add
hinting for vregs that will be copied into physregs (we only hinted
for vregs getting copied to a physreg previously).  Give hinted a
register a bonus when deciding which value to spill.  This is part of
my rewrite regallocfast series. In fact this one doesn't even have an
effect unless you also flip the allocation to happen from back to
front of a basic block. Nonetheless it helps to split this up to ease
review of D52010

Patch by Matthias Braun

llvm-svn: 360887
2019-05-16 12:50:39 +00:00
Roman Lebedev 62650cf464 [NFC] Fixup FileCheck option name in tests added in rL360881
llvm-svn: 360884
2019-05-16 12:39:34 +00:00
Roman Lebedev ec6608d547 [NFC][CodeGen] Add some more tests for pulling binops through shifts
The ashr variant may see relaxation in https://reviews.llvm.org/D61918

llvm-svn: 360881
2019-05-16 12:26:53 +00:00
Clement Courbet c4fdd717ef Reland r360771 "[MergeICmps] Simplify the code."
This revision does not seem to be the culprit.

llvm-svn: 360859
2019-05-16 06:18:02 +00:00
Craig Topper e43bdf144c [X86] Delay creating index register negations during address matching until after we know for sure the match will succeed
If we're trying to match an LEA, its possible the LEA match will be deemed unprofitable. In which case the negation we created in matchAddress would be left dangling in the SelectionDAG. This could artificially increase use counts for other nodes in the DAG. Though I don't have an example of that. But it just seems like bad form to have dangling nodes in isel.

Differential Revision: https://reviews.llvm.org/D61047

llvm-svn: 360823
2019-05-15 21:59:53 +00:00
Reid Kleckner 4882490349 [codeview] Fix SDNode representation of annotation labels
Before this change, they were erroneously constructed with the EH_LABEL
SDNode opcode, which caused other passes to interact with them in
incorrect ways. See the FIXME about fastisel that this addresses in the
existing test case.

Fixes PR41890

llvm-svn: 360818
2019-05-15 21:46:05 +00:00
Nicolai Haehnle 664ceeda68 RegAlloc: try to fail more gracefully when out of registers
Summary:
The emitError path allows the program to continue, unlike report_fatal_error.
This is friendlier to use cases where LLVM is embedded in a larger program,
because the caller may be able to deal with the error somewhat gracefully.

Change the number of requested NOP bytes in the AArch64 and PowerPC
test cases to avoid triggering an unrelated assertion. The compilation
still fails, as verified by the test.

Change-Id: Iafb9ca341002a597b82e59ddc7a1f13c78758e3d

Reviewers: arsenm, MatzeB

Subscribers: qcolombet, nemanjai, wdng, javed.absar, kristof.beyls, kbarton, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61489

llvm-svn: 360786
2019-05-15 17:29:58 +00:00
Clement Courbet eaf4413d2d Revert r360771 "[MergeICmps] Simplify the code."
Breaks a bunch of builbdots.

llvm-svn: 360776
2019-05-15 14:21:59 +00:00
Clement Courbet 157ae639fa [MergeICmps] Simplify the code.
Instead of patching the original blocks, we now generate new blocks and
delete the old blocks. This results in simpler code with a less twisted
control flow (see the change in `entry-block-shuffled.ll`).

This will make https://reviews.llvm.org/D60318 simpler by making it more
obvious where control flow created and deleted.

Reviewers: gchatelet

Subscribers: hiraditya, llvm-commits, spatel

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61736

llvm-svn: 360771
2019-05-15 13:04:24 +00:00
Craig Topper 384d46c0d5 [X86] Use OR32mi8Locked instead of LOCK_OR32mi8 in emitLockedStackOp.
They encode the same way, but OR32mi8Locked sets hasUnmodeledSideEffects set
which should be stronger than the mayLoad/mayStore on LOCK_OR32mi8. I think
this makes sense since we are using it as a fence.

This also seems to hide the operation from the speculative load hardening pass
so I've reverted r360511.

llvm-svn: 360747
2019-05-15 04:15:46 +00:00
Fangrui Song f4dfd63c74 [IR] Disallow llvm.global_ctors and llvm.global_dtors of the 2-field form in textual format
The 3-field form was introduced by D3499 in 2014 and the legacy 2-field
form was planned to be removed in LLVM 4.0

For the textual format, this patch migrates the existing 2-field form to
use the 3-field form and deletes the compatibility code.
test/Verifier/global-ctors-2.ll checks we have a friendly error message.

For bitcode, lib/IR/AutoUpgrade UpgradeGlobalVariables will upgrade the
2-field form (add i8* null as the third field).

Reviewed By: rnk, dexonsmith

Differential Revision: https://reviews.llvm.org/D61547

llvm-svn: 360742
2019-05-15 02:35:32 +00:00
Philip Reames 445f942fc4 Use an offset from TOS for idempotent rmw locked op lowering
This was the portion split off D58632 so that it could follow the redzone API cleanup. Note that I changed the offset preferred from -8 to -64. The difference should be very minor, but I thought it might help address one concern which had been previously raised.

Differential Revision: https://reviews.llvm.org/D61862

llvm-svn: 360719
2019-05-14 22:32:42 +00:00
Roman Lebedev 7baf528aba [NFC][CodeGen][X86][AArch64] Add and-const-mask + const-shift pattern tests
Unlike instcombine, we currently don't turn and+shift into shift+and.
We probably should, likely unconditionally.

While i'm adding only all-ones (potentially shifted) mask,
this obviously isn't limited to any particular mask pattern:
https://rise4fun.com/Alive/kmX

Related to https://bugs.llvm.org/show_bug.cgi?id=41874

llvm-svn: 360706
2019-05-14 20:17:04 +00:00
Simon Pilgrim c2d9cfd925 [X86] Disable shouldFoldConstantShiftPairToMask for scalar shifts on AMD targets (PR40758)
D61068 handled vector shifts, this patch does the same for scalars where there are similar number of pipes for shifts as bit ops - this is true almost entirely for AMD targets where the scalar ALUs are well balanced.

This combine avoids AND immediate mask which usually means we reduce encoding size.

Some tests show use of (slow, scaled) LEA instead of SHL in some cases, but thats due to particular shift immediates - shift+mask generate these just as easily.

Differential Revision: https://reviews.llvm.org/D61830

llvm-svn: 360684
2019-05-14 15:21:28 +00:00
Philip Reames 3098e44daa [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit targets
This is a follow on to D58632, with the same logic. Given a memory operation which needs ordering, but doesn't need to modify any particular address, prefer to use a locked stack op over an mfence.

Differential Revision: https://reviews.llvm.org/D61863

llvm-svn: 360649
2019-05-14 04:43:37 +00:00
Craig Topper cc761e6fae [X86] Use X86 instead of X32 as a check prefix in atomic-idempotent.ll. NFC
X32 can refer to a 64-bit ABI that uses 32-bit ints, longs, and pointers.

I plan to add gnux32 command lines to this test so this prepares for that.

Also remove some check lines that have a prefix that is not in any run lines.

llvm-svn: 360642
2019-05-14 03:07:56 +00:00
Sanjay Patel 3a13d970aa [SDAG, x86] allow targets to override test for binop opcodes
This follows the pattern of the existing isCommutativeBinOp().

x86 shows improvements from vector narrowing for the min/max opcodes.

llvm-svn: 360639
2019-05-14 00:39:40 +00:00
Robert Lougher 91a9d4ef4b Revert [X86] Avoid SFB - Fix inconsistent codegen with/without debug info
Revert r360436 as it is causing clang-x64-windows-msvc buildbot to fail.

llvm-svn: 360606
2019-05-13 17:36:46 +00:00
Nick Desaulniers c33f754e74 [TargetLowering] Handle multi depth GEPs w/ inline asm constraints
Summary:
X86TargetLowering::LowerAsmOperandForConstraint had better support than
TargetLowering::LowerAsmOperandForConstraint for arbitrary depth
getelementpointers for "i", "n", and "s" extended inline assembly
constraints. Hoist its support from the derived class into the base
class.

Link: https://github.com/ClangBuiltLinux/linux/issues/469

Reviewers: echristo, t.p.northover

Reviewed By: t.p.northover

Subscribers: t.p.northover, E5ten, kees, jyknight, nemanjai, javed.absar, eraman, hiraditya, jsji, llvm-commits, void, craig.topper, nathanchance, srhines

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61560

llvm-svn: 360604
2019-05-13 17:27:44 +00:00
Simon Pilgrim 73aee29095 [X86][SSE] LowerBuildVectorv4x32 - don't insert MOVQ for undef elts
Fixes the regression noted in D61782 where a VZEXT_MOVL was being inserted because we weren't discriminating between 'zeroable' and 'all undef' for the upper elts.

Differential Revision: https://reviews.llvm.org/D61782

llvm-svn: 360596
2019-05-13 16:10:11 +00:00
Simon Pilgrim cf5a8eb7cd [X86][SSE] Relax use limits for lowerAddSubToHorizontalOp (PR32433)
Now that we can use HADD/SUB for scalar additions from any pair of extracted elements (D61263), we can relax the one use limit as we will be able to merge multiple uses into using the same HADD/SUB op.

This exposes a couple of missed opportunities in LowerBuildVectorv4x32 which will be committed separately.

Differential Revision: https://reviews.llvm.org/D61782

llvm-svn: 360594
2019-05-13 16:02:45 +00:00
Simon Pilgrim d3cedee3c6 [TargetLowering] Add SimplifyDemandedBits support for ZERO_EXTEND_VECTOR_INREG
More work for PR39709.

llvm-svn: 360592
2019-05-13 15:51:26 +00:00
Craig Topper c6a6c10742 [X86] Add test case for mask register variant of PR41619 which should be fixed after r360552
llvm-svn: 360591
2019-05-13 15:45:20 +00:00
Sanjay Patel 05dafb1c97 [DAGCombiner] narrow vector binop with inserts/extract
We catch most of these patterns (on x86 at least) by matching
a concat vectors opcode early in combining, but the pattern may
emerge later using insert subvector instead.

The AVX1 diffs for add/sub overflow show another missed narrowing
pattern. That one may be falling though the cracks because of
combine ordering and multiple uses.

llvm-svn: 360585
2019-05-13 14:31:14 +00:00
Sanjay Patel 83e61bc5e2 [x86] add test for insert/extract binop; NFC
This pattern is visible in the c-ray benchmark with an AVX target.

llvm-svn: 360582
2019-05-13 13:32:16 +00:00
Kevin P. Neal 5987749e33 Add constrained fptrunc and fpext intrinsics.
The new fptrunc and fpext intrinsics are constrained versions of the
regular fptrunc and fpext instructions.

Reviewed by:	Andrew Kaylor, Craig Topper, Cameron McInally, Conner Abbot
Approved by:	Craig Topper
Differential Revision: https://reviews.llvm.org/D55897

llvm-svn: 360581
2019-05-13 13:23:30 +00:00
Clement Courbet 9afc4764dd [DAGCombiner] Fix invalid alias analysis.
Summary:
When we know for sure whether two addresses do or do not alias, we
should immediately return from DAGCombiner::isAlias().

I think this comes from a bad copy/paste, Sorry for not catching that during the
code review.

Fixes PR41855.

Reviewers: niravd, gchatelet, EricWF

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61846

llvm-svn: 360566
2019-05-13 09:07:37 +00:00
Clement Courbet c4e37fd9b2 [DAGCombiner][NFC] Commit test to show fix in D61846.
llvm-svn: 360561
2019-05-13 08:15:34 +00:00
Craig Topper 61e556d2bd Recommit r358887 "[TargetLowering][AMDGPU][X86] Improve SimplifyDemandedBits bitcast handling"
I've included a new fix in X86RegisterInfo to prevent PR41619 without
reintroducing r359392. We might be able to improve that in the base class
implementation of shouldRewriteCopySrc somehow. But this hopefully enables
forward progress on SimplifyDemandedBits improvements for now.

Original commit message:

This patch adds support for BigBitWidth -> SmallBitWidth bitcasts, splitting the DemandedBits/Elts accordingly.

The AMDGPU backend needed an extra  (srl (and x, c1 << c2), c2) -> (and (srl(x, c2), c1) combine to encourage BFE creation, I investigated putting this in DAGComb
but it caused a lot of noise on other targets - some improvements, some regressions.

The X86 changes are all definite wins.

llvm-svn: 360552
2019-05-13 04:03:35 +00:00
Simon Pilgrim a7fc763082 [X86][AVX] Split VZEXT_MOVL ymm/zmm if the upper elements are not demanded.
Removes unnecessary vzeroupper noted in D61806

llvm-svn: 360543
2019-05-12 15:16:29 +00:00
Sanjay Patel a09e686821 [DAGCombiner] try to move bitcast after extract_subvector
I noticed that we were failing to narrow an x86 ymm math op in a case similar
to the 'madd' test diff. That is because a bitcast is sitting between the math
and the extract subvector and thwarting our pattern matching for narrowing:

       t56: v8i32 = add t59, t58
      t68: v4i64 = bitcast t56
    t73: v2i64 = extract_subvector t68, Constant:i64<2>
  t96: v4i32 = bitcast t73

There are a few wins and neutral diffs in the other tests.

Differential Revision: https://reviews.llvm.org/D61806

llvm-svn: 360541
2019-05-12 14:43:20 +00:00
Simon Pilgrim fda6bffd3b [X86][SSE] SimplifyDemandedBits - call PEXTRB/PEXTRW SimplifyDemandedVectorElts as well.
See if we can simplify the demanded vector elts from the extraction before trying to simplify the demanded bits.

This helps us with target shuffles and hops in particular.

llvm-svn: 360535
2019-05-11 21:35:50 +00:00
Simon Pilgrim 605a840747 [DAG] Add SimplifyDemandedBits support for BITREVERSE
Pulled out of D58017 while I continue to investigate the BSWAP regression on PPC

llvm-svn: 360534
2019-05-11 20:56:05 +00:00
Simon Pilgrim 3fa632a112 [X86] Updated shift-mask test targets for D61830
llvm-svn: 360533
2019-05-11 20:28:20 +00:00
Simon Pilgrim 91e697c145 [X86] Add scalar shl+lshr -> shift+mask tests (PR40758)
As discussed on D61068, many x86 targets can perform 2 immediate shifts quicker than a shift + mask

llvm-svn: 360530
2019-05-11 19:16:46 +00:00
Simon Pilgrim 6f7c62d70f [X86] Add avx512f tests for boolean reduction
llvm-svn: 360529
2019-05-11 19:14:19 +00:00
Simon Pilgrim e4c5b6d9bd [X86][SSE] Add SimplifyDemandedVectorElts HADD/HSUB handling.
Still missing PHADDW/PHSUBW tests because PEXTRW doesn't call SimplifyDemandedVectorElts

llvm-svn: 360526
2019-05-11 16:07:12 +00:00
Craig Topper c9d7484aa3 [X86] Add CMOV_FR32X/CMOV_FR64X pseudo instructions. Use them in fast isel to fix a machine verifier error after adding test cases.
Fast isel picks the FR32X/FR64X register classes when lowering pseudo select, but it didn't have the right opcode to go with it.

llvm-svn: 360524
2019-05-11 16:00:28 +00:00
Simon Pilgrim 4871a3057e [X86][SSE] Tweaked HADD/HSUB SimplifyDemandedVectorElts
Try to ensure we LHS and RHS test coverage

llvm-svn: 360519
2019-05-11 14:47:54 +00:00
Simon Pilgrim 1db0cc9e1b [X86][SSE] Add integer HADD/HSUB SimplifyDemandedVectorElts tests
llvm-svn: 360518
2019-05-11 14:08:34 +00:00
Simon Pilgrim 67ad4c2f27 [X86][SSE] Add HADD/HSUB SimplifyDemandedVectorElts tests
Shows missed opportunities to simplify args.

Will add integer HADD/HSUB tests in a future commit.

llvm-svn: 360517
2019-05-11 12:46:38 +00:00
Craig Topper 31f7adb94f [X86] Don't emit MOVNTDQA loads from fast-isel without SSE4.1.
We were checking for SSE4.1 for FP types, but not integer 128-bit types.

Fixes PR41837.

llvm-svn: 360512
2019-05-11 04:19:33 +00:00
Craig Topper bdef12df8d [X86] Add a test case for idempotent atomic operations with speculative load hardening. Fix an additional issue found by the test.
This test covers the fix from r360475 as well.

llvm-svn: 360511
2019-05-11 04:00:27 +00:00
Mircea Trofin ff3bed0e61 Skip over prefetches
Summary: Skip over prefetches when assigning debug info to instructions with memory operands. This way, the debug info is stable after instrumenting a binary with prefetches, allowing for iterative profiling and instrumentation.

Reviewers: davidxl

Reviewed By: davidxl

Subscribers: aprantl, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61789

llvm-svn: 360471
2019-05-10 21:27:55 +00:00
Nikita Popov 9f7537bd48 [SDAG] Recursively legalize both vector mulo results
Split out from D61692 per RKSimon's suggestion. Vector op
legalization will automatically recursively legalize the returned
SDValue, but we need to take care of the other results ourselves.
Otherwise it will end up getting legalized only during op
legalization, by which point it might be too late (though I'm not
aware of any specific cases right now).

There are codegen differences because expansion occurs earlier now
and we don't get a DAGCombiner run in between.

Differential Revision: https://reviews.llvm.org/D61744

llvm-svn: 360470
2019-05-10 20:42:48 +00:00
Robert Lougher 986b6b86bb [X86] Avoid SFB - Fix inconsistent codegen with/without debug info
Fixes https://bugs.llvm.org/show_bug.cgi?id=40969

The functions findPotentiallyBlockedCopies and buildCopy are currently not
accounting for the presence of debug instructions. In the former this results
in the optimization not being trigerred, and in the latter results in
inconsistent codegen.

This patch enables the optimization to be performed in a debug build and
ensures the codegen is consistent with non-debug builds.

Patch by Chris Dawson.

Differential Revision: https://reviews.llvm.org/D61680

llvm-svn: 360436
2019-05-10 15:55:06 +00:00
Simon Pilgrim a0b1518a4a [X86][SSE] Add getHopForBuildVector vector splitting
If we only use the lower xmm of a ymm hop, then extract the xmm's (for free), perform the xmm hop and then insert back into a ymm (for free).

Fixes some of the regressions noted in D61782

llvm-svn: 360435
2019-05-10 15:46:04 +00:00
Mircea Trofin 5c31c05fbd [llvm] X86DiscriminateMemOps: insert debug info when missing
Reviewers: davidxl

Reviewed By: davidxl

Subscribers: aprantl, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61735

llvm-svn: 360396
2019-05-10 00:12:51 +00:00
Philip Reames bd588dfd59 [X86] Improve lowering of idemptotent RMW operations
The current lowering uses an mfence. mfences are substaintially higher latency than the locked operations originally requested, but we do want to avoid contention on the original cache line. As such, use a locked instruction on a cache line assumed to be thread local.

Differential Revision: https://reviews.llvm.org/D58632

llvm-svn: 360393
2019-05-09 23:23:42 +00:00
Simon Pilgrim 93bfa5af48 [X86][SSE] Fold add(shuffle(),shuffle()) to hadd on 'slow' targets (PR39920)
As reported on PR39920, "slow horizontal ops" targets tend to internally expand to 2*shuffle+add/sub - so if we can reduce 2*shuffle+add/sub to a hadd/sub then we should do it - similar port usage but reduced instruction count.

This works out in most cases, although the "PR22377" regression in vector-shuffle-combining.ll is annoying - going from 2*shuffle+add+shuffle to hadd+2*shuffle - I've opened PR41813 to cover this.

Differential Revision: https://reviews.llvm.org/D61308

llvm-svn: 360360
2019-05-09 17:45:01 +00:00
Sanjay Patel 902b3ecdad [SelectionDAG] fold 'fneg undef' to undef
This is extracted from the original draft of D61419 with some additional tests.
We don't currently get this in IR (it's conservatively turned into a NaN),
but presumably that'll get updated as we add real IR support for 'fneg'
rather than 'fsub -0.0, x'.

The x86-32 run shows the following, and I haven't looked further to see why,
but that seems to be independent:
  Legalizing: t1: f32 = undef
  Trying to expand node
  Creating fp constant: t4: f32 = ConstantFP<0.000000e+00>

Differential Revision: https://reviews.llvm.org/D61516

llvm-svn: 360296
2019-05-08 22:19:52 +00:00
Craig Topper c5db081f8d [X86] Add a non-ambiguous check prefix to lwp-intrinsics.ll for the case when only the feature is specified and not the CPUs.
Not sure why the script doesn't notice this. We just weren't checking the +lwp command in 32-bit mode in 2 of the test cases.

llvm-svn: 360282
2019-05-08 19:20:53 +00:00
Quentin Colombet 157427245a [RegAllocFast] Scan physcial reg definitions before assigning virtual reg definitions
When assigning the definitions of an instruction we were updating
the available registers while walking the definitions. Some of
those definitions may be from physical registers and thus, they are
not available for other definitions to take, but by the time we see
that we may have already assign these registers to another
virtual register.

Fix that by walking through all the definitions and mark as unavailable
the physical register definitions, then do the virtual register assignments.

PR41790

llvm-svn: 360278
2019-05-08 18:30:26 +00:00
Philip Reames 8186e39082 [Tests] Landing tests for D58632 to show diffs in review
llvm-svn: 360274
2019-05-08 17:28:38 +00:00
Craig Topper 493aec3ef5 [FastISel][X86] Support FNeg instruction in target independent fast isel handling
This patch adds support for calling selectFNeg for FNeg instructions in addition to the fsub idiom

Differential Revision: https://reviews.llvm.org/D61624

llvm-svn: 360273
2019-05-08 17:27:08 +00:00
Philip Reames b61eaebb6b [Tests] Expand coverage of small memset zero idioms
llvm-svn: 360210
2019-05-07 23:48:42 +00:00
Reid Kleckner 6bf108d77a [COFF] Use COFF stubs for extern_weak functions
Summary:
A COFF stub indirects the reference to a symbol through memory. A
.refptr.$sym global variable pointer is created to refer to $sym.
Typically mingw uses these for external global variable declarations,
but we can use them for weak function declarations as well.

Updates the dso_local classification to add a special case for
extern_weak symbols on COFF in both clang and LLVM.

Fixes PR37598

Reviewers: smeenai, mstorsjo

Subscribers: hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D61615

llvm-svn: 360207
2019-05-07 23:06:21 +00:00
Eric Christopher 4727221734 Make sure that the DAG combiner doesn't merge stores that we explicitly
asked not be greater than preferred vector width for the vectorizer.
Test for both 128 and 256 with a skylake architecture.

llvm-svn: 360183
2019-05-07 19:25:34 +00:00
Philip Reames 800e6e34ae [Tests] Yet more combination of tests for unordered.atomic memset
llvm-svn: 360177
2019-05-07 17:45:52 +00:00
Simon Pilgrim b0f51266b8 [X86][AVX] Fold concat(packus(),packus()) -> packus(concat(),concat()) (PR34773)
Basic "revectorization" combine, we can probably do more opcodes here but it can be a tricky cost-benefit depending on where the subvectors came from - but this case helps shuffle combining.

llvm-svn: 360134
2019-05-07 11:17:39 +00:00
Craig Topper c6d445f9c1 [FastISel][X86] If selectFNeg fails, fall back to SelectionDAG not treating it as an fsub.
Summary:
If fneg lowering for fsub -0.0, x fails we currently fall back to treating it as an fsub. This has different behavior for nans than the xor with sign bit trick we normally try to do. On X86, the xor trick for double fails fast-isel in 32-bit mode with sse2 due to 64 bit integer types not being available. With -O2 we would always use an xorpd for this case. If we use subsd, this creates an observable behavior difference between -O0 and -O2. So fall back to SelectionDAG if we can't fast-isel it, that way SelectionDAG will use the xorpd.

I believe this patch is restoring the behavior prior to r345295 from last October. This was missed then because our fast isel case in 32-bit mode aborted fast-isel earlier for another reason. But I've added new tests to cover that.

Reviewers: andrew.w.kaylor, cameron.mcinally, spatel, efriedma

Reviewed By: cameron.mcinally

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61622

llvm-svn: 360111
2019-05-07 04:25:24 +00:00
Amy Huang 987b969bab Fix bug in getCompleteTypeIndex in codeview debug info
Summary:
When there are multiple instances of a forward decl record type, only the first one is emitted with a type index, because
the type is added to a map with a null type index. Avoid this by reordering so that forward decl types aren't added to the map.

Reviewers: rnk

Subscribers: aprantl, hiraditya, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61460

llvm-svn: 360101
2019-05-06 23:37:03 +00:00
Craig Topper 39f1a97417 [FastISel] Pass the fneg input operand to hasTrivialKill in FastISel::selectFNeg.
We're trying to calculate the kill flag for OpReg which is the input so we need to pass the input here.

llvm-svn: 360097
2019-05-06 23:09:09 +00:00
Craig Topper 24cfb7a992 [X86] Add test case to show that we don't set the kill flag properly for fast isel handling of fneg.
llvm-svn: 360096
2019-05-06 23:08:17 +00:00
Philip Reames 1b31390fc6 [Tests] Add tests for optimized lowerings of element.unordered.atomic memset/memcmove/memcopy
llvm-svn: 360093
2019-05-06 22:25:59 +00:00
Philip Reames 03a979a45a [Tests] Rename tests before adding new ones
llvm-svn: 360092
2019-05-06 22:16:55 +00:00
Philip Reames 4bcf10fc0f [Tests] Autogen a test in advance of updates
llvm-svn: 360091
2019-05-06 22:12:07 +00:00
Philip Reames 2f53d79bff Fix pr33010, a 2 year old crashing regression
The problem was that we were creating a CMOV64rr <TargetFrameIndex>, <TargetFrameIndex>.  The entire point of a TFI is that address code is not generated, so there's no way to legalize/lower this.  Instead, simply prevent it's creation.

Arguably, we shouldn't be using *Target*FrameIndices in StatepointLowering at all, but that's a much deeper change.  

llvm-svn: 360090
2019-05-06 22:09:31 +00:00
Craig Topper 77e69d8850 [X86] Add more test cases for fast-isel handling of fneg.
The fneg double case is falling back to a subsd in 32-bit mode if you write a test that doesn't trigger a fast-isel abort on the return value.

The subsd lowering has different behavior with respect to nans than using an xor. This is inconsisent with what we would do in SelectionDAG
and can lead to differences between -O0 and -O2.

llvm-svn: 360088
2019-05-06 22:04:26 +00:00
Craig Topper d10a200ceb [X86] Remove the suffix on vcvt[u]si2ss/sd register variants in assembly printing.
We require d/q suffixes on the memory form of these instructions to disambiguate the memory size.
We don't require it on the register forms, but need to support parsing both with and without it.

Previously we always printed the d/q suffix on the register forms, but it's redundant and
inconsistent with gcc and objdump.

After this patch we should support the d/q for parsing, but not print it when its unneeded.

llvm-svn: 360085
2019-05-06 21:39:51 +00:00
Craig Topper ad56843dd7 [SelectionDAG][X86] Support inline assembly returning an mmx register into a type with fewer than 64 bits.
It's possible to use the 'y' mmx constraint with a type narrower than 64-bits.

This patch supports this by bitcasting the mmx type to 64-bits and then
truncating to the desired type.

There are probably other missing type combinations we need to support, but this
is the case we have a bug report for.

Fixes PR41748.

Differential Revision: https://reviews.llvm.org/D61582

llvm-svn: 360069
2019-05-06 19:50:14 +00:00
Craig Topper 55a71b575c Revert r359392 and r358887
Reverts "[X86] Remove (V)MOV64toSDrr/m and (V)MOVDI2SSrr/m. Use 128-bit result MOVD/MOVQ and COPY_TO_REGCLASS instead"
Reverts "[TargetLowering][AMDGPU][X86] Improve SimplifyDemandedBits bitcast handling"

Eric Christopher and Jorge Gorbe Moya reported some issues with these patches to me off list.

Removing the CodeGenOnly instructions has changed how fneg is handled during fast-isel with sse/sse2. We're now emitting fsub -0.0, x instead
moving to the integer domain(in a GPR), xoring the sign bit, and then moving back to xmm. This is because the fast isel table no longer
contains an entry for (f32/f64 bitcast (i32/i64)) so the target independent fneg code fails. The use of fsub changes the behavior of nan with
respect to -O2 codegen which will always use a pxor. NOTE: We still have a difference with double with -m32 since the move to GPR doesn't work
there. I'll file a separate PR for that and add test cases.

Since removing the CodeGenOnly instructions was fixing PR41619, I'm reverting r358887 which exposed that PR. Though I wouldn't be surprised
if that bug can still be hit independent of that.

This should hopefully get Google back to green. I'll work with Simon and other X86 folks to figure out how to move forward again.

llvm-svn: 360066
2019-05-06 19:29:24 +00:00
Fangrui Song 4c3d579096 [CodeGen] Move X86 tests under the X86 directory
llvm-svn: 360029
2019-05-06 10:21:17 +00:00
Guillaume Chatelet 3cfb48b877 [NFC] Update memcpy tests
Summary: Runs utils/update_llc_test_checks.py on a few memcpy files

Reviewers: courbet

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61507

Remove cfi noise by adding nounwind

llvm-svn: 360023
2019-05-06 09:46:50 +00:00
Pengfei Wang b5d3430d3d [NFC] This is a test for the commit access.
Summary: Signed-off-by: Pengfei Wang <pengfei.wang@intel.com>

Reviewers: LuoYuanke

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61580

llvm-svn: 360019
2019-05-06 08:31:18 +00:00
Sanjay Patel 5ab41a7a05 [CodeGenPrepare] limit overflow intrinsic matching to a single basic block (2nd try)
This is a subset of the original commit from rL359879
which was reverted because it could crash when using the 'RemovedInstructions'
structure that enables delayed deletion of dead instructions. The motivating
compile-time win does not require that change though. We should get most of
that win from this change alone.

Using/updating a dominator tree to match math overflow patterns may be very
expensive in compile-time (because of the way CGP uses a DT), so just handle
the single-block case.

See post-commit thread for rL354298 for more details:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190422/646276.html

Differential Revision: https://reviews.llvm.org/D61075

llvm-svn: 359969
2019-05-04 12:46:32 +00:00
Sanjay Patel dd2e91a181 [x86] add tests for fneg IR with undef; NFC
llvm-svn: 359941
2019-05-03 22:47:29 +00:00
Matt Arsenault b6c599afd3 Reapply r359906, "RegAllocFast: Add heuristic to detect values not live-out of a block"
This reverts commit r359912.

This should pass now, since the clang test was made less fragile in
r359918.

llvm-svn: 359919
2019-05-03 19:06:57 +00:00
Nico Weber bb852a9672 Revert r359906, "RegAllocFast: Add heuristic to detect values not live-out of a block"
Makes clang/test/Misc/backend-stack-frame-diagnostics-fallback.cpp fail.

llvm-svn: 359912
2019-05-03 18:08:03 +00:00
Evgeniy Stepanov 46ec57e576 Revert "[CodeGenPrepare] limit overflow intrinsic matching to a single basic block"
This reverts commit r359879, which introduced a compiler crash.

llvm-svn: 359908
2019-05-03 17:31:49 +00:00
Matt Arsenault daf2d653fa RegAllocFast: Add heuristic to detect values not live-out of a block
Add an improved/new heuristic to catch more cases when values are not
live out of a basic block.

Patch by Matthias Braun

llvm-svn: 359906
2019-05-03 17:03:24 +00:00
Sanjay Patel d0336b1e3f [x86] add tests for fneg with undefs; NFC
This was originally part of D61419.

llvm-svn: 359896
2019-05-03 15:09:53 +00:00
Simon Pilgrim 4d4f779fa2 [X86] Add X64 common prefixes and regenerate mul i64 tests
Noticed while reviewing D61472

llvm-svn: 359886
2019-05-03 14:07:38 +00:00
Sanjay Patel 8ff072e48e [CodeGenPrepare] limit overflow intrinsic matching to a single basic block
Using/updating a dominator tree to match math overflow patterns may be very
expensive in compile-time (because of the way CGP uses a DT), so just handle
the single-block case.

Also, we were restarting the iterator loops when doing the overflow intrinsic
transforms by marking the dominator tree for update. That was done to prevent
iterating over a removed instruction. But we can postpone the deletion using
the existing "RemovedInsts" structure, and that means we don't need to update
the DT.

See post-commit thread for rL354298 for more details:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190422/646276.html

Differential Revision: https://reviews.llvm.org/D61075

llvm-svn: 359879
2019-05-03 13:09:18 +00:00
Anton Afanasyev 6d08b8dbae Revert "[MIR] Add simple PRE pass to MachineCSE"
This reverts commit 9c20156de3.
It breaks stage 2 of clang-ppc64be-linux-multistage.

llvm-svn: 359875
2019-05-03 12:36:22 +00:00
Anton Afanasyev 9c20156de3 [MIR] Add simple PRE pass to MachineCSE
This is the second part of the commit fixing PR38917 (hoisting
partitially redundant machine instruction). Most of PRE (partitial
redundancy elimination) and CSE work is done on LLVM IR, but some of
redundancy arises during DAG legalization. Machine CSE is not enough
to deal with it. This simple PRE implementation works a little bit
intricately: it passes before CSE, looking for partitial redundancy
and transforming it to fully redundancy, anticipating that the next
CSE step will eliminate this created redundancy. If CSE doesn't
eliminate this, than created instruction will remain dead and eliminated
later by Remove Dead Machine Instructions pass.

The third part of the commit is supposed to refactor MachineCSE,
to make it more clear and to merge MachinePRE with MachineCSE,
so one need no rely on further Remove Dead pass to clear instrs
not eliminated by CSE.

First step: https://reviews.llvm.org/D54839

Fixes llvm.org/PR38917

Reviewers: RKSimon

Subscribers: hfinkel, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D56772

llvm-svn: 359870
2019-05-03 10:30:59 +00:00
Craig Topper e1e38d4248 [X86] Correct the register class for specific mask register constraints in getRegForInlineAsmConstraint when the VT is a scalar type
The default impementation in the base class for TargetLowering::getRegForInlineAsmConstraint doesn't work for mask registers when the VT is a scalar type integer types since the only legal mask types are vXi1. So we end up just getting whatever the first register class that contains the register. Currently this appears to be VK1, but its really dependent on the order tablegen outputs the register classes.

Some code in the caller ends up looking up the type for this register class and find v1i1 then generates a copyfromreg from the physical k-register with the v1i1 type. Then it generates an any_extend from v1i1 to the scalar VT which isn't legal. This bad any_extend sticks around until isel where it selects a MOVZX32rr8 with a v1i1 input or maybe a i8 input. Not sure but eventually we pick up a copy from VK1 to GR8 in MachineIR which isn't supported. This leads to a failure in physical register copying.

This patch uses the scalar type to find a VK class of the right size. In the attached test case this will be VK16. This causes a bitcast from vk16 to i16 to be generated instead of an any_extend. This will be properly iseled to a VK16 to GR32 copy and a GR32->GR16 extract_subreg.

Fixes PR41678

Differential Revision: https://reviews.llvm.org/D61453

llvm-svn: 359837
2019-05-02 22:26:40 +00:00
Sanjay Patel 1972826178 [DAGCombiner] try repeated fdiv divisor transform before building estimate (2nd try)
The original patch was committed at rL359398 and reverted at rL359695 because of
infinite looping.

This includes a fix to check for a vector splat of "1.0" to avoid the infinite loop.

Original commit message:

This was originally part of D61028, but it's an independent diff.

If we try the repeated divisor reciprocal transform before producing an estimate sequence,
then we have an opportunity to use scalar fdiv. On x86, the trade-off is 1 divss vs. 5
vector FP ops in the default estimate sequence. On recent chips (Skylake, Ryzen), the
full-precision division is only 3 cycle throughput, so that's probably the better perf
default option and avoids problems from x86's inaccurate estimates.

The last 2 tests show that users still have the option to override the defaults by using
the function attributes for reciprocal estimates, but those patterns are potentially made
faster by converting the vector ops (including ymm ops) to scalar math.

Differential Revision: https://reviews.llvm.org/D61149

llvm-svn: 359793
2019-05-02 15:02:08 +00:00
Simon Pilgrim df8daf0ef4 [X86][SSE] lowerAddSubToHorizontalOp - enable ymm extraction+fold
Limiting scalar hadd/hsub generation to the lowest xmm looks to be unnecessary - we will be extracting one upper xmm whatever, and we can remove a shuffle by using the hop which is inline with what shouldUseHorizontalOp expects to happen anyway.

Testing on btver2 (the main target for fast-hops) shows this is beneficial even for float ops where we have a 'shuffle' to extract the float result:
https://godbolt.org/z/0R-U-K

Differential Revision: https://reviews.llvm.org/D61426

llvm-svn: 359786
2019-05-02 14:00:55 +00:00
Simon Pilgrim 9f04d97cd7 [X86][SSE] Fold scalar horizontal add/sub for non-0/1 element extractions
We already perform horizontal add/sub if we extract from elements 0 and 1, this patch extends it to non-0/1 element extraction indices (as long as they are from the lowest 128-bit vector).

Differential Revision: https://reviews.llvm.org/D61263

llvm-svn: 359707
2019-05-01 17:13:35 +00:00
Sanjay Patel 64d5751254 Revert "[DAGCombiner] try repeated fdiv divisor transform before building estimate"
This reverts commit fb9a5307a9 (rL359398)
because it can cause an infinite loop due to opposing combines.

llvm-svn: 359695
2019-05-01 16:06:21 +00:00
Simon Pilgrim 6711b9699a [X86][SSE] Add demanded elts support X86ISD::PMULDQ\PMULUDQ
Add to SimplifyDemandedVectorEltsForTargetNode and SimplifyDemandedBitsForTargetNode

llvm-svn: 359686
2019-05-01 14:50:50 +00:00
Simon Pilgrim 3d6899e369 [X86][SSE] Add SSE vector shift support to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359680
2019-05-01 13:51:09 +00:00
Simon Pilgrim ba372c6e62 [X86][SSE] Split 512-bit -> 128-bit vector directly in SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 359678
2019-05-01 12:48:42 +00:00
Simon Pilgrim 951a6b4579 [X86][SSE] Add 512-bit vector support to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359677
2019-05-01 12:37:41 +00:00
Simon Pilgrim 37c2419cc7 [X86][SSE] Add X86ISD::PACKSS\PACKUS to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359673
2019-05-01 11:29:36 +00:00
Simon Pilgrim 7244437050 [X86][SSE] Add scalar horizontal add/sub tests for element extractions from upper lanes
As suggested on D61263

llvm-svn: 359671
2019-05-01 11:17:11 +00:00
Simon Pilgrim 3353cee06c [X86][SSE] Add X86ISD::UNPCKL\UNPCK to SimplifyDemandedVectorEltsForTargetNode vector splitting
llvm-svn: 359670
2019-05-01 11:08:03 +00:00
Simon Pilgrim f7b978a71b [X86][SSE] Move extract_subvector(pshufb) fold to SimplifyDemandedVectorEltsForTargetNode
This lets us hit more cases than combineExtractSubvector and allows us reuse more code.

llvm-svn: 359669
2019-05-01 10:58:38 +00:00
Simon Pilgrim 99eefe94b5 [X86][SSE] Extract i1 elements from vXi1 bool vectors
This is an alternative to D59669 which more aggressively extracts i1 elements from vXi1 bool vectors using a MOVMSK.

Differential Revision: https://reviews.llvm.org/D61189

llvm-svn: 359666
2019-05-01 10:02:22 +00:00
Fangrui Song e29e30b139 [llvm-readobj] Change -long-option to --long-option in tests. NFC
We use both -long-option and --long-option in tests. Switch to --long-option for consistency.

In the "llvm-readelf" mode, -long-option is discouraged as it conflicts with grouped short options and it is not accepted by GNU readelf.

While updating the tests, change llvm-readobj -s to llvm-readobj -S to reduce confusion ("s" is --section-headers in llvm-readobj but --symbols in llvm-readelf).

llvm-svn: 359649
2019-05-01 05:27:20 +00:00
Simon Pilgrim 07ab4e7db8 [X86][SSE] Fold extract_subvector(extend(x)) -> extend_vector_inreg(x)
This adds any extend support - folding to zero_extend_vector_inreg (PMOVZX) for legality

Minor improvement for PR39709

llvm-svn: 359608
2019-04-30 20:31:07 +00:00
Craig Topper 3958719dda [X86] If PreprocessISelDAG reorders a load before a call, make sure we remove dead nodes from the graph
The reordering can leave at least a dead TokenFactor in the graph. This cause the linearize scheduler to fail with something like the assert seen in PR22614. This is only one of many ways we can break the linearize scheduler today so I can't say for sure that any of the other failures in that bug were caused by this issue.

This takes the heavy hammer approach of just running RemoveDeadNodes unconditionally at the end of the PreprocessISelDAG. If this turns out to be a compile time hit, we can try to refine it.

Differential Revision: https://reviews.llvm.org/D61164

llvm-svn: 359582
2019-04-30 17:56:47 +00:00
Craig Topper 965d1306ae [X86] Initial cleanups on the FixupLEAs pass. Separate Atom LEA creation from other LEA optimizations.
This removes some of the class variables. Merge basic block processing into
runOnMachineFunction to keep the flags local.

Pass MachineBasicBlock around instead of an iterator. We can get the iterator in
the few places that need it. Allows a range-based outer for loop.

Separate the Atom optimization from the rest of the optimizations. This allows
fixupIncDec to create INC/DEC and still allow Atom to turn it back into LEA
when profitable by its heuristics.

I'd like to improve fixupIncDec to turn LEAs into ADD any time the base or index
register is equal to the destination register. This is profitable regardless of
the various slow flags. But again we would want Atom to be able to undo that.

Differential Revision: https://reviews.llvm.org/D60993

llvm-svn: 359581
2019-04-30 17:56:28 +00:00
Sanjay Patel 0387bf5269 [SelectionDAG] remove div-by-zero constant folding restriction
We don't have this restriction in IR, so it should not be here
either simply out of consistency. Code that wants to handle FP
exceptions is expected to use the 'strict' variants of these
nodes.

We don't get the frem case because frem by 0.0 produces NaN (invalid),
and that's the remaining check here (so the removed check for frem
was dead code AFAIK).

This is the only place in SDAG that uses "HasFPExceptions", so I
think we should remove that entirely as a follow-up patch.

llvm-svn: 359566
2019-04-30 14:37:15 +00:00
Simon Pilgrim 22641cc194 Fix for bug 41512: lower INSERT_VECTOR_ELT(ZeroVec, 0, Elt) to SCALAR_TO_VECTOR(Elt) for all SSE flavors
Current LLVM uses pxor+pinsrb on SSE4+ for INSERT_VECTOR_ELT(ZeroVec, 0, Elt) insead of much simpler movd.
INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is idiomatic construct which is used e.g. for _mm_cvtsi32_si128(Elt) and for lowest element initialization in _mm_set_epi32.
So such inefficient lowering leads to significant performance digradations in ceratin cases switching from SSSE3 to SSE4.
https://bugs.llvm.org/show_bug.cgi?id=41512

Here INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is simply converted to SCALAR_TO_VECTOR(Elt) when applicable since latter is closer match to desired behavior and always efficiently lowered to movd and alike.

Committed on behalf of @Serge_Preis (Serge Preis)

Differential Revision: https://reviews.llvm.org/D60852

llvm-svn: 359545
2019-04-30 10:18:25 +00:00
Markus Lavin a475da36eb [DebugInfo] DW_OP_deref_size in PrologEpilogInserter.
The PrologEpilogInserter need to insert a DW_OP_deref_size before
prepending a memory location expression to an already implicit
expression to avoid having the existing expression act on the memory
address instead of the value behind it.

The reason for using DW_OP_deref_size and not plain DW_OP_deref is that
big-endian targets need to read the right size as simply truncating a
larger read would yield the wrong result (LSB bytes are not at the lower
address).

This re-commit fixes issues reported in the first one. Namely deref was
inserted under wrong conditions and additionally the deref_size argument
was incorrectly encoded.

Differential Revision: https://reviews.llvm.org/D59687

llvm-svn: 359535
2019-04-30 07:58:57 +00:00
Martin Storsjo c0d138d147 [X86] Run CFIInstrInserter on Windows if Dwarf is used
This is necessary since SVN r330706, as tail merging can include
CFI instructions since then.

This fixes PR40322 and PR40012.

Differential Revision: https://reviews.llvm.org/D61252

llvm-svn: 359496
2019-04-29 20:25:51 +00:00
Simon Pilgrim 028485d7b9 [X86][SSE] isHorizontalBinOp - add support for target shuffles
Add target shuffle decoding to isHorizontalBinOp as well as ISD::VECTOR_SHUFFLE support.

This does mean we can go through bitcasts so we need to bitcast the extracted args to ensure they are the correct type

Fixes PR39936 and should help with PR39920/PR39921

Differential Revision: https://reviews.llvm.org/D61245

llvm-svn: 359491
2019-04-29 19:52:59 +00:00
Bjorn Pettersson 820994572c [DAG] Refactor DAGCombiner::ReassociateOps
Summary:
Extract the logic for doing reassociations
from DAGCombiner::reassociateOps into a helper
function DAGCombiner::reassociateOpsCommutative,
and use that helper to trigger reassociation
on the original operand order, or the commuted
operand order.

Codegen is not identical since the operand order will
be different when doing the reassociations for the
commuted case. That causes some unfortunate churn in
some test cases. Apart from that this should be NFC.

Reviewers: spatel, craig.topper, tstellar

Reviewed By: spatel

Subscribers: dmgreen, dschuff, jvesely, nhaehnle, javed.absar, sbc100, jgravelle-google, hiraditya, aheejin, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61199

llvm-svn: 359476
2019-04-29 17:50:10 +00:00
Kevin P. Neal a25c928302 Add AVX support to this test.
Requested by Craig Topper and Andrew Kaylor as part of D55897.

llvm-svn: 359461
2019-04-29 16:06:04 +00:00
Simon Pilgrim 9d4ed24f25 [X86][SSE] Add scalar horizontal add/sub tests for non-0/1 element extractions
llvm-svn: 359454
2019-04-29 14:26:27 +00:00
Simon Pilgrim c570b2a2e5 [X86][SSE] Moved haddps test from phaddsub.ll to haddsub.ll (D61245)
Also merged duplicate PR39921 + PR39936 tests

llvm-svn: 359437
2019-04-29 11:30:47 +00:00
Simon Pilgrim 65f12f66f6 [X86] Add PR39921 HADD pairwise reduction test and AVX2 test coverage
llvm-svn: 359409
2019-04-28 21:04:47 +00:00
Simon Pilgrim 85bacd0f95 [X86][AVX] Add fast-hops target for add/fadd reduction tests
llvm-svn: 359408
2019-04-28 20:04:08 +00:00
Simon Pilgrim e375257e95 [X86] Add PR39936 HADD Tests
llvm-svn: 359407
2019-04-28 20:03:11 +00:00
Simon Pilgrim d394195221 [X86][AVX] Enabled AVX512F tests and add PR40815 test case
llvm-svn: 359401
2019-04-28 15:04:30 +00:00
Simon Pilgrim 22d1476bfa [X86][AVX] Combine non-lane crossing binary shuffles using X86ISD::VPERMV3
Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results.

llvm-svn: 359400
2019-04-28 14:31:01 +00:00
Sanjay Patel ce8cfe96f7 [SelectionDAG] include FP min/max variants as binary operators
The x86 test diffs don't look great because of extra move ops,
but FP min/max should clearly be included in the list.

llvm-svn: 359399
2019-04-28 13:19:29 +00:00
Sanjay Patel fb9a5307a9 [DAGCombiner] try repeated fdiv divisor transform before building estimate
This was originally part of D61028, but it's an independent diff.

If we try the repeated divisor reciprocal transform before producing an estimate sequence,
then we have an opportunity to use scalar fdiv. On x86, the trade-off is 1 divss vs. 5
vector FP ops in the default estimate sequence. On recent chips (Skylake, Ryzen), the
full-precision division is only 3 cycle throughput, so that's probably the better perf
default option and avoids problems from x86's inaccurate estimates.

The last 2 tests show that users still have the option to override the defaults by using
the function attributes for reciprocal estimates, but those patterns are potentially made
faster by converting the vector ops (including ymm ops) to scalar math.

Differential Revision: https://reviews.llvm.org/D61149

llvm-svn: 359398
2019-04-28 12:23:43 +00:00
Simon Pilgrim 93ad48210c [X86][SSE] Optimize llvm.experimental.vector.reduce.xor.vXi1 parity reduction (PR38840)
An xor reduction of a bool vector can be optimized to a parity check of the MOVMSK/BITCAST'd integer - if the population count is odd return 1, else return 0.

Differential Revision: https://reviews.llvm.org/D61230

llvm-svn: 359396
2019-04-28 10:46:17 +00:00
Simon Pilgrim fed302ae37 [X86][AVX] Add AVX512DQ coverage for masked memory ops tests (PR34584)
llvm-svn: 359395
2019-04-28 10:02:34 +00:00
Craig Topper bd35a30940 [X86] Remove (V)MOV64toSDrr/m and (V)MOVDI2SSrr/m. Use 128-bit result MOVD/MOVQ and COPY_TO_REGCLASS instead
Summary:
The register form of these instructions are CodeGenOnly instructions that cover
GR32->FR32 and GR64->FR64 bitcasts. There is a similar set of instructions for
the opposite bitcast. Due to the patterns using bitcasts these instructions get
marked as "bitcast" machine instructions as well. The peephole pass is able to
look through these as well as other copies to try to avoid register bank copies.

Because FR32/FR64/VR128 are all coalescable to each other we can end up in a
situation where a GR32->FR32->VR128->FR64->GR64 sequence can be reduced to
GR32->GR64 which the copyPhysReg code can't handle.

To prevent this, this patch removes one set of the 'bitcast' instructions. So
now we can only go GR32->VR128->FR32 or GR64->VR128->FR64. The instruction that
converts from GR32/GR64->VR128 has no special significance to the peephole pass
and won't be looked through.

I guess the other option would be to add support to copyPhysReg to just promote
the GR32->GR64 to a GR64->GR64 copy. The upper bits were basically undefined
anyway. But removing the CodeGenOnly instruction in favor of one that won't be
optimized seemed safer.

I deleted the peephole test because it couldn't be made to work with the bitcast
instructions removed.

The load version of the instructions were unnecessary as the pattern that selects
them contains a bitcasted load which should never happen.

Fixes PR41619.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61223

llvm-svn: 359392
2019-04-28 06:25:33 +00:00
Simon Pilgrim 03c4e2663c Revert rL359389: [X86][SSE] Add support for <64 x i1> bool reduction
Minor generalization of the existing <32 x i1> pre-AVX2 split code.
........
Causing irregular buildbot failures.

llvm-svn: 359391
2019-04-27 20:44:08 +00:00
Simon Pilgrim 1a4a43250e [X86][AVX] Add additional SSE/AVX expandload and compressstore targets
llvm-svn: 359390
2019-04-27 20:20:02 +00:00
Simon Pilgrim 4118be3af6 [X86][SSE] Add support for <64 x i1> bool reduction
Minor generalization of the existing <32 x i1> pre-AVX2 split code.

llvm-svn: 359389
2019-04-27 20:04:44 +00:00
Simon Pilgrim 399746eaf6 [X86][AVX] Cleanup and add additional expandload and compressstore tests
sort order by types and add vXi32/vXi16/vXi8 test coverage

llvm-svn: 359388
2019-04-27 19:57:34 +00:00
Simon Pilgrim 2a2d422400 [X86][AVX512] Improve vector bool reductions
As predicate masks are legal on AVX512 targets, we avoid MOVMSK in these cases, but we can just bitcast the bool vector to the integer equivalent directly - avoiding expansion of the reduction to a shuffle pattern.

llvm-svn: 359386
2019-04-27 17:32:46 +00:00
Simon Pilgrim 913bfd3363 [X86] Add vector boolean reduction tests (PR38840)
AND/OR/XOR tests for the @llvm.experimental.vector.reduce intrinsics

AND/OR are pretty good (pre-AVX512), XOR (not so common but used for parity reduction) is still pretty bad.

llvm-svn: 359385
2019-04-27 16:49:54 +00:00
Simon Pilgrim 5cf616530a Fix check-prefixes typo
llvm-svn: 359382
2019-04-27 15:41:14 +00:00
Simon Pilgrim 3879b2cd45 [X86][SSE] Add initial test case for subvector insert/extract of illegal types
Suggested by @nikic on D59188

llvm-svn: 359379
2019-04-27 15:30:06 +00:00
Simon Pilgrim acc1e6d1c6 [X86][AVX] Merge mask select with shuffles across extract_subvector (PR40332)
Fixes PR40332 in the limited case where we're selecting between a target shuffle and a zero vector.

We can extend this in the future to handle more opcodes and non-zero selections.

llvm-svn: 359378
2019-04-27 13:35:32 +00:00
Craig Topper 063b471ff7 [X86] Use MOVQ for i64 atomic_stores when SSE2 is enabled
Summary: If we have SSE2 we can use a MOVQ to store 64-bits and avoid falling back to a cmpxchg8b loop. If its a seq_cst store we need to insert an mfence after the store.

Reviewers: spatel, RKSimon, reames, jfb, efriedma

Reviewed By: RKSimon

Subscribers: hiraditya, dexonsmith, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60546

llvm-svn: 359368
2019-04-27 03:38:15 +00:00
Nick Desaulniers 7ab164c4a4 [AsmPrinter] refactor to support %c w/ GlobalAddress'
Summary:
Targets like ARM, MSP430, PPC, and SystemZ have complex behavior when
printing the address of a MachineOperand::MO_GlobalAddress. Move that
handling into a new overriden method in each base class. A virtual
method was added to the base class for handling the generic case.

Refactors a few subclasses to support the target independent %a, %c, and
%n.

The patch also contains small cleanups for AVRAsmPrinter and
SystemZAsmPrinter.

It seems that NVPTXTargetLowering is possibly missing some logic to
transform GlobalAddressSDNodes for
TargetLowering::LowerAsmOperandForConstraint to handle with "i" extended
inline assembly asm constraints.

Fixes:
- https://bugs.llvm.org/show_bug.cgi?id=41402
- https://github.com/ClangBuiltLinux/linux/issues/449

Reviewers: echristo, void

Reviewed By: void

Subscribers: void, craig.topper, jholewinski, dschuff, jyknight, dylanmckay, sdardis, nemanjai, javed.absar, sbc100, jgravelle-google, eraman, kristof.beyls, hiraditya, aheejin, kbarton, fedor.sergeev, jrtc27, atanasyan, jsji, llvm-commits, kees, tpimh, nathanchance, peter.smith, srhines

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60887

llvm-svn: 359337
2019-04-26 18:45:04 +00:00
Simon Pilgrim 27e01e675c [X86][AVX] Fold extract_subvector(broadcast(x)) -> broadcast(x) iff x has one use
llvm-svn: 359332
2019-04-26 18:02:14 +00:00
Sanjay Patel 8224bc081c [x86] add tests for fmin/fmax; NFC
'maximum' and 'minimum' still crash, so they are commented out.

llvm-svn: 359306
2019-04-26 13:36:37 +00:00
Simon Pilgrim 5d6ef94c36 [X86][SSE] Disable shouldFoldConstantShiftPairToMask for btver1/btver2 targets (PR40758)
As detailed on PR40758, Bobcat/Jaguar can perform vector immediate shifts on the same pipes as vector ANDs with the same latency - so it doesn't make sense to replace a shl+lshr with a shift+and pair as it requires an additional mask (with the extra constant pool, loading and register pressure costs).

Differential Revision: https://reviews.llvm.org/D61068

llvm-svn: 359293
2019-04-26 10:49:13 +00:00
Simon Pilgrim 5e161df9f8 [X86][AVX] Combine shuffles extracted from a common vector
A small step towards combining shuffles across vector sizes - this recognizes when a shuffle's operands are all extracted from the same larger source and tries to combine to an unary shuffle of that source instead. Fixes one of the test cases from PR34380.

Differential Revision: https://reviews.llvm.org/D60512

llvm-svn: 359292
2019-04-26 09:56:14 +00:00
Sanjay Patel 7a2718181e [x86] add tests for vector fdiv reciprocal estimate; NFC
llvm-svn: 359238
2019-04-25 20:35:47 +00:00
Craig Topper f9c30eddd0 [SelectionDAG][X86] Use stack load/store in PromoteIntRes_BITCAST when the input needs to be be split and the output type is a vector.
We had special case handling here, but it uses a scalar any_extend for the
promotion then bitcasts to the final type. This won't split up the input data
into multiple promoted elements like we need.

This patch falls back to doing the conversion through memory.

Fixes PR41594 which I believe was reflected in the bitcast-vector-bool.ll
changes. The changes to vector-half-conversions.ll are fixing a previously
unknown miscompile from this issue.

Differential Revision: https://reviews.llvm.org/D61114

llvm-svn: 359219
2019-04-25 18:19:59 +00:00
Simon Pilgrim 0a7d1b3ce1 [X86][SSE] combineBitcastvxi1 - add support for bitcasting to non-scalar integers
Truncate the movmsk scalar integer result to the equivalent scalar integer width as before but then bitcast to the requested type.

We still have the issue identified in PR41594 but D61114 should handle this.

llvm-svn: 359176
2019-04-25 09:34:36 +00:00
Amy Huang 68c9199493 Recommitting r358783 and r358786 "[MS] Emit S_HEAPALLOCSITE debug info" with fixes for buildbot error (undefined assembler label).
Summary:
This emits labels around heapallocsite calls and S_HEAPALLOCSITE debug
info in codeview. Currently only changes FastISel, so emitting labels still
needs to be implemented in SelectionDAG.

Reviewers: rnk

Subscribers: aprantl, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D61083

llvm-svn: 359149
2019-04-24 23:02:48 +00:00
Sanjay Patel 6f41bf948b [DAGCombiner] scale repeated FP divisor by splat factor
If we have a vector FP division with a splatted divisor, use the existing transform
that converts 'x/y' into 'x * (1.0/y)' to allow more conversions. This can then
potentially be converted into a scalar FP division by existing combines (rL358984)
as seen in the tests here.

That can be a potentially big perf difference if scalar fdiv has better timing
(including avoiding possible frequency throttling for vector ops).

Differential Revision: https://reviews.llvm.org/D61028

llvm-svn: 359147
2019-04-24 22:28:58 +00:00
Craig Topper af194e9380 [X86] Prevent folding a load into an AND if that AND is really a ZEXT_INREG that should use movzx.
This can save a 32-bit immediate move.

We would shrink the load and fold it if it was non-volatile, but that's trickier to check for.

llvm-svn: 359129
2019-04-24 19:28:38 +00:00
Craig Topper 882ca6d484 [X86] Remove dead nodes left after ReplaceAllUsesWith calls during address matching
ReplaceAllUsesWith doesn't remove the node that was replaced. So its left around in the graph messing up use counts on other nodes.

One thing to note, is that this isn't valid if the node being deleted is the root node of an LEA match that gets rejected. In that case the node needs to stay alive because the isel table walking code would still have a reference to it that its going to try to match next. I don't think that's the case here though because the nodes being deleted here should be "and", "srl", and "zero_extend" none of which can be the root node of an LEA match.

Differential Revision: https://reviews.llvm.org/D61048

llvm-svn: 359121
2019-04-24 18:02:07 +00:00
Simon Pilgrim 10daecba1d [X86][SSE] Add tests for bitcasting vXi1 bool vectors to non-simple types.
llvm-svn: 359116
2019-04-24 17:25:45 +00:00
Sanjay Patel b1b3368907 [x86] make sure horizontal op and broadcast types match to simplify (PR41414)
If the types don't match, we can't just remove the shuffle.
There may be some other opportunity for optimization here,
but this should prevent the crashing seen in:
https://bugs.llvm.org/show_bug.cgi?id=41414

llvm-svn: 359095
2019-04-24 14:05:08 +00:00
Simon Pilgrim 039a563e6a [X86][SSE] Add masked bit test cases for PR26697
llvm-svn: 359082
2019-04-24 10:34:15 +00:00
Francis Visoiu Mistrih 7fee2b89fd [Remarks] Add string deduplication using a string table
* Add support for uniquing strings in the remark streamer and emitting the string table in the remarks section.

* Add parsing support for the string table in the RemarkParser.

From this remark:

```
--- !Missed
Pass:     inline
Name:     NoDefinition
DebugLoc: { File: 'test-suite/SingleSource/UnitTests/2002-04-17-PrintfChar.c',
            Line: 7, Column: 3 }
Function: printArgsNoRet
Args:
  - Callee:   printf
  - String:   ' will not be inlined into '
  - Caller:   printArgsNoRet
    DebugLoc: { File: 'test-suite/SingleSource/UnitTests/2002-04-17-PrintfChar.c',
                Line: 6, Column: 0 }
  - String:   ' because its definition is unavailable'
...
```

to:

```
--- !Missed
Pass: 0
Name: 1
DebugLoc: { File: 3, Line: 7, Column: 3 }
Function: 2
Args:
  - Callee:   4
  - String:   5
  - Caller:   2
    DebugLoc: { File: 3, Line: 6, Column: 0 }
  - String:   6
...
```

And the string table in the .remarks/__remarks section containing:

```
inline\0NoDefinition\0printArgsNoRet\0
test-suite/SingleSource/UnitTests/2002-04-17-PrintfChar.c\0printf\0
will not be inlined into \0 because its definition is unavailable\0
```

This is mostly supposed to be used for testing purposes, but it gives us
a 2x reduction in the remark size, and is an incremental change for the
updates to the remarks file format.

Differential Revision: https://reviews.llvm.org/D60227

llvm-svn: 359050
2019-04-24 00:06:24 +00:00
Francis Visoiu Mistrih 1646851b87 [CGP] Look through bitcasts when duplicating returns for tail calls
The simple case of:

```
int *callee();
void *caller(void *a) {
  if (a == NULL)
    return callee();
  return a;
}
```

would generate a regular call instead of a tail call because we don't
look through the bitcast of the call to `callee` when duplicating the
return blocks.

Differential Revision: https://reviews.llvm.org/D60837

llvm-svn: 359041
2019-04-23 21:57:46 +00:00
Francis Visoiu Mistrih eea9da5921 [X86] Add codegen prepare test exercising a bitcast + tail call
In preparation of https://reviews.llvm.org/D60837, add this test where
we don't perform a tail call because we don't look through a bitcast.

llvm-svn: 359040
2019-04-23 21:57:43 +00:00
Amy Huang fc79ab9857 Revert "[MS] Emit S_HEAPALLOCSITE debug info" because of ToTWin64(db)
buildbot failure.

This reverts commit d07d6d6177 and
c774f687b6.

llvm-svn: 359034
2019-04-23 21:12:58 +00:00
Craig Topper 26518466ef [X86] Autogenerate complete checks. NFC
Prep for D60993

llvm-svn: 359031
2019-04-23 20:52:00 +00:00
Sanjay Patel 7c0bd5a27c [x86] fix test checks for fdiv combine; NFC
Must have picked up some transient code changes when originally generating this.

llvm-svn: 359008
2019-04-23 16:31:30 +00:00
Sanjay Patel 171b74e31c [x86] add tests for vector fdiv with splat divisor; NFC
llvm-svn: 359006
2019-04-23 16:16:16 +00:00
Sanjay Patel 12a561fa1b [x86] use psubus for more vsetcc lowering (PR39859)
Circling back to a leftover bit from PR39859:
https://bugs.llvm.org/show_bug.cgi?id=39859#c1

...we have this counter-intuitive (based on the test diffs) opportunity to use 'psubus'.
This appears to be the better perf option for both Haswell and Jaguar based on llvm-mca.
We already do this transform for the SETULT predicate, so this makes the code more
symmetrical too. If we have pminub/pminuw, we prefer those, so this should not affect
anything but pre-SSE4.1 subtargets.

  $ cat before.s
	movdqa	-16(%rip), %xmm2    ## xmm2 = [32768,32768,32768,32768,32768,32768,32768,32768]
	pxor	%xmm0, %xmm2
	pcmpgtw	-32(%rip), %xmm2 ## xmm2 = [255,255,255,255,255,255,255,255]
	pand	%xmm2, %xmm0
	pandn	%xmm1, %xmm2
	por	%xmm2, %xmm0

  $ cat after.s
	movdqa	-16(%rip), %xmm2    ## xmm2 = [256,256,256,256,256,256,256,256]
	psubusw	%xmm0, %xmm2
	pxor	%xmm3, %xmm3
	pcmpeqw	%xmm2, %xmm3
	pand	%xmm3, %xmm0
	pandn	%xmm1, %xmm3
	por	%xmm3, %xmm0

  $ llvm-mca before.s -mcpu=haswell
  Iterations:        100
  Instructions:      600
  Total Cycles:      909
  Total uOps:        700

  Dispatch Width:    4
  uOps Per Cycle:    0.77
  IPC:               0.66
  Block RThroughput: 1.8

  $ llvm-mca after.s -mcpu=haswell
  Iterations:        100
  Instructions:      700
  Total Cycles:      409
  Total uOps:        700

  Dispatch Width:    4
  uOps Per Cycle:    1.71
  IPC:               1.71
  Block RThroughput: 1.8

Differential Revision: https://reviews.llvm.org/D60838

llvm-svn: 358999
2019-04-23 15:20:17 +00:00
Sanjay Patel 06ff5eae5b [DAGCombiner] generalize binop-of-splats scalarization
If we only match build vectors, we can miss some patterns
that use shuffles as seen in the affected tests.

Note that the underlying calls within getSplatSourceVector()
have the potential for compile-time explosion because of
exponential recursion looking through binop opcodes, but
currently the list of supported opcodes is very limited.
Both of those problems should be addressed in follow-up
patches.

llvm-svn: 358984
2019-04-23 13:16:41 +00:00
Bjorn Pettersson f97b29be88 [DAGCombiner] Combine OR as ADD when no common bits are set
Summary:
The DAGCombiner is rewriting (canonicalizing) an ISD::ADD
with no common bits set in the operands as an ISD::OR node.

This could sometimes result in "missing out" on some
combines that normally are performed for ADD. To be more
specific this could happen if we already have rewritten an
ADD into OR, and later (after legalizations or combines)
we expose patterns that could have been optimized if we
had seen the OR as an ADD (e.g. reassociations based on ADD).

To make the DAG combiner less sensitive to if ADD or OR is
used for these "no common bits set" ADD/OR operations we
now apply most of the ADD combines also to an OR operation,
when value tracking indicates that the operands have no
common bits set.

Reviewers: spatel, RKSimon, craig.topper, kparzysz

Reviewed By: spatel

Subscribers: arsenm, rampitec, lebedev.ri, jvesely, nhaehnle, hiraditya, javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59758

llvm-svn: 358965
2019-04-23 10:01:08 +00:00
Sanjay Patel bf8aacb715 [SelectionDAG] move splat util functions up from x86 lowering
This was supposed to be NFC, but the change in SDLoc
definitions causes instruction scheduling changes.

There's nothing x86-specific in this code, and it can
likely be used from DAGCombiner's simplifyVBinOp().

llvm-svn: 358930
2019-04-22 22:43:36 +00:00
Simon Pilgrim 6276ce0142 [TargetLowering][AMDGPU][X86] Improve SimplifyDemandedBits bitcast handling
This patch adds support for BigBitWidth -> SmallBitWidth bitcasts, splitting the DemandedBits/Elts accordingly.

The AMDGPU backend needed an extra  (srl (and x, c1 << c2), c2) -> (and (srl(x, c2), c1) combine to encourage BFE creation, I investigated putting this in DAGCombine but it caused a lot of noise on other targets - some improvements, some regressions.

The X86 changes are all definite wins.

Differential Revision: https://reviews.llvm.org/D60462

llvm-svn: 358887
2019-04-22 14:04:35 +00:00
Craig Topper df02beb416 [X86] Add the rounding control operand to the printing for some scalar FMA instructions.
llvm-svn: 358844
2019-04-21 07:12:56 +00:00
Craig Topper 3980d1ca6b [X86] Disable argument copy elision for arguments passed via pointers
Summary:
If you pass two 1024 bit vectors in IR with AVX2 on Windows 64. Both vectors will be split in four 256 bit pieces. The four pieces of the first argument will be passed indirectly using 4 gprs. The second argument will get passed via pointers in memory.

The PartOffsets stored for the second argument are all in terms of its original 1024 bit size. So the PartOffsets for each piece are 32 bytes apart. So if we consider it for copy elision we'll only load an 8 byte pointer, but we'll move the address 32 bytes. The stack object size we create for the first part is probably wrong too.

This issue was encountered by ISPC. I'm working on getting a reduce test case, but wanted to go ahead and get feedback on the fix.

Reviewers: rnk

Reviewed By: rnk

Subscribers: dbabokin, llvm-commits, hiraditya

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60801

llvm-svn: 358817
2019-04-20 15:26:44 +00:00
Nikita Popov b75c8fc6fb [X86] Fix stack probing on x32 (PR41477)
Fix for https://bugs.llvm.org/show_bug.cgi?id=41477. On the x32 ABI
with stack probing a dynamic alloca will result in a WIN_ALLOCA_32
with a 32-bit size. The current implementation tries to copy it into
RAX, resulting in a physreg copy error. Fix this by copying to EAX
instead. Also fix incorrect opcodes or registers used in subs.

llvm-svn: 358807
2019-04-20 07:25:46 +00:00
Craig Topper 4d4b5d952e [X86] Don't turn (and (shl X, C1), C2) into (shl (and X, (C1 >> C2), C2) if the original AND can represented by MOVZX.
The MOVZX doesn't require an immediate to be encoded at all. Though it does use
a 2 byte opcode so its the same size as a 1 byte immediate. But it has a
separate source and dest register so can help avoid copies.

llvm-svn: 358805
2019-04-20 04:38:53 +00:00
Craig Topper 8b8264828c [X86] Turn (and (anyextend (shl X, C1), C2)) into (shl (and (anyextend X), (C1 >> C2), C2) if the AND could match a movzx.
There's one slight regression in here because we don't check that the immediate
already allowed movzx before the shift. I'll fix that next.

llvm-svn: 358804
2019-04-20 04:38:49 +00:00
Craig Topper f919a2ece4 [X86] Add test case for D60801. NFC
llvm-svn: 358784
2019-04-19 21:39:16 +00:00
Amy Huang c774f687b6 [MS] Emit S_HEAPALLOCSITE debug info
Summary:
This emits labels around heapallocsite calls and S_HEAPALLOCSITE debug
info in codeview. Currently only changes FastISel, so emitting labels still
needs to be implemented in SelectionDAG.

Reviewers: hans, rnk

Subscribers: aprantl, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D60800

llvm-svn: 358783
2019-04-19 21:09:11 +00:00
Craig Topper bb769a2946 [X86] Turn (and (shl X, C1), C2) into (shl (and X, (C1 >> C2), C2) if the AND could match a movzx.
Could get further improvements by recognizing (i64 and (anyext (i32 shl))).

llvm-svn: 358737
2019-04-19 05:48:13 +00:00
Craig Topper 2099ccbe1f [X86] Add test cases for turning (and (shl X, C1), C2) into (shl (and X, (C1 >> C2), C2) when the AND could match to a movzx.
We already reorder when C1 >> C2 would allow a smaller immediate encoding.

llvm-svn: 358736
2019-04-19 05:48:09 +00:00
Simon Pilgrim 4171a91e92 [X86] combineVectorTruncationWithPACKUS - remove split/concatenation of mask
combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together.

This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356).

This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit.

My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great.

Differential Revision: https://reviews.llvm.org/D60375#inline-539623

llvm-svn: 358692
2019-04-18 17:23:09 +00:00
Sanjay Patel 51fa60bcbb [x86] add tests for improved insertelement to index 0 (PR41512); NFC
Patch proposal in D60852.

llvm-svn: 358687
2019-04-18 16:58:50 +00:00
Simon Pilgrim 8f87e53462 [X86][SSE] Lower ICMP EQ(AND(X,C),C) -> SRA(SHL(X,LOG2(C)),BW-1) iff C is power-of-2.
This replaces the MOVMSK combine introduced at D52121/rL342326

(movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C))

with the more general icmp lowering so it can pick up more cases through bitcasts - notably vXi8 cases which use vXi16 shifts+masks, this patch can remove the mask and use pcmpgtb(0,x) for the sra.

Differential Revision: https://reviews.llvm.org/D60625

llvm-svn: 358651
2019-04-18 09:58:59 +00:00
Sanjay Patel fb363a778f [x86] try to widen 'shl' as part of LEA formation
The test file has pairs of tests that are logically equivalent:
https://rise4fun.com/Alive/2zQ

%t4 = and i8 %t1, 8
%t5 = zext i8 %t4 to i16
%sh = shl i16 %t5, 2
%t6 = add i16 %sh, %t0
=>
%t4 = and i8 %t1, 8
%sh2 = shl i8 %t4, 2
%z5 = zext i8 %sh2 to i16
%t6 = add i16 %z5, %t0

...so if we can fold the shift op into LEA in the 1st pattern, then we
should be able to do the same in the 2nd pattern (unnecessary 'movzbl'
is a separate bug I think).

We don't want to do this any sooner though because that would conflict
with generic transforms that try to narrow the width of the shift.

Differential Revision: https://reviews.llvm.org/D60789

llvm-svn: 358622
2019-04-17 22:38:51 +00:00
Craig Topper 5ca2e04c7a [X86] Autogenerate complete checks. NFC
llvm-svn: 358556
2019-04-17 06:09:16 +00:00
Sanjay Patel d5bc5ca3e4 [x86] adjust LEA tests for better coverage; NFC
The scale can 1, 2, or 3.

llvm-svn: 358539
2019-04-16 23:10:41 +00:00
Simon Pilgrim d769bb1e58 [X86][AVX] X86ISD::PERMV/PERMV3 node types can never fold index ops
Improves codegen demonstrated by D60512 - instructions represented by X86ISD::PERMV/PERMV3 can never memory fold the operand used for their index register.

This patch updates the 'isUseOfShuffle' helper into the more capable 'isFoldableUseOfShuffle' that recognises that the op is used for a X86ISD::PERMV/PERMV3 index mask and can't be folded - allowing us to use broadcast/subvector-broadcast ops to reduce the size of the mask constant pool data.

Differential Revision: https://reviews.llvm.org/D60562

llvm-svn: 358516
2019-04-16 19:18:53 +00:00
Sanjay Patel f136c46bd6 [x86] add more tests for LEA formation; NFC
Promoting the shift to the wider type should allow LEA.

llvm-svn: 358513
2019-04-16 18:58:03 +00:00
Craig Topper 0495f29e42 [X86] Limit the 'x' inline assembly constraint to zmm0-15 when used for a 512 type.
The 'v' constraint is used to select zmm0-31. This makes 512 bit consistent with 128/256-bit.a

llvm-svn: 358450
2019-04-15 21:06:32 +00:00
Craig Topper 77439bb128 [X86] Fix a stack folding test to have a full xmm2-31 clobber list instead of stopping at xmm15. Add an additional dependency to keep instruction below inline asm block.
llvm-svn: 358449
2019-04-15 21:06:23 +00:00
Craig Topper 3d9b47c770 [X86] Block i32/i64 for 'k' and 'Yk' in getRegForInlineAsmConstraint without avx512bw.
32 and 64 bit k-registers require avx512bw. If we don't block this properly, it leads to a crash.

llvm-svn: 358436
2019-04-15 18:39:45 +00:00
Sanjay Patel 8ae68f2648 [x86] update test checks; NFC
llvm-svn: 358432
2019-04-15 17:38:47 +00:00
Craig Topper 8e364c680f [X86] Restore the pavg intrinsics.
The pattern we replaced these with may be too hard to match as demonstrated by
PR41496 and PR41316.

This patch restores the intrinsics and then we can start focusing
on the optimizing the intrinsics.

I've mostly reverted the original patch that removed them. Though I modified
the avx512 intrinsics to not have masking built in.

Differential Revision: https://reviews.llvm.org/D60674

llvm-svn: 358427
2019-04-15 17:17:35 +00:00
Yevgeny Rouban 3992e9d229 Codegen: Fixed perf branch_weights in couple of tests. NFC.
This is need to pass future checks of perf branch_weights metadata.

llvm-svn: 358384
2019-04-15 09:30:31 +00:00
Bjorn Pettersson 60569363a5 [SelectionDAG] Use KnownBits::computeForAddSub/computeForAddCarry
Summary:
Use KnownBits::computeForAddSub/computeForAddCarry
in SelectionDAG::computeKnownBits when doing value
tracking for addition/subtraction.

This should improve the precision of the known bits,
as we only used to make a simple estimate of known
zeroes. The KnownBits support functions are also
able to deduce bits that are known to be one in the
result.

Reviewers: spatel, RKSimon, nikic, lebedev.ri

Reviewed By: nikic

Subscribers: nikic, javed.absar, lebedev.ri, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60460

llvm-svn: 358372
2019-04-15 07:19:11 +00:00
Craig Topper abd87ff48b [X86] Regenerate checks for domain-reassignment.mir
Apparently there are some stray IMPLICIT_DEF operations that weren't in the
checks. Not sure if they've always been there or something changed at some
point.

llvm-svn: 358371
2019-04-15 05:22:47 +00:00
Amara Emerson 946b1246d6 [GlobalISel] Enable CSE in the IRTranslator & legalizer for -O0 with constants only.
Other opcodes shouldn't be CSE'd until we can be sure debug info quality won't
be degraded.

This change also improves the IRTranslator so that in most places, but not all,
it creates constants using the MIRBuilder directly instead of first creating a
new destination vreg and then creating a constant. By doing this, the
buildConstant() method can just return the vreg of an existing G_CONSTANT
instead of having to create a COPY from it.

I measured a 0.2% improvement in compile time and a 0.9% improvement in code
size at -O0 ARM64.

Compile time:
Program                                        base   cse    diff
test-suite...ark/tramp3d-v4/tramp3d-v4.test     9.04   9.12  0.8%
test-suite...Mark/mafft/pairlocalalign.test     2.68   2.66 -0.7%
test-suite...-typeset/consumer-typeset.test     5.53   5.51 -0.4%
test-suite :: CTMark/lencod/lencod.test         5.30   5.28 -0.3%
test-suite :: CTMark/Bullet/bullet.test        25.82  25.76 -0.2%
test-suite...:: CTMark/ClamAV/clamscan.test     6.92   6.90 -0.2%
test-suite...TMark/7zip/7zip-benchmark.test    34.24  34.17 -0.2%
test-suite :: CTMark/SPASS/SPASS.test           6.25   6.24 -0.1%
test-suite...:: CTMark/sqlite3/sqlite3.test     1.66   1.66 -0.1%
test-suite :: CTMark/kimwitu++/kc.test         13.61  13.60 -0.0%
Geomean difference                                          -0.2%

Code size:
Program                                        base     cse      diff
test-suite...-typeset/consumer-typeset.test    1315632  1266480 -3.7%
test-suite...:: CTMark/ClamAV/clamscan.test    1313892  1297508 -1.2%
test-suite :: CTMark/lencod/lencod.test        1439504  1423112 -1.1%
test-suite...TMark/7zip/7zip-benchmark.test    2936980  2904172 -1.1%
test-suite :: CTMark/Bullet/bullet.test        3478276  3445460 -0.9%
test-suite...ark/tramp3d-v4/tramp3d-v4.test    8082868  8033492 -0.6%
test-suite :: CTMark/kimwitu++/kc.test         3870380  3853972 -0.4%
test-suite :: CTMark/SPASS/SPASS.test          1434904  1434896 -0.0%
test-suite...Mark/mafft/pairlocalalign.test    764528   764528   0.0%
test-suite...:: CTMark/sqlite3/sqlite3.test    782092   782092   0.0%
Geomean difference                                              -0.9%

Differential Revision: https://reviews.llvm.org/D60580

llvm-svn: 358369
2019-04-15 05:04:20 +00:00
Craig Topper 3c57976447 [X86] Move VPTESTM matching from the isel table to custom code in X86ISelDAGToDAG.
We had many tablegen patterns for these instructions. And due to the
commutability of the patterns, tablegen expands them to even more patterns. All
together VPTESTMD patterns accounted for more the 50K of the 610K isel table.
This had gotten bad when we stopped canonicalizing AND to vXi64. This required
a pattern for every combination of bitcast input type.

This change moves the matching to custom code where it is easier to look through
the bitcasts without being concerned with the specific types.

The test changes are because we are now stricter with one use checks as its
required to make load folding legal. We now require the AND and any BITCAST to
only have a single use. This prevents forming VPTESTM and a VPAND with the same
inputs.

We now support broadcast loads for 128/256 patterns without VLX. We'll widen to
512-bit like and still fold the broadcast since the amount of memory read
doesn't change.

There are a few tests that got slightly longer because are now prefering
load + VPTESTM over XOR+VPCMPEQ for (seteq (load), allzeros). Previously we were
able to share the XOR with multiple VPTESTM instructions.

llvm-svn: 358359
2019-04-14 18:26:11 +00:00
Craig Topper b17e5ec61b [X86] Don't form masked vpcmp/vcmp/vptestm operations if the setcc node has more than one use.
We're better of emitting a single compare + kand rather than a compare for the
other use and a masked compare.

I'm looking into using custom instruction selection for VPTESTM to reduce the
ridiculous number of permutations of patterns in the isel table. Putting a one
use check on all masked compare folding makes load fold matching in the custom
code easier.

llvm-svn: 358358
2019-04-14 18:26:06 +00:00
Craig Topper 476dd06854 [X86] Update bool_reduction_v8f32 test cases from vector-compare-any_of.ll and vector-compare-all_of.ll to be proper reductions.
One of the shuffles was used twice. While the intended shuffle wasn't connected.

llvm-svn: 358346
2019-04-14 04:20:42 +00:00
Bill Wendling 191f1487b6 [X86] Use PC-relative mode for the kernel code model
Summary:
The Linux kernel uses PC-relative mode, so allow that when the code model is
"kernel".

Reviewers: craig.topper

Reviewed By: craig.topper

Subscribers: llvm-commits, kees, nickdesaulniers

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60643

llvm-svn: 358343
2019-04-13 21:39:28 +00:00
Sanjay Patel 5e4ad39af7 [DAGCombiner] narrow shuffle of concatenated vectors
// shuffle (concat X, undef), (concat Y, undef), Mask -->
// concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)

The ARM changes with 'vtrn' and narrowed 'vuzp' are improvements.

The x86 changes look neutral or better. There's one test with an
extra instruction, but that could be reversed for a subtarget with
the right attributes. But by default, we want to avoid the 256-bit
op when possible (in my motivating benchmark, a handful of ymm ops
sprinkled into a sequence of xmm ops are triggering frequency
throttling on Haswell resulting in significantly worse perf).

Differential Revision: https://reviews.llvm.org/D60545

llvm-svn: 358291
2019-04-12 16:31:56 +00:00
Simon Pilgrim 6c8f4ada36 [X86][SSE] Recognise vXi1 boolean anyof/allof reduction patterns
Currently combineHorizontalPredicateResult only handles anyof/allof reduction patterns of legal types, which can be tricky to match as type legalization of bools can introduce bitcasts/truncs/extensions.

This patch extends combineHorizontalPredicateResult to recognise vXi1 bool reductions as well and uses the existing combineBitcastvxi1 helper to create the MOVMSK necessary to then compare the signmask result.

This ensures the accuracy of the reduction costs added in D60403 which assume the MOVMSK generation.

Differential Revision: https://reviews.llvm.org/D60610

llvm-svn: 358286
2019-04-12 14:22:57 +00:00
Hans Wennborg 4e6b857922 Revert r358268 "[DebugInfo] DW_OP_deref_size in PrologEpilogInserter."
It causes clang to crash while building Chromium. See https://crbug.com/952230
for reproducer.

> The PrologEpilogInserter need to insert a DW_OP_deref_size before
> prepending a memory location expression to an already implicit
> expression to avoid having the existing expression act on the memory
> address instead of the value behind it.
>
> The reason for using DW_OP_deref_size and not plain DW_OP_deref is that
> big-endian targets need to read the right size as simply truncating a
> larger read would yield the wrong result (LSB bytes are not at the lower
> address).
>
> Differential Revision: https://reviews.llvm.org/D59687

llvm-svn: 358281
2019-04-12 12:54:52 +00:00
Markus Lavin 138c76129b [DebugInfo] DW_OP_deref_size in PrologEpilogInserter.
The PrologEpilogInserter need to insert a DW_OP_deref_size before
prepending a memory location expression to an already implicit
expression to avoid having the existing expression act on the memory
address instead of the value behind it.

The reason for using DW_OP_deref_size and not plain DW_OP_deref is that
big-endian targets need to read the right size as simply truncating a
larger read would yield the wrong result (LSB bytes are not at the lower
address).

Differential Revision: https://reviews.llvm.org/D59687

llvm-svn: 358268
2019-04-12 08:23:55 +00:00
Craig Topper 3b1239d2a8 [TargetLowering][X86] Teach SimplifyDemandedBits to use ShrinkDemandedOp on ISD::SHL nodes.
If the upper bits of the SHL result aren't used, we might be able to use a narrower shift. For example, on X86 this can turn a 64-bit into 32-bit enabling a smaller encoding.

Differential Revision: https://reviews.llvm.org/D60358

llvm-svn: 358257
2019-04-12 06:49:28 +00:00
Aaron Smith 994023a3f1 [DebugInfo] Combine Trivial and NonTrivial flags
Summary:
Companion to https://reviews.llvm.org/D59347


Reviewers: rnk, zturner, probinson, dblaikie, deadalnix

Subscribers: aprantl, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59348

llvm-svn: 358220
2019-04-11 20:25:10 +00:00
Craig Topper 68a5d619a4 [X86] Restrict vselect handling in scalarizeExtEltFP to only case to pre type legalization where the setcc result type is vXi1.
If the vector setcc has been legalized then we will need to convert a vector boolean of 0 or -1 to a scalar boolean of 0 or 1.

The added test case previously crashed in 32-bit mode by creating a setcc with an i64 condition that type legalization couldn't expand.

llvm-svn: 358218
2019-04-11 19:57:44 +00:00
Craig Topper a3635b94c4 [X86] Add 32-bit command line to extractelement-fp.ll so I can add a test case for a 32-bit only crasher. NFC
This is a bit ugly for ABI reasons about how floats/doubles are returned.

llvm-svn: 358217
2019-04-11 19:57:24 +00:00
Craig Topper 586fad50ac [X86] Add patterns for using movss/movsd for atomic load/store of f32/64. Remove atomic fadd pseudos use isel patterns instead.
This patch adds patterns for turning bitcasted atomic load/store into movss/sd.

It also removes the pseudo instructions for atomic RMW fadd. Instead just adding isel patterns for folding an atomic load into addss/sd. And relying on the new movss/sd store pattern to handle the write part.

This also makes the fadd patterns use VEX and EVEX instructions when AVX or AVX512F are enabled.

Differential Revision: https://reviews.llvm.org/D60394

llvm-svn: 358215
2019-04-11 19:19:52 +00:00
Craig Topper f7e548c076 Recommit r358211 "[X86] Use FILD/FIST to implement i64 atomic load on 32-bit targets with X87, but no SSE2"
With correct test checks this time.

If we have X87, but not SSE2 we can atomicaly load an i64 value into the significand of an 80-bit extended precision x87 register using fild. We can then use a fist instruction to convert it back to an i64 integ

This matches what gcc and icc do for this case and removes an existing FIXME.

llvm-svn: 358214
2019-04-11 19:19:42 +00:00