Commit Graph

13181 Commits

Author SHA1 Message Date
Craig Topper 3f3b8ef442 [X86] Remove mask parameter from vpshufbitqmb intrinsics. Change result to a vXi1 vector.
The input mask can be represented with an AND in IR.

Fixes PR40258

llvm-svn: 351028
2019-01-14 00:03:50 +00:00
Simon Pilgrim 56ba1db933 [DAGCombiner] If add_sat(x,y) can't overflow -> add(x,y)
NOTE: We need more powerful signed overflow detection in computeOverflowKind
llvm-svn: 351026
2019-01-13 22:08:26 +00:00
Simon Pilgrim 897d4c6fe9 [DAGCombiner] Some very basic add/sub saturation combines.
Handle combines with zero and constant canonicalization for adds.

llvm-svn: 351024
2019-01-13 21:50:24 +00:00
Simon Pilgrim 9961c55e28 [X86] Add some basic add/sub saturation combine tests.
The actual combines will be added in a future commit.

llvm-svn: 351023
2019-01-13 21:21:46 +00:00
Simon Pilgrim a0069ba0db [X86] More aggressive shuffle mask widening in combineExtractWithShuffle
Use demanded extract index to set most of the shuffle mask to undef, making it easier to widen and peek through.

llvm-svn: 351013
2019-01-12 16:38:56 +00:00
Sanjay Patel 625d5aef62 [DAGCombiner] fold insert_subvector of insert_subvector
This pattern:

    t33: v8i32 = insert_subvector undef:v8i32, t35, Constant:i64<0>
  t21: v16i32 = insert_subvector undef:v16i32, t33, Constant:i64<0>

...shows up in PR33758:
https://bugs.llvm.org/show_bug.cgi?id=33758
...although this patch doesn't make any difference to the final result on that yet.

In the affected tests here, it looks like it just makes RA wiggle. But we might 
as well squash this to prevent it interfering with other pattern-matching.

Differential Revision:
https://reviews.llvm.org/D56604

llvm-svn: 351008
2019-01-12 15:12:28 +00:00
Nikita Popov 537b319860 [X86] Add more usub.sat vector tests; NFC
Add additional vXi32 and vXi64 tests.

llvm-svn: 351003
2019-01-12 11:43:04 +00:00
Simon Pilgrim a21e2bd682 [X86] Improve vXi64 ISD::ABS codegen with SSE41+
Make use of vblendvpd to select on the signbit

Differential Revision: https://reviews.llvm.org/D56544

llvm-svn: 350999
2019-01-12 10:28:12 +00:00
Simon Pilgrim ca0de0363b [X86][AARCH64] Improve ISD::ABS support
This patch takes some of the code from D49837 to allow us to enable ISD::ABS support for all SSE vector types.

Differential Revision: https://reviews.llvm.org/D56544

llvm-svn: 350998
2019-01-12 09:59:32 +00:00
Craig Topper 33b2cf50e3 [X86] Add ISD node for masked version of CVTPS2PH.
The 128-bit input produces 64-bits of output and fills the upper 64-bits with 0. The mask only applies to the lower elements. But we can't represent this with a vselect like we normally do.

This also avoids the need to have a special X86ISD::SELECT when avx512bw isn't enabled since vselect v8i16 isn't legal there.

Fixes another instruction for PR34877.

llvm-svn: 350994
2019-01-12 08:05:12 +00:00
Craig Topper bf61525e8c [X86] When lowering v1i1/v2i1/v4i1/v8i1 load/store with avx512f, but not avx512dq, use v16i1 as the intermediate mask type instead of v8i1.
We still use i8 for the load/store type. So we need to convert to/from i16 to around the mask type.

By doing this we get an i8->i16 extload which we can then pattern match to a KMOVW if the access is aligned.

llvm-svn: 350989
2019-01-12 02:22:10 +00:00
Craig Topper abe6ef8d09 [X86] Add ISD nodes for masked truncate so we can properly represent when the output has more elements than the input due to needing to be 128 bits.
We can't properly represent this with a vselect since the upper elements of the result are supposed to be zeroed regardless of the mask.

This also reuses the new nodes even when the result type fits in 128 bits if the input is q/d and the result is w/b since vselect w/b using k-register condition isn't legal without avx512bw. Currently we're doing this even when avx512bw is enabled, but I might change that.

This fixes some of PR34877

llvm-svn: 350985
2019-01-12 00:55:27 +00:00
Pirama Arumuga Nainar cc07dabdaa [Legalizer] Use correct ValueType of SELECT_CC node during Float promotion
Summary:
When legalizing the result of a SELECT_CC node by promoting the
floating-point type, use the promoted-to type rather than the original
type.

Fix PR40273.

Reviewers: efriedma, majnemer

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56566

llvm-svn: 350951
2019-01-11 18:46:02 +00:00
Sanjay Patel 40cd4b77e9 [x86] allow insert/extract when matching horizontal ops
Previously, we limited this transform to cases where the
extraction into the build vector happens from vectors of
the same type as the build vector, but that's not required.

There's a slight potential regression seen in the AVX512
result for phadd -- we're using the 256-bit flavor of the
instruction now even though the 128-bit subset is sufficient.
The same problem could already be seen in the AVX2 result.
Follow-up patches will attempt to narrow that back down.

llvm-svn: 350928
2019-01-11 14:27:59 +00:00
Craig Topper b97885cc2e [X86] Change vXi1 extract_vector_elt lowering to be legal if the index is 0. Add DAG combine to turn scalar_to_vector+extract_vector_elt into extract_subvector.
We were lowering the last step extract_vector_elt to a bitcast+truncate. Change it to use an extract_vector_elt of index 0 instead. Add isel patterns to do the equivalent of what the bitcast would have done. Plus an isel pattern for an any_extend+extract to prevent some regressions.

Finally add a DAG combine to turn v1i1 scalar_to_vector+extract_vector_elt of 0 into an extract_subvector.

This fixes some of the regressions from D350800.

llvm-svn: 350918
2019-01-11 05:44:56 +00:00
Craig Topper 844f989608 [X86] Call SimplifyDemandedBits on conditions of X86ISD::SHRUNKBLEND
This extends to combineVSelectToShrunkBlend to be able to resimplify SHRUNKBLENDS that have already been created.

This should help some of the regressions from D56387

Differential Revision: https://reviews.llvm.org/D56421

llvm-svn: 350875
2019-01-10 19:05:34 +00:00
Sanjay Patel 87ae1460f7 [x86] fix remaining miscompile bug in horizontal binop matching (PR40243)
When we use the partial-matching function on a 128-bit chunk, we must 
account for the possibility that we've matched undef halves of the
original source vectors, so the outputs may need to be reset.

This should allow closing PR40243:
https://bugs.llvm.org/show_bug.cgi?id=40243

llvm-svn: 350830
2019-01-10 15:27:23 +00:00
Sanjay Patel ed5cfc6792 [x86] fix horizontal binop matching for 256-bit vectors (PR40243)
This is a partial fix for:
https://bugs.llvm.org/show_bug.cgi?id=40243
...as seen in the integer test, we still need to correct the result when using the 
existing (old) horizontal op matching function because it does not model the way 
x86 256-bit horizontal ops return results (each 128-bit half is its own horizontal-op). 
A potential follow-up change for that is discussed in the bug report - see also D56490.

This generally duplicates a lot of the existing matching code, but we can't just remove 
that without introducing regressions, so the existing code is renamed and used less often. 
Follow-ups may try to reduce that overlap.

Differential Revision: https://reviews.llvm.org/D56450

llvm-svn: 350826
2019-01-10 15:04:52 +00:00
Simon Pilgrim 8c221b3f76 [X86] Add SSE41 vector abs tests
llvm-svn: 350822
2019-01-10 14:26:15 +00:00
Craig Topper 5d20eb240f [X86] Disable DomainReassignment pass when AVX512BW is disabled to avoid injecting VK32/VK64 references into the MachineIR
Summary:
This pass replaces GR8/GR16/GR32/GR64 with their equivalent sized mask register classes. But VK32/VK64 aren't legal without AVX512BW. Apparently this mostly appears to work if the register coalescer is able to remove the VK32/VK64 register class reference. Or if we don't ever spill it. But there's no guarantee of that.

Another Intel employee managed to trigger a crash due to this with ISPC. Unfortunately, I've lost the test case he sent me at the time. I'm trying to get him to reproduce it for me. I'd like to get this in before 8.0 branches since its a little scary.

The regressions here are unfortunate, but I think we can make some improvements to DAG combine, load folding, etc. to fix them. Just not sure if we can get that done for 8.0.

Fixes PR39741

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56460

llvm-svn: 350800
2019-01-10 07:43:54 +00:00
Sanjay Patel 15f2a4d1b9 [x86] use 'nounwind' to remove test noise; NFC
llvm-svn: 350745
2019-01-09 17:29:18 +00:00
Simon Pilgrim fcabfc8666 [X86][SSE] Cleanup shuffle combining test check prefixes
Share prefixes whenever possible, use X86 instead of X32.

llvm-svn: 350722
2019-01-09 13:46:14 +00:00
Simon Pilgrim 5a7132ff0f [X86] Enable combining shuffles to PACKSS/PACKUS for 256/512-bit vectors
llvm-svn: 350716
2019-01-09 13:23:28 +00:00
Simon Pilgrim 17eace47cb [X86] Add extra test coverage for combining shuffles to PACKSS/PACKUS
llvm-svn: 350707
2019-01-09 12:34:10 +00:00
Sanjay Patel 6a18c531ae [x86] add tests for PR40243; NFC
llvm-svn: 350646
2019-01-08 19:15:21 +00:00
Wei Mi 2645fd0ece [RegisterCoalescer] dst register's live interval needs to be updated when
merging a src register in ToBeUpdated set.

This is to fix PR40061 related with https://reviews.llvm.org/rL339035.

In https://reviews.llvm.org/rL339035, live interval of source pseudo register
in rematerialized copy may be saved in ToBeUpdated set and its update may be
postponed.

In PR40061, %t2 = %t1 is rematerialized and %t1 is added into toBeUpdated set
to postpone its live interval update. After the rematerialization, the live
interval of %t1 is larger than necessary. Then %t1 is merged into %t3 and %t1
gets removed. After the merge, %t3 contains live interval larger than necessary.
Because %t3 is not in toBeUpdated set, its live interval is not updated after
register coalescing and it will break some assumption in regalloc.

The patch requires the live interval of destination register in a merge to be
updated if the source register is in ToBeUpdated.

Differential revision: https://reviews.llvm.org/D55867

llvm-svn: 350586
2019-01-08 00:26:11 +00:00
Craig Topper 486313b5f7 Recommit r350554 "[X86] Remove AVX512VBMI2 concat and shift intrinsics. Replace with target independent funnel shift intrinsics."
The MSVC limit we hit on AutoUpgrade.cpp has been worked around for now.

llvm-svn: 350567
2019-01-07 21:00:32 +00:00
Craig Topper fad1589f39 Revert r350554 "[X86] Remove AVX512VBMI2 concat and shift intrinsics. Replace with target independent funnel shift intrinsics."
The AutoUpgrade.cpp if/else cascade hit an MSVC limit again.

llvm-svn: 350562
2019-01-07 19:39:05 +00:00
Craig Topper 9c4f7e9147 [X86] Remove AVX512VBMI2 concat and shift intrinsics. Replace with target independent funnel shift intrinsics.
Differential Revision: https://reviews.llvm.org/D56377

llvm-svn: 350554
2019-01-07 19:10:12 +00:00
Simon Pilgrim 32f77f2b52 [X86] Add OR(AND(X,C),AND(Y,~C)) bit select tests
Based off work for D55935

llvm-svn: 350548
2019-01-07 18:07:56 +00:00
Sanjay Patel 47f92d3270 [x86] add more tests for LowerToHorizontalOp(); NFC
These tests show missed optimizations and a miscompile
similar to PR40243 - https://bugs.llvm.org/show_bug.cgi?id=40243

llvm-svn: 350533
2019-01-07 16:10:14 +00:00
Craig Topper 1ac0839098 [X86] Update VBMI2 vshld/vshrd tests to use an immediate that doesn't require a modulo.
Planning to replace these with funnel shift intrinsics which would mask out the extra bits. This will help minimize test diffs.

llvm-svn: 350504
2019-01-07 05:58:53 +00:00
Craig Topper 6ffeeb705f [X86] Add support for matching vector funnel shift to AVX512VBMI2 instructions.
Summary: AVX512VBMI2 supports a funnel shift by immediate and a funnel shift by a variable vector.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56361

llvm-svn: 350498
2019-01-06 18:10:18 +00:00
Craig Topper d0ba531a0c [X86] Use two pmovmskbs in combineBitcastvxi1 for (i64 (bitcast (v64i1 (truncate (v64i8)))) on KNL.
llvm-svn: 350481
2019-01-05 22:42:58 +00:00
Craig Topper 46f8b4a11e [X86] Allow combinevxi1Bitcast to use pmovmskb on avx512 targets if the input is a truncate from v16i8/v32i8.
This is especially helpful on targets without avx512bw since we don't have a good way to convert from v16i8/v32i8 to v16i1/v32i1 for the truncate anyway. If we're just going to convert it to a GPR we might as well use pmovmskb to accomplish both.

llvm-svn: 350480
2019-01-05 21:40:07 +00:00
Craig Topper 27406e1f9e [X86] Regenerate test to merge 32-bit and 64-bit check lines. NFC
llvm-svn: 350474
2019-01-05 19:19:37 +00:00
Craig Topper 3f48dbf72e [X86] Allow LowerTRUNCATE to use PACKUS/PACKSS for v16i16->v16i8 truncate when -mprefer-vector-width-256 is in effect and BWI is not available.
llvm-svn: 350473
2019-01-05 18:48:11 +00:00
Vyacheslav Zakharin 0a6f86c54b Update the pr_datasz of .note.gnu.property section.
Patch by Xiang Zhang.

Differential Revision: https://reviews.llvm.org/D56080

llvm-svn: 350436
2019-01-04 21:25:01 +00:00
Craig Topper cfeb1cf9af [X86] Add INSERT_SUBVECTOR to ComputeNumSignBits
This adds support for calculating sign bits of insert_subvector. I based it on the computeKnownBits.

My motivating case is propagating sign bits information across basic blocks on AVX targets where concatenating using insert_subvector is common.

Differential Revision: https://reviews.llvm.org/D56283

llvm-svn: 350432
2019-01-04 20:50:59 +00:00
Sanjay Patel 6a5656703e [x86] add tests for potential horizontal vector ops; NFC
These are modified versions of the FP tests from rL349923.

llvm-svn: 350430
2019-01-04 20:14:53 +00:00
Sanjay Patel 6153565511 [x86] lower extracted fadd/fsub to horizontal vector math; 2nd try
The 1st try for this was at rL350369, but it caused IR-level diffs because
our cost models differentiate custom vs. legal/promote lowering. So that was
reverted at rL350373. The cost models were fixed independently at rL350403,
so this is effectively the same patch as last time.

Original commit message:
This would show up if we fix horizontal reductions to narrow as they go along,
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.

We need to do this late to not interfere with other pattern matching of larger
horizontal sequences.

We can extend this to integer ops in a follow-up patch.

Differential Revision: https://reviews.llvm.org/D56011

llvm-svn: 350421
2019-01-04 17:48:13 +00:00
Simon Pilgrim 9f4dea8c06 [X86] Add VPSLLI/VPSRLI ((X >>u C1) << C2) SimplifyDemandedBits combine
Repeat of the generic SimplifyDemandedBits shift combine

llvm-svn: 350399
2019-01-04 15:43:43 +00:00
Simon Pilgrim 7ee2285625 [X86] Split immediate shifts tests. NFCI.
A future patch will combine logical shifts more aggressively.

llvm-svn: 350396
2019-01-04 14:56:10 +00:00
Craig Topper 6265a15f2e [X86] Add post-isel peephole to fold KAND+KORTEST into KTEST if only the zero flag is used.
Doing this late so we will prefer to fold the AND into a masked comparison first. That can be better for the live range of the mask register.

Differential Revision: https://reviews.llvm.org/D56246

llvm-svn: 350374
2019-01-04 00:10:58 +00:00
Sanjay Patel 26ce9c38a7 revert r350369: [x86] lower extracted fadd/fsub to horizontal vector math
There are non-codegen tests that need to be updated with this code change.

llvm-svn: 350373
2019-01-04 00:02:02 +00:00
Sanjay Patel ef4afca2ad [x86] lower extracted fadd/fsub to horizontal vector math
This would show up if we fix horizontal reductions to narrow as they go along, 
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.

We need to do this late to not interfere with other pattern matching of larger 
horizontal sequences.

We can extend this to integer ops in a follow-up patch.

Differential Revision: https://reviews.llvm.org/D56011

llvm-svn: 350369
2019-01-03 23:16:19 +00:00
Sanjay Patel b8687c2168 [x86] add 512-bit vector tests for horizontal ops; NFC
llvm-svn: 350364
2019-01-03 22:55:18 +00:00
Sanjay Patel ac23c46883 [x86] add AVX512 runs for horizontal ops; NFC
llvm-svn: 350362
2019-01-03 22:42:32 +00:00
Craig Topper 58c61dce1d [X86] Add test case for D56283.
This tests a case where we need to be able to compute sign bits for two insert_subvectors that is a liveout of a basic block. The result is then used as a boolean vector in another basic block.

llvm-svn: 350359
2019-01-03 22:31:07 +00:00
Sanjay Patel 6b8a9dbfc4 [x86] remove dead CHECK lines from test file; NFC
llvm-svn: 350358
2019-01-03 22:30:36 +00:00
Sanjay Patel fd58d623ff [x86] split tests for FP and integer horizontal math
These are similar patterns, but when you throw AVX512 onto the pile,
the number of variations explodes. For FP, we really don't care about
AVX1 vs. AVX2 for FP ops. There may be some superficial shuffle diffs,
but that's not what we're testing for here, so I removed those RUNs.

Separating by type also lets us specify 'sse3' for the FP file vs. 'ssse3'
for the integer file...because x86.

llvm-svn: 350357
2019-01-03 22:26:51 +00:00
Sanjay Patel 8db27b31ac [x86] add common FileCheck prefix to reduce assert duplication; NFC
llvm-svn: 350356
2019-01-03 22:11:14 +00:00
Sanjay Patel 9633d76a40 [DAGCombiner][x86] scalarize binop followed by extractelement
As noted in PR39973 and D55558:
https://bugs.llvm.org/show_bug.cgi?id=39973
...this is a partial implementation of a fold that we do as an IR canonicalization in instcombine:

// extelt (binop X, Y), Index --> binop (extelt X, Index), (extelt Y, Index)

We want to have this in the DAG too because as we can see in some of the test diffs (reductions), 
the pattern may not be visible in IR.

Given that this is already an IR canonicalization, any backend that would prefer a vector op over 
a scalar op is expected to already have the reverse transform in DAG lowering (not sure if that's
a realistic expectation though). The transform is limited with a TLI hook because there's an
existing transform in CodeGenPrepare that tries to do the opposite transform.

Differential Revision: https://reviews.llvm.org/D55722

llvm-svn: 350354
2019-01-03 21:31:16 +00:00
Sanjay Patel 4e71ff234e [x86] add tests for buildvector with extracted element; NFC
llvm-svn: 350338
2019-01-03 17:55:32 +00:00
Simon Pilgrim 44d6b25d2c [X86] Cleanup saturated add/sub tests
Use X86/X64 check prefixes
Use nounwind to reduce cfi noise

llvm-svn: 350301
2019-01-03 12:31:13 +00:00
Markus Lavin 72b9deb21f [CodeGen] Skip over dbg-instr in twoaddr pass
A DBG_VALUE between a two-address instruction and a following COPY
would prevent rescheduleMIBelowKill optimization inside
TwoAddressInstructionPass.

Differential Revision: https://reviews.llvm.org/D55987

llvm-svn: 350289
2019-01-03 08:36:06 +00:00
Craig Topper 5ef47ad82e [X86] Add test cases for opportunities to use KTEST when check if the result of ANDing two mask registers is zero.
The test cases are constructed to avoid folding the AND into a masked compare operation.

Currently we emit a KAND and a KORTEST for these cases.

llvm-svn: 350287
2019-01-03 07:12:54 +00:00
Craig Topper df5304d8de [X86] Add load folding support to the custom isel we do for X86ISD::UMUL/SMUL.
The peephole pass isn't always able to fold the load because it can't commute the implicit usage of AL/AX/EAX/RAX.

llvm-svn: 350272
2019-01-02 23:24:08 +00:00
Craig Topper ce46bfa848 [X86] Add test cases to show that we fail to fold loads into i8 smulo and i8/i16/i32/i64 umulo lowering without the assistance of the peephole pass. NFC
llvm-svn: 350271
2019-01-02 23:24:03 +00:00
Craig Topper 9d4860ec4e [X86] Remove X86ISD::INC/DEC. Just select them from X86ISD::ADD/SUB at isel time
INC/DEC are pretty much the same as ADD/SUB except that they don't update the C flag.

This patch removes the special nodes and just pattern matches from ADD/SUB during isel if the C flag isn't being used.

I had to avoid selecting DEC is the result isn't used. This will become a SUB immediate which will turned into a CMP later by optimizeCompareInstr. This lead to the one test change where we use a CMP instead of a DEC for an overflow intrinsic since we only checked the flag.

This also exposed a hole in our RMW flag matching use of hasNoCarryFlagUses. Our root node for the match is a store and there's no guarantee that all the flag users have been selected yet. So hasNoCarryFlagUses needs to check copyToReg and machine opcodes, but it also needs to check for the pre-match SETCC, SETCC_CARRY, BRCOND, and CMOV opcodes.

Differential Revision: https://reviews.llvm.org/D55975

llvm-svn: 350245
2019-01-02 19:01:05 +00:00
Craig Topper 8dd7bd2cd7 [DAGCombiner] After performing the division by constant optimization for a DIV or REM node, replace the users of the corresponding REM or DIV node if it exists.
Currently we expand the two nodes separately. This gives DAG combiner an opportunity to optimize the expanded sequence taking into account only one set of users. When we expand the other node we'll create the expansion again, but might not be able to optimize it the same way. So the nodes won't CSE and we'll have two similarish sequences in the same basic block. By expanding both nodes at the same time we'll avoid prematurely optimizing the expansion until both the division and remainder have been replaced.

Improves the test case from PR38217. There may be additional opportunities after this.

Differential Revision: https://reviews.llvm.org/D56145

llvm-svn: 350239
2019-01-02 18:19:07 +00:00
Craig Topper 3109f3a4ab [LegalizeIntegerTypes] When promoting the result of an extract_vector_elt also promote the input type if necessary
By also promoting the input type we get a better idea for what scalar type to use. This can provide better results if the result of the extract is sign extended. What was previously happening is that the extract result would be legalized, sometime later the input of the sign extend would be legalized using the result of the extract. Then later the extract input would be legalized forcing a truncate into the input of the sign extend using a replace all uses. This requires DAG combine to combine out the sext/truncate pair. But sometimes we visited the truncate first and messed things up before the sext could be combined.

By creating the extract with the correct scalar type when we create legalize the result type, the truncate will be added right away. Then when the sign_extend input is legalized it will create an any_extend of the truncate which can be optimized by getNode to maybe remove the truncate. And then a sign_extend_inreg. Now DAG combine doesn't have to worry about getting rid of the extend.

This fixes the regression on X86 in D56156.

Differential Revision: https://reviews.llvm.org/D56176

llvm-svn: 350236
2019-01-02 17:58:30 +00:00
Craig Topper c562fae02b [DAGCombiner][X86][PowerPC] Teach visitSIGN_EXTEND_INREG to fold (sext_in_reg (aext/sext x)) -> (sext x) when x has more than 1 sign bit and the sext_inreg is from one of them.
If x has multiple sign bits than it doesn't matter which one we extend from so we can sext from x's msb instead.

The X86 setcc-combine.ll changes are a little weird. It appears we ended up with a (sext_inreg (aext (trunc (extractelt)))) after type legalization. The sext_inreg+aext now gets optimized by this combine to leave (sext (trunc (extractelt))). Then we visit the trunc before we visit the sext. This ends up changing the truncate to an extractvectorelt from a bitcasted vector. I have a follow up patch to fix this.

Differential Revision: https://reviews.llvm.org/D56156

llvm-svn: 350235
2019-01-02 17:58:27 +00:00
Simon Pilgrim d8125726d5 [X86] Support SHLD/SHRD masked shift-counts (PR34641)
Peek through shift modulo masks while matching double shift patterns.

I was hoping to delay this until I could remove the X86 code with generic funnel shift matching (PR40081) but this will do for now.

Differential Revision: https://reviews.llvm.org/D56199

llvm-svn: 350222
2019-01-02 17:05:37 +00:00
Sanjay Patel eafd481aad [x86] add more tests for potential horizontal ops; NFC
As discussed in D56011 - add runs for AVX512 and tests with extra uses.

llvm-svn: 350221
2019-01-02 16:36:04 +00:00
Craig Topper 8969720787 [X86] Add i8/i16 smulo/umulo test cases where the overflow indication is used by a mask.
llvm-svn: 350204
2019-01-02 05:46:02 +00:00
Craig Topper 6f2feb8293 [X86] Remove KNL specific check prefix from xmulo.ll test. NFC
This was added at a time when i1 was a legal type with avx512f and there was a bug. i1 is no longer considered a legal type with avx512f so there should be no codegen difference.

llvm-svn: 350203
2019-01-02 05:46:00 +00:00
Craig Topper 00b390a000 [X86] Factor the core code out of LowerXALUO into a helper function. Use it in LowerBRCOND and LowerSELECT to avoid some duplicated code.
This makes it easier to keep the LowerBRCOND and LowerSELECT code in sync with LowerXALUO so they always pick the same operation for overflowing instructions.

This is inspired by the helper functions used by ARM and AArch64 for the same purpose.

The test change is because LowerSELECT was not in sync with LowerXALUO with regard to INC/DEC for SADDO/SSUBO.

llvm-svn: 350198
2019-01-01 19:34:11 +00:00
Craig Topper a728214203 [X86] Remove KNL specific check prefix from xaluo.ll test. NFC
This was added at a time when i1 was a legal type with avx512f and there was a bug. i1 is no longer considered a legal type with avx512f so there should be no codegen difference.

llvm-svn: 350195
2019-01-01 18:44:44 +00:00
Craig Topper 9478492a80 [X86] Add test cases to show where LowerSELECT doesn't select SADDO/SSUBO to INC/DEC, but LowerXALUOOp does. Leading to duplicate code.
When SADDO/SSUBO is used as a part of a condition, the X86 backend has to lower the instruction twice. One for the flags use and then once for the data use. These two selections should be kept in sync so they end up with one node providing the data and the flags. This doesn't seem to be happening for INC/DEC.

llvm-svn: 350194
2019-01-01 18:44:42 +00:00
Simon Pilgrim 8b503c795e [X86] Add PR34641 masked shld/shrd test cases
llvm-svn: 350181
2018-12-31 19:46:18 +00:00
Craig Topper c25f1f8f17 [X86] Add additional RUN lines to prepare for D56156. NFC
llvm-svn: 350180
2018-12-31 19:09:32 +00:00
Craig Topper ed3ffae4a4 [SelectionDAG] Add SIGN_EXTEND_VECTOR_INREG support to computeKnownBits.
Differential Revision: https://reviews.llvm.org/D56168

llvm-svn: 350179
2018-12-31 19:09:30 +00:00
Craig Topper bb0873cf46 [X86] Add X86ISD::VSRAI to computeKnownBitsForTargetNode.
Differential Revision: https://reviews.llvm.org/D56169

llvm-svn: 350178
2018-12-31 19:09:27 +00:00
Craig Topper a32e353afa [X86] Don't mark SEXTLOAD from v4i8/v4i16/v8i8 as Custom on pre-sse4.1.
This seems to be getting in the way more than its helping. This does mean we stop scalarizing some cases, but I'm not convinced the scalarization was really better.

Some of the changes to vsel-cmp-load.ll are a regression but D56156 should fix it.

llvm-svn: 350159
2018-12-30 03:05:07 +00:00
Craig Topper f237ce159e [X86] Add custom type legalization for SIGN_EXTEND_VECTOR_INREG from 16i16/v32i8 to v4i64 when v4i64 needs splitting.
This allows us to sign extend to v4i32 first. And then share that extension to implement the final steps to v4i64 using a pcmpgt and punpckl and punpckh.

We already do something similar for SIGN_EXTEND with -x86-experimental-vector-widening-legalization.

llvm-svn: 350158
2018-12-30 02:30:34 +00:00
Craig Topper 7bb1d50455 [X86] Add test case from PR38217. NFC
llvm-svn: 350150
2018-12-29 07:14:30 +00:00
Craig Topper 0a6cec6f9f [X86] Don't mark SEXTLOAD v4i8->v4i64 and v8i8->v8i64 as custom under vector widening legalization.
This was tricking us into making these operations and then letting them get scalarized later. But I can't prove that the scalarized version is actually better.

llvm-svn: 350141
2018-12-29 01:17:11 +00:00
Craig Topper f814d28eb3 [X86] Directly emit X86ISD::PMULUDQ from the ReplaceNodeResults handling of v2i8/v2i16/v2i32 multiply.
Previously we emitted a multiply and some masking that was supposed to matched to PMULUDQ, but the masking could sometimes be removed before we got a chance to match it. So instead just emit the PMULUDQ directly.

Remove the DAG combine that was added when the ReplaceNodeResults code was originally added. Add a new DAG combine to avoid regressions in shrink_vmul.ll

Some of the shrink_vmul.ll test cases now pick PMULUDQ instead of PMADDWD/PMULLD, but I think this should be an improvement on most CPUs.

I think all of this can go away if/when we switch to -x86-experimental-vector-widening-legalization

llvm-svn: 350134
2018-12-28 19:19:39 +00:00
Craig Topper 787ad92bf6 [X86] Remove check that avoids creating PMULDQ with illegal types. Rely on SplitOpsAndApply to legalize it.
Create PMULDQ/PMULUDQ as long as the number of elements is a power of 2.

This seems to give some improvements in our ability to use SimplifyDemandedBits.

llvm-svn: 350084
2018-12-27 03:37:04 +00:00
Craig Topper 0229da8f07 [X86] Use GetDemandedBits to simplify the operands of PMULDQ/PMULUDQ.
This is an alternative to what I attempted in D56057.

GetDemandedBits is a special version of SimplifyDemandedBits that allows simplifications even when the operand has other uses. GetDemandedBits will only do simplifications that allow a node to be bypassed. It won't create new nodes or alter any of the other users.

I had to add support for bypassing SIGN_EXTEND_INREG to GetDemandedBits.

Based on a patch that Simon Pilgrim sent me in email.

Fixes PR40142.

llvm-svn: 350059
2018-12-24 19:40:20 +00:00
Craig Topper 6356ad940b [X86] Add test cases for PR40142. NFC
llvm-svn: 350058
2018-12-24 19:40:17 +00:00
Craig Topper 5eb5e2bc89 [X86] Autogenerate complete checks. NFC
llvm-svn: 350039
2018-12-24 01:59:31 +00:00
Sanjay Patel 93f1074677 [DAGCombiner] limit shuffle to extend transform (PR40146)
It's dangerous to knowingly create an illegal vector type
no matter what stage of combining we're in.

This prevents the missed folding/scalarization seen in:
https://bugs.llvm.org/show_bug.cgi?id=40146

llvm-svn: 350034
2018-12-23 20:48:31 +00:00
Sanjay Patel 9e5588e1df [x86] add test for vector shuffle --> extend transform (PR40146); NFC
llvm-svn: 350033
2018-12-23 20:36:52 +00:00
Sanjay Patel 9933574ac3 [DAGCombiner] allow hoisting vector bitwise logic ahead of extends
llvm-svn: 350032
2018-12-23 19:58:16 +00:00
Sanjay Patel 8bc612f63b [x86] add tests for vector extend + logic ops; NFC
llvm-svn: 350031
2018-12-23 18:37:44 +00:00
Craig Topper 006bac6880 [X86] Return false from hasAndNotCompare if the comparision value is a constant.
We won't end up using an ANDN instruction in this case so we should generate the same code we do for pre-BMI targets.

llvm-svn: 350018
2018-12-23 05:52:55 +00:00
Craig Topper 3cc92a28ce [X86] Fix an old FIXME about folding the zero constant into the OR instruction we use for sequentially consistent fence in 32-bit mode without SSE2.
llvm-svn: 350013
2018-12-23 01:54:43 +00:00
Craig Topper dfb8a427ff [X86] Autogenerate complete checks. NFC
llvm-svn: 350012
2018-12-23 01:54:41 +00:00
Sanjay Patel 4b537aaf6d [DAGCombiner] allow narrowing of add followed by truncate
trunc (add X, C ) --> add (trunc X), C'

If we're throwing away the top bits of an 'add' instruction, do it in the narrow destination type.
This makes the truncate-able opcode list identical to the sibling transform done in IR (in instcombine).

This change used to show regressions for x86, but those are gone after D55494. 
This gets us closer to deleting the x86 custom function (combineTruncatedArithmetic) 
that does almost the same thing.

Differential Revision: https://reviews.llvm.org/D55866

llvm-svn: 350006
2018-12-22 17:10:31 +00:00
Sanjay Patel 52c02d70e2 [x86] add load fold patterns for movddup with vzext_load
The missed load folding noticed in D55898 is visible independent of that change 
either with an adjusted IR pattern to start or with AVX2/AVX512 (where the build 
vector becomes a broadcast first; movddup is not produced until we get into isel 
via tablegen patterns).

Differential Revision: https://reviews.llvm.org/D55936

llvm-svn: 350005
2018-12-22 16:59:02 +00:00
Roman Lebedev da1df56e5d NFC][CodeGen][X86][AArch64] Tests for bit extract (pat. a/c/d) with trunc (PR36419)
llvm-svn: 350000
2018-12-22 10:38:05 +00:00
Roman Lebedev c90611db06 [NFC][CodeGen][X86][AArch64] Bit extract: add nounwind attr to drop .cfi noise
Forgot about that.

llvm-svn: 349999
2018-12-22 09:58:13 +00:00
Roman Lebedev 29d8af283a [NFC][CodeGen][X86][AArch64] Tests for bit extract (pat. b) with trunc (PR36419)
@bextr64_32_b1 is extracted from hotpath of real-world code
(RawSpeed BitStream<>::peekBitsNoFill()) after `clang -O3`.

@bextr64_32_b2/@bextr64_32_b0 is the same pattern,
but with trunc done last, showing how i think it can be handled:
https://rise4fun.com/Alive/K4B
https://rise4fun.com/Alive/qC9

It is possible that middle-end should do some of this, too.

https://bugs.llvm.org/show_bug.cgi?id=36419

llvm-svn: 349998
2018-12-22 09:40:14 +00:00
David Blaikie 9efb0153f0 llvm-dwarfdump: Remove extraneous space between '(' and 'indexed'
When dumping string or address indexes

llvm-svn: 349997
2018-12-22 08:43:08 +00:00
Craig Topper e58cd9cbc6 [X86] Add isel patterns to match BMI/TBMI instructions when lowering has turned the root nodes into one of the flag producing binops.
This fixes the patterns that have or/and as a root. 'and' is handled differently since thy usually have a CMP wrapped around them.

I had to look for uses of the CF flag because all these nodes have non-standard CF flag behavior. A real or/xor would always clear CF. In practice we shouldn't be using the CF flag from these nodes as far as I know.

Differential Revision: https://reviews.llvm.org/D55813

llvm-svn: 349962
2018-12-21 21:42:43 +00:00
Craig Topper 62ec024d3b [X86] Don't allow optimizeCompareInstr to replace a CMP with BEXTR if the sign flag is used.
The BEXTR instruction documents the SF bit as undefined.

The TBM BEXTR instruction has the same issue, but I'm not sure how to test it. With the control being an immediate we can determine the sign bit is 0 or the BEXTR would have been removed.

Fixes PR40060

Differential Revision: https://reviews.llvm.org/D55807

llvm-svn: 349956
2018-12-21 21:16:26 +00:00
Sanjay Patel 80187b8a17 [x86] add movddup specialization for build vector lowering (PR37502)
This is admittedly a narrow fix for the problem:
https://bugs.llvm.org/show_bug.cgi?id=37502
...but as the XOP restriction shows, it's a maze to get this right. 
In the motivating example, note that we have movddup before SSE4.1 and 
again with AVX2. That's because insertps isn't available pre-SSE41 and 
vbroadcast is (more generally) available with AVX2 (and the splat is 
reduced to movddup via isel pattern).

Differential Revision: https://reviews.llvm.org/D55898

llvm-svn: 349937
2018-12-21 18:48:32 +00:00
Sanjay Patel a87fba4e92 [x86] remove excess check lines; NFC
Forgot that the integer variants have an extra 's'.

llvm-svn: 349929
2018-12-21 17:19:43 +00:00