Commit Graph

12722 Commits

Author SHA1 Message Date
Simon Pilgrim c5bb362b13 [X86][SSE] Add SimplifyDemandedBitsForTargetNode PMULDQ/PMULUDQ handling
Add X86 SimplifyDemandedBitsForTargetNode and use it to simplify PMULDQ/PMULUDQ target nodes.

This enables us to repeatedly simplify the node's arguments after the previous approach had to be reverted due to PR39398.

Differential Revision: https://reviews.llvm.org/D53643

llvm-svn: 345182
2018-10-24 19:11:28 +00:00
Craig Topper 2417273255 [X86] Bring back the MOV64r0 pseudo instruction
This patch brings back the MOV64r0 pseudo instruction for zeroing a 64-bit register. This replaces the SUBREG_TO_REG MOV32r0 sequence we use today. Post register allocation we will rewrite the MOV64r0 to a 32-bit xor with an implicit def of the 64-bit register similar to what we do for the various XMM/YMM/ZMM zeroing pseudos.

My main motivation is to enable the spill optimization in foldMemoryOperandImpl. As we were seeing some code that repeatedly did "xor eax, eax; store eax;" to spill several registers with a new xor for each store. With this optimization enabled we get a store of a 0 immediate instead of an xor. Though I admit the ideal solution would be one xor where there are multiple spills. I don't believe we have a test case that shows this optimization in here. I'll see if I can try to reduce one from the code were looking at.

There's definitely some other machine CSE(and maybe other passes) behavior changes exposed by this patch. So it seems like there might be some other deficiencies in SUBREG_TO_REG handling.

Differential Revision: https://reviews.llvm.org/D52757

llvm-svn: 345165
2018-10-24 17:32:09 +00:00
Robert Lougher 18bfb3a5ec [CodeGen] skip lifetime end marker in isInTailCallPosition
A lifetime end intrinsic between a tail call and the return should not
prevent the call from being tail call optimized.

Differential Revision: https://reviews.llvm.org/D53519

llvm-svn: 345163
2018-10-24 17:03:19 +00:00
Simon Pilgrim 84cc110732 [X86][SSE] Update PMULDQ schedule tests to survive more aggressive SimplifyDemandedBits
llvm-svn: 345136
2018-10-24 13:13:36 +00:00
Matthias Braun 4f82406c46 SelectionDAG: Reuse bigger sized constants in memset expansion.
When implementing memset's today we often see this pattern:
$x0 = MOV 0xXYXYXYXYXYXYXYXY
store $x0, ...
$w1 = MOV 0xXYXYXYXY
store $w1, ...

We first create a 64bit constant in a 64bit register with all bytes the
same and then create a 32bit constant with all bytes the same in a 32bit
register. In many targets we could just access the lower byte of the
64bit register instead.

- Ideally this would be handled by the ConstantHoist pass but it runs
  too early when memset isn't expanded yet.
- The memset expansion code already had this optimization implemented,
  however SelectionDAG constantfolding would constantfold the
  "trunc(bigconstnat)" pattern to "smallconstant".
- This patch makes the memset expansion mark the constant as Opaque and
  stop DAGCombiner from constant folding in this situation. (Similar to
  how ConstantHoisting marks things as Opaque to avoid folding
  ADD/SUB/etc.)

Differential Revision: https://reviews.llvm.org/D53181

llvm-svn: 345102
2018-10-23 23:19:23 +00:00
Craig Topper e01d516ac7 [X86] Autogenerate comple checks. NFC
llvm-svn: 345087
2018-10-23 21:58:49 +00:00
Simon Pilgrim b6c57075c0 [X86][SSE] Revert rL343922 combinePMULDQ AddToWorklist (PR39398)
We can't add the MULDQ node back to the worklist after the demanded bits change has been committed in case the node has been removed entirely. This will have to wait until we have SimplifyDemandedBitsForTargetNode.

llvm-svn: 345070
2018-10-23 19:07:53 +00:00
Simon Pilgrim d705ba97dd [LegalizeDAG] Share Vector/Scalar CTLZ Expansion
As suggested on D53258, this patch shares common CTLZ expansion code between VectorLegalizer and SelectionDAGLegalize by putting it in TargetLowering.

Extension to D53474

llvm-svn: 345060
2018-10-23 17:48:30 +00:00
Roman Lebedev 06e4db07af Experimental re-land of [X86][BMI1] X86DAGToDAGISel: select BEXTR from x << (32 - y) >> (32 - y) pattern
This initially landed in rL345014, but was reverted in rL345017
due to sanitizer-x86_64-linux-fast buildbot failure in
check-lld (ELF/relocatable-versioned.s) test.

While i'm not yet quite sure what is the problem, one obvious
thing here is that extra truncation roundtrip.
Maybe that's it? If not, will re-revert.

Differential Revision: https://reviews.llvm.org/D53521

llvm-svn: 345027
2018-10-23 13:19:31 +00:00
Roman Lebedev c29dbbdb10 Revert "[X86][BMI1] X86DAGToDAGISel: select BEXTR from x << (32 - y) >> (32 - y) pattern"
*Seems* to be breaking sanitizer-x86_64-linux-fast buildbot,
the ELF/relocatable-versioned.s test:

==17758==MemorySanitizer CHECK failed: /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_allocator.cc:191 "((kBlockMagic)) == ((((u64*)addr)[0]))" (0x6a6cb03abcebc041, 0x0)
    #0 0x59716b in MsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/msan/msan.cc:393
    #1 0x586635 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_termination.cc:79
    #2 0x57d5ff in __sanitizer::InternalFree(void*, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<__sanitizer::AP32> >*) /b/sanitizer-x86_64-linux-fast/build/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_allocator.cc:191
    #3 0x7fc21b24193f  (/lib/x86_64-linux-gnu/libc.so.6+0x3593f)
    #4 0x7fc21b241999 in exit (/lib/x86_64-linux-gnu/libc.so.6+0x35999)
    #5 0x7fc21b22c2e7 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202e7)
    #6 0x57c039 in _start (/b/sanitizer-x86_64-linux-fast/build/llvm_build_msan/bin/lld+0x57c039)

This reverts commit r345014.

llvm-svn: 345017
2018-10-23 10:34:57 +00:00
Roman Lebedev 1c95b2f779 [X86][BMI1] X86DAGToDAGISel: select BEXTR from x << (32 - y) >> (32 - y) pattern
Summary:
Continuation of D52348.

We also get the `c) x &  (-1 >> (32 - y))` pattern here, because of the D48768.
I will add extra-uses into those tests and follow-up with a patch to handle those patterns too.

Reviewers: RKSimon, craig.topper

Reviewed By: craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53521

llvm-svn: 345014
2018-10-23 09:08:44 +00:00
Craig Topper f50f086743 [X86] Regenerate test checks to show fma comments. NFC
llvm-svn: 344999
2018-10-23 04:18:08 +00:00
Leonard Chan 0acfc6be38 [Intrinsic] Unigned Saturation Addition Intrinsic
Add an intrinsic that takes 2 integers and perform unsigned saturation
addition on them.

This is a part of implementing fixed point arithmetic in clang where some of
the more complex operations will be implemented as intrinsics.

Differential Revision: https://reviews.llvm.org/D53340

llvm-svn: 344971
2018-10-22 23:08:40 +00:00
Matthias Braun a0beeffeed X86: Do not optimize branches with undef eflags inputs
analyzeBranch()/insertBranch() etc. do not properly deal with an undef
flag on the eflags input and used to produce invalid MIR.  I don't see
this ever affecting real world inputs (I don't think it is possible to
produce undef flags with llvm IR), so I simply changed the code to bail
out in this case.

rdar://42122367

llvm-svn: 344970
2018-10-22 22:52:23 +00:00
Craig Topper c8e183f9ee Recommit r344877 "[X86] Stop promoting integer loads to vXi64"
I've included a fix to DAGCombiner::ForwardStoreValueToDirectLoad that I believe will prevent the previous miscompile.

Original commit message:

Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to rem

I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.

I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the lo

I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits, RKSimon

Differential Revision: https://reviews.llvm.org/D53306

llvm-svn: 344965
2018-10-22 22:14:05 +00:00
Sanjay Patel fb41544af8 [x86] add test for PR25498 and complete checks; NFC
Might as well test the actual codegen instead of just the absence of crashing.

llvm-svn: 344955
2018-10-22 21:11:15 +00:00
Craig Topper 8d8dcfe690 Revert r344877 "[X86] Stop promoting integer loads to vXi64"
Sam McCall reported miscompiles in some tensorflow code. Reverting while I try to figure out.

llvm-svn: 344921
2018-10-22 16:59:24 +00:00
Roman Lebedev 13c5ab2e27 [X86][BMI1]: X86DAGToDAGISel: select BEXTR from x & ((1 << nbits) + (-1)) pattern
Summary:
Trivial continuation of D52304.
While this pattern is not canonical, we do select it in the BZHI case,
so this should not be any different.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52348

llvm-svn: 344902
2018-10-22 13:54:17 +00:00
Craig Topper 290c081d91 [X86] Add patterns for vector and/or/xor/andn with other types than vXi64.
This makes fast isel treat all legal vector types the same way. Previously only vXi64 was in the fast-isel tables.

This unfortunately prevents matching of andn by fast-isel for these types since the requires SelectionDAG. But we already had this issue for vXi64. So at least we're consistent now.

Interestinly it looks like fast-isel can't handle instructions with constant vector arguments so the the not part of the andn patterns is selected with SelectionDAG. This explains why VPTERNLOG shows up in some of the tests.

This is a subset of D53268. As I make progress on that, I will try to reduce the number of lines in the tablegen files.

llvm-svn: 344884
2018-10-22 06:30:22 +00:00
Craig Topper 321df5b0d4 [X86] Stop promoting integer loads to vXi64
Summary:
Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to remove the bitcast.

I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.

I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the load size.

I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits, RKSimon

Differential Revision: https://reviews.llvm.org/D53306

llvm-svn: 344877
2018-10-21 21:30:26 +00:00
Sanjay Patel e439cc2745 [DAGCombiner] reduce insert+bitcast+extract vector ops to truncate (PR39016)
This is a late backend subset of the IR transform added with:
D52439

We can confirm that the conversion to a 'trunc' is correct by running:
$ opt -instcombine -data-layout="e"
(assuming the IR transforms are correct; change "e" to "E" for big-endian)

As discussed in PR39016:
https://bugs.llvm.org/show_bug.cgi?id=39016
...the pattern may emerge during legalization, so that's we are waiting for an 
insertelement to become a scalar_to_vector in the pattern matching here.

The DAG allows for fun variations that are not possible in IR. Result types for 
extracts and scalar_to_vector don't necessarily match input types, so that means 
we have to be a bit more careful in the transform (see code comments).

The tests show that we don't handle cases that require a shift (as we did in the
IR version). I've left that as a potential follow-up because I'm not sure if 
that's a real concern at this late stage.

Differential Revision: https://reviews.llvm.org/D53201

llvm-svn: 344872
2018-10-21 20:13:29 +00:00
Simon Pilgrim eb806d5f30 [X86][AVX] Enable lowerVectorShuffleAsLanePermuteAndPermute v16i16/v32i8 unary shuffle lowering
llvm-svn: 344868
2018-10-21 17:07:50 +00:00
David Blaikie 2df23a4e2e DebugInfo: Use DW_OP_addrx in DWARFv5
Reuse addresses in the address pool, even in non-split cases.

llvm-svn: 344838
2018-10-20 08:54:05 +00:00
Kristina Brooks 312fcc116b [X86] Support for the mno-tls-direct-seg-refs flag
Allows to disable direct TLS segment access (%fs or %gs). GCC supports
a similar flag, it can be useful in some circumstances, e.g. when a thread
context block needs to be updated directly from user space. More info
and specific use cases: https://bugs.llvm.org/show_bug.cgi?id=16145

There is another revision for clang as well.
Related: D53102

All X86 CodeGen tests appear to pass:
```
[46/47] Running lit suite /SourceCache/llvm-trunk-8.0/test/CodeGen
Testing Time: 23.17s
  Expected Passes    : 3801
  Expected Failures  : 15
  Unsupported Tests  : 8021
```

Reviewed by: Craig Topper.

Patch by nruslan (Ruslan Nikolaev).

Differential Revision: https://reviews.llvm.org/D53103

llvm-svn: 344723
2018-10-18 03:14:37 +00:00
Craig Topper e0a992918b [X86] Match (cmp (and (shr X, C), mask), 0) to BEXTR+TEST.
Without this we match the CMP+AND to a TEST and then match the SHR separately. I'm trusting analyzeCompare to remove the TEST during the peephole pass. Otherwise we need to check the flag users to see if they only use the Z flag.

This recovers a case lost by r344270.

Differential Revision: https://reviews.llvm.org/D53310

llvm-svn: 344649
2018-10-16 22:29:36 +00:00
Leonard Chan 699b3b54da [Intrinsic] Signed Saturation Addition Intrinsic
Add an intrinsic that takes 2 integers and perform saturation addition on them.

This is a part of implementing fixed point arithmetic in clang where some of
the more complex operations will be implemented as intrinsics.

Differential Revision: https://reviews.llvm.org/D53053

llvm-svn: 344629
2018-10-16 17:35:41 +00:00
Craig Topper 2909a3d9d0 [X86] Fix a bad bitcast in the load form of vXi16 uniform shift patterns for EVEX encoded instructions.
llvm-svn: 344563
2018-10-15 21:51:32 +00:00
Craig Topper 548e8b5b47 [X86] Add test cases showing failure to fold load into vpsrlw when EVEX encoded instructions are used.
There's a bad bitcast being used in the isel patterns for the vXi16 shift instructions.

llvm-svn: 344562
2018-10-15 21:51:29 +00:00
Craig Topper 125c54ba10 [X86] Disable the peephole pass on avx2-intrinsics-x86.ll and avx512bw-intrinsics.ll to ensure any load folding tests are testing isel not load folding tables.
llvm-svn: 344561
2018-10-15 21:51:26 +00:00
Craig Topper da0e544953 [X86] Regenerate avx2-intrinsics-x86.ll to compress the 32 vs 64 bit mode checks.
llvm-svn: 344560
2018-10-15 21:51:22 +00:00
Sanjay Patel 4cf1da0e02 [SelectionDAG] allow FP binops in SimplifyDemandedVectorElts
This is intended to make the backend on par with functionality that was 
added to the IR version of SimplifyDemandedVectorElts in:
rL343727
...and the original motivation is that we need to improve demanded-vector-elements 
in several ways to avoid problems that would be exposed in D51553.

Differential Revision: https://reviews.llvm.org/D52912

llvm-svn: 344541
2018-10-15 18:05:34 +00:00
Sanjay Patel 9e7e0fd828 [DAGCombiner] allow undef elts in vector fma matching
llvm-svn: 344528
2018-10-15 15:56:39 +00:00
Sanjay Patel 475a53649e [x86] add tests for fma with undef elts; NFC
llvm-svn: 344527
2018-10-15 15:47:37 +00:00
Sanjay Patel 4e970ff022 [DAGCombiner] allow undef elts in vector fma matching
llvm-svn: 344525
2018-10-15 15:38:38 +00:00
Sanjay Patel b06ac18ee9 [x86] add tests for fma with undef elts; NFC
llvm-svn: 344523
2018-10-15 15:28:44 +00:00
Craig Topper b44b22c6fe [X86] Autogenerate checks. NFC
llvm-svn: 344490
2018-10-15 05:31:24 +00:00
Craig Topper 06aea1720a [X86] Move promotion of vector and/or/xor from legalization to DAG combine
Summary:
I've noticed that the bitcasts we introduce for these make computeKnownBits and computeNumSignBits not work well in LegalizeVectorOps. LegalizeVectorOps legalizes bottom up while LegalizeDAG legalizes top down. The bottom up strategy for LegalizeVectorOps means operands are legalized before their uses. So we promote and/or/xor before we legalize the operands that use them making computeKnownBits/computeNumSignBits in places like LowerTruncate suboptimal. I looked at changing LegalizeVectorOps to be top down as well, but that was more disruptive and caused some regressions. I also looked at just moving promotion of binops to LegalizeDAG, but that had a few issues one around matching AND,ANDN,OR into VSELECT because I had to create ANDN as vXi64, but the other nodes hadn't legalized yet, I didn't look too hard at fixing that.

This patch seems to produce better results overall than my other attempts. We now form broadcasts of constants better in some cases. For at least some of them the AND was being introduced in LegalizeDAG, promoted to vXi64, and the BUILD_VECTOR was also legalized there. I think we got bad ordering of that. Now the promotion is out of the legalizer so we handle this better.

In the longer term I think we really should evaluate whether we should be doing this promotion at all. It's really there to reduce isel pattern count, but I'm wondering if we'd be better served just eating the pattern cost or doing C++ based isel for vector and/or/xor in X86ISelDAGToDAG. The masked and/or/xor will definitely be difficult in patterns if a bitcast gets between the vselect and the and/or/xor node. That becomes a lot of permutations to cover.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53107

llvm-svn: 344487
2018-10-15 01:51:58 +00:00
Craig Topper 671779456a [X86] Add 128 MOVDDUP to the constant pool printing in X86AsmPrinter::EmitInstruction.
We use this instruction to broadcast a single 64-bit value to a v2i64/v2f64 vector.

llvm-svn: 344486
2018-10-15 01:51:53 +00:00
Craig Topper b5000974fe [X86] Autogenerate complete checks. NFC
llvm-svn: 344485
2018-10-15 01:51:50 +00:00
Simon Pilgrim 861cd0ba44 [X86][AVX] Enable lowerVectorShuffleAsLanePermuteAndPermute v16i16/v32i8 shuffle lowering
Extends D53148 from v4f64 now that we have test coverage for v16i16/v32i8 shuffles.

llvm-svn: 344481
2018-10-14 17:34:20 +00:00
Craig Topper ec4b75f47a [X86] Type legalize v2f32 stores by widening to v4f32, casting to v2f64, extracting f64 and storing.
Summary: This is similar to what D52528 did for loads. It should match what generic type legalization does in 64-bit mode where it uses a v2i64 cast and an i64 store.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53173

llvm-svn: 344470
2018-10-14 03:36:27 +00:00
Craig Topper 189e5b4ab6 [LegalizeTypes] Prevent an assertion from PromoteIntRes_BSWAP and PromoteIntRes_BITREVERSE if the shift amount is too large for the VT returned by getShiftAmountTy
Summary:
getShiftAmountTy for X86 returns MVT::i8. If a BSWAP or BITREVERSE is created that requires promotion and the difference between the original VT and the promoted VT is more than 255 then we won't able to create the constant.

This patch adds a check to replace the result from getShiftAmountTy to MVT::i32 if the difference won't fit. This should get legalized later when the shift is ultimately expanded since its clearly an illegal type that we're only promoting to make it a power of 2 bit width. Alternatively we could base the decision completely on the largest shift amount the promoted VT could use.

Vectors should be immune here because getShiftAmountTy always returns the incoming VT for vectors. Only the scalar shift amount can be changed by the targets.

Reviewers: eli.friedman, RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53232

llvm-svn: 344460
2018-10-13 17:47:20 +00:00
Simon Pilgrim 1c6d320351 [X86][SSE] combineIncDecVector - use isConstantSplat
Use isConstantSplat instead of ISD::isConstantSplatVector to let us us peek through to illegal types (in this case for i686 targets to recognise i64 constants)

llvm-svn: 344452
2018-10-13 14:45:44 +00:00
Simon Pilgrim f64e654d62 [X86][SSE] Improve CTTZ lowering when CTLZ is legal
If we have better CTLZ support than CTPOP, then use cttz(x) = width - ctlz(~x & (x - 1)) - and remove the CTTZ_ZERO_UNDEF handling as it no longer gives better codegen.

Similar to rL344447, this is also closer to LegalizeDAG's approach

llvm-svn: 344448
2018-10-13 13:05:19 +00:00
Simon Pilgrim afead139cf [X86][SSE] Change CTTZ vector lowering to cttz(x) = ctpop(~x & (x - 1))
This patch changes the vector CTTZ lowering from:

cttz(x) = ctpop((x & -x) - 1)

to:

cttz(x) = ctpop(~x & (x - 1))

Not only does this make better use of the PANDN instruction, but it also matches the LegalizeDAG method which should allow us to remove the x86 specific code at some point in the future (we need to fix some issues with the bitcasted logic ops and CTPOP lowering first).

Differential Revision: https://reviews.llvm.org/D53214

llvm-svn: 344447
2018-10-13 12:12:06 +00:00
Simon Pilgrim f3952413f7 [X86][AVX] Add lowerVectorShuffleAsLanePermuteAndPermute for v4f64 shuffles (PR39161)
Add shuffle lowering for the case where we can shuffle the lanes into place followed by an in-lane permute.

This is mainly for cases where we can have non-repeating permutes in each lane, but for now I've just enabled it for v4f64 unary shuffles to fix PR39161 - there is no test coverage for other shuffles that might benefit yet.

We now have several cross-lane shuffle lowering methods that all do something similar - I've looked at merging some of these (notably by making the repeated mask mechanism in lowerVectorShuffleByMerging128BitLanes optional), but there is a lot of assertions/assumptions in the way that makes this tricky - I ended up going for adding yet another relatively simple method instead.

Differential Revision: https://reviews.llvm.org/D53148

llvm-svn: 344446
2018-10-13 11:38:10 +00:00
Craig Topper 3e76b2d736 [X86] Improve type legalization of (v2i32/v4i16/v8i16 (bitcast (v2f32))) to avoid a stack stack temporary.
llvm-svn: 344425
2018-10-12 22:00:04 +00:00
Craig Topper 1bb0c6041a [LegalizeVectorTypes] When widening the operands to a concat_vectors, see if we can use the widened operand 0 if the width matches and the other operands are undef.
This saves a conversion to extracts and build_vector. We already do this when both the result and the input need to be widened to the same type.

This changed the sse-intrinsics-fast-isel test because we don't lower (insert_vector_elt (scalar_to_vector X), Y, 1) well. We turn it into (vector_shuffle (scalar_to_vector X), (scalar_to_vector Y), <0, 4, 2, 3>) losing track of the fact that the upper elts could be undef.

We should probably find a way to prevent the scalarization of the <2 x f32> load on these tests.

llvm-svn: 344404
2018-10-12 19:37:49 +00:00
Simon Pilgrim 1d6b938132 Regenerate test. NFCI.
llvm-svn: 344399
2018-10-12 19:03:54 +00:00
Sanjay Patel e28c8ecd72 [x86] add and use fast horizontal vector math subtarget feature
This is the planned follow-up to D52997. Here we are reducing horizontal vector math codegen 
by default. AMD Jaguar (btver2) should have no difference with this patch because it has 
fast-hops. (If we want to set that bit for other CPUs, let me know.)

The code changes are small, but there are many test diffs. For files that are specifically 
testing for hops, I added RUNs to distinguish fast/slow, so we can see the consequences 
side-by-side. For files that are primarily concerned with codegen other than hops, I just 
updated the CHECK lines to reflect the new default codegen.

To recap the recent horizontal op story:

1. Before rL343727, we were producing hops for all subtargets for a variety of patterns. 
   Hops were likely not optimal for all targets though.
2. The IR improvement in r343727 exposed a hole in the backend hop pattern matching, so 
   we reduced hop codegen for all subtargets. That was bad for Jaguar (PR39195).
3. We restored the hop codegen for all targets with rL344141. Good for Jaguar, but 
   probably bad for other CPUs.
4. This patch allows us to distinguish when we want to produce hops, so everyone can be 
   happy. I'm not sure if we have the best predicate here, but the intent is to undo the 
   extra hop-iness that was enabled by r344141.

Differential Revision: https://reviews.llvm.org/D53095

llvm-svn: 344361
2018-10-12 16:41:02 +00:00
Nick Desaulniers 47bab69a2e [MC][ELF] fix newly added test
Summary:
Reland of
- r344197 "[MC][ELF] compute entity size for explicit sections"
- r344206 "[MC][ELF] Fix section_mergeable_size.ll"
after being reverted in r344278 due to build breakages from not
specifying a target triple.

Move test from test/CodeGen/Generic/ to test/MC/ELF/.
Add explicit target triple so we don't try to run
this test on non ELF targets.

Reported: https://reviews.llvm.org/D53056#1261707

Reviewers: fhahn, rnk, espindola, NoQ

Reviewed By: fhahn, rnk

Subscribers: NoQ, MaskRay, rengolin, emaste, arichardson, llvm-commits, pirama, srhines

Differential Revision: https://reviews.llvm.org/D53146

llvm-svn: 344360
2018-10-12 16:35:44 +00:00
Sanjay Patel f5b1892348 [AArch64][x86] add tests for trunc disguised as vector ops (PR39016); NFC
These correspond to the IR transform from:
D52439

llvm-svn: 344353
2018-10-12 15:22:14 +00:00
Simon Pilgrim b8339c0167 [SelectionDAG] Move VectorLegalizer::ExpandCTLZ codegen into SelectionDAGLegalize
Generalize SelectionDAGLegalize's CTLZ expansion to handle vectors - lets VectorLegalizer::ExpandCTLZ to just pass the expansion on instead of repeating the same codegen.

llvm-svn: 344349
2018-10-12 14:45:57 +00:00
Simon Pilgrim 78b5a3c3ef [X86][SSE] LowerVectorCTPOP - pull out repeated byte sum stage.
Pull out repeated byte sum stage for popcount of vector elements > 8bits.

This allows us to simplify the LUT/BITMATH popcnt code to always assume vXi8 vectors, and also improves avx512bitalg codegen which only has access to vpopcntb/vpopcntw.

llvm-svn: 344348
2018-10-12 14:18:47 +00:00
Simon Pilgrim bb37e81b65 [X86][AVX] Regenerate tzcnt tests
llvm-svn: 344341
2018-10-12 13:24:51 +00:00
Simon Pilgrim 29279f29c8 [X86][SSE] Add extract_subvector(PSHUFB) -> PSHUFB(extract_subvector()) combine
Fixes PR32160 by reducing the size of PSHUFB if we only use one of the lanes.

This approach can probably be generalized to handle any target shuffle (and any subvector index) but we have no test coverage at the moment.

llvm-svn: 344336
2018-10-12 12:10:34 +00:00
Simon Pilgrim bc760b2dc5 [X86][AVX] Add examples of shuffles that can be reduced to a cross-lane shuffle followed by a in-lane permute
Suitable for lowering by D53148

llvm-svn: 344332
2018-10-12 10:26:59 +00:00
Simon Pilgrim c844bc84dd [X86] Ignore float/double non-temporal loads (PR39256)
Scalar non-temporal loads were asserting instead of just being ignored.

Reduced from https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=10895

llvm-svn: 344331
2018-10-12 10:20:16 +00:00
Sanjay Patel 56b6660d2e [DAGCombiner] rearrange extract_element+bitcast fold; NFC
I want to add another pattern here that includes scalar_to_vector,
so this makes that patch smaller. I was hoping to remove the
hasOneUse() check because it shouldn't be necessary for common
codegen, but an AMDGPU test has a comment suggesting that the
extra check makes things better on one of those targets.

llvm-svn: 344320
2018-10-11 23:56:56 +00:00
Sanjay Patel 99c02f6301 [x86] add tests for extract_element; NFC
The transform for this pattern has an unnecessary one-use limitation.

llvm-svn: 344303
2018-10-11 22:04:36 +00:00
Sanjay Patel 7ad8e32f03 [x86] regenerate CHECKs; NFC
llvm-svn: 344301
2018-10-11 21:44:38 +00:00
Craig Topper 35d513c7e4 [X86] Type legalize v2f32 loads by using an f64 load and a scalar_to_vector.
On 64-bit targets the generic legalize will use an i64 load and a scalar_to_vector for us. But on 32-bit targets i64 isn't legal and the generic legalizer will end up emitting two 32-bit loads. We have DAG combines that try to put those two loads back together with pretty good success.

This patch instead uses f64 to avoid the splitting entirely. I've made it do the same for 64-bit mode for consistency and to keep the load in the fp domain.

There are a few things in here that look like regressions in 32-bit mode, but I believe they bring us closer to the 64-bit mode codegen. And that the 64-bit mode code could be better. I think those issues should be looked at separately.

Differential Revision: https://reviews.llvm.org/D52528

llvm-svn: 344291
2018-10-11 20:36:06 +00:00
Craig Topper fb2ac8969e [X86] Restore X86ISelDAGToDAG::matchBEXTRFromAnd. Teach address matching to create a BEXTR pattern from a (shl (and X, mask >> C1) if C1 can be folded into addressing mode.
This is an alternative to D53080 since I think using a BEXTR for a shifted mask is definitely an improvement when the shl can be absorbed into addressing mode. The other cases I'm less sure about.

We already have several tricks for handling an and of a shift in address matching. This adds a new case for BEXTR.

I've moved the BEXTR matching code back to X86ISelDAGToDAG to allow it to match. I suppose alternatively we could directly emit a X86ISD::BEXTR node that isel could pattern match. But I'm trying to view BEXTR matching as an isel concern so DAG combine can see 'and' and 'shift' operations that are well understood. We did lose a couple cases from tbm_patterns.ll, but I think there are ways to recover that.

I've also put back the manual load folding code in matchBEXTRFromAnd that I removed a few months ago in r324939. This gives us some more freedom to make decisions based on the ability to fold a load. I haven't done anything with that yet.

Differential Revision: https://reviews.llvm.org/D53126

llvm-svn: 344270
2018-10-11 18:06:07 +00:00
Roman Lebedev 4225f4adff [X86][BMI1]: X86DAGToDAGISel: select BEXTR from x & ~(-1 << nbits) pattern
Summary:
As discussed in D48491, we can't really do this in the TableGen,
since we need to produce *two* instructions. This only implements
one single pattern. The other 3 patterns will be in follow-ups.

I'm not sure yet if we want to also fuse shift into here
(i.e `(x >> start) & ...`)

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D52304

llvm-svn: 344224
2018-10-11 07:51:13 +00:00
Roman Lebedev 62cd430602 [NFC][X86][AArch64] extract-bits.ll: add tests with constants+storing results.
As noted in https://reviews.llvm.org/D53080#inline-467678,
this *may* get pessimized by that diff.

llvm-svn: 344182
2018-10-10 20:50:52 +00:00
Roman Lebedev 33d84c6dac [X86] Move X86DAGToDAGISel::matchBEXTRFromAnd() into X86ISelLowering
Summary:
As discussed in [[ https://bugs.llvm.org/show_bug.cgi?id=38938 | PR38938 ]],
we fail to emit `BEXTR` if the mask is shifted.
We can't deal with that in `X86DAGToDAGISel` `before the address mode for the inc is selected`,
and we can't really do it in the normal DAGCombine, because we don't have generic `ISD::BitFieldExtract` node,
and if we simply turn the shifted mask into a normal mask + shift-left, it will be folded back.
So it would seem X86ISelLowering is the place to handle this.

This patch only moves the matchBEXTRFromAnd()
from X86DAGToDAGISel to X86ISelLowering.
It does not add support for the 'shifted mask' pattern.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D52426

llvm-svn: 344179
2018-10-10 20:40:12 +00:00
Volkan Keles da5578c5d0 [GlobalISel] Fix the artifact combiner to fold G_IMPLICIT_DEF properly
Summary:
GlobalISel generates incorrect code because the legalizer artifact
combiner assumes `G_[SZ]EXT (G_IMPLICIT_DEF)` is equivalent to
`G_IMPLICIT_DEF `.

Replace `G_[SZ]EXT (G_IMPLICIT_DEF)` with 0 because the top bits
will be 0 for G_ZEXT and 0/1 for the G_SEXT.

Reviewers: aditya_nandakumar, dsanders, aemerson, javed.absar

Reviewed By: aditya_nandakumar

Subscribers: rovka, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D52996

llvm-svn: 344163
2018-10-10 18:01:48 +00:00
Nirav Dave 07acc992dc [DAGCombine] Improve Load-Store Forwarding
Summary:
Extend analysis forwarding loads from preceeding stores to work with
extended loads and truncated stores to the same address so long as the
load is fully subsumed by the store.

Hexagon's swp-epilog-phis.ll and swp-memrefs-epilog1.ll test are
deleted as they've no longer seem to be relevant.

Reviewers: RKSimon, rnk, kparzysz, javed.absar

Subscribers: sdardis, nemanjai, hiraditya, atanasyan, llvm-commits

Differential Revision: https://reviews.llvm.org/D49200

llvm-svn: 344142
2018-10-10 14:15:52 +00:00
Sanjay Patel 6cca8af227 [x86] allow single source horizontal op matching (PR39195)
This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in:
rL343727

As noted in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd 
solve that in a machine instruction pass, but a quicker solution will be adding a 'HasFastHorizontalOp' feature
bit to deal with it here in the DAG.

Differential Revision: https://reviews.llvm.org/D52997

llvm-svn: 344141
2018-10-10 13:39:59 +00:00
Carlos Alberto Enciso c0952c8a08 Revert "[DebugInfo][Dexter] Unreachable line stepped onto after SimplifyCFG."
This reverts commit r344120.

It was causing buildbot failures.

llvm-svn: 344135
2018-10-10 12:09:34 +00:00
Carlos Alberto Enciso e7a347e5f8 [DebugInfo][Dexter] Unreachable line stepped onto after SimplifyCFG.
When SimplifyCFG changes the PHI node into a select instruction, the debug line records becomes ambiguous. It causes the debugger to display unreachable source lines. 

Differential Revision: https://reviews.llvm.org/D52887

llvm-svn: 344120
2018-10-10 08:29:55 +00:00
Craig Topper 02c62aa58a [X86] Remove FeatureRTM from Skylake processor list
Summary:
There are a LOT of Skylakes and later without TSX-NI. Examples:
- SKL: https://ark.intel.com/products/136863/Intel-Core-i3-8121U-Processor-4M-Cache-up-to-3-20-GHz-
- KBL: https://ark.intel.com/products/97540/Intel-Core-i7-7560U-Processor-4M-Cache-up-to-3-80-GHz-
- KBL-R: https://ark.intel.com/products/149091/Intel-Core-i7-8565U-Processor-8M-Cache-up-to-4-60-GHz-
- CNL: https://ark.intel.com/products/136863/Intel-Core-i3-8121U-Processor-4M-Cache-up-to-3_20-GHz

This feature seems to be present only on high-end desktop and server
chips (I can't find any SKX without). This commit leaves it disabled
for all processors, but can be re-enabled for specific builds with
-mrtm.

Patch by Thiago Macieira

Reviewers: erichkeane, craig.topper

Reviewed By: craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D53041

llvm-svn: 344116
2018-10-10 07:43:35 +00:00
Nemanja Ivanovic 72d4866e57 [DAGCombiner] Expand combining of FP logical ops to sign-setting FP ops
We already do the following combines:
(bitcast int (and (bitcast fp X to int), 0x7fff...) to fp) -> fabs X
(bitcast int (xor (bitcast fp X to int), 0x8000...) to fp) -> fneg X

When the target has "bit preserving fp logic". This patch just extends it
to also combine:
(bitcast int (or (bitcast fp X to int), 0x8000...) to fp) -> fneg (fabs X)

As some targets have fnabs and even those that don't can efficiently lower
both the fabs and the fneg.

Differential revision: https://reviews.llvm.org/D44548

llvm-svn: 344093
2018-10-09 23:20:11 +00:00
Rong Xu 3d2efdfdea Recommit r343993: [X86] condition branches folding for three-way conditional codes
Fix the memory issue exposed by sanitizer.

llvm-svn: 344085
2018-10-09 22:03:40 +00:00
Craig Topper f6d8400869 [X86] When lowering unsigned v2i64 setcc without SSE42, flip the sign bits in the v2i64 type then bitcast to v4i32.
This may give slightly better opportunities for DAG combine to simplify with the operations before the setcc. It also matches the type the xors will eventually be promoted to anyway so it saves a legalization step.

Almost all of the test changes are because our constant pool entry is now v2i64 instead of v4i32 on 64-bit targets. On 32-bit targets getConstant should be emitting a v4i32 build_vector and a v4i32->v2i64 bitcast.

There are a couple test cases where it appears we now combine a bitwise not with one of these xors which caused a new constant vector to be generated. This prevented a constant pool entry from being shared. But if that's an issue we're concerned about, it seems we need to address it another way that just relying a bitcast to hide it.

This came about from experiments I've been trying with pushing the promotion of and/or/xor to vXi64 later than LegalizeVectorOps where it is today. We run LegalizeVectorOps in a bottom up order. So the and/or/xor are promoted before their users are legalized. The bitcasts added for the promotion act as a barrier to computeKnownBits if we try to use it during vector legalization of a later operation. So by moving the promotion out we can hopefully get better results from computeKnownBits/computeNumSignBits like in LowerTruncate on AVX512. I've also looked at running LegalizeVectorOps in a top down order like LegalizeDAG, but thats showing some other issues.

llvm-svn: 344071
2018-10-09 19:05:50 +00:00
Craig Topper 4ea569622a [X86] Autogenerate complete checks. NFC
llvm-svn: 344060
2018-10-09 17:52:07 +00:00
Sanjay Patel a875030ab3 [AArch64][x86] add tests for bitcasted fnabs; NFC
Alternate target coverage for  D44548.

llvm-svn: 344059
2018-10-09 17:20:26 +00:00
Sanjay Patel f5fac1826a [x86] use demanded bits to simplify masked store codegen
As noted in D52747, if we prefer IR to use trunc for bool vectors rather 
than and+icmp, we can expose codegen shortcomings as seen here with masked store.

Replace a hard-coded PCMPGT simplification with the more general demanded bits call
to improve things.

Differential Revision: https://reviews.llvm.org/D52964

llvm-svn: 344048
2018-10-09 14:04:14 +00:00
Simon Pilgrim 23f880317a [SelectionDAG] Add SIGN_EXTEND_VECTOR_INREG and CONCAT_VECTORS support to SimplifyDemandedBits
Fix for AVX1 masked load/store regression on D52964

llvm-svn: 344043
2018-10-09 13:13:35 +00:00
Simon Pilgrim 720db8ed7b [X86][AVX1] Enable *_EXTEND_VECTOR_INREG lowering of 256-bit vectors
As discussed on D52964, this adds 256-bit *_EXTEND_VECTOR_INREG lowering support for AVX1 targets to help improve SimplifyDemandedBits handling.

Differential Revision: https://reviews.llvm.org/D52980

llvm-svn: 344019
2018-10-09 07:42:01 +00:00
Matthias Braun 1c96c9687f ExpandPostRAPseudos: Fix alldefsAreDead() not removing operands
One case left around nonsensical operands for the KILL instruction
which the machine verifier checks for nowadays. While this should not
hurt in release builds we should fix the machine verifier errors anyway.

llvm-svn: 344008
2018-10-09 00:07:34 +00:00
Rong Xu 47fd015163 [X86] Revert r343993 condition branches folding for three-way conditional codes
Some buildbots failed.

llvm-svn: 343998
2018-10-08 22:08:43 +00:00
Sanjay Patel 6205ba0e7f [x86] add tests for phaddd/phaddw; NFC
More tests related to PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195

If we limit the horizontal codegen, it may require different
constraints for FP and integer.

llvm-svn: 343994
2018-10-08 19:48:18 +00:00
Rong Xu 67b1b328f7 [X86] condition branches folding for three-way conditional codes
This patch implements a pass that optimizes condition branches on x86 by
taking advantage of the three-way conditional code generated by compare
instructions.

Currently, it tries to hoisting EQ and NE conditional branch to a dominant
conditional branch condition where the same EQ/NE conditional code is
computed. An example:
bb_0:
  cmp %0, 19
  jg bb_1
  jmp bb_2
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_4
bb_4:
  cmp %0, 20
  je bb_5
  jmp bb_6
Here we could combine the two compares in bb_0 and bb_4 and have the
following code:

bb_0:
  cmp %0, 20
  jg bb_1
  jl bb_2
  jmp bb_5
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_6

For the case of %0 == 20 (bb_5), we eliminate two jumps, and the control height
for bb_6 is also reduced. bb_4 is gone after the optimization.

This optimization is motivated by the branch pattern generated by the switch
lowering: we always have pivot-1 compare for the inner nodes and we do a pivot
compare again the leaf (like above pattern).

This pass currently is enabled on Intel's Sandybridge and later arches. Some
reviewers pointed out that on some arches (like AMD Jaguar), this pass may
increase branch density to the point where it hurts the performance of the
branch predictor.

Differential Revision: https://reviews.llvm.org/D46662

llvm-svn: 343993
2018-10-08 18:52:39 +00:00
Simon Pilgrim 6fc8d05565 [X86][AVX2] Enable ZERO_EXTEND_VECTOR_INREG lowering of 256-bit vectors
Some necessary yak shaving before lowering *_EXTEND_VECTOR_INREG 256-bit vectors on AVX1 targets as suggested by D52964.

Differential Revision: https://reviews.llvm.org/D52970

llvm-svn: 343991
2018-10-08 18:40:50 +00:00
Sanjay Patel 8459465a44 [x86] add hadd test with no undefs, remove duplicate tests; NFC
llvm-svn: 343975
2018-10-08 16:24:43 +00:00
Sanjay Patel d48789c0c4 [x86] simplify hadd tests; NFC
The tests from PR39195 don't use 2 parameters. That's the
root problem for the pattern matching in isHorizontalBinOp().

llvm-svn: 343974
2018-10-08 15:56:28 +00:00
Alexander Ivchenko 1aedf203dd [GlobalIsel][X86] Support G_UDIV/G_UREM/G_SREM
Support G_UDIV/G_UREM/G_SREM. The instruction selection
code is taken from FastISel with only minor tweaks to adapt
for GlobalISel.

Differential Revision: https://reviews.llvm.org/D49781

llvm-svn: 343966
2018-10-08 13:40:34 +00:00
Sanjay Patel 60badd7584 [x86] add 16 missed hadd patterns (PR39195); NFC
llvm-svn: 343965
2018-10-08 12:54:33 +00:00
Sanjay Patel ecc8af61e7 [DAGCombiner] allow undef elts in vector fadd matching
llvm-svn: 343945
2018-10-07 16:30:42 +00:00
Sanjay Patel f956840dbe [x86] add vector fadd with undef elts test; NFC
llvm-svn: 343944
2018-10-07 16:27:50 +00:00
Sanjay Patel 6c02c6a3a6 [x86] remove redundant tests; NFC
The equivalent tests were added to the file with related folds in rL343941.

llvm-svn: 343943
2018-10-07 16:13:38 +00:00
Sanjay Patel ef76e27985 [DAGCombiner] allow undefs when matching vector splats for fmul folds
llvm-svn: 343942
2018-10-07 16:05:37 +00:00
Sanjay Patel fcb1061c13 [x86] add vector fmul with undef elts tests; NFC
llvm-svn: 343941
2018-10-07 16:00:55 +00:00
Sanjay Patel 0b74c840dd [DAGCombiner] allow undef elts in vector fabs/fneg matching
This change is proposed as a part of D44548, but we
need this independently to avoid regressions from improved
undef propagation in SimplifyDemandedVectorElts().

llvm-svn: 343940
2018-10-07 15:32:06 +00:00
Sanjay Patel 31a3f2aaba [x86] add tests for FP logic folding for vectors with undefs; NFC
llvm-svn: 343938
2018-10-07 15:05:39 +00:00
Simon Pilgrim 3b04a4e322 [SelectionDAG] Respect multiple uses in SimplifyDemandedBits to SimplifyDemandedVectorElts simplification
rL343913 was using SimplifyDemandedBits's original demanded mask instead of the adjusted 'NewMask' that accounts for multiple uses of the op (those variable names really need improving....).

Annoyingly many of the test changes (back to pre-rL343913 state) are actually safe - but only because their multiple uses are all by PMULDQ/PMULUDQ.

Thanks to Jan Vesely (@jvesely) for bisecting the bug.

llvm-svn: 343935
2018-10-07 11:45:46 +00:00
Simon Pilgrim 012fda59a5 [AARCH64][X86] Remove _nonsplat from test names
As discussed on D50222 

llvm-svn: 343934
2018-10-07 11:24:04 +00:00
Simon Pilgrim 9fa1c66421 [X86] getFauxShuffleMask - Handle undef + sentinel values in subvector insertion
llvm-svn: 343926
2018-10-06 22:13:44 +00:00
Simon Pilgrim 0dcf1cea03 [X86][SSE] Add SSE41 vector int2fp tests
llvm-svn: 343925
2018-10-06 20:24:27 +00:00