Commit Graph

18611 Commits

Author SHA1 Message Date
Craig Topper c5903c935c [X86] Use unsigned type for opcodes throughout X86FixupLEAs.
All of the interfaces related to opcode in MachineInstr and MCInstrInfo refer to opcodes as unsigned.

llvm-svn: 357444
2019-04-02 00:50:58 +00:00
Craig Topper 4307172b84 [X86] Classify the AVX512 rounding control operand as X86::OPERAND_ROUNDING_CONTROL instead of MCOI::OPERAND_IMMEDIATE. Add an assert on legal values of rounding control in the encoder and remove an explicit mask.
This should allow llvm-exegesis to intelligently constrain the rounding mode.

The mask in the encoder shouldn't be necessary any more. We used to allow codegen to use 8-11 for rounding mode and the assembler would use 0-3 to mean the same thing so we masked here and in the printer. Codegen now matches the assembler and the printer was updated, but I forgot to update the encoder.

llvm-svn: 357419
2019-04-01 19:08:15 +00:00
Matt Arsenault ebf90db084 X86: Fix override warning
llvm-svn: 357388
2019-04-01 14:08:26 +00:00
Clement Courbet 7e062c9b1f [X86] Make post-ra scheduling macrofusion-aware.
Subscribers: MatzeB, arsenm, jvesely, nhaehnle, hiraditya, javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59688

llvm-svn: 357384
2019-04-01 13:48:50 +00:00
Craig Topper 2e1bf89e3a [X86] Use ISD::INTRINSIC_VOID in getTgtMemIntrinsic for truncating stores and scatter intrinsics.
This is the appropriate opcode for only having a chain output. Though I'm not
sure it matters much.

llvm-svn: 357375
2019-04-01 05:26:12 +00:00
Sanjay Patel e1bc360fc6 [x86] allow movmsk with 2-element reductions
One motivation for making this change is that the lack of using movmsk is likely
a main source of perf difference between clang and gcc on the C-Ray benchmark as
shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.

The 'all-of' examples show what is likely the worst case trade-off: we end up with
an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of'
examples look clearly better using movmsk because we've traded 2 vector instructions
for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.

If we examine the llvm-mca output for these cases, it appears that even though the
'all-of' movmsk variant looks worse on paper, it would perform better on both
Haswell and Jaguar.

  $ llvm-mca -mcpu=haswell no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      504
  Total uOps:        400

  Dispatch Width:    4
  uOps Per Cycle:    0.79
  IPC:               0.79
  Block RThroughput: 1.0

  $ llvm-mca -mcpu=haswell movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      358
  Total uOps:        600

  Dispatch Width:    4
  uOps Per Cycle:    1.68
  IPC:               1.68
  Block RThroughput: 1.5

  $ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      407
  Total uOps:        400

  Dispatch Width:    2
  uOps Per Cycle:    0.98
  IPC:               0.98
  Block RThroughput: 2.0

  $ llvm-mca -mcpu=btver2 movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      311
  Total uOps:        600

  Dispatch Width:    2
  uOps Per Cycle:    1.93
  IPC:               1.93
  Block RThroughput: 3.0

Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if
that's true, then we're also almost certainly making the wrong transform already for
reductions with >2 elements, so that should be fixed independently.

Differential Revision: https://reviews.llvm.org/D59997

llvm-svn: 357367
2019-03-31 15:11:34 +00:00
Liang Zou 9f4a4d3974 fix typo: "\t" => " "
Reviewers: llvm.org, Jim

Reviewed By: Jim

Subscribers: arsenm, jvesely, nhaehnle, rupprecht, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59983

llvm-svn: 357365
2019-03-31 14:49:00 +00:00
Craig Topper e4a0fc7d75 [X86] Teach isel for RMW binops to handle negate
Negate updates flags like a subtract. We should be able to use the flags from the RMW form of negate when we have (store (X86ISD::SUB 0, load A), A)

Differential Revision: https://reviews.llvm.org/D60007

llvm-svn: 357353
2019-03-30 18:59:17 +00:00
Simon Pilgrim 10c9032c02 [X86][SSE] detectAVGPattern - Match zext(or(x,y)) 'add like' patterns (PR41316)
Fixes PR41316 where the expanded PAVG intrinsic had had one of its ADDs turned into an OR due to its operands having no conflicting bits.

llvm-svn: 357351
2019-03-30 17:12:29 +00:00
Simon Pilgrim 3293455595 [X86][SSE] detectAVGPattern - begin generalizing ADD matches
Move the ADD matching into a helper - first NFC stage towards supporting 'ADD like' cases such as in PR41316

llvm-svn: 357349
2019-03-30 15:31:53 +00:00
Amara Emerson d413f41de6 [X86] When using Win64 ABI, exit with error if SSE is disabled for varargs
We need XMM registers to handle varargs with the Win64 ABI. Before we would
silently generate bad code resulting in an assertion failure elsewhere in the
backend.

llvm-svn: 357317
2019-03-29 21:30:51 +00:00
Craig Topper 4ccb3b96b6 [X86] Use cached OptForSize in X86ISelDAGToDAG.cpp instead of pulling it from the function attribute. NFCI
llvm-svn: 357297
2019-03-29 18:36:40 +00:00
Simon Pilgrim aeaf7fcdde [X86] Add X86TargetLowering::isCommutativeBinOp override.
We currently just have test coverage for PMULUDQ - will add more in the future.

llvm-svn: 357244
2019-03-29 11:25:58 +00:00
Craig Topper c25c9b4d16 [X86] Teach the isel optimization for (x << C1) op C2 to (x op (C2>>C1)) << C1 to consider cases where C2>>C1 can fit an unsigned 32-bit immediate
For 64-bit operations we should consider if the immediate can be made to fit
in an unsigned 32-bits immedate. For OR/XOR this allows us to load the immediate
with MOV32ri instead of movabsq. For AND this allows us to fold the immediate.

Differential Revision: https://reviews.llvm.org/D59867

llvm-svn: 357196
2019-03-28 18:05:37 +00:00
Sanjay Patel 5bbf6f0bd8 [x86] avoid cmov in movmsk reduction
This is probably the least important of our movmsk problems, but I'm starting
at the bottom to reduce distractions.

We were creating a select_cc which bypasses the select and bitmask codegen
optimizations that we have now. If we produce a compare+negate instead, we
allow things like neg/sbb carry bit hacks, and in all cases we avoid a cmov.
There's no partial register update danger in these sequences because we always
produce the zero-register xor ahead of the 'set' if needed.

There seems to be a missing fold for sext of a bool bit here:

negl %ecx
movslq %ecx, %rax

...but that's an independent transform.

Differential Revision: https://reviews.llvm.org/D59818

llvm-svn: 357172
2019-03-28 14:16:13 +00:00
Clement Courbet 699dc025a6 [X86MacroFusion] Handle branch fusion (AMD CPUs).
Summary:
This adds a BranchFusion feature to replace the usage of the MacroFusion
for AMD CPUs.

See D59688 for context.

Reviewers: andreadb, lebedev.ri

Subscribers: hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59872

llvm-svn: 357171
2019-03-28 14:12:46 +00:00
Roman Lebedev c325be6cef [X86] AMD Piledriver (BdVer2): fine-tune some latencies
Based on llvm-exegesis measurements.

Now that llvm-exegesis is ~2 magnitudes faster, and is a bit smarter,
it is now possible to continue cleanup of the scheduler model.

With this, there are no more latency inconsistencies for the
opcodes that produce stable measurements, and only a few inconsistencies
for unstable measurements (MMX_* opcodes, opcodes that llvm-exegesis
measures by chaining - CMP, TEST, BT, SETcc, CVT, MOV, etc.)

llvm-svn: 357169
2019-03-28 13:40:34 +00:00
Clement Courbet 54c95e5172 [NFC] Format InlineFeatureIgnoreList.
To avoid more spurious clang-format changes when adding features (D59872).

llvm-svn: 357168
2019-03-28 13:38:58 +00:00
Simon Pilgrim 22be913ac0 [X85][AVX] Add missing vXi16 broadcast fold patterns
Now that D59484 has landed its easier to add these.

Added missing AVX512BW v32i16 equivalents while I was at it.

llvm-svn: 357155
2019-03-28 10:25:13 +00:00
Sanjay Patel 1df0bb6264 [x86] improve AVX lowering of vector zext
If we know the 2 halves of an oversized zext-in-reg are the same,
don't create those halves independently.

I tried several different approaches to fold this, but it's difficult
to get right during legalization. In the default path, we are creating
a generic shuffle that looks like an unpack high, but it can get
transformed into a different mask (a blend), so it's not
straightforward to match that. If we try to fold after it actually
becomes an X86ISD::UNPCKH node, we can't be sure what the operand node
is - it might be a generic shuffle, or it could be some x86-specific op.

From the test output, we should be doing something like this for SSE4.1
as well, but I'd rather leave that as a follow-up since it involves
changing lowering actions.

Differential Revision: https://reviews.llvm.org/D59777

llvm-svn: 357129
2019-03-27 22:42:11 +00:00
Sanjay Patel 704817912a [x86] look through bitcast operand of MOVMSK
This is not exactly NFC because it should make further combines
of MOVMSK easier to match, but there should be no outward differences
because we have isel patterns in place specifically to allow this. See:
  // Also support integer VTs to avoid a int->fp bitcast in the DAG.

llvm-svn: 357128
2019-03-27 22:24:03 +00:00
Craig Topper 4bc38cfe29 [X86ISelDAGToDAG] Move initialization of OptForSize and OptForMinSize from PreprocessISelDAG to runOnMachineFunction. NFCI
This makes more sense as a place to initialize these. I don't think runOnMachineFunction was overriden when these cached values were originally created.

llvm-svn: 357123
2019-03-27 21:05:07 +00:00
Craig Topper 7c9afc35bc [X86] Add post-isel pseudos for rotate by immediate using SHLD/SHRD
Haswell CPUs have special support for SHLD/SHRD with the same register for both sources. Such an instruction will go to the rotate/shift unit on port 0 or 6. This gives it 1 cycle latency and 0.5 cycle reciprocal throughput. When the register is not the same, it becomes a 3 cycle operation on port 1. Sandybridge and Ivybridge always have 1 cyc latency and 0.5 cycle reciprocal throughput for any SHLD.

When FastSHLDRotate feature flag is set, we try to use SHLD for rotate by immediate unless BMI2 is enabled. But MachineCopyPropagation can look through a copy and change one of the sources to be different. This will break the hardware optimization.

This patch adds psuedo instruction to hide the second source input until after register allocation and MachineCopyPropagation. I'm not sure if this is the best way to do this or if there's some other way we can make this work.

Fixes PR41055

Differential Revision: https://reviews.llvm.org/D59391

llvm-svn: 357096
2019-03-27 17:29:34 +00:00
Simon Pilgrim ccb71b2985 Revert rL356864 : [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.
........
Causes PR41249

llvm-svn: 357057
2019-03-27 10:25:02 +00:00
Craig Topper 7da7b97487 [X86] When iselling (x << C1) and/or/xor C2 as (x and/or/xor (C2>>C1)) << C1, go through the isel table instead of manually selecting.
Previously we manually selected the AND/OR/XOR with immediate and the SHL(or ADD if the shift is 1). But this was missing out on the opportunity to use a 64 bit AND with a 32-bit immediate and possibly other isel tricks we have built into the tables.

Instead, insert the new nodes into the DAG using insertDAGNode and allow them each to be selected through the normal table.

llvm-svn: 357049
2019-03-27 04:45:58 +00:00
Craig Topper 22387a56fe [X86] Simplify some code in matchBitExtract by using ANY_EXTEND.
We were manually outputting the code we would get from selecting ANY_EXTEND. We
can save some code by just letting an ANY_EXTEND go through isel on its own.

llvm-svn: 357045
2019-03-27 02:08:03 +00:00
Craig Topper 4dcabf8ddf [X86] In matchBitExtract, place all of the new nodes before Node's position in the DAG for the topological sort.
We were using OrigNBits, but that put all the nodes before the node we used to start the control computation. This caused some node earlier than the sequence we inserted to be selected before the sequence we created. We want our new sequence to be selected first since it depends on OrigNBits.

I don't have a test case. Found by reviewing the code.

llvm-svn: 356979
2019-03-26 05:31:32 +00:00
Craig Topper 10576fea82 [X86] In matchBitExtract, if we need to truncate the BEXTR make sure we put the BEXTR at Node's position in the DAG for the topological sort.
We were using OrigNBits, but that doesn't guarantee that it will be selected before the nodes that make up X.

llvm-svn: 356978
2019-03-26 05:12:23 +00:00
Craig Topper 795ebe3bff [X86] Remove unneeded FIXME. NFC
We do fold loads right below this.

llvm-svn: 356977
2019-03-26 05:12:21 +00:00
Craig Topper fd880d30b1 X86Parser: Fix potential reference to deleted object
Within the MatchFPUWaitAlias function, Operands[0] is potentially overwritten leading to &Op referencing a deleted object. To fix this, assign the reference after the function.

Differential Revision: https://reviews.llvm.org/D57376

llvm-svn: 356973
2019-03-26 03:12:43 +00:00
Craig Topper 3dce29b8e9 X86AsmParser: Do not process a non-existent token
This error can only happen if an unfinished operation is at Eof.

Patch by Brandon Jones

Differential Revision: https://reviews.llvm.org/D57379

llvm-svn: 356972
2019-03-26 03:12:41 +00:00
Craig Topper a17287f084 [X86] Update some of the getMachineNode calls from X86ISelDAGToDAG to also include a VT for a EFLAGS result.
This makes the nodes consistent with how they would be emitted from the isel
table.

llvm-svn: 356870
2019-03-25 07:22:18 +00:00
Craig Topper 1cc01c3228 [X86] When selecting (x << C1) op C2 as (x op (C2>>C1)) << C1, use the operation VT for the target constant.
Normally when the nodes we use here(AND32ri8 for example) are selected their
immediates are just converted from ConstantSDNode to TargetConstantSDNode
without changing VT from the original operation VT. So we should still be
emitting them with the operation VT.

Theoretically this could expose more accurate opportunities for CSE.

llvm-svn: 356869
2019-03-25 06:53:45 +00:00
Craig Topper 3810e35d3f [X86] Remove GetLo8XForm and use GetLo32XForm instead. NFCI
We were using this to create an AND32ri8 node from a 64-bit and, but that node
normally still uses a 32-bit immediate. So we should just truncate the existing
immediate to i32. We already verified it has the same value in bits 31:7.

llvm-svn: 356868
2019-03-25 06:53:44 +00:00
Craig Topper 5b43446831 [X86] Remove a couple unused SDNodeXForms. NFC
llvm-svn: 356867
2019-03-25 06:53:43 +00:00
Craig Topper 7c2554dd92 Revert r356688 "[X86] Don't avoid folding multiple use sign extended 8-bit immediate into instructions under optsize."
Looking back over how the one use optimization works, I don't think this is the right way to fix this.

llvm-svn: 356866
2019-03-25 01:25:32 +00:00
Simon Pilgrim 87d4ab8b92 [X86][SSE41] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Enable SSE41 ZERO_EXTEND_VECTOR_INREG shuffle combines - for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern we reduce the shuffles (port5-bottleneck on Intel) at the expense of creating a zero (pxor v,v) and an extra register move - which is a good trade off as these are pretty cheap and in most cases it doesn't increase register pressure.

This also exposed a missed opportunity to use combine to ZERO_EXTEND_VECTOR_INREG with folded loads - even if we're in the float domain.

llvm-svn: 356864
2019-03-24 19:06:35 +00:00
Simon Pilgrim a71c0ed471 [X86][AVX] Start shuffle combining from ZERO_EXTEND_VECTOR_INREG (PR40685)
Just enable this for AVX for now as SSE41 introduces extra register moves for the PMOVZX(PSHUFD(V)) -> UNPCKH(V,0) pattern (but otherwise helps reduce port5 usage on Intel targets).

Only AVX support is required for PR40685 as the issue is due to 8i8->8i32 zext shuffle leftovers.

llvm-svn: 356858
2019-03-24 16:30:35 +00:00
Sanjay Patel 7d676dfd86 [x86] improve the default expansion of uaddsat/usubsat
This is yet another step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613

uaddsat X, Y --> (X >u (X + Y)) ? -1 : X + Y
usubsat X, Y --> (X >u Y) ? X - Y : 0

We can't count on a sane vector ISA, so override the default (umin/umax)
expansion of unsigned add/sub saturate in cases where we do not have umin/umax.

Differential Revision: https://reviews.llvm.org/D59006

llvm-svn: 356855
2019-03-24 13:55:54 +00:00
Sanjay Patel 2e92846d36 [x86] reduce code duplication; NFC
llvm-svn: 356836
2019-03-23 15:00:52 +00:00
Craig Topper ce1ed55a4a [X86] Use xmm registers to implement 64-bit popcnt on 32-bit targets if possible if popcnt instruction is not available
On 32-bit targets without popcnt, we currently expand 64-bit popcnt to sequences of arithmetic and logic ops for each 32-bit half and then add the 32 bit halves together. If we have xmm registers we can use use those to implement the operation instead. This results in less instructions then doing two separate 32-bit popcnt sequences.

This mitigates some of PR41151 for the i64 on i686 case when we have SSE2.

Differential Revision: https://reviews.llvm.org/D59662

llvm-svn: 356808
2019-03-22 20:47:02 +00:00
Craig Topper 1ffd8e8114 [X86] Use movq for i64 atomic load on 32-bit targets when sse2 is enable
We used a lock cmpxchg8b to do i64 atomic loads. But if we have SSE2 we can do better and use a plain movq to do the load instead.

I tried to just use an f64 atomic load and add isel patterns to MOVSD(which the domain fixing pass can turn to MOVQ), but the atomic_load SDNode in TargetSelectionDAG.td requires the type to be integer.

So I've emitted VZEXT_LOAD instead which should be selected by isel to a MOVQ. Hopefully we don't need a specific atomic flavor of this. I kept the memory operand from the original AtomicSDNode. I wasn't sure if I might need to set the MOVolatile flag?

I've left some FIXMEs for improvements we can do without SSE2.

Differential Revision: https://reviews.llvm.org/D59679

llvm-svn: 356807
2019-03-22 20:46:56 +00:00
Simon Pilgrim 564392d752 [X86] lowerShuffleAsBitMask - ensure float bit masks are the correct width (PR41203)
llvm-svn: 356784
2019-03-22 17:23:55 +00:00
Craig Topper b3bad3dce3 [X86] Use LoadInst->getType() instead of LoadInst->getPointerOperandType()->getElementType(). NFCI
For the future day when the pointer's don't have element types, we shoudl just use the type of the load result instead.

llvm-svn: 356721
2019-03-21 21:37:18 +00:00
Simon Pilgrim c2e4405475 [X86] canonicalizeBitSelect - don't attempt to canonicalize mask registers
We don't use X86ISD::ANDNP for mask registers.

Test case from @craig.topper (Craig Topper)

llvm-svn: 356696
2019-03-21 18:32:38 +00:00
Craig Topper c14f3e4222 [X86] Don't avoid folding multiple use sign extended 8-bit immediate into instructions under optsize.
Under optsize we try to avoid folding immediates into instructions under optsize. But if the immediate is 16-bits or 32 bits, but can be encoded as an 8-bit immediate we don't save enough from disabling the folding unless the immediate has enough uses to make up for the size of the move which is either 3 bytes or 5 bytes since there are no sign extended 8-bit moves. We would also save something if the immediate was a live out of the basic block and thus a move was unavoidable, but that would require a more advanced heuristic than just counting uses.

Note we only avoid folding multiple use immediates into the patterns that use X86ISD::ADD/SUB/XOR/OR/AND/CMP/ADC/SBB nodes and not the more common ISD::ADD/SUB/XOR/OR/AND nodes.

Differential Revision: https://reviews.llvm.org/D59522

llvm-svn: 356688
2019-03-21 17:38:58 +00:00
Craig Topper 9f0b17a248 [ScalarizeMaskedMemIntrin] Add support for scalarizing expandload and compressstore intrinsics.
This adds support for scalarizing these intrinsics as well the X86TargetTransformInfo support to avoid scalarizing them in the cases X86 can handle.

I've omitted handling special cases for constant masks for this first pass. Though CodeGenPrepare can constant fold the branch conditions and remove some of the control flow anyway.

Fixes PR40994 and is covers most of PR3666. Might want to implement constant masks to close that.

Differential Revision: https://reviews.llvm.org/D59180

llvm-svn: 356687
2019-03-21 17:38:52 +00:00
Craig Topper 8d46403b8e [X86] Add CMPXCHG8B feature flag. Set it for all CPUs except i386/i486 including 'generic'. Disable use of CMPXCHG8B when this flag isn't set.
CMPXCHG8B was introduced on i586/pentium generation.

If its not enabled, limit the atomic width to 32 bits so the AtomicExpandPass will expand to lib calls. Unclear if we should be using a different limit for other configs. The default is 1024 and experimentation shows that using an i256 atomic will cause a crash in SelectionDAG.

Differential Revision: https://reviews.llvm.org/D59576

llvm-svn: 356631
2019-03-20 23:35:49 +00:00
Craig Topper 0367553304 [X86] Call lowerShuffleAsBitMask for 512-bit vectors in lowerShuffleAsBlend.
This patch enables the use of lowerShuffleAsBitMask for 512-bit blends before
falling back to move immedate, GPR to k-register, and masked op.

I had to make some changes to support v8i64 when i64 is not a legal type. And to
support floating point types.

This trades a load for the move immediate and GPR move which is higher latency.
But its probably better for register pressure not having to hop through other
register classes. The load+and should play better with LICM and
rematerialization I think.

Differential Revision: https://reviews.llvm.org/D59479

llvm-svn: 356618
2019-03-20 21:30:20 +00:00
Simon Pilgrim 2acca37a2d [X86] Use getConstantOperandAPInt to detect out-of-range shifts.
llvm-svn: 356549
2019-03-20 11:41:52 +00:00