Commit Graph

15315 Commits

Author SHA1 Message Date
Igor Breger 36d447d8a8 [GlobalISel][X86] Support variadic function call.
Summary: Support variadic function call. Port the implementation from X86FastISel.

Reviewers: zvi, guyblank, oren_ben_simhon

Reviewed By: guyblank

Subscribers: rovka, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D37261

llvm-svn: 312130
2017-08-30 15:10:15 +00:00
Sanjay Patel 7b8183fdab fix more typos; NFC
llvm-svn: 312120
2017-08-30 13:19:23 +00:00
Sanjay Patel 7e5af84cae fix typos; NFC
llvm-svn: 312119
2017-08-30 13:16:25 +00:00
Craig Topper 17854ecf24 [AVX512] Correct isel patterns to support selecting masked vbroadcastf32x2/vbroadcasti32x2
Summary:
This patch adjusts the patterns to make the result type of the broadcast node vXf64/vXi64. Then adds a bitcast to vXi32 after that. Intrinsic lowering was also adjusted to generate this new pattern.

Fixes PR34357

We should probably just drop the intrinsic entirely and use native IR, but I'll leave that for a future patch.

Any idea what instruction we should be lowering the floating point 128-bit result version of this pattern to?  There's a 128-bit v2i32 integer broadcast but not an fp one.

Reviewers: aymanmus, zvi, igorb

Reviewed By: aymanmus

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D37286

llvm-svn: 312101
2017-08-30 07:48:39 +00:00
Craig Topper 48a7917079 [AVX512] Use 256-bit extract instructions for extracting bits [255:128] from a 512-bit register
This enables the use of a smaller encoding by using a VEX instruction when possible.

Differential Revision: https://reviews.llvm.org/D37092

llvm-svn: 312100
2017-08-30 07:26:12 +00:00
Craig Topper ef1f71669e [X86] Apply SlowIncDec feature to Sandybridge/Ivybridge CPUs as well
Currently we start applying this on Haswell and newer. I don't believe anything changed in the Haswell architecture to make this the right cutoff point. The partial flag handling around this has been roughly the same since Sandybridge.

Differential Revision: https://reviews.llvm.org/D37250

llvm-svn: 312099
2017-08-30 05:00:35 +00:00
Craig Topper 641e2af9e8 [X86] Provide a separate feature bit for macro fusion support instead of basing it on the AVX flag
Summary:
Currently we determine if macro fusion is supported based on the AVX flag as a proxy for the processor being Sandy Bridge".

This is really strange as now AMD supports AVX. It also means if user explicitly disables AVX we disable macro fusion.

This patch adds an explicit macro fusion feature. I've also enabled for the generic 64-bit CPU (which doesn't have AVX)

This is probably another candidate for being in the MI layer, but for now I at least wanted to correct the overloading of the AVX feature.

Reviewers: spatel, chandlerc, RKSimon, zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D37280

llvm-svn: 312097
2017-08-30 04:34:48 +00:00
Craig Topper 559f61e179 [X86] Finish the subtarget and predicate implementation of CLWB.
We don't have an intrinsic implemented for this instruction yet, but it looked odd that we were missing the accessor method from the subtarget.

llvm-svn: 312064
2017-08-29 23:13:36 +00:00
Craig Topper a54ca1a662 [X86] Fix copy pasto from r311841. Call getOnesVector instead of getZeroVector.
llvm-svn: 312006
2017-08-29 15:29:36 +00:00
Eric Christopher 5bea524091 Revert "The current version of LLVM X86 disassembler incorrectly interprets some possible sets of x86 prefixes. This patch is the first step to close PR7709 and PR17697. There will be next patch(es) to close relative PRs." temporarily while some regressions are addressed.
This reverts commit r311882.

llvm-svn: 311987
2017-08-29 08:23:46 +00:00
Craig Topper 62c47a2aa5 Mark Knights Landing as having slow two memory operand instructions
Summary: Knights Landing, because it is Atom derived, has slow two memory operand instructions. Mark the Knights Landing CPU model accordingly.

Patch by David Zarzycki.

Reviewers: craig.topper

Reviewed By: craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D37224

llvm-svn: 311979
2017-08-29 05:14:27 +00:00
Craig Topper fa86fd928e [X86] Make 128/256-bit extract_subvector Legal instead of Custom. Move combining with BUILD_VECTOR from Legalization to DAG combine
EXTRACT_SUBVECTOR was marked Custom solely so we could combine it with BUILD_VECTOR operations to create smaller BUILD_VECTORS during Legalization. But that sort of combining should really be done by the DAG combiner.

This patch adds the last piece of needed supported DAG combine to handle this. Once that's done we can make the EXTRACT_SUBVECTOR operations Legal.

Differential Revision: https://reviews.llvm.org/D37197

llvm-svn: 311893
2017-08-28 15:32:50 +00:00
Andrew V. Tischenko 574962a3b3 The current version of LLVM X86 disassembler incorrectly interprets some possible sets of x86 prefixes. This patch is the first step to close PR7709 and PR17697. There will be next patch(es) to close relative PRs.
Differential Revision: https://reviews.llvm.org/D36788

M    lib/Target/X86/Disassembler/X86DisassemblerDecoder.cpp
M    lib/Target/X86/Disassembler/X86DisassemblerDecoder.h
A    test/MC/Disassembler/X86/prefixes-i386.s
A    test/MC/Disassembler/X86/prefixes-x86_64.s
M    test/MC/Disassembler/X86/prefixes.txt

llvm-svn: 311882
2017-08-28 10:43:14 +00:00
Gadi Haber d76f7b824e [X86][Haswell] Updating HSW instruction scheduling information
This patch completely replaces the instruction scheduling information for the Haswell architecture target by modifying the file X86SchedHaswell.td located under the X86 Target.
We used the scheduling information retrieved from the Haswell architects in order to replace and modify the existing scheduling.
The patch continues the scheduling replacement effort started with the SNB target in r307529 and r310792.
Information includes latency, number of micro-Ops and used ports by each HSW instruction.

Please expect some performance fluctuations due to code alignment effects.

Reviewers: RKSimon, zvi, aymanmus, craig.topper, m_zuckerman, igorb, dim, chandlerc, aaboud

Differential Revision: https://reviews.llvm.org/D36663

llvm-svn: 311879
2017-08-28 10:04:16 +00:00
Craig Topper 33681161c4 [X86] Use getUnpackl helper to create an ISD::VECTOR_SHUFFLE instead of using X86ISD::UNPCKL in reduceVMULWidth.
This runs fairly early, we should use target independent nodes if possible.

llvm-svn: 311873
2017-08-28 05:14:38 +00:00
Craig Topper 2c77011d15 [X86] Add an early out to combineLoopMAddPattern and combineLoopSADPattern when SSE2 is disabled.
Without this the madd.ll and sad.ll test cases both throw assertions if you run them with SSE2 disabled.

llvm-svn: 311872
2017-08-28 04:29:08 +00:00
Craig Topper 80075a5fb7 [AVX512] Add more patterns for using masked moves for subvector extracts of the lowest subvector. This time with bitcasts between the vselect and the extract.
llvm-svn: 311856
2017-08-27 19:03:36 +00:00
Craig Topper 36bd247f64 [X86] Add a target-specific DAG combine to combine extract_subvector from all zero/one build_vectors.
llvm-svn: 311841
2017-08-27 05:39:57 +00:00
Craig Topper 71dab64a57 [X86] Use getOnesVector instead of using DAG.getConstant(-1).
llvm-svn: 311840
2017-08-27 03:26:04 +00:00
Craig Topper a088362e88 [AVX512] Add patterns to match masked extract_subvector with bitcasts between the vselect and the extract_subvector. Remove the late DAG combine.
We used to do a late DAG combine to move the bitcasts out of the way, but I'm starting to think that it's better to canonicalize extract_subvector's type to match the type of its input. I've seen some cases where we've formed two different extract_subvector from the same node where one had a bitcast and the other didn't.

Add some more test cases to ensure we've also got most of the zero masking covered too.

llvm-svn: 311837
2017-08-26 22:24:57 +00:00
Craig Topper 69ec201848 [X86] Qualify the RMW INC/DEC patterns with NotSlowIncDec.
We were suppressing most uses of INC/DEC, but this one seems to have been missed.

llvm-svn: 311828
2017-08-26 06:24:25 +00:00
Craig Topper d27386a9ed [AVX512] Add patterns to use masked moves to implement masked extract_subvector of the lowest subvector.
This only supports 32 and 64 bit element sizes for now. But we could probably do 16 and 8-bit elements with BWI.

llvm-svn: 311821
2017-08-25 23:34:59 +00:00
Chandler Carruth 4b611a896d [x86] Teach the backend to fold more read-modify-write memory operands
to instructions.

These can't be reasonably matched in tablegen due to the handling of
flags, so we have to do this in C++ code. We only did it for `inc` and
`dec` historically, this starts fleshing that out to more interesting
instructions. Notably, this handles transfering operands to `add` and
`sub`.

Currently this forces them into a register. The next patch will add
support for keeping immediate operands as immediates. Then I'll extend
this beyond just `add` and `sub`.

I'm not super thrilled by the repeated switches in the code but
everything else I tried was really ugly or problematic.

Many thanks to Craig Topper for the suggestions about where to even
begin here and how to make this stuff work.

Differential Revision: https://reviews.llvm.org/D37130

llvm-svn: 311806
2017-08-25 22:50:52 +00:00
Craig Topper c93d0556ae [X86] Use SDValue::getOpcode instead of calling getNode and calling getOpcode on that. NFC
llvm-svn: 311765
2017-08-25 05:36:29 +00:00
Craig Topper fc53dc2d43 [X86] Use isUInt and isShiftedUInt instead of using our own masking and compares. NFCI
While there use a local variable instead of calling C->getZExtValue() repeatedly.

llvm-svn: 311764
2017-08-25 05:04:34 +00:00
Chandler Carruth 96db308f03 [x86] NFC: More refactoring to pave the way to extending this ISel logic
to handle other x86 pseudos that carry flags and thus can't be matched
by our ISel patterns with fused memory accesses.

Differential Revision: https://reviews.llvm.org/D37088

llvm-svn: 311749
2017-08-25 02:06:36 +00:00
Chandler Carruth 03258f251f [x86] NFC - Refactor the custom lowering of `(load; op; store)` RMW sequences.
This extracts the code out of a giant switch in preparation for expanding it to
handle operations other thin `inc` and `dec`. Add a FIXME indicating what's
coming here.

Differential Revision: https://reviews.llvm.org/D37045

llvm-svn: 311748
2017-08-25 02:04:03 +00:00
Craig Topper 355d8cff49 [X86] Add TBM instructions to X86InstrInfo::isDefConvertible.
This allows us to remove "test" instructions and use the flags from the TBM instructions directly.

llvm-svn: 311747
2017-08-25 01:59:06 +00:00
Chandler Carruth 5b491808f5 [x86] Back out one aspect of r311318: don't generically set
FeatureSlowUAMem32.

The idea was to mark things that are slow on widely available processors
as slow in the generic CPU so that the code generated for that CPU would
be fast across those processors. However, for this feature that doesn't
work out very well at all.

The problem here is that you can very easily enable AVX or AVX2 on top
of this generic CPU. For example, this can happen just by using AVX2
intrinsics from Clang within a region of code guarded by a dynamic CPU
feature test. When you do that, the generated code with SlowUAMem32 set
is ... amazingly slower. The problem is that there really aren't very
good alternatives to the unaligned loads, and so our vector codegen
regresses significantly.

The other issue is that there are plenty of AMD CPUs with AVX1 that
don't set FeatureSlowUAMem32 and so we shouldn't just check for AVX2
instead of this special feature. =/

It would be nice to have the target attriute logic be able to
enable/disable more than just one feature at a time and control this in
a more fine grained and useful way, but that doesn't seem easy. Given
that it is only Sandybridge and Ivybridge that set this feature, for now
I'm just backing it out of the generic CPU. That has the additional
advantage of going back to the previous state that people seemed vaguely
happy with.

llvm-svn: 311740
2017-08-25 00:56:05 +00:00
Chandler Carruth 8ac488b161 [x86] Fix an amazing goof in the handling of sub, or, and xor lowering.
The comment for this code indicated that it should work similar to our
handling of add lowering above: if we see uses of an instruction other
than flag usage and store usage, it tries to avoid the specialized
X86ISD::* nodes that are designed for flag+op modeling and emits an
explicit test.

Problem is, only the add case actually did this. In all the other cases,
the logic was incomplete and inverted. Any time the value was used by
a store, we bailed on the specialized X86ISD node. All of this appears
to have been historical where we had different logic here. =/

Turns out, we have quite a few patterns designed around these nodes. We
should actually form them. I fixed the code to match what we do for add,
and it has quite a positive effect just within some of our test cases.
The only thing close to a regression I see is using:

  notl %r
  testl %r, %r

instead of:

  xorl -1, %r

But we can add a pattern or something to fold that back out. The
improvements seem more than worth this.

I've also worked with Craig to update the comments to no longer be
actively contradicted by the code. =[ Some of this still remains
a mystery to both Craig and myself, but this seems like a large step in
the direction of consistency and slightly more accurate comments.

Many thanks to Craig for help figuring out this nasty stuff.

Differential Revision: https://reviews.llvm.org/D37096

llvm-svn: 311737
2017-08-25 00:34:07 +00:00
Sanjay Patel e404cbff66 [DAG] convert vector select-of-constants to logic/math
This goes back to a discussion about IR canonicalization. We'd like to preserve and convert
more IR to 'select' than we currently do because that's likely the best choice in IR:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105335.html
...but that's often not true for codegen, so we need to account for this pattern coming in
to the backend and transform it to better DAG ops.

Steps in this patch:

  1. Add an EVT param to the existing convertSelectOfConstantsToMath() TLI hook to more finely
     enable this transform. Other targets will probably want that anyway to distinguish scalars
     from vectors. We're using that here to exclude AVX512 targets, but it may not be necessary.

  2. Convert a vselect to ext+add. This eliminates a constant load/materialization, and the
     vector ext is often free.

Implementing a more general fold using xor+and can be a follow-up for targets that don't have
a legal vselect. It's also possible that we can remove the TLI hook for the special case fold
implemented here because we're eliminating a constant, but it needs to be tested on other
targets.

Differential Revision: https://reviews.llvm.org/D36840

llvm-svn: 311731
2017-08-24 23:24:43 +00:00
Coby Tayree ee1bc325c0 [fixup][rL311639]
rL311639 created X86AsmParser a dependency in X86AsmPrinter, which broke builds
this fix adds the necessary dep

llvm-svn: 311657
2017-08-24 14:10:50 +00:00
Tobias Grosser d7eb619299 Model cache size and associativity in TargetTransformInfo
Summary:
We add the precise cache sizes and associativity for the following Intel
architectures:

  - Penry
  - Nehalem
  - Westmere
  - Sandy Bridge
  - Ivy Bridge
  - Haswell
  - Broadwell
  - Skylake
  - Kabylake

Polly uses since several months a performance model for BLAS computations that
derives optimal cache and register tile sizes from cache and latency
information (based on ideas from "Analytical Modeling Is Enough for High-Performance BLIS", by Tze Meng Low published at TOMS 2016).
While bootstrapping this model, these target values have been kept in Polly.
However, as our implementation is now rather mature, it seems time to teach
LLVM itself about cache sizes.

Interestingly, L1 and L2 cache sizes are pretty constant across
micro-architectures, hence a set of architecture specific default values
seems like a good start. They can be expanded to more target specific values,
in case certain newer architectures require different values. For now a set
of Intel architectures are provided.

Just as a little teaser, for a simple gemm kernel this model allows us to
improve performance from 1.2s to 0.27s. For gemm kernels with less optimal
memory layouts even larger speedups can be reported.

Reviewers: Meinersbur, bollu, singam-sanjay, hfinkel, gareevroman, fhahn, sebpop, efriedma, asb

Reviewed By: fhahn, asb

Subscribers: lsaba, asb, pollydev, llvm-commits

Differential Revision: https://reviews.llvm.org/D37051

llvm-svn: 311647
2017-08-24 09:46:25 +00:00
Coby Tayree 21c312d8c6 [LLVM][x86][Inline Asm] support for GCC style inline asm - Y<x> constraints
This patch is intended to enable the use of basic double letter constraints used in GCC extended inline asm {Yi Y2 Yz Y0 Ym Yt}.
Supersedes D35204
Clang counterpart: D36371

Differential Revision: https://reviews.llvm.org/D36369

llvm-svn: 311644
2017-08-24 09:08:33 +00:00
Coby Tayree d89128925b [X86AsmParser] Refactoring, (almost) NFC.
Some refactoring to X86AsmParser, mostly regarding the way rewrites are conducted.
Mainly, we try to concentrate all the rewrite effort under one hood, so it'll hopefully be less of a mess and easier to maintain and understand.
naturally, some frontend tests were affected: D36794

Differential Revision: https://reviews.llvm.org/D36793

llvm-svn: 311639
2017-08-24 08:46:25 +00:00
Igor Breger 47be5fbbe9 [GlobalISel][X86] Support G_IMPLICIT_DEF.
Summary: Support G_IMPLICIT_DEF.

Reviewers: zvi, guyblank, t.p.northover

Reviewed By: guyblank

Subscribers: rovka, llvm-commits, kristof.beyls

Differential Revision: https://reviews.llvm.org/D36733

llvm-svn: 311633
2017-08-24 07:06:27 +00:00
Aditya Nandakumar efd8a84cd5 [GISEl]: Translate phi into G_PHI
G_PHI has the same semantics as PHI but also has types.
This lets us verify that the types in the G_PHI are consistent.
This also allows specifying legalization actions for G_PHIs.

https://reviews.llvm.org/D36990

llvm-svn: 311596
2017-08-23 20:45:48 +00:00
Benjamin Kramer 3c56b0bb8f [X86] Fix -Wenum-compare warning
lib/Target/X86/X86ISelLowering.cpp:34613:25: error: enumeral mismatch in
conditional expression: 'llvm::ISD::NodeType' vs
'llvm::X86ISD::NodeType'

llvm-svn: 311580
2017-08-23 17:50:46 +00:00
Craig Topper 853a8d9ffc [AVX512] Don't create SHRUNKBLEND SDNodes for 512-bit vectors
There are no 512-bit blend instructions so we shouldn't create SHRUNKBLEND for them.

On a side note, it looks like there may be a missed opportunity for constant folding TESTM when LHS and RHS are equal.

This fixes PR34139.

Differential Revision: https://reviews.llvm.org/D36992

llvm-svn: 311572
2017-08-23 16:41:02 +00:00
Craig Topper f1417ca625 [X86] Remove X86ISD::FMADD in favor ISD::FMA
There's no reason to have a target specific node with the same semantics as a target independent opcode.

This should simplify D36335 so that it doesn't need to touch X86ISelDAGToDAG.cpp

Differential Revision: https://reviews.llvm.org/D36983

llvm-svn: 311568
2017-08-23 16:28:04 +00:00
Dean Michael Berris 0884b73220 [XRay][CodeGen] Use PIC-friendly code in XRay sleds; remove synthetic references in .text
Summary:
This change achieves two things:

  - Redefine the Custom Event handling instrumentation points emitted by
    the compiler to not require dynamic relocation of references to the
    __xray_CustomEvent trampoline.

  - Remove the synthetic reference we emit at the end of a function that
    we used to keep auxiliary sections alive in favour of SHF_LINK_ORDER
    associated with the section where the function is defined.

To achieve the custom event handling change, we've had to introduce the
concept of sled versioning -- this will need to be supported by the
runtime to allow us to understand how to turn on/off the new version of
the custom event handling sleds. That change has to land first before we
change the way we write the sleds.

To remove the synthetic reference, we rely on a relatively new linker
feature that preserves the sections that are associated with each other.
This allows us to limit the effects on the .text section of ELF
binaries.

Because we're still using absolute references that are resolved at
runtime for the instrumentation map (and function index) maps, we mark
these sections write-able. In the future we can re-define the entries in
the map to use relative relocations instead that can be statically
determined by the linker. That change will be a bit more invasive so we
defer this for later.

Depends on D36816.

Reviewers: dblaikie, echristo, pcc

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36615

llvm-svn: 311525
2017-08-23 04:49:41 +00:00
Craig Topper 35189d5221 [SelectionDAG] Make ISD::isConstantSplatVector always return an element sized APInt.
This partially reverts r311429 in favor of making ISD::isConstantSplatVector do something not confusing. Turns out the only other user of it was also having to deal with the weird property of it returning a smaller size.

So rather than continue to deal with this quirk everywhere, just make the interface do something sane.

Differential Revision: https://reviews.llvm.org/D37039

llvm-svn: 311510
2017-08-22 23:54:13 +00:00
Craig Topper b49f0893b2 [X86] Prevent several calls to ISD::isConstantSplatVector from returning a narrower APInt than the original scalar type
ISD::isConstantSplatVector can shrink to the smallest splat width. But we don't check the size of the resulting APInt at all. This can cause us to misinterpret the results.

This patch just adds a flag to prevent the APInt from changing width.

Fixes PR34271.

Differential Revision: https://reviews.llvm.org/D36996

llvm-svn: 311429
2017-08-22 05:40:17 +00:00
Craig Topper 8078dd2984 [X86] When selecting sse_load_f32/f64 pattern, make sure there's only one use of every node all the way back to the root of the match
Summary: With masked operations, its possible for the operation node like fadd, fsub, etc. to be used by multiple different vselects. Since the pattern matching will start at the vselect, we need to make sure the operation node itself is only used once before we can fold a load. Otherwise we'll end up folding the same load into multiple instructions.

Reviewers: RKSimon, spatel, zvi, igorb

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36938

llvm-svn: 311342
2017-08-21 16:04:04 +00:00
Igor Breger 685889cf9b [GlobalISel][X86] Support G_BRCOND operation.
Summary: Support G_BRCOND operation. For now don't try to fold cmp/trunc instructions.

Reviewers: zvi, guyblank

Reviewed By: guyblank

Subscribers: rovka, llvm-commits, kristof.beyls

Differential Revision: https://reviews.llvm.org/D34754

llvm-svn: 311327
2017-08-21 10:51:54 +00:00
Igor Breger 03c2208d5f [GlobalISel][X86] InstructionSelector, for now use fallback path for LOAD_STACK_GUARD and PHI nodes.
llvm-svn: 311323
2017-08-21 09:17:28 +00:00
Igor Breger 1b5e3d3e28 [GlobalISel][X86] LowerCall, for now don't handel ByValue function arguments.
llvm-svn: 311321
2017-08-21 08:59:59 +00:00
Chandler Carruth 98c51cbee1 [x86] Teach the "generic" x86 CPU to avoid patterns that are slow on
widely used processors.

This occured to me when I saw that we were generating 'inc' and 'dec'
when for Haswell and newer we shouldn't. However, there were a few "X is
slow" things that we should probably just set.

I've avoided any of the "X is fast" features because most of those would
be pretty serious regressions on processors where X isn't actually fast.
The slow things are likely to be negligible costs on processors where
these aren't slow and a significant win when they are slow.

In retrospect this seems somewhat obvious. Not sure why we didn't do
this a long time ago.

Differential Revision: https://reviews.llvm.org/D36947

llvm-svn: 311318
2017-08-21 08:45:22 +00:00
Chandler Carruth 63dd5e0ef6 [x86] Handle more cases where we can re-use an atomic operation's flags
rather than doing a separate comparison.

This both saves an explicit comparision and avoids the use of `xadd`
which introduces register constraints and other challenges to the
generated code.

The motivating case is from atomic reference counts where `1` is the
sentinel rather than `0` for whatever reason. This can and should be
lowered efficiently on x86 by just using a different flag, however the
x86 code only handled the `0` case.

There remains some further opportunities here that are currently hidden
due to canonicalization. I've included test cases that show these and
FIXMEs. However, I don't at the moment have any production use cases and
they seem substantially harder to address.

Differential Revision: https://reviews.llvm.org/D36945

llvm-svn: 311317
2017-08-21 08:45:19 +00:00
Coby Tayree c54c5cbe67 [X86] Allow xacquire/xrelease prefixes
Allow those prefixes on assembly code
Differential Revision: https://reviews.llvm.org/D36845

llvm-svn: 311309
2017-08-21 07:50:15 +00:00
Craig Topper d6f4be97e6 [AVX-512] Don't change which instructions we use for unmasked subvector broadcasts when AVX512DQ is enabled.
There's no functional difference between the AVX512DQ instructions if we're not masking.

This change unifies test checks and removes extra isel entries. Similar was done for subvector insert and extracts recently.

llvm-svn: 311308
2017-08-21 05:29:02 +00:00
Craig Topper 702097dafc [AVX-512] Use a scalar load pattern for FPCLASSSS/FPCLASSSD patterns.
llvm-svn: 311297
2017-08-20 18:30:24 +00:00
Benjamin Kramer 49a49fe816 Move helper classes into anonymous namespaces.
No functionality change intended.

llvm-svn: 311288
2017-08-20 13:03:48 +00:00
Elena Demikhovsky f58f838495 Changed basic cost of store operation on X86
Store operation takes 2 UOps on X86 processors. The exact cost calculation affects several optimization passes including loop unroling.
This change compensates performance degradation caused by https://reviews.llvm.org/D34458 and shows improvements on some benchmarks.

Differential Revision: https://reviews.llvm.org/D35888

llvm-svn: 311285
2017-08-20 12:34:29 +00:00
Igor Breger 88a3d5c855 [GlobalISel][X86] Support call ABI.
Summary: Support call ABI. For now only Linux C and X86_64_SysV calling conventions supported. Variadic function not supported.

Reviewers: zvi, guyblank, oren_ben_simhon

Reviewed By: oren_ben_simhon

Subscribers: rovka, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D34602

llvm-svn: 311279
2017-08-20 09:25:22 +00:00
Igor Breger b3a860a5e8 [GlobalISel][X86] Support asimetric copy from/to GPR physical register.
Usually this case generated by ABI lowering, it requare to performe trancate/anyext.

llvm-svn: 311278
2017-08-20 07:14:40 +00:00
Chandler Carruth 9ef881efab [x86] Fix an even stranger corner case where we have multiple levels of
cmov self-refrencing.

Pointed out by Amjad Aboud in code review, test case minorly simplified
from the one he posted.

llvm-svn: 311267
2017-08-19 23:35:50 +00:00
Craig Topper 4de6f583da [X86] Merge all of the vecload and alignedload predicates into single predicates.
We can load the memory VT and check for natural alignment. This also adds a new preferNonTemporalLoad helper that checks the correct subtarget feature based on the load size.

This shrinks the isel table by at least 5000 bytes by allowing more reordering and combining to occur.

llvm-svn: 311266
2017-08-19 23:21:22 +00:00
Craig Topper afa69eecbb [X86] Converge alignedstore/alignedstore256/alignedstore512 to a single predicate.
We can read the memoryVT and get its store size directly from the SDNode to check its alignment.

llvm-svn: 311265
2017-08-19 23:21:21 +00:00
Craig Topper a0319bb434 [AVX512] Use alignedstore256 in a pattern that's emitting a 256-bit movaps from an extract subvector operation.
llvm-svn: 311263
2017-08-19 22:02:02 +00:00
Craig Topper 6e70f7cd33 [X86] Remove an unnecessary alignment restriction from MOVDDUP pattern.
The SSE MOVDDUP instruction only loads 64-bits with no alignment restriction.

llvm-svn: 311253
2017-08-19 18:02:28 +00:00
Chandler Carruth 93a645525c [x86] Teach the cmov converter to aggressively convert cmovs with memory
operands into control flow.

We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.

Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.

Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.20 Cycles       Throughput Bottleneck: Port1

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           | 1.0 |           |           |     |     | CP | mov rcx, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.1       | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.8       | 0.6 |           |           |     | 0.6 | CP | cmovbe rax, rdi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jb 0xf
Total Num Of Uops: 9
```
After, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: Port5

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.5       | 0.5 | 1.0   1.0 |           |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     | 1.0 | CP | jnbe 0x39
|   2^   |           |     |           | 1.0   1.0 |     | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jnb 0x3c
Total Num Of Uops: 7
```
The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rcx, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.0       | 1.0 |           |           |     |     | 1.0 |     |    | cmovbe rax, rdi
|   2^   | 0.5       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.5 |     |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jb 0xf
Total Num Of Uops: 8
```
After, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 1.0   1.0 |           |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           | 1.0 |           |           |     |     |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     |     | 1.0 |     |    | jnbe 0x39
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     |    | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jnb 0x3c
Total Num Of Uops: 6
```

Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.

I have run a suite of internal benchmarks with this change. I saw a few
very significant improvements and a very few minor regressions,
but overall this change rarely has a significant effect. However, the
improvements were very significant, and in quite important routines
responsible for a great deal of our C++ CPU cycles. The gains pretty
clealy outweigh the regressions for us.

I also ran the test-suite and SPEC2006. Only 11 binaries changed at all
and none of them showed any regressions.

Amjad Aboud at Intel also ran this over their benchmarks and saw no
regressions.

Differential Revision: https://reviews.llvm.org/D36858

llvm-svn: 311226
2017-08-19 05:01:19 +00:00
Chandler Carruth e3b3547e9f [x86] Refactor the CMOV conversion pass to be more flexible.
The primary thing that this accomplishes is to allow future re-use of
these routines in more contexts and clarify the behavior w.r.t. loops.
For example, if handling outer loops is desirable, doing so in
a inside-out order becomes straight forward because it walks the loop
nest itself (rather than walking the function's basic blocks) and
de-couples the CMOV rewriting from the loop structure as there isn't
actually anything loop-specific about this transformation.

This patch should be essentially a no-op. It potentially changes the
order in which we visit the inner loops, but otherwise should merely set
the stage for subsequent changes.

Differential Revision: https://reviews.llvm.org/D36783

llvm-svn: 311225
2017-08-19 04:28:20 +00:00
Craig Topper 1fae3ae6f0 [X86] Remove SSE/AVX patterns for AND/XOR/OR/ANDN that checked for the inputs being bitcasted from floating point types.
There's really no reason to do this we should just let isel pick the integer version and let the execution dependency fixing pass take care of moving to FP if necessary.

It's not very reliable to look for bitcasts at the edges of patterns. If for some reason one input was bitcasted and the other wasn't, or if one was a v4f32 bitcast and one was a v2f64 bitcast, we would have fallen back to the integer pattern anyway.

llvm-svn: 311138
2017-08-17 23:20:57 +00:00
Craig Topper 3a622a14f9 [AVX512] Don't switch unmasked subvector insert/extract instructions when AVX512DQI is enabled.
There's no reason to switch instructions with and without DQI. It just creates extra isel patterns and test divergences.

There is however value in enabling the masked version of the instructions with DQI.

This required introducing some new multiclasses to enabling this splitting.

Differential Revision: https://reviews.llvm.org/D36661

llvm-svn: 311091
2017-08-17 15:40:25 +00:00
Craig Topper 5960848060 [X86] Remove memopmmx pattern fragment
Summary: Just like the FIXME says, there is no alignment requirement for MMX.

Reviewers: RKSimon, zvi, igorb

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36815

llvm-svn: 311090
2017-08-17 15:25:05 +00:00
Amjad Aboud 19f15843ab [X86] Refactoring of X86TargetLowering::EmitLoweredSelect. NFC.
Authored by aivchenk
Differential Revision: https://reviews.llvm.org/D35685

llvm-svn: 311082
2017-08-17 12:12:30 +00:00
Craig Topper 2f9743d2ea [X86] Exchange the memory op predicate for PALIGNR/VPALIGNR. I accidentally swapped them.
llvm-svn: 311060
2017-08-17 02:34:35 +00:00
Craig Topper 5357526ce8 [X86] Cleanup multiclasses for SSE/AVX2 PALIGNR. Add missing load patterns.
We used to have a separate multiclass for AVX2 and SSE/AVX. Now we have one multiclass and pass the relevant differences.

We were also missing load patterns, though we had them for the AVX-512 version.

llvm-svn: 311059
2017-08-17 01:48:03 +00:00
Craig Topper bbe3e46bb9 [X86] Remove patterns for PALIGNR with non-vXi8 types.
llvm-svn: 311058
2017-08-17 01:48:00 +00:00
Craig Topper 42a535351e [X86] Put multiclass closer to its use and simplify slightly. NFC
llvm-svn: 311055
2017-08-16 23:38:25 +00:00
Craig Topper 9025579e8a [X86] Use a static array instead of a SmallVector for a small fixed size array. NFC
llvm-svn: 311054
2017-08-16 23:16:43 +00:00
Simon Pilgrim c63f93a197 [CostModel][X86][XOP] Improve costs for XOP shuffles
VPPERM/VPERMIL2PD/VPERMIL2PS all provide more effective 2-input shuffles than regular AVX instructions

llvm-svn: 311005
2017-08-16 13:50:20 +00:00
Quentin Colombet 61d71a138b Reapply "[GlobalISel] Remove the GISelAccessor API."
This reverts commit r310425, thus reapplying r310335 with a fix for link
issue of the AArch64 unittests on Linux bots when BUILD_SHARED_LIBS is ON.

Original commit message:
[GlobalISel] Remove the GISelAccessor API.

Its sole purpose was to avoid spreading around ifdefs related to
building global-isel. Since r309990, GlobalISel is not optional anymore,
thus, we can get rid of this mechanism all together.

NFC.

----
The fix for the link issue consists in adding the GlobalISel library in
the list of dependencies for the AArch64 unittests. This dependency
comes from the use of AArch64Subtarget that needs to know how
to destruct the GISel related APIs when being detroyed.

Thanks to Bill Seurer and Ahmed Bougacha for helping me reproducing and
understand the problem.

llvm-svn: 310969
2017-08-15 22:31:51 +00:00
Sanjay Patel 92653865e6 [x86] fold the mask op on 8- and 16-bit rotates
Ref the post-commit thread for r310770:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20170807/478507.html

The motivating cases as 'C' source examples can look like this:

unsigned char rotate_right_8(unsigned char v, int shift) {
  // shift &= 7;
  v = ( v >> shift ) | ( v << ( 8 - shift ) );
  return v;
}

https://godbolt.org/g/K6rc1A

Notice that the source doesn't contain UB-safe masked shift amounts, but instcombine created those 
in order to produce narrow rotate patterns. This should be the last step needed to resolve PR34046:
https://bugs.llvm.org/show_bug.cgi?id=34046

Differential Revision: https://reviews.llvm.org/D36644

llvm-svn: 310849
2017-08-14 15:55:43 +00:00
Craig Topper 411f29de69 [X86] Fix a place that was mishandling X86ISD::UMUL.
According to the X86ISelLowering.h, UMUL results are low, high, and flags. But this place was treating result 1 or 2 as flags.

Differential Revision: https://reviews.llvm.org/D36654

llvm-svn: 310846
2017-08-14 15:32:40 +00:00
Craig Topper c0471829b1 [X86] Remove flag setting ISD nodes from computeKnownBitsForTargetNode
Summary:
The flag result is an i32 type. But its only really used for connectivity. I don't think anything even assumes a particular format. We don't ever do any real operations on it. So known bits don't help us optimize anything.

My main motivation is that the UMUL behavior is actually wrong. I was going to fix this in D36654, but then realized there was just no reason for it to be here.

Reviewers: RKSimon, zvi, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36657

llvm-svn: 310845
2017-08-14 15:28:49 +00:00
Craig Topper b9e3e111ad [AVX512] Make the itinerary parameter actually pass through the the AVX512_maskable_common multiclass
Summary: This looks to have been disconnected about 3 years ago in r219358.

Reviewers: gadi.haber, RKSimon, zvi

Reviewed By: gadi.haber

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36658

llvm-svn: 310844
2017-08-14 15:28:48 +00:00
Craig Topper 2374de420b [AVX512] Remove leftover code for when i1 was a legal type from the fast isel load/store code.
Summary:
I don't think we need this code anymore. It only existed because i1 used to be legal.

There's probably more unneeded code in fast isel still.

Reviewers: guyblank, zvi

Reviewed By: guyblank

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36652

llvm-svn: 310843
2017-08-14 15:28:47 +00:00
Craig Topper 508aa97b25 [AVX-512] Add hasSideEffects = 0 to the 8-bit and 16-bit register broadcasts.
llvm-svn: 310813
2017-08-14 05:09:34 +00:00
Craig Topper ca98bb9c94 [X86] Remove unused argument from the vextract_for_size multiclass. NFC
llvm-svn: 310812
2017-08-14 05:09:33 +00:00
Craig Topper 2c8cb2fa87 [AVX512] Remove comment I should have removed in r310808. NFC
llvm-svn: 310811
2017-08-14 05:09:31 +00:00
Craig Topper aadec7078e [AVX512] Simplify the instruction defintion for VEXTRACT. NFCI
The comment about why we couldn't use avx512_maskable appears to have been incorrect.

llvm-svn: 310808
2017-08-14 01:53:10 +00:00
Craig Topper 5b59176abb [X86] Fix typo from r310794. Index = 0 should have been Index == 0.
llvm-svn: 310801
2017-08-13 20:21:12 +00:00
Craig Topper 2dfc7889fd [X86] Remove unused pattern fragment that referenced MVT::i1. NFC
llvm-svn: 310799
2017-08-13 20:04:05 +00:00
Craig Topper 43e3b788f4 [AVX512] Correct isExtractSubvectorCheap so that it will return the correct answers for extracting 128-bits from a 512-bit vector and for mask registers.
Previously it would not return true for extracting either of the upper quarters of a 512-bit registers.

For mask registers we support extracting anything from index 0. And otherwise we only support extracting the upper half of a register.

Differential Revision: https://reviews.llvm.org/D36638

llvm-svn: 310794
2017-08-13 17:40:02 +00:00
Craig Topper 2251ef95a3 [X86][ARM][TargetLowering] Add SrcVT to isExtractSubvectorCheap
Summary:
Without the SrcVT its hard to know what is really being asked for. For example if your target has 128, 256, and 512 bit vectors. Maybe extracting 128 from 256 is cheap, but maybe extracting 128 from 512 is not.

For x86 we do support extracting a quarter of a 512-bit register. But for i1 vectors we don't have isel patterns for extracting arbitrary pieces. So we need this to have a correct implementation of isExtractSubvectorCheap for mask vectors.

Reviewers: RKSimon, zvi, efriedma

Reviewed By: RKSimon

Subscribers: aemerson, javed.absar, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D36649

llvm-svn: 310793
2017-08-13 17:29:07 +00:00
Gadi Haber bed2c50607 [X86][SandyBridge] Additional updates to the SNB instructions scheduling information
This is a continuation patch for commit r307529 which completely replaces the scheduling information for the SandyBridge architecture target by modifying the file X86SchedSandyBridge.td located under the X86 Target (see also https://reviews.llvm.org/D35019).

In this patch we added the scheduling information of additional SNB instructions that were missing from the patch commit r307529, fixed the scheduling of several resource groups that include only port0 instead of port05 (i.e., port0 OR port5) and fixed several incorrect instructions' scheduling in the r307529 commit.

The patch also includes the X87 instructions which were missing in previous patch commit r307529 as reported in bugzilla bug 34080.

Reviewers: zvi, RKSimon, chandlerc, igorb, m_zuckerman, craig.topper, aymanmus, dim

Differential Revision: https://reviews.llvm.org/D36388

llvm-svn: 310792
2017-08-13 13:59:24 +00:00
Coby Tayree 799fa2c76e [X86][AsmParser][AVX512] Error appropriately when K0 is tried as a write-mask
K0 isn't expected as a write-mask, so provide a detailed error here, instead of the more generic one (invalid op for insn)
Conforms with gas

Differential Revision: https://reviews.llvm.org/D36570

llvm-svn: 310789
2017-08-13 12:03:00 +00:00
Guy Blank de425ae753 [X86][AVX512] Add combine for TESTM
Add an X86 combine for TESTM when one of the operands is a BUILD_VECTOR(0,0,...).

TESTM op0, BUILD_VECTOR(0,0,...) -> BUILD_VECTOR(0,0,...)
TESTM BUILD_VECTOR(0,0,...), op1 -> BUILD_VECTOR(0,0,...)

Differential Revision:
https://reviews.llvm.org/D36536

llvm-svn: 310787
2017-08-13 08:03:37 +00:00
Craig Topper 77dd140786 [X86] Early out of combineInsertSubvector for mask vectors.
The combines here shouldn't be done for mask vectors, but it wasn't clear anything was preventing that.

llvm-svn: 310786
2017-08-12 22:33:58 +00:00
Craig Topper dbca6d47f3 [X86] Fix bad comment. NFC
llvm-svn: 310785
2017-08-12 22:33:57 +00:00
Craig Topper 44cb1ffb6a [X86] When handling addcarry intrinsic, create the flag result with the correct type so we don't crash if we use a memory instruction
Summary:
Previously we were creating the flag result with MVT::Other which is interpretted as a Chain node. If we used a memory form of the instruction we would end up with a copyToReg that consumed the chain result of the adcx instruction instead of the flag result.

Pretty sure we should be using MVT::i32 here, that's what we do other places we create these node types.

We should probably consider this for 5.0 as well.

Reviewers: RKSimon, zvi, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36645

llvm-svn: 310784
2017-08-12 20:19:44 +00:00
Craig Topper ac217b7aa3 [X86] Don't use fsin/fcos/fsincos instructions ever
Summary:
Previously we would use these instructions if sse was disabled and fastmath was enabled.

As mentioned in D28335, this is a bad idea.

Reviewers: efriedma, scanon, DavidKreitzer

Reviewed By: DavidKreitzer

Subscribers: zvi, llvm-commits

Differential Revision: https://reviews.llvm.org/D36344

llvm-svn: 310762
2017-08-11 20:55:29 +00:00
Rafael Espindola b8956a70d3 Fix access to undefined weak symbols in pic code
When the access to a weak symbol is not a call, the access has to be
able to produce the value 0 at runtime.

We were sometimes producing code sequences where that was not possible
if the code was leaded more than 4g away from 0.

llvm-svn: 310756
2017-08-11 20:49:27 +00:00
Craig Topper 561092f233 [AVX512] Remove and autoupgrade many of the broadcast intrinsics
Summary:
This autoupgrades most of the broadcast intrinsics. They've been unused in clang for some time.

This leaves the 32x2 intrinsics because they are still used in clang.

Reviewers: RKSimon, zvi, igorb

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36606

llvm-svn: 310725
2017-08-11 16:22:45 +00:00
Craig Topper 0f30fe9634 [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512
Summary:
This teaches 512-bit shuffles to detect unused halfs in order to reduce shuffle size.

We may need to refine the 512-bit exit point. I couldn't remember if we had good cross lane shuffles for 8/16 bit with AVX-512 or not.

I believe this is step towards being able to handle D36454 without a special case.

From here we need to improve our ability to combine extract_subvector with insert_subvector and other extract_subvectors. And we need to support narrowing binary operations where we don't demand all elements. This may be improvements to DAGCombiner::narrowExtractedVectorBinOp(by recognizing an insert_subvector in addition to concat) or we may need a target specific combiner.

Reviewers: RKSimon, zvi, delena, jbhateja

Reviewed By: RKSimon, jbhateja

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36601

llvm-svn: 310724
2017-08-11 16:20:05 +00:00
Sanjay Patel 169dae70a6 [x86] use more shift or LEA for select-of-constants (2nd try)
The previous rev (r310208) failed to account for overflow when subtracting the
constants to see if they're suitable for shift/lea. This version add a check
for that and more test were added in r310490.

We can convert any select-of-constants to math ops:
http://rise4fun.com/Alive/d7d

For this patch, I'm enhancing an existing x86 transform that uses fake multiplies
(they always become shl/lea) to avoid cmov or branching. The current code misses
cases where we have a negative constant and a positive constant, so this is just
trying to plug that hole.

The DAGCombiner diff prevents us from hitting a terrible inefficiency: we can start
with a select in IR, create a select DAG node, convert it into a sext, convert it
back into a select, and then lower it to sext machine code.

Some notes about the test diffs:

1. 2010-08-04-MaskedSignedCompare.ll - We were creating control flow that didn't exist in the IR.
2. memcmp.ll - Choose -1 or 1 is the case that got me looking at this again. We could avoid the 
   push/pop in some cases if we used 'movzbl %al' instead of an xor on a different reg? That's a 
   post-DAG problem though.
3. mul-constant-result.ll - The trade-off between sbb+not vs. setne+neg could be addressed if
   that's a regression, but those would always be nearly equivalent.
4. pr22338.ll and sext-i1.ll - These tests have undef operands, so we don't actually care about these diffs.
5. sbb.ll - This shows a win for what is likely a common case: choose -1 or 0.
6. select.ll - There's another borderline case here: cmp+sbb+or vs. test+set+lea? Also, sbb+not vs. setae+neg shows up again.
7. select_const.ll - These are motivating cases for the enhancement; replace cmov with cheaper ops.

Assembly differences between movzbl and xor to avoid a partial reg stall are caused later by the X86 Fixup SetCC pass.

Differential Revision: https://reviews.llvm.org/D35340

llvm-svn: 310717
2017-08-11 15:44:14 +00:00
Nirav Dave d1b3f09faa [X86][DAG] Switch X86 Target to post-legalized store merge
Move store merge to happen after intrinsic lowering to allow lowered
stores to be merged.

Some regressions due in MergeConsecutiveStores to missing
insert_subvector that are addressed in follow up patch.

Reviewers: craig.topper, efriedma, RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D34559

llvm-svn: 310710
2017-08-11 13:21:35 +00:00
Simon Pilgrim b59c2d9d73 [CostModel][X86] Add SSE2 two-src shuffle costs
llvm-svn: 310654
2017-08-10 19:32:35 +00:00