Commit Graph

10081 Commits

Author SHA1 Message Date
Chandler Carruth 46259260c7 [x86] NFC - normalize test case formatting of IR and generate CHECK
lines with the script rather than using manually written checks.

llvm-svn: 311753
2017-08-25 02:32:51 +00:00
Craig Topper 355d8cff49 [X86] Add TBM instructions to X86InstrInfo::isDefConvertible.
This allows us to remove "test" instructions and use the flags from the TBM instructions directly.

llvm-svn: 311747
2017-08-25 01:59:06 +00:00
Chandler Carruth 5b491808f5 [x86] Back out one aspect of r311318: don't generically set
FeatureSlowUAMem32.

The idea was to mark things that are slow on widely available processors
as slow in the generic CPU so that the code generated for that CPU would
be fast across those processors. However, for this feature that doesn't
work out very well at all.

The problem here is that you can very easily enable AVX or AVX2 on top
of this generic CPU. For example, this can happen just by using AVX2
intrinsics from Clang within a region of code guarded by a dynamic CPU
feature test. When you do that, the generated code with SlowUAMem32 set
is ... amazingly slower. The problem is that there really aren't very
good alternatives to the unaligned loads, and so our vector codegen
regresses significantly.

The other issue is that there are plenty of AMD CPUs with AVX1 that
don't set FeatureSlowUAMem32 and so we shouldn't just check for AVX2
instead of this special feature. =/

It would be nice to have the target attriute logic be able to
enable/disable more than just one feature at a time and control this in
a more fine grained and useful way, but that doesn't seem easy. Given
that it is only Sandybridge and Ivybridge that set this feature, for now
I'm just backing it out of the generic CPU. That has the additional
advantage of going back to the previous state that people seemed vaguely
happy with.

llvm-svn: 311740
2017-08-25 00:56:05 +00:00
Chandler Carruth 8ac488b161 [x86] Fix an amazing goof in the handling of sub, or, and xor lowering.
The comment for this code indicated that it should work similar to our
handling of add lowering above: if we see uses of an instruction other
than flag usage and store usage, it tries to avoid the specialized
X86ISD::* nodes that are designed for flag+op modeling and emits an
explicit test.

Problem is, only the add case actually did this. In all the other cases,
the logic was incomplete and inverted. Any time the value was used by
a store, we bailed on the specialized X86ISD node. All of this appears
to have been historical where we had different logic here. =/

Turns out, we have quite a few patterns designed around these nodes. We
should actually form them. I fixed the code to match what we do for add,
and it has quite a positive effect just within some of our test cases.
The only thing close to a regression I see is using:

  notl %r
  testl %r, %r

instead of:

  xorl -1, %r

But we can add a pattern or something to fold that back out. The
improvements seem more than worth this.

I've also worked with Craig to update the comments to no longer be
actively contradicted by the code. =[ Some of this still remains
a mystery to both Craig and myself, but this seems like a large step in
the direction of consistency and slightly more accurate comments.

Many thanks to Craig for help figuring out this nasty stuff.

Differential Revision: https://reviews.llvm.org/D37096

llvm-svn: 311737
2017-08-25 00:34:07 +00:00
Sanjay Patel e404cbff66 [DAG] convert vector select-of-constants to logic/math
This goes back to a discussion about IR canonicalization. We'd like to preserve and convert
more IR to 'select' than we currently do because that's likely the best choice in IR:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105335.html
...but that's often not true for codegen, so we need to account for this pattern coming in
to the backend and transform it to better DAG ops.

Steps in this patch:

  1. Add an EVT param to the existing convertSelectOfConstantsToMath() TLI hook to more finely
     enable this transform. Other targets will probably want that anyway to distinguish scalars
     from vectors. We're using that here to exclude AVX512 targets, but it may not be necessary.

  2. Convert a vselect to ext+add. This eliminates a constant load/materialization, and the
     vector ext is often free.

Implementing a more general fold using xor+and can be a follow-up for targets that don't have
a legal vselect. It's also possible that we can remove the TLI hook for the special case fold
implemented here because we're eliminating a constant, but it needs to be tested on other
targets.

Differential Revision: https://reviews.llvm.org/D36840

llvm-svn: 311731
2017-08-24 23:24:43 +00:00
Michael Zuckerman 9ee61d9b00 Adding base lit test for x86interleaved
llvm-svn: 311658
2017-08-24 14:11:28 +00:00
Chandler Carruth dc2556934c [x86] NFC: Clean up two tests and generate precise checks for them.
Mostly this involved giving unnamed values names and running the IR
through `opt` to re-format it but merging in any important comments in
the original. I then deleted pointless comments and inlined the function
attributes for ease of reading and editting.

All of this is to make it much easier to see the instructions being
generated here and evaluate any updates to the tests.

llvm-svn: 311634
2017-08-24 07:38:36 +00:00
Igor Breger 47be5fbbe9 [GlobalISel][X86] Support G_IMPLICIT_DEF.
Summary: Support G_IMPLICIT_DEF.

Reviewers: zvi, guyblank, t.p.northover

Reviewed By: guyblank

Subscribers: rovka, llvm-commits, kristof.beyls

Differential Revision: https://reviews.llvm.org/D36733

llvm-svn: 311633
2017-08-24 07:06:27 +00:00
Wei Ding a131d3fb29 Add ‘llvm.experimental.constrained.fma‘ Intrinsic.
Differential Revision: http://reviews.llvm.org/D36335

llvm-svn: 311629
2017-08-24 04:18:24 +00:00
Hans Wennborg c39ec95d88 [DAG] Fix Node Replacement in PromoteIntBinOp
When one operand is a user of another in a promoted binary operation
we may replace and delete the returned value before returning
triggering an assertion. Reorder node replacements to prevent this.

Fixes PR34137.

Landing on behalf of Nirav.

Differential Revision: https://reviews.llvm.org/D36581

llvm-svn: 311623
2017-08-24 01:08:27 +00:00
Reid Kleckner 6d353348e5 Parse and print DIExpressions inline to ease IR and MIR testing
Summary:
Most DIExpressions are empty or very simple. When they are complex, they
tend to be unique, so checking them inline is reasonable.

This also avoids the need for CodeGen passes to append to the
llvm.dbg.mir named md node.

See also PR22780, for making DIExpression not be an MDNode.

Reviewers: aprantl, dexonsmith, dblaikie

Subscribers: qcolombet, javed.absar, eraman, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D37075

llvm-svn: 311594
2017-08-23 20:31:27 +00:00
Craig Topper 853a8d9ffc [AVX512] Don't create SHRUNKBLEND SDNodes for 512-bit vectors
There are no 512-bit blend instructions so we shouldn't create SHRUNKBLEND for them.

On a side note, it looks like there may be a missed opportunity for constant folding TESTM when LHS and RHS are equal.

This fixes PR34139.

Differential Revision: https://reviews.llvm.org/D36992

llvm-svn: 311572
2017-08-23 16:41:02 +00:00
Dean Michael Berris 0884b73220 [XRay][CodeGen] Use PIC-friendly code in XRay sleds; remove synthetic references in .text
Summary:
This change achieves two things:

  - Redefine the Custom Event handling instrumentation points emitted by
    the compiler to not require dynamic relocation of references to the
    __xray_CustomEvent trampoline.

  - Remove the synthetic reference we emit at the end of a function that
    we used to keep auxiliary sections alive in favour of SHF_LINK_ORDER
    associated with the section where the function is defined.

To achieve the custom event handling change, we've had to introduce the
concept of sled versioning -- this will need to be supported by the
runtime to allow us to understand how to turn on/off the new version of
the custom event handling sleds. That change has to land first before we
change the way we write the sleds.

To remove the synthetic reference, we rely on a relatively new linker
feature that preserves the sections that are associated with each other.
This allows us to limit the effects on the .text section of ELF
binaries.

Because we're still using absolute references that are resolved at
runtime for the instrumentation map (and function index) maps, we mark
these sections write-able. In the future we can re-define the entries in
the map to use relative relocations instead that can be statically
determined by the linker. That change will be a bit more invasive so we
defer this for later.

Depends on D36816.

Reviewers: dblaikie, echristo, pcc

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36615

llvm-svn: 311525
2017-08-23 04:49:41 +00:00
Matthias Braun d6c0868da5 Fix tail-merge-after-mbp test
The output of this test changed after the fix in r311520 to have
-run-pass=block-placement behave like it does in a normal pipeline.
Adjust the test.

llvm-svn: 311521
2017-08-23 03:49:53 +00:00
Sanjay Patel 0ab50f6d68 [x86] auto-generate full checks; NFC
I don't see anything Darwin-specific here, so I made the target generic x86-64.

llvm-svn: 311465
2017-08-22 16:27:00 +00:00
Sanjay Patel 40b8e3bfe5 [x86] simplify runs and auto-generate full checks
I've replaced the two OS-specific runs with a generic run because
there's no functional difference in the resulting output that
we're checking. Also, the script still doesn't work with a Win
target.

llvm-svn: 311463
2017-08-22 16:21:45 +00:00
Craig Topper b49f0893b2 [X86] Prevent several calls to ISD::isConstantSplatVector from returning a narrower APInt than the original scalar type
ISD::isConstantSplatVector can shrink to the smallest splat width. But we don't check the size of the resulting APInt at all. This can cause us to misinterpret the results.

This patch just adds a flag to prevent the APInt from changing width.

Fixes PR34271.

Differential Revision: https://reviews.llvm.org/D36996

llvm-svn: 311429
2017-08-22 05:40:17 +00:00
Craig Topper 8078dd2984 [X86] When selecting sse_load_f32/f64 pattern, make sure there's only one use of every node all the way back to the root of the match
Summary: With masked operations, its possible for the operation node like fadd, fsub, etc. to be used by multiple different vselects. Since the pattern matching will start at the vselect, we need to make sure the operation node itself is only used once before we can fold a load. Otherwise we'll end up folding the same load into multiple instructions.

Reviewers: RKSimon, spatel, zvi, igorb

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D36938

llvm-svn: 311342
2017-08-21 16:04:04 +00:00
Igor Breger 685889cf9b [GlobalISel][X86] Support G_BRCOND operation.
Summary: Support G_BRCOND operation. For now don't try to fold cmp/trunc instructions.

Reviewers: zvi, guyblank

Reviewed By: guyblank

Subscribers: rovka, llvm-commits, kristof.beyls

Differential Revision: https://reviews.llvm.org/D34754

llvm-svn: 311327
2017-08-21 10:51:54 +00:00
Igor Breger 1b5e3d3e28 [GlobalISel][X86] LowerCall, for now don't handel ByValue function arguments.
llvm-svn: 311321
2017-08-21 08:59:59 +00:00
Michael Zuckerman bdb6673151 [InterLeaved] Adding lit test for future work interleaved load strid 3
llvm-svn: 311320
2017-08-21 08:56:39 +00:00
Chandler Carruth 98c51cbee1 [x86] Teach the "generic" x86 CPU to avoid patterns that are slow on
widely used processors.

This occured to me when I saw that we were generating 'inc' and 'dec'
when for Haswell and newer we shouldn't. However, there were a few "X is
slow" things that we should probably just set.

I've avoided any of the "X is fast" features because most of those would
be pretty serious regressions on processors where X isn't actually fast.
The slow things are likely to be negligible costs on processors where
these aren't slow and a significant win when they are slow.

In retrospect this seems somewhat obvious. Not sure why we didn't do
this a long time ago.

Differential Revision: https://reviews.llvm.org/D36947

llvm-svn: 311318
2017-08-21 08:45:22 +00:00
Chandler Carruth 63dd5e0ef6 [x86] Handle more cases where we can re-use an atomic operation's flags
rather than doing a separate comparison.

This both saves an explicit comparision and avoids the use of `xadd`
which introduces register constraints and other challenges to the
generated code.

The motivating case is from atomic reference counts where `1` is the
sentinel rather than `0` for whatever reason. This can and should be
lowered efficiently on x86 by just using a different flag, however the
x86 code only handled the `0` case.

There remains some further opportunities here that are currently hidden
due to canonicalization. I've included test cases that show these and
FIXMEs. However, I don't at the moment have any production use cases and
they seem substantially harder to address.

Differential Revision: https://reviews.llvm.org/D36945

llvm-svn: 311317
2017-08-21 08:45:19 +00:00
Craig Topper d6f4be97e6 [AVX-512] Don't change which instructions we use for unmasked subvector broadcasts when AVX512DQ is enabled.
There's no functional difference between the AVX512DQ instructions if we're not masking.

This change unifies test checks and removes extra isel entries. Similar was done for subvector insert and extracts recently.

llvm-svn: 311308
2017-08-21 05:29:02 +00:00
Craig Topper 485cca1ecb [AVX512] Add 128->256 vbroadcastf64x2/vbroadcasti64x2 instructions to the EVEX->VEX table.
llvm-svn: 311307
2017-08-21 05:03:28 +00:00
Craig Topper d63b33f9c4 [AVX512] Add a test to check what happens when a load is referenced by two different masked scalar intrinsics with the same op inputs, but different masking node.
We're missing some single use checks in the sse_load_f32/f64 handling that cause us to replicate the load.

llvm-svn: 311300
2017-08-20 19:47:00 +00:00
Igor Breger 88a3d5c855 [GlobalISel][X86] Support call ABI.
Summary: Support call ABI. For now only Linux C and X86_64_SysV calling conventions supported. Variadic function not supported.

Reviewers: zvi, guyblank, oren_ben_simhon

Reviewed By: oren_ben_simhon

Subscribers: rovka, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D34602

llvm-svn: 311279
2017-08-20 09:25:22 +00:00
Igor Breger b3a860a5e8 [GlobalISel][X86] Support asimetric copy from/to GPR physical register.
Usually this case generated by ABI lowering, it requare to performe trancate/anyext.

llvm-svn: 311278
2017-08-20 07:14:40 +00:00
Chandler Carruth 9ef881efab [x86] Fix an even stranger corner case where we have multiple levels of
cmov self-refrencing.

Pointed out by Amjad Aboud in code review, test case minorly simplified
from the one he posted.

llvm-svn: 311267
2017-08-19 23:35:50 +00:00
Craig Topper a0319bb434 [AVX512] Use alignedstore256 in a pattern that's emitting a 256-bit movaps from an extract subvector operation.
llvm-svn: 311263
2017-08-19 22:02:02 +00:00
Jatin Bhateja 6b4c205685 [DAGCombiner] Extending pattern detection for vector shuffle.
Summary:
    If all the operands of a BUILD_VECTOR extract elements from same vector then split the
    vector efficiently based on the maximum vector access index.

    Reviewers: zvi, delena, RKSimon, thakis

    Reviewed By: RKSimon

    Subscribers: chandlerc, eladcohen, llvm-commits

    Differential Revision: https://reviews.llvm.org/D35788

llvm-svn: 311255
2017-08-19 18:08:59 +00:00
Jatin Bhateja 66f7958e91 Revert rL311247 : To rectify commit message.
Summary: This reverts commit rL311247.

Differential Revision: https://reviews.llvm.org/D36927

llvm-svn: 311252
2017-08-19 17:59:58 +00:00
Jatin Bhateja 6f0d0d23b0 Merge branch 'arcpatch-D35788'
llvm-svn: 311247
2017-08-19 17:00:04 +00:00
Jatin Bhateja 1c56863739 Revert rL311242 "Extension of shuffle vector pattern detection, updating post rebase."
Summary:

This reverts commit rL311242.

Differential Revision: https://reviews.llvm.org/D36924

llvm-svn: 311246
2017-08-19 16:40:06 +00:00
Jatin Bhateja 313f97dd84 Extension of shuffle vector pattern detection, updating post rebase.
llvm-svn: 311242
2017-08-19 15:58:36 +00:00
Chandler Carruth 93a645525c [x86] Teach the cmov converter to aggressively convert cmovs with memory
operands into control flow.

We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.

Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.

Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.20 Cycles       Throughput Bottleneck: Port1

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           | 1.0 |           |           |     |     | CP | mov rcx, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.1       | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.8       | 0.6 |           |           |     | 0.6 | CP | cmovbe rax, rdi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jb 0xf
Total Num Of Uops: 9
```
After, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: Port5

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.5       | 0.5 | 1.0   1.0 |           |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     | 1.0 | CP | jnbe 0x39
|   2^   |           |     |           | 1.0   1.0 |     | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jnb 0x3c
Total Num Of Uops: 7
```
The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rcx, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.0       | 1.0 |           |           |     |     | 1.0 |     |    | cmovbe rax, rdi
|   2^   | 0.5       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.5 |     |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jb 0xf
Total Num Of Uops: 8
```
After, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 1.0   1.0 |           |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           | 1.0 |           |           |     |     |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     |     | 1.0 |     |    | jnbe 0x39
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     |    | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jnb 0x3c
Total Num Of Uops: 6
```

Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.

I have run a suite of internal benchmarks with this change. I saw a few
very significant improvements and a very few minor regressions,
but overall this change rarely has a significant effect. However, the
improvements were very significant, and in quite important routines
responsible for a great deal of our C++ CPU cycles. The gains pretty
clealy outweigh the regressions for us.

I also ran the test-suite and SPEC2006. Only 11 binaries changed at all
and none of them showed any regressions.

Amjad Aboud at Intel also ran this over their benchmarks and saw no
regressions.

Differential Revision: https://reviews.llvm.org/D36858

llvm-svn: 311226
2017-08-19 05:01:19 +00:00
Simon Pilgrim f36cca88fb [X86][ADX] Regenerate ADX intrinsics tests
llvm-svn: 311198
2017-08-18 21:21:14 +00:00
Simon Pilgrim 879ce046ad [X86][BMI2] Added scheduling test for RORX/SARX/SHLX/SHRX instructions
llvm-svn: 311171
2017-08-18 16:26:39 +00:00
Simon Pilgrim 358aeae7b8 [X86][AES] Add scheduling latency/throughput tests for AES instructions
llvm-svn: 311167
2017-08-18 15:26:51 +00:00
Simon Pilgrim 9eb0869e91 [X86][PCLMUL] Add scheduling latency/throughput test for PCLMULQDQ instruction
Added it to the SSE42 tests as targets seem to always have both

llvm-svn: 311166
2017-08-18 15:08:30 +00:00
Simon Pilgrim ccaec26175 [X86][SHA] Add scheduling latency/throughput tests for SHA instructions
llvm-svn: 311164
2017-08-18 14:55:50 +00:00
Simon Pilgrim 7f506f7d72 [X86][MOVBE] Add scheduling latency/throughput tests for MOVBE instructions
llvm-svn: 311163
2017-08-18 14:44:31 +00:00
Simon Pilgrim 320f89782a [X86][BMI2] Added scheduling test for MULX instructions
llvm-svn: 311159
2017-08-18 13:22:18 +00:00
Geoff Berry bd47e8a4f7 Revert "[MachineCopyPropagation] Extend pass to do COPY source forwarding" round 2
This reverts commit r311135.

sanitizer-x86_64-linux-android buildbot is timing out with just this
patch applied.

llvm-svn: 311142
2017-08-18 01:43:11 +00:00
Richard Smith c0541dfa3e Increase tail dup threshold for -O3 from 3 to 4.
We see a modest performance improvement from this slightly higher tail dup threshold.

Differential Revision: https://reviews.llvm.org/D36775

llvm-svn: 311139
2017-08-17 23:38:41 +00:00
Craig Topper 1fae3ae6f0 [X86] Remove SSE/AVX patterns for AND/XOR/OR/ANDN that checked for the inputs being bitcasted from floating point types.
There's really no reason to do this we should just let isel pick the integer version and let the execution dependency fixing pass take care of moving to FP if necessary.

It's not very reliable to look for bitcasts at the edges of patterns. If for some reason one input was bitcasted and the other wasn't, or if one was a v4f32 bitcast and one was a v2f64 bitcast, we would have fallen back to the integer pattern anyway.

llvm-svn: 311138
2017-08-17 23:20:57 +00:00
Geoff Berry 51f52c4fca Re-enable "[MachineCopyPropagation] Extend pass to do COPY source forwarding"
Two issues identified by buildbots were addressed:
    - The pass no longer forwards COPYs to physical register uses, since
      doing so can break code that implicitly relies on the physical
      register number of the use.
    - The pass no longer forwards COPYs to undef uses, since doing so
      can break the machine verifier by creating LiveRanges that don't
      end on a use (since the undef operand is not considered a use).

    [MachineCopyPropagation] Extend pass to do COPY source forwarding

    This change extends MachineCopyPropagation to do COPY source forwarding.

    This change also extends the MachineCopyPropagation pass to be able to
    be run during register allocation, after physical registers have been
    assigned, but before the virtual registers have been re-written, which
    allows it to remove virtual register COPY LiveIntervals that become dead
    through the forwarding of all of their uses.

    Reviewers: qcolombet, javed.absar, MatzeB, jonpa

    Subscribers: jyknight, nemanjai, llvm-commits, nhaehnle, mcrosier, mgorny

    Differential Revision: https://reviews.llvm.org/D30751

llvm-svn: 311135
2017-08-17 23:06:55 +00:00
Sanjay Patel f2d67f7ecc [x86] add tests for vector select-of-constants; NFC
We've discussed canonicalizing to this form in IR, so the backend
should be prepared to lower these in ways better than what we see
here in most cases.

llvm-svn: 311103
2017-08-17 17:07:37 +00:00
Adrian Prantl 6a57daad81 Improve line debug info when translating a CaseBlock to SDNodes.
The SelectionDAGBuilder translates various conditional branches into
CaseBlocks which are then translated into SDNodes. If a conditional
branch results in multiple CaseBlocks only the first CaseBlock is
translated into SDNodes immediately, the rest of the CaseBlocks are
put in a queue and processed when all LLVM IR instructions in the
basic block have been processed.

When a CaseBlock is transformed into SDNodes the SelectionDAGBuilder
is queried for the current LLVM IR instruction and the resulting
SDNodes are annotated with the debug info of the current
instruction (if it exists and has debug metadata).

When the deferred CaseBlocks are processed, the SelectionDAGBuilder
does not have a current LLVM IR instruction, and the resulting SDNodes
will not have any debuginfo. As DwarfDebug::beginInstruction() outputs
a .loc directive for the first instruction in a labeled
block (typically the case for something coming from a CaseBlock) this
tends to produce a line-0 directive.

This patch changes the handling of CaseBlocks to store the current
instruction's debug info into the CaseBlock when it is created (and the
SelectionDAGBuilder knows the current instruction) and to always use
the stored debug info when translating a CaseBlock to SDNodes.

Patch by Frej Drejhammar!

Differential Revision: https://reviews.llvm.org/D36671

llvm-svn: 311097
2017-08-17 16:57:13 +00:00
Craig Topper 3a622a14f9 [AVX512] Don't switch unmasked subvector insert/extract instructions when AVX512DQI is enabled.
There's no reason to switch instructions with and without DQI. It just creates extra isel patterns and test divergences.

There is however value in enabling the masked version of the instructions with DQI.

This required introducing some new multiclasses to enabling this splitting.

Differential Revision: https://reviews.llvm.org/D36661

llvm-svn: 311091
2017-08-17 15:40:25 +00:00