Commit Graph

91 Commits

Author SHA1 Message Date
Craig Topper 42fc7852f5 [X86] Print k-mask in FMA3 comments. 2020-04-12 13:16:53 -07:00
Simon Pilgrim 7e9747b50b [X86][F16C] Remove cvtph2ps intrinsics and use generic half2float conversion (PR37554)
This removes everything but int_x86_avx512_mask_vcvtph2ps_512 which provides the SAE variant, but even this can use the fpext generic if the rounding control is the default.

Differential Revision: https://reviews.llvm.org/D75162
2020-02-29 18:57:35 +00:00
Craig Topper 57eb56b839 [X86] Swap the 0 and the fudge factor in the constant pool for the 32-bit mode i64->f32/f64/f80 uint_to_fp algorithm.
This allows us to generate better code for selecting the fixup
to load.

Previously when the sign was set we had to load offset 0. And
when it was clear we had to load offset 4. This required a testl,
setns, zero extend, and finally a mul by 4. By switching the offsets
we can just shift the sign bit into the lsb and multiply it by 4.
2020-01-14 17:05:23 -08:00
Simon Pilgrim 31ed36d044 [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask (REAPPLIED)
If we don't demand all elements, then attempt to combine to a simpler shuffle.

At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts (see D66004).

This reapplies rL368307 (reverted at rL369167) after the fix for the infinite loop reported at PR43024 was applied at rG3f087e38a2e7b87a5adaaac1c1b61e51220e7ff3
2019-11-04 11:37:57 +00:00
Simon Pilgrim 6ada70d1b5 [X86][SSE] LowerUINT_TO_FP_i64 - only use HADDPD for size/fast-hops
We were always generating a single source HADDPD, but really we should only do this if shouldUseHorizontalOp says its a good idea.

Differential Revision: https://reviews.llvm.org/D69175

llvm-svn: 375341
2019-10-19 11:53:48 +00:00
Jordan Rupprecht d0797ece46 Revert [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask (reapplied)
This reverts r368662 (git commit 1a8d790cf5)

The compile-time regression repro is in https://bugs.llvm.org/show_bug.cgi?id=43024

llvm-svn: 369167
2019-08-16 23:08:56 +00:00
Simon Pilgrim 1a8d790cf5 [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask (reapplied)
If we don't demand all elements, then attempt to combine to a simpler shuffle.

At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts. 

The insertps-combine.ll regression is because XFormVExtractWithShuffleIntoLoad can't see through shuffles of different widths - this will be fixed in a follow-up commit.

Reapplying this as rL368307 had to be reverted as part of rL368660 to revert rL368276

llvm-svn: 368662
2019-08-13 10:51:39 +00:00
Hans Wennborg 5390d25f2b Revert r368276 "[TargetLowering] SimplifyDemandedBits - call SimplifyMultipleUseDemandedBits for ISD::EXTRACT_VECTOR_ELT"
This introduced a false positive MemorySanitizer warning about use of
uninitialized memory in a vectorized crc function in Chromium. That suggests
maybe something is not right with this transformation. See
https://crbug.com/992853#c7 for a reproducer.

This also reverts the follow-up commits r368307 and r368308 which
depended on this.

> This patch attempts to peek through vectors based on the demanded bits/elt of a particular ISD::EXTRACT_VECTOR_ELT node, allowing us to avoid dependencies on ops that have no impact on the extract.
>
> In particular this helps remove some unnecessary scalar->vector->scalar patterns.
>
> The wasm shift patterns are annoying - @tlively has indicated that the wasm vector shift codegen are to be refactored in the near-term and isn't considered a major issue.
>
> Differential Revision: https://reviews.llvm.org/D65887

llvm-svn: 368660
2019-08-13 09:33:25 +00:00
Simon Pilgrim 67c246bbe6 [X86] SimplifyDemandedVectorElts - attempt to recombine target shuffle using DemandedElts mask
If we don't demand all elements, then attempt to combine to a simpler shuffle.

At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts. 

The insertps-combine.ll regression is because XFormVExtractWithShuffleIntoLoad can't see through shuffles of different widths - this will be fixed in a follow-up commit.

llvm-svn: 368307
2019-08-08 15:54:20 +00:00
Craig Topper 033774e144 [X86] Cleanups and safety checks around the isFNEG
This patch does a few things to start cleaning up the isFNEG function.

-Remove the Op0/Op1 peekThroughBitcast calls that seem unnecessary. getTargetConstantBitsFromNode has its own peekThroughBitcast inside. And we have a separate peekThroughBitcast on the return value.
-Add a check of the scalar size after the first peekThroughBitcast to ensure we haven't changed the element size and just did something like f32->i32 or f64->i64.
-Remove an unnecessary check that Op1's type is floating point after the peekThroughBitcast. We're just going to look for a bit pattern from a constant. We don't care about its type.
-Add VT checks on several places that consume the return value of isFNEG. Due to the peekThroughBitcasts inside, the type of the return value isn't guaranteed. So its not safe to use it to build other nodes without ensuring the type matches the type being used to build the node. We might be able to replace these checks with bitcasts instead, but I don't have a test case so a bail out check seemed better for now.

Differential Revision: https://reviews.llvm.org/D63683

llvm-svn: 364206
2019-06-24 17:28:26 +00:00
Cameron McInally 8608afa964 Revert "[NFC][CodeGen] Add unary FNeg tests to X86/avx512-intrinsics-fast-isel.ll"
This reverts commit 41e0b9f280.

llvm-svn: 363301
2019-06-13 19:24:24 +00:00
Cameron McInally 675be5db46 Revert "[NFC][CodeGen] Add unary FNeg tests to X86/avx512-intrinsics-fast-isel.ll"
This reverts commit aeb89f8b33.

llvm-svn: 363300
2019-06-13 19:24:21 +00:00
Cameron McInally aeb89f8b33 [NFC][CodeGen] Add unary FNeg tests to X86/avx512-intrinsics-fast-isel.ll
Patch 2 of n.

llvm-svn: 363275
2019-06-13 15:54:20 +00:00
Cameron McInally 41e0b9f280 [NFC][CodeGen] Add unary FNeg tests to X86/avx512-intrinsics-fast-isel.ll
Patch 1 of n.

llvm-svn: 363215
2019-06-12 22:50:44 +00:00
Craig Topper d10a200ceb [X86] Remove the suffix on vcvt[u]si2ss/sd register variants in assembly printing.
We require d/q suffixes on the memory form of these instructions to disambiguate the memory size.
We don't require it on the register forms, but need to support parsing both with and without it.

Previously we always printed the d/q suffix on the register forms, but it's redundant and
inconsistent with gcc and objdump.

After this patch we should support the d/q for parsing, but not print it when its unneeded.

llvm-svn: 360085
2019-05-06 21:39:51 +00:00
Craig Topper df02beb416 [X86] Add the rounding control operand to the printing for some scalar FMA instructions.
llvm-svn: 358844
2019-04-21 07:12:56 +00:00
Sanjay Patel 26e06e859e [x86] add x86-specific opcodes to extractelement scalarization list
llvm-svn: 355792
2019-03-10 18:56:21 +00:00
Sanjay Patel 7fc6ef7dd7 [x86] scalarize extract element 0 of FP math
This is another step towards ensuring that we produce the optimal code for reductions,
but there are other potential benefits as seen in the tests diffs:

  1. Memory loads may get scalarized resulting in more efficient code.
  2. Memory stores may get scalarized resulting in more efficient code.
  3. Complex ops like fdiv/sqrt get scalarized which may be faster instructions depending on uarch.
  4. Even simple ops like addss/subss/mulss/roundss may result in faster operation/less frequency throttling when scalarized depending on uarch.

The TODO comment suggests 1 or more follow-ups for opcodes that can currently result in regressions.

Differential Revision: https://reviews.llvm.org/D58282

llvm-svn: 355130
2019-02-28 19:47:04 +00:00
Craig Topper 5d20eb240f [X86] Disable DomainReassignment pass when AVX512BW is disabled to avoid injecting VK32/VK64 references into the MachineIR
Summary:
This pass replaces GR8/GR16/GR32/GR64 with their equivalent sized mask register classes. But VK32/VK64 aren't legal without AVX512BW. Apparently this mostly appears to work if the register coalescer is able to remove the VK32/VK64 register class reference. Or if we don't ever spill it. But there's no guarantee of that.

Another Intel employee managed to trigger a crash due to this with ISPC. Unfortunately, I've lost the test case he sent me at the time. I'm trying to get him to reproduce it for me. I'd like to get this in before 8.0 branches since its a little scary.

The regressions here are unfortunate, but I think we can make some improvements to DAG combine, load folding, etc. to fix them. Just not sure if we can get that done for 8.0.

Fixes PR39741

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D56460

llvm-svn: 350800
2019-01-10 07:43:54 +00:00
Craig Topper 0229da8f07 [X86] Use GetDemandedBits to simplify the operands of PMULDQ/PMULUDQ.
This is an alternative to what I attempted in D56057.

GetDemandedBits is a special version of SimplifyDemandedBits that allows simplifications even when the operand has other uses. GetDemandedBits will only do simplifications that allow a node to be bypassed. It won't create new nodes or alter any of the other users.

I had to add support for bypassing SIGN_EXTEND_INREG to GetDemandedBits.

Based on a patch that Simon Pilgrim sent me in email.

Fixes PR40142.

llvm-svn: 350059
2018-12-24 19:40:20 +00:00
Simon Pilgrim 2a25360ae3 [X86] Auto upgrade XOP/AVX512 rotation intrinsics to generic funnel shift intrinsics (llvm)
This emits FSHL/FSHR generic intrinsics for the XOP VPROT and AVX512 VPROL/VPROR rotation intrinsics.

Clang counterpart: https://reviews.llvm.org/D55937

Differential Revision: https://reviews.llvm.org/D55938

llvm-svn: 349795
2018-12-20 19:01:07 +00:00
Craig Topper aa5eb2fbaa [X86] Force floating point values in constant pool decoding to print in scientific notation so they can't be confused with integers.
When the floating point constants are whole numbers they have no decimal point so look like integers, but mean something very different in something like an 'and' instruction.

Ideally we would just print a decimal point and a 0, but I couldn't see how to make APFloat::toString do that.

llvm-svn: 345488
2018-10-29 04:52:04 +00:00
Craig Topper 8315d9990c [X86] Stop promoting vector and/or/xor/andn to vXi64.
These promotions add additional bitcasts to the SelectionDAG that can pessimize computeKnownBits/computeNumSignBits. It also seems to interfere with broadcast formation.

This patch removes the promotion and adds isel patterns instead.

The increased table size is more than I would like, but hopefully we can find some canonicalizations or other tricks to start pruning out patterns going forward.

Differential Revision: https://reviews.llvm.org/D53268

llvm-svn: 345408
2018-10-26 17:21:26 +00:00
Sanjay Patel 4cf1da0e02 [SelectionDAG] allow FP binops in SimplifyDemandedVectorElts
This is intended to make the backend on par with functionality that was 
added to the IR version of SimplifyDemandedVectorElts in:
rL343727
...and the original motivation is that we need to improve demanded-vector-elements 
in several ways to avoid problems that would be exposed in D51553.

Differential Revision: https://reviews.llvm.org/D52912

llvm-svn: 344541
2018-10-15 18:05:34 +00:00
Sanjay Patel e28c8ecd72 [x86] add and use fast horizontal vector math subtarget feature
This is the planned follow-up to D52997. Here we are reducing horizontal vector math codegen 
by default. AMD Jaguar (btver2) should have no difference with this patch because it has 
fast-hops. (If we want to set that bit for other CPUs, let me know.)

The code changes are small, but there are many test diffs. For files that are specifically 
testing for hops, I added RUNs to distinguish fast/slow, so we can see the consequences 
side-by-side. For files that are primarily concerned with codegen other than hops, I just 
updated the CHECK lines to reflect the new default codegen.

To recap the recent horizontal op story:

1. Before rL343727, we were producing hops for all subtargets for a variety of patterns. 
   Hops were likely not optimal for all targets though.
2. The IR improvement in r343727 exposed a hole in the backend hop pattern matching, so 
   we reduced hop codegen for all subtargets. That was bad for Jaguar (PR39195).
3. We restored the hop codegen for all targets with rL344141. Good for Jaguar, but 
   probably bad for other CPUs.
4. This patch allows us to distinguish when we want to produce hops, so everyone can be 
   happy. I'm not sure if we have the best predicate here, but the intent is to undo the 
   extra hop-iness that was enabled by r344141.

Differential Revision: https://reviews.llvm.org/D53095

llvm-svn: 344361
2018-10-12 16:41:02 +00:00
Sanjay Patel 6cca8af227 [x86] allow single source horizontal op matching (PR39195)
This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in:
rL343727

As noted in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd 
solve that in a machine instruction pass, but a quicker solution will be adding a 'HasFastHorizontalOp' feature
bit to deal with it here in the DAG.

Differential Revision: https://reviews.llvm.org/D52997

llvm-svn: 344141
2018-10-10 13:39:59 +00:00
Simon Pilgrim 3b04a4e322 [SelectionDAG] Respect multiple uses in SimplifyDemandedBits to SimplifyDemandedVectorElts simplification
rL343913 was using SimplifyDemandedBits's original demanded mask instead of the adjusted 'NewMask' that accounts for multiple uses of the op (those variable names really need improving....).

Annoyingly many of the test changes (back to pre-rL343913 state) are actually safe - but only because their multiple uses are all by PMULDQ/PMULUDQ.

Thanks to Jan Vesely (@jvesely) for bisecting the bug.

llvm-svn: 343935
2018-10-07 11:45:46 +00:00
Simon Pilgrim 9c9c97bcf4 [SelectionDAG] Add SimplifyDemandedBits to SimplifyDemandedVectorElts simplification
This patch enables SimplifyDemandedBits to call SimplifyDemandedVectorElts in cases where the demanded bits mask covers entire elements of a bitcasted source vector.

There are a couple of cases here where simplification at a deeper level (such as through bitcasts) prevents further simplification - CommitTargetLoweringOpt only adds immediate uses/users back to the worklist when we might want to combine the original caller again to see what else it can simplify.

As well as that I had to disable handling of bool vector until SimplifyDemandedVectorElts better supports some of their opcodes (SETCC, shifts etc.).

Fixes PR39178

Differential Revision: https://reviews.llvm.org/D52935

llvm-svn: 343913
2018-10-06 10:20:04 +00:00
Simon Pilgrim 01ae462fef [X86][SSE] Combine (some) target shuffles with multiple uses
As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses.

This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should reduce instruction dependencies and allow the total number of shuffles to still drop without increasing the constant pool.

However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move.

This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - the fix will be added in a followup commit.

Differential Revision: https://reviews.llvm.org/D50328

llvm-svn: 339335
2018-08-09 12:30:02 +00:00
Craig Topper a7a12399a1 [X86] Remove all the vector NOP bitcast patterns. Use a few lines of code in the Select method in X86ISelDAGToDAG.cpp instead.
There are a lot of permutations of types here generating a lot of patterns in the isel table. It's more efficient to just ReplaceUses and RemoveDeadNode from the Select function.

The test changes are because we have a some shuffle patterns that have a bitcast as their root node. But the behavior is identical to another instruction whose pattern doesn't start with a bitcast. So this isn't a functional change.

llvm-svn: 338824
2018-08-03 07:01:10 +00:00
Craig Topper f0b164415c [X86] Prefer blendi over movss/sd when avx512 is enabled unless optimizing for size.
AVX512 doesn't have an immediate controlled blend instruction. But blend throughput is still better than movss/sd on SKX.

This commit changes AVX512 to use the AVX blend instructions instead of MOVSS/MOVSD. This constrains the register allocation since it won't be able to use XMM16-31, but hopefully the increased throughput and reduced port 5 pressure makes up for that.

llvm-svn: 337083
2018-07-14 02:05:08 +00:00
Craig Topper 9e17073c21 [X86] Enhance combineFMA to look for FNEG behind an EXTRACT_VECTOR_ELT.
llvm-svn: 336514
2018-07-08 18:04:00 +00:00
Craig Topper fdf3f1ff82 [X86] Add new scalar fma intrinsics with rounding mode that use f32/f64 types.
This allows us to handle masking in a very similar way to the default rounding version that uses llvm.fma.

I had to add new rounding mode CodeGenOnly instructions to support isel when we can't find a movss to grab the upper bits from to use the b_Int instruction.

Fast-isel tests have been updated to match new clang codegen.

We are currently having trouble folding fneg into the new intrinsic. I'm going to correct that in a follow up patch to keep the size of this one down.

A future patch will also remove the old intrinsics.

llvm-svn: 336506
2018-07-08 01:10:43 +00:00
Craig Topper d679d01a1f [X86] Use a rounding mode other than 4 in the scalar fma intrinsic fast-isel tests to match clang test cases.
llvm-svn: 336505
2018-07-08 00:32:56 +00:00
Craig Topper df99cdb95b [X86] Fix a few test names in avx512-intrinsics-fast-isel.ll to match their clang intrinsic names.
I thought I fixed these yesterday, but I guess I missed a few.

llvm-svn: 336071
2018-07-01 23:49:06 +00:00
Craig Topper 50a10ba6e0 [X86] Update some avx512 fast-isel tests to match their real clang IRgen.
Especially of note was the test_mm_mask_set1_epi64 and other set1 tests that were truncating the element to be broadcasted to i8 and broadcasting that instead of a whole 64 bit value.

Some of the others were just correcting mask sizes on parameters due to bugs in the clang test case they were generated from that have now been fixed.

Some were converting i8 to <4 x i1>/<2 x i1> by truncating to i4/i2 and then bitcasting. But the clang codegen is bitcast to <8 x i1>, then extract to <4 x i1>/<2 x i1>. This is likely to incur less trouble from the integer type legalizer in the backend.

llvm-svn: 336045
2018-06-30 07:25:29 +00:00
Craig Topper 59f2f38fe0 [X86] Remove masking from avx512 rotate intrinsics. Use select in IR instead.
llvm-svn: 336035
2018-06-30 01:32:04 +00:00
Craig Topper 875e9f8fa4 [X86] Remove masking from the avx512 packed sqrt intrinsics. Use select in IR instead.
While there improve the coverage of the intrinsic testing and add fast-isel tests.

llvm-svn: 335944
2018-06-29 05:43:26 +00:00
Craig Topper 8014053cbd [X86] Update fast-isel tests for clang r335253.
The new IR fixes a mismatch in the final extractelement for the i32 intrinsics. Previously we extracted a 64-bit element even though we only wanted 32 bits.

SimplifyDemandedElts isn't able to make FP elements undef now and the shuffle mask I used prevents the use of horizontal add we had before. Not sure we should have been using horizontal add anyway. It's implemented on Intel with two port 5 shuffles and an add. So we have on less shuffle now, but an additional instruction to decode.

Differential Revision: https://reviews.llvm.org/D48347

llvm-svn: 335256
2018-06-21 16:54:18 +00:00
Craig Topper 296526bf46 [X86] Remove masking from 512-bit floating max/min intrinsics. Use select instruction instead.
llvm-svn: 335199
2018-06-21 05:00:56 +00:00
Craig Topper 31a64ee76c [X86] Remove a fptosi from the test_mm512_mask_reduce_max_pd fast-isel test.
The clang test inadvertently turned a floating point value into a double by having the wrong return type on the test function relative to the intrinsic it was testing.

This resulted in an extra fptosi instruction that propagated into this test when I copied the clang output.

llvm-svn: 335094
2018-06-20 04:32:06 +00:00
Craig Topper 31961f051f [X86] Update fast-isel tests for clang's avx512f reduction intrinsics to match the codegen from r335070.
llvm-svn: 335071
2018-06-19 19:14:50 +00:00
Craig Topper 858afbd165 [X86] Add fast-isel tests for clang's AVX512F vector reduction intrinsics.
llvm-svn: 335068
2018-06-19 18:52:15 +00:00
Craig Topper 3c4cc01226 [X86] Add more instructions to the memory folding tables using the autogenerated table as a guide.
I think this covers most of the unmasked vector instructions. We're still missing a lot of the masked instructions.

There are some test changes here because of the new folding support. I don't think these particular cases should be folded because it creates an undef register dependency. I think the changes introduced in r334175 are not handling stack folding. They're only blocking the peephole pass.

llvm-svn: 334800
2018-06-15 05:49:19 +00:00
Craig Topper 3b060daba5 [X86] Fix some checks to use X86 instead of X32.
These tests were recently updated so it looks like gone wrong.

llvm-svn: 334786
2018-06-15 04:42:55 +00:00
Tomasz Krupa d8d66a6b28 [X86] Lowering Mask Scalar intrinsics to native IR (LLVM part)
Summary: Complementary patch to lowering add, sub, mul and div mask scalar
intrinsics in Clang.

Reviewers: craig.topper, sroland, spatel, RKSimon

Reviewed by: craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D47978

llvm-svn: 334740
2018-06-14 17:32:58 +00:00
Craig Topper 304bd747af [X86] Add expandload and compresstore fast-isel tests for avx512f and avx512vl. Update existing tests for avx512vbmi2 to use target independent intrinsics.
llvm-svn: 334368
2018-06-10 18:55:37 +00:00
Simon Pilgrim 1f60e2b41b [X86][AVX512] Cleanup intrinsics tests
Ensure we test on 32-bit and 64-bit targets, and strip -mcpu usage.

Part of ongoing work to ensure we test all intrinsic style tests on 32 and 64 bit targets where possible.

llvm-svn: 333843
2018-06-03 14:56:04 +00:00
Gabor Buella 890e363e11 [X86] Lowering FMA intrinsics to native IR (LLVM part)
Support for Clang lowering of fused intrinsics. This patch:

1. Removes bindings to clang fma intrinsics.
2. Introduces new LLVM unmasked intrinsics with rounding mode:
     int_x86_avx512_vfmadd_pd_512
     int_x86_avx512_vfmadd_ps_512
     int_x86_avx512_vfmaddsub_pd_512
     int_x86_avx512_vfmaddsub_ps_512
     supported with a new intrinsic type (INTR_TYPE_3OP_RM).
3. Introduces new x86 fmaddsub/fmsubadd folding.
4. Introduces new tests for code emitted by sequentions introduced in Clang part.

Patch by tkrupa

Reviewers: craig.topper, sroland, spatel, RKSimon

Reviewed By: craig.topper, RKSimon

Differential Revision: https://reviews.llvm.org/D47443

llvm-svn: 333554
2018-05-30 15:25:16 +00:00
Craig Topper 2adc7d956c [X86] Add unmasked vermi2var intrinsics so we can use explicit select instructions for masking in clang.
This will allow us to remove the 3 different flavors of masked intrinsics. I'm leaving the actual intrinsic removal for another patch.

llvm-svn: 333386
2018-05-29 03:26:30 +00:00