Commit Graph

5371 Commits

Author SHA1 Message Date
Craig Topper b8d7b1620b [X86] Custom legalize (v2i1 (fp_to_uint/fp_to_sint v2f64)) without AVX512VL.
Strangely the code was already present, just the setOperationAction wasn't being called without VLX.

llvm-svn: 324806
2018-02-10 08:39:31 +00:00
Craig Topper c3aab4bbe1 [X86] Legalize zero extends from vXi1 to vXi16/vXi32/vXi64 using a sign extend and a shift.
This avoids a constant pool load to create 1.

The int->float are showing converts to mask and back. We probably need to widen inputs to sint_to_fp/uint_to_fp before type legalization.

llvm-svn: 324805
2018-02-10 08:06:52 +00:00
Craig Topper d34af6f636 [X86] Teach combineExtSetcc to handle ZERO_EXTEND by widening the setcc and then masking. A later DAG combine will convert to a shift.
This helps to avoid a constant pool load needed to zero extend from the mask.

llvm-svn: 324804
2018-02-10 08:06:49 +00:00
Craig Topper fa6113b3d7 [X86] Teach combineInsertSubvector how to combine some k-register insert_subvectors and extract_subvector sequences to remove extra zeroing.wq
llvm-svn: 324791
2018-02-10 01:00:41 +00:00
Craig Topper 99db883d55 [X86] Teach lower1BitVectorShuffle to recognize shuffles that are just filling upper elements with zero. Replace with insert_subvector.
There's still some extra kshifts in one of the modified test cases here, but hopefully that's only a DAG combine away.

llvm-svn: 324782
2018-02-09 23:32:27 +00:00
Craig Topper ca5841b4e4 [X86] Simplify some code in lowerV4X128VectorShuffle and lowerV2X128VectorShuffle
Previously we extracted two subvectors and concatenate. But the concatenate will be lowered to two insert subvectors. Then DAG combine will merge once of the inserts and one of the extracts back into the original vector. We might as well just directly use one extract and one insert.

llvm-svn: 324710
2018-02-09 05:54:36 +00:00
Craig Topper 28166a877d [X86] Teach shuffle lowering to recognize 128/256 bit insertions into a zero vector.
This regresses a couple cases in the shuffle combining test. But those cases use intrinsics that InstCombine knows how to turn into a generic shuffle earlier. This should give opportunities to fold this earlier in InstCombine or DAG combine.

llvm-svn: 324709
2018-02-09 05:54:34 +00:00
Craig Topper 9e030c9e00 [X86] Improve combineCastedMaskArithmetic to fold (bitcast (vXi1 (and/or/xor X, C)))->(vXi1 (and/or/xor (bitcast X), (bitcast C)) where C is a constant build_vector.
Most vxi1 constant build vectors have to be implemented in the scalar domain anyway so we'll probably end up with a cast there later. But by then its too late to do the combine to get rid of it.

llvm-svn: 324662
2018-02-08 22:26:39 +00:00
Craig Topper 1b5b4ccb77 [X86] Add DAG combine to constant fold a bitcast of a vXi1 constant build_vector into a scalar integer.
llvm-svn: 324661
2018-02-08 22:26:36 +00:00
Craig Topper dccf72b583 [X86] Remove kortest intrinsics and replace with native IR.
llvm-svn: 324646
2018-02-08 20:16:06 +00:00
Clement Courbet 1b8c08b633 [X86] Fix compilation of r324580.
@ctopper Can you check that the fix is correct ?

llvm-svn: 324586
2018-02-08 09:41:50 +00:00
Craig Topper 8d0c8c9be1 [X86] Support folding in a k-register OR when creating KORTEST from scalar compare of a bitcast from vXi1.
This should allow us to remove the kortest intrinsic from IR and use compare+bitcast+or in IR instead.

llvm-svn: 324580
2018-02-08 08:29:43 +00:00
Craig Topper 93505707b6 [X86] Allow KORTEST instruction to be used for testing if a mask is all ones
The KTEST instruction sets the C flag if the result of anding both operands together is all 1s. We can use this to lower (icmp eq/ne (bitcast (vXi1 X), -1)

Differential Revision: https://reviews.llvm.org/D42772

llvm-svn: 324577
2018-02-08 07:54:16 +00:00
Craig Topper f5465f98d2 [X86] Don't emit KTEST instructions unless only the Z flag is being used
Summary:
KTEST has weird flag behavior. The Z flag is set for all bits in the AND of the k-registers being 0, and the C flag is set for all bits being 1. All other flags are cleared.

We currently emit this instruction in EmitTEST and don't check the condition code. This can lead to strange things like using the S flag after a KTEST for a signed compare.

The domain reassignment pass can also transform TEST instructions into KTEST and is not protected against the flag usage either. For now I've disabled this part of the domain reassignment pass. I tried to comment out the checks in the mir test so that we could recover them later, but I couldn't figure out how to get that to work.

This patch moves the KTEST handling into LowerSETCC and now creates a ktest+x86setcc. I've chosen this approach because I'd like to add support for the C flag for all ones in a followup patch. To do that requires that I can rewrite the condition code going in the x86setcc to be different than the original SETCC condition code.

This fixes PR36182. I'll file a PR to fix domain reassignment once this goes in. Should this be merged to 6.0?

Reviewers: spatel, guyblank, RKSimon, zvi

Reviewed By: guyblank

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42770

llvm-svn: 324576
2018-02-08 07:45:55 +00:00
Craig Topper 37765ff326 [X86] Prune some unreachable 'return SDValue()' paths from LowerSIGN_EXTEND/LowerZERO_EXTEND/LowerANY_EXTEND.
We were doing a lot of whitelisting of what we handle in these routines, but setOperationAction constrains what we can get here. So just add some asserts and prune the unreachable paths.

llvm-svn: 324538
2018-02-07 22:45:38 +00:00
Craig Topper 1db5ebc016 [X86] Remove dead code from EmitTest that looked for an i1 type which should have already been type legalized away. NFC
llvm-svn: 324536
2018-02-07 22:19:26 +00:00
Simon Pilgrim b4e789e8f6 [X86][AVX] Add PACKSSDW/PACKUSDW support for truncation of clamped values
SSE and shorter vector sizes will have to wait until we can add support for general SMIN/SMAX matching.

llvm-svn: 324485
2018-02-07 15:48:44 +00:00
Chandler Carruth 282ae1632a [x86/retpoline] Make the external thunk names exactly match the names
that happened to end up in GCC.

This is really unfortunate, as the names don't have much rhyme or reason
to them. Originally in the discussions it seemed fine to rely on aliases
to map different names to whatever external thunk code developers wished
to use but there are practical problems with that in the kernel it turns
out. And since we're discovering this practical problems late and since
GCC has already shipped a release with one set of names, we are forced,
yet again, to blindly match what is there.

Somewhat rushing this patch out for the Linux kernel folks to test and
so we can get it patched into our releases.

Differential Revision: https://reviews.llvm.org/D42998

llvm-svn: 324449
2018-02-07 06:16:24 +00:00
Craig Topper 58ecffd857 [DAGCombiner][AMDGPU][X86] Turn cttz/ctlz into cttz_zero_undef/ctlz_zero_undef if we can prove the input is never zero
X86 currently has a late DAG combine after cttz/ctlz are turned into BSR+BSF+CMOV to detect this and remove the CMOV. But we should be able to do this much earlier and avoid creating the cmov all together.

For the changed AMDGPU test case it appears that previously the i8 cttz was type legalized to i16 which introduced an OR with 256 in order to limit the result to 8 on the widened type. At this point the result is known to never be zero, but nothing checked that. Then operation legalization is told to promote all i16 cttz to i32. This introduces an extend and a truncate and another OR with 65536 to limit the result to 16. With the DAG combiner change we are able to prevent the creation of the second OR since the opcode will have been changed to cttz_zero_undef after the first OR. I the lack of the OR caused the instruction to change to v_ffbl_b32_sdwa

Differential Revision: https://reviews.llvm.org/D42985

llvm-svn: 324427
2018-02-06 23:54:37 +00:00
Simon Pilgrim ae00a71f55 [X86][SSE] Add PACKUS support for truncation of clamped values
Followup to D42544 that matches PACKUSWB cases for non-AVX512, SSE and PACKUSDW cases will have to wait until we can add support for general SMIN/SMAX matching.

llvm-svn: 324347
2018-02-06 14:07:46 +00:00
Simon Pilgrim 90a237bf83 [X86][SSE] Add PACKSS support for truncation of clamped values
Followup to D42544 that matches PACKSSWB cases for non-AVX512, SSE and PACKSSDW cases will have to wait until we can add support for general SMIN/SMAX matching.

llvm-svn: 324339
2018-02-06 12:16:10 +00:00
Craig Topper 9c6c7c5e9b [X86] Relax restrictions on what setcc condition codes can be folded with a sext when AVX512 is enabled.
We now allow all signed comparisons and not equal. The complement that needs to be added for this is no worse than the extend. And the vector output forms of pcmpeq/pcmpgt have better latency than the k-register version on SKX.

llvm-svn: 324294
2018-02-05 23:57:01 +00:00
Craig Topper 5a2bd99a9e [X86] Add isel patterns for selecting masked SUBV_BROADCAST with bitcasts. Remove combineBitcastForMaskedOp.
Add test cases for the merge masked versions to make sure we have all those covered.

llvm-svn: 324210
2018-02-05 08:37:37 +00:00
Craig Topper 6ff5eb5dd5 [X86] Remove unused lambda. NFC
llvm-svn: 324206
2018-02-05 06:56:33 +00:00
Craig Topper 25ceba7f30 [X86] Remove X86ISD::SHUF128 from combineBitcastForMaskedOp. Use isel patterns instead.
We always created X86ISD::SHUF128 with a 64-bit element type so we can use isel patterns to detect a bitconvert to 32-bit to handle masking.

The test changes are because we also match the bitconvert even if there is no masking. This leads to unnecessary isel pattern, but it requires more multiclass hackery in tablegen to get rid of it.

llvm-svn: 324205
2018-02-05 06:00:23 +00:00
Craig Topper 8d511a65af [X86] Add DAG combine to turn (bitcast (and/or/xor (bitcast X), Y)) -> (and/or/xor X, (bitcast Y)) when casting between GPRs and mask operations.
This reduces the number of transitions between k-registers and GPRs, reducing the number of instructions.

There's still some room for improvement to remove more transitions, but this is a good start.

llvm-svn: 324184
2018-02-04 01:43:48 +00:00
Craig Topper 17d99f1df4 [X86] Remove unused function argument. NFC
llvm-svn: 324183
2018-02-04 01:43:44 +00:00
Craig Topper 071ad9c6e0 [X86] Remove and autoupgrade kand/kandn/kor/kxor/kxnor/knot intrinsics.
Clang already stopped using these a couple months ago.

The test cases aren't great as there is nothing forcing the operations to stay in k-registers so some of them moved back to scalar ops due to the bitcasts being moved around.

llvm-svn: 324177
2018-02-03 20:18:25 +00:00
Craig Topper fae8788cfa [X86] Prefer to create a ISD::SETCC over X86ISD::PCMPEQ in combineVectorSizedSetCCEquality.
This is running pre-legalize, we should try to use target independent nodes. This will give the best opportunity for target independent optimizations.

llvm-svn: 324147
2018-02-02 21:59:46 +00:00
Craig Topper 10aa254ecd [X86] Pass SDLoc by const reference in a few more places in X86ISelLowering.cpp. NFC
llvm-svn: 324135
2018-02-02 20:32:00 +00:00
Craig Topper 76c5ce5184 [X86] Legalize (v64i1 (bitcast (i64 X))) on 32-bit targets by extracting 32-bit halves from i32, bitcasting each to v32i1, and concatenating.
This prevents the scalarization that would otherwise occur.

llvm-svn: 324057
2018-02-02 05:59:33 +00:00
Craig Topper 5570e03b21 [X86] Legalize (i64 (bitcast (v64i1 X))) on 32-bit targets by extracting to v32i1 and bitcasting to i32.
This saves a trip through memory and seems to open up other combining opportunities.

llvm-svn: 324056
2018-02-02 05:59:31 +00:00
Craig Topper 2d67d1e2a8 [X86] Separate the call to LowerVectorAllZeroTest from EmitTest. NFCI
Every instruction that has the word TEST in its name seems to have been buried into EmitTest. But that code is largely concerned with trying to reuse the flags from instructions that update flags in a pretty normal way.

PTEST/TESTP/KTEST do not update flags in a normal way. They only update Z and C and the C flag update is non-standard. Rather than try to bend EmitTest's already complex logic to accomodate this, just move the call up to LowerSETCC and replicate the few pre-checks that are needed.

While there add a FIXME for using the C flag for checking for all 1s which we definitely couldn't do from EmitTEST.

llvm-svn: 324029
2018-02-01 23:21:20 +00:00
Simon Pilgrim 1a8cefc328 [X86][SSE] LowerBUILD_VECTORAsVariablePermute - add support for scaling index vectors
This allows us to use PSHUFB for v8i16/v4i32 and VPERMD/PERMPS for v4i64/v4f64 variable shuffles.

Differential Revision: https://reviews.llvm.org/D42487

llvm-svn: 323987
2018-02-01 18:10:30 +00:00
Craig Topper a8a24232ee [X86] Remove custom lowering vXi1 extending loads and truncating stores.
Summary: Now that v2i1/v4i1 are legal without VLX. And v32i1 is legalized by splitting rather than widening. And isVectorLoadExtDesirable returns false for vXi1. It appears this handling is dead because the operations simply don't exist.

Reviewers: RKSimon, zvi, guyblank, delena, spatel

Reviewed By: delena

Subscribers: llvm-commits, rengolin

Differential Revision: https://reviews.llvm.org/D42781

llvm-svn: 323983
2018-02-01 17:08:41 +00:00
Craig Topper 7e910a9e85 [X86] Turn X86ISD::AND nodes that have no flag users back into ISD::AND just before isel to enable test instruction matching
Summary:
EmitTest sometimes creates X86ISD::AND specifically to hide the AND from DAG combine. But this prevents isel patterns that look for (cmp (and X, Y), 0) from being able to see it. So we end up with an AND and a TEST. The TEST gets removed by compare instruction optimization during the peephole pass.

This patch attempts to fix this by converting X86ISD::AND with no flag users back into ISD::AND during the DAG preprocessing just before isel.

In order to do this correctly I had to make the X86ISD::AND node created by EmitTest in this case really have a flag output. Which arguably it should have had anyway so that the number of operands would be consistent for the opcode in all cases. Then I had to modify the ReplaceAllUsesWith to understand that we might be looking at an instruction with 2 outputs. Though in this case there are no uses to replace since we just created the node, but that's what the code did before so I just made it keep working.

Reviewers: spatel, RKSimon, niravd, deadalnix

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42764

llvm-svn: 323982
2018-02-01 17:08:39 +00:00
Dean Michael Berris cdca0730be [XRay][compiler-rt+llvm] Update XRay register stashing semantics
Summary:
This change expands the amount of registers stashed by the entry and
`__xray_CustomEvent` trampolines.

We've found that since the `__xray_CustomEvent` trampoline calls can show up in
situations where the scratch registers are being used, and since we don't
typically want to affect the code-gen around the disabled
`__xray_customevent(...)` intrinsic calls, that we need to save and restore the
state of even the scratch registers in the handling of these custom events.

Reviewers: pcc, pelikan, dblaikie, eizan, kpw, echristo, chandlerc

Reviewed By: echristo

Subscribers: chandlerc, echristo, hiraditya, davide, dblaikie, llvm-commits

Differential Revision: https://reviews.llvm.org/D40894

llvm-svn: 323940
2018-02-01 02:21:54 +00:00
Craig Topper e44faf53c7 [X86] Make the type checks in detectAVX512USatPattern more robust
This code currently uses isSimple and getSizeInBits in an attempt to prune types. But isSimple will return true for any type that any target supports natively. I don't think that's a good way to prune types. I also don't think the dest element type checks are very robust since we didn't do an isSimple check on the dest type.

This patch adds a check for the input type being legal to the one caller that didn't already check that. Then we explicitly check the element types for the destination are i8, i16, or i32

Differential Revision: https://reviews.llvm.org/D42706

llvm-svn: 323924
2018-01-31 22:26:31 +00:00
Craig Topper d759f476e8 [X86] Remove redundant check for hasAVX512 before calling hasBWI. NFC
hasBWI implies hasAVX512.

llvm-svn: 323823
2018-01-30 21:53:35 +00:00
Simon Pilgrim 073f089c6e [X86][XOP] Update isVectorShiftByScalarCheap with cases covered by XOP
Similar to D42437, XOP supports variable shift for v16i8/v8i16/v4i32/v2i64 types.

Differential Revision: https://reviews.llvm.org/D42526

llvm-svn: 323797
2018-01-30 18:10:21 +00:00
Craig Topper eb13ebdb99 [X86] Don't create SHRUNKBLEND when the condition is used by the true or false operand of the vselect.
Fixes PR34592.

Differential Revision: https://reviews.llvm.org/D42628

llvm-svn: 323672
2018-01-29 17:56:57 +00:00
Hiroshi Inoue c8e9245816 [NFC] fix trivial typos in comments and documents
"to to" -> "to"

llvm-svn: 323628
2018-01-29 05:17:03 +00:00
Craig Topper 3913a4dd56 [X86] Fix a crash that can occur in combineExtractVectorElt due to not checking the width of a ConstantSDNode before calling getConstantOperandVal.
llvm-svn: 323614
2018-01-28 07:29:35 +00:00
Craig Topper 15d69739e2 [X86] Remove VPTESTM/VPTESTNM ISD opcodes. Use isel patterns matching cmpm eq/ne with immallzeros.
llvm-svn: 323612
2018-01-28 00:56:30 +00:00
Craig Topper 247016a735 [X86] Use vptestm/vptestnm for comparisons with zero to avoid creating a zero vector.
We can use the same input for both operands to get a free compare with zero.

We already use this trick in a couple places where we explicitly create PTESTM with the same input twice. This generalizes it.

I'm hoping to remove the ISD opcodes and move this to isel patterns like we do for scalar cmp/test.

llvm-svn: 323605
2018-01-27 20:19:09 +00:00
Craig Topper 513d3fa674 [X86] Remove X86ISD::PCMPGTM/PCMPEQM and instead just use X86ISD::PCMPM and pattern match the immediate value during isel.
Legalization is still biased to turn LT compares in to GT by swapping operands to avoid needing extra isel patterns to commute.

I'm hoping to remove TESTM/TESTNM next and this should simplify that by making EQ/NE more similar.

llvm-svn: 323604
2018-01-27 20:19:02 +00:00
Simon Pilgrim fe3fac805a [X86][SSE] Simplify demanded elements from BROADCAST shuffle source.
If broadcasting from another shuffle, attempt to simplify it.

We can probably generalize this a lot more (embedding in combineX86ShufflesRecursively), but BROADCAST is one of the more troublesome as it accepts inputs of different sizes to the result.

llvm-svn: 323602
2018-01-27 19:48:13 +00:00
Benjamin Kramer a03d3198ee [X86] Unbreak the build.
X86ISelLowering.cpp:34130:5: error: return type 'llvm::SDValue' must
match previous return type 'const llvm::SDValue' when lambda expression
has unspecified explicit return type

llvm-svn: 323557
2018-01-26 20:16:43 +00:00
Craig Topper d4795b700d [X86] Allow any_extend to be combined with setcc on VLX targets.
For VLX target getSetccResultType returns vXi1 which prevents the target independent DAG combine from doing this tranform itself.

llvm-svn: 323555
2018-01-26 20:02:52 +00:00
Simon Pilgrim 8e9becbd81 [X86][AVX512] Add combining support for X86ISD::VTRUNCS
Similar to the existing support for X86ISD::VTRUNCUS.

Differential Revision: https://reviews.llvm.org/D42544

llvm-svn: 323553
2018-01-26 20:01:12 +00:00
Sanjay Patel b8ae262bd3 [x86] fix typo in comment; NFC
llvm-svn: 323545
2018-01-26 18:44:32 +00:00
Simon Pilgrim 1b14bdc0b8 [X86][AVX] LowerBUILD_VECTORAsVariablePermute - add support for VPERMILPV to v4i32/v4f32
Extension to D42431, adding support for v4i32/v4f32 as well as v2i64/v2f64 now that D42308 has landed

llvm-svn: 323542
2018-01-26 17:19:59 +00:00
Simon Pilgrim 76ede609f6 [X86][SSE] Don't colaesce v4i32 extracts
We currently coalesce v4i32 extracts from all 4 elements to 2 v2i64 extracts + shifts/sign-extends.

This seems to have been added back in the days when we tended to spill vectors and reload scalars, or ended up with repeated shuffles moving everything down to 0'th index. I don't think either of these are likely these days as we have better EXTRACT_VECTOR_ELT and VECTOR_SHUFFLE handling, and the existing code tends to make it very difficult for various vector and load combines.

Differential Revision: https://reviews.llvm.org/D42308

llvm-svn: 323541
2018-01-26 17:11:34 +00:00
Simon Pilgrim d567c27c84 [X86][SSE] Drop PMADDWD in lowerMul
As mentioned in D42258, we don't need this any more

llvm-svn: 323540
2018-01-26 16:57:36 +00:00
Simon Pilgrim 445d7c0e5c [X86] Cleanup SDLoc arguments as mentioned on D42544
llvm-svn: 323526
2018-01-26 14:00:01 +00:00
Craig Topper 882f0d7955 [X86] Remove dead code from LowerBUILD_VECTOR that tried to handle i64 element type in 32-bit mode.
Type legalization would prevent any i64 operands to the build_vector from existing before we get here. The coverage bots show this code as uncovered.

llvm-svn: 323506
2018-01-26 07:30:44 +00:00
Craig Topper 77c5077585 [X86] Remove code from combineBitcastvxi1 that was needed to support the previous native IR for kunpck intrinsics.
The original autoupgrade for kunpck intrinsics used a bitcasted scalar shift, or, and. This combine would turn this into a concat_vectors. Now the kunpck intrinsics are autoupgraded to a vector shuffle that will become a concat_vectors.

llvm-svn: 323504
2018-01-26 07:15:21 +00:00
Craig Topper 95e8c9143e [X86] Remove unused intrinsic type handling. NFC
llvm-svn: 323503
2018-01-26 07:15:20 +00:00
Craig Topper ccb35dfda6 [X86] Simplify condition in VSETCC. NFC
This listed all legal 128-bit integer types individually, but since we already know we have a legal type and its integer, we can just check is128BitVector.

llvm-svn: 323502
2018-01-26 07:15:18 +00:00
Craig Topper faa56f7b08 [X86] Remove LowerVSETCC code for handling vXi1 setcc with vXi8/vXi16 input type. NFC
These kinds of setccs are promoted by a DAG combine before they ever get to legalization.

llvm-svn: 323501
2018-01-26 07:15:17 +00:00
Craig Topper ad8ce0b800 [X86] Remove some dead code from LowerVSETCC. NFC
This code was added in r321967, but ultimately I fixed the issue in the legalizer and this code was no longer required.

llvm-svn: 323500
2018-01-26 07:15:16 +00:00
Simon Pilgrim 09c56b799f [X86] Apply clang-format to detectUSatPattern. NFCI.
Cleanup from D42544

llvm-svn: 323439
2018-01-25 16:38:56 +00:00
Simon Pilgrim 9f551ad604 [X86][SSE] Aggressively use PMADDWD for v4i32 multiplies with 17 or more leading zeros
As discussed in D41484, PMADDWD for 'zero extended' vXi32 is nearly always a better option than PMULLD:
On SNB it will result in code that isn't any faster, but not any slower so we may as well keep it.
On KNL it only has half the throughput, so I've disabled it on there - ideally there'd be a better way than this.

Differential Revision: https://reviews.llvm.org/D42258

llvm-svn: 323367
2018-01-24 19:20:02 +00:00
Simon Pilgrim f26df47831 [X86][SSE] Avoid calls to combineX86ShufflesRecursively that can't combine to target shuffles (PR32037)
Don't bother making recursive calls to combineX86ShufflesRecursively if we have more shuffle source operands than will be combined together with the remaining recursive depth.

See https://bugs.llvm.org/show_bug.cgi?id=32037#c26 and https://bugs.llvm.org/show_bug.cgi?id=32037#c27 for the reduction in compile times from this patch.

Differential Revision: https://reviews.llvm.org/D42378

llvm-svn: 323320
2018-01-24 11:41:09 +00:00
Craig Topper 0321ebc054 [X86] Use ISD::SIGN_EXTEND instead of X86ISD::VSEXT for mask to xmm/ymm/zmm conversion
There are a couple tricky things with this patch.

I had to add an override of isVectorLoadExtDesirable to stop DAG combine from combining sign_extend with loads after legalization since we legalize sextload using a load+sign_extend. Overriding this hook actually prevents a lot sextloads from being created in the first place.

I also had to add isel patterns because DAG combine blindly combines sign_extend+truncate to a smaller sign_extend which defeats what legalization was trying to do.

Differential Revision: https://reviews.llvm.org/D42407

llvm-svn: 323301
2018-01-24 04:51:17 +00:00
Zvi Rackover b5447b1e7c X86: Update isVectorShiftByScalarCheap with cases covered by AVX512BW
Summary:
AVX512BW adds support for variable shift amount for 16-bit element
vectors.

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: RKSimon

Subscribers: rengolin, tschuett, llvm-commits

Differential Revision: https://reviews.llvm.org/D42437

llvm-svn: 323292
2018-01-24 01:36:40 +00:00
Simon Pilgrim 2cc74ed2be [X86][AVX] LowerBUILD_VECTORAsVariablePermute - add support for VPERMILPV to v2i64/v2f64
Minor refactor to make it possible for LowerBUILD_VECTORAsVariablePermute to be used with a wider variety of shuffles op and types.

I'd have liked to add v4i32/v4f32 support as well but we don't see v4i32 index extractions at the moment (which is why I created D42308)

After this I intend to begin adding scaling support for PSHUFB (v8i16, v4i32, v2i64)) and VPERMPS (v4f64, v4i64).

Differential Revision: https://reviews.llvm.org/D42431

llvm-svn: 323260
2018-01-23 21:33:24 +00:00
Simon Pilgrim 6ff241fc99 [X86][SSE] LowerBUILD_VECTORAsVariablePermute - extract subvector from oversized index vectors
llvm-svn: 323223
2018-01-23 17:02:15 +00:00
Craig Topper c58c2b5c9b [X86] Rewrite vXi1 element insertion by using a vXi1 scalar_to_vector and inserting into a vXi1 vector.
The existing code was already doing something very similar to subvector insertion so this allows us to remove the nearly duplicate code.

This patch is a little larger than it should be due to differences between the DQI handling between the two today.

llvm-svn: 323212
2018-01-23 15:56:36 +00:00
Simon Pilgrim 0c9f77a9f9 [X86][SSE] LowerBUILD_VECTORAsVariablePermute - ensure that the source vector is not larger than the destination
We might be able to support this in the future with VPERMV3, OR(PSHUFB, PSHUFB) etc.

llvm-svn: 323210
2018-01-23 15:51:03 +00:00
Simon Pilgrim 9b4a097f94 Use EVT::changeVectorElementTypeToInteger() to convert index type to integer
llvm-svn: 323207
2018-01-23 15:30:07 +00:00
Simon Pilgrim e2905c8a0c [X86][SSE] LowerBUILD_VECTORAsVariablePermute - ensure that the index vector has the correct number of elements
llvm-svn: 323206
2018-01-23 15:13:37 +00:00
Craig Topper 76adcc86cd [X86] Legalize v32i1 without BWI via splitting to v16i1 rather than the default of promoting to v32i8.
Summary:
For the most part its better to keep v32i1 as a mask type of a narrower width than trying to promote it to a ymm register.

I had to add some overrides to the methods that get the types for the calling convention so that we still use v32i8 for argument/return purposes.

There are still some regressions in here. I definitely saw some around shuffles. I think we probably should move vXi1 shuffle from lowering to a DAG combine where I think the extend and truncate we have to emit would be better combined.

I think we also need a DAG combine to remove trunc from (extract_vector_elt (trunc))

Overall this removes something like 13000 CHECK lines from lit tests.

Reviewers: zvi, RKSimon, delena, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42031

llvm-svn: 323201
2018-01-23 14:25:39 +00:00
Simon Pilgrim 8ea1a0c690 [X86][SSE] LowerBUILD_VECTORAsVariablePermute - fix PSHUFB source/index operand ordering
As detailed in rL317463, PSHUFB (like most variable shuffle instructions) uses Op[0] for the source vector and Op[1] for the shuffle index vector, VPERMV works in reverse which is probably where the confusion comes from.

Differential Revision: https://reviews.llvm.org/D42380

llvm-svn: 323190
2018-01-23 11:39:06 +00:00
Craig Topper c92edd994e [X86] Don't reorder (srl (and X, C1), C2) if (and X, C1) can be matched as a movzx
Summary:
If we can match as a zero extend there's no need to flip the order to get an encoding benefit. As movzx is 3 bytes with independent source/dest registers. The shortest 'and' we could make is also 3 bytes unless we get lucky in the register allocator and its on AL/AX/EAX which have a 2 byte encoding.

This patch was more impressive before r322957 went in. It removed some of the same Ands that got deleted by that patch.

Reviewers: spatel, RKSimon

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42313

llvm-svn: 323175
2018-01-23 05:45:52 +00:00
Craig Topper 26a701f24f [X86] Various vXi1 insertion improvements.
Add missing patterns for inserting v1i1 into a zero vector. Use insert_subvector to zero upper bits before inserting an element into a vXi1 vector. Replace kshift based isel pattern with insert_subvector based pattern now that code that caused the pattern has been fixed to emit insert_subvector.

llvm-svn: 323173
2018-01-23 05:36:53 +00:00
Chandler Carruth c58f2166ab Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
Summary:
First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.

The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.

However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.

On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.

This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886

We also support a target feature that disables emission of the retpoline
thunk by the compiler to allow for custom thunks if users want them.
These are particularly useful in environments like kernels that
routinely do hot-patching on boot and want to hot-patch their thunk to
different code sequences. They can write this custom thunk and use
`-mretpoline-external-thunk` *in addition* to `-mretpoline`. In this
case, on x86-64 thu thunk names must be:
```
  __llvm_external_retpoline_r11
```
or on 32-bit:
```
  __llvm_external_retpoline_eax
  __llvm_external_retpoline_ecx
  __llvm_external_retpoline_edx
  __llvm_external_retpoline_push
```
And the target of the retpoline is passed in the named register, or in
the case of the `push` suffix on the top of the stack via a `pushl`
instruction.

There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.

The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.

For kernels or other freestanding or fully static executables, the
compiler switch `-mretpoline` is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with `-mretpoline` and additionally link the dynamic
executable and all shared libraries with LLD and pass `-z retpolineplt`
(or use similar functionality from some other linker). We strongly
recommend also using `-z now` as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.

When manually apply similar transformations to `-mretpoline` to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.

However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.

We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.

This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.

Reviewers: echristo, rnk, ruiu, craig.topper, DavidKreitzer

Subscribers: sanjoy, emaste, mcrosier, mgorny, mehdi_amini, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D41723

llvm-svn: 323155
2018-01-22 22:05:25 +00:00
Simon Pilgrim 17682a86da [X86][SSE] Add ISD::VECTOR_SHUFFLE to faux shuffle decoding (Reapplied)
Primarily, this allows us to use the aggressive extraction mechanisms in combineExtractWithShuffle earlier and make use of UNDEF elements that may be lost during lowering.

Reapplied after rL322279 was reverted at rL322335 due to PR35918, underlying issue was fixed at rL322644.

llvm-svn: 323104
2018-01-22 12:05:17 +00:00
Marina Yatsina 6fc2aaae8d Separate ExecutionDepsFix into 4 parts:
1. ReachingDefsAnalysis - Allows to identify for each instruction what is the “closest” reaching def of a certain register. Used by BreakFalseDeps (for clearance calculation) and ExecutionDomainFix (for arbitrating conflicting domains).
2. ExecutionDomainFix - Changes the variant of the instructions in order to minimize domain crossings.
3. BreakFalseDeps - Breaks false dependencies.
4. LoopTraversal - Creatws a traversal order of the basic blocks that is optimal for loops (introduced in revision L293571). Both ExecutionDomainFix and ReachingDefsAnalysis use this to determine the order they will traverse the basic blocks.

This also included the following changes to ExcecutionDepsFix original logic:
1. BreakFalseDeps and ReachingDefsAnalysis logic no longer restricted by a register class.
2. ReachingDefsAnalysis tracks liveness of reg units instead of reg indices into a given reg class.

Additional changes in affected files:
1. X86 and ARM targets now inherit from ExecutionDomainFix instead of ExecutionDepsFix. BreakFalseDeps also was added to the passes they activate.
2. Comments and references to ExecutionDepsFix replaced with ExecutionDomainFix and BreakFalseDeps, as appropriate.

Additional refactoring changes will follow.

This commit is (almost) NFC.
The only functional change is that now BreakFalseDeps will break dependency for all register classes.
Since no additional instructions were added to the list of instructions that have false dependencies, there is no actual change yet.
In a future commit several instructions (and tests) will be added.

This is the first of multiple patches that fix bugzilla https://bugs.llvm.org/show_bug.cgi?id=33869
Most of the patches are intended at refactoring the existent code.

Additional relevant reviews:
https://reviews.llvm.org/D40331
https://reviews.llvm.org/D40332
https://reviews.llvm.org/D40333
https://reviews.llvm.org/D40334

Differential Revision: https://reviews.llvm.org/D40330

Change-Id: Icaeb75e014eff96a8f721377783f9a3e6c679275
llvm-svn: 323087
2018-01-22 10:05:23 +00:00
Craig Topper 7fddf2bfef [X86] Add an override of targetShrinkDemandedConstant to limit the damage that shrinkdemandedbits can do to zext_in_reg operations
Summary:
This patch adds an implementation of targetShrinkDemandedConstant that tries to keep shrinkdemandedbits from removing bits that would otherwise have been recognized as a movzx.

We still need a follow patch to stop moving ands across srl if the and could be represented as a movzx before the shift but not after. I think this should help with some of the cases that D42088 ended up removing during isel.

Reviewers: spatel, RKSimon

Reviewed By: spatel

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D42265

llvm-svn: 323048
2018-01-20 18:50:09 +00:00
Simon Pilgrim 89540d9665 [X86][SSE] Check for out of bounds PEXTR/PINSR indices during faux shuffle combining.
llvm-svn: 323045
2018-01-20 17:16:01 +00:00
Craig Topper 08bd14803c [X86] Teach X86 codegen to use vector width preference to avoid promoting to 512-bit types when VLX is enabled and the preference is for a smaller size.
This change applies to places where we would turn 128/256-bit code into 512-bit in order to get a wider element type through sext/zext. Any 512-bit types that already existed in the IR/DAG will be left that way.

The width preference has no effect on codegen behavior when the target does not have AVX512 enabled. So AVX/AVX2 codegen cannot be limited via this mechanism yet.

If the preference is lower than 256 we may still use a 256 bit type to do the operation. Constraining to 128 bits makes it much more difficult to support some operations. For many of these cases we need to change element width while keeping element count constant which is easiest done by switching between 256 and 128 bit.

The preference is only obeyed when AVX512 and VLX are available. This means the preference is not obeyed for KNL, but is obeyed for SKX, Cannonlake, and Icelake. For KNL, the only way to do masked operation is on 512-bit registers so we would have to completely disable masking to obey the preference. We would also lose support for gather, scatter, ctlz, vXi64 multiplies, etc. This may change in the future, but this simplifies the initial implementation.

Differential Revision: https://reviews.llvm.org/D41895

llvm-svn: 323016
2018-01-20 00:26:12 +00:00
Craig Topper b70ca5060f [X86] Teach LowerBUILD_VECTOR to recognize pair-wise splats of 32-bit elements and use a 64-bit broadcast
If we are splatting pairs of 32-bit elements, we can use a 64-bit broadcast to get the job done.

We could probably could probably do this with other sizes too, for example four 16-bit elements. Or we could broadcast pairs of 16-bit elements using a 32-bit element broadcast. But I've left that as a future improvement.

I've also restricted this to AVX2 only because we can only broadcast loads under AVX.

Differential Revision: https://reviews.llvm.org/D42086

llvm-svn: 322730
2018-01-17 18:58:22 +00:00
Craig Topper 279ace187a [X86] When legalizing (v64i1 select i8, v64i1, v64i1) make sure not to introduce bitcasts to i64 in 32-bit mode
We legalize selects of masks with scalar conditions using a bitcast to an integer type. But if we are in 32-bit mode we can't convert v64i1 to i64. So instead split the v64i1 to v32i1 and concat it back together. Each half will then be legalized by bitcasting to i32 which is fine.

The test case is a little indirect. If we have the v64i1 select in IR it will get legalized by legalize vector ops which has a run of type legalization after it. That type legalization run is able to fix this i64 bitcast. So in order to avoid that we need a build_vector of a splat which legalize vector ops will ignore. Legalize DAG will then turn that into a select via LowerBUILD_VECTORvXi1. And the select will get legalized. In this case there is no type legalizer run to cleanup the bitcast.

This fixes pr35972.

llvm-svn: 322724
2018-01-17 18:46:01 +00:00
Benjamin Kramer 8d073a2c2d [X86] Don't mutate shuffle arguments after early-out for AVX512
The match* functions have the annoying behavior of modifying its inputs.
Save and restore the inputs, just in case the early out for AVX512 is
hit. This is still not great and its only a matter of time this kind of
bug happens again, but I couldn't come up with a better pattern without
rewriting significant chunks of this code. Fixes PR35977.

llvm-svn: 322644
2018-01-17 13:01:06 +00:00
Benjamin Kramer 05dc3527de [X86] Constify DebugLoc parameters. No functionality change.
llvm-svn: 322643
2018-01-17 13:00:58 +00:00
Craig Topper 77ba1e7c08 [X86] In LowerBUILD_VECTOR, rename ExtVT to EltVT so it makes sense.
llvm-svn: 322616
2018-01-17 03:58:21 +00:00
Simon Pilgrim 3e0aafbfcc [X86][MMX] Accept UNDEF upper bits for MOVD GR32->MMX
llvm-svn: 322574
2018-01-16 17:01:31 +00:00
Simon Pilgrim 85e6139633 [X86][MMX] Improve MMX constant generation
Extend the MMX zero code to take any constant with zero'd upper 32-bits

llvm-svn: 322553
2018-01-16 14:21:28 +00:00
Simon Pilgrim 85bd9141ca [X86][MMX] Add support for MMX zero vector creation
As mentioned on PR35869, (and came up recently on D41517) we don't create a MMX zero register via the PXOR but instead perform a spill to stack from a XMM zero register.

This patch adds support for direct MMX zero vector creation and should make it easier to add better constant vector creation in the future as well.

Differential Revision: https://reviews.llvm.org/D41908

llvm-svn: 322525
2018-01-15 22:32:40 +00:00
Craig Topper 1393ccf949 [X86] Use MVT::getVectorVT instead of EVT::getVectorVT when splitting 256/512 bit build_vectors. NFC
We must be creating a legal type here which means it can be an MVT.

llvm-svn: 322512
2018-01-15 20:33:53 +00:00
Craig Topper aacc622564 [X86] Generalize some code in LowerBUILD_VECTOR. NFC
llvm-svn: 322511
2018-01-15 20:33:52 +00:00
Craig Topper 4f7fadd029 [X86] Remove unnecessary if statement from LowerBUILD_VECTOR. NFCI
We were checking for 128, 256, or 512 bit vectors, but those are the only types that can get here.

llvm-svn: 322510
2018-01-15 20:33:50 +00:00
Simon Pilgrim 9904fe77a0 [X86][SSE] Support combining MOVLHPS undef inputs
llvm-svn: 322459
2018-01-14 18:50:34 +00:00
Craig Topper b2868233b7 [X86] Use ISD::TRUNCATE instead of X86ISD::VTRUNC when input and output types have the same number of elements.
llvm-svn: 322455
2018-01-14 08:11:36 +00:00
Craig Topper 57d58051bb [X86] Add X86ISD::VTRUNC to computeKnownBitsForTargetNode.
We have to take special care to avoid the cases where the result of the truncate would be padded with zero elements.

Ideally we'd just use ISD::TRUNCATE for these cases instead.

llvm-svn: 322454
2018-01-14 08:11:33 +00:00
Craig Topper e9fc0cd920 [X86] Improve legalization of vXi16/vXi8 selects.
Extend vXi1 conditions of vXi8/vXi16 selects even before type legalization gets a chance to split wide vectors. Previously we would only extend 128 and 256 bit vectors. But if we start with a 512 bit vector or wider that needs to be split we wouldn't extend until after the split had taken place. By extending early we improve the results of type legalization.

Don't widen condition of 128/256 bit vXi16/vXi8 selects when we have BWI but not VLX. We can still use a mask register by widening the select to 512-bits instead. This is similar to what we do for compares already.

llvm-svn: 322450
2018-01-14 02:05:51 +00:00
Zvi Rackover 652f9a1896 X86: Add pattern matching for PMADDWD
In addition to the existing match as part of a loop-reduction, add a
straightforward pattern match for DAG-contained patterns.

Reviewers: RKSimon, craig.topper

Subscribers: llvm-commits

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D41811

llvm-svn: 322446
2018-01-13 17:42:19 +00:00
Craig Topper 6f109f8c6c [X86] Add DAG combine to promote vXi1 result of a vXi8/vXi16 setcc when we have AVX512 but not BWI.
This avoids having the result type stick around until lowering where we have to extend the setcc and insert a truncate. If we get the types converted early we can do more to optimize it.

llvm-svn: 322432
2018-01-13 06:24:46 +00:00
David L. Jones 8c87213c26 Revert r322279 due to Skylake miscompile.
Summary:
This revision causes Skylake (and apparently, only Skylake) codegen to fail in
certain cases. Details: https://bugs.llvm.org/show_bug.cgi?id=35918

Subscribers: sanjoy, llvm-commits

Differential Revision: https://reviews.llvm.org/D41972

llvm-svn: 322335
2018-01-12 00:17:38 +00:00
Craig Topper 2aac3ee5bc [X86] Legalize 128/256 gathers/scatters on KNL by using widening rather than sign extending the index.
We can just widen the vectors with undef and zero extend the mask.

llvm-svn: 322308
2018-01-11 19:38:30 +00:00
Zvi Rackover 61beca9368 X86: Refactor type-splitting to target-legal size vector to a helper function
Summary: This is a preparatory step for D41811: refactoring code for breaking vector operands of binary operation to legal-types.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41925

llvm-svn: 322296
2018-01-11 17:29:47 +00:00
Simon Pilgrim 6e6da3f449 [X86][SSE] Add ISD::VECTOR_SHUFFLE to faux shuffle decoding
Primarily, this allows us to use the aggressive extraction mechanisms in combineExtractWithShuffle earlier and make use of UNDEF elements that may be lost during lowering.

llvm-svn: 322279
2018-01-11 14:25:18 +00:00
Zvi Rackover 3ee66d9cd1 X86: Fix LowerBUILD_VECTORAsVariablePermute for case Src is smaller than Indices
Summary:
As RKSimon suggested in pr35820, in the case that Src is smaller in
bit-size than Indices, need to widen Src to avoid type mismatch.

Fixes pr35820

Reviewers: RKSimon, craig.topper

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41865

llvm-svn: 322272
2018-01-11 12:26:52 +00:00
Craig Topper d1696e8d6c [X86] Fix unused variable in release builds.
llvm-svn: 322262
2018-01-11 07:19:29 +00:00
Craig Topper 0b59034b15 [X86] Optimize v2i32/v2f32 scatters.
If the index is v2i64 we can use the scatter instruction that has v4i32/v4f32 data register, v2i64 index, and v2i1 mask. Similar was already done for gather.

Implement custom widening for v2i32 data to remove the code that reverses type legalization during lowering.

llvm-svn: 322254
2018-01-11 06:31:28 +00:00
Craig Topper af4eb17223 [SelectionDAG][X86] Explicitly store the scale in the gather/scatter ISD nodes
Currently we infer the scale at isel time by analyzing whether the base is a constant 0 or not. If it is we assume scale is 1, else we take it from the element size of the pass thru or stored value. This seems a little weird and I think it makes more sense to make it explicit in the DAG rather than doing tricky things in the backend.

Most of this patch is just making sure we copy the scale around everywhere.

Differential Revision: https://reviews.llvm.org/D40055

llvm-svn: 322210
2018-01-10 19:16:05 +00:00
Simon Pilgrim 8b63227279 [X86][MMX] Pull out common MMX VT test. NFCI.
llvm-svn: 322195
2018-01-10 15:32:19 +00:00
Craig Topper c4d2dd80b6 [X86] Add a DAG combine to combine (sext (setcc)) with VLX
Normally target independent DAG combine would do this combine based on getSetCCResultType, but with VLX getSetCCResultType returns a vXi1 type preventing the DAG combining from kicking in.

But doing this combine can allow us to remove the explicit sign extend that would otherwise be emitted.

This patch adds a target specific DAG combine to combine the sext+setcc when the result type is the same size as the input to the setcc. I've restricted this to FP compares and things that can be represented with PCMPEQ and PCMPGT since we don't have full integer compare support on the older ISAs.

Differential Revision: https://reviews.llvm.org/D41850

llvm-svn: 322101
2018-01-09 18:14:22 +00:00
Craig Topper cc342d465e [X86] Remove llvm.x86.avx512.cvt*2mask.* intrinsics and autoupgrade to (icmp slt X, 0)
I had to drop fast-isel-abort from a test because we can't fast isel some of the mask stuff. When we used intrinsics we implicitly fell back to SelectionDAG for the intrinsic call without triggering the abort error. But with native IR that doesn't happen the same way.

llvm-svn: 322050
2018-01-09 00:50:47 +00:00
Craig Topper f090e8a89a [X86] Replace CVT2MASK ISD opcode with PCMPGTM compared to zero.
CVT2MASK is just checking the sign bit which can be represented with a comparison with zero.

llvm-svn: 321985
2018-01-08 06:53:54 +00:00
Craig Topper a2018e799a [X86] Add patterns to allow 512-bit BWI compare instructions to be used for 128/256-bit compares when VLX is not available.
llvm-svn: 321984
2018-01-08 06:53:52 +00:00
Craig Topper 9f5859e3ee [X86] Simplify some code in lower1BitVectorShuffle by relying on getNode's ability to constant fold vector SIGN_EXTEND.
llvm-svn: 321979
2018-01-07 23:56:37 +00:00
Craig Topper c1ec57c3e2 [X86] Remove unneeded code from combineGatherScatter that used to delte SIGN_EXTEND_INREG nodes created during legalization of v2i1/v4i1 masks on KNL.
v2i1/v4i1 are now legal on KNL so no sign_extend_inreg is generated.

llvm-svn: 321968
2018-01-07 18:34:08 +00:00
Craig Topper d58c165545 [X86] Make v2i1 and v4i1 legal types without VLX
Summary:
There are few oddities that occur due to v1i1, v8i1, v16i1 being legal without v2i1 and v4i1 being legal when we don't have VLX. Particularly during legalization of v2i32/v4i32/v2i64/v4i64 masked gather/scatter/load/store. We end up promoting the mask argument to these during type legalization and then have to widen the promoted type to v8iX/v16iX and truncate it to get the element size back down to v8i1/v16i1 to use a 512-bit operation. Since need to fill the upper bits of the mask we have to fill with 0s at the promoted type.

It would be better if we could just have the v2i1/v4i1 types as legal so they don't undergo any promotion. Then we can just widen with 0s directly in a k register. There are no real v4i1/v2i1 instructions anyway. Everything is done on a larger register anyway.

This also fixes an issue that we couldn't implement a masked vextractf32x4 from zmm to xmm properly.

We now have to support widening more compares to 512-bit to get a mask result out so new tablegen patterns got added.

I had to hack the legalizer for widening the operand of a setcc a bit so it didn't try create a setcc returning v4i32, extract from it, then try to promote it using a sign extend to v2i1. Now we create the setcc with v4i1 if the original setcc's result type is v2i1. Then extract that and don't sign extend it at all.

There's definitely room for improvement with some follow up patches.

Reviewers: RKSimon, zvi, guyblank

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41560

llvm-svn: 321967
2018-01-07 18:20:37 +00:00
Craig Topper 8c2ea74e74 [X86] Call lowerShuffleAsRepeatedMaskAndLanePermute from lowerV4I64VectorShuffle.
llvm-svn: 321929
2018-01-06 06:08:04 +00:00
Craig Topper e6e9c27510 [X86] Remove 'else' after 'return' I forgot to cleanup before committing D41691.
llvm-svn: 321755
2018-01-03 19:15:43 +00:00
Craig Topper 8232e88dd5 [X86] Remove useless custom inserter for 64-bit TAILJMP and TCRETURN opcodes
This custom inserter was added in r124272 at which time it added about bunch of Defs for Win64. In r150708, those defs were removed leaving only the "return BB". So I think this means the custom inserter is a NOP these days.

This patch removes the remaining code and stops tagging the instructions for custom insertion

Differential Revision: https://reviews.llvm.org/D41671

llvm-svn: 321747
2018-01-03 18:20:36 +00:00
Craig Topper cc6637b707 [X86] Use ANY_EXTEND instead of SIGN_EXTEND in lowerMasksToReg
Currently we use SIGN_EXTEND in lowerMasksToReg as part of calling convention setup, but we don't require a specific value for the upper bits.

This patch changes it to ANY_EXTEND which will be lowered as SIGN_EXTEND if it ends up sticking around.

llvm-svn: 321746
2018-01-03 18:11:01 +00:00
Sanjay Patel 9a80871ffe [x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons (PR33325)
This is an extension of D31156 with the goal that we'll allow memcmp() == 0 expansion 
for x86 to use 2 pairs of loads per block.

The memcmp expansion pass (formerly part of CGP) will generate this kind of pattern 
with oversized integer compares, so we want to transform these into x86-specific vector
nodes before legalization splits things into scalar chunks.

See PR33325 for more details:
https://bugs.llvm.org/show_bug.cgi?id=33325

Differential Revision: https://reviews.llvm.org/D41618

llvm-svn: 321656
2018-01-02 16:38:29 +00:00
Simon Pilgrim 39f50e103b Strip trailing whitespace. NFCI
llvm-svn: 321644
2018-01-02 12:41:29 +00:00
Craig Topper c8898b3640 [X86] Promote vXi1 fp_to_uint/fp_to_sint to vXi32 to avoid scalarization.
llvm-svn: 321632
2018-01-01 21:12:18 +00:00
Craig Topper e5943bb337 [X86] Replace custom lowering of vXi1 SINT_TO_FP/UINT_TO_FP with promotion.
The custom lowering was just doing the same thing promotion would do.

llvm-svn: 321630
2018-01-01 20:08:43 +00:00
Craig Topper a4f9997675 [SelectionDAG][X86][AArch64] Require targets to specify the promotion type when using setOperationAction Promote for INT_TO_FP and FP_TO_INT
Currently the promotion for these ignores the normal getTypeToPromoteTo and instead just tries to double the element width. This is because the default behavior of getTypeToPromote to just adds 1 to the SimpleVT, which has the affect of increasing the element count while keeping the scalar size the same.

If multiple steps are required to get to a legal operation type, int_to_fp will be promoted multiple times. And fp_to_int will keep trying wider types in a loop until it finds one that works.

getTypeToPromoteTo does have the ability to query a promotion map to get the type and not do the increasing behavior. It seems better to just let the target specify the promotion type in the map explicitly instead of letting the legalizer iterate via widening.

FWIW, it's worth I think for any other vector operations that need to be promoted, we have to specify the type explicitly because the default behavior of getTypeToPromote isn't useful for vectors. The other types of promotion already require either the element count is constant or the total vector width is constant, but neither happens by incrementing the SimpleVT enum.

Differential Revision: https://reviews.llvm.org/D40664

llvm-svn: 321629
2018-01-01 19:21:35 +00:00
Craig Topper 0d35edda90 [X86] In LowerTruncateVecI1, don't add SHL if the input is known to be all sign bits.
If the input is all sign bits then the LSB through MSB are all the same so we don't need to be move the LSB to the MSB.

llvm-svn: 321617
2018-01-01 04:52:58 +00:00
Craig Topper f78b75fb59 [X86] Use CONCAT_VECTORS instead of INSERT_SUBVECTOR for padding v4i1/v2i1 vector to v8i1 pre-legalize.
The CONCAT_VECTORS will be lowered to INSERT_SUBVECTOR later. In the modified cases this seems to be enough to trick a later DAG combine into running in a different order than allows the ANDs to be removed.

I'll admit this is a bit of a hack that happens to work, but using CONCAT_VECTORS is more consistent with other legalization code anyway.

llvm-svn: 321611
2017-12-31 19:17:52 +00:00
Simon Pilgrim b000675374 [X86][AVX2] Combine extract(broadcast(scalar_value)) --> scalar_value
As it has a scalar source we don't treat it as a target shuffle so needs special handling.

llvm-svn: 321610
2017-12-31 18:59:30 +00:00
Simon Pilgrim f205ec716b [X86][SSE] Don't vectorize splat buildvector of binops (PR30780)
Don't combine buildvector(binop(),binop(),binop(),binop()) -> binop(buildvector(), buildvector()) if its a splat - keep the binop scalar and just splat the result to avoid large vector constants.

llvm-svn: 321607
2017-12-31 17:07:47 +00:00
Craig Topper f0f6eefb49 [X86] Add a DAG combine to widen (i4 (bitcast (v4i1))) before type legalization sees the i4 and changes to load/store.
Same for v2i1 and i2.

llvm-svn: 321602
2017-12-31 09:50:38 +00:00
Craig Topper 7f39623533 [X86] Add a DAG combine to fix (v4i1 (bitcast (i4))) before type legalization sees the i4 and changes to load/store.
Same for i2 and v2i1.

llvm-svn: 321601
2017-12-31 08:25:50 +00:00
Craig Topper 876ec0b558 [X86] Prevent combining (v8i1 (bitconvert (i8 load)))->(v8i1 load) if we don't have DQI.
We end up using an i8 load via an isel pattern from v8i1 anyway. This just makes it more explicit. This seems to improve codgen in some cases and I'd like to kill off some of the load patterns.

llvm-svn: 321598
2017-12-31 07:38:41 +00:00
Craig Topper 7ba1b76854 [X86] Fix a crash when returning a <1 x i1> value>
llvm-svn: 321595
2017-12-31 07:38:30 +00:00
Craig Topper 1d0e2e82bc [X86] Cleanup store splitting in LowerTruncatingStore
Use getMemBasePlusOffset and calculate proper pointer info and alignment for the second store.

llvm-svn: 321594
2017-12-31 07:38:26 +00:00
Craig Topper c5fd31a802 [X86] Custom legalize vXi1 extract_subvector with KSHIFTR.
This allows us to remove some isel patterns.

This is mostly NFC, but we now use KSHIFTB instead of KSHIFTW with DQI.

llvm-svn: 321576
2017-12-30 06:45:43 +00:00
Simon Pilgrim c701596e86 [X86][SSE] Match PSHUFLW/PSHUFHW + PSHUFD vXi16 shuffle patterns (PR34686)
As noted in PR34686, we are relying on a PSHUFD+PSHUFLW+PSHUFHW shuffle chain for most general vXi16 unary shuffles.

This patch checks for simpler PSHUFLW+PSHUFD and PSHUFHW+PSHUFD cases beforehand, building on some existing code that just handled splat shuffles.

By doing so we also prevent premature use of PSHUFB shuffles which can be slower and require the creation/loading of constant shuffle masks.

We now have the 'fast-variable-shuffle' option for hardware that prefers combining 2 or more shuffles to VPSHUFB etc.

Differential Revision: https://reviews.llvm.org/D38318

llvm-svn: 321553
2017-12-29 14:41:50 +00:00
Craig Topper 55cf880900 [X86] When lowering extending loads from v2i1/v4i1, if we have VLX, use a narrower extend.
Previously we used an extend from v8i1 to v8i32/v8i64. Then extracted to the final width. But if we have VLX we should extract first. This way we don't end up with an overly large extend.

This allows us to use vcmpeq to make all ones for the sign extend when DQI isn't available. Otherwise we get a VPTERNLOG.

If we make v2i1/v4i1 legal like proposed in D41560, we could always do this and rely on the lowering of the extend to widen when necessary.

llvm-svn: 321538
2017-12-28 19:46:11 +00:00
Craig Topper c0b6cb1e47 [X86] Use ISD::CONCAT_VECTORS when splitting 256-bit loads in combineLoad.
llvm-svn: 321537
2017-12-28 19:46:06 +00:00
Craig Topper 4b311da3a4 [X86] Fix inconsistencies in different places where we split loads/stores.
-Use MinAlign instead of std::min.
-Use SelectionDAG::getMemBasePlusOffset.
-Apply offset to the pointer info for the second load/store created.

llvm-svn: 321536
2017-12-28 19:46:03 +00:00
Craig Topper 05cf1f338f [X86] Emit ISD::TRUNCATE instead of X86ISD::VTRUNC from LowerZERO_EXTEND_Mask/LowerSIGN_EXTEND_Mask.
The truncate will be lowered X86ISD::VTRUNC later.

llvm-svn: 321534
2017-12-28 19:45:58 +00:00
Simon Pilgrim 62411e4d4f [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
If there are 17 or more leading zeros to the v4i32 elements, then we can use PMADD for the integer multiply when PMULLD is unavailable or slow.

The 17 bits need to be zero as the PMADDWD performs a v8i16 signed-mul-extend + pairwise-add - the upper 16 so we're adding a zero pair and the 17th bit so we don't incorrectly sign extend.

Differential Revision: https://reviews.llvm.org/D41484

llvm-svn: 321516
2017-12-28 10:05:49 +00:00
Craig Topper 72bbbeb2a7 [X86] Reimplement r321437 using custom lowering instead of as a DAG combine.
My original implementation ran as a DAG combine post type legalization, but it turns out we don't run that DAG combine step if type legalization didn't change anything. Attempts to make the combine run before type legalization as well hit other issues.

So just do it in LowerMUL where we can catch more cases.

llvm-svn: 321496
2017-12-27 19:09:40 +00:00
Benjamin Kramer 293f34301e [X86] Fix vmul combine for AVX1 targets.
v8i32 is legal von AVX1, but it doesn't have pmuludq for it.

llvm-svn: 321490
2017-12-27 13:31:50 +00:00
Craig Topper 428d87e559 [X86] Return SDValue(N, 0) instead of an SDValue() after a successful combine.
Returning SDValue() means nothing changed, SDValue(N,0) means there was a change but the worklist management was taken care of.

I don't know if this has a real effect other than making sure the combine counter in the DAG combiner gets updated, but it is the correct thing to do.

llvm-svn: 321463
2017-12-26 22:22:58 +00:00
Craig Topper e0b9b5ef2b [X86] Fix typo in assert message.
llvm-svn: 321450
2017-12-26 05:43:02 +00:00
Craig Topper 705fef3ef3 [X86] Add a DAG combines to turn vXi64 muls into VPMULDQ/VPMULUDQ if the upper bits are all sign bits or zeros.
Normally we catch this during lowering, but vXi64 mul is considered legal when we have AVX512DQ.

This DAG combine allows us to avoid PMULLQ with AVX512DQ if we can prove its unnecessary. PMULLQ is 3 uops that take 4 cycles each. While pmuldq/pmuludq is only one 4 cycle uop.

llvm-svn: 321437
2017-12-25 06:47:10 +00:00
Craig Topper fabeb27e36 [X86] Make some helper methods static functions instead. NFC
llvm-svn: 321433
2017-12-25 00:54:53 +00:00
Craig Topper b2cd8485dc [X86] Use SelectionDAG::getFPExtendOrRound to simplify some code.
llvm-svn: 321432
2017-12-25 00:54:51 +00:00
Craig Topper 2d1d9a11c1 [X86] Fix (v2f64 (s/uint_to_fp (v2i1))) to avoid scalarization without AVX512DQ.
Previously we extended v2i1 to v2f64 and then tried to use cvtuqq2pd/cvtqq2pd, but that only works with avx512dq. So we ended up scalarizing it. Now we widen to v4i1 first and extend to v4i32.

llvm-svn: 321420
2017-12-24 06:51:36 +00:00
Craig Topper 62fd123731 [X86] Teach WidenMaskArithmetic to handle any constant buildvector on the RHS not just all zeros/ones.
llvm-svn: 321415
2017-12-24 01:03:31 +00:00
Craig Topper 06dad14797 [X86] Remove type restrictions from WidenMaskArithmetic.
This can help AVX-512 code where mask types are legal allowing us to remove extends and truncates to/from mask types.

llvm-svn: 321408
2017-12-23 18:53:05 +00:00
Craig Topper e79a7a4b2e [X86] In WidenMaskArithmetic, make sure we check the input type of a truncate on N1.
Later in the code we explicitly bypass the truncate so we should be checking its type to make sure that it's safe.

llvm-svn: 321407
2017-12-23 18:53:03 +00:00
Craig Topper dbbbb8532c [X86] Remove unneeded EVT variable. NFC
Immediately after it is created we check if its equal to another EVT. Then we inconsistently use one or the other variables in the code below.

Instead do the equality check directly on the getValueType result and remove the variable. Use the origina VT variable throughout the remaining code.

llvm-svn: 321406
2017-12-23 18:53:01 +00:00
Craig Topper b8e7ab8231 [X86] Pass the right VT to the getZeroExtendInReg introduced in r321398
Apparently we don't have tests for this which I didn't realize before. I'll try to fix that but wanted to fix the obvious bug.

llvm-svn: 321399
2017-12-23 06:52:03 +00:00
Craig Topper ed4a87f6a8 [X86] Use SelectionDAG::getZeroExtendInReg instead of implementing it manually.
llvm-svn: 321398
2017-12-23 02:54:52 +00:00
Craig Topper d6a8f2e67d [SelectionDAG][X86] Don't use ->getValueType(0) after a call to getOperand to get the type of the operand.
getOperand returns an SDValue that contains the node and the result number. There is no guarantee that the result number if 0. By using the -> operator we are calling SDNode::getValueType rather than SDValue::getValueType. This requires supplying a result number and we shouldn't assume it was 0.

I don't have a test case. Just noticed while cleaning up some other code and saw that it occurred in other places.

llvm-svn: 321397
2017-12-23 02:54:50 +00:00
Craig Topper 576335f998 [X86] When lowering insert_vector_elt/extract_vector_elt of vXi1 with a non-constant index just use either a 128-bit type or the vXi8 type with the correct number of elements.
Despite what the comment said there isn't better codegen for 512-bit vectors. The 128/256/512 bit implementation jus stores to memory and loads an element. There's no advantage to doing that with a larger size. In fact in many cases it causes a stack realignment and generates worse code.

llvm-svn: 321369
2017-12-22 17:18:11 +00:00
Craig Topper e268598dd3 [X86] Add prefetchwt1 instruction and overhaul priorities and isel enabling for prefetch instructions.
Previously prefetch was only considered legal if sse was enabled, but it should be supported with 3dnow as well.

The prfchw flag now imply at least some form of prefetch without the write hint is available, either the sse or 3dnow version. This is true even if 3dnow and sse are explicitly disabled.

Similarly prefetchwt1 feature implies availability of prefetchw and the the prefetcht0/1/2/nta instructions. This way we can support _MM_HINT_ET0 using prefetchw and _MM_HINT_ET1 with prefetchwt1. And its assumed that if we have levels for the write hint we would have levels for the non-write hint, thus why we enable the sse prefetch instructions.

I believe this behavior is consistent with gcc. I've updated the prefetch.ll to test all of these combinations.

llvm-svn: 321335
2017-12-22 02:30:30 +00:00
Craig Topper 9befe89367 [X86] Use SIGN_EXTEND to implement ANY_EXTEND from vXi1.
llvm-svn: 321334
2017-12-22 02:30:26 +00:00
Craig Topper 8772228963 [X86] Use SIGN_EXTEND rather than ZERO_EXTEND for lowering extract_vector_elt from vXi1 with a non-const index.
We have a better range of instructions we can use if we can fill with the value i1 value rather than zeroing.

llvm-svn: 321315
2017-12-21 22:08:23 +00:00
Craig Topper 742ac98d01 [X86] When lowering truncates to vXi1, don't sign extend i16/i8 types to 512-bit if we have VLX.
This should only affect what we do for v8i16. Previously we went to v8i64, but if we have VLX we only need v8i32. This prevents an unnecessary zmm usage.

llvm-svn: 321303
2017-12-21 20:45:13 +00:00
Craig Topper 410a289b79 [X86] Promote v8i1 shuffles to v8i32 instead of v8i64 if we have VLX.
We should have equally good shuffle options for v8i32 with VLX. This was spotted during my attempts to remove 512-bit vectors from SKX.

We still use 512-bits for v16i1, v32i1, and v64i1. I'm less sure we can handle those well with narrower vectors. i32 and i64 element sizes get the best shuffle support.

llvm-svn: 321291
2017-12-21 18:44:06 +00:00
Simon Pilgrim 4de5bb093c [X86][SSE] Split large PAVGB/PAVGW vectors to legal widths
Patch to allow detectAVGPattern handle vectors larger than the legal size (128 SSE2, 256 AVX2, 512 AVX512BW), splitting the vectors accordingly.

Differential Revision: https://reviews.llvm.org/D41440

llvm-svn: 321288
2017-12-21 18:12:31 +00:00
Craig Topper 72c22f4366 [X86] Use PSHUFB for v32i16 shuffles before falling back to VPERMW/VPERMI2W.
PSHUFB has the ability to implicitly 0 elements which VPERMI2W can't do. So give a chance to use it first.

llvm-svn: 321251
2017-12-21 08:22:51 +00:00
Craig Topper 38af615b4c [X86] Use VPERMI2B for v16i8 shuffles if we have VBMI+VLX and would have otherwise used two PSHUFBs ORed together.
llvm-svn: 321249
2017-12-21 07:31:30 +00:00
Craig Topper 03b2bc4838 [X86] Use VPERMB/VPERMI2B for v32i8 shuffle lowering if VBMI and VLX are supported.
llvm-svn: 321248
2017-12-21 05:58:31 +00:00
Craig Topper 07820f2fe4 [X86] Remove zext from vXi32 to vXi64 on indices of gather/scatter instructions if we can prove the pre-extended value is positive.
Gather/scatter can implicitly sign extend from i32->i64 on indices. So if we know the sign bit of the input to a zext is 0 we can use the implicit extension.

llvm-svn: 321209
2017-12-20 19:25:33 +00:00
Craig Topper bc92e00f2e [X86] Implement the fusing of MUL+SUBADD to FMSUBADD
This patch turns shuffles of fadd/fsub with fmul into fmsubadd.

Patch by Dmitry Venikov

Differential Revision: https://reviews.llvm.org/D40335

llvm-svn: 321200
2017-12-20 18:05:15 +00:00
Craig Topper abed821c36 [X86] Optimize sign extends on index operand to gather/scatter to not sign extend past i32.
The gather instruction will implicitly sign extend to the pointer width, we don't need to further extend it. This can prevent unnecessary splitting in some cases.

There's still an issue that lowering on non-VLX can introduce another sign extend that doesn't get combined with shifts from a lowered sign_extend_inreg.

llvm-svn: 321152
2017-12-20 07:36:59 +00:00
Craig Topper 158d54d954 [X86] Add a missing return to combineGatherScatter after sucessful combine.
Not sure how to test this cause I think the worst that happens is that we don't revisit the node a second time to look for additional combines. We used UpdateNodeOperands so the updating the DAG work was already done.

llvm-svn: 321148
2017-12-20 06:44:50 +00:00
Craig Topper aee3acb9a8 [X86] Remove code from combineSext that looks for MVT::i1 after operation legalization which can never happen.
Type legalization guarantees this to be impossible since MVT::i1 isn't a legal type.

llvm-svn: 321132
2017-12-20 01:00:01 +00:00
Craig Topper fbdb236a8a [X86] Add an assert to indicate that there is only once specific VT allowed at a certain point in LowerMULH.
Helps with code readability a little.

llvm-svn: 321118
2017-12-19 22:38:09 +00:00
Simon Pilgrim d873b6f6ba [X86][AVX512] Attempt target shuffle combining to different types instead of early-out
We try to prevent shuffle combining to value types that would stop the folding of masked operations, but by just returning early, we were failing to try different shuffle types.

The TODOs are all still relevant here to improve codegen but we're lacking test examples.

llvm-svn: 321085
2017-12-19 16:54:07 +00:00
Simon Pilgrim fd5df639a3 [X86][SSE] Add cpu feature for aggressive combining to variable shuffles
As mentioned in D38318 and D40865, modern Intel processors prefer to combine multiple shuffles to a variable shuffle mask (PSHUFB/VPERMPS etc.) instead of having multiple stage 'fixed' shuffles which put more pressure on Port 5 (at the expense of extra shuffle mask loads).

This patch provides a FeatureFastVariableShuffle target flag for Haswell+ CPUs that prefers combining 2 or more fixed shuffles to a single variable shuffle (default is 3 shuffles).

The long term aim is to drive more of this from schedule data (probably via the MC) but we're not close to being ready for that yet.

Differential Revision: https://reviews.llvm.org/D41323

llvm-svn: 321074
2017-12-19 13:16:43 +00:00
Simon Pilgrim f6d4ab6daf [X86][SSE] Use (V)PHMINPOSUW for vXi8 SMAX/SMIN/UMAX/UMIN horizontal reductions (PR32841)
Extension to D39729 which performed this for vXi16, with the same bit flipping to handle SMAX/SMIN/UMAX cases, vXi8 UMIN horizontal reductions can be performed.

This makes use of the fact that by performing a pair-wise i8 SHUFFLE/UMIN before PHMINPOSUW, we both get the UMIN of each pair but also zero-extend the upper bits ready for v8i16.

Differential Revision: https://reviews.llvm.org/D41294

llvm-svn: 321070
2017-12-19 12:02:40 +00:00
Craig Topper 13142b10d5 [X86] Don't extend v16i8 non-uniform shifts to v16i32 if we have BWI. Use v16i16 instead.
BWI supports shifting by word amounts. Even if VLX isn't support we can still widen to v32i16 and extract the lower half. For SKX its preferrable to not use 512-bit vector if we can.

llvm-svn: 321059
2017-12-19 06:59:10 +00:00
Craig Topper 6e3091c265 [X86] Use a specific list of MVTs in combineShiftRightArithmetic instead of iterating over every integer VT and checking their size.
Previously, we were checking for MVTs with sizes betwen 8 and 64 which only includes i8, i16, i32, and i64 today. But I don't think we should assume that and should list the types that are legal for x86. I also don't think we need i64 since type legalization is guaranteed to split those up.

llvm-svn: 321058
2017-12-19 06:29:00 +00:00
Craig Topper eb13a418e1 [X86] Remove unnecessary check for integer VT from combineShiftRightArithmetic.
I doubt there's any way to create a ashr for an FP type.

llvm-svn: 321057
2017-12-19 06:28:58 +00:00
Craig Topper da853a9c2f [X86] Remove dead code for turning vector shifts by large amounts into a zero vector.
Pretty sure these are handled by a target independent DAG combine that turns them into undef these days.

llvm-svn: 321056
2017-12-19 05:21:50 +00:00
Craig Topper ad3a554889 [X86] Use ZERO_EXTEND instead of ANY_EXTEND when extending the shift amount for a non-uniform shift.
My reading of the SDM says that all bits of the shift amount are used. If the value of the element is larger than the number of bits the result the shift result is zero. So I think we need to zero_extend here to avoid garbage in the upper bits.

In reality we lower any_extend as zero_extend so in most cases it would be hard to hit this.

llvm-svn: 321055
2017-12-19 04:52:04 +00:00
Matthias Braun a4852d2c19 X86/AArch64/ARM: Factor out common sincos_stret logic; NFCI
Note:
- X86ISelLowering: setLibcallName(SINCOS) was superfluous as
  InitLibcalls() already does it.
- ARMISelLowering: Setting libcallnames for sincos/sincosf seemed
  superfluous as in the darwin case it wouldn't be used while for all
  other cases InitLibcalls already does it.

llvm-svn: 321036
2017-12-18 23:19:42 +00:00
Craig Topper 8e2837cc6e [X86] Fix mistake that I made when splitting up the setOperationAction calls recently.
The block I moved things that need BWI and 512-bit or VLX is incorrectly qualified with just hasBWI || hasVLX. Here I've qualified it with hasBWI && (hasAVX512 || hasVLX) where the hasAVX512 will be replaced with allowing 512-bit vectors in an upcoming patch.

llvm-svn: 320957
2017-12-18 04:50:05 +00:00
Craig Topper fd8d040820 [X86] Make the code that creates fmaddsub from build_vector of extracts and inserts functional and add tests.
Summary:
We had no tests for this and we couldn't do the optimization because of a bad use count check. We need to know how many non-undef pieces of the build vector were filled in and ensure our use count is equal to that. But on the shuffle combine version we need the use count to be 2.

The missing coverage was noticed during the review of D40335.

Reviewers: RKSimon, zvi, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41133

llvm-svn: 320950
2017-12-17 18:23:45 +00:00
Craig Topper ee1e71e576 [X86] Use extract_vector_elt instead of X86ISD::VEXTRACT for isel of vXi1 extractions.
llvm-svn: 320937
2017-12-17 01:35:48 +00:00
Craig Topper c0c2d19e08 [X86] Canonicalize extract_vector_elt from vXi1 to always return MVT::i32.
This allows us to remove some isel patterns that allowed MVT::i8 result type.

llvm-svn: 320936
2017-12-17 01:35:47 +00:00
Craig Topper c609dc8f55 [X86] Don't create X86ISD::VEXTRACT nodes directly. Use EXTRACT_VECTOR_ELT and allow that to be legaized to VEXTRACT.
I think we can remove the VEXTRACT node completely and use a canonicalized EXTRACT_VECTOR_ELT instead. This is a first step.

llvm-svn: 320935
2017-12-17 01:35:44 +00:00
Simon Pilgrim 5c0c93ed4c Fix unused variable warning.
llvm-svn: 320934
2017-12-16 23:37:51 +00:00
Simon Pilgrim 4c9e8215e9 [X86][AVX] lowerVectorShuffleAsBroadcast - aggressively peek through BITCASTs
Assuming we can safely adjust the broadcast index for the new type to keep it suitably aligned, then peek through BITCASTs when looking for the broadcast source.

Fixes PR32007

llvm-svn: 320933
2017-12-16 23:32:18 +00:00
Simon Pilgrim 88c10bc969 [X86][AVX] Use extract128BitVector helper. NFCI.
llvm-svn: 320932
2017-12-16 23:09:57 +00:00
Simon Pilgrim f3b6da00f5 [X86][AVX] Fix failed broadcast fold
Strip excess BITCASTs from EXTRACT_SUBVECTOR input

llvm-svn: 320930
2017-12-16 22:57:17 +00:00
Craig Topper 849b717c86 [X86] Don't pass a zero input to the passthru operand of getVectorMaskingNode/getScalarMaskingNode when its going to emit an ISD::OR/ISD::AND. NFCI
In those cases, the pass thru operand of the methods isn't used. The calls to the scalar version were passing a MVT::i1 zero, which is an illegal type at the stage this code runs.

llvm-svn: 320928
2017-12-16 21:12:24 +00:00
Craig Topper 93253e189c [X86] Have getVectorMaskingNode return an ISD::AND for X86ISD::VPSHUFBITQMB instead of creating a select with one input being 0.
llvm-svn: 320927
2017-12-16 21:12:23 +00:00
Craig Topper 1260a4e826 [X86] When using vpopcntdq for ctpop of v8i16 vectors, only promote to v8i32.
Previously we promoted to v8i64, but we don't need to go all the way to 512-bits. If we have VLX we can use the 256-bit instruction. And even if we don't have VLX we can widen v8i32 to v16i32 and drop the upper half.

llvm-svn: 320926
2017-12-16 19:31:36 +00:00
Craig Topper 1c7d07c601 [X86] Remove unneeded code for handling the old kunpck intrinsics.
llvm-svn: 320917
2017-12-16 06:58:30 +00:00
Matthias Braun f1caa2833f MachineFunction: Return reference from getFunction(); NFC
The Function can never be nullptr so we can return a reference.

llvm-svn: 320884
2017-12-15 22:22:58 +00:00
Craig Topper 422ed23298 [X86] In LowerVectorCTPOP use ISD::ZERO_EXTEND/ISD::TRUNCATE instead of the target specific nodes.
The target independent nodes will get legalized to the target specific nodes by their own legalization process. Someday I'd like to stop using a target specific for zero extends and truncates of legal types so the less places we reference the target specific opcode the better.

llvm-svn: 320863
2017-12-15 21:18:05 +00:00
Craig Topper f08ab74ae3 [X86] Remove unnecessary TODO.
When I wrote it I thought we were missing a potential optimization for KNL. But investigating further shows that for KNL we still do the optimal thing by widening to v4f32 and then using special isel patterns to widen again to zmm a register.

llvm-svn: 320862
2017-12-15 20:57:18 +00:00
Craig Topper 3fb8386685 [SelectionDAG][X86] Fix insert_vector_elt lowering for v32i1/v64i1 with non-constant index
Summary:
Currently we don't handle v32i1/v64i1 insert_vector_elt correctly as we fail to look at the number of elements closely and assume it can only be v16i1 or v8i1.

We also can't type legalize v64i1 insert_vector_elt correctly on KNL due to the type not being byte addressable as required by the legalizing through memory accesses path requires.

For the first issue, the patch now tries to pick a 512-bit register with the correct number of elements and promotes to that.

For the second issue, we now extend the vector to a byte addressable type, do the stores to memory, load the two halves, and then truncate the halves back to the original type. Technically since we changed the type, we may not need two loads, but actually checking that is more work and for the v64i1 case we do need them.

Reviewers: RKSimon, delena, spatel, zvi

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D40942

llvm-svn: 320849
2017-12-15 19:35:22 +00:00
Craig Topper ad9221d684 [X86] Widen (v2i32 (fp_to_uint v2f64)) to (v8i32 (fp_to_uint v8f64)) during legalization if we have AVX512F, but not VLX. NFC
Previously we widened it using isel patterns.

llvm-svn: 320824
2017-12-15 16:22:20 +00:00
Craig Topper 7cfacbf6ea [X86] Fix a couple bugs in my recent changes to vXi1 insert_subvector lowering.
A couple places didn't use the same SDValue variables to connect everything all the way through.

I don't have a test case for a bug in insert into the lower bits of a non-zero, non-undef vector. Not sure the best way to create that. We don't create the case when lowering concat_vectors which is the main way to get insert_subvectors.

llvm-svn: 320790
2017-12-15 07:16:41 +00:00
Craig Topper 1a1e6d6cf6 [X86] Add a TODO about v8i1 CONCAT_VECTORS.
llvm-svn: 320784
2017-12-15 01:03:46 +00:00