Commit Graph

5371 Commits

Author SHA1 Message Date
Craig Topper 2aac3ee5bc [X86] Legalize 128/256 gathers/scatters on KNL by using widening rather than sign extending the index.
We can just widen the vectors with undef and zero extend the mask.

llvm-svn: 322308
2018-01-11 19:38:30 +00:00
Zvi Rackover 61beca9368 X86: Refactor type-splitting to target-legal size vector to a helper function
Summary: This is a preparatory step for D41811: refactoring code for breaking vector operands of binary operation to legal-types.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41925

llvm-svn: 322296
2018-01-11 17:29:47 +00:00
Simon Pilgrim 6e6da3f449 [X86][SSE] Add ISD::VECTOR_SHUFFLE to faux shuffle decoding
Primarily, this allows us to use the aggressive extraction mechanisms in combineExtractWithShuffle earlier and make use of UNDEF elements that may be lost during lowering.

llvm-svn: 322279
2018-01-11 14:25:18 +00:00
Zvi Rackover 3ee66d9cd1 X86: Fix LowerBUILD_VECTORAsVariablePermute for case Src is smaller than Indices
Summary:
As RKSimon suggested in pr35820, in the case that Src is smaller in
bit-size than Indices, need to widen Src to avoid type mismatch.

Fixes pr35820

Reviewers: RKSimon, craig.topper

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41865

llvm-svn: 322272
2018-01-11 12:26:52 +00:00
Craig Topper d1696e8d6c [X86] Fix unused variable in release builds.
llvm-svn: 322262
2018-01-11 07:19:29 +00:00
Craig Topper 0b59034b15 [X86] Optimize v2i32/v2f32 scatters.
If the index is v2i64 we can use the scatter instruction that has v4i32/v4f32 data register, v2i64 index, and v2i1 mask. Similar was already done for gather.

Implement custom widening for v2i32 data to remove the code that reverses type legalization during lowering.

llvm-svn: 322254
2018-01-11 06:31:28 +00:00
Craig Topper af4eb17223 [SelectionDAG][X86] Explicitly store the scale in the gather/scatter ISD nodes
Currently we infer the scale at isel time by analyzing whether the base is a constant 0 or not. If it is we assume scale is 1, else we take it from the element size of the pass thru or stored value. This seems a little weird and I think it makes more sense to make it explicit in the DAG rather than doing tricky things in the backend.

Most of this patch is just making sure we copy the scale around everywhere.

Differential Revision: https://reviews.llvm.org/D40055

llvm-svn: 322210
2018-01-10 19:16:05 +00:00
Simon Pilgrim 8b63227279 [X86][MMX] Pull out common MMX VT test. NFCI.
llvm-svn: 322195
2018-01-10 15:32:19 +00:00
Craig Topper c4d2dd80b6 [X86] Add a DAG combine to combine (sext (setcc)) with VLX
Normally target independent DAG combine would do this combine based on getSetCCResultType, but with VLX getSetCCResultType returns a vXi1 type preventing the DAG combining from kicking in.

But doing this combine can allow us to remove the explicit sign extend that would otherwise be emitted.

This patch adds a target specific DAG combine to combine the sext+setcc when the result type is the same size as the input to the setcc. I've restricted this to FP compares and things that can be represented with PCMPEQ and PCMPGT since we don't have full integer compare support on the older ISAs.

Differential Revision: https://reviews.llvm.org/D41850

llvm-svn: 322101
2018-01-09 18:14:22 +00:00
Craig Topper cc342d465e [X86] Remove llvm.x86.avx512.cvt*2mask.* intrinsics and autoupgrade to (icmp slt X, 0)
I had to drop fast-isel-abort from a test because we can't fast isel some of the mask stuff. When we used intrinsics we implicitly fell back to SelectionDAG for the intrinsic call without triggering the abort error. But with native IR that doesn't happen the same way.

llvm-svn: 322050
2018-01-09 00:50:47 +00:00
Craig Topper f090e8a89a [X86] Replace CVT2MASK ISD opcode with PCMPGTM compared to zero.
CVT2MASK is just checking the sign bit which can be represented with a comparison with zero.

llvm-svn: 321985
2018-01-08 06:53:54 +00:00
Craig Topper a2018e799a [X86] Add patterns to allow 512-bit BWI compare instructions to be used for 128/256-bit compares when VLX is not available.
llvm-svn: 321984
2018-01-08 06:53:52 +00:00
Craig Topper 9f5859e3ee [X86] Simplify some code in lower1BitVectorShuffle by relying on getNode's ability to constant fold vector SIGN_EXTEND.
llvm-svn: 321979
2018-01-07 23:56:37 +00:00
Craig Topper c1ec57c3e2 [X86] Remove unneeded code from combineGatherScatter that used to delte SIGN_EXTEND_INREG nodes created during legalization of v2i1/v4i1 masks on KNL.
v2i1/v4i1 are now legal on KNL so no sign_extend_inreg is generated.

llvm-svn: 321968
2018-01-07 18:34:08 +00:00
Craig Topper d58c165545 [X86] Make v2i1 and v4i1 legal types without VLX
Summary:
There are few oddities that occur due to v1i1, v8i1, v16i1 being legal without v2i1 and v4i1 being legal when we don't have VLX. Particularly during legalization of v2i32/v4i32/v2i64/v4i64 masked gather/scatter/load/store. We end up promoting the mask argument to these during type legalization and then have to widen the promoted type to v8iX/v16iX and truncate it to get the element size back down to v8i1/v16i1 to use a 512-bit operation. Since need to fill the upper bits of the mask we have to fill with 0s at the promoted type.

It would be better if we could just have the v2i1/v4i1 types as legal so they don't undergo any promotion. Then we can just widen with 0s directly in a k register. There are no real v4i1/v2i1 instructions anyway. Everything is done on a larger register anyway.

This also fixes an issue that we couldn't implement a masked vextractf32x4 from zmm to xmm properly.

We now have to support widening more compares to 512-bit to get a mask result out so new tablegen patterns got added.

I had to hack the legalizer for widening the operand of a setcc a bit so it didn't try create a setcc returning v4i32, extract from it, then try to promote it using a sign extend to v2i1. Now we create the setcc with v4i1 if the original setcc's result type is v2i1. Then extract that and don't sign extend it at all.

There's definitely room for improvement with some follow up patches.

Reviewers: RKSimon, zvi, guyblank

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41560

llvm-svn: 321967
2018-01-07 18:20:37 +00:00
Craig Topper 8c2ea74e74 [X86] Call lowerShuffleAsRepeatedMaskAndLanePermute from lowerV4I64VectorShuffle.
llvm-svn: 321929
2018-01-06 06:08:04 +00:00
Craig Topper e6e9c27510 [X86] Remove 'else' after 'return' I forgot to cleanup before committing D41691.
llvm-svn: 321755
2018-01-03 19:15:43 +00:00
Craig Topper 8232e88dd5 [X86] Remove useless custom inserter for 64-bit TAILJMP and TCRETURN opcodes
This custom inserter was added in r124272 at which time it added about bunch of Defs for Win64. In r150708, those defs were removed leaving only the "return BB". So I think this means the custom inserter is a NOP these days.

This patch removes the remaining code and stops tagging the instructions for custom insertion

Differential Revision: https://reviews.llvm.org/D41671

llvm-svn: 321747
2018-01-03 18:20:36 +00:00
Craig Topper cc6637b707 [X86] Use ANY_EXTEND instead of SIGN_EXTEND in lowerMasksToReg
Currently we use SIGN_EXTEND in lowerMasksToReg as part of calling convention setup, but we don't require a specific value for the upper bits.

This patch changes it to ANY_EXTEND which will be lowered as SIGN_EXTEND if it ends up sticking around.

llvm-svn: 321746
2018-01-03 18:11:01 +00:00
Sanjay Patel 9a80871ffe [x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons (PR33325)
This is an extension of D31156 with the goal that we'll allow memcmp() == 0 expansion 
for x86 to use 2 pairs of loads per block.

The memcmp expansion pass (formerly part of CGP) will generate this kind of pattern 
with oversized integer compares, so we want to transform these into x86-specific vector
nodes before legalization splits things into scalar chunks.

See PR33325 for more details:
https://bugs.llvm.org/show_bug.cgi?id=33325

Differential Revision: https://reviews.llvm.org/D41618

llvm-svn: 321656
2018-01-02 16:38:29 +00:00
Simon Pilgrim 39f50e103b Strip trailing whitespace. NFCI
llvm-svn: 321644
2018-01-02 12:41:29 +00:00
Craig Topper c8898b3640 [X86] Promote vXi1 fp_to_uint/fp_to_sint to vXi32 to avoid scalarization.
llvm-svn: 321632
2018-01-01 21:12:18 +00:00
Craig Topper e5943bb337 [X86] Replace custom lowering of vXi1 SINT_TO_FP/UINT_TO_FP with promotion.
The custom lowering was just doing the same thing promotion would do.

llvm-svn: 321630
2018-01-01 20:08:43 +00:00
Craig Topper a4f9997675 [SelectionDAG][X86][AArch64] Require targets to specify the promotion type when using setOperationAction Promote for INT_TO_FP and FP_TO_INT
Currently the promotion for these ignores the normal getTypeToPromoteTo and instead just tries to double the element width. This is because the default behavior of getTypeToPromote to just adds 1 to the SimpleVT, which has the affect of increasing the element count while keeping the scalar size the same.

If multiple steps are required to get to a legal operation type, int_to_fp will be promoted multiple times. And fp_to_int will keep trying wider types in a loop until it finds one that works.

getTypeToPromoteTo does have the ability to query a promotion map to get the type and not do the increasing behavior. It seems better to just let the target specify the promotion type in the map explicitly instead of letting the legalizer iterate via widening.

FWIW, it's worth I think for any other vector operations that need to be promoted, we have to specify the type explicitly because the default behavior of getTypeToPromote isn't useful for vectors. The other types of promotion already require either the element count is constant or the total vector width is constant, but neither happens by incrementing the SimpleVT enum.

Differential Revision: https://reviews.llvm.org/D40664

llvm-svn: 321629
2018-01-01 19:21:35 +00:00
Craig Topper 0d35edda90 [X86] In LowerTruncateVecI1, don't add SHL if the input is known to be all sign bits.
If the input is all sign bits then the LSB through MSB are all the same so we don't need to be move the LSB to the MSB.

llvm-svn: 321617
2018-01-01 04:52:58 +00:00
Craig Topper f78b75fb59 [X86] Use CONCAT_VECTORS instead of INSERT_SUBVECTOR for padding v4i1/v2i1 vector to v8i1 pre-legalize.
The CONCAT_VECTORS will be lowered to INSERT_SUBVECTOR later. In the modified cases this seems to be enough to trick a later DAG combine into running in a different order than allows the ANDs to be removed.

I'll admit this is a bit of a hack that happens to work, but using CONCAT_VECTORS is more consistent with other legalization code anyway.

llvm-svn: 321611
2017-12-31 19:17:52 +00:00
Simon Pilgrim b000675374 [X86][AVX2] Combine extract(broadcast(scalar_value)) --> scalar_value
As it has a scalar source we don't treat it as a target shuffle so needs special handling.

llvm-svn: 321610
2017-12-31 18:59:30 +00:00
Simon Pilgrim f205ec716b [X86][SSE] Don't vectorize splat buildvector of binops (PR30780)
Don't combine buildvector(binop(),binop(),binop(),binop()) -> binop(buildvector(), buildvector()) if its a splat - keep the binop scalar and just splat the result to avoid large vector constants.

llvm-svn: 321607
2017-12-31 17:07:47 +00:00
Craig Topper f0f6eefb49 [X86] Add a DAG combine to widen (i4 (bitcast (v4i1))) before type legalization sees the i4 and changes to load/store.
Same for v2i1 and i2.

llvm-svn: 321602
2017-12-31 09:50:38 +00:00
Craig Topper 7f39623533 [X86] Add a DAG combine to fix (v4i1 (bitcast (i4))) before type legalization sees the i4 and changes to load/store.
Same for i2 and v2i1.

llvm-svn: 321601
2017-12-31 08:25:50 +00:00
Craig Topper 876ec0b558 [X86] Prevent combining (v8i1 (bitconvert (i8 load)))->(v8i1 load) if we don't have DQI.
We end up using an i8 load via an isel pattern from v8i1 anyway. This just makes it more explicit. This seems to improve codgen in some cases and I'd like to kill off some of the load patterns.

llvm-svn: 321598
2017-12-31 07:38:41 +00:00
Craig Topper 7ba1b76854 [X86] Fix a crash when returning a <1 x i1> value>
llvm-svn: 321595
2017-12-31 07:38:30 +00:00
Craig Topper 1d0e2e82bc [X86] Cleanup store splitting in LowerTruncatingStore
Use getMemBasePlusOffset and calculate proper pointer info and alignment for the second store.

llvm-svn: 321594
2017-12-31 07:38:26 +00:00
Craig Topper c5fd31a802 [X86] Custom legalize vXi1 extract_subvector with KSHIFTR.
This allows us to remove some isel patterns.

This is mostly NFC, but we now use KSHIFTB instead of KSHIFTW with DQI.

llvm-svn: 321576
2017-12-30 06:45:43 +00:00
Simon Pilgrim c701596e86 [X86][SSE] Match PSHUFLW/PSHUFHW + PSHUFD vXi16 shuffle patterns (PR34686)
As noted in PR34686, we are relying on a PSHUFD+PSHUFLW+PSHUFHW shuffle chain for most general vXi16 unary shuffles.

This patch checks for simpler PSHUFLW+PSHUFD and PSHUFHW+PSHUFD cases beforehand, building on some existing code that just handled splat shuffles.

By doing so we also prevent premature use of PSHUFB shuffles which can be slower and require the creation/loading of constant shuffle masks.

We now have the 'fast-variable-shuffle' option for hardware that prefers combining 2 or more shuffles to VPSHUFB etc.

Differential Revision: https://reviews.llvm.org/D38318

llvm-svn: 321553
2017-12-29 14:41:50 +00:00
Craig Topper 55cf880900 [X86] When lowering extending loads from v2i1/v4i1, if we have VLX, use a narrower extend.
Previously we used an extend from v8i1 to v8i32/v8i64. Then extracted to the final width. But if we have VLX we should extract first. This way we don't end up with an overly large extend.

This allows us to use vcmpeq to make all ones for the sign extend when DQI isn't available. Otherwise we get a VPTERNLOG.

If we make v2i1/v4i1 legal like proposed in D41560, we could always do this and rely on the lowering of the extend to widen when necessary.

llvm-svn: 321538
2017-12-28 19:46:11 +00:00
Craig Topper c0b6cb1e47 [X86] Use ISD::CONCAT_VECTORS when splitting 256-bit loads in combineLoad.
llvm-svn: 321537
2017-12-28 19:46:06 +00:00
Craig Topper 4b311da3a4 [X86] Fix inconsistencies in different places where we split loads/stores.
-Use MinAlign instead of std::min.
-Use SelectionDAG::getMemBasePlusOffset.
-Apply offset to the pointer info for the second load/store created.

llvm-svn: 321536
2017-12-28 19:46:03 +00:00
Craig Topper 05cf1f338f [X86] Emit ISD::TRUNCATE instead of X86ISD::VTRUNC from LowerZERO_EXTEND_Mask/LowerSIGN_EXTEND_Mask.
The truncate will be lowered X86ISD::VTRUNC later.

llvm-svn: 321534
2017-12-28 19:45:58 +00:00
Simon Pilgrim 62411e4d4f [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
If there are 17 or more leading zeros to the v4i32 elements, then we can use PMADD for the integer multiply when PMULLD is unavailable or slow.

The 17 bits need to be zero as the PMADDWD performs a v8i16 signed-mul-extend + pairwise-add - the upper 16 so we're adding a zero pair and the 17th bit so we don't incorrectly sign extend.

Differential Revision: https://reviews.llvm.org/D41484

llvm-svn: 321516
2017-12-28 10:05:49 +00:00
Craig Topper 72bbbeb2a7 [X86] Reimplement r321437 using custom lowering instead of as a DAG combine.
My original implementation ran as a DAG combine post type legalization, but it turns out we don't run that DAG combine step if type legalization didn't change anything. Attempts to make the combine run before type legalization as well hit other issues.

So just do it in LowerMUL where we can catch more cases.

llvm-svn: 321496
2017-12-27 19:09:40 +00:00
Benjamin Kramer 293f34301e [X86] Fix vmul combine for AVX1 targets.
v8i32 is legal von AVX1, but it doesn't have pmuludq for it.

llvm-svn: 321490
2017-12-27 13:31:50 +00:00
Craig Topper 428d87e559 [X86] Return SDValue(N, 0) instead of an SDValue() after a successful combine.
Returning SDValue() means nothing changed, SDValue(N,0) means there was a change but the worklist management was taken care of.

I don't know if this has a real effect other than making sure the combine counter in the DAG combiner gets updated, but it is the correct thing to do.

llvm-svn: 321463
2017-12-26 22:22:58 +00:00
Craig Topper e0b9b5ef2b [X86] Fix typo in assert message.
llvm-svn: 321450
2017-12-26 05:43:02 +00:00
Craig Topper 705fef3ef3 [X86] Add a DAG combines to turn vXi64 muls into VPMULDQ/VPMULUDQ if the upper bits are all sign bits or zeros.
Normally we catch this during lowering, but vXi64 mul is considered legal when we have AVX512DQ.

This DAG combine allows us to avoid PMULLQ with AVX512DQ if we can prove its unnecessary. PMULLQ is 3 uops that take 4 cycles each. While pmuldq/pmuludq is only one 4 cycle uop.

llvm-svn: 321437
2017-12-25 06:47:10 +00:00
Craig Topper fabeb27e36 [X86] Make some helper methods static functions instead. NFC
llvm-svn: 321433
2017-12-25 00:54:53 +00:00
Craig Topper b2cd8485dc [X86] Use SelectionDAG::getFPExtendOrRound to simplify some code.
llvm-svn: 321432
2017-12-25 00:54:51 +00:00
Craig Topper 2d1d9a11c1 [X86] Fix (v2f64 (s/uint_to_fp (v2i1))) to avoid scalarization without AVX512DQ.
Previously we extended v2i1 to v2f64 and then tried to use cvtuqq2pd/cvtqq2pd, but that only works with avx512dq. So we ended up scalarizing it. Now we widen to v4i1 first and extend to v4i32.

llvm-svn: 321420
2017-12-24 06:51:36 +00:00
Craig Topper 62fd123731 [X86] Teach WidenMaskArithmetic to handle any constant buildvector on the RHS not just all zeros/ones.
llvm-svn: 321415
2017-12-24 01:03:31 +00:00
Craig Topper 06dad14797 [X86] Remove type restrictions from WidenMaskArithmetic.
This can help AVX-512 code where mask types are legal allowing us to remove extends and truncates to/from mask types.

llvm-svn: 321408
2017-12-23 18:53:05 +00:00
Craig Topper e79a7a4b2e [X86] In WidenMaskArithmetic, make sure we check the input type of a truncate on N1.
Later in the code we explicitly bypass the truncate so we should be checking its type to make sure that it's safe.

llvm-svn: 321407
2017-12-23 18:53:03 +00:00
Craig Topper dbbbb8532c [X86] Remove unneeded EVT variable. NFC
Immediately after it is created we check if its equal to another EVT. Then we inconsistently use one or the other variables in the code below.

Instead do the equality check directly on the getValueType result and remove the variable. Use the origina VT variable throughout the remaining code.

llvm-svn: 321406
2017-12-23 18:53:01 +00:00
Craig Topper b8e7ab8231 [X86] Pass the right VT to the getZeroExtendInReg introduced in r321398
Apparently we don't have tests for this which I didn't realize before. I'll try to fix that but wanted to fix the obvious bug.

llvm-svn: 321399
2017-12-23 06:52:03 +00:00
Craig Topper ed4a87f6a8 [X86] Use SelectionDAG::getZeroExtendInReg instead of implementing it manually.
llvm-svn: 321398
2017-12-23 02:54:52 +00:00
Craig Topper d6a8f2e67d [SelectionDAG][X86] Don't use ->getValueType(0) after a call to getOperand to get the type of the operand.
getOperand returns an SDValue that contains the node and the result number. There is no guarantee that the result number if 0. By using the -> operator we are calling SDNode::getValueType rather than SDValue::getValueType. This requires supplying a result number and we shouldn't assume it was 0.

I don't have a test case. Just noticed while cleaning up some other code and saw that it occurred in other places.

llvm-svn: 321397
2017-12-23 02:54:50 +00:00
Craig Topper 576335f998 [X86] When lowering insert_vector_elt/extract_vector_elt of vXi1 with a non-constant index just use either a 128-bit type or the vXi8 type with the correct number of elements.
Despite what the comment said there isn't better codegen for 512-bit vectors. The 128/256/512 bit implementation jus stores to memory and loads an element. There's no advantage to doing that with a larger size. In fact in many cases it causes a stack realignment and generates worse code.

llvm-svn: 321369
2017-12-22 17:18:11 +00:00
Craig Topper e268598dd3 [X86] Add prefetchwt1 instruction and overhaul priorities and isel enabling for prefetch instructions.
Previously prefetch was only considered legal if sse was enabled, but it should be supported with 3dnow as well.

The prfchw flag now imply at least some form of prefetch without the write hint is available, either the sse or 3dnow version. This is true even if 3dnow and sse are explicitly disabled.

Similarly prefetchwt1 feature implies availability of prefetchw and the the prefetcht0/1/2/nta instructions. This way we can support _MM_HINT_ET0 using prefetchw and _MM_HINT_ET1 with prefetchwt1. And its assumed that if we have levels for the write hint we would have levels for the non-write hint, thus why we enable the sse prefetch instructions.

I believe this behavior is consistent with gcc. I've updated the prefetch.ll to test all of these combinations.

llvm-svn: 321335
2017-12-22 02:30:30 +00:00
Craig Topper 9befe89367 [X86] Use SIGN_EXTEND to implement ANY_EXTEND from vXi1.
llvm-svn: 321334
2017-12-22 02:30:26 +00:00
Craig Topper 8772228963 [X86] Use SIGN_EXTEND rather than ZERO_EXTEND for lowering extract_vector_elt from vXi1 with a non-const index.
We have a better range of instructions we can use if we can fill with the value i1 value rather than zeroing.

llvm-svn: 321315
2017-12-21 22:08:23 +00:00
Craig Topper 742ac98d01 [X86] When lowering truncates to vXi1, don't sign extend i16/i8 types to 512-bit if we have VLX.
This should only affect what we do for v8i16. Previously we went to v8i64, but if we have VLX we only need v8i32. This prevents an unnecessary zmm usage.

llvm-svn: 321303
2017-12-21 20:45:13 +00:00
Craig Topper 410a289b79 [X86] Promote v8i1 shuffles to v8i32 instead of v8i64 if we have VLX.
We should have equally good shuffle options for v8i32 with VLX. This was spotted during my attempts to remove 512-bit vectors from SKX.

We still use 512-bits for v16i1, v32i1, and v64i1. I'm less sure we can handle those well with narrower vectors. i32 and i64 element sizes get the best shuffle support.

llvm-svn: 321291
2017-12-21 18:44:06 +00:00
Simon Pilgrim 4de5bb093c [X86][SSE] Split large PAVGB/PAVGW vectors to legal widths
Patch to allow detectAVGPattern handle vectors larger than the legal size (128 SSE2, 256 AVX2, 512 AVX512BW), splitting the vectors accordingly.

Differential Revision: https://reviews.llvm.org/D41440

llvm-svn: 321288
2017-12-21 18:12:31 +00:00
Craig Topper 72c22f4366 [X86] Use PSHUFB for v32i16 shuffles before falling back to VPERMW/VPERMI2W.
PSHUFB has the ability to implicitly 0 elements which VPERMI2W can't do. So give a chance to use it first.

llvm-svn: 321251
2017-12-21 08:22:51 +00:00
Craig Topper 38af615b4c [X86] Use VPERMI2B for v16i8 shuffles if we have VBMI+VLX and would have otherwise used two PSHUFBs ORed together.
llvm-svn: 321249
2017-12-21 07:31:30 +00:00
Craig Topper 03b2bc4838 [X86] Use VPERMB/VPERMI2B for v32i8 shuffle lowering if VBMI and VLX are supported.
llvm-svn: 321248
2017-12-21 05:58:31 +00:00
Craig Topper 07820f2fe4 [X86] Remove zext from vXi32 to vXi64 on indices of gather/scatter instructions if we can prove the pre-extended value is positive.
Gather/scatter can implicitly sign extend from i32->i64 on indices. So if we know the sign bit of the input to a zext is 0 we can use the implicit extension.

llvm-svn: 321209
2017-12-20 19:25:33 +00:00
Craig Topper bc92e00f2e [X86] Implement the fusing of MUL+SUBADD to FMSUBADD
This patch turns shuffles of fadd/fsub with fmul into fmsubadd.

Patch by Dmitry Venikov

Differential Revision: https://reviews.llvm.org/D40335

llvm-svn: 321200
2017-12-20 18:05:15 +00:00
Craig Topper abed821c36 [X86] Optimize sign extends on index operand to gather/scatter to not sign extend past i32.
The gather instruction will implicitly sign extend to the pointer width, we don't need to further extend it. This can prevent unnecessary splitting in some cases.

There's still an issue that lowering on non-VLX can introduce another sign extend that doesn't get combined with shifts from a lowered sign_extend_inreg.

llvm-svn: 321152
2017-12-20 07:36:59 +00:00
Craig Topper 158d54d954 [X86] Add a missing return to combineGatherScatter after sucessful combine.
Not sure how to test this cause I think the worst that happens is that we don't revisit the node a second time to look for additional combines. We used UpdateNodeOperands so the updating the DAG work was already done.

llvm-svn: 321148
2017-12-20 06:44:50 +00:00
Craig Topper aee3acb9a8 [X86] Remove code from combineSext that looks for MVT::i1 after operation legalization which can never happen.
Type legalization guarantees this to be impossible since MVT::i1 isn't a legal type.

llvm-svn: 321132
2017-12-20 01:00:01 +00:00
Craig Topper fbdb236a8a [X86] Add an assert to indicate that there is only once specific VT allowed at a certain point in LowerMULH.
Helps with code readability a little.

llvm-svn: 321118
2017-12-19 22:38:09 +00:00
Simon Pilgrim d873b6f6ba [X86][AVX512] Attempt target shuffle combining to different types instead of early-out
We try to prevent shuffle combining to value types that would stop the folding of masked operations, but by just returning early, we were failing to try different shuffle types.

The TODOs are all still relevant here to improve codegen but we're lacking test examples.

llvm-svn: 321085
2017-12-19 16:54:07 +00:00
Simon Pilgrim fd5df639a3 [X86][SSE] Add cpu feature for aggressive combining to variable shuffles
As mentioned in D38318 and D40865, modern Intel processors prefer to combine multiple shuffles to a variable shuffle mask (PSHUFB/VPERMPS etc.) instead of having multiple stage 'fixed' shuffles which put more pressure on Port 5 (at the expense of extra shuffle mask loads).

This patch provides a FeatureFastVariableShuffle target flag for Haswell+ CPUs that prefers combining 2 or more fixed shuffles to a single variable shuffle (default is 3 shuffles).

The long term aim is to drive more of this from schedule data (probably via the MC) but we're not close to being ready for that yet.

Differential Revision: https://reviews.llvm.org/D41323

llvm-svn: 321074
2017-12-19 13:16:43 +00:00
Simon Pilgrim f6d4ab6daf [X86][SSE] Use (V)PHMINPOSUW for vXi8 SMAX/SMIN/UMAX/UMIN horizontal reductions (PR32841)
Extension to D39729 which performed this for vXi16, with the same bit flipping to handle SMAX/SMIN/UMAX cases, vXi8 UMIN horizontal reductions can be performed.

This makes use of the fact that by performing a pair-wise i8 SHUFFLE/UMIN before PHMINPOSUW, we both get the UMIN of each pair but also zero-extend the upper bits ready for v8i16.

Differential Revision: https://reviews.llvm.org/D41294

llvm-svn: 321070
2017-12-19 12:02:40 +00:00
Craig Topper 13142b10d5 [X86] Don't extend v16i8 non-uniform shifts to v16i32 if we have BWI. Use v16i16 instead.
BWI supports shifting by word amounts. Even if VLX isn't support we can still widen to v32i16 and extract the lower half. For SKX its preferrable to not use 512-bit vector if we can.

llvm-svn: 321059
2017-12-19 06:59:10 +00:00
Craig Topper 6e3091c265 [X86] Use a specific list of MVTs in combineShiftRightArithmetic instead of iterating over every integer VT and checking their size.
Previously, we were checking for MVTs with sizes betwen 8 and 64 which only includes i8, i16, i32, and i64 today. But I don't think we should assume that and should list the types that are legal for x86. I also don't think we need i64 since type legalization is guaranteed to split those up.

llvm-svn: 321058
2017-12-19 06:29:00 +00:00
Craig Topper eb13a418e1 [X86] Remove unnecessary check for integer VT from combineShiftRightArithmetic.
I doubt there's any way to create a ashr for an FP type.

llvm-svn: 321057
2017-12-19 06:28:58 +00:00
Craig Topper da853a9c2f [X86] Remove dead code for turning vector shifts by large amounts into a zero vector.
Pretty sure these are handled by a target independent DAG combine that turns them into undef these days.

llvm-svn: 321056
2017-12-19 05:21:50 +00:00
Craig Topper ad3a554889 [X86] Use ZERO_EXTEND instead of ANY_EXTEND when extending the shift amount for a non-uniform shift.
My reading of the SDM says that all bits of the shift amount are used. If the value of the element is larger than the number of bits the result the shift result is zero. So I think we need to zero_extend here to avoid garbage in the upper bits.

In reality we lower any_extend as zero_extend so in most cases it would be hard to hit this.

llvm-svn: 321055
2017-12-19 04:52:04 +00:00
Matthias Braun a4852d2c19 X86/AArch64/ARM: Factor out common sincos_stret logic; NFCI
Note:
- X86ISelLowering: setLibcallName(SINCOS) was superfluous as
  InitLibcalls() already does it.
- ARMISelLowering: Setting libcallnames for sincos/sincosf seemed
  superfluous as in the darwin case it wouldn't be used while for all
  other cases InitLibcalls already does it.

llvm-svn: 321036
2017-12-18 23:19:42 +00:00
Craig Topper 8e2837cc6e [X86] Fix mistake that I made when splitting up the setOperationAction calls recently.
The block I moved things that need BWI and 512-bit or VLX is incorrectly qualified with just hasBWI || hasVLX. Here I've qualified it with hasBWI && (hasAVX512 || hasVLX) where the hasAVX512 will be replaced with allowing 512-bit vectors in an upcoming patch.

llvm-svn: 320957
2017-12-18 04:50:05 +00:00
Craig Topper fd8d040820 [X86] Make the code that creates fmaddsub from build_vector of extracts and inserts functional and add tests.
Summary:
We had no tests for this and we couldn't do the optimization because of a bad use count check. We need to know how many non-undef pieces of the build vector were filled in and ensure our use count is equal to that. But on the shuffle combine version we need the use count to be 2.

The missing coverage was noticed during the review of D40335.

Reviewers: RKSimon, zvi, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D41133

llvm-svn: 320950
2017-12-17 18:23:45 +00:00
Craig Topper ee1e71e576 [X86] Use extract_vector_elt instead of X86ISD::VEXTRACT for isel of vXi1 extractions.
llvm-svn: 320937
2017-12-17 01:35:48 +00:00
Craig Topper c0c2d19e08 [X86] Canonicalize extract_vector_elt from vXi1 to always return MVT::i32.
This allows us to remove some isel patterns that allowed MVT::i8 result type.

llvm-svn: 320936
2017-12-17 01:35:47 +00:00
Craig Topper c609dc8f55 [X86] Don't create X86ISD::VEXTRACT nodes directly. Use EXTRACT_VECTOR_ELT and allow that to be legaized to VEXTRACT.
I think we can remove the VEXTRACT node completely and use a canonicalized EXTRACT_VECTOR_ELT instead. This is a first step.

llvm-svn: 320935
2017-12-17 01:35:44 +00:00
Simon Pilgrim 5c0c93ed4c Fix unused variable warning.
llvm-svn: 320934
2017-12-16 23:37:51 +00:00
Simon Pilgrim 4c9e8215e9 [X86][AVX] lowerVectorShuffleAsBroadcast - aggressively peek through BITCASTs
Assuming we can safely adjust the broadcast index for the new type to keep it suitably aligned, then peek through BITCASTs when looking for the broadcast source.

Fixes PR32007

llvm-svn: 320933
2017-12-16 23:32:18 +00:00
Simon Pilgrim 88c10bc969 [X86][AVX] Use extract128BitVector helper. NFCI.
llvm-svn: 320932
2017-12-16 23:09:57 +00:00
Simon Pilgrim f3b6da00f5 [X86][AVX] Fix failed broadcast fold
Strip excess BITCASTs from EXTRACT_SUBVECTOR input

llvm-svn: 320930
2017-12-16 22:57:17 +00:00
Craig Topper 849b717c86 [X86] Don't pass a zero input to the passthru operand of getVectorMaskingNode/getScalarMaskingNode when its going to emit an ISD::OR/ISD::AND. NFCI
In those cases, the pass thru operand of the methods isn't used. The calls to the scalar version were passing a MVT::i1 zero, which is an illegal type at the stage this code runs.

llvm-svn: 320928
2017-12-16 21:12:24 +00:00
Craig Topper 93253e189c [X86] Have getVectorMaskingNode return an ISD::AND for X86ISD::VPSHUFBITQMB instead of creating a select with one input being 0.
llvm-svn: 320927
2017-12-16 21:12:23 +00:00
Craig Topper 1260a4e826 [X86] When using vpopcntdq for ctpop of v8i16 vectors, only promote to v8i32.
Previously we promoted to v8i64, but we don't need to go all the way to 512-bits. If we have VLX we can use the 256-bit instruction. And even if we don't have VLX we can widen v8i32 to v16i32 and drop the upper half.

llvm-svn: 320926
2017-12-16 19:31:36 +00:00
Craig Topper 1c7d07c601 [X86] Remove unneeded code for handling the old kunpck intrinsics.
llvm-svn: 320917
2017-12-16 06:58:30 +00:00
Matthias Braun f1caa2833f MachineFunction: Return reference from getFunction(); NFC
The Function can never be nullptr so we can return a reference.

llvm-svn: 320884
2017-12-15 22:22:58 +00:00
Craig Topper 422ed23298 [X86] In LowerVectorCTPOP use ISD::ZERO_EXTEND/ISD::TRUNCATE instead of the target specific nodes.
The target independent nodes will get legalized to the target specific nodes by their own legalization process. Someday I'd like to stop using a target specific for zero extends and truncates of legal types so the less places we reference the target specific opcode the better.

llvm-svn: 320863
2017-12-15 21:18:05 +00:00
Craig Topper f08ab74ae3 [X86] Remove unnecessary TODO.
When I wrote it I thought we were missing a potential optimization for KNL. But investigating further shows that for KNL we still do the optimal thing by widening to v4f32 and then using special isel patterns to widen again to zmm a register.

llvm-svn: 320862
2017-12-15 20:57:18 +00:00
Craig Topper 3fb8386685 [SelectionDAG][X86] Fix insert_vector_elt lowering for v32i1/v64i1 with non-constant index
Summary:
Currently we don't handle v32i1/v64i1 insert_vector_elt correctly as we fail to look at the number of elements closely and assume it can only be v16i1 or v8i1.

We also can't type legalize v64i1 insert_vector_elt correctly on KNL due to the type not being byte addressable as required by the legalizing through memory accesses path requires.

For the first issue, the patch now tries to pick a 512-bit register with the correct number of elements and promotes to that.

For the second issue, we now extend the vector to a byte addressable type, do the stores to memory, load the two halves, and then truncate the halves back to the original type. Technically since we changed the type, we may not need two loads, but actually checking that is more work and for the v64i1 case we do need them.

Reviewers: RKSimon, delena, spatel, zvi

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D40942

llvm-svn: 320849
2017-12-15 19:35:22 +00:00
Craig Topper ad9221d684 [X86] Widen (v2i32 (fp_to_uint v2f64)) to (v8i32 (fp_to_uint v8f64)) during legalization if we have AVX512F, but not VLX. NFC
Previously we widened it using isel patterns.

llvm-svn: 320824
2017-12-15 16:22:20 +00:00
Craig Topper 7cfacbf6ea [X86] Fix a couple bugs in my recent changes to vXi1 insert_subvector lowering.
A couple places didn't use the same SDValue variables to connect everything all the way through.

I don't have a test case for a bug in insert into the lower bits of a non-zero, non-undef vector. Not sure the best way to create that. We don't create the case when lowering concat_vectors which is the main way to get insert_subvectors.

llvm-svn: 320790
2017-12-15 07:16:41 +00:00
Craig Topper 1a1e6d6cf6 [X86] Add a TODO about v8i1 CONCAT_VECTORS.
llvm-svn: 320784
2017-12-15 01:03:46 +00:00
Craig Topper 5ebf3ac9c2 [X86] Further rearrange the setOperationAction calls to separate the ones that require 512-bit registers OR VLX into separate sections. NFCI
We have several instructions that were introduced in AVX512F that are only available in 512-bit form on KNL. We still make use of them for 128/256 by artificially widening and extracting during isel.

This commit separates these operations from the true 512-bit operations. This way we can qualify the normal 512-bit operations with needing 512-bit register support. And these special operations will get qualified with needing 512-bit registers OR VLX.

The 512-bit register qualification will be introduced in a future patch this just gets everything grouped to minimize deltas on that patch.

llvm-svn: 320782
2017-12-15 01:03:43 +00:00
Craig Topper 07a28f777e [X86] Group setOperationActions related to vXi1 masks together. NFCI
Previously they were sort of interleaved in with XMM/YMM/ZMM action related code.

Trying to separate things so its easier to split 512-bit vectors later.

llvm-svn: 320781
2017-12-15 01:03:42 +00:00
Craig Topper b89bc20a64 [X86] Make ISD::INSERT_SUBVECTOR v8i1 legal with AVX512F because we should be custom lowering inserting v1i1 into v8i1 under this.
I don't have a test case at the moment. Just noticed while auditing things.

llvm-svn: 320780
2017-12-15 01:03:40 +00:00
Craig Topper 212070486d [X86] Move some of the hasVLX qualified code out of the main hasAVX512 block in the X86ISelLowering constructor. NFCI
Move it into the separate hasVLX block later in the constructor.

I'm trying to separate 128/256 and 512-bit related code so we can eventually qualify the hasAVX512 block with support for 512-bit vectors required by the prefer-vector-width feature support being talked about in D41096.

llvm-svn: 320779
2017-12-15 01:03:38 +00:00
Craig Topper 4341a7b08c [X86] Remove an unnecessary SmallVector that was collecting chains for two SDNode's we're still holding SDValues for. NFCI
We can just get the chains from those SDValues to create the TokenFactor.

llvm-svn: 320757
2017-12-14 22:50:10 +00:00
Matt Arsenault 7d7adf4f2e TLI: Allow using PSV for intrinsic mem operands
llvm-svn: 320756
2017-12-14 22:34:10 +00:00
Zachary Turner 260fe3eca6 Fix many -Wsign-compare and -Wtautological-constant-compare warnings.
Most of the -Wsign-compare warnings are due to the fact that
enums are signed by default in the MS ABI, while the
tautological comparison warnings trigger on x86 builds where
sizeof(size_t) is 4 bytes, so N > numeric_limits<unsigned>::max()
is always false.

Differential Revision: https://reviews.llvm.org/D41256

llvm-svn: 320750
2017-12-14 22:07:03 +00:00
Matt Arsenault 1117133687 DAG: Expose all MMO flags in getTgtMemIntrinsic
Rather than adding more bits to express every
MMO flag you could want, just directly use the
MMO flags. Also fixes using a bunch of bool arguments to
getMemIntrinsicNode.

On AMDGPU, buffer and image intrinsics should always
have MODereferencable set, but currently there is no
way to do that directly during the initial intrinsic
lowering.

llvm-svn: 320746
2017-12-14 21:39:51 +00:00
Craig Topper 600f1ba333 [X86] Don't zero the upper bits of the k-register before extracting a single bit from a vXi1.
This doesn't match the semantics of the extract_vector_elt operation. Nothing downstream knows the bits were zeroed so they still get masked or sign extended after the extrat anyway.

llvm-svn: 320723
2017-12-14 18:35:25 +00:00
Michael Zuckerman 19fd217eaa [AVX512] Adding support for load truncate store of I1
store operation on a truncated memory (load) of vXi1 is poorly supported by LLVM and most of the time end with an assertion.
This patch fixes this issue.

Differential Revision: https://reviews.llvm.org/D39547

Change-Id: Ida5523dd09c1ad384acc0a27e9e59273d28cbdc9
llvm-svn: 320691
2017-12-14 11:55:50 +00:00
Craig Topper 8cdf7c0e68 [X86] Make ANY_EXTEND from vXi1 Custom for more types.
We should be able to support ANY_EXTEND for any types we support ZERO_EXTEND for.

llvm-svn: 320675
2017-12-14 08:26:00 +00:00
Craig Topper 271a5c72a0 [X86] Remove redundant setOperationAction calls.
These calls already exist earlier under AVX2 feature.

llvm-svn: 320673
2017-12-14 08:25:53 +00:00
Simon Pilgrim f51f4d3623 [X86][SSE] MOVMSK only uses the sign bit from each vector element
Pass the input vector through SimplifyDemandedBits as we only need the sign bit from each vector element of MOVMSK

We'd probably get more hits if SimplifyDemandedBits was better at handling vectors...

Differential Revision: https://reviews.llvm.org/D41119

llvm-svn: 320570
2017-12-13 11:43:14 +00:00
Craig Topper 712a209db9 [X86] Add a couple TODOs about missing coverage/features motivated by D40335
D40335 was wanting to add FMSUBADD support, but it discovered that there are two pieces of code to make FMADDSUB and only one of those is tested. So I've asked that review to implement the one path until we get tests that test the existing code.

llvm-svn: 320507
2017-12-12 18:39:04 +00:00
Nirav Dave 674d053d18 [X86] Cleanup type conversion of 64-bit load-store pairs.
Summary:
Simplify and generalize chain handling and search for 64-bit load-store pairs.
Nontemporal test now converts 64-bit integer load-store into f64 which it realizes directly instead of splitting into two i32 pairs.

Reviewers: craig.topper, spatel

Reviewed By: craig.topper

Subscribers: hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D40918

llvm-svn: 320505
2017-12-12 18:25:48 +00:00
Ayman Musa c2eed926b0 [X86] Recognize constant arrays with special values and replace loads from it with subtract and shift instructions, which then will be replaced by X86 BZHI machine instruction.
Recognize constant arrays with the following values:
  0x0, 0x1, 0x3, 0x7, 0xF, 0x1F, .... , 2^(size - 1) -1
where //size// is the size of the array.

the result of a load with index //idx// from this array is equivalent to the result of the following:
  (0xFFFFFFFF >> (sub 32, idx))             (assuming the array of type 32-bit integer).

And the result of an 'AND' operation on the returned value of such a load and another input, is exactly equivalent to the X86 BZHI instruction behavior.

See test cases in the LIT test for better understanding.

Differential Revision: https://reviews.llvm.org/D34141

llvm-svn: 320481
2017-12-12 14:13:51 +00:00
Craig Topper 5ac75d5628 [X86] Improve lowering of vXi1 insert_subvectors to better utilize (insert_subvector zero, vec, 0) for zeroing upper bits.
This can be better recognized during isel when the producer already zeroed the upper bits.

llvm-svn: 320267
2017-12-09 22:44:42 +00:00
Craig Topper 504534514c [X86] Don't use getTargetConstant for all 0s and all 1s mask vector.
llvm-svn: 320260
2017-12-09 19:18:30 +00:00
Craig Topper 6504a8f888 [X86] When inserting into the upper bits of a vXi1 vector, make sure we shift enough bits if we widened the vector.
We may need to widen the vector to make the shifts legal, but if we do that we need to make sure we shift left/right after accounting for the new size. If not we can't guarantee we are shifting in zeros.

The test cases affected actually show cases where we should move the shifts all together, but that's another problem.

llvm-svn: 320248
2017-12-09 08:19:07 +00:00
Craig Topper b3e14ce90c [X86] Improve lowering of concats of mask vectors to better optimize zero vector inputs.
We were previously using kunpck with zero inputs unnecessarily. And we had cases where we would insert into a zero vector and then insert into larger zero vector incurring two sets of shifts.

llvm-svn: 320244
2017-12-09 07:02:19 +00:00
Craig Topper 7f0d456ef8 [X86] Teach lowering to only let through (insert_subvector (vXi1 zeros), subvec, 0) for vector sizes that have native KSHIFT support.
For narrow sizes we'll widen the zero vector and widen the insert. Then do an extract_subvector to get back down to correct size.

This allows us to remove some patterns from the isel table that had to COPY_TO_REGCLASS to an oversized register, do the shift and then COPY_TO_REGCLASS back to the narrow register. Now this is represented explicitly in the DAG.

This seems to have perturbed the register allocation in one of the tests, but the number of instructions didn't change.

llvm-svn: 320190
2017-12-08 20:10:33 +00:00
Sanjay Patel d4468912b0 [x86] use hasAVX2() rather than hasInt256(); NFC
These are aliases, but the thing we're checking here is that the target has
vpsllv*, not that the data type is 256-bit. Those instructions exist for
128-bit vectors too...but sadly, not for all element sizes.

llvm-svn: 320170
2017-12-08 18:35:51 +00:00
Craig Topper 037115c29f [X86] Always consider inserting a vXi1 vector into the lsbs of a zero vector to be legal during lowering. Add isel patterns to emit shifts.
Previously we only allowed these through if the subvector came from a compare or test instruction which we would again check for during isel.

With this change we only check for the compare and test instructions during isel and have fallback patterns that emit the shifts if needed.

I noticed that in a lot of cases we don't actually see the compare during lowering and rely on an odd legalization of concat_vectors with a zero vector as the second argument. This keeps the concat_vectors around long enough for a later dag combine to expose the compare then we re-legalize the concat_vectors and catch the compare.

llvm-svn: 320134
2017-12-08 08:10:58 +00:00
Craig Topper 323ba39f10 [X86] Handle alls version of vXi1 insert_vector_elt with a constant index without falling back to shuffles.
We previously only supported inserting to the LSB or MSB where it was easy to zero to perform an OR to insert.

This change effectively extracts the old value and the new value, xors them together and then xors that single bit with the correct location in the original vector. This will cancel out the old value in the first xor leaving the new value in the position.

The way I've implemented this uses 3 shifts and two xors and uses an additional register. We can avoid the additional register at the cost of another shift.

llvm-svn: 320120
2017-12-08 00:16:09 +00:00
Craig Topper fd86b3cf22 [X86] Fix indentation. NFC
llvm-svn: 320119
2017-12-08 00:15:57 +00:00
Craig Topper dfc79c7c33 [X86] Fix InsertBitToMaskVector to only issue KSHIFTS of native size so that upper bits are properly zeroed.
There's no v2i1 or v4i1 kshift, and v8i1 is only supported with AVXDQ. Isel has fake patterns to extend these types to native shifts, but makes no guarantees about the value of any bits shifted in when shifting right.

This patch promotes the vector to a type that supports a native shift first and only allows inserting into the msb of a native sized shift.

I've constructed this in a way that doesn't do the promotion if we're going to fallback to using a xmm/ymm/zmm shuffle. I think I have a plan to remove the shuffle fall back entirely. In which case we this can be simplified, but I wanted to fix the correctness issue first.

llvm-svn: 320081
2017-12-07 20:10:04 +00:00
Craig Topper 7b8fa5f782 [X86] Fix typo in variable name. NFC
llvm-svn: 320080
2017-12-07 20:10:01 +00:00
Craig Topper b67e5da89b [X86] Make a couple helper lowering methods static.
llvm-svn: 320079
2017-12-07 20:09:55 +00:00
Benjamin Kramer 1e9bf765a1 [X86] Avoid unused variable warning in Release builds. NFCI.
llvm-svn: 319891
2017-12-06 13:32:36 +00:00
Craig Topper 3275eb7a68 [X86] Split 512-bit vector extends from types other than vXi1 out of LowerZERO_EXTEND_AVX512/LowerSIGN_EXTEND_AVX512. NFCI
Most of the code in these routines is for handling extends from vXi1 types. The 512-bit handling for other extends is very much like the AVX2 code. So make the special routines just do vXi1 types and move the other 512-bit handling to the place that handles AVX2.

llvm-svn: 319878
2017-12-06 07:37:20 +00:00
Craig Topper 647e4f590f [X86] Update to getSetCCResultType to be more robust to EVT types.
Attempt to determine what the type will be legalized to and then analyze that to see if we will be able to use a vXi1 compare.

llvm-svn: 319861
2017-12-06 00:15:17 +00:00
Hans Wennborg 5df9f0878b Re-commit r319490 "XOR the frame pointer with the stack cookie when protecting the stack"
The patch originally broke Chromium (crbug.com/791714) due to its failing to
specify that the new pseudo instructions clobber EFLAGS. This commit fixes
that.

> Summary: This strengthens the guard and matches MSVC.
>
> Reviewers: hans, etienneb
>
> Subscribers: hiraditya, JDevlieghere, vlad.tsyrklevich, llvm-commits
>
> Differential Revision: https://reviews.llvm.org/D40622

llvm-svn: 319824
2017-12-05 20:22:20 +00:00
Jina Nahias 51c1a627c2 [x86][AVX512] Lowering kunpack intrinsics to LLVM IR
This patch, together with a matching clang patch (https://reviews.llvm.org/D39719), implements the lowering of X86 kunpack intrinsics to IR.

Differential Revision: https://reviews.llvm.org/D39720

Change-Id: I4088d9428478f9457f6afddc90bd3d66b3daf0a1
llvm-svn: 319778
2017-12-05 15:42:56 +00:00
Craig Topper a404ce955a [X86] Use vector widening to support sign extend from i1 when the dest type is not 512-bits and vlx is not enabled.
Previously we used a wider element type and truncated. But its more efficient to keep the element type and drop unused elements.

If BWI isn't supported and we have a i16 or i8 type, we'll extend it to be i32 and still use a truncate.

llvm-svn: 319740
2017-12-05 06:37:21 +00:00
Craig Topper e1ba2450c2 [X86] Fix a crash if avx512bw and xop are both enabled when the IR contrains a v32i8 bitreverse.
llvm-svn: 319737
2017-12-05 04:47:12 +00:00
Craig Topper 276c770e57 [X86] Use vector widening to support zero extend from i1 when the dest type is not 512-bits and vlx is not enabled.
Previously we used a wider element type and truncated. But its more efficient to keep the element type and drop unused elements.

If BWI isn't supported and we have a i16 or i8 type, we'll extend it to be i32 and still use a truncate.

llvm-svn: 319728
2017-12-05 01:45:46 +00:00
Craig Topper 913b42b0e1 [X86] Don't use kunpck for vXi1 concat_vectors if the upper bits are undef.
This can be efficiently selected by a COPY_TO_REGCLASS without the need for an extra instruction.

llvm-svn: 319726
2017-12-05 01:28:06 +00:00
Craig Topper 6302012442 [X86] Use getZeroVector and remove an unnecessary creation of an APInt before calling getConstant. NFCI
The getConstant function can take care of creating the APInt internally.

getZeroVector will take care of using the correct type for the build vector to avoid re-lowering.

The test change here is because execution domain constraints apparently pass through undef inputs of a zeroing xor. So the different ordering of register allocation here caused the dependency to change.

llvm-svn: 319725
2017-12-05 01:28:04 +00:00
Craig Topper adadaae586 [X86] Rearrange some of the code around AVX512 sign/zero extends. NFCI
Move the AVX512 code out of LowerAVXExtend. LowerAVXExtend has two callers but one of them pre-checks for AVX-512 so the code is only live from the other caller. So move the AVX-512 checks up to that caller for symmetry.

Move all of the i1 input type code in Lower_AVX512ZeroExend together.

llvm-svn: 319724
2017-12-05 01:28:00 +00:00
Hans Wennborg 361d4392cf Revert r319490 "XOR the frame pointer with the stack cookie when protecting the stack"
This broke the Chromium build (crbug.com/791714). Reverting while investigating.

> Summary: This strengthens the guard and matches MSVC.
>
> Reviewers: hans, etienneb
>
> Subscribers: hiraditya, JDevlieghere, vlad.tsyrklevich, llvm-commits
>
> Differential Revision: https://reviews.llvm.org/D40622
>
> git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@319490 91177308-0d34-0410-b5e6-96231b3b80d8

llvm-svn: 319706
2017-12-04 22:21:15 +00:00
Craig Topper 4520d4f8ad [X86] Allow VPMAXUQ/VPMAXSQ/VPMINUQ/VPMINSQ to be used with 128/256 bit vectors when AVX512 is enabled.
These instructions can be used by widening to 512-bits and extracting back to 128/256. We do similar to several other instructions already.

llvm-svn: 319641
2017-12-04 07:21:01 +00:00
Craig Topper 1151facf76 [X86] Don't turn UINT_TO_FP into SINT_TO_FP during lowering.
We already do this as a DAG combine. The version during lowering can only trigger if known bits changes something that improves known bits analysis. But this means we should be improving known bits analysis to work on the unlowered form instead.

llvm-svn: 319640
2017-12-04 05:38:44 +00:00
Craig Topper f8470a6399 [X86] Custom legalize v2i32 gathers via widening rather than promoting.
The default legalization for v2i32 is promotion to v2i64. This results in a gather that reads 64-bit elements rather than 32. If one of the elements is near a page boundary this can cause an illegal access that can fault.

We also miscalculate the scale for the gather which is an even worse problem, but we probably could have found a separate way to fix that.

llvm-svn: 319521
2017-12-01 06:02:02 +00:00
Craig Topper 11f733df9b [X86] Add a DAG combine to simplify masks for AVX2 gather instructions.
AVX2 gathers only use the upper bit of the mask allowing us to simplify sign_extend_inreg to a shift left.

llvm-svn: 319514
2017-12-01 02:49:07 +00:00
Reid Kleckner ba4014e9dc XOR the frame pointer with the stack cookie when protecting the stack
Summary: This strengthens the guard and matches MSVC.

Reviewers: hans, etienneb

Subscribers: hiraditya, JDevlieghere, vlad.tsyrklevich, llvm-commits

Differential Revision: https://reviews.llvm.org/D40622

llvm-svn: 319490
2017-11-30 22:41:21 +00:00
Craig Topper d4257565cf [X86] Promote i8 CTPOP to i32 instead of i16 when we have the POPCNT instruction.
The 32-bit version is shorter to encode and the zext we emit for the promotion is likely going to be a 32-bit zero extend anyway.

llvm-svn: 319468
2017-11-30 20:15:31 +00:00
Francis Visoiu Mistrih 93ef145862 [CodeGen] Print "%vreg0" as "%0" in both MIR and debug output
As part of the unification of the debug format and the MIR format, avoid
printing "vreg" for virtual registers (which is one of the current MIR
possibilities).

Basically:

* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E "s/%vreg([0-9]+)/%\1/g"
* grep -nr '%vreg' . and fix if needed
* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E "s/ vreg([0-9]+)/ %\1/g"
* grep -nr 'vreg[0-9]\+' . and fix if needed

Differential Revision: https://reviews.llvm.org/D40420

llvm-svn: 319427
2017-11-30 12:12:19 +00:00
Craig Topper a495744d2c [X86] Optimize avx2 vgatherqps for v2f32 with v2i64 index type.
Normal type legalization will widen everything. This requires forcing 0s into the mask register. We can instead choose the form that only reads 2 elements without zeroing the mask.

llvm-svn: 319406
2017-11-30 07:01:40 +00:00
Craig Topper 321a8b9b63 [X86] Make sure we don't remove sign extends of masks with AVX2 masked gathers.
We don't use k-registers and instead use the MSB so we need to make sure we sign extend the mask to the msb.

llvm-svn: 319405
2017-11-30 06:31:31 +00:00
Craig Topper 56a41d4b3a [X86] Remove some questionable looking code that seems to be looking through a VZEXT to create a larger VSEXT.
If the input the vzext was signed this would do the wrong thing.

Not sure how to test this.

llvm-svn: 319382
2017-11-29 23:08:25 +00:00
Craig Topper e3515001b9 [X86] Remove setOperationAction Promote for ISD::SINT_TO_FP MVT::v8i16/v16i8/v16i16.
A DAG combine ensures these ops are always promoted to vXi32.

llvm-svn: 319298
2017-11-29 08:19:36 +00:00
Craig Topper fbf7b3bf3e [X86] Promote fp_to_sint v16f32->v16i16/v16i8 to avoid scalarization.
llvm-svn: 319266
2017-11-29 00:32:09 +00:00
Craig Topper 88ffb5d4d5 [X86] Mark ISD::FP_TO_UINT v16i8/v16i16 as Promote under AVX512 instead of legal. Fix infinite loop in op legalization when promotion requires 2 steps.
Previously we had an isel pattern to add the truncate. Instead use Promote to add the truncate to the DAG before isel.

The Promote legalization code had to be updated to prevent an infinite loop if promotion took multiple steps because it wasn't remembering the previously tried value.

llvm-svn: 319259
2017-11-28 23:56:02 +00:00
Craig Topper ab9bfc904b [X86] Remove unused variable.
llvm-svn: 319239
2017-11-28 22:28:23 +00:00
Craig Topper a27f1e675a [X86] Remove code from combineUIntToFP that tried to favor UINT_TO_FP if legal when zero extending from vXi8/vX816.
The UINT_TO_FP is immediately converted to SINT_TO_FP when the node is re-evaluated because we'll detect that the sign bit is zero.

llvm-svn: 319234
2017-11-28 22:08:51 +00:00
Craig Topper 3aaa71f222 [X86] Remove custom lowering for uint_to_fp from vXi8/vXi16.
We have a DAG combine that uses a zero extend that should prevent this from ever occurring now.

llvm-svn: 319233
2017-11-28 22:08:48 +00:00
Craig Topper dd4295626b [X86] In lowerVectorShuffleAsElementInsertion, if were able to find a scalar i8 or i16 and need to zero extend it, make sure we use a vXi32 type of the full vector width.
Previously, this was hardcoded to v4i32, but if the input type is 256 bits we need to use v8i32.

Fixes PR35443

llvm-svn: 319208
2017-11-28 19:25:45 +00:00
Craig Topper ddbc340c20 [X86] Make zero extend from v16i1/v8i1 to v16i8/v8i16/v16i16 not scalarize under AVX512.
llvm-svn: 319136
2017-11-28 01:36:33 +00:00
Craig Topper 8b9cd03824 [X86] Remove unnecessary fp<->int setOperationAction lines from a hasVLX block. NFCI
These lines all exist identically either under SSE2, AVX2 or AVX512. Given that VLX implies all of those, these aren't providing anything new.

llvm-svn: 319124
2017-11-28 00:41:12 +00:00
Craig Topper ce732e7c30 [X86] Remove duplicate calls to setOperationAction. NFCI
These same calls exist a few lines down.

llvm-svn: 319122
2017-11-28 00:16:42 +00:00
Craig Topper 256cc48df6 [X86] Teach getSetCCResultType to handle more than just SimpleVTs when looking at larger than 512-bit vectors.
Which VTs are considered simple is determined by the superset of the legal types of all targets in LLVM. If we're looking at VTs that are going to be split down to 512-bits we should allow any VT not just simple ones since the simple list changes over time as new targets are added.

llvm-svn: 319110
2017-11-27 22:56:10 +00:00
Craig Topper 4aa519507d [X86] Remove lines that set v8f32 FP_ROUND/FP_EXTEND to Legal under AVX512. NFCI
We don't do this for narrow vectors under AVX or SSE features. We also don't set them to Expand like we do for many vectors op. Nor does TargetLoweringBase.cpp. This leads me to believe these default to Legal.

llvm-svn: 319103
2017-11-27 22:01:17 +00:00
Craig Topper a4120fc42c [X86] Teach combineX86ShuffleChain that AllowIntDomain requires at least SSE2.
I don't have a good test case for this at the moment. I was playing around with a change in legalizing and triggered this code to produce a PSHUFD with sse1 only.

llvm-svn: 319066
2017-11-27 18:15:14 +00:00
Craig Topper 62189f7ab3 [X86] Make getSetCCResultType return vXi1 for any vXi32/vXi64 vector over 512 bits long when AVX512 is enabled.
Similar for vXi16/vXi8 with BWI.

Any vector larger than 512 bits will be split to 512 bits during legalization. But without this we will fold sexts with them before that making it difficult to recover leading to scalarization.

llvm-svn: 319059
2017-11-27 17:51:55 +00:00
Craig Topper 074003c8e2 [X86] Fix an assert that was incorrectly checking for BMI instead of AVX512VBMI.
The check is actually unnecessary since AVX512VBMI implies AVX512BW which is the other part of the assert.

llvm-svn: 319006
2017-11-26 21:14:48 +00:00
Coby Tayree d8b17bedfa [x86][icelake]GFNI
galois field arithmetic (GF(2^8)) insns:
gf2p8affineinvqb
gf2p8affineqb
gf2p8mulb
Differential Revision: https://reviews.llvm.org/D40373

llvm-svn: 318993
2017-11-26 09:36:41 +00:00
Craig Topper e485631cd1 [X86] Add separate intrinsics for scalar FMA4 instructions.
Summary:
These instructions zero the non-scalar part of the lower 128-bits which makes them different than the FMA3 instructions which pass through the non-scalar part of the lower 128-bits.

I've only added fmadd because we should be able to derive all other variants using operand negation in the intrinsic header like we do for AVX512.

I think there are still some missed negate folding opportunities with the FMA4 instructions in light of this behavior difference that I hadn't noticed before.

I've split the tests so that we can use different intrinsics for scalar testing between the two. I just copied the tests split the RUN lines and changed out the scalar intrinsics.

fma4-fneg-combine.ll is a new test to make sure we negate the fma4 intrinsics correctly though there are a couple TODOs in it.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D39851

llvm-svn: 318984
2017-11-25 18:32:43 +00:00
Craig Topper a456f13af2 [X86] Simplify some code in combineSetCC. NFCI
Make the condition for doing a std::swap simpler so we don't have to repeat the full checks.

llvm-svn: 318970
2017-11-25 07:20:24 +00:00
Craig Topper 696bfc08d8 [X86] Qualify some vector specific code with VT.isVector(). NFCI
Other checks inside require a build_vector, but we this lets us stop earlier and makes the code more clear.

llvm-svn: 318969
2017-11-25 07:20:23 +00:00
Craig Topper c1b3269171 [X86] Support folding to andnps with SSE1 only.
With SSE1 only, we emit FAND and FXOR nodes for v4f32.

llvm-svn: 318968
2017-11-25 07:20:22 +00:00
Craig Topper 5b85df8605 [X86] Add some early DAG combines to turn v4i32 AND/OR/XOR into FAND/FOR/FXOR whe only SSE1 is available.
v4i32 isn't a legal type with sse1 only and would end up getting scalarized otherwise.

This isn't completely ideal as it doesn't handle cases like v8i32 that would get split to v4i32. But it at least helps with code written using the clang intrinsic header.

llvm-svn: 318967
2017-11-25 07:20:21 +00:00
Craig Topper 13ed01e635 [X86] Prevent using X * rsqrt(X) to approximate sqrt when only sse1 is enabled.
This optimization can occur after type legalization and emit a vselect with v4i32 type. But that type is not legal with sse1. This ultimately gets scalarized by the second type legalization that runs after vector op legalization, but that's really intended to handle the scalar types that might be introduced by legalizing vector ops.

For now just stop this from happening by disabling the optimization with sse1.

llvm-svn: 318965
2017-11-24 19:57:48 +00:00
Craig Topper f31b0b850b [X86] Teach isel that X86ISD::CMPM_RND zeros the upper bits of the mask register.
llvm-svn: 318933
2017-11-23 18:41:21 +00:00
Craig Topper 94b994972c [X86] Remove some unneeded opcodes from getVectorMaskingNode. NFC
We never reach here with these opcodes.

llvm-svn: 318932
2017-11-23 18:41:20 +00:00
Craig Topper b663adddb0 [X86] Add X86ISD::CMPM_RND to getVectorMaskingNode to select ISD::AND instead of ISD::VSELECT
A later DAG combine will turn the VSELECT into an AND, but we have the other mask compare opcodes here so add this one too.

llvm-svn: 318931
2017-11-23 18:41:19 +00:00
Craig Topper 27d182b7d4 [X86] Remove some dead code leftover from when i1 was a legal type. NFCI
llvm-svn: 318930
2017-11-23 18:41:18 +00:00
Craig Topper be9bf65d76 [X86] Remove some dead code. NFC
AVX512 code never reaches here so we don't need to handle X86ISD::CMPM as an opcode.

llvm-svn: 318929
2017-11-23 18:41:17 +00:00
Simon Pilgrim 90accbc5d9 [X86][SSE] Use (V)PHMINPOSUW for vXi16 SMAX/SMIN/UMAX/UMIN horizontal reductions (PR32841)
(V)PHMINPOSUW determines the UMIN element in an v8i16 input, with suitable bit flipping it can also be used for SMAX/SMIN/UMAX cases as well.

This patch matches vXi16 SMAX/SMIN/UMAX/UMIN horizontal reductions and reduces the input down to a v8i16 vector before calling (V)PHMINPOSUW.

A later patch will use this for v16i8 reductions as well (PR32841).

Differential Revision: https://reviews.llvm.org/D39729

llvm-svn: 318917
2017-11-23 13:50:27 +00:00
Coby Tayree e8bdd383e9 [x86][icelake]BITALG
2/3
vpshufbitqmb encoding
3/3
vpshufbitqmb intrinsics
Differential Revision: https://reviews.llvm.org/D40222

llvm-svn: 318904
2017-11-23 11:15:50 +00:00
Craig Topper a7864ed64a [X86] Turn an if condition that should always be true into an assert. NFCI
If Values.size() == 0, we should have returned 0 or undef earlier. If it was 1, it's a splat and we already handled that too.

llvm-svn: 318894
2017-11-23 03:24:01 +00:00
Craig Topper 6a0177bcf1 [X86] Remove unnecessary check for is128BitVector. NFC
256 and 512 bit vectors were picked off earlier in the function. Lots of code between there and here already assumed 128-bit vectors.

llvm-svn: 318893
2017-11-23 03:24:00 +00:00
Craig Topper 2a38887f28 [X86] Simplify some bitmasking and use llvm_unreachable to mark an impossible case. NFC
llvm-svn: 318892
2017-11-23 03:23:59 +00:00
Craig Topper ac4b0b1a2a [X86] Remove a ternary operator that can only ever be false. NFC
We are checking for AVX512 in an SSE1 only block.

llvm-svn: 318891
2017-11-23 03:23:58 +00:00
Craig Topper 726968d6a2 [X86] Support v32i16/v64i8 CTLZ using lookup table.
Had to tweak the setcc's used by the code to use a vXi1 result type with a sign extend back to vector size.

llvm-svn: 318871
2017-11-22 20:05:57 +00:00
Craig Topper 8ad818656a [X86] Move the BITALG setOperationAction code into the hasBWI section to match what is done for VPOPCNTDQ in the AVX512F block. NFC
llvm-svn: 318870
2017-11-22 20:05:54 +00:00
Craig Topper e15cc16873 [X86] Sink the MGATHER setOperationActions for AVX2 into the AVX block where most of the rest of the AVX2 legalization lives.
llvm-svn: 318869
2017-11-22 20:05:51 +00:00
Craig Topper ee74044f93 [X86] Add an X86ISD::MSCATTER node for consistency with the X86ISD::MGATHER.
This makes the fact that X86 needs an explicit mask output not part of the type constraint for the ISD::MSCATTER.

This also gives the X86ISD::MGATHER/MSCATTER nodes a common base class simplifying the address selection code in X86ISelDAGToDAG.cpp

llvm-svn: 318823
2017-11-22 08:10:54 +00:00
Craig Topper c1e7b3f6ca [X86] Lower all ISD::MGATHER nodes to X86ISD:MGATHER.
Now we consistently represent the mask result without relying on isel ignoring it.

We now have a more general SDNode and type constraints to represent these nodes in isel patterns. This allows us to present both both vXi1 and XMM/YMM mask types with a single set of constraints.

llvm-svn: 318821
2017-11-22 07:11:03 +00:00
Coby Tayree 5c7fe5df53 [x86][icelake]BITALG
vpopcnt{b,w}
Differential Revision: https://reviews.llvm.org/D40213

llvm-svn: 318748
2017-11-21 10:32:42 +00:00
Coby Tayree 3880f2a363 [x86][icelake]VNNI
Introducing Vector Neural Network Instructions, consisting of:
vpdpbusd{s}
vpdpwssd{s}
Differential Revision: https://reviews.llvm.org/D40208

llvm-svn: 318746
2017-11-21 10:04:28 +00:00
Coby Tayree 71e37cc9ff [x86][icelake]vbmi2
introducing vbmi2, consisting of
vpcompress{b,w}
vpexpand{b,w}
vpsh{l,r}d{w,d,q}
vpsh{l,r}dv{w,d,q}
Differential Revision: https://reviews.llvm.org/D40206

llvm-svn: 318745
2017-11-21 09:48:44 +00:00
Mohammed Agabaria 115f68ea3e [LV][X86] Support of AVX2 Gathers code generation and update the LV with this
This patch depends on: https://reviews.llvm.org/D35348

Support of pattern selection of masked gathers of AVX2 (X86\AVX2 code gen)
Update LoopVectorize to generate gathers for AVX2 processors.

Reviewers: delena, zvi, RKSimon, craig.topper, aaboud, igorb

Reviewed By: delena, RKSimon

Differential Revision: https://reviews.llvm.org/D35772

llvm-svn: 318641
2017-11-20 08:18:12 +00:00
Craig Topper 410bbcdcf1 [X86] Qualify a few places with ExperimentalVectorWideningLegalization.
I'm playing around with this flag and these places cause errors if not qualified.

llvm-svn: 318595
2017-11-18 18:49:16 +00:00
Simon Pilgrim c9bc55a08d [X86] Add todo comment for TRUNC(SUB(X,C)) -> SUB(TRUNC(X),C')
As discussed on PR35295, but it causes regressions in combineSubToSubus which need to be addressed first 

llvm-svn: 318594
2017-11-18 18:33:07 +00:00
Craig Topper 3a431cfb13 [X86] Fix typo in variable name. NFC
llvm-svn: 318590
2017-11-18 05:09:55 +00:00
David Blaikie b3bde2ea50 Fix a bunch more layering of CodeGen headers that are in Target
All these headers already depend on CodeGen headers so moving them into
CodeGen fixes the layering (since CodeGen depends on Target, not the
other way around).

llvm-svn: 318490
2017-11-17 01:07:10 +00:00
Craig Topper 089082378f [X86] Add DAG combine to remove sext i32->i64 from gather/scatter instructions.
Only do this pre-legalize in case we're using the sign extend to legalize for KNL.

This recovers all of the tests that changed when I stopped SelectionDAGBuilder from deleting sign extends.

There's more work that could be done here particularly to fix the i8->i64 test case that experienced split.

llvm-svn: 318468
2017-11-16 23:09:06 +00:00
Craig Topper e85ff4f732 [X86] Pre-truncate gather/scatter indices that have element sizes larger than 64-bits before Legalize.
The wider element type will normally cause legalize to try to split and scalarize the gather/scatter, but we can't handle that. Instead, truncate the index early so the gather/scatter node is insulated from the legalization.

This really shouldn't happen in practice since InstCombine will normalize index types to the same size as pointers.

llvm-svn: 318452
2017-11-16 20:23:22 +00:00
Craig Topper 04be793cec [X86] DAGCombinerInfo is in TargetLowering not X86TargetLowering.
llvm-svn: 318451
2017-11-16 20:23:17 +00:00
Craig Topper e6601fd30e [X86] Custom type legalize v2f32 masked gathers instead of trying to cleanup after type legalization.
llvm-svn: 318368
2017-11-16 02:07:45 +00:00
Craig Topper 54b57b0dd8 [X86] Add a return to the end of a switch to prevent an accidental fallthrough in the future.
llvm-svn: 318330
2017-11-15 20:42:47 +00:00
Craig Topper 16a91cee6c [X86] Redefine the 128-bit version of VPGATHERQD and VGATHERQPS to use a VK2 mask instead of a VK4 mask.
This allows us to remove extra extend creation during lowering and more accurately reflects the semantics of the instruction.

While there add an extra output VT to X86 masked gather node to better match the isel pattern predicate. Currently we're exploiting the fact that the isel table doesn't count how many output results a node actually has if the result type of any can be inferred from the first result and the type constraints defined in tablegen. I think we might ultimately want to lower all MGATHER/MSCATTER to an X86ISD node with the extra mask result and stop relying on this hole in the isel checking.

llvm-svn: 318278
2017-11-15 07:46:43 +00:00
Craig Topper 23493f3777 [X86] Attempt to fix signed and unsigned comparison warning.
llvm-svn: 318010
2017-11-13 02:19:13 +00:00
Craig Topper 63157c4784 [X86] Use EVEX encoded VRNDSCALE instructions to implement the legacy round intrinsics.
The VRNDSCALE instructions implement a superset of the (V)ROUND instructions. They are equivalent if the upper 4-bits of the immediate are 0.

This patch lowers the legacy intrinsics to the VRNDSCALE ISD node and masks the upper bits of the immediate to 0. This allows us to take advantage of the larger register encoding space.

We should maybe consider converting VRNDSCALE back to VROUND in the EVEX to VEX pass if the extended registers are not being used.

I notice some load folding opportunities being missed for the VRNDSCALESS/SD instructions that I'll try to fix in future patches.

llvm-svn: 318008
2017-11-13 02:03:00 +00:00
Craig Topper 0af48f1ad4 [X86] Split VRNDSCALE/VREDUCE/VGETMANT/VRANGE ISD nodes into versions with and without the rounding operand. NFCI
I want to reuse the VRNDSCALE node for the legacy SSE rounding intrinsics so that those intrinsics can use EVEX instructions. All of these nodes share tablegen multiclasses so I split them all so that they all remain similar in their implementations.

llvm-svn: 318007
2017-11-13 02:02:58 +00:00
Craig Topper b42a23ff8f [X86] Add an X86ISD::RANGES opcode to use for the scalar intrinsics.
This fixes a bug where we selected packed instructions for scalar intrinsics.

llvm-svn: 317999
2017-11-12 18:51:09 +00:00
Craig Topper 1382932c12 [X86] Remove some no longer needed intrinsic lowering code.
llvm-svn: 317997
2017-11-12 18:51:06 +00:00
Simon Pilgrim 294b87b432 [X86] Attempt to match multiple binary reduction ops at once. NFCI
matchBinOpReduction currently matches against a single opcode, but we already have a case where we repeat calls to try to match against AND/OR and I'll be shortly adding another case for SMAX/SMIN/UMAX/UMIN (D39729).

This NFCI patch alters matchBinOpReduction to try and pattern match against any of the provided list of candidate bin ops at once to save time.

Differential Revision: https://reviews.llvm.org/D39726

llvm-svn: 317985
2017-11-11 18:16:55 +00:00
Craig Topper 1a0da2db5f [X86] Add support for combining FMADDSUB(A, B, FNEG(C))->FMSUBADD(A, B, C)
Support the opposite direction as well. Also add a TODO for not being able to combine FMSUB/FNMADD/FNMSUB with FNEG.

llvm-svn: 317878
2017-11-10 08:22:37 +00:00
Craig Topper 93e27d2ecc [X86] Make sure we don't read too many operands from X86ISD::FMADDS1/FMADDS3 nodes when doing FNEG combine.
r317453 added new ISD nodes without rounding modes that were added to an existing if/else chain. But all the previous nodes handled there included a rounding mode. The final code after this if/else chain expected an extra operand that isn't present for the new nodes.

llvm-svn: 317748
2017-11-09 01:06:47 +00:00
Craig Topper cf8e6d0a76 [X86] Add support for using EVEX instructions for the legacy vcvtph2ps intrinsics.
Looks like there's some missed load folding opportunities for i64 loads.

llvm-svn: 317544
2017-11-07 07:13:03 +00:00
Craig Topper 428a4e6374 [X86] Make FeatureAVX512 imply FeatureF16C.
The EVEX to VEX pass is already assuming this is true under AVX512VL. We had special patterns to use zmm instructions if VLX and F16C weren't available.

Instead just make AVX512 imply F16C to make the EVEX to VEX behavior explicitly legal and remove the extra patterns.

All known CPUs with AVX512 have F16C so this should safe for now.

llvm-svn: 317521
2017-11-06 22:49:04 +00:00
Simon Pilgrim ad9b9720e8 [X86][SSE] Merge combineExtractVectorElt_SSE into combineExtractVectorElt. NFCI.
We still early-out for X86ISD::PEXTRW/X86ISD::PEXTRB so no actual change in behaviour, but it'll make it easier to add support in a future patch.

llvm-svn: 317485
2017-11-06 15:28:25 +00:00
Simon Pilgrim 14450720e6 [X86][SSE] Combine EXTRACT_VECTOR_ELT with combineExtractWithShuffle before XFormVExtractWithShuffleIntoLoad
combineExtractWithShuffle can handle more complex shuffles/bitcasts than we can with the equivalent code in XFormVExtractWithShuffleIntoLoad.

Mainly a compile time improvement now (combineExtractWithShuffle combines will have always failed late on inside XFormVExtractWithShuffleIntoLoad), and will let us merge combineExtractVectorElt_SSE in a future commit.

llvm-svn: 317481
2017-11-06 14:34:19 +00:00
Uriel Korach bb86686a8b [X86][AVX512] Improve lowering of AVX512 test intrinsics
Added TESTM and TESTNM to the list of instructions that already zeroing unused upper bits
and does not need the redundant shift left and shift right instructions afterwards.
Added a pattern for TESTM and TESTNM in iselLowering, so now icmp(neq,and(X,Y), 0) goes folds into TESTM
and icmp(eq,and(X,Y), 0) goes folds into TESTNM
This commit is a preparation for lowering the test and testn X86 intrinsics to IR.

Differential Revision: https://reviews.llvm.org/D38732

llvm-svn: 317465
2017-11-06 09:22:38 +00:00
Zvi Rackover 3122698040 X86 ISel: Basic support for variable-index vector permutations
Summary:
Try to lower a BUILD_VECTOR composed of extract-extract chains that can be
reasoned to be a permutation of a vector by indices in a non-constant vector.

We saw this pattern created by ISPC, which resolts to creating it due to the
requirement that shufflevector's mask operand be a *constant* vector.
I didn't check this but we could possibly use this pattern for lowering the X86 permute
C-instrinsics instead of llvm.x86 instrinsics.

This change can be followed by more improvements:
1. Handle vectors with undef elements.
2. Utilize pshufb and zero-mask-blending to support more effiecient
   construction of vectors with constant-0 elements.
3. Use smaller-element vectors of same width, and "interpolate" the indices,
   when no native operation available.

Reviewers: RKSimon, craig.topper

Reviewed By: RKSimon

Subscribers: chandlerc, DavidKreitzer

Differential Revision: https://reviews.llvm.org/D39126

llvm-svn: 317463
2017-11-06 08:25:46 +00:00
Jina Nahias 3844f1ad5c Revert "adding a pattern for broadcastm"
This reverts commit r317457.

Change-Id: If07f1fca1e3453d16c1dac906e87768661384e91
llvm-svn: 317462
2017-11-06 07:48:58 +00:00
Jina Nahias 7b705f1f91 [x86][AVX512] Lowering Broadcastm intrinsics to LLVM IR
This patch, together with a matching clang patch (https://reviews.llvm.org/D38683), implements the lowering of X86 broadcastm intrinsics to IR.

Differential Revision: https://reviews.llvm.org/D38684

Change-Id: I709ac0b34641095397e994c8ff7e15d1315b3540
llvm-svn: 317458
2017-11-06 07:09:24 +00:00
Jina Nahias 9c6561b648 adding a pattern for broadcastm
Change-Id: I6551fb13879e098aed74de410e29815cf37d9ab5
llvm-svn: 317457
2017-11-06 07:09:09 +00:00
Craig Topper 07dac55d95 [X86] Add scalar FMA ISD nodes without rounding mode. NFC
Next step is to use them for the legacy FMA scalar intrinsics as well. This will enable the legacy intrinsics to use EVEX encoded opcodes and the extended registers.

llvm-svn: 317453
2017-11-06 05:48:25 +00:00
Craig Topper 692c8efe30 [X86] Don't use RCP14 and RSQRT14 for reciprocal estimations or for legacy SSE rcp/rsqrt intrinsics when AVX512 features are enabled.
Summary:
AVX512 added RCP14 and RSQRT instructions which improve accuracy over the legacy RCP and RSQRT instruction, but not enough accuracy to remove the need for a Newton Raphson refinement.

Currently we use these new instructions for the legacy packed SSE instrinics, but not the scalar instrinsics. And we use it for fast math optimization of division and reciprocal sqrt.

I think switching the legacy instrinsics maybe surprising to the user since it changes the answer based on which processor you're using regardless of any fastmath settings. It's also weird that we did something different between scalar and packed.

As far at the reciprocal estimation, I think it creates unnecessary deltas in our output behavior (and prevents EVEX->VEX). A little playing around with gcc and icc and godbolt suggest they don't change which instructions they use here.

This patch adds new X86ISD nodes for the RCP14/RSQRT14 and uses those for the new intrinsics. Leaving the old intrinsics to use the old instructions.

Going forward I think our focus should be on
-Supporting 512-bit vectors, which will have to use the RCP14/RSQRT14.
-Using RSQRT28/RCP28 to remove the Newton Raphson step on processors with AVX512ER
-Supporting double precision.

Reviewers: zvi, DavidKreitzer, RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D39583

llvm-svn: 317413
2017-11-04 18:26:41 +00:00
Craig Topper a96d62b360 [X86] Teach shuffle lowering to use 256-bit SHUF128 when possible.
This allows masked operations to be used and allows the register allocator to use YMM16-31 if necessary.

As a follow up I'll look into teaching EVEX->VEX how to turn this back into PERM2X128 if any of the additional features don't work out.

llvm-svn: 317403
2017-11-04 06:44:47 +00:00
Craig Topper d21a53f246 [X86] Give unary PERMI priority over SHUF128 in lowerV8I64VectorShuffle to make it possible to fold a load.
llvm-svn: 317382
2017-11-03 22:48:13 +00:00
Simon Pilgrim ae1f013495 [X86][SSE] Add PACKUS support to combineVectorTruncation
Similar to the existing code to lower to PACKSS, we can use PACKUS if the input vector's leading zero bits extend all the way to the packed/truncated value.

We have to account for pre-SSE41 targets not supporting PACKUSDW

llvm-svn: 317315
2017-11-03 11:33:48 +00:00
Craig Topper 333897ec31 [X86] Remove PALIGNR/VALIGN handling from combineBitcastForMaskedOp and move to isel patterns instead. Prefer 128-bit VALIGND/VALIGNQ over PALIGNR during lowering when possible.
llvm-svn: 317299
2017-11-03 06:48:02 +00:00
Simon Pilgrim e152c2c447 [X86][SSE] Add PACKUS support to LowerTruncate
Similar to the existing code to lower to PACKSS, we can use PACKUS if the input vector's leading zero bits extend all the way to the packed/truncated value.

We have to account for pre-SSE41 targets not supporting PACKUSDW

llvm-svn: 317128
2017-11-01 21:52:29 +00:00
Simon Pilgrim 778810eb42 [X86][SSE] Begun generalizing truncateVectorWithPACKSS to work with PACKSS/PACKUS functions
Renamed to truncateVectorWithPACK

llvm-svn: 317098
2017-11-01 15:31:51 +00:00
Simon Pilgrim f657ba0cb6 [X86][SSE] Truncate with PACKSS any input with sufficient sign-bits
So far we've only been using PACKSS truncations with 'all-bits or zero-bits' patterns (vector comparison results etc.). When really we can safely use it for any case as long as the number of sign bits reach down to the last 16-bits (or 8-bits if we're truncating to bytes).

The next steps after this is add the equivalent support for PACKUS and to support packing to sub-128 bit vectors for truncating stores etc.

Differential Revision: https://reviews.llvm.org/D39476

llvm-svn: 317086
2017-11-01 11:47:44 +00:00
Simon Pilgrim f3c33ca83e [X86][SSE] Add VSRLI/VSRAI/VSLLI demanded elts support to computeKnownBits/ComputeNumSignBits
Mainly a perf improvements as most combines will have occurred before we lower to these instructions

llvm-svn: 317005
2017-10-31 16:06:21 +00:00
Jina Nahias 5bf6620b15 [X86][AVX512] Adding a pattern for broadcastm intrinsic.
Differential Revision: https://reviews.llvm.org/D38312

Change-Id: I71c8605a8e4c98013ef25289694afc5cfd46bb0b
llvm-svn: 316921
2017-10-30 16:37:28 +00:00
Craig Topper 4e13d4de52 [X86] Make sure we don't create locked inc/dec instructions when the carry flag is being used.
Summary:
INC/DEC don't update the carry flag so we need to make sure we don't try to use it.

This patch introduces new X86ISD opcodes for locked INC/DEC. Teaches lowerAtomicArithWithLOCK to emit these nodes if INC/DEC is not slow or the function is being optimized for size. An additional flag is added that allows the INC/DEC to be disabled if the caller determines that the carry flag is being requested.

The test_sub_1_cmp_1_setcc_ugt test is currently showing this bug. The other test case changes are recovering cases that were regressed in r316860.

This should fully fix PR35068 finishing the fix started in r316860.

Reviewers: RKSimon, zvi, spatel

Reviewed By: zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D39411

llvm-svn: 316913
2017-10-30 14:51:37 +00:00
Jina Nahias e63db55c67 Revert "[X86][AVX512] Adding a pattern for broadcastm intrinsic."
This reverts commit r316890.

Change-Id: I683cceee9848ef309b452293086b1f26a941950d
llvm-svn: 316894
2017-10-30 10:35:53 +00:00
Jina Nahias 70280f9a0d [X86][AVX512] Adding a pattern for broadcastm intrinsic.
Differential Revision: https://reviews.llvm.org/D38312

Change-Id: I6551fb13879e098aed74de410e29815cf37d9ab5
llvm-svn: 316890
2017-10-30 09:59:52 +00:00
Craig Topper 495a1bc893 [X86] Remove combine that turns X86ISD::LSUB into X86ISD::LADD. Update patterns that depended on this.
If the carry flag is being used, this transformation isn't safe.

This does prevent some test cases from using DEC now, but I'll try to look into that separately.

Fixes PR35068.

llvm-svn: 316860
2017-10-29 06:51:04 +00:00
Craig Topper 7a60e29185 [X86] Fix typo in comment. NFC
llvm-svn: 316859
2017-10-29 06:51:02 +00:00
Craig Topper 0692ca4bd2 [X86] Remove invalid code from LowerVSELECT.
This code attempted to say that v8i16/v16i16 VSELECT is legal if BWI and VLX are enabled, but the only way we could reach this point is if the condition was not a vXi1 type. Which means it really wasn't legal.

We don't have any tests that exercise this code. So I'm hoping it wasn't really reachable.

llvm-svn: 316851
2017-10-28 23:10:13 +00:00
Simon Pilgrim 294f88dfa0 [X86][SSE] Combine 128-bit target shuffles to PACKSS/PACKUS.
llvm-svn: 316845
2017-10-28 20:51:27 +00:00
Simon Pilgrim bd3852aa5e [X86][SSE] Split off matchVectorShuffleWithPACK. NFCI.
Split matchVectorShuffleWithPACK from lowerVectorShuffleWithPACK so that we can reuse it for target shuffle combines

llvm-svn: 316844
2017-10-28 20:27:22 +00:00
Simon Pilgrim 25808c303f [X86][SSE] Rename truncateVectorCompareWithPACKSS to truncateVectorWithPACKSS. NFC.
We no longer rely on the vector source being a comparison result, just have sufficient sign bits.

llvm-svn: 316834
2017-10-28 17:59:56 +00:00
Craig Topper b8d7d4d683 [X86] Improve handling of UDIVREM8_ZEXT_HREG/SDIVREM8_SEXT_HREG to support 64-bit extensions.
If the extend type is 64-bits, emit a 32-bit -> 64-bit extend after the UDIVREM8_ZEXT_HREG/UDIVREM8_SEXT_HREG operation.

This gives a shorter encoding for the second extend in the sext case, and allows us to completely remove the second extend in the zext case.

This also adds known bit and num sign bits support for UDIVREM8_ZEXT_HREG/SDIVREM8_SEXT_HREG.

Differential Revision: https://reviews.llvm.org/D38275

llvm-svn: 316702
2017-10-26 21:12:03 +00:00
Sanjay Patel ac50f3e907 [x86] use an insert op to put one variable element into a constant of vectors
Instead of loading (a potential ton of) scalar constants, load those as a vector and then insert into it.

Differential Revision: https://reviews.llvm.org/D38756

llvm-svn: 316685
2017-10-26 18:27:55 +00:00
Simon Pilgrim 5e8c3f328f [X86][AVX] ComputeNumSignBitsForTargetNode - add support for X86ISD::VTRUNC
llvm-svn: 316462
2017-10-24 17:04:57 +00:00
Simon Pilgrim 0a12c239b6 [X86] truncateVectorCompareWithPACKSS - use PACKSSDW/PACKSSWB instead of just PACKSSWB.
By using the widest type possible for PACKSS truncation we have a better chance of being able to peek through bitcasts and improves other combines driven by ComputeNumSignBits.

llvm-svn: 316448
2017-10-24 15:38:16 +00:00
Simon Pilgrim c36dd6ae9c [X86] truncateVectorCompareWithPACKSS - remove duplicate variables. NFCI.
llvm-svn: 316440
2017-10-24 14:18:32 +00:00
Simon Pilgrim 321e54f72d [X86][SSE] combineBitcastvxi1 - use PACKSSWB directly to pack v8i16 to v16i8
Avoid difficulties determining the number of sign bits later on in shuffle lowering to lower to PACKSS

llvm-svn: 316383
2017-10-23 22:05:02 +00:00
Simon Pilgrim 1dcb913be6 [X86][SSE] Remove AssertZext stage from PEXTRW/PEXTRB lowering. NFCI.
Remove AssertZext and instead add PEXTRW/PEXTRB support to computeKnownBitsForTargetNode to simplify instruction selection.

Differential Revision: https://reviews.llvm.org/D39169

llvm-svn: 316336
2017-10-23 16:00:57 +00:00
Craig Topper fcf27188d7 [X86] Do not generate __multi3 for mul i128 on X86
Summary: __multi3 is not available on x86 (32-bit). Setting lib call name for MULI_128 to nullptr forces DAGTypeLegalizer::ExpandIntRes_MUL to generate instructions for 128-bit multiply instead of a call to an undefined function.  This fixes PR20871 though it may be worth looking at why licm and indvars combine to generate 65-bit multiplies in that test.

Patch by Riyaz V Puthiyapurayil

Reviewers: craig.topper, schweitz

Reviewed By: craig.topper, schweitz

Subscribers: RKSimon, llvm-commits

Differential Revision: https://reviews.llvm.org/D38668

llvm-svn: 316254
2017-10-21 02:26:00 +00:00
Simon Pilgrim 29b32472b4 [X86][SSE] getTargetShuffleMask - check shuffle input value types. NFCI.
To help identify shuffle combine issues

llvm-svn: 316222
2017-10-20 18:07:50 +00:00
Craig Topper 7bce79a539 [X86] Remove LowerEXTRACT_SUBVECTOR handler. All EXTRACT_SUBVECTORs are marked as legal.
llvm-svn: 316182
2017-10-19 20:59:40 +00:00
Simon Pilgrim fdd63d1535 [X86] Replace custom scalar integer absolute matching with ISD::ABS lowering.
x86 has its own copy of integer absolute pattern matching to combine directly to a SUB+CMOV.

This patch removes the x86 combine and adds custom lowering support for ISD::ABS instead, allowing us to use the DAGCombiner version.

Additional test cases are already covered by iabs.ll (rL315706 and rL315711).

Differential Revision: https://reviews.llvm.org/D38895

llvm-svn: 316162
2017-10-19 15:02:24 +00:00
Krzysztof Parzyszek 72518eaa6f Add iterator range MachineRegisterInfo::liveins(), adopt users, NFC
llvm-svn: 315927
2017-10-16 19:08:41 +00:00
Craig Topper a5af4a64d0 [AVX512] Don't mark EXTLOAD as legal with AVX512. Continue using custom lowering.
Summary:
This was impeding our ability to combine the extending shuffles with other shuffles as you can see from the test changes.

There's one special case that needed to be added to use VZEXT directly for v8i8->v8i64 since the custom lowering requires v64i8.

Reviewers: RKSimon, zvi, delena

Reviewed By: delena

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38714

llvm-svn: 315860
2017-10-15 16:41:17 +00:00
Craig Topper a9cd59fb5d [X86] Lower vselect with constant condition to vector_shuffle even with AVX512 instructions.
Summary:
It's better to use our shuffle lowering code to handle these than loading an immediate into a k-register.

It really feels like this should be a DAG combine optimization rather than a lowering operation, but that's a problem for another day.

Reviewers: RKSimon, delena, zvi

Reviewed By: delena

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38932

llvm-svn: 315849
2017-10-15 06:39:07 +00:00
Simon Pilgrim 36fe00ee17 [X86][SSE] Don't attempt to reduce the imul vector width of odd sized vectors (PR34947)
llvm-svn: 315825
2017-10-14 19:57:19 +00:00
Simon Pilgrim f5b9f353c3 Pull out repeated calls to VT.getVectorNumElements(). NFCI.
llvm-svn: 315818
2017-10-14 17:37:42 +00:00
Simon Pilgrim cded82837d Use DAG::getBitcast() helper. NFCI.
llvm-svn: 315815
2017-10-14 17:14:42 +00:00
Simon Pilgrim f367c27d2d [X86][SSE] Support combining AND(EXTRACT(SHUF(X)), C) -> EXTRACT(SHUF(X))
If we are applying a byte mask to a value extracted from a shuffle, see if we can combine the mask into shuffle.

Fixes the last issue with PR22415

llvm-svn: 315807
2017-10-14 15:01:36 +00:00
Craig Topper f6c69564e7 [X86] Use X86ISD::VBROADCAST in place of v2f64 X86ISD::MOVDDUP when AVX2 is available
This is particularly important for AVX512VL where we are better able to recognize the VBROADCAST loads to fold with other operations.

For AVX512VL we now use X86ISD::VBROADCAST for all of the patterns and remove the 128-bit X86ISD::VMOVDDUP.

We may be able to use this for AVX1 as well which would allow us to remove more isel patterns.

I also had to add X86ISD::VBROADCAST as a node to call combineShuffle for so that we treat it similar to X86ISD::MOVDDUP.

Differential Revision: https://reviews.llvm.org/D38836

llvm-svn: 315768
2017-10-13 21:56:48 +00:00
Craig Topper 0817346aef [X86] Stop creating CMOV nodes with a second MVT::Glue result
Summary: We seem to inconsistently create CMOV nodes some with a Glue result and some without. But I can't find any cases that use the Glue result. So I've tried to remove all the place that did this.

Reviewers: RKSimon, spatel, zvi

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38664

llvm-svn: 315686
2017-10-13 15:28:35 +00:00
Sanjay Patel 3a72909b7e [x86] replace isEqualTo with == for efficiency
This is a follow-up suggested in D37534.
Patch by Yulia Koval.

llvm-svn: 315589
2017-10-12 16:15:38 +00:00
Simon Pilgrim 0903085ec3 [X86][SSE] Pull out repeated INSERT_VECTOR_ELT code from LowerBUILD_VECTOR v16i8/v8i16 insertion. NFCI.
llvm-svn: 315587
2017-10-12 15:52:01 +00:00
Sanjay Patel 6c0aef77aa [x86] avoid infinite loop from SoftenFloatOperand (PR34866)
Legalization of fp128 assumes things that we should have asserts for,
so that's another potential improvement.

Differential Revision: https://reviews.llvm.org/D38771

llvm-svn: 315485
2017-10-11 18:24:21 +00:00
Simon Pilgrim 7db366630c Spelling mistake in comment. NFCI.
llvm-svn: 315471
2017-10-11 16:10:05 +00:00
Craig Topper 3dc22bba47 [X86] Remove MVT::i1 handling code from LowerTRUNCATE
Summary: I don't think this is necessary with i1 being illegal now.

Reviewers: RKSimon, zvi, guyblank

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38784

llvm-svn: 315469
2017-10-11 16:05:05 +00:00
Uriel Korach 059e211aa1 after fixing the i386 case
Change-Id: If6fe0b6ec01f111115fb734fe31c0e152dbc165f
llvm-svn: 315311
2017-10-10 13:43:09 +00:00
Zvi Rackover c1d5955684 [X86] Unsigned saturation subtraction canonicalization [the backend part]
Summary:
On behalf of julia.koval@intel.com

The patch transforms canonical version of unsigned saturation, which is sub(max(a,b),a) or sub(a,min(a,b)) to special psubus insturuction on targets, which support it(8bit and 16bit uints).
umax(a,b) - b -> subus(a,b)
a - umin(a,b) -> subus(a,b)

There is also extra case handled, when right part of sub is 32 bit and can be truncated, using UMIN(this transformation was discussed in https://reviews.llvm.org/D25987).

The example of special case code:

```
void foo(unsigned short *p, int max, int n) {

  int i;
  unsigned m;
  for (i = 0; i < n; i++) {
    m = *--p;
    *p = (unsigned short)(m >= max ? m-max : 0);
  }
}
```
Max in this example is truncated to max_short value, if it is greater than m, or just truncated to 16 bit, if it is not. It is vaid transformation, because if max > max_short, result of the expression will be zero.

Here is the table of types, I try to support, special case items are bold:

| Size | 128 | 256 | 512
| -----  | -----  | -----   | -----
| i8 | v16i8 | v32i8 | v64i8
| i16 | v8i16 | v16i16 | v32i16
| i32 | | **v8i32** | **v16i32**
| i64 | | | **v8i64**

Reviewers: zvi, spatel, DavidKreitzer, RKSimon

Reviewed By: zvi

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D37534

llvm-svn: 315237
2017-10-09 20:01:10 +00:00
Craig Topper c88883b07d [X86] Remove a setLoadExtAction from the AVX512 section that uses an AVX512BW type and is alraedy present in the AVX512BW section.
llvm-svn: 315202
2017-10-09 01:05:16 +00:00
Craig Topper 4f8656a7af [X86] Enable extended comparison predicate support for SETUEQ/SETONE when targeting AVX instructions.
We believe that despite AMD's documentation, that they really do support all 32 comparision predicates under AVX.

Differential Revision: https://reviews.llvm.org/D38609

llvm-svn: 315201
2017-10-09 01:05:15 +00:00
Simon Pilgrim 2c742f919a [X86][SSE] Don't call combineTo inside combineX86ShufflesRecursively. NFCI.
Return the combined shuffle from combineX86ShufflesRecursively and perform the combineTo in the caller.

Makes it easier for future patches to use this in functions that aren't actually shuffles themselves.

llvm-svn: 315195
2017-10-08 20:58:14 +00:00
Simon Pilgrim 6abbd33ec0 Tidyup with clang-format. NFCI.
llvm-svn: 315187
2017-10-08 19:24:30 +00:00
Simon Pilgrim dc32c844f9 [X86] getTargetConstantBitsFromNode - add support for decoding scalar constants
llvm-svn: 315182
2017-10-08 17:21:18 +00:00
Craig Topper c97775c03c [X86] Prefer MOVSS/SD over BLENDI during legalization. Remove BLENDI versions of scalar arithmetic patterns
Summary:
We currently disable some converting of shuffles to MOVSS/MOVSD during legalization if SSE41 is enabled. But later during shuffle combining we go back to prefering MOVSS/MOVSD.

Additionally we have patterns that look for BLENDIs to detect scalar arithmetic operations. I believe due to the combining using MOVSS/MOVSD these are unnecessary.

Interestingly, we still codegen blend instructions even though lowering/isel emit movss/movsd instructions. Turns out machine CSE commutes them to blend, and then commuting those blends back into blends that are equivalent to the original movss/movsd.

This patch fixes the inconsistency in legalization to prefer MOVSS/MOVSD. The one test change was caused by this change. The problem is that we have integer types and are mostly selecting integer instructions except for the shufps. This shufps forced the execution domain, but the vpblendw couldn't have its domain changed with a naive instruction swap. We could fix this by special casing VPBLENDW based on the immediate to widen the element type.

The rest of the patch is removing all the excess scalar patterns.

Long term we should probably add isel patterns to make MOVSS/MOVSD emit blends directly instead of relying on the double commute. We may also want to consider emitting movss/movsd for optsize. I also wonder if we should still use the VEX encoded blendi instructions even with AVX512. Blends have better throughput, and that may outweigh the register constraint.

Reviewers: RKSimon, zvi

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38023

llvm-svn: 315181
2017-10-08 16:57:23 +00:00
Craig Topper bbca2f2978 [X86] Stop LowerSIGN_EXTEND_AVX512 from creating v8i16/v16i16/v16i8 vselects with a v8i1/v16i1 condition when BWI is not available.
Some of the tests in vector-shuffle-v1.ll would get into an infinite loop without this.

llvm-svn: 315172
2017-10-08 08:50:59 +00:00
Craig Topper 27170fee8d [X86] If we see an insert of a bitcast into zero vector, canonicalize it to move the bitcast to the other side of the insert.
This improves detection of zeroing of upper bits during isel.

llvm-svn: 315161
2017-10-08 01:33:41 +00:00
Craig Topper f7a19db649 [X86] Remove ISD::INSERT_SUBVECTOR handling from combineBitcastForMaskedOp. Add isel patterns to make up for it.
This will allow for some flexibility in canonicalizing bitcasts around insert_subvector.

llvm-svn: 315160
2017-10-08 01:33:40 +00:00
Craig Topper 16f2044fa8 [X86] Use getConstantOperandVal to simplify some code. NFC
llvm-svn: 315159
2017-10-08 01:33:38 +00:00
Simon Pilgrim 9508fe7924 [X86][SSE] Match bitcasted BUILD_VECTOR of constants for v2i64 shifts on 64-bit targets (PR34855)
Extension to rL315155, generate constant shifts on 64-bits as well as 32-bits.

llvm-svn: 315156
2017-10-07 17:57:22 +00:00
Simon Pilgrim 70e1db78db [X86][SSE] Match bitcasted v4i32 BUILD_VECTORS for v2i64 shifts on 64-bit targets (PR34855)
We were already doing this for 32-bit targets, but we can generate these on 64-bits as well.

llvm-svn: 315155
2017-10-07 17:42:17 +00:00
Craig Topper 2f60295364 [X86] Add X86ISD::CMOV to computeKnownBitsForTargetNode and ComputeNumSignBitsForTargetNode.
Summary: Implementations based on ISD::SELECT.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D38663

llvm-svn: 315153
2017-10-07 16:51:19 +00:00
Simon Pilgrim 73f143e774 [X86][SSE] Improve shuffling combining with horizontal operations
Recognise cases when we can merge the shuffles with their horizontal (HADD/HSUB/PACK) instruction inputs.

Replaces an older implementation which performed some of this during lowering, expanding an existing target shuffle combine stage instead.

Differential Revision: https://reviews.llvm.org/D38506

llvm-svn: 315150
2017-10-07 12:42:23 +00:00
Martin Storsjo 5e9d482b0a [X86] Update an outdated comment about SjLj
The SjLj intrinsics in the X86 backend are intended for use with
SjLj exception handling as well, since SVN r271244.

Differential Revision: https://reviews.llvm.org/D38532

llvm-svn: 315146
2017-10-07 06:00:32 +00:00
Craig Topper e79eff3bb5 [X86] Correct result type for the flag result of RDSEED and RDRAND nodes. Correct the CC type for the CMOV used with RDSEED/RDRAND.
The flag result was MVT::Glue, but should be MVT::i32. The CC type was MVT::i8, but should be MVT::i32.

llvm-svn: 315145
2017-10-07 05:11:59 +00:00
Artur Pilipenko 7b15254c8f [X86] Fix chains update when lowering BUILD_VECTOR to a vector load
The code which lowers BUILD_VECTOR of consecutive loads into a single vector
load doesn't update chains properly. As a result the vector load can be
reordered with the store to the same location.

The current code in EltsFromConsecutiveLoads only updates the chain following
the first load. The fix is to update the chains following all the loads
comprising the vector.

This is a fix for PR10114.

Reviewed By: niravd

Differential Revision: https://reviews.llvm.org/D38547

llvm-svn: 314988
2017-10-05 16:28:21 +00:00
Simon Pilgrim 9edbe110e8 [X86][AVX] Improve (i8 bitcast (v8i1 x)) handling for v8i64/v8f64 512-bit vector compare results.
AVX1/AVX2 targets were missing a chance to use vmovmskps for v8f32/v8i32 results for bool vector bitcasts

llvm-svn: 314921
2017-10-04 18:00:42 +00:00
Simon Pilgrim b47b3f2564 [X86][SSE] Add support for lowering v8i16 binary shuffles to PACKSS/PACKUS
Missed in D38472

llvm-svn: 314916
2017-10-04 17:31:28 +00:00
Simon Pilgrim 46a366ccb7 [X86][SSE] Early out from ComputeNumSignBitsForTargetNode. NFCI.
Early out from vector shift by immediates that will exceed eltsize - don't bother making an unnecessary ComputeNumSignBits recursive call.

llvm-svn: 314903
2017-10-04 13:41:26 +00:00
Simon Pilgrim bd5d2f0284 [X86][SSE] Add support for lowering unary shuffles to PACKSS/PACKUS
Extension to D38472

llvm-svn: 314901
2017-10-04 13:12:08 +00:00
Martin Storsjo e14145dcb0 [X86] Fix using the SJLJ jump table on x86_64
The previous version didn't work if the jump table base address didn't
fit in 32 bit, since it was encoded as an immediate offset. And in case
the jump table is encoded as 32 bit label differences, we need to
load and add them to the table base first.

This solves the first half of the issues mentioned in PR34720.

Also fix some of the errors pointed out by -verify-machineinstrs, by
using GR32_NOSPRegClass.

Differential Revision: https://reviews.llvm.org/D38333

llvm-svn: 314876
2017-10-04 05:12:10 +00:00
Simon Pilgrim cf99d069c3 [X86][SSE] Add support for decoding PACKSS/PACKUS shuffles masks with UNDEF
llvm-svn: 314792
2017-10-03 12:41:39 +00:00
Simon Pilgrim f5f291d129 [X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS
If the upper bits of a truncation shuffle patterns have at least the minimum number of sign/zero bits on their inputs then we can safely use PACKSS/PACKUS as shuffles.

Partial fix for https://bugs.llvm.org/show_bug.cgi?id=34773

Differential Revision: https://reviews.llvm.org/D38472

llvm-svn: 314788
2017-10-03 12:01:31 +00:00
Simon Pilgrim d87af9a1c0 Remove unused variable. NFCI.
llvm-svn: 314778
2017-10-03 10:01:02 +00:00
Simon Pilgrim 640fbf5132 [X86][SSE] Add support for shuffle combining from PACKSS/PACKUS
Mentioned in D38472

llvm-svn: 314777
2017-10-03 09:54:03 +00:00
Simon Pilgrim 19d535e75b [X86][SSE] Add support for PACKSS/PACKUS constant folding
Pulled out of D38472

llvm-svn: 314776
2017-10-03 09:41:00 +00:00
Martin Storsjo 1e54738676 [X86] Provide the LSDA pointer with RIP relative addressing if necessary
This makes sure the LSDA pointer isn't truncated to 32 bit.

Make LowerINTRINSIC_WO_CHAIN a member function instead of a static
function, so that it can use the getGlobalWrapperKind method.

This solves the second half of the issues mentioned in PR34720.

Differential Revision: https://reviews.llvm.org/D38343

llvm-svn: 314767
2017-10-03 06:29:58 +00:00
Bjorn Pettersson 8e978c0151 [X86][SSE] Fix -Wsign-compare problems introduced in r314658
The refactoring in
"[X86][SSE] Add createPackShuffleMask helper function. NFCI."
resulted in warning when compiling the code (seen in build bots).

This patch restores some types from int to unsigned to avoid
those warnings.

llvm-svn: 314667
2017-10-02 12:46:38 +00:00
Simon Pilgrim e2e27aff9b [X86][SSE] Add createPackShuffleMask helper function. NFCI.
llvm-svn: 314658
2017-10-02 10:12:51 +00:00
Simon Pilgrim c04c7443ea [X86][SSE] matchBinaryVectorShuffle - add support for different src/dst value shuffle types
Preparation for support for combining to PACKSS/PACKUS

llvm-svn: 314656
2017-10-02 09:45:08 +00:00
Simon Pilgrim 3bbbf31590 Fix typo in comment. NFCI.
llvm-svn: 314653
2017-10-02 09:10:50 +00:00
Simon Pilgrim e575651370 [X86] Cleanup uses of computeKnownBits by using MaskedValueIsZero helper instead. NFCI.
llvm-svn: 314652
2017-10-02 09:08:45 +00:00
Craig Topper bb7866162c [X86] Use a bool flag instead of assigning an unsigned to two different values that we only use in an equality comparison.
llvm-svn: 314647
2017-10-02 05:46:52 +00:00