llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	fe3bbb251b	[X86] Add a bunch of test cases for storing a scalar bitcasted from a vXi1 type. Currently a store combine will absorb the bitcast before our combine that turns bitcasts into movmsk gets a chance to run. This results in a store being created with a vXi1 type. Type legalization then promotes the input type and makes this a truncating store. Then we badly scalarize this store. Currently we avoid this on v8i1->i8 bitcasts due to an incompletely qualified(per the original intention) check in isLoadBitCastBeneficial. An easy fix is to disable this for all vXi1->iX bitcasts on pre-avx512 targets. We'll still generate terrible code if the IR explicitly contains a store of vXi1 without a bitcast. We could probably solve that by just turning all stores of vXi1 into (store (iX (bitcast))) as an early DAG combine. llvm-svn: 347631	2018-11-27 02:57:23 +00:00
Sterling Augustine	9cc1ffadc5	Notify the linker when a TU compiled with split-stack has a function without a prologue. More context here: https://go-review.googlesource.com/c/go/+/148819/ llvm-svn: 347614	2018-11-26 23:26:31 +00:00
Mircea Trofin	183df14520	Add new passes to X86 pipeline tests Summary: Fixes test failures introduced by rL347596. Reviewers: davidxl Reviewed By: davidxl Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D54916 llvm-svn: 347607	2018-11-26 22:49:17 +00:00
Mircea Trofin	cfbc1788d6	Support for inserting profile-directed cache prefetches Summary: Support for profile-driven cache prefetching (X86) This change is part of a larger system, consisting of a cache prefetches recommender, create_llvm_prof (https://github.com/google/autofdo), and LLVM. A proof of concept recommender is DynamoRIO's cache miss analyzer. It processes memory access traces obtained from a running binary and identifies patterns in cache misses. Based on them, it produces a csv file with recommendations. The expectation is that, by leveraging such recommendations, we can reduce the amount of clock cycles spent waiting for data from memory. A microbenchmark based on the DynamoRIO analyzer is available as a proof of concept: https://goo.gl/6TM2Xp. The recommender makes prefetch recommendations in terms of: * the binary offset of an instruction with a memory operand; * a delta; * and a type (nta, t0, t1, t2) meaning: a prefetch of that type should be inserted right before the instrution at that binary offset, and the prefetch should be for an address delta away from the memory address the instruction will access. For example: 0x400ab2,64,nta and assuming the instruction at 0x400ab2 is: movzbl (%rbx,%rdx,1),%edx means that the recommender determined it would be beneficial for a prefetchnta instruction to be inserted right before this instruction, as such: prefetchnta 0x40(%rbx,%rdx,1) movzbl (%rbx, %rdx, 1), %edx The workflow for prefetch cache instrumentation is as follows (the proof of concept script details these steps as well): 1. build binary, making sure -gmlt -fdebug-info-for-profiling is passed. The latter option will enable the X86DiscriminateMemOps pass, which ensures instructions with memory operands are uniquely identifiable (this causes ~2% size increase in total binary size due to the additional debug information). 2. collect memory traces, run analysis to obtain recommendations (see above-referenced DynamoRIO demo as a proof of concept). 3. use create_llvm_prof to convert recommendations to reference insertion locations in terms of debug info locations. 4. rebuild binary, using the exact same set of arguments used initially, to which -mllvm -prefetch-hints-file=<file> needs to be added, using the afdo file obtained at step 3. Note that if sample profiling feedback-driven optimization is also desired, that happens before step 1 above. In this case, the sample profile afdo file that was used to produce the binary at step 1 must also be included in step 4. The data needed by the compiler in order to identify prefetch insertion points is very similar to what is needed for sample profiles. For this reason, and given that the overall approach (memory tracing-based cache recommendation mechanisms) is under active development, we use the afdo format as a syntax for capturing this information. We avoid confusing semantics with sample profile afdo data by feeding the two types of information to the compiler through separate files and compiler flags. Should the approach prove successful, we can investigate improvements to this encoding mechanism. Reviewers: davidxl, wmi, craig.topper Reviewed By: davidxl, wmi, craig.topper Subscribers: davide, danielcdh, mgorny, aprantl, eraman, JDevlieghere, llvm-commits Differential Revision: https://reviews.llvm.org/D54052 llvm-svn: 347596	2018-11-26 21:36:18 +00:00
Craig Topper	b955bf382c	[LegalizeVectorTypes][X86][ARM][AArch64][PowerPC] Don't use SplitVecOp_TruncateHelper for FP_TO_SINT/UINT. SplitVecOp_TruncateHelper tries to promote the result type while splitting FP_TO_SINT/UINT. It then concatenates the result and introduces a truncate to the original result type. But it does this without inserting the AssertZExt/AssertSExt that the regular result type promotion would insert. Nor does it turn FP_TO_UINT into FP_TO_SINT the way normal result type promotion for these operations does. This is bad on X86 which doesn't support FP_TO_SINT until AVX512. This patch disables the use of SplitVecOp_TruncateHelper for these operations and just lets normal promotion handle it. I've tweaked a couple things in X86ISelLowering to avoid a few obvious regressions there. I believe all the changes on X86 are improvements. The other targets look neutral. Differential Revision: https://reviews.llvm.org/D54906 llvm-svn: 347593	2018-11-26 21:12:39 +00:00
Craig Topper	923f463ef2	[SelectionDAG] Teach BaseIndexOffset::match to unwrap the base after looking through an add/or We might find a target specific node that needs to be unwrapped after we look through an add/or. Otherwise we get inconsistent results if one pointer is just X86WrapperRIP and the other is (add X86WrapperRIP, C) Differential Revision: https://reviews.llvm.org/D54818 llvm-svn: 347591	2018-11-26 20:16:33 +00:00
Craig Topper	2754d1dca4	[X86] Add test case for D54818 llvm-svn: 347590	2018-11-26 20:16:31 +00:00
Than McIntosh	b9e4852c92	[CodeGen] Take SPAdj into account for STATEPOINT liveness args Summary: STATEPOINT records its args' locations on stack relative to SP. If the SP is changed, take that into account. This patch authored by Cherry Zhang <cherryyz@google.com>. Reviewers: thanm, reames Reviewed By: reames Subscribers: reames, llvm-commits Differential Revision: https://reviews.llvm.org/D53603 llvm-svn: 347569	2018-11-26 16:16:09 +00:00
Sanjay Patel	d31220e0de	[x86] promote all multiply i8 by constant to i32 We have these 2 "isDesirable" promotion hooks (I'm not sure why we need both of them, but that's independent of this patch), and we can adjust them to promote "mul i8 X, C" to i32. Then, all of our existing LEA and other multiply expansion magic happens as it would for i32 ops. Some of the test diffs show that we could end up with an actual 32-bit mul instruction here because we choose not to expand to simpler ops. That instruction could be slower depending on the subtarget. On the plus side, this means we don't need a separate instruction to load the constant operand and possibly an extra instruction to move the result. If we need to tune mul i32 further, we could add a later transform that tries to shrink it back to i8 based on subtarget timing. I did not bother to duplicate all of the 32-bit test file RUNs and target settings that exist to test whether LEA expansion is cheap or not. The diffs here assume a default target, so that means LEA is generally cheap. Differential Revision: https://reviews.llvm.org/D54803 llvm-svn: 347557	2018-11-26 15:22:30 +00:00
Craig Topper	b7a50e5796	[X86] Add test cases to show bad type legalization of fptosi/fptosui v16f32->v16i8 and v8f64->v8i16 on pre-AVX512 targets. When splitting the v16f32/v8f64 result type, type legalization will try to promote the integer result type before a concat and an explicit truncate. But for the fptoui test case this is particularly bad since fptoui isn't supported on X86 until AVX512. We could use an fptosi since the result range would fit in a signed 32-bit value, but the generic type legalization doesn't do that transformation when splitting. It does do this when promoting. llvm-svn: 347533	2018-11-26 06:50:19 +00:00
Sanjay Patel	7336e7c67a	[x86] limit transform for select-of-fp-constants This should likely be adjusted to limit this transform further, but these diffs should be clear wins. If we have blendv/conditional move, then we should assume those are cheap ops. The loads become independent of the compare, so those can be speculated before we need to use the values in the blend/mov. llvm-svn: 347526	2018-11-25 17:27:02 +00:00
Sanjay Patel	2e5a25c170	[x86] add tests for select-of-fp-constants; NFC There are many options here depending on subtarget, but we are uniformly relying on a transform that was driven by performance for a 32-bit SSE2 target in 2009. Note: The same motivation was apparently used to do this transform for all targets, so non-x86 may want to look at this too. llvm-svn: 347525	2018-11-25 16:54:43 +00:00
Sanjay Patel	7e119c0400	[DAG] consolidate shift simplifications ...and use them to avoid creating obviously undef values as discussed in the post-commit thread for r347478. The diffs in vector div/rem show that we were missing real optimizations by creating bogus shift nodes. llvm-svn: 347502	2018-11-23 20:05:12 +00:00
Sanjay Patel	e0cc876363	[x86] make test immune to oversized shift simplification I'm not sure if this actually preserves the original intent of this test, but if we leave it as-is, the -1 (oversized) shift should be folded to undef and allow deleting half of the output. llvm-svn: 347501	2018-11-23 19:45:29 +00:00
Craig Topper	0ec17884de	[LegalizeVectorTypes] Don't use SplitVecOp_TruncateHelper if we're heading towards scalarizing the type. This code takes a truncate, fp_to_int, or int_to_fp with a legal result type and an input type that needs to be split and enlarges the elements in the result type before doing the split. Then inserts a follow up truncate or fp_round after concatenating the two halves back together. But if the input type of the original op is being split on its way to ultimately being scalarized we're just going to end up building a vector from scalars and then truncating or rounding it in the vector register. Seems kind of silly to enlarge the result element type of the operation only to end up with scalar code and then building a vector with large elements only to make the elements smaller again in the vector register. Seems better to just try to get away producing smaller result types in the scalarized code. The X86 test case that changes is a pretty contrived test case that exists because of a bug we used to have in our AVG matching code. I think the code is better now, but its not realistic anyway. llvm-svn: 347482	2018-11-23 02:32:13 +00:00
Craig Topper	b239763384	[LegalizeVectorTypes] Have SplitVecOp_TruncateHelper fall back to SplitVecOp_UnaryOp if splitting the output type would be a legal type. SplitVecOp_TruncateHelper tries to introduce a multilevel truncate to avoid scalarization. But if splitting the result type would still be a legal type we don't need to do that. The comment block at the top of the function implied that this was already implemented. I looked back through the history and it doesn't look to have ever been checked. llvm-svn: 347479	2018-11-22 22:56:52 +00:00
Sanjay Patel	3e80019275	[DAGCombiner] form 'not' ops ahead of shifts (PR39657) We fail to canonicalize IR this way (prefer 'not' ops to arbitrary 'xor'), but that would not matter without this patch because DAGCombiner was reversing that transform. I think we need this transform in the backend regardless of what happens in IR to catch cases where the shift-xor is formed late from GEP or other ops. https://rise4fun.com/Alive/NC1 Name: shl Pre: (-1 << C2) == C1 %shl = shl i8 %x, C2 %r = xor i8 %shl, C1 => %not = xor i8 %x, -1 %r = shl i8 %not, C2 Name: shr Pre: (-1 u>> C2) == C1 %sh = lshr i8 %x, C2 %r = xor i8 %sh, C1 => %not = xor i8 %x, -1 %r = lshr i8 %not, C2 https://bugs.llvm.org/show_bug.cgi?id=39657 llvm-svn: 347478	2018-11-22 19:24:10 +00:00
Sanjay Patel	1afd38f008	[x86] use FileCheck to verify output; NFC llvm-svn: 347438	2018-11-21 23:39:19 +00:00
Reid Kleckner	86ada54e4c	[mingw] Use unmangled name after the $ in the section name GCC does it this way, and we have to be consistent. This includes stdcall and fastcall functions with suffixes. I confirmed that a fastcall function named "foo" ends up in ".text$foo", not ".text$@foo@8". Based on a patch by Andrew Yohn! Fixes PR39218. Differential Revision: https://reviews.llvm.org/D54762 llvm-svn: 347431	2018-11-21 22:01:10 +00:00
Sanjay Patel	78e2b901e5	[x86] add tests for select-of-FP-constants; NFC llvm-svn: 347406	2018-11-21 19:14:38 +00:00
Sanjay Patel	cadf62f360	[x86] fix predicate for avoiding vblendv It only makes sense to produce the logic ops when 1 of the constants is +0.0. Otherwise, go with vblendv to reduce code. llvm-svn: 347403	2018-11-21 18:02:50 +00:00
Sanjay Patel	5ba384347c	[x86] add test for FP select with constant; NFC llvm-svn: 347401	2018-11-21 17:47:18 +00:00
Sanjay Patel	2c513f5b4b	[x86] add checks for asm to test; NFC llvm-svn: 347394	2018-11-21 15:26:35 +00:00
Simon Pilgrim	66bae9aee8	[X86][AVX] Remove BROADCAST if we only need the 0'th element We don't catch this with target shuffle simplification if the src/dst types are different. llvm-svn: 347386	2018-11-21 11:00:09 +00:00
Craig Topper	e9b4001a82	[X86] In getScalarMaskingNode, replace scalar_to_vector with a bitcast to v8i1 and an extract_subvector to convert i8 to v1i1. The bitcast can be nicely merged with any i8 loads that exist for argument passing in 32 mode for example. llvm-svn: 347380	2018-11-21 07:01:22 +00:00
Craig Topper	27a5896fe8	[X86] Correct 256 vpmovzx/vpmovsx isel patterns to check HasAVX2 instead of HasAVX to prevent fast-isel from using them incorrectly. These are AVX2 instructions, but have been incorrectly marked in tablegen for a while. This wasn't a problem until r346784 switched the patterns to use target independent ISD opcodes. This made the patterns visible to fast isel. Fixes PR39733 llvm-svn: 347375	2018-11-21 01:39:38 +00:00
Craig Topper	8b48587f5b	[X86] Add a copy of avx512-trunc.ll with -x86-experimental-vector-widening-legalization enabled. llvm-svn: 347374	2018-11-21 01:39:35 +00:00
Craig Topper	aa52ee2770	[X86] Emit a PACKUS instead of a VECTOR_SHUFFLE from LowerTRUNCATE for v16i16->v16i8. We can't guarantee that demanded bits passing through the vector shuffle won't cause the AND in front of this to be removed. This would prevent the PACKUS from being matched during shuffle lowering. Unfortunately, this adds a packuswb to one of the vector-reduce-mul.ll tests since we were removing the shuffle via SimplifyDemandedVectorElts. We appear to have similar issues with vpmovwb on the same test case on other targets. llvm-svn: 347361	2018-11-20 22:57:48 +00:00
Sanjay Patel	357053f289	[DAGCombiner] look through bitcasts when trying to narrow vector binops This is another step in vector narrowing - a follow-up to D53784 (and hoping to eventually squash potential regressions seen in D51553). The x86 test diffs are wins, but the AArch64 diff is probably not. That problem already exists independent of this patch (see PR39722), but it went unnoticed in the previous patch because there were no regression tests that showed the possibility. The x86 diff in i64-mem-copy.ll is close. Given the frequency throttling concerns with using wider vector ops, an extra extract to reduce vector width is the right trade-off at this level of codegen. Differential Revision: https://reviews.llvm.org/D54392 llvm-svn: 347356	2018-11-20 22:26:35 +00:00
Craig Topper	24b346da42	[X86] Emit a single shuffle for the v16i8->v4i32 step of a SIGN_EXTEND_VECTOR_INREG lowering on pre-sse4.1 targets. Previously we emitted to separate shuffles, one for unpcklbw and one for unpcklwd. Instead emit a single shuffle equivalent to both of the original shuffles. Shuffle lowering seems able to handle it. This avoids a bitcast between the two shuffles which seems helpful to DAG combine. Remove the custom type legalization for v8i8->v8i32. I had put that in to avoid some almost duplicate punpcklbw instructions I was seeing, but this lowering change seems to fix that. It also fixes some duplicate shuffles seen in vector-sext.ll llvm-svn: 347348	2018-11-20 21:21:52 +00:00
Sanjay Patel	fa78c228a3	[x86] add tests for 8-bit multiply with constant; NFC This is based on the existing file for 16-bit. We also already have 32-bit and 64-bit variants. llvm-svn: 347341	2018-11-20 19:45:53 +00:00
Simon Pilgrim	368a199236	[X86] Remove -verify-machineinstrs=0 now that PR38391 is fixed. llvm-svn: 347335	2018-11-20 18:08:56 +00:00
Sanjay Patel	8aeffd8c57	[AArch64, x86] add tests for shift-not (PR39657); NFC llvm-svn: 347316	2018-11-20 15:49:42 +00:00
Simon Pilgrim	3735105961	[DAGCombine] Add calls to SimplifyDemandedVectorElts from visitINSERT_SUBVECTOR (PR37989) This uncovered an off-by-one typo in SimplifyDemandedVectorElts's INSERT_SUBVECTOR handling as its bounds check was bailing on safe indices. llvm-svn: 347313	2018-11-20 15:23:50 +00:00
Simon Pilgrim	ee8b96f253	[X86][SSE] Add computeKnownBits/ComputeNumSignBits support for PACKSS/PACKUS instructions. Pull out getPackDemandedElts demanded elts remapping helper from computeKnownBitsForTargetNode and use in computeKnownBits/ComputeNumSignBits. llvm-svn: 347303	2018-11-20 13:23:37 +00:00
Simon Pilgrim	b356d0463e	[TargetLowering] Improve SimplifyDemandedVectorElts/SimplifyDemandedBits support For bitcast nodes from larger element types, add the ability for SimplifyDemandedVectorElts to call SimplifyDemandedBits by merging the elts mask to a bits mask. I've raised https://bugs.llvm.org/show_bug.cgi?id=39689 to deal with the few places where SimplifyDemandedBits's lack of vector handling is a problem. Differential Revision: https://reviews.llvm.org/D54679 llvm-svn: 347301	2018-11-20 12:02:16 +00:00
Simon Pilgrim	a6fb85ffa7	[X86][SSE] Lower immediately to PACKUS instead of VECTOR_SHUFFLE. As discussed on rL347240, this avoids some regressions on D54679 and also helps some combines to kick in a bit earlier. llvm-svn: 347300	2018-11-20 11:46:37 +00:00
Simon Pilgrim	7198506ba8	[X86][SSE] Add SimplifyDemandedVectorElts support for PACKSS/PACKUS instructions. As discussed on rL347240. llvm-svn: 347299	2018-11-20 11:09:46 +00:00
Craig Topper	17fa42a69b	[X86] Preserve undef information when creating a punpckl/hbw from a v16i8 where all the even or odd elements are undef. Previously if V2 was unused we ended up using V1 for both inputs as part of the code that follows the new code. By using lowerVectorShuffleWithUNPCK we keep the undef nature of V2 in the output. As near as I can tell this makes v16i8 behavior consistent with every other VT now. This does mean that we give the register allocator freedom to fill in random registers now and create false dependencies. But like I said we're already doing that for other types. llvm-svn: 347296	2018-11-20 09:04:01 +00:00
Craig Topper	c733c7bf94	[X86] Replace more calls to getZeroVector with regular getConstant. getZeroVector produces a specifically canonicalized zero vector, but we can just let DAG legalization take care of it. The test changes are because MULH lowering happens later than it should and this change gave us the opportunity to constant fold away a multiply during a DAG combine before the build_vector got legalized with a bitcast. llvm-svn: 347290	2018-11-20 06:54:01 +00:00
Craig Topper	4954c66430	[SelectionDAG] Compute known bits and num sign bits for live out vector registers. Use it to add AssertZExt/AssertSExt in the live in basic blocks Summary: We already support this for scalars, but it was explicitly disabled for vectors. In the updated test cases this allows us to see the upper bits are zero to use less multiply instructions to emulate a 64 bit multiply. This should help with this ispc issue that a coworker pointed me to https://github.com/ispc/ispc/issues/1362 Reviewers: spatel, efriedma, RKSimon, arsenm Reviewed By: spatel Subscribers: wdng, llvm-commits Differential Revision: https://reviews.llvm.org/D54725 llvm-svn: 347287	2018-11-20 04:30:26 +00:00
Craig Topper	dbe3473634	[X86] Add test case to show missed opportunity to use a single pmuludq to implement a multiply when a zext lives in another basic block. This can occur when one of the inputs to the multiply is loop invariant. Though my test cases just use two basic blocks with an unconditional jump which we won't merge until after isel in the codegen pipeline. For scalars, I believe SelectionDAGBuilder can add an AssertZExt to pass knowledge across basic blocks but its explicitly disabled for vectors. llvm-svn: 347266	2018-11-19 22:04:12 +00:00
Simon Pilgrim	c4861ab170	[X86][SSE] Remove unnecessary bit-and in pshufb vector ctlz (PR39703) SSE PSHUFB vector ctlz lowering works at the i4 nibble level. As detailed in PR39703, we were masking the lower nibble off but we only actually use it in the case where the upper nibble is known to be zero, making it safe to remove the mask and save an instruction. Differential Revision: https://reviews.llvm.org/D54707 llvm-svn: 347242	2018-11-19 18:40:59 +00:00
Craig Topper	311bbcd535	[X86] Attempt to improve v32i8/v64i8 multiply lowering by applying the v16i8 non-avx2 algorithm to each 128-bit lane. Previously we split the vectors in half to allow the two halves to be any extended then concatenated the results back together. This patch instead instead extends the v16i8 sse algorithm to extend half of each 128-bit lane using punpcklbw/punpckhbw. Multiplies all the low half lanes and high half lanes together in separate operations. Then merges the half lane results back together using packuswb. Unfortunately, some of the cases in vector-reduce-mul.ll regress because we aren't narrowing the vector width of the multiplies as we reduce. The splitting was somewhat making up for that before by causing halves to be discarded after the split. Differential Revision: https://reviews.llvm.org/D54668 llvm-svn: 347240	2018-11-19 18:32:53 +00:00
Sanjay Patel	b25adf5edb	[SelectionDAG] simplify vector select with undef operand(s) llvm-svn: 347227	2018-11-19 17:06:05 +00:00
Sanjay Patel	60abc29b0a	[x86] add/make tests immune to improvements in undef simplification llvm-svn: 347217	2018-11-19 15:33:44 +00:00
Sanjay Patel	a1dca3553e	[SelectionDAG] simplify select FP with undef condition llvm-svn: 347212	2018-11-19 14:42:28 +00:00
Sanjay Patel	7a51bdcf3b	[x86] add test for select FP with undef condition; NFC llvm-svn: 347211	2018-11-19 14:39:57 +00:00
Simon Pilgrim	f6c2fbdd1a	[X86] Add codegen tests for slow-shld scalar funnel shifts llvm-svn: 347195	2018-11-19 12:29:41 +00:00
Craig Topper	8b22bcd39f	[X86] Use a pcmpgt with 0 instead of psrad 31, to fill elements with the sign bit in v4i32 MULH lowering. The shift requires a copy to avoid clobbering a register. Comparing with 0 uses an xor to produce 0 that will be overwritten with the compare results. So still requires 2 instructions, but should be one byte shorter since it doesn't need to encode an immediate. llvm-svn: 347185	2018-11-19 07:22:26 +00:00
Craig Topper	3616891046	[X86] Use compare with 0 to fill an element with sign bits when sign extending to v2i64 pre-sse4.1 Previously we used an arithmetic shift right by 31, but that requires a copy to preserve the input. So we might as well materialize a zero and compare to it since the comparison will overwrite the register that contains the zeros. This should be one byte shorter. llvm-svn: 347181	2018-11-19 04:33:20 +00:00
Craig Topper	053f1eea96	[X86] Remove most of the SEXTLOAD Custom setOperationAction calls under -x86-experimental-vector-widening-legalization. Leave just the v4i8->v4i64 and v8i8->v8i64, but only enable them on pre-sse4.1 targets when 64-bit mode is enabled. In those cases we end up creating sext loads that get scalarized to code that looks better than what we get from loading into a vector register and doing a multiple step sign extend using unpacks and shifts. llvm-svn: 347180	2018-11-19 00:33:16 +00:00
Simon Pilgrim	7f92efa5a9	[X86][SSE] Add SimplifyDemandedVectorElts support for SSE packed i2fp conversions. llvm-svn: 347177	2018-11-18 22:13:31 +00:00
Craig Topper	0468c860b7	[X86] Add custom type legalization for extending v4i8/v4i16->v4i64. Pre-SSE4.1 sext_invec for v2i64 is complicated because we don't have a v2i64 sra instruction. So instead we sign extend to i32 using unpack and sra, then copy the elements and do a v4i32 sra to fill with sign bits, then interleave the i32 sign extend and the sign bits. So really we're doing to two sign extends but only using half of the v4i32 intermediate result. When the result is more than 128 bits, default type legalization would prefer to split the destination type all the way down to v2i64 with shuffles followed by v16i8/v8i16->v2i64 sext_inreg operations. This results in more instructions than necessary because we are only utilizing the lower 2 elements of the v4i32 intermediate result. Instead we can custom split a v4i8/v4i16->v4i64 sign_extend. Then we can sign extend v4i8/v4i16->v4i32 invec producing a full v4i32 result. Create the sign bit vector as a v4i32 then split and interleave with the sign bits using an punpackldq and punpackhdq. llvm-svn: 347176	2018-11-18 21:28:50 +00:00
Craig Topper	950f3842cc	[X86] Add a 32-bit command line with only sse2 to vector-sext.ll and vector-sext.ll to show some of the scalarized load sequences without 64-bit scalar support. Some of these sequeces look pretty bad since we have to copy the sign bit from a 32 bit register to a 64 bit register to finish a sign extend. llvm-svn: 347175	2018-11-18 21:28:47 +00:00
Simon Pilgrim	b31bdbd2e9	[X86][SSE] Add SimplifyDemandedVectorElts support for SSE splat-vector-shifts. SSE vector shifts only use the bottom 64-bits of the shift amount vector. llvm-svn: 347173	2018-11-18 20:21:52 +00:00
Craig Topper	11d50948e2	[X86] Disable combineToExtendVectorInReg under -x86-experimental-vector-widening-legalization. Add custom type legalization for extends. If we widen illegal types instead of promoting, we should be able to rely on the type legalizer to create the vector_inreg operations for us with some caveats. This patch disables combineToExtendVectorInReg when we are using widening. I've enabled custom legalization for v8i8->v8i64 extends under avx512f since the type legalizer would want to create a vector_inreg with a v64i8 input type which isn't legal without avx512bw. So we go to v16i8 with custom code using the relaxation of rules we get from D54346. I've also enable custom legalization of v8i64 and v16i32 operations with with AVX. When the input type is 128 bits, the default splitting legalization would extend first 128->256, then do the a split to two 128 pieces. Extend each half to 256 and then concat the result. The custom legalization I've added instead uses a 128->256 bit vector_inreg extend that only reads the lower 64-bits for the low half of the split. Then shuffles the high 64-bits to the low 64-bits and does another vector_inreg extend. llvm-svn: 347172	2018-11-18 18:11:25 +00:00
Craig Topper	bc8148f7b0	[X86] Lower v16i16->v8i16 truncate using an 'and' with 255, an extract_subvector, and a packuswb instruction. Summary: This is an improvement over the two pshufbs and punpcklqdq we'd get otherwise. Reviewers: RKSimon, spatel Reviewed By: RKSimon Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D54671 llvm-svn: 347171	2018-11-18 17:59:28 +00:00
Sanjay Patel	8c0cd77bff	[DAG] add undef simplifications for select nodes Sadly, this duplicates (twice) the logic from InstSimplify. There might be some way to at least share the DAG versions of the code, but copying the folds seems to be the standard method to ensure that we don't miss these folds. Unlike in IR, we don't run DAGCombiner to fixpoint, so there's no way to ensure that we do these kinds of simplifications unless the code is repeated at node creation time and during combines. There were other tests that would become worthless with this improvement that I changed as pre-commits: rL347161 rL347164 rL347165 rL347166 rL347167 I'm not sure how to salvage the remaining tests (diffs in this patch). So the x86 tests verify that the new code is working as intended. The AMDGPU test is actually similar to my motivating case: we have some undef value that has survived to machine IR in an x86 test, and then it gets folded in some weird way, or we crash if we don't transfer the undef flag. But we would have been better off never getting to that point by doing these simplifications. This will lead back to PR32023 someday... https://bugs.llvm.org/show_bug.cgi?id=32023 llvm-svn: 347170	2018-11-18 17:36:23 +00:00
Sanjay Patel	bc23408fe5	[x86] regenerate full checks; NFC llvm-svn: 347167	2018-11-18 16:56:17 +00:00
Simon Pilgrim	fec9f8657b	[X86][SSE] Relax IsSplatValue - remove the 'variable shift' limit on subtracts. Means we don't use the per-lane-shifts as much when we can cheaply use the older splat-variable-shifts. llvm-svn: 347162	2018-11-18 15:52:08 +00:00
Sanjay Patel	40509997eb	[x86] make tests immune to improvements in undef handling llvm-svn: 347161	2018-11-18 15:27:19 +00:00
Simon Pilgrim	7fdbae3224	[X86][SSE] Add some generic masked gather codegen tests llvm-svn: 347159	2018-11-18 14:35:57 +00:00
Simon Pilgrim	cc1f5d2407	[X86][SSE] Use raw shuffle mask decode in SimplifyDemandedVectorEltsForTargetNode (PR39549) We were using the 'normalized' shuffle mask from resolveTargetShuffleInputs, which replaces zero/undef inputs with sentinel values. For SimplifyDemandedVectorElts we need the raw mask so we can correctly demand those 'zero' inputs that got normalized away, this requires an extra bit of logic to locally normalize undef inputs. llvm-svn: 347158	2018-11-18 13:34:53 +00:00
Craig Topper	f56a57518d	[X86] Don't use a pmaddwd for vXi32 multiply if the inputs are zero extends from i8 or smaller without SSE4.1. Prefer to shrink the mul instead. The zero extend will require two stages of unpacks to implement. So its better to shrink the multiply using pmullw and then extend that result back to v4i32 using a single unpack. llvm-svn: 347149	2018-11-18 05:53:21 +00:00
Craig Topper	0438d791fa	[X86] Add support for matching PACKUSWB from a v64i8 shuffle. llvm-svn: 347143	2018-11-17 18:54:43 +00:00
Craig Topper	c6c760f07f	[X86] Add test case to show missed opportunity to use PACKUSWB in v64i8 shuffle lowering. llvm-svn: 347142	2018-11-17 18:54:41 +00:00
Simon Pilgrim	0e1a9d5ee6	[X86][SSE] Add shuffle demanded elts test case for PR39549 llvm-svn: 347139	2018-11-17 14:06:03 +00:00
Craig Topper	dd61f11642	[X86] Don't extend v32i8 multiplies to v32i16 with avx512bw and prefer-vector-width=256. llvm-svn: 347131	2018-11-17 02:36:07 +00:00
Craig Topper	d8da95bbe3	[X86] Add test cases to show incorrect use of a 512 bit vector in v32i8 multiply lowering with prefer-vector-width=256. On the min-legal-vector-width test this actually causes some of the v32i16 operations we emitted to be scalarized. llvm-svn: 347130	2018-11-17 02:36:02 +00:00
Stanislav Mekhanoshin	0ff7c8309d	DAG combiner: fold (select, C, X, undef) -> X Differential Revision: https://reviews.llvm.org/D54646 llvm-svn: 347110	2018-11-16 23:13:38 +00:00
Craig Topper	ee0333b4a9	[X86] Add custom promotion of narrow fp_to_uint/fp_to_sint operations under -x86-experimental-vector-widening-legalization. This tries to force the result type to vXi32 followed by a truncate. This can help avoid scalarization that would otherwise occur. There's some annoying examples of an avx512 truncate instruction followed by a packus where we should really be able to just use one truncate. But overall this is still a net improvement. llvm-svn: 347105	2018-11-16 22:53:00 +00:00
Rong Xu	3a38175723	[X86] Disable Condbr_merge pass Disable Condbr_merge pass for now due to PR39658. Will reenable the pass once the bug is fixed. llvm-svn: 347079	2018-11-16 19:35:00 +00:00
Simon Pilgrim	96f7924fe2	[X86] Add codegen tests for scalar funnel shifts llvm-svn: 347066	2018-11-16 17:48:52 +00:00
Sanjay Patel	8da76a6581	[x86] regenerate complete checks for test; NFC llvm-svn: 347051	2018-11-16 14:44:20 +00:00
Roman Lebedev	90c5b3f78e	[X86] X86DAGToDAGISel::matchBitExtract(): extract 'lshr' from `X` Summary: As discussed in previous review, and noted in the FIXME, if `X` is actually an `lshr Y, Z` (logical!), we can fold the `Z` into 'control`, and let the `BEXTR` do this too. We could just insert those 8 bits of shift amount into control, but it is better to instead zero-extend them, and 'or' them in place. We can only do this for `lshr`, not `ashr`, because we do not know that the mask cover only the bits of `Y`, and not any of the sign-extended bits. The obvious question is, is this actually legal to do? I believe it is. Relevant quotes, from `Intel® 64 and IA-32 Architectures Software Developer’s Manual`, `BEXTR — Bit Field Extract`: * `Bit 7:0 of the second source operand specifies the starting bit position of bit extraction.` * `A START value exceeding the operand size will not extract any bits from the second source operand.` * `Only bit positions up to (OperandSize -1) of the first source operand are extracted.` * `All higher order bits in the destination operand (starting at bit position LENGTH) are zeroed.` * `The destination register is cleared if no bits are extracted.` FIXME: if we can do this, i wonder if we should prefer `BEXTR` over `BZHI` in such cases. Reviewers: RKSimon, craig.topper, spatel, andreadb Reviewed By: RKSimon, craig.topper, andreadb Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D54095 llvm-svn: 347048	2018-11-16 13:04:54 +00:00
Craig Topper	079c37da58	[X86] Add custom type legalization for v2i8/v4i8/v8i8 mul under -x86-experimental-vector-widening. By early promoting the multiply to use an i16 element type we can avoid op legalization emit a second multiply for the 8 upper elements of the v16i8 type we would otherwise get. llvm-svn: 347032	2018-11-16 06:15:21 +00:00
Craig Topper	dc957d49f9	[X86] Add some test cases for vector multiplies on vectors shorter than 128 bits with -x86-experimental-vector-widening-legalization. llvm-svn: 347031	2018-11-16 06:15:20 +00:00
Craig Topper	c93ae2b0a2	Revert r347014 "[X86] Add some test cases for vector multiplies on vectors shorter than 128 bits with -x86-experimental-vector-widening-legalization." Apparently I failed to update this after turnign sign extend to any extend. llvm-svn: 347015	2018-11-16 01:57:55 +00:00
Craig Topper	36920b44f7	[X86] Add some test cases for vector multiplies on vectors shorter than 128 bits with -x86-experimental-vector-widening-legalization. llvm-svn: 347014	2018-11-16 01:52:32 +00:00
Craig Topper	5802b82b40	[X86] Use ANY_EXTEND instead of SIGN_EXTEND in the AVX2 and later path for legalizing vXi8 multiply. We aren't going to use the upper bits of the multiply result that the extend would effect. So we don't need a specific type of extend. This makes some reduction test cases shorter because we were previously trying to sign_extend a truncate which we can't eliminate. llvm-svn: 347011	2018-11-16 01:16:59 +00:00
Craig Topper	73bb04ab6f	[X86] Add -x86-experimental-vector-widening support to reduceVMULWidth and combineMulToPMADDWD In reduceVMULWidth, we no longer need to worry about extending the vector to 128 bits first. Regular widening of extends, muls and shuffles will take care of that for us. In combineMulToPMADDWD, we can handle v2i32 multiplies and allow the VPMADDWD to be widened to v4i32 during type legalization by adding custom widening like we do have for AVG/ADDUS/SUBUS. I had to modify that code a little to allow different and output VTs. Differential Revision: https://reviews.llvm.org/D54512 llvm-svn: 346980	2018-11-15 18:59:31 +00:00
Simon Pilgrim	0db8cb0147	[X86] Fix MCNullStreamer support for modules with a CodeView flag This fixes -filetype=null support when compiling for a Win32 target and the module has a CodeView flag. The only places changed are the uses of getTargetStreamer function - this patch guards both of them with null checks. Committed on behalf of @eush (Eugene Sharygin) Differential Revision: https://reviews.llvm.org/D54008 llvm-svn: 346962	2018-11-15 15:17:15 +00:00
Craig Topper	553ac560aa	[X86] Add some custom type legalization rules for truncate with -x86-experimental-vector-widening-legalization. This avoids some nasty shuffles when we have avx512. It will also prevent using zmm truncate instructions when a ymm instruction that zeroes part of an xmm register will do. Also avoid using avx512 truncate instructions when the input is 128 bits or less. These instructions are 2 uops on skx so we can probably find a better single uop shuffle like pshufb. llvm-svn: 346936	2018-11-15 08:23:40 +00:00
Craig Topper	926dbdd601	[X86] Add -x86-experimental-vector-widening-legalization versions of shuffle-vs-trunc tests. llvm-svn: 346935	2018-11-15 08:23:37 +00:00
Craig Topper	ea6ced9d1a	[X86] Don't mark SEXTLOADS with narrow types as Custom with -x86-experimental-vector-widening-legalization. The narrow types end up requesting widening, but generic legalization will end up scalaring and using a build_vector to do the widening. llvm-svn: 346916	2018-11-15 00:21:41 +00:00
Craig Topper	0b2089da4b	[X86] Support v2i32/v4i16/v8i8 load/store using f64 on 32-bit targets under -x86-experimental-vector-widening-legalization. On 64-bit targets the type legalizer will use i64 to legalize these. But when i64 isn't legal, the type legalizer won't try an FP type. So do it manually instead. There are a few regressions in here due to some v2i32 operations like mul and div now being reassembled into a full vector just to store instead of storing the pieces. But this was already occuring in 64-bit mode so its not a new issue. llvm-svn: 346908	2018-11-14 23:02:09 +00:00
Simon Pilgrim	e8cc5e4e03	[X86] Update masked expandload/compressstore test names llvm-svn: 346903	2018-11-14 22:44:08 +00:00
Simon Pilgrim	9d9353aef5	[X86][SSE] Add SSE2/SSE42 masked load/store tests Now that the load/store tests are split the impact of running the tests on multiple (illegal) targets is a lot less impactful llvm-svn: 346896	2018-11-14 21:31:50 +00:00
Nirav Dave	1241dcb3cf	Bias physical register immediate assignments The machine scheduler currently biases register copies to/from physical registers to be closer to their point of use / def to minimize their live ranges. This change extends this to also physical register assignments from immediate values. This causes a reduction in reduction in overall register pressure and minor reduction in spills and indirectly fixes an out-of-registers assertion (PR39391). Most test changes are from minor instruction reorderings and register name selection changes and direct consequences of that. Reviewers: MatzeB, qcolombet, myatsina, pcc Subscribers: nemanjai, jvesely, nhaehnle, eraman, hiraditya, javed.absar, arphaman, jfb, jsji, llvm-commits Differential Revision: https://reviews.llvm.org/D54218 llvm-svn: 346894	2018-11-14 21:11:53 +00:00
Simon Pilgrim	be527b545f	[X86] Split masked load/store test files llvm-svn: 346889	2018-11-14 20:44:59 +00:00
Simon Pilgrim	7f15568c40	[X86] Update masked load/store test names llvm-svn: 346887	2018-11-14 20:25:50 +00:00
Craig Topper	6c94264b1f	[X86] Allow pmulh to be formed from narrow vXi16 vectors under -x86-experimental-vector-widening-legalization Narrower vectors will be widened to 128 bits without changing the element size. And generic type legalization can already handle widening mulhu/mulhs. Differential Revision: https://reviews.llvm.org/D54513 llvm-svn: 346879	2018-11-14 18:16:21 +00:00
Simon Pilgrim	7501780ec6	[X86][AVX512] Remove constant pool shuffle decoding from SelectionDAG This patch removes the last use of the constant pool shuffle decode helper and consistently uses the 'getTargetShuffleMaskIndices' versions instead. The constant pool versions are now purely used for assembly comments. The avx512vbmi intrinsic upgrades had to be altered as they were being decoded as broadcasts, similar to what I fixed in rL346032. I don't think the change is critical - although its annoying that we lose the {k}{z} instruction test coverage as they are tricky to generate.... Differential Revision: https://reviews.llvm.org/D54083 llvm-svn: 346850	2018-11-14 11:26:35 +00:00
Craig Topper	789cc8170d	[X86] Add -x86-experimental-vector-widening command lines to pmulh.ll I've only added sse2 and sse4.1 variants as I'm only interested in the two v4i16 tests and I don't expect that to different with AVX other than a v prefix. llvm-svn: 346834	2018-11-14 07:51:26 +00:00
Cameron McInally	cbde0d9c7b	[IR] Add a dedicated FNeg IR Instruction The IEEE-754 Standard makes it clear that fneg(x) and fsub(-0.0, x) are two different operations. The former is a bitwise operation, while the latter is an arithmetic operation. This patch creates a dedicated FNeg IR Instruction to model that behavior. Differential Revision: https://reviews.llvm.org/D53877 llvm-svn: 346774	2018-11-13 18:15:47 +00:00
Craig Topper	333ab7d08b	[X86] Add more tests for -x86-experimental-vector-widening-legalization I'm looking into whether we can make this the default legalization strategy. Adding these tests to help cover the changes that will be necessary. This patch adds copies of some tests with the command line switch enabled. By making copies its easier to compare the two legalization strategies. I've also removed RUN lines from some of these tests that already had -x86-experimental-vector-widening-legalization llvm-svn: 346745	2018-11-13 07:47:52 +00:00
Simon Pilgrim	e565e5a962	[X86][SSE] Add lowerVectorShuffleAsByteRotateAndPermute (PR39387) This patch adds the ability to use a PALIGNR to rotate a pair of inputs to select a range containing all the referenced elements, followed by a single input permute to put them in the right location. Differential Revision: https://reviews.llvm.org/D54267 llvm-svn: 346706	2018-11-12 21:12:38 +00:00
Craig Topper	c48712b341	[X86] In LowerMULH, use generic truncate and vector shuffle nodes instead of directly emitting PACKUS. Truncate and shuffle lowering are already capable of matching to PACKUS using known bits analysis. This features one test change where we now prefer to extend v16i16->v16i32 then trunc v16i32->v16i8 over extract_subvector+packus when avx512f is available, but avx512bw is not. llvm-svn: 346697	2018-11-12 19:37:29 +00:00
Paul Robinson	5b302bfc8e	[DWARFv5] Emit split type units in .debug_info.dwo. Differential Revision: https://reviews.llvm.org/D54350 llvm-svn: 346674	2018-11-12 16:55:11 +00:00

1 2 3 4 5 ...

12919 Commits