llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	fb7aa57bf1	[X86][SSE] Introduce Float/Vector WriteMove, WriteLoad and Writetore scheduler classes As discussed on D44428 and PR36726, this patch splits off WriteFMove/WriteVecMove, WriteFLoad/WriteVecLoad and WriteFStore/WriteVecStore scheduler classes to permit vectors to be handled separately from gpr/scalar types. I've minimised the diff here by only moving various basic SSE/AVX vector instructions across - we can fix the rest when called for. This does fix the MOVDQA vs MOVAPS/MOVAPD discrepancies mentioned on D44428. Differential Revision: https://reviews.llvm.org/D44471 llvm-svn: 327630	2018-03-15 14:45:30 +00:00
Simon Pilgrim	69a4132f63	[X86] Regenerate schedule tests with zero latency comments llvm-svn: 327628	2018-03-15 14:30:59 +00:00
Craig Topper	ff6e82c9d0	[X86] Add test cases for 512-bit addsub from build_vector. There is no 512 bit addsub instruction, but we partially match it handle fmaddsub matching. We explicitly bail out for 512 bit vectors after failing the fmaddsub match, but we had no test coverage for that bail out. We might want to consider splitting and using 256 bit instructions instead of the long sequence seen here. llvm-svn: 327605	2018-03-15 06:49:01 +00:00
Craig Topper	26a3a80c87	[X86] Add support for matching FMSUBADD from build_vector. llvm-svn: 327604	2018-03-15 06:14:55 +00:00
Reid Kleckner	3a7a2e4a0a	[FastISel] Sink local value materializations to first use Summary: Local values are constants, global addresses, and stack addresses that can't be folded into the instruction that uses them. For example, when storing the address of a global variable into memory, we need to materialize that address into a register. FastISel doesn't want to materialize any given local value more than once, so it generates all local value materialization code at EmitStartPt, which always dominates the current insertion point. This allows it to maintain a map of local value registers, and it knows that the local value area will always dominate the current insertion point. The downside is that local value instructions are always emitted without a source location. This is done to prevent jumpy line tables, but it means that the local value area will be considered part of the previous statement. Consider this C code: call1(); // line 1 ++global; // line 2 ++global; // line 3 call2(&global, &local); // line 4 Today we end up with assembly and line tables like this: .loc 1 1 callq call1 leaq global(%rip), %rdi leaq local(%rsp), %rsi .loc 1 2 addq $1, global(%rip) .loc 1 3 addq $1, global(%rip) .loc 1 4 callq call2 The LEA instructions in the local value area have no source location and are treated as being on line 1. Stepping through the code in a debugger and correlating it with the assembly won't make much sense, because these materializations are only required for line 4. This is actually problematic for the VS debugger "set next statement" feature, which effectively assumes that there are no registers live across statement boundaries. By sinking the local value code into the statement and fixing up the source location, we can make that feature work. This was filed as https://bugs.llvm.org/show_bug.cgi?id=35975 and https://crbug.com/793819. This change is obviously not enough to make this feature work reliably in all cases, but I felt that it was worth doing anyway because it usually generates smaller, more comprehensible -O0 code. I measured a 0.12% regression in code generation time with LLC on the sqlite3 amalgamation, so I think this is worth doing. There are some special cases worth calling out in the commit message: 1. local values materialized for phis 2. local values used by no-op casts 3. dead local value code Local values can be materialized for phis, and this does not show up as a vreg use in MachineRegisterInfo. In this case, if there are no other uses, this patch sinks the value to the first terminator, EH label, or the end of the BB if nothing else exists. Local values may also be used by no-op casts, which adds the register to the RegFixups table. Without reversing the RegFixups map direction, we don't have enough information to sink these instructions. Lastly, if the local value register has no other uses, we can delete it. This comes up when fastisel tries two instruction selection approaches and the first materializes the value but fails and the second succeeds without using the local value. Reviewers: aprantl, dblaikie, qcolombet, MatzeB, vsk, echristo Subscribers: dotdash, chandlerc, hans, sdardis, amccarth, javed.absar, zturner, llvm-commits, hiraditya Differential Revision: https://reviews.llvm.org/D43093 llvm-svn: 327581	2018-03-14 21:54:21 +00:00
Francis Visoiu Mistrih	e85b06d65f	[CodeGen] Use MIR syntax for MachineMemOperand printing Get rid of the "; mem:" suffix and use the one we use in MIR: ":: (load 2)". rdar://38163529 Differential Revision: https://reviews.llvm.org/D42377 llvm-svn: 327580	2018-03-14 21:52:13 +00:00
Simon Pilgrim	adf72e8549	[X86] Add haswell testing for PR35635 as well. To improve complete model testing for schedulers for instructions with multiple results. llvm-svn: 327572	2018-03-14 21:03:09 +00:00
Craig Topper	9c098ed819	[X86] Add back fast-isel code for handling i8 shifts. I removed this in r316797 because the coverage report showed no coverage and I thought it should have been handled by the auto generated table. I now see that there is code that bypasses the table if the shift amount is out of bounds. This adds back the code. We'll codegen out of bounds i8 shifts to effectively (amount & 0x1f). The 0x1f is a strange quirk of x86 that shift amounts are always masked to 5-bits(except 64-bits). So if the masked value is still out bounds the result will be 0. Fixes PR36731. llvm-svn: 327540	2018-03-14 17:57:19 +00:00
Craig Topper	b36cb20ef9	[X86] Teach X86TargetLowering::targetShrinkDemandedConstant to set non-demanded bits if it helps created an and mask that can be matched as a zero extend. I had to modify the bswap recognition to allow unshrunk masks to make this work. Fixes PR36689. Differential Revision: https://reviews.llvm.org/D44442 llvm-svn: 327530	2018-03-14 16:55:15 +00:00
Simon Pilgrim	d1c3c995c0	[X86][AVX] Use WriteFShuffleLd for broadcast reg-mem instructions They shouldn't be treated as pure loads. Found while investigating D44428 llvm-svn: 327524	2018-03-14 15:47:08 +00:00
Alexander Ivchenko	86ef9ab28f	[GlobalIsel][X86] Support for G_SDIV instruction Reviewed By: igorb Differential Revision: https://reviews.llvm.org/D44430 llvm-svn: 327520	2018-03-14 15:41:11 +00:00
Simon Pilgrim	d594942928	[X86][Btver2] Fix YMM shuffle, permute and permutevar scheduler costs Account for ymm double pumping and add proper pshufb/permutevar support llvm-svn: 327510	2018-03-14 14:05:19 +00:00
Simon Pilgrim	de995e6e37	[X86][SSE] Use WriteFShuffleLd for MOVDDUP/MOVSHDUP/MOVSLDUP reg-mem instructions They shouldn't be treated as pure loads. Found while investigating D44428 llvm-svn: 327505	2018-03-14 13:22:56 +00:00
Alexander Ivchenko	0bd4d8c901	[GlobalISel][X86] Support G_LSHR/G_ASHR/G_SHL Support G_LSHR/G_ASHR/G_SHL. We have 3 variance for shift instructions : shift gpr, shift imm, shift 1. Currently GlobalIsel TableGen generate patterns for shift imm and shift 1, but with shiftCount i8. In G_LSHR/G_ASHR/G_SHL like LLVM-IR both arguments has the same type, so for now only shift i8 can use auto generated TableGen patterns. The support of G_SHL/G_ASHR enables tryCombineSExt from LegalizationArtifactCombiner.h to hit, which results in different legalization for the following tests: LLVM :: CodeGen/X86/GlobalISel/ext-x86-64.ll LLVM :: CodeGen/X86/GlobalISel/gep.ll LLVM :: CodeGen/X86/GlobalISel/legalize-ext-x86-64.mir -; X64-NEXT: movsbl %dil, %eax +; X64-NEXT: movl $24, %ecx +; X64-NEXT: # kill: def $cl killed $ecx +; X64-NEXT: shll %cl, %edi +; X64-NEXT: movl $24, %ecx +; X64-NEXT: # kill: def $cl killed $ecx +; X64-NEXT: sarl %cl, %edi +; X64-NEXT: movl %edi, %eax ..which is not optimal and should be addressed later. Rework of the patch by igorb Reviewed By: igorb Differential Revision: https://reviews.llvm.org/D44395 llvm-svn: 327499	2018-03-14 11:23:57 +00:00
Alexander Ivchenko	327de80529	[GlobalIsel][X86] Support for G_ZEXT instruction Reviewed By: igorb Differential Revision: https://reviews.llvm.org/D44378 llvm-svn: 327482	2018-03-14 09:11:23 +00:00
Craig Topper	9ca7e67c4c	[X86] Re-generate test to get proper capitalization of its CHECK lines. NFC llvm-svn: 327462	2018-03-13 23:31:48 +00:00
Craig Topper	cc060e921b	[X86] Rewrite LowerAVXCONCAT_VECTORS similar to how we handle vXi1 concats. This better able to detect undef and zeros pieces in the concat. Or cases when only one subvector is non-zero. This allows us to avoid silly things like double inserts into progressively larger undefs. This still builds 512 bit concats of 128 bits by building up through 256 bits first. But I don't know if that's best. We probably want to merge this with the vXi1 concat code since they are very similar. llvm-svn: 327454	2018-03-13 22:05:25 +00:00
Craig Topper	4aeec51986	[DAGCombiner] Allow visitEXTRACT_SUBVECTOR to combine with BUILD_VECTORS between LegalizeVectorOps and LegalizeDAG. BUILD_VECTORs aren't themselves legalized until LegalizeDAG so we should still be able to create an "illegal" one before that. This helps combine with BUILD_VECTORS that are introduced during LegalizeVectorOps due to unrolling. llvm-svn: 327446	2018-03-13 20:36:28 +00:00
Sanjay Patel	bb45cc126d	[x86] add test for WriteZero sched class instructions; NFC Nops should have zero latency because there is no result. Idioms like 'xorps xmm0, xmm0' may have zero latency because they are handled without using an execution unit. llvm-svn: 327435	2018-03-13 19:20:01 +00:00
Simon Pilgrim	9855b39380	[DAGCombine] visitREM - Don't assume that one divrem isn't driving another Under some circumstances the divrems won't have been combined together before getting to this code. So replace the assertion with a if() guard to not expand to X-((X/C)*C) to give the other combine chance to happen. Reduced from OSS-Fuzz #6883 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=6883 llvm-svn: 327424	2018-03-13 17:17:15 +00:00
Simon Pilgrim	3d4c86d399	[X86][Btver2] Split i8/i16/i32/i64 div/idiv costs We were assuming a mixture of 32/64 division costs. llvm-svn: 327407	2018-03-13 15:22:24 +00:00
Simon Pilgrim	93bd7187f4	[X86][SSE41] createVariablePermute v2X64 - PCMPEQQ can test for index 0/1 and select between them. llvm-svn: 327385	2018-03-13 12:22:58 +00:00
Craig Topper	80058e30cc	[LegalizeTypes] In SplitVecOp_TruncateHelper, use GetSplitVector on the input instead of creating new extract_subvectors. llvm-svn: 327355	2018-03-13 01:17:40 +00:00
Simon Pilgrim	6618e2a09c	[X86][SSE] createVariablePermute - PSHUFB requires SSSE3 not just SSE3 llvm-svn: 327259	2018-03-12 12:30:04 +00:00
Simon Pilgrim	d09cc9c62c	[X86][MMX] Support MMX build vectors to avoid SSE usage (PR29222) 64-bit MMX vector generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type. This patch creates a MMX vector from MMX source values, taking the lowest element from each source and constructing broadcasts/build_vectors with direct calls to the MMX PUNPCKL/PSHUFW intrinsics. We're missing a few consecutive load combines that could be handled in a future patch if that would be useful - my main interest here is just avoiding a lot of the MMX/SSE crossover. Differential Revision: https://reviews.llvm.org/D43618 llvm-svn: 327247	2018-03-11 19:22:13 +00:00
Simon Pilgrim	55ed3dc676	[X86][AVX512] Added more non-VLX test cases Cleaned up check prefixes so that they actually share a bit more llvm-svn: 327246	2018-03-11 18:28:37 +00:00
Simon Pilgrim	30f74c14ff	[X86][AVX] createVariablePermute - scale v16i16 variable permutes to use v32i8 codegen XOP was already doing this, and now AVX performs v32i8 variable permutes as well. llvm-svn: 327245	2018-03-11 17:23:54 +00:00
Simon Pilgrim	b306501796	[X86][AVX] createVariablePermute - widen permutes for cases where the source vector is wider than the destination type llvm-svn: 327244	2018-03-11 17:00:46 +00:00
Simon Pilgrim	9a5d0c7540	[X86][AVX] createVariablePermute - use PSHUFB+PCMPGT+SELECT for v32i8 variable permutes Same as the VPERMILPS/VPERMILPD approach for v8f32/v4f64 cases, rely on PSHUFB using bits[3:0] for indexing - we can ignore the sign bit (zero element) as those index vector values are considered undefined. The select between the lo/hi permute results based on the index size. llvm-svn: 327242	2018-03-11 16:28:11 +00:00
Simon Pilgrim	f9cc80d218	[X86][AVX] createVariablePermute - use 2xVPERMIL+PCMPGT+SELECT for v8i32/v8f32 and v4i64/v4f64 variable permutes As VPERMILPS/VPERMILPD only selects elements based on the bits[1:0]/bit[1] then we can permute both the (repeated) lo/hi 128-bit vectors in each case and then select between these results based on whether the index was for for lo/hi. For v4i64/v4f64 this avoids some rather nasty v4i64 multiples on the AVX2 implementation, which seems to be worse than the extra port5 pressure from the additional shuffles/blends. llvm-svn: 327239	2018-03-11 11:52:26 +00:00
Simon Pilgrim	2565bd421e	[X86][AVX512] createVariablePermute - Non-VLX targets can widen v4i64/v8f64 variable permutes to v8i64/v8f64 Permutes in the upper elements will be undefined, but they will be discarded anyway. llvm-svn: 327238	2018-03-11 11:19:19 +00:00
Craig Topper	d88204fe1b	[X86] Add comments to the end of FMA3 instructions to make the operation clear Summary: There are 3 different operand orders for FMA instructions so figuring out the exact operation being performed requires a lot of thought. This patch adds a comment to the end of the assembly line to print the exact operation. I think I've got all the instructions in here except the ones with builtin rounding. I didn't update all tests, but I assume we can get them as we regenerate tests in the future. Reviewers: spatel, v_klochkov, RKSimon Reviewed By: spatel Subscribers: llvm-commits Differential Revision: https://reviews.llvm.org/D44345 llvm-svn: 327225	2018-03-10 21:30:46 +00:00
Simon Pilgrim	de7f3f0f91	[X86][XOP] createVariablePermute - use VPERMIL2 for v8i32/v4i64 variable permutes llvm-svn: 327222	2018-03-10 19:49:59 +00:00
Simon Pilgrim	ff1248f82f	[X86][XOP] createVariablePermute - use VPPERM for v16i16 variable permutes llvm-svn: 327218	2018-03-10 18:33:29 +00:00
Simon Pilgrim	8224241f75	[X86][XOP] createVariablePermute - use VPPERM for v32i8 variable permutes llvm-svn: 327213	2018-03-10 16:51:45 +00:00
Craig Topper	9804c67d21	[X86] Rewrite printMasking code in X86InstComments to use TSFlags to determine whether the instruction is masked. This should have been NFC, but it looks like we were missing PUNPCKLHQDQ/PUNPCKLQDQ instructions in there. llvm-svn: 327200	2018-03-10 03:12:00 +00:00
Rafael Espindola	63c378d343	Go back to sometimes assuming intristics are local. This fixes pr36674. While it is valid for shouldAssumeDSOLocal to return false anytime, always returning false for intrinsics is not optimal on i386 and also hits a bug in the backend. To use a plt, the caller must first setup ebx to handle the case of that file being linked into a PIE executable or shared library. In those cases the generated PLT uses ebx. Currently we can produce "calll expf@plt" without setting ebx. We could fix that by correctly setting ebx, but this would produce worse code for the case where the runtime library is statically linked. It would also required other tools to handle R_386_PLT32. llvm-svn: 327198	2018-03-10 02:42:14 +00:00
Nirav Dave	042678bd55	Revert: r327172 "Correct load-op-store cycle detection analysis" r327171 "Improve Dependency analysis when doing multi-node Instruction Selection" r328170 "[DAG] Enforce stricter NodeId invariant during Instruction selection" Reverting patch as NodeId invariant change is causing pathological increases in compile time on PPC llvm-svn: 327197	2018-03-10 02:16:15 +00:00
Craig Topper	f6ff51fc62	[TwoAddressInstructionPass] Improve tryInstructionCommute of X86 FMA and vpternlog instructions These instructions have 3 operands that can be commuted. The first commute we find may not be the best. So we should keep searching if we performed an aggressive commute. There may still be an operand that is killed or a physical register constraint that might be better. Differential Revision: https://reviews.llvm.org/D44324 llvm-svn: 327188	2018-03-09 23:36:58 +00:00
Nirav Dave	0fab41782d	Correct load-op-store cycle detection analysis Add missing cycle dependency checks in load-op-store fusion. Fixes PR36274. Reviewers: craig.topper, bogner Subscribers: hiraditya, llvm-commits Differential Revision: https://reviews.llvm.org/D43154 llvm-svn: 327172	2018-03-09 20:58:07 +00:00
Nirav Dave	d668f69ee7	Improve Dependency analysis when doing multi-node Instruction Selection Relanding after fixing NodeId Invariant. Cleanup cycle/validity checks in ISel (IsLegalToFold, HandleMergeInputChains) and X86 (isFusableLoadOpStore). Now do a full search for cycles / dependencies pruning the search when topological property of NodeId allows. As part of this propogate the NodeId-based cutoffs to narrow hasPreprocessorHelper searches. Reviewers: craig.topper, bogner Subscribers: llvm-commits, hiraditya Differential Revision: https://reviews.llvm.org/D41293 llvm-svn: 327171	2018-03-09 20:57:42 +00:00
Nirav Dave	071699bf82	[DAG] Enforce stricter NodeId invariant during Instruction selection Instruction Selection makes use of the topological ordering of nodes by node id (a node's operands have smaller node id than it) when doing cycle detection. During selection we may violate this property as a selection of multiple nodes may induce a use dependence (and thus a node id restriction) between two unrelated nodes. If a selected node has an unselected successor this may allow us to miss a cycle in detection an invalid selection. This patch fixes this by marking all unselected successors of a selected node have negated node id. We avoid pruning on such negative ids but still can reconstruct the original id for pruning. In-tree targets have been updated to replace DAG-level replacements with ISel-level ones which enforce this property. This preemptively fixes PR36312 before triggering commit r324359 relands Reviewers: craig.topper, bogner, jyknight Subscribers: arsenm, nhaehnle, javed.absar, llvm-commits, hiraditya Differential Revision: https://reviews.llvm.org/D43198 llvm-svn: 327170	2018-03-09 20:57:15 +00:00
Peter Collingbourne	2974856ad4	Use branch funnels for virtual calls when retpoline mitigation is enabled. The retpoline mitigation for variant 2 of CVE-2017-5715 inhibits the branch predictor, and as a result it can lead to a measurable loss of performance. We can reduce the performance impact of retpolined virtual calls by replacing them with a special construct known as a branch funnel, which is an instruction sequence that implements virtual calls to a set of known targets using a binary tree of direct branches. This allows the processor to speculately execute valid implementations of the virtual function without allowing for speculative execution of of calls to arbitrary addresses. This patch extends the whole-program devirtualization pass to replace certain virtual calls with calls to branch funnels, which are represented using a new llvm.icall.jumptable intrinsic. It also extends the LowerTypeTests pass to recognize the new intrinsic, generate code for the branch funnels (x86_64 only for now) and lay out virtual tables as required for each branch funnel. The implementation supports full LTO as well as ThinLTO, and extends the ThinLTO summary format used for whole-program devirtualization to support branch funnels. For more details see RFC: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120672.html Differential Revision: https://reviews.llvm.org/D42453 llvm-svn: 327163	2018-03-09 19:11:44 +00:00
Simon Pilgrim	2cd489feb2	[X86][AVX] createVariablePermute - fix v2i64/v2f64 VPERMILPD index creation. The input indices vector will put the index in bit0, but VPERMILPD actually selects off bit1 - so we need to scale accordingly. llvm-svn: 327159	2018-03-09 18:37:56 +00:00
Craig Topper	784f1bbf5e	[X86] Remove SRAs from v16i8 multiply lowering on sse2 targets Previously we unpacked the even bytes of each input into the high byte of 16-bit elements then did an v8i16 arithmetic shift right by 8 bits to fill the upper bits of each word with sign bits. Then we did the v8i16 multiply and then masked to zero the upper 8-bits of each result. The similar was done for all the odd bytes. The results are then packed together with packuswb Since we are masking each multiply result element to 8-bits, and those 8-bits are determined only by the lower 8-bits of each of the inputs, we don't need to fill the upper bits with sign bits. So we can just unpack into the low byte of each element and treat the upper bits as garbage. This is what gcc also does. Differential Revision: https://reviews.llvm.org/D44267 llvm-svn: 327093	2018-03-09 01:22:31 +00:00
Sanjay Patel	0cdccf5f37	[x86] fix test to be independent of FP undef llvm-svn: 327030	2018-03-08 17:24:30 +00:00
Sanjay Patel	af2c4185a2	[x86] regenerate checks; NFC This test will fail if we fix FP undef constant folding. llvm-svn: 327026	2018-03-08 16:56:49 +00:00
Craig Topper	a406796f5f	[X86] Change X86::PMULDQ/PMULUDQ opcodes to take vXi64 type as input instead of vXi32. This instruction can be thought of as reading either the even elements of a vXi32 input or the lower half of each element of a vXi64 input. We currently use the vXi32 interpretation, but vXi64 matches better with its broadcast behavior in EVEX. I'm looking at moving MULDQ/MULUDQ creation to a DAG combine so we can do it when AVX512DQ is enabled without having to go through Custom lowering. But in some of the test cases we failed to use a broadcast load due to the size difference. This should help with that. I'm also wondering if we can model these instructions in native IR and remove the intrinsics and I think using a vXi64 type will work better with that. llvm-svn: 326991	2018-03-08 08:02:52 +00:00
Craig Topper	7ff9779768	[X86] Fix some isel patterns that used aligned vector load instructions with unaligned predicates. These patterns weren't checking the alignment of the load, but were using the aligned instructions. This will cause a GP fault if the data isn't aligned. I believe these were introduced in r312450. llvm-svn: 326967	2018-03-08 00:21:17 +00:00
Simon Pilgrim	dc1a0385ee	[X86][SSE] Regenerate float maxnum/minnum tests llvm-svn: 326930	2018-03-07 19:14:05 +00:00

1 2 3 4 5 ...

11458 Commits