llvm-project

Commit Graph

Author	SHA1	Message	Date
Chandler Carruth	2c0390ca4b	[x86] Remove the final fallback in the v8i16 lowering that isn't really needed, and significantly improve the SSSE3 path. This makes the new strategy much more clear. If we can blend, we just go with that. If we can't blend, we try to permute into an unpack so that we handle cases where the unpack doing the blend also simplifies the shuffle. If that fails and we've got SSSE3, we now call into factored-out pshufb lowering code so that we leverage the fact that pshufb can set up a blend for us while shuffling. This generates great code, especially because we know we don't have a fast blend at this point. Finally, we fall back on decomposing into permutes and blends because we do at least have a bit-math-based blend if we need to use that. This pretty significantly improves some of the v8i16 code paths. We never need to form pshufb for the single-input shuffles because we have effective target-specific combines to form it there, but we were missing its effectiveness in the blends. llvm-svn: 229851	2015-02-19 13:56:49 +00:00
Chandler Carruth	f0f0d27391	[x86] Simplify the pre-SSSE3 v16i8 lowering significantly by decomposing them into permutes and a blend with the generic decomposition logic. This works really well in almost every case and lets the code only manage the expansion of a single input into two v8i16 vectors to perform the actual shuffle. The blend-based merging is often much nicer than the pack based merging that this replaces. The only place where it isn't we end up blending between two packs when we could do a single pack. To handle that case, just teach the v2i64 lowering to handle these blends by digging out the operands. With this we're down to only really random permutations that cause an explosion of instructions. llvm-svn: 229849	2015-02-19 13:15:12 +00:00
Chandler Carruth	bcb6c5f62d	[x86] Add support for bit-wise blending and use it in the v8 and v16 lowering paths. I'm going to be leveraging this to simplify a lot of the overly complex lowering of v8 and v16 shuffles in pre-SSSE3 modes. Sadly, this isn't profitable on v4i32 and v2i64. There, the float and double blending instructions for pre-SSE4.1 are actually pretty good, and we can't beat them with bit math. And once SSE4.1 comes around we have direct blending support and this ceases to be relevant. Also, some of the test cases look odd because the domain fixer canonicalizes these to floating point domain. That's OK, it'll use the integer domain when it matters and some day I may be able to update enough of LLVM to canonicalize the other way. This restores almost all of the regressions from teaching x86's vselect lowering to always use vector shuffle lowering for blends. The remaining problems are because the v16 lowering path is still doing crazy things. I'll be re-arranging that strategy in more detail in subsequent commits to finish recovering the performance here. llvm-svn: 229836	2015-02-19 10:46:52 +00:00
Simon Pilgrim	3ac3b251a9	[X86][SSE] pslldq/psrldq byte shifts/rotation for SSE2 This patch builds on http://reviews.llvm.org/D5598 to perform byte rotation shuffles (lowerVectorShuffleAsByteRotate) on pre-SSSE3 (palignr) targets - pre-SSSE3 is only enabled on i8 and i16 vector targets where it is a more definite performance gain. I've also added a separate byte shift shuffle (lowerVectorShuffleAsByteShift) that makes use of the ability of the SLLDQ/SRLDQ instructions to implicitly shift in zero bytes to avoid the need to create a zero register if we had used palignr. Differential Revision: http://reviews.llvm.org/D5699 llvm-svn: 222340	2014-11-19 10:06:49 +00:00
Chandler Carruth	99627bfbff	[x86] Enable the new vector shuffle lowering by default. Update the entire regression test suite for the new shuffles. Remove most of the old testing which was devoted to the old shuffle lowering path and is no longer relevant really. Also remove a few other random tests that only really exercised shuffles and only incidently or without any interesting aspects to them. Benchmarking that I have done shows a few small regressions with this on LNT, zero measurable regressions on real, large applications, and for several benchmarks where the loop vectorizer fires in the hot path it shows 5% to 40% improvements for SSE2 and SSE3 code running on Sandy Bridge machines. Running on AMD machines shows even more dramatic improvements. When using newer ISA vector extensions the gains are much more modest, but the code is still better on the whole. There are a few regressions being tracked (PR21137, PR21138, PR21139) but by and large this is expected to be a win for x86 generated code performance. It is also more correct than the code it replaces. I have fuzz tested this extensively with ISA extensions up through AVX2 and found no crashes or miscompiles (yet...). The old lowering had a few miscompiles and crashers after a somewhat smaller amount of fuzz testing. There is one significant area where the new code path lags behind and that is in AVX-512 support. However, there was extremely little support for that already and so this isn't a significant step backwards and the new framework will probably make it easier to implement lowering that uses the full power of AVX-512's table-based shuffle+blend (IMO). Many thanks to Quentin, Andrea, Robert, and others for benchmarking assistance. Thanks to Adam and others for help with AVX-512. Thanks to Hal, Eric, and many others for answering my incessant questions about how the backend actually works. =] I will leave the old code path in the tree until the 3 PRs above are at least resolved to folks' satisfaction. Then I will rip it (and 1000s of lines of code) out. =] I don't expect this flag to stay around for very long. It may not survive next week. llvm-svn: 219046	2014-10-04 03:52:55 +00:00
Chandler Carruth	c75abc162c	[x86] Cleanup and generate precise FileCheck assertions for a bunch of SSE tests. llvm-svn: 218947	2014-10-03 01:37:58 +00:00
Chandler Carruth	c85473143c	[x86] Fix a bad predicate I spotted by inspection -- pshufhw and pshuflw were added in SSE2, no SSSE3. Found this while auditing all uses of SSSE3 in the X86 target. I don't actually expect this to make a significant difference on anything and I don't have any detailed test cases but I updated the existing test cases that already covered some of this code path. llvm-svn: 209056	2014-05-17 03:29:20 +00:00
Andrew Trick	8485257d6d	Allocate local registers in order for optimal coloring. Also avoid locals evicting locals just because they want a cheaper register. Problem: MI Sched knows exactly how many registers we have and assumes they can be colored. In cases where we have large blocks, usually from unrolled loops, greedy coloring fails. This is a source of "regressions" from the MI Scheduler on x86. I noticed this issue on x86 where we have long chains of two-address defs in the same live range. It's easy to see this in matrix multiplication benchmarks like IRSmk and even the unit test misched-matmul.ll. A fundamental difference between the LLVM register allocator and conventional graph coloring is that in our model a live range can't discover its neighbors, it can only verify its neighbors. That's why we initially went for greedy coloring and added eviction to deal with the hard cases. However, for singly defined and two-address live ranges, we can optimally color without visiting neighbors simply by processing the live ranges in instruction order. Other beneficial side effects: It is much easier to understand and debug regalloc for large blocks when the live ranges are allocated in order. Yes, global allocation is still very confusing, but it's nice to be able to comprehend what happened locally. Heuristics could be added to bias register assignment based on instruction locality (think late register pairing, banks...). Intuituvely this will make some test cases that are on the threshold of register pressure more stable. llvm-svn: 187139	2013-07-25 18:35:14 +00:00
Stephen Lin	d24ab20e9b	Mass update to CodeGen tests to use CHECK-LABEL for labels corresponding to function definitions for more informative error messages. No functionality change and all updated tests passed locally. This update was done with the following bash script: find test/CodeGen -name ".ll" \| \ while read NAME; do echo "$NAME" if ! grep -q "^; RUN: llc.debug" $NAME; then TEMP=`mktemp -t temp` cp $NAME $TEMP sed -n "s/^define [^@]@$[A-Za-z0-9_]$(.$/\1/p" < $NAME \| \ while read FUNC; do sed -i '' "s/;$.$$[A-Za-z0-9_-]$:$ $$FUNC: \$/;\1\2-LABEL:\3$FUNC:/g" $TEMP done sed -i '' "s/;$.$-LABEL-LABEL:/;\1-LABEL:/" $TEMP sed -i '' "s/;$.$-NEXT-LABEL:/;\1-NEXT:/" $TEMP sed -i '' "s/;$.$-NOT-LABEL:/;\1-NOT:/" $TEMP sed -i '' "s/;$.*$-DAG-LABEL:/;\1-DAG:/" $TEMP mv $TEMP $NAME fi done llvm-svn: 186280	2013-07-14 06:24:09 +00:00
Craig Topper	92db928ee9	Simplify handling of v16i8 shuffles and fix a missed optimization. llvm-svn: 157043	2012-05-18 06:42:06 +00:00
Evan Cheng	30f44ad785	Teach two-address pass to re-schedule two-address instructions (or the kill instructions of the two-address operands) in order to avoid inserting copies. This fixes the few regressions introduced when the two-address hack was disabled (without regressing the improvements). rdar://10422688 llvm-svn: 144559	2011-11-14 19:48:55 +00:00
Evan Cheng	d33b2d6b7a	Use a bigger hammer to fix PR11314 by disabling the "forcing two-address instruction lower optimization" in the pre-RA scheduler. The optimization, rather the hack, was done before MI use-list was available. Now we should be able to implement it in a better way, perhaps in the two-address pass until a MI scheduler is available. Now that the scheduler has to backtrack to handle call sequences. Adding artificial scheduling constraints is just not safe. Furthermore, the hack is not taking all the other scheduling decisions into consideration so it's just as likely to pessimize code. So I view disabling this optimization goodness regardless of PR11314. llvm-svn: 144267	2011-11-10 07:43:16 +00:00
Dan Gohman	198b7ffc11	Reapply r143206, with fixes. Disallow physical register lifetimes across calls, and only check for nested dependences on the special call-sequence-resource register. llvm-svn: 143660	2011-11-03 21:49:52 +00:00
Dan Gohman	9b9c970148	Revert r143206, as there are still some failing tests. llvm-svn: 143262	2011-10-29 00:41:52 +00:00
Dan Gohman	73057ad24f	Reapply r143177 and r143179 (reverting r143188), with scheduler fixes: Use a separate register, instead of SP, as the calling-convention resource, to avoid spurious conflicts with actual uses of SP. Also, fix unscheduling of calling sequences, which can be triggered by pseudo-two-address dependencies. llvm-svn: 143206	2011-10-28 17:55:38 +00:00
Duncan Sands	225a7037d6	Speculatively disable Dan's commits 143177 and 143179 to see if it fixes the dragonegg self-host (it looks like gcc is miscompiled). Original commit messages: Eliminate LegalizeOps' LegalizedNodes map and have it just call RAUW on every node as it legalizes them. This makes it easier to use hasOneUse() heuristics, since unneeded nodes can be removed from the DAG earlier. Make LegalizeOps visit the DAG in an operands-last order. It previously used operands-first, because LegalizeTypes has to go operands-first, and LegalizeTypes used to be part of LegalizeOps, but they're now split. The operands-last order is more natural for several legalization tasks. For example, it allows lowering code for nodes with floating-point or vector constants to see those constants directly instead of seeing the lowered form (often constant-pool loads). This makes some things somewhat more complicated today, though it ought to allow things to be simpler in the future. It also fixes some bugs exposed by Legalizing using RAUW aggressively. Remove the part of LegalizeOps that attempted to patch up invalid chain operands on libcalls generated by LegalizeTypes, since it doesn't work with the new LegalizeOps traversal order. Instead, define what LegalizeTypes is doing to be correct, and transfer the responsibility of keeping calls from having overlapping calling sequences into the scheduler. Teach the scheduler to model callseq_begin/end pairs as having a physical register definition/use to prevent calls from having overlapping calling sequences. This is also somewhat complicated, though there are ways it might be simplified in the future. This addresses rdar://9816668, rdar://10043614, rdar://8434668, and others. Please direct high-level questions about this patch to management. Delete #if 0 code accidentally left in. llvm-svn: 143188	2011-10-28 09:55:57 +00:00
Dan Gohman	4db3f7dd83	Eliminate LegalizeOps' LegalizedNodes map and have it just call RAUW on every node as it legalizes them. This makes it easier to use hasOneUse() heuristics, since unneeded nodes can be removed from the DAG earlier. Make LegalizeOps visit the DAG in an operands-last order. It previously used operands-first, because LegalizeTypes has to go operands-first, and LegalizeTypes used to be part of LegalizeOps, but they're now split. The operands-last order is more natural for several legalization tasks. For example, it allows lowering code for nodes with floating-point or vector constants to see those constants directly instead of seeing the lowered form (often constant-pool loads). This makes some things somewhat more complicated today, though it ought to allow things to be simpler in the future. It also fixes some bugs exposed by Legalizing using RAUW aggressively. Remove the part of LegalizeOps that attempted to patch up invalid chain operands on libcalls generated by LegalizeTypes, since it doesn't work with the new LegalizeOps traversal order. Instead, define what LegalizeTypes is doing to be correct, and transfer the responsibility of keeping calls from having overlapping calling sequences into the scheduler. Teach the scheduler to model callseq_begin/end pairs as having a physical register definition/use to prevent calls from having overlapping calling sequences. This is also somewhat complicated, though there are ways it might be simplified in the future. This addresses rdar://9816668, rdar://10043614, rdar://8434668, and others. Please direct high-level questions about this patch to management. llvm-svn: 143177	2011-10-28 01:29:32 +00:00
Evan Cheng	fd7e3fcad3	Fix broken x86_64 tests which specify non-64-bit cpu's. llvm-svn: 134756	2011-07-08 22:29:33 +00:00
Jakob Stoklund Olesen	4931bbc671	Be more aggressive about following hints. RAGreedy::tryAssign will now evict interference from the preferred register even when another register is free. To support this, add the EvictionCost struct that counts how many hints are broken by an eviction. We don't want to break one hint just to satisfy another. Rename canEvict to shouldEvict, and add the first bit of eviction policy that doesn't depend on spill weights: Always make room in the preferred register as long as the evictees can be split and aren't already assigned to their preferred register. Also make the CSR avoidance more accurate. When looking for a cheaper register it is OK to use a new volatile register. Only CSR aliases that have never been used before should be avoided. llvm-svn: 134735	2011-07-08 20:46:18 +00:00
Jakob Stoklund Olesen	369bddf5ad	Fix a batch of x86 tests to be coalescer independent. Most of these tests require a single mov instruction that can come either before or after a 2-addr instruction. -join-physregs changes the behavior, but the results are equivalent. llvm-svn: 130891	2011-05-04 23:54:51 +00:00
Jakob Stoklund Olesen	bd09d45489	Fix register-dependent X86 tests. llvm-svn: 128867	2011-04-05 00:32:44 +00:00
Evan Cheng	5c31bf0619	Canonicalize X86ISD::MOVDDUP nodes to v2f64 to make sure all cases match. Also eliminate unneeded isel patterns. rdar://8520311 llvm-svn: 115977	2010-10-07 20:50:20 +00:00
Dan Gohman	9a2f0473b2	Teach EmitLiveInCopies to omit copies for unused virtual registers, and to clean up unused incoming physregs from the live-in list. llvm-svn: 106805	2010-06-24 22:23:02 +00:00
Jakob Stoklund Olesen	6f6ebb663c	Enable -sse-domain-fix by default. Now with tests! llvm-svn: 99954	2010-03-30 22:47:00 +00:00
Evan Cheng	bf724b9ee0	Turning off post-ra scheduling for x86. It isn't a consistent win. llvm-svn: 98810	2010-03-18 06:55:42 +00:00
Chris Lattner	dd030701bd	Fix some issues in WalkChainUsers dealing with CopyToReg/CopyFromReg/INLINEASM. These are annoying because they have the same opcode before an after isel. Fix this by setting their NodeID to -1 to indicate that they are selected, just like what automatically happens when selecting things that end up being machine nodes. With that done, give IsLegalToFold a new flag that causes it to ignore chains. This lets the HandleMergeInputChains routine be the one place that validates chains after a match is successful, enabling the new hotness in chain processing. This smarter chain processing eliminates the need for "PreprocessRMW" in the X86 and MSP430 backends and enables MSP to start matching it's multiple mem operand instructions more aggressively. I currently #if out the dead code in the X86 backend and MSP backend, I'll remove it for real in a follow-on patch. The testcase changes are: test/CodeGen/X86/sse3.ll: we generate better code test/CodeGen/X86/store_op_load_fold2.ll: PreprocessRMW was miscompiling this before, we now generate correct code Convert it to filecheck while I'm at it. test/CodeGen/MSP430/Inst16mm.ll: Add a testcase for mem/mem folding to make anton happy. :) llvm-svn: 97596	2010-03-02 22:20:06 +00:00
Evan Cheng	ea5c6be766	Run codegen dce pass for all targets at all optimization levels. Previously it's only run for x86 with fastisel. I've found it being very effective in eliminating some obvious dead code as result of formal parameter lowering especially when tail call optimization eliminated the need for some of the loads from fixed frame objects. It also shrinks a number of the tests. A couple of tests no longer make sense and are now eliminated. llvm-svn: 95493	2010-02-06 09:07:11 +00:00
Dan Gohman	9528ccdd77	Don't enable the post-RA scheduler on x86 except at -O3. In its current form, it is too expensive in compile time. llvm-svn: 90781	2009-12-07 19:04:31 +00:00
Eric Christopher	bd05185ef1	Fix a couple of shuffle patterns to use movhlps instead of movhps as the constraint. Changes optimizations so update testcases as appropriate as well. llvm-svn: 86360	2009-11-07 08:45:53 +00:00
Evan Cheng	36f4bd0b62	Update tests for 84931. llvm-svn: 84932	2009-10-23 05:58:34 +00:00
David Goodwin	02ad4cb32e	Allow the target to select the level of anti-dependence breaking that should be performed by the post-RA scheduler. The default is none. llvm-svn: 84911	2009-10-22 23:19:17 +00:00
Dan Gohman	682a2d154a	Revert r84658 and r84691. They were causing llvm-gcc bootstrap to fail. llvm-svn: 84727	2009-10-21 01:44:44 +00:00
David Goodwin	baf6dd26ea	Checkpoint more aggressive anti-dependency breaking for post-ra scheduler. llvm-svn: 84658	2009-10-20 19:54:44 +00:00
Evan Cheng	c436631a9c	Turn on post-alloc scheduling for x86. llvm-svn: 84431	2009-10-18 19:57:27 +00:00
Dan Gohman	40503396da	Eliminate more uses of llvm-as and llvm-dis. llvm-svn: 81290	2009-09-08 23:54:48 +00:00
Chris Lattner	e819cfbc71	change selectiondag to add the sign extended versions of immediate operands to instructions instead of zero extended ones. This makes the asmprinter print signed values more consistently. This apparently only really affects the X86 backend. llvm-svn: 81265	2009-09-08 23:05:44 +00:00
Chris Lattner	c6a803be7c	specify a target triple so global variable manglings are consistent etc. llvm-svn: 79118	2009-08-15 17:35:05 +00:00
Chris Lattner	d3954e2790	merge a bunch more sse3 tests into sse3.ll llvm-svn: 79115	2009-08-15 17:21:44 +00:00
Chris Lattner	9bae01ec47	convert test to filecheck format. llvm-svn: 79114	2009-08-15 17:05:03 +00:00
Chris Lattner	912aa19c25	rename test llvm-svn: 79113	2009-08-15 17:01:44 +00:00

40 Commits