llvm-project

Commit Graph

Author	SHA1	Message	Date
Sanjay Patel	957efc23bb	Use rsqrt (X86) to speed up reciprocal square root calcs This is a first step for generating SSE rsqrt instructions for reciprocal square root calcs when fast-math is allowed. For now, be conservative and only enable this for AMD btver2 where performance improves significantly - for example, 29% on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c (if we convert the data type to single-precision float). This patch adds a two constant version of the Newton-Raphson refinement algorithm to DAGCombiner that can be selected by any target via a parameter returned by getRsqrtEstimate().. See PR20900 for more details: http://llvm.org/bugs/show_bug.cgi?id=20900 Differential Revision: http://reviews.llvm.org/D5658 llvm-svn: 220570	2014-10-24 17:02:16 +00:00
Ahmed Bougacha	5175bcf43a	[X86] Improve mul w/ overflow codegen, to MUL8+SETO. Currently, @llvm.smul.with.overflow.i8 expands to 9 instructions, where 3 are really needed. This adds X86ISD::UMUL8/SMUL8 SD nodes, and custom lowers them to MUL8/IMUL8 + SETO. i8 is a special case because there is no two/three operand variants of (I)MUL8, so the first operand and return value need to go in AL/AX. Also, we can't write patterns for these instructions: TableGen refuses patterns where output operands don't match SDNode results. In this case, instructions where the output operand is an implicitly defined register. A related special case (and FIXME) exists for MUL8 (X86InstrArith.td): // FIXME: Used for 8-bit mul, ignore result upper 8 bits. // This probably ought to be moved to a def : Pat<> if the // syntax can be accepted. [(set AL, (mul AL, GR8:$src)), (implicit EFLAGS)] Ideally, these go away with UMUL8, but we still need to improve TableGen support of implicit operands in patterns. Before this change: movsbl %sil, %eax movsbl %dil, %ecx imull %eax, %ecx movb %cl, %al sarb $7, %al movzbl %al, %eax movzbl %ch, %esi cmpl %eax, %esi setne %al After: movb %dil, %al imulb %sil seto %al Also, remove a made-redundant testcase for PR19858, and enable more FastISel ALU-overflow tests for SelectionDAG too. Differential Revision: http://reviews.llvm.org/D5809 llvm-svn: 220516	2014-10-23 21:55:31 +00:00
Matt Arsenault	7c93690be0	Add minnum / maxnum codegen llvm-svn: 220342	2014-10-21 23:01:01 +00:00
Quentin Colombet	06355199f1	[X86] Fix a bug in the lowering of the mask of VSELECT. X86 code to lower VSELECT messed a bit with the bits set in the mask of VSELECT when it knows it can be lowered into BLEND. Indeed, only the high bits need to be set for those and it optimizes those accordingly. However, when the mask is a compile time constant, the lowering will be handled by the generic optimizer and those modifications will generate bad code in the generic optimizer. This patch fixes that by preventing the optimization if the VSELECT will be handled by the generic optimizer. <rdar://problem/18675020> llvm-svn: 220242	2014-10-20 23:13:30 +00:00
Filipe Cabecinhas	9d7bd78ffa	Fix a broadcast related regression on the vector shuffle lowering. Summary: Test by Robert Lougher! Reviewers: chandlerc Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D5745 llvm-svn: 219617	2014-10-13 16:16:16 +00:00
Simon Pilgrim	3c1e1e9498	Test commit access (email fix) Indentation tidyup. llvm-svn: 219577	2014-10-11 20:28:56 +00:00
Simon Pilgrim	d89591e0a1	Test commit access Fix comment typo + spelling. llvm-svn: 219572	2014-10-11 14:23:36 +00:00
Robert Khasanov	b51bb22611	[AVX512] Added intrinsics for 128-, 256- and 512-bit versions of VPCMP/VPCMPU{BWDQ} Added CMP_MASK_CC intrinsic type. Added tests for intrinsics. Patch by Sergey Lisitsyn <sergey.lisitsyn@intel.com> llvm-svn: 219316	2014-10-08 15:49:26 +00:00
Elena Demikhovsky	44bf0637d5	AVX-512-SKX: Added instruction VPMOVM2B/W/D/Q. This instruction allows to broadacst mask vector to data vector. llvm-svn: 219083	2014-10-05 14:11:08 +00:00
Chandler Carruth	acecdc0211	[x86] Fix PR21139, one of the last remaining regressions found in the new vector shuffle lowering. This is loosely based on a patch by Marius Wachtler to the PR (thanks!). I refactored it a bi to use std::count_if and a mutable array ref but the core idea was exactly right. I also added some direct testing of this case. I believe PR21137 is now the only remaining regression. llvm-svn: 219081	2014-10-05 12:07:34 +00:00
Chandler Carruth	9f4d9fa54e	[x86] Teach the new vector shuffle lowering how to lower 128-bit shuffles using AVX and AVX2 instructions. This fixes PR21138, one of the few remaining regressions impacting benchmarks from the new vector shuffle lowering. You may note that it "regresses" many of the vperm2x128 test cases -- these were actually "improved" by the naive lowering that the new shuffle lowering previously did. This regression gave me fits. I had this patch ready-to-go about an hour after flipping the switch but wasn't sure how to have the best of both worlds here and thought the correct solution might be a completely different approach to lowering these vector shuffles. I'm now convinced this is the correct lowering and the missed optimizations shown in vperm2x128 are actually due to missing target-independent DAG combines. I've even written most of the needed DAG combine and will submit it shortly, but this part is ready and should help some real-world benchmarks out. llvm-svn: 219079	2014-10-05 11:41:36 +00:00
Chandler Carruth	99627bfbff	[x86] Enable the new vector shuffle lowering by default. Update the entire regression test suite for the new shuffles. Remove most of the old testing which was devoted to the old shuffle lowering path and is no longer relevant really. Also remove a few other random tests that only really exercised shuffles and only incidently or without any interesting aspects to them. Benchmarking that I have done shows a few small regressions with this on LNT, zero measurable regressions on real, large applications, and for several benchmarks where the loop vectorizer fires in the hot path it shows 5% to 40% improvements for SSE2 and SSE3 code running on Sandy Bridge machines. Running on AMD machines shows even more dramatic improvements. When using newer ISA vector extensions the gains are much more modest, but the code is still better on the whole. There are a few regressions being tracked (PR21137, PR21138, PR21139) but by and large this is expected to be a win for x86 generated code performance. It is also more correct than the code it replaces. I have fuzz tested this extensively with ISA extensions up through AVX2 and found no crashes or miscompiles (yet...). The old lowering had a few miscompiles and crashers after a somewhat smaller amount of fuzz testing. There is one significant area where the new code path lags behind and that is in AVX-512 support. However, there was extremely little support for that already and so this isn't a significant step backwards and the new framework will probably make it easier to implement lowering that uses the full power of AVX-512's table-based shuffle+blend (IMO). Many thanks to Quentin, Andrea, Robert, and others for benchmarking assistance. Thanks to Adam and others for help with AVX-512. Thanks to Hal, Eric, and many others for answering my incessant questions about how the backend actually works. =] I will leave the old code path in the tree until the 3 PRs above are at least resolved to folks' satisfaction. Then I will rip it (and 1000s of lines of code) out. =] I don't expect this flag to stay around for very long. It may not survive next week. llvm-svn: 219046	2014-10-04 03:52:55 +00:00
Chandler Carruth	200e87c0c5	[x86] Fix a bug in the VZEXT DAG combine that I just made more powerful. It turns out this combine was always somewhat flawed -- there are cases where nested VZEXT nodes can't be combined: if their types have a mismatch that can be observed in the result. While none of these show up in currently, once I switch to the new vector shuffle lowering a few test cases actually form such nested VZEXT nodes. I've not come up with any IR pattern that I can sensible write to exercise this, but it will be covered by tests once I flip the switch. llvm-svn: 219044	2014-10-04 02:51:03 +00:00
Chandler Carruth	7e26a67ffa	[x86] Sink a generic combine of VZEXT nodes from the lowering to VZEXT nodes to the DAG combining of them. This will allow the combine to fire on both old vector shuffle lowering and the new vector shuffle lowering and generally seems like a cleaner design. I've trimmed down the code a bit and tried to make it and the surrounding combine fairly clean while moving it around. llvm-svn: 219042	2014-10-04 01:05:48 +00:00
Chandler Carruth	f3e880697a	[x86] Add a really preposterous number of patterns for matching all of the various ways in which blends can be used to do vector element insertion for lowering with the scalar math instruction forms that effectively re-blend with the high elements after performing the operation. This then allows me to bail on the element insertion lowering path when we have SSE4.1 and are going to be doing a normal blend, which in turn restores the last of the blends lost from the new vector shuffle lowering when I got it to prioritize insertion in other cases (for example when we don't have a blend instruction). Without the patterns, using blends here would have regressed sse-scalar-fp-arith.ll completely with the new vector shuffle lowering. For completeness, I've added RUN-lines with the new lowering here. This is somewhat superfluous as I'm about to flip the default, but hey, it shows that this actually significantly changed behavior. The patterns I've added are just ridiculously repetative. Suggestions on making them better very much welcome. In particular, handling the commuted form of the v2f64 patterns is somewhat obnoxious. llvm-svn: 219033	2014-10-03 22:43:17 +00:00
Chandler Carruth	1964078936	[x86] Teach the new vector shuffle lowering to aggressively form MOVSS and MOVSD nodes for single element vector inserts. This is particularly important because a number of patterns in the backend detect these patterns and leverage them to simplify things. It also fixes quite a few of the insertion bad code examples. However, it regresses a specific area: when available, blendps and blendpd are dramatically faster than movss and movsd respectively. But it doesn't really work to form the blend logic first because the blends aren't as crazy efficient when the data is coming from memory anyways, and thus will have a movss or movsd regardless. Also, doing that would block a bunch of the patterns that this is designed to hit. So my plan is to go into the patterns for lowering MOVSS and MOVSD and lower them via blends when available. However that's a pretty invasive restructuring so it will need to be a follow-up patch. I have already gone into the patterns to lower MOVSS and MOVSD from memory using MOVLPD, etc. Without that, several of the test cases I already have regress. llvm-svn: 218985	2014-10-03 13:11:13 +00:00
Chandler Carruth	4bf341de3c	[x86] Refactor the element insertion logic in the new vector shuffle lowering to handle the potential mirroring of 2-element vectors (because we can't reliably sort them one way) in the caller rather than in the insertion logic. This will simplify things considerably as more ways to fail to match the insertion are added because now we have a nice try and retry point. llvm-svn: 218980	2014-10-03 12:01:55 +00:00
Chandler Carruth	971a560cb8	[x86] Significantly improve the ability of the new vector shuffle lowering to match VZEXT_MOVL patterns. I hadn't realized that these had sufficient pattern smarts in the backend to lower zext-ing from the low element of a vector without it being a scalar_to_vector node. They do, and this is how to match a bunch of patterns for movq, movss, etc. There is a weird propensity to end up using pshufd to place the element afterward even though it means domain crossing (or rather, to use xorps+movss to zext the element rather than movq) but that's an orthogonal problem with VZEXT_MOVL that someone should probably look at. llvm-svn: 218977	2014-10-03 11:25:58 +00:00
Chandler Carruth	e91b316266	[x86] Unbreak SSE1 with the new vector shuffle lowering. We can't widen element types to form illegal vector types. I've added a special SSE1 test case here that makes sure we don't break this going forward. llvm-svn: 218974	2014-10-03 10:11:39 +00:00
Chandler Carruth	75e182b414	[x86] Teach the new vector shuffle lowering to widen floating point elements as well as integer elements in order to form simpler shuffle patterns. This is the primary reason why we were failing to match some of the 2-and-2 floating point shuffles such as PR21140. Even after fixing this we need to support some extra patterns in the backend in order to match the resulting X86ISD::UNPCKL nodes into the correct instructions. This commit should fix PR21140 and includes more comprehensive testing of insertion patterns in v4 shuffles. Not all of the added tests are beautiful. For example, we don't have clever instructions to insert-via-load in the integer domain. There are also some places where we aren't sufficiently cunning with our use of movq and movd, but that's future work. llvm-svn: 218911	2014-10-02 21:37:14 +00:00
Chandler Carruth	8a16802d46	[x86] Improve and correct how the new vector shuffle lowering was matching and lowering 64-bit insertions. The first problem was that we weren't looking through bitcasts to discover that we could lower as insertions. Once fixed, we in turn weren't looking through bitcasts to discover that we could fold a load into the lowering. Once fixed, we weren't forming a SCALAR_TO_VECTOR node around the inserted element and instead were passing a scalar to a DAG node that expected a vector. It turns out there are some patterns that will "lower" this into the correct asm, but the rest of the X86 backend is very unhappy with such antics. This should fix a few more edge case regressions I've spotted going through the regression test suite to enable the new vector shuffle lowering. llvm-svn: 218839	2014-10-01 23:14:28 +00:00
Sanjay Patel	9ebfbb969d	Lower FNEG ( FABS (x) ) -> FNABS (x) [X86 codegen] PR20578 Negative FABS of either a scalar or vector should be handled the same way on x86 with SSE/AVX: a single OR instruction of the FP operand with a constant to light up the sign bit(s). http://llvm.org/bugs/show_bug.cgi?id=20578 Differential Revision: http://reviews.llvm.org/D5201 llvm-svn: 218822	2014-10-01 21:20:06 +00:00
Eric Christopher	12f4a78581	constify TargetMachine parameter for X86TargetLowering. llvm-svn: 218804	2014-10-01 20:38:22 +00:00
Sanjay Patel	0e4a83e89c	Don't repeat function/variable name in comment. NFC. llvm-svn: 218791	2014-10-01 19:39:32 +00:00
Chandler Carruth	6c02c031b8	[x86] Fix a few more tiny patterns with the new vector shuffle lowering that keep cropping up in the regression test suite. This also addresses one of the issues raised on the mailing list with failing to form 'movsd' in as many cases as we realistically should. There will be corresponding patches forthcoming for v4f32 at least. This was a lot of fuss for a relatively small gain, but all the fuss was on my end trying different ways of holding the pieces of the x86 fragment patterns just right. Now that it works, the code is reasonably simple. In the new test cases I'm adding here, v2i64 sticks out as just plain horrible. I've not come up with any great ideas here other than that it would be nice to recognize when we're going to take a domain crossing hit and cross earlier to get the decent instructions. At least with AVX it is slightly less silly.... llvm-svn: 218756	2014-10-01 11:14:02 +00:00
Chandler Carruth	048486109b	[x86] Delete some extraneous logic from the new vector shuffle lowering. Nothing was relying on this and there are potentially some edge cases that it would not be correct under. Removing it seems better than trying to "fix" it as nothing was relying on it. llvm-svn: 218755	2014-10-01 11:13:57 +00:00
Nick Lewycky	5f75f4ddb9	Fix typo in comment from r218733 llvm-svn: 218739	2014-10-01 03:37:34 +00:00
Chandler Carruth	26cb9b8d2d	[x86] Teach the new vector shuffle lowering to be even more aggressive in exposing the scalar value to the broadcast DAG fragment so that we can catch even reloads and fold them into the broadcast. This is somewhat magical I'm afraid but seems to work. It is also what the old lowering did, and I've switched an old test to run both lowerings demonstrating that we get the same result. Unlike the old code, I'm not lowering f32 or f64 scalars through this path when we only have AVX1. The target patterns include pretty heinous code to re-cast those as shuffles when the scalar happens to not be spilled because AVX1 provides no broadcast mechanism from registers what-so-ever. This is terribly brittle. I'd much rather go through our generic lowering code to get this. If needed, we can add a peephole to get even more opportunities to broadcast-from-spill-slots that are exposed post-RA, but my suspicion is this just doesn't matter that much. llvm-svn: 218734	2014-10-01 03:19:43 +00:00
Chandler Carruth	846baf2ca1	[x86] Hoist the zext-lowering up in the v4i32 lowering routine -- it is the same speed as pshufd but we can fold loads into the pmovzx instructions. This fixes some regressions that came up in the regression test suite for the new vector shuffle lowering. llvm-svn: 218733	2014-10-01 02:25:54 +00:00
Chandler Carruth	b9d3fa1e65	[x86] Teach the new vector shuffle lowering about VBROADCAST and VPBROADCAST. This has the somewhat expected pervasive impact. I don't know why I forgot about this. Everything seems good with lots of significant improvements in the tests. llvm-svn: 218724	2014-10-01 00:41:21 +00:00
Robert Khasanov	5aa4445bde	[AVX512] Added intrinsics for 128- and 256-bit versions of VCMPEQ{BWDQ} Fixed lowering of this intrinsics in case when mask is v2i1 and v4i1. Now cmp intrinsics lower in the following way: (i8 (int_x86_avx512_mask_pcmpeq_q_128 (v2i64 %a), (v2i64 %b), (i8 %mask))) -> (i8 (bitcast (v8i1 (insert_subvector undef, (v2i1 (and (PCMPEQM %a, %b), (extract_subvector (v8i1 (bitcast %mask)), 0))), 0)))) llvm-svn: 218669	2014-09-30 11:41:54 +00:00
Robert Khasanov	a27c8e0fd9	[AVX512] Enabled intrinsics for VPCMPEQD and VPCMPEQQ. Added CMP_MASK intrinsic type llvm-svn: 218667	2014-09-30 11:19:50 +00:00
Chandler Carruth	aaf8e03d92	[x86] Revert r218588, r218589, and r218600. These patches were pursuing a flawed direction and causing miscompiles. Read on for details. Fundamentally, the premise of this patch series was to map VECTOR_SHUFFLE DAG nodes into VSELECT DAG nodes for all blends because we are going to have to lower to VSELECT nodes for some blends to trigger the instruction selection patterns of variable blend instructions. This doesn't actually work out so well. In order to match performance with the existing VECTOR_SHUFFLE lowering code, we would need to re-slice the blend in order to fit it into either the integer or floating point blends available on the ISA. When coming from VECTOR_SHUFFLE (or other vNi1 style VSELECT sources) this works well because the X86 backend ensures that these types of operands to VSELECT get sign extended into '-1' and '0' for true and false, allowing us to re-slice the bits in whatever granularity without changing semantics. However, if the VSELECT condition comes from some other source, for example code lowering vector comparisons, it will likely only have the required bit set -- the high bit. We can't blindly slice up this style of VSELECT. Reid found some code using Halide that triggers this and I'm hopeful to eventually get a test case, but I don't need it to understand why this is A Bad Idea. There is another aspect that makes this approach flawed. When in VECTOR_SHUFFLE form, we have very distilled information that represents the constant blend mask. Converting back to a VSELECT form actually can lose this information, and so I think now that it is better to treat this as VECTOR_SHUFFLE until the very last moment and only use VSELECT nodes for instruction selection purposes. My plan is to: 1) Clean up and formalize the target pre-legalization DAG combine that converts a VSELECT with a constant condition operand into a VECTOR_SHUFFLE. 2) Remove any fancy lowering from VSELECT during legalization relying entirely on the DAG combine to catch cases where we can match to an immediate-controlled blend instruction. One additional step that I'm not planning on but would be interested in others' opinions on: we could add an X86ISD::VSELECT or X86ISD::BLENDV which encodes a fully legalized VSELECT node. Then it would be easy to write isel patterns only in terms of this to ensure VECTOR_SHUFFLE legalization only ever forms the fully legalized construct and we can't cycle between it and VSELECT combining. llvm-svn: 218658	2014-09-30 02:52:28 +00:00
Chandler Carruth	6cbf43167b	[x86] Make the new vector shuffle lowering lower blends as VSELECT nodes, and rely exclusively on its logic. This removes a ton of duplication from the blend lowering and centralizes it in one place. One downside is that it requires a bunch of hacks to make this work with the current legalization framework. We have to manually speculate one aspect of legalizing VSELECT nodes to get everything to work nicely because the existing legalization framework isn't actually bottom-up. The other grossness is that we somewhat duplicate the analysis of constant blends. I'm on the fence here. If reviewers thing this would look better with VSELECT when it has constant operands dumping over tho VECTOR_SHUFFLE, we could go that way. But it would be a substantial change because currently all of the actual blend instructions are matched via patterns in the TD files based around VSELECT nodes (despite them not being perfect fits for that). Suggestions welcome, but at least this removes the rampant duplication in the backend. llvm-svn: 218600	2014-09-29 09:57:07 +00:00
Chandler Carruth	b1cc7a8542	[x86] Delete a bunch of really bad and totally unnecessary code in the X86 target-specific DAG combining that tried to convert VSELECT nodes into VECTOR_SHUFFLE nodes that it "knew" would lower into immediate-controlled blend nodes. Turns out, we have perfectly good lowering of all these VSELECT nodes, and indeed that lowering already knows how to handle lowering through BLENDI to immediate-controlled blend nodes. The code just wasn't getting used much because this thing forced the world to go through the vector shuffle lowering. Yuck. This also exposes that I was too aggressive in avoiding domain crossing in v218588 with that lowering -- when the other option is to expand into two 128-bit vectors, it is worth domain crossing. Restore that behavior now that we have nice tests covering it. The test updates here fall into two camps. One is where previously we ended up with an unsigned encoding of the blend operand and now we get a signed encoding. In most of those places there were elaborate comments explaining exactly what these operands really mean. Rather than that, just switch these tests to use the nicely decoded comments that make it obvious that the final shuffle matches. The other updates are just removing pointless domain crossing by blending integers with PBLENDW rather than BLENDPS. llvm-svn: 218589	2014-09-29 02:01:20 +00:00
Chandler Carruth	d639c7a829	[x86] Refactor all of the VSELECT-as-blend lowering code to avoid domain crossing and generally work more like the blend emission code in the new vector shuffle lowering. My goal is to have the new vector shuffle lowering just produce VSELECT nodes that are either matched here to BLENDI or are legal and matched in the .td files to specific blend instructions. That seems much cleaner as there are other ways to produce a VSELECT anyways. =] No observable functionality changed yet, mostly because this code appears to be near-dead. The behavior of this lowering routine did change though. This code being mostly dead and untestable will change with my next commit which will also point some new tests at it. llvm-svn: 218588	2014-09-29 01:32:54 +00:00
Chandler Carruth	2f9e56e527	[x86] Improve naming and comments for VSELECT lowering. No functionality changed. llvm-svn: 218586	2014-09-29 00:51:58 +00:00
Chandler Carruth	c7129276cd	[x86] Add the dispatch skeleton to the new vector shuffle lowering for AVX-512. There is no interesting logic yet. Everything ends up eventually delegating to the generic code to split the vector and shuffle the halves. Interestingly, that logic does a significantly better job of lowering all of these types than the generic vector expansion code does. Mostly, it lets most of the cases fall back to nice AVX2 code rather than all the way back to SSE code paths. Step 2 of basic AVX-512 support in the new vector shuffle lowering. Next up will be to incrementally add direct support for the basic instruction set to each type (adding tests first). llvm-svn: 218585	2014-09-29 00:37:27 +00:00
Chandler Carruth	32a3ebda14	[x86] Make the split-and-lower routine fully generic by relaxing the assertion, making the name generic, and improving the documentation. Step 1 in adding very primitive support for AVX-512. No functionality changed yet. llvm-svn: 218584	2014-09-29 00:21:49 +00:00
Chandler Carruth	24e3b69cbd	[x86] Teach the new vector shuffle lowering to fall back on AVX-512 vectors. Someone will need to build the AVX512 lowering, which should follow AVX1 and AVX2 very closely for AVX512F and AVX512BW resp. I've added a dummy test which is a port of the v8f32 and v8i32 tests from AVX and AVX2 to v8f64 and v8i64 tests for AVX512F and AVX512BW. Hopefully this is enough information for someone to implement proper lowering here. If not, I'll be happy to help, but right now the AVX-512 support isn't a priority for me. llvm-svn: 218583	2014-09-28 23:53:10 +00:00
Chandler Carruth	abe742e8fb	[x86] Fix the new vector shuffle lowering's use of VSELECT for AVX2 lowerings. This was hopelessly broken. First, the x86 backend wants '-1' to be the element value representing true in a boolean vector, and second the operand order for VSELECT is backwards from the actual x86 instructions. To make matters worse, the backend is just using '-1' as the true value to get the high bit to be set. It doesn't actually symbolically map the '-1' to anything. But on x86 this isn't quite how it works: there only the high bit is relevant. As a consequence weird non-'-1' values like 0x80 actually "work" once you flip the operands to be backwards. Anyways, thanks to Hal for helping me sort out what these should be. llvm-svn: 218582	2014-09-28 23:23:55 +00:00
Chandler Carruth	6578f9208b	[x86] Fix a really silly bug that I introduced fixing another bug in the new vector shuffle target DAG combines -- it helps to actually test for the value you want rather than just using an integer in a boolean context. Have I mentioned that I loathe implicit conversions recently? :: sigh :: llvm-svn: 218576	2014-09-28 06:11:04 +00:00
Chandler Carruth	b10c6b8e9e	[x86] Fix yet another bug in the new vector shuffle lowering's handling of widening masks. We can't widen a zeroing mask unless both elements that would be merged are either zeroed or undef. This is the only way to widen a mask if it has a zeroed element. Also clean up the code here by ordering the checks in a more logical way and by using the symoblic values for undef and zero. I'm actually torn on using the symbolic values because the existing code is littered with the assumption that -1 is undef, and moreover that entries '< 0' are the special entries. While that works with the values given to these constants, using the symbolic constants actually makes it a bit more opaque why this is the case. llvm-svn: 218575	2014-09-28 03:30:25 +00:00
Chandler Carruth	f4b9e6b9d9	[x86] Fix yet another issue with widening vector shuffle elements. I spotted this by inspection when debugging something else, so I have no test case what-so-ever, and am not even sure it is possible to realistically trigger the bug. But this is what was intended here. llvm-svn: 218565	2014-09-27 08:40:33 +00:00
Chandler Carruth	4d03be1717	[x86] Fix terrible bugs everywhere in the new vector shuffle lowering and in the target shuffle combining when trying to widen vector elements. Previously only one of these was correct, and we didn't correctly propagate zeroing target shuffle masks (which have a different sentinel value from undef in non- target shuffle masks now). This isn't just a missed optimization, this caused us to drop zeroing shuffles on the floor and miscompile code. The added test case is one example of that. There are other fixes to the test suite as a consequence of this as well as restoring the undef elements in some of the masks that were lost when I brought sanity to the actual value of the undef and zero sentinels. I've also just cleaned up some of the PSHUFD and PSHUFLW and PSHUFHW combining code, but that code really needs to go. It was a nice initial attempt, but it isn't very principled and the recursive shuffle combiner is much more powerful. llvm-svn: 218562	2014-09-27 04:42:44 +00:00
Chandler Carruth	f572f3b2c0	[x86] Fix a moderately terrifying bug in the new 128-bit shuffle logic that managed to elude all of my fuzz testing historically. =/ Something changed to allow this code path to actually be exercised and it was doing bad things. It is especially heavily exercised by the patterns that emerge when doing AVX shuffles that end up lowered through the 128-bit code path. llvm-svn: 218540	2014-09-26 20:41:45 +00:00
Chandler Carruth	acd1906446	[x86] The mnemonic is SHUFPS not SHUPFS. =[ I'm very bad at spelling sadly. llvm-svn: 218524	2014-09-26 17:27:40 +00:00
Chandler Carruth	0c9ee10d01	[x86] In the new vector shuffle lowering, when trying to do another layer of tie-breaking sorting, it really helps to check that you're in a tie first. =] Otherwise the whole thing cycles infinitely. Test case added, another one found through fuzz testing. llvm-svn: 218523	2014-09-26 17:24:26 +00:00
Chandler Carruth	5afd4c2603	[x86] Fix a large collection of bugs that crept in as I fleshed out the AVX support. New test cases included. Note that none of the existing test cases covered these buggy code paths. =/ Also, it is clear from this that SHUFPS and SHUFPD are the most bug prone shuffle instructions in x86. =[ These were all detected by fuzz-testing. (I <3 fuzz testing.) llvm-svn: 218522	2014-09-26 17:11:02 +00:00
Robin Morisset	810739d174	Lower idempotent RMWs to fence+load Summary: I originally tried doing this specifically for X86 in the backend in D5091, but it was rather brittle and generally running too late to be general. Furthermore, other targets may want to implement similar optimizations. So I reimplemented it at the IR-level, fitting it into AtomicExpandPass as it interacts with that pass (which could not be cleanly done before at the backend level). This optimization relies on a new target hook, which is only used by X86 for now, as the correctness of the optimization on other targets remains an open question. If it is found correct on other targets, it should be trivial to enable for them. Details of the optimization are discussed in D5091. Test Plan: make check-all + a new test Reviewers: jfb Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D5422 llvm-svn: 218455	2014-09-25 17:27:43 +00:00
Chandler Carruth	0a6e961efd	[x86] Teach the new vector shuffle lowering to use AVX2 instructions for v4f64 and v8f32 shuffles when they are lane-crossing. We have fully general lane-crossing permutation functions in AVX2 that make this easy. Part of this also changes exactly when and how these vectors are split up when we don't have AVX2. This isn't always a win but it usually is a win, so on the balance I think its better. The primary regressions are all things that just need to be fixed anyways such as modeling when a blend can be completely accomplished via VINSERTF128, etc. Also, this highlights one of the few remaining big features: we do a really poor job of inserting elements into AVX registers efficiently. This completes almost all of the big tricks I have in mind for AVX2. The only things left that I plan to add: 1) element insertion smarts 2) palignr and other fairly specialized lowerings when they happen to apply llvm-svn: 218449	2014-09-25 11:03:55 +00:00
Chandler Carruth	e91d68c475	[x86] Teach the new vector shuffle lowering a fancier way to lower 256-bit vectors with lane-crossing. Rather than immediately decomposing to 128-bit vectors, try flipping the 256-bit vector lanes, shuffling them and blending them together. This reduces our worst case shuffle by a pretty significant margin across the board. llvm-svn: 218446	2014-09-25 10:21:15 +00:00
Chandler Carruth	02387122e0	[x86] Fix an oversight in the v8i32 path of the new vector shuffle lowering where it only used the mask of the low 128-bit lane rather than the entire mask. This allows the new lowering to correctly match the unpack patterns for v8i32 vectors. For reference, the reason that we check for the the entire mask rather than checking the repeated mask is because the repeated masks don't abide by all of the invariants of normal masks. As a consequence, it is safer to use the full mask with functions like the generic equivalence test. llvm-svn: 218442	2014-09-25 04:10:27 +00:00
Chandler Carruth	8140158cb5	[x86] Rearrange the code for v16i16 lowering a bit for clarity and to reduce the amount of checking we do here. The first realization is that only non-crossing cases between 128-bit lanes are handled by almost the entire function. It makes more sense to handle the crossing cases first. THe second is that until we actually are going to generate fancy shared lowering strategies that use the repeated semantics of the v8i16 lowering, we should waste time checking for repeated masks. It is simplest to directly test for the entire unpck masks anyways, so we gained nothing from this. This also matches the structure of v32i8 more closely. No functionality changed here. llvm-svn: 218441	2014-09-25 04:03:22 +00:00
Chandler Carruth	d8f528adb8	[x86] Implement AVX2 support for v32i8 in the new vector shuffle lowering. This completes the basic AVX2 feature support, but there are still some improvements I'd like to do to really get the last mile of performance here. llvm-svn: 218440	2014-09-25 02:52:12 +00:00
Chandler Carruth	d355369dbb	[x86] Remove the defunct X86ISD::BLENDV entry -- we use vector selects for this now. Should prevent folks from running afoul of this and not knowing why their code won't instruction select the way I just did... llvm-svn: 218436	2014-09-25 01:16:01 +00:00
Chandler Carruth	a577bc26b6	[x86] Fix the v16i16 blend logic I added in the prior commit and add the missing test cases for it. Unsurprisingly, without test cases, there were bugs here. Surprisingly, this bug wasn't caught at compile time. Yep, there is an X86ISD::BLENDV. It isn't wired to anything. Oops. I'll fix than next. llvm-svn: 218434	2014-09-25 01:13:38 +00:00
Chandler Carruth	98443d89b9	[x86] Implement v16i16 support with AVX2 in the new vector shuffle lowering. This also implements the fancy blend lowering for v16i16 using AVX2 and teaches the X86 backend to print shuffle masks for 256-bit PSHUFB and PBLENDW instructions. It also makes the mask decoding correct for PBLENDW instructions. The yaks, they are legion. Tests are updated accordingly. There are some missing tests for the VBLENDVB lowering, but I'll add those in a follow-up as this commit has accumulated enough cruft already. llvm-svn: 218430	2014-09-25 00:24:19 +00:00
Chandler Carruth	edcba62b4a	[x86] Factor out the logic to generically decombose a vector shuffle into unblended shuffles and a blend. This is the consistent fallback for the lowering paths that have fast blend operations available, and its getting quite repetitive. No functionality changed. llvm-svn: 218399	2014-09-24 18:20:09 +00:00
Chandler Carruth	9bd10e7492	[x86] Teach the new vector shuffle lowering to lower v8i32 shuffles with the native AVX2 instructions. Note that the test case is really frustrating here because VPERMD requires the mask to be in the register input and we don't produce a comment looking through that to the constant pool. I'm going to attempt to improve this in a subsequent commit, but not sure if I will succeed. llvm-svn: 218347	2014-09-24 01:24:44 +00:00
Chandler Carruth	fd11815a7d	[x86] Fix a really terrible bug in the repeated 128-bin-lane shuffle detection. It was incorrectly handling undef lanes by actually treating an undef lane in the first 128-bit lane as a numeric shuffle value. Fortunately, this almost always DTRT and disabled detecting repeated patterns. But not always. =/ This patch introduces a much more principled approach and fixes the miscompiles I spotted by inspection previously. llvm-svn: 218346	2014-09-24 01:03:57 +00:00
Chandler Carruth	df2e421845	[x86] Teach the new vector shuffle lowering to lower v4i64 vector shuffles using the AVX2 instructions. This is the first step of cutting in real AVX2 support. Note that I have spotted at least one bug in the test cases already, but I suspect it was already present and just is getting surfaced. Will investigate next. llvm-svn: 218338	2014-09-23 22:39:02 +00:00
Chandler Carruth	9a94bd6fa4	[x86] Teach the rest of the 'target shuffle' machinery about blends and add VPBLENDD to the InstPrinter's comment generation so we get nice comments everywhere. Now that we have the nice comments, I can see the bug introduced by a silly typo in the commit that enabled VPBLENDD, and have fixed it. Yay tests that are easy to inspect. llvm-svn: 218335	2014-09-23 22:14:14 +00:00
Robin Morisset	6dbbbc28b0	[X86] Make wide loads be managed by AtomicExpand Summary: AtomicExpand already had logic for expanding wide loads and stores on LL/SC architectures, and for expanding wide stores on CmpXchg architectures, but not for wide loads on CmpXchg architectures. This patch fills this hole, and makes use of this new feature in the X86 backend. Only one functionnal change: we now lose the SynchScope attribute. It is regrettable, but I have another patch that I will submit soon that will solve this for all of AtomicExpand (it seemed better to split it apart as it is a different concern). Test Plan: make check-all (lots of tests for this functionality already exist) Reviewers: jfb Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D5404 llvm-svn: 218332	2014-09-23 20:59:25 +00:00
Chandler Carruth	adcfec995c	[x86] Teach the new shuffle lowering's blend functionality to use AVX2's VPBLENDD where appropriate even on 128-bit vectors. According to Agner's tables, this instruction is significantly higher throughput (can execute on any port) on Haswell chips so we should aggressively try to form it when available. Sadly, this loses our delightful shuffle comments. I'll add those back for VPBLENDD next. llvm-svn: 218322	2014-09-23 18:16:12 +00:00
Chandler Carruth	40592d2dec	[x86] Teach the vector comment parsing and printing to correctly handle undef in the shuffle mask. This shows up when we're printing comments during lowering and we still have an IR-level constant hanging around that models undef. A nice consequence of this is much prettier test cases where the undef lanes actually show up as undef rather than as a particular set of values. This also allows us to print shuffle comments in cases that use undef such as the recently added variable VPERMILPS lowering. Now those test cases have nice shuffle comments attached with their details. The shuffle lowering for PSHUFB has been augmented to use undef, and the shuffle combining has been augmented to comprehend it. llvm-svn: 218301	2014-09-23 11:15:19 +00:00
Chandler Carruth	6d5916a2d7	[x86] Teach the AVX1 path of the new vector shuffle lowering one more trick that I missed. VPERMILPS has a non-immediate memory operand mode that allows it to do asymetric shuffles in the two 128-bit lanes. Use this rather than two shuffles and a blend. However, it turns out the variable shuffle path to VPERMILPS (and VPERMILPD, although that one offers no functional differenc from the immediate operand other than variability) wasn't even plumbed through codegen. Do such plumbing so that we can reasonably emit a variable-masked VPERMILP instruction. Also plumb basic comment parsing and printing through so that the tests are reasonable. There are still a few tests which don't show the shuffle pattern. These are tests with undef lanes. I'll teach the shuffle decoding and printing to handle undef mask entries in a follow-up. I've looked at the masks and they seem reasonable. llvm-svn: 218300	2014-09-23 10:08:29 +00:00
Chandler Carruth	ed5dfff865	[x86] Rename X86ISD::VPERMILP to X86ISD::VPERMILPI (and the same for the td pattern). Currently we only model the immediate operand variation of VPERMILPS and VPERMILPD, we should make that clear in the pseudos used. Will be adding support for the variable mask variant in my next commit. llvm-svn: 218282	2014-09-22 22:29:42 +00:00
Kaelyn Takata	cecdff6512	Fix a "typo" from my previous commit. llvm-svn: 218281	2014-09-22 22:17:59 +00:00
Kaelyn Takata	ba0a1e0520	Silence unused variable warnings in the new stub functions that occur when assertions are disabled. llvm-svn: 218280	2014-09-22 22:14:13 +00:00
Chandler Carruth	252debeb0b	[x86] Stub out the integer lowering of 256-bit vectors with AVX2 support. No interesting functionality yet, but this will let me implement one vector type at a time. llvm-svn: 218277	2014-09-22 21:45:57 +00:00
Sanjay Patel	7939d7229d	Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2). We generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors. This patch should preserve all existing behavior with regular optimization levels, but also use splats whenever possible when optimizing for size on any CPU with AVX or AVX2. The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save at least 8 bytes (up to 31 bytes) of constant pool data. Differential Revision: http://reviews.llvm.org/D5347 llvm-svn: 218263	2014-09-22 18:54:01 +00:00
Pavel Chupin	be9f12102f	[x32] Fix segmented stacks support Summary: Update segmented-stacks*.ll tests with x32 target case and make corresponding changes to make them pass. Test Plan: tests updated with x32 target Reviewers: nadav, rafael, dschuff Subscribers: llvm-commits, zinovy.nis Differential Revision: http://reviews.llvm.org/D5245 llvm-svn: 218247	2014-09-22 13:11:35 +00:00
Chandler Carruth	12bbf7d922	[x86] Back out a bad choice about lowering v4i64 and pave the way for a more sane approach to AVX2 support. Fundamentally, there is no useful way to lower integer vectors in AVX. None. We always end up with a VINSERTF128 in the end, so we might as well eagerly switch to the floating point domain and do everything there. This cleans up lots of weird and unlikely to be correct differences between integer and floating point shuffles when we only have AVX1. The other nice consequence is that by doing things this way we will make it much easier to write the integer lowering routines as we won't need to duplicate the logic to check for AVX vs. AVX2 in each one -- if we actually try to lower a 256-bit vector as an integer vector, we have AVX2 and can rely on it. I think this will make the code much simpler and more comprehensible. Currently, I've disabled all support for AVX2 so that we always fall back to AVX. This keeps everything working rather than asserting. That will go away with the subsequent series of patches that provide a baseline AVX2 implementation. Please note, I'm going to implement AVX2 without access to hardware. That means I cannot correctness test this path. I will be relying on those with access to AVX2 hardware to do correctness testing and fix bugs here, but as a courtesy I'm trying to sketch out the framework for the new-style vector shuffle lowering in the context of the AVX2 ISA. llvm-svn: 218228	2014-09-22 00:32:15 +00:00
Chandler Carruth	5d45962b2c	[x86] Teach the new vector shuffle lowering how to cleverly lower single input v8f32 shuffles which are not 128-bit lane crossing but have different shuffle patterns in the low and high lanes. This removes most of the extract/insert traffic that was unnecessary and is particularly good at lowering cases where only one of the two lanes is shuffled at all. I've also added a collection of test cases with undef lanes because this lowering is somewhat more sensitive to undef lanes than others. llvm-svn: 218226	2014-09-21 23:46:13 +00:00
Chandler Carruth	215037e35d	[x86] With the stronger canonicalization of shuffles added in r218216, the new vector shuffle lowering no longer needs to check both symmetric forms of UNPCK patterns for v4f64. llvm-svn: 218217	2014-09-21 13:37:51 +00:00
Chandler Carruth	b3125c7522	[x86] Teach the new vector shuffle lowering to re-use the SHUFPS lowering when it can use a symmetric SHUFPS across both 128-bit lanes. This required making the SHUFPS lowering tolerant of other vector types, and adjusting our canonicalization to canonicalize harder. This is the last of the clever uses of symmetry I've thought of for v8f32. The rest of the tricks I'm aware of here are to work around assymetry in the mask. llvm-svn: 218216	2014-09-21 13:35:14 +00:00
Chandler Carruth	02f3554971	[x86] Refactor the logic to form SHUFPS instruction patterns to lower a generic vector shuffle mask into a helper that isn't specific to the other things that influence which choice is made or the specific types used with the instruction. No functionality changed. llvm-svn: 218215	2014-09-21 13:03:00 +00:00
Chandler Carruth	33eda72802	[x86] Teach the new vector shuffle lowering the basics about insertion of a single element into a zero vector for v4f64 and v4i64 in AVX. Ironically, there is less to see here because xor+blend is so crazy fast that we can't really beat that to zero the high 128-bit lane. llvm-svn: 218214	2014-09-21 12:49:46 +00:00
Chandler Carruth	43f5974ea0	[x86] Teach the new vector shuffle lowering how to lower to UNPCKLPS and UNPCKHPS with AVX vectors by recognizing those patterns when they are repeated for both 128-bit lanes. With this, we now generate the exact same (really nice) code for Quentin's avx_test_case.ll which was the most significant regression reported for the new shuffle lowering. In fact, I'm out of specific test cases for AVX lowering, the rest were AVX2 I think. However, there are a bunch of pretty obvious remaining things to improve with AVX... llvm-svn: 218213	2014-09-21 12:20:44 +00:00
Chandler Carruth	88404c4f9b	[x86] Begin teaching the new vector shuffle lowering among the most important bits of cleverness: to detect and lower repeated shuffle patterns between the two 128-bit lanes with a single instruction. This patch just teaches it how to lower single-input shuffles that fit this model using VPERMILPS. =] There is more that needs to happen here. llvm-svn: 218211	2014-09-21 12:01:19 +00:00
Chandler Carruth	3dccabaf35	[x86] Explicitly lower to a blend early if it is trivial to do so for v8f32 shuffles in the new vector shuffle lowering code. This is very cheap to do and makes it much more clear that anything more expensive but overlapping with this lowering should be selected afterward (for example using AVX2's VPERMPS). However, no functionality changed here as without this code we would fall through to create no-op shuffles of each input and a blend. =] llvm-svn: 218209	2014-09-21 11:40:39 +00:00
Chandler Carruth	e81bfbada9	[x86] Teach the new vector shuffle lowering of v4f64 to prefer a direct VBLENDPD over using VSHUFPD. While the 256-bit variant of VBLENDPD slows down to the same speed as VSHUFPD on Sandy Bridge CPUs, it has twice the reciprocal throughput on Ivy Bridge CPUs much like it does everywhere for 128-bits. There isn't a downside, so just eagerly use this instruction when it suffices. llvm-svn: 218208	2014-09-21 11:17:55 +00:00
Chandler Carruth	8d0a1b209b	[x86] Switch the blend implementation to use a MVT switch rather than awkward conditions. The readability improvement of this will be even more important as I generalize it to handle more types. No functionality changed. llvm-svn: 218205	2014-09-21 10:36:12 +00:00
Chandler Carruth	f098cee2e3	[x86] Remove some essentially lying comments from the v4f64 path of the new vector shuffle lowering. llvm-svn: 218204	2014-09-21 10:27:14 +00:00
Chandler Carruth	a746d776eb	[x86] Fix a helper to reflect that what we actually care about is 128-bit lane crossings, not 'half' crossings. This came up in code review ages ago, but I hadn't really addresesd it. Also added some documentation for the helper. No functionality changed. llvm-svn: 218203	2014-09-21 09:35:25 +00:00
Chandler Carruth	293327ddcd	[x86] Teach the new vector shuffle lowering the first step toward more actual support for complex AVX shuffling tricks. We can do independent blends of the low and high 128-bit lanes of an avx vector, so shuffle the inputs into place and then do the blend at 256 bits. This will in many cases remove one blend instruction. The next step is to permute the low and high halves in-place rather than extracting them and re-inserting them. llvm-svn: 218202	2014-09-21 09:35:22 +00:00
Chandler Carruth	a454812ac8	[x86] Teach the new vector shuffle lowering to use VPERMILPD for single-input shuffles with doubles. This allows them to fold memory operands into the shuffle, etc. This is just the analog to the v4f32 case in my prior commit. llvm-svn: 218193	2014-09-20 22:09:27 +00:00
Chandler Carruth	6f80abac4e	[x86] Teach the new vector shuffle lowering to use the AVX VPERMILPS instruction for single-vector floating point shuffles. This in turn allows the shuffles to fold a load into the instruction which is one of the common regressions hit with the new shuffle lowering. llvm-svn: 218190	2014-09-20 20:52:07 +00:00
Chandler Carruth	8c4cccd4aa	[x86] Teach the v4f32 path of the new shuffle lowering to handle the tricky case of single-element insertion into the zero lane of a zero vector. We can't just use the same pattern here as we do in every other vector type because the general insertion logic can handle insertion into the non-zero lane of the vector. However, in SSE4.1 with v4f32 vectors we have INSERTPS that is a much better choice than the generic one for such lowerings. But INSERTPS can do lots of other lowerings as well so factoring its logic into the general insertion logic doesn't work very well. We also can't just extract the core common part of the general insertion logic that is faster (forming VZEXT_MOVL synthetic nodes that lower to MOVSS when they can) because VZEXT_MOVL is often faster than a blend while INSERTPS is slower! So instead we do a restrictive condition on attempting to use the generic insertion logic to narrow it to those cases where VZEXT_MOVL won't need a shuffle afterward and thus will do better than INSERTPS. Then we try blending. Then we go back to INSERTPS. This still doesn't generate perfect code for some silly reasons that can be fixed by tweaking the td files for lowering VZEXT_MOVL to use XORPS+BLENDPS when available rather than XORPS+MOVSS when the input ends up in a register rather than a load from memory -- BLENDPSrr has twice the reciprocal throughput of MOVSSrr. Don't you love this ISA? llvm-svn: 218177	2014-09-20 04:15:22 +00:00
Chandler Carruth	87dcf09367	[x86] Refactor the code for emitting INSERTPS to reuse the zeroable mask analysis used elsewhere. This removes the last duplicate of this logic. Also simplify the code here quite a bit. No functionality changed. llvm-svn: 218176	2014-09-20 03:57:01 +00:00
Chandler Carruth	00389f3ed9	[x86] Generalize the single-element insertion lowering to work with floating point types and use it for both v2f64 and v2i64 single-element insertion lowering. This fixes the last non-AVX performance regression test case I've gotten of for the new vector shuffle lowering. There is obvious analogous lowering for v4f32 that I'll add in a follow-up patch (because with INSERTPS, v4f32 requires special treatment). After that, its AVX stuff. llvm-svn: 218175	2014-09-20 03:32:25 +00:00
Chandler Carruth	dba8444c2a	[x86] Replace some duplicated logic reasoning about whether particular vector lanes can be modeled as zero with a call to the new function that computes a bit-vector representing that information. No functionality changed here, but will allow doing more clever things with the zero-test. llvm-svn: 218174	2014-09-20 02:44:21 +00:00
Chandler Carruth	a6b7178b9d	[x86] Hoist a function up to the rest of the non-type-specific lowering helpers, and re-flow the logic to use early exit and be a bit more readable. No functionality changed. llvm-svn: 218155	2014-09-19 21:52:10 +00:00
Chandler Carruth	f85c6dfa45	[x86] Hoist the actual lowering logic into a helper function to separate it from the shuffle pattern matching logic. Also cleaned up variable names, comments, etc. No functionality changed. llvm-svn: 218152	2014-09-19 21:20:08 +00:00
Chandler Carruth	0fc0c22fa9	[x86] Fully generalize the zext lowering in the new vector shuffle lowering to support both anyext and zext and to custom lower for many different microarchitectures. Using this allows us to get exactly the right code for zext and anyext shuffles in all the vector sizes. For v16i8, the improvement is huge. The new SSE2 test case added I refused to add before this because it was sooooo muny instructions. llvm-svn: 218143	2014-09-19 20:00:32 +00:00
Chandler Carruth	8a6536d4b2	[x86] Recognize that we can use duplication to widen v16i8 shuffles due to undef lanes as well as defined widenable lanes. This dramatically improves the lowering we use for undef-shuffles in a zext-ish pattern for SSE2. llvm-svn: 218115	2014-09-19 09:45:21 +00:00
Chandler Carruth	2e275142cd	[x86] Teach the new vector shuffle lowering to also use pmovzx for v4i32 shuffles that are zext-ing. Not a lot to see here; the undef lane variant is better handled with pshufd, but this improves the actual zext pattern. llvm-svn: 218112	2014-09-19 08:37:44 +00:00
Chandler Carruth	398ba9a018	[x86] Add a dedicated lowering path for zext-compatible vector shuffles to the new vector shuffle lowering code. This allows us to emit PMOVZX variants consistently for patterns where it is a viable lowering. This instruction is both fast and allows us to fold loads into it. This only hooks the new lowering up for i16 and i8 element widths, mostly so I could manage the change to the tests. I'll add the i32 one next, although it is significantly less interesting. One thing to note is that we already had some tests for these patterns but those tests had far less horrible instructions. The problem is that those tests weren't checking the strict start and end of the instruction sequence. =[ As a consequence something changed in the lowering making us generate TERRIBLE code for these patterns in SSE2 through SSSE3. I've consolidated all of the tests and spelled out the madness that we currently emit for these shuffles. I'm going to try to figure out what has gone wrong here. llvm-svn: 218102	2014-09-19 06:07:49 +00:00
Chandler Carruth	9057fcaf82	[x86] Use PALIGNR for v4i32 and v2i64 blends when appropriate. There is no purpose in using it for single-input shuffles as pshufd is just as fast and doesn't tie the two operands. This removes a substantial amount of wrong-domain blend operations in SSSE3 mode. It also completes the usage of PALIGNR for integer shuffles and addresses one of the test cases Quentin hit with the new vector shuffle lowering. There is still the question of whether and when to use this for floating point shuffles. It is faster than shufps or shufpd but in the integer domain. I don't yet really have a good heuristic here for when to use this instruction for floating point vectors. llvm-svn: 218038	2014-09-18 09:00:25 +00:00

1 2 3 4 5 ...

2942 Commits