llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	058f2f6d72	[AVX-512] Fix accidental uses of AH/BH/CH/DH after copies to/from mask registers We've had several bugs(PR32256, PR32241) recently that resulted from usages of AH/BH/CH/DH either before or after a copy to/from a mask register. This ultimately occurs because we create COPY_TO_REGCLASS with VK1 and GR8. Then in CopyToFromAsymmetricReg in X86InstrInfo we find a 32-bit super register for the GR8 to emit the KMOV with. But as these tests are demonstrating, its possible for the GR8 register to be a high register and we end up doing an accidental extra or insert from bits 15:8. I think the best way forward is to stop making copies directly between mask registers and GR8/GR16. Instead I think we should restrict to only copies between mask registers and GR32/GR64 and use EXTRACT_SUBREG/INSERT_SUBREG to handle the conversion from GR32 to GR16/8 or vice versa. Unfortunately, this complicates fastisel a bit more now to create the subreg extracts where we used to create GR8 copies. We can probably make a helper function to bring down the repitition. This does result in KMOVD being used for copies when BWI is available because we don't know the original mask register size. This caused a lot of deltas on tests because we have to split the checks for KMOVD vs KMOVW based on BWI. Differential Revision: https://reviews.llvm.org/D30968 llvm-svn: 298928	2017-03-28 16:35:29 +00:00
Igor Breger	293dfb9768	[X86] Add vector zext tests. llvm-svn: 297581	2017-03-12 13:20:10 +00:00
Simon Pilgrim	0f0e5bd3c6	[X86][SSE] Allow matchVectorShuffleWithUNPCK to recognise ZERO inputs Add support for specifying an UNPCK input as ZERO, particularly improves ZEXT cases with non-zero offsets llvm-svn: 295169	2017-02-15 11:46:15 +00:00
Craig Topper	680c73e7ab	[X86] Genericize the handling of INSERT_SUBVECTOR from an EXTRACT_SUBVECTOR to support 512-bit vectors with 128-bit or 256-bit subvectors. We now detect that both the extract and insert indices are non-zero and convert to a shuffle. This will be lowered as a blend for 256-bit vectors or as a vshuf operations for 512-bit vectors. llvm-svn: 294931	2017-02-13 04:53:29 +00:00
Craig Topper	978fdb75a4	[X86] Add support for folding (insert_subvector vec1, (extract_subvector vec2, idx1), idx1) -> (blendi vec2, vec1). llvm-svn: 294112	2017-02-04 23:26:46 +00:00
Michael Zuckerman	6baa3838e9	Fix blend mask by switch the side of the operand since Blend node uses opposite mask then Select NODE. llvm-svn: 292066	2017-01-15 16:43:14 +00:00
Craig Topper	63e2cd6caa	[AVX-512] Teach two address instruction pass to replace masked move instructions with blendm instructions when its beneficial. Isel now selects masked move instructions for vselect instead of blendm. But sometimes it beneficial to register allocation to remove the tied register constraint by using blendm instructions. This also picks up cases where the masked move was created due to a masked load intrinsic. Differential Revision: https://reviews.llvm.org/D28454 llvm-svn: 292005	2017-01-14 07:50:52 +00:00
Michael Zuckerman	558a4d8419	[X86][AVX512] Adding missing shuffle lowering to blend mask instructions Some shuffles can be lowered to blend mask instruction (VPBLENDMB/VPBLENDMW/VPBLENDMD/VPBLENDMQ) . In this patch, I added new pattern match for this case. Reviewers: 1. craig.topper 2. guyblank 3. RKSimon 4. igorb Differential Revision: https://reviews.llvm.org/D28483 llvm-svn: 291888	2017-01-13 09:06:00 +00:00
Craig Topper	fa875a1d3d	[AVX-512] Teach EVEX to VEX conversion pass to handle VINSERT and VEXTRACT instructions. llvm-svn: 290869	2017-01-03 05:46:18 +00:00
Craig Topper	15d116ab41	[AVX-512] Re-generate tests that were updated for r290663 without using update_llc_test_checks.py so duplicate check lines weren't merged. llvm-svn: 290868	2017-01-03 05:46:10 +00:00
Gadi Haber	19c4fc5e62	This is a large patch for X86 AVX-512 of an optimization for reducing code size by encoding EVEX AVX-512 instructions using the shorter VEX encoding when possible. There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers. The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled. Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky Differential Revision: https://reviews.llvm.org/D27901 llvm-svn: 290663	2016-12-28 10:12:48 +00:00
Simon Pilgrim	1638d49f20	[X86][SSE] Add support for combining target shuffles to binary BLEND We already had support for 1-input BLEND with zero - this adds support for 2-input BLEND as well. llvm-svn: 283040	2016-10-01 16:04:28 +00:00
Craig Topper	8aca90507f	[AVX-512] Add VLX command lines to 128 and 256-bit shufffle tests. llvm-svn: 283014	2016-10-01 06:01:18 +00:00
Simon Pilgrim	687d71e877	[X86][SSE] Add support for combining target shuffles to PSLLDQ/PSRLDQ byte shifts llvm-svn: 278502	2016-08-12 11:24:34 +00:00
Simon Pilgrim	898f030f70	[X86][SSE] Enable target shuffle combining to combine multiple shuffle inputs. We currently only support combining target shuffles that consist of a single source input (plus elements known to be undef/zero). This patch generalizes the recursive combining of the target shuffle to collect all the inputs, merging any duplicates along the way, into a full set of src ops and its shuffle mask. We uncover a number of cases where we have failed to combine a unary shuffle because the input has been duplicated and separated during lowering. This will allow us to combine to 2-input shuffles in a future patch. Differential Revision: https://reviews.llvm.org/D22859 llvm-svn: 277631	2016-08-03 19:08:24 +00:00
Simon Pilgrim	2683ad54ad	[X86][AVX2] Improve lowerShuffleAsRepeatedMaskAndLanePermute permutation of 64-bit sub-lanes As discussed on PR28136, lowerShuffleAsRepeatedMaskAndLanePermute was attempting to match repeated masks at the 128-bit level and then permute the resultant lanes at the 128-bit (AVX1) or 64-bit (AVX2) sub-lane level. This change allows us to create the repeated masks at the sub-lane level (and then concat them together to create a 128-bit repeated mask) and then select which sub-lane to permute. This has no effect on the AVX1 codegen. Fixes PR28136. llvm-svn: 275543	2016-07-15 09:49:12 +00:00
Simon Pilgrim	420b266d0a	[X86][AVX2] Allow VPERMPD/VPERMQ shuffles to call combineShuffle (reapplied) This improves the situation discussed in D19228 where we were forcing VPERMPD/VPERMQ where VPERM2F128/VPERM2I128 would have been better. This was incorrectly reverted in rL275421 during triage of PR28552. llvm-svn: 275497	2016-07-14 23:05:09 +00:00
Nico Weber	3afaf16abc	Revert r275411, it cause PR28552. llvm-svn: 275421	2016-07-14 14:49:35 +00:00
Simon Pilgrim	3ecb6bdd5f	[X86][AVX2] Allow VPERMPD/VPERMQ shuffles to call combineShuffle This improves the situation discussed in D19228 where we were forcing VPERMPD/VPERMQ where VPERM2F128/VPERM2I128 would have been better. llvm-svn: 275411	2016-07-14 13:28:43 +00:00
Simon Pilgrim	9a09652a3a	[X86][AVX] Added test case for PR28136 llvm-svn: 273098	2016-06-18 22:59:08 +00:00
Ahmed Bougacha	a3dc1ba142	[X86] Try to zero elts when lowering 256-bit shuffle with PSHUFB. Otherwise we fallback to a blend of PSHUFBs later on. Differential Revision: http://reviews.llvm.org/D19661 llvm-svn: 271113	2016-05-28 14:38:04 +00:00
Simon Pilgrim	32b1c9fe7f	[X86][AVX2] Prefer VPERMQ/VPERMPD over VINSERTI128/VINSERTF128 for unary shuffles Using VPERMQ/VPERMPD allows memory folding of the (repeated) input where VINSERTI128/VINSERTF128 can not. Differential Revision: http://reviews.llvm.org/D19228 llvm-svn: 266728	2016-04-19 12:26:40 +00:00
Simon Pilgrim	c5b5dcb985	[X86][AVX] Support bit-blend integer shuffles for 256-bit integer vectors AVX1 doesn't support the shuffling of 256-bit integer vectors. For 32/64-bit elements we get around this by shuffling as float/double but for 8/16-bit elements (assuming they can't widen) we currently just split, shuffle as 128-bit vectors and concatenate the results back. This patch adds the ability to lower using the bit-blend patterns before defaulting to the splitting behaviour. Part 2 of 2 Differential Revision: http://reviews.llvm.org/D17292 llvm-svn: 261082	2016-02-17 10:50:06 +00:00
Simon Pilgrim	a50e8d3627	[X86][AVX] Support bit-mask integer shuffles for 256-bit integer vectors AVX1 doesn't support the shuffling of 256-bit integer vectors. For 32/64-bit elements we get around this by shuffling as float/double but for 8/16-bit elements (assuming they can't widen) we currently just split, shuffle as 128-bit vectors and concatenate the results back. This patch adds the ability to lower using the bit-mask patterns before defaulting to the splitting behaviour. In some cases this ends up matching what AVX2 would do anyhow or what AVX1 does on the split vectors. Part 1 of 2 Differential Revision: http://reviews.llvm.org/D17292 llvm-svn: 261081	2016-02-17 10:37:49 +00:00
Simon Pilgrim	08ba012973	[X86][AVX] Lower shuffles as repeated lane shuffles then lane-crossing shuffles This patch attempts to represent a shuffle as a repeating shuffle (recognisable by is128BitLaneRepeatedShuffleMask) with the source input(s) in their original lanes, followed by a single permutation of the 128-bit lanes to their final destinations. On AVX2 we can additionally attempt to match using 64-bit sub-lane permutation. AVX2 can also now match a similar 'broadcasted' repeating shuffle. This patch has several benefits: * Avoids prematurely matching with lowerVectorShuffleByMerging128BitLanes which can require both inputs to have their input lanes permuted before shuffling. * Can replace PERMPS/PERMD instructions - although these are useful for cross-lane unary shuffling, they require their shuffle mask to be pre-loaded (and increase register pressure). * Matching the repeating shuffle makes use of a lot of existing shuffle lowering. There is an outstanding minor AVX1 regression (combine_unneeded_subvector1 in vector-shuffle-combining.ll) of a previously 128-bit shuffle + subvector splat being converted to a subvector splat + (2 instruction) 256-bit shuffle, I intend to fix this in a followup patch for review. Differential Revision: http://reviews.llvm.org/D16537 llvm-svn: 260834	2016-02-13 21:54:04 +00:00
Simon Pilgrim	5ba1c127fc	[X86][SSE] Improve i16 splatting shuffles Better handling of the annoying pshuflw/pshufhw ops which only shuffle lower/upper halves of a vector. Added vXi16 unary shuffle support for cases where i16 elements (from the same half of the source) are being splatted to the whole of one of the halves. This avoids the general lowering case which must shuffle the 32-bit elements first - meaning that we used to end up with unnecessary duplicate pshuflw/pshufhw shuffles. Note this has the side effect of a lot of SSSE3 test cases no longer needing to use PSHUFB, as it falls below the 3 op combine threshold for when PSHUFB is typically worth it. I've raised PR26183 to discuss if the threshold should be changed and whether we need to make it more specific to the target CPU. Differential Revision: http://reviews.llvm.org/D14901 llvm-svn: 258440	2016-01-21 22:07:41 +00:00
Simon Pilgrim	3e5fb61978	[X86][AVX2] Broadcast subvectors AVX2 can only broadcast from the zero'th element of a vector, but if the broadcastable element is the zero'th element of a 128-bit subvector its advantageous to extract the subvector, broadcast from that and avoid the loading of shuffle mask data that would be needed for VPERMPS/VPERMD. The only exception being when the source type is 4f64 or 4i64 which can directly use the immediate shuffle VPERMPD/VPERMQ directly. Differential Revision: http://reviews.llvm.org/D16050 llvm-svn: 258081	2016-01-18 20:59:04 +00:00
Simon Pilgrim	17377bdd45	[X86][AVX] Only shuffle the lower half of vectors if the upper half is undefined First step towards making better use of AVX's implicit zeroing of the upper half of a 256-bit vector by instructions that only act on the lower 128-bit vector - discussed on D14151. As well as the fact that 128-bit shuffle instructions are generally more capable, this can be performant for older CPUs with 128-bit ALUs (e.g. Jaguar, Sandy Bridge) that must treat 256-bit vectors as multiple micro-ops. Moved the similar subvector extraction shuffle combines from PerformShuffleCombine256 to lowerVectorShuffle as well. Note: I've avoided combining shuffles that reference elements from the upper halves of the input vectors - this may be reviewed in future work as well (AVX1 would probably always gain, but AVX2 does have some cross-lane shuffle instructions). Differential Revision: http://reviews.llvm.org/D15477 llvm-svn: 256332	2015-12-23 13:10:07 +00:00
James Y Knight	7c905063c5	Make utils/update_llc_test_checks.py note that the assertions are autogenerated. Also update existing test cases which appear to be generated by it and weren't modified (other than addition of the header) by rerunning it. llvm-svn: 253917	2015-11-23 21:33:58 +00:00
Simon Pilgrim	e896f9f8c3	[X86][AVX] Added 256-bit shuffle splat tests. llvm-svn: 253449	2015-11-18 09:39:38 +00:00
Ahmed Bougacha	05a0514b12	[X86] SRL non-LSB extracts when folding to truncating broadcasts. Now that we recognize this, we can support it instead of bailing out. That is, we can fold: (v8i16 (shufflevector (v8i16 (bitcast (v4i32 (build_vector X, Y, ...)))), <1,1,...,1>)) into: (v8i16 (vbroadcast (i16 (trunc (srl Y, 16))))) llvm-svn: 252362	2015-11-06 23:16:43 +00:00
Ahmed Bougacha	68614a36d1	[X86] Don't fold non-LSB extracts into truncating broadcasts. We used to incorrectly assume that the offset we're extracting from was a multiple of the element size. So, we'd fold: (v8i16 (shufflevector (v8i16 (bitcast (v4i32 (build_vector X, Y, ...)))), <1,1,...,1>)) into: (v8i16 (vbroadcast (i16 (trunc Y)))) whereas we should have extracted the higher bits from X. Instead, bail out if the assumption doesn't hold. llvm-svn: 252361	2015-11-06 23:16:38 +00:00
Simon Pilgrim	1cad0cd3ce	[X86][SSE] Match zero/any extension shuffles that don't start from the first element This patch generalizes the lowering of shuffles as zero extensions to allow extensions that don't start from the first element. It now recognises extensions starting anywhere in the lower 128-bits or at the start of any higher 128-bit lane. The motivation was to reduce the number of high cost pshufb calls, but it also improves the SSE2 case as well. Differential Revision: http://reviews.llvm.org/D12561 llvm-svn: 248250	2015-09-22 08:16:08 +00:00
Ahmed Bougacha	0cdc7719f0	[X86] Look for scalar through one bitcast when lowering to VBROADCAST. Fixes PR23464: one way to use the broadcast intrinsics is: _mm256_broadcastw_epi16(_mm_cvtsi32_si128((int)src)); We don't currently fold this, but now that we use native IR for the intrinsics (r245605), we can look through one bitcast to find the broadcast scalar. Differential Revision: http://reviews.llvm.org/D10557 llvm-svn: 245613	2015-08-20 21:02:39 +00:00
Ahmed Bougacha	69a17acb74	[X86] Add some broadcast-from-memory tests. llvm-svn: 245612	2015-08-20 20:59:41 +00:00
Simon Pilgrim	df984f58ad	[X86][SSE] Use bitmasks instead of shuffles where possible. VPAND is a lot faster than VPSHUFB and VPBLENDVB - this patch ensures we attempt to lower to a basic bitmask before lowering to the slower byte shuffle/blend instructions. Split off from D11518. Differential Revision: http://reviews.llvm.org/D11541 llvm-svn: 243395	2015-07-28 08:54:41 +00:00
Simon Pilgrim	c363e7d8e0	[X86][SSE] Added shuffle tests to demonstrate missed bitmask. llvm-svn: 243324	2015-07-27 20:41:57 +00:00
Sanjay Patel	eca590ffb3	[AVX] Improve insertion of i8 or i16 into low element of 256-bit zero vector Without this patch, we split the 256-bit vector into halves and produced something like: movzwl (%rdi), %eax vmovd %eax, %xmm0 vxorps %xmm1, %xmm1, %xmm1 vblendps $15, %ymm0, %ymm1, %ymm0 ## ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7] Now, we eliminate the xor and blend because those zeros are free with the vmovd: movzwl (%rdi), %eax vmovd %eax, %xmm0 This should be the final fix needed to resolve PR22685: https://llvm.org/bugs/show_bug.cgi?id=22685 llvm-svn: 233941	2015-04-02 20:21:52 +00:00
Sanjay Patel	d5c2d287f9	[X86, AVX] use blends instead of insert128 with index 0 Another case of x86-specific shuffle strength reduction: avoid generating insert*128 instructions with index 0 because they are slower than their non-lane-changing blend equivalents. Shuffle lowering already catches most of these cases, but the zero vector case and some other paths such as in the modified test in vector-shuffle-256-v32.ll were getting through. Differential Revision: http://reviews.llvm.org/D8366 llvm-svn: 232773	2015-03-19 22:29:40 +00:00
Chandler Carruth	9ad2ffac23	[x86] Run most of the rest of the shuffle combining over non-128-bit vectors. This lets us fix the rest of the v16 lowering problems when pshufb is clearly better. We might still be able to improve some of the lowerings by enabling the other combine-based rewriting to fire for non-128-bit vectors, but this at least should remove any regressions from using the fancy v16i16 lowering strategy. llvm-svn: 230753	2015-02-27 12:13:14 +00:00
Chandler Carruth	97f3260f57	[x86] Make the v8i16 clever single-input shuffle lowering usable for repeated 128-bit lane shuffles of wider vector types and use it to lower 256-bit v16i16 vector shuffles where applicable. This should let us perfectly lowering the pattern of pshuflw and pshufhw even for AVX2 256-bit patterns. I've not added AVX-512 support, but it should be trivial for someone working on that to wire up. Note that currently this generates bad, long shuffle chains because we don't combine 256-bit target shuffles. The subsequent patches will fix that. llvm-svn: 230751	2015-02-27 11:33:46 +00:00
Chandler Carruth	eb206aa1ea	[x86] Now that the new vector shuffle legality is enabled and everything is going well, remove the flag and the code for the old legality tests. This is the first step toward removing the entire old vector shuffle lowering. Much more code to delete coming up next. llvm-svn: 229963	2015-02-20 03:59:35 +00:00
Chandler Carruth	0b39536390	[x86] Teach the unpack lowering how to lower with an initial unpack in addition to lowering to trees rooted in an unpack. This saves shuffles and or registers in many various ways, lets us handle another class of v4i32 shuffles pre SSE4.1 without domain crosses, etc. llvm-svn: 229856	2015-02-19 15:06:13 +00:00
Chandler Carruth	8817e5e01b	[x86] Remove the insanely over-aggressive unpack lowering strategy for v16i8 shuffles, and replace it with new facilities. This uses precise patterns to match exact unpacks, and the new generalized unpack lowering only when we detect a case where we will have to shuffle both inputs anyways and they terminate in exactly a blend. This fixes all of the blend horrors that I uncovered by always lowering blends through the vector shuffle lowering. It also removes sooooo much of the crazy instruction sequences required for v16i8 lowering previously. Much cleaner now. The only "meh" aspect is that we sometimes use pshufb+pshufb+unpck when it would be marginally nicer to use pshufb+pshufb+por. However, the difference there is tiny. In many cases its a win because we re-use the pshufb mask. In others, we get to avoid the pshufb entirely. I've left a FIXME, but I'm dubious we can really do better than this. I'm actually pretty happy with this lowering now. For SSE2 this exposes some horrors that were really already there. Those will have to fixed by changing a different path through the v16i8 lowering. llvm-svn: 229846	2015-02-19 12:10:37 +00:00
Chandler Carruth	c802085b3a	[x86] Add initial basic support for forming blends of v16i8 vectors. This blend instruction is ... really lame. The register usage is insane. As a consequence this is probably only barely better than 2 pshufbs followed by a por, and that mostly because it only has to read from a single memory location. However, this doesn't fix as much as I kind of expected, so more to go. Pretty sure that the ordering and delegation of v16i8 is just really, really bad. llvm-svn: 229373	2015-02-16 10:58:23 +00:00
Craig Topper	7e8dcef094	[X86] Add support for lowering shuffles to 256-bit PALIGNR instruction. llvm-svn: 229359	2015-02-16 06:29:06 +00:00
Craig Topper	b2b4f8a721	[X86] Remove some hard tab characters from tests. llvm-svn: 229358	2015-02-16 06:29:02 +00:00
Simon Pilgrim	00bd79d794	[X86][AVX2] vpslldq/vpsrldq byte shifts for AVX2 This patch refactors the existing lowerVectorShuffleAsByteShift function to add support for 256-bit vectors on AVX2 targets. It also fixes a tablegen issue that prevented the lowering of vpslldq/vpsrldq vec256 instructions. Differential Revision: http://reviews.llvm.org/D7596 llvm-svn: 229311	2015-02-15 13:19:52 +00:00
Chandler Carruth	bf0fb06e0d	[x86] Teach the decomposed shuffle/blend lowering to use an early blend when that will allow it to lower with a single permute instead of multiple permutes. It tries to detect when it will only have to do a single permute in either case to maximize folding of loads and such. This cuts a lot of the avx2 shuffle permute counts in half. =] llvm-svn: 229309	2015-02-15 12:42:15 +00:00
Chandler Carruth	1b5285dd57	[SDAG] Teach the SelectionDAG to canonicalize vector shuffles of splats directly into blends of the splats. These patterns show up even very late in the vector shuffle lowering where we don't have any chance for DAG combining to kick in, and blending is a tremendously simpler operation to model. By coercing the shuffle into a blend we can much more easily match and lower shuffles of splats. Immediately with this change there are significantly more blends being matched in the x86 vector shuffle lowering. llvm-svn: 229308	2015-02-15 12:18:12 +00:00

1 2

76 Commits