llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	3a622a14f9	[AVX512] Don't switch unmasked subvector insert/extract instructions when AVX512DQI is enabled. There's no reason to switch instructions with and without DQI. It just creates extra isel patterns and test divergences. There is however value in enabling the masked version of the instructions with DQI. This required introducing some new multiclasses to enabling this splitting. Differential Revision: https://reviews.llvm.org/D36661 llvm-svn: 311091	2017-08-17 15:40:25 +00:00
Dinar Temirbulatov	a0beedef1c	[X86] SET0 to use XMM registers where possible PR26018 PR32862 Differential Revision: https://reviews.llvm.org/D35965 llvm-svn: 309926	2017-08-03 08:50:18 +00:00
Dinar Temirbulatov	aead31a36f	[X86] SET0 to use XMM registers where possible PR26018 PR32862 Differential Revision: https://reviews.llvm.org/D35839 llvm-svn: 309298	2017-07-27 17:47:01 +00:00
Ayman Musa	721d97f7b8	Recommitting rL305465 after fixing bug in TableGen in rL306251 & rL306371 [X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions). AVX512 compare instructions return v*i1 types. In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type. Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes. The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class. When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction. Differential Revision: https://reviews.llvm.org/D33188 llvm-svn: 306402	2017-06-27 12:08:37 +00:00
Simon Pilgrim	b98cb3808c	Revert r305465: [X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions). This is causing windows buildbot failures llvm-svn: 305470	2017-06-15 14:39:34 +00:00
Ayman Musa	56912cda71	[X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions). AVX512 compare instructions return v*i1 types. In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type. Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes. The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class. When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction. Differential Revision: https://reviews.llvm.org/D33188 llvm-svn: 305465	2017-06-15 13:02:37 +00:00
Craig Topper	058f2f6d72	[AVX-512] Fix accidental uses of AH/BH/CH/DH after copies to/from mask registers We've had several bugs(PR32256, PR32241) recently that resulted from usages of AH/BH/CH/DH either before or after a copy to/from a mask register. This ultimately occurs because we create COPY_TO_REGCLASS with VK1 and GR8. Then in CopyToFromAsymmetricReg in X86InstrInfo we find a 32-bit super register for the GR8 to emit the KMOV with. But as these tests are demonstrating, its possible for the GR8 register to be a high register and we end up doing an accidental extra or insert from bits 15:8. I think the best way forward is to stop making copies directly between mask registers and GR8/GR16. Instead I think we should restrict to only copies between mask registers and GR32/GR64 and use EXTRACT_SUBREG/INSERT_SUBREG to handle the conversion from GR32 to GR16/8 or vice versa. Unfortunately, this complicates fastisel a bit more now to create the subreg extracts where we used to create GR8 copies. We can probably make a helper function to bring down the repitition. This does result in KMOVD being used for copies when BWI is available because we don't know the original mask register size. This caused a lot of deltas on tests because we have to split the checks for KMOVD vs KMOVW based on BWI. Differential Revision: https://reviews.llvm.org/D30968 llvm-svn: 298928	2017-03-28 16:35:29 +00:00
Amjad Aboud	4f97751798	[X86] Generate VZEROUPPER for Skylake-avx512. VZEROUPPER should not be issued on Knights Landing (KNL), but on Skylake-avx512 it should be. Differential Revision: https://reviews.llvm.org/D29874 llvm-svn: 296859	2017-03-03 09:03:24 +00:00
Simon Pilgrim	5910ebe720	[X86][AVX512] Add support for ASHR v2i64/v4i64 support without VLX Use v8i64 ASHR instructions if we don't have VLX. Differential Revision: https://reviews.llvm.org/D28537 llvm-svn: 295656	2017-02-20 12:16:38 +00:00
Gadi Haber	19c4fc5e62	This is a large patch for X86 AVX-512 of an optimization for reducing code size by encoding EVEX AVX-512 instructions using the shorter VEX encoding when possible. There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers. The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled. Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky Differential Revision: https://reviews.llvm.org/D27901 llvm-svn: 290663	2016-12-28 10:12:48 +00:00
Sanjay Patel	a0d8a278a7	[x86] use a single shufps when it can save instructions This is a tiny patch with a big pile of test changes. This partially fixes PR27885: https://llvm.org/bugs/show_bug.cgi?id=27885 My motivating case looks like this: - vpshufd {{.#+}} xmm1 = xmm1[0,1,0,2] - vpshufd {{.#+}} xmm0 = xmm0[0,2,2,3] - vpblendw {{.#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7] + vshufps {{.#+}} xmm0 = xmm0[0,2],xmm1[0,2] And this happens several times in the diffs. For chips with domain-crossing penalties, the instruction count and size reduction should usually overcome any potential domain-crossing penalty due to using an FP op in a sequence of int ops. For chips such as recent Intel big cores and Atom, there is no domain-crossing penalty for shufps, so using shufps is a pure win. So the test case diffs all appear to be improvements except one test in vector-shuffle-combining.ll where we miss an opportunity to use a shift to generate zero elements and one test in combine-sra.ll where multiple uses prevent the expected shuffle combining. Differential Revision: https://reviews.llvm.org/D27692 llvm-svn: 289837	2016-12-15 18:03:38 +00:00
Elena Demikhovsky	09375d98b8	Type legalization for compressstore and expandload intrinsics. Implemented widening (v2f32) and splitting (v16f64). On splitting, I use "popcnt" to calculate memory increment. More type legalization work will come in the next patches. llvm-svn: 287761	2016-11-23 13:58:24 +00:00
Elena Demikhovsky	caaceef4b3	Expandload and Compressstore intrinsics 2 new intrinsics covering AVX-512 compress/expand functionality. This implementation includes syntax, DAG builder, operation lowering and tests. Does not include: handling of illegal data types, codegen prepare pass and the cost model. llvm-svn: 285876	2016-11-03 03:23:55 +00:00

13 Commits