llvm-project

Commit Graph

Author	SHA1	Message	Date
Andrea Di Biagio	c48cb86f05	[X86] Fix missed selection of non-temporal store of zero vector. When the input to a store instruction was a zero vector, the backend always selected a normal vector store regardless of the non-temporal hint. This is fixed by this patch. This fixes PR19370. llvm-svn: 220054	2014-10-17 17:27:06 +00:00
Benjamin Kramer	4ba642a2f7	X86: Drop the isConvertibleTo3Addr bit from shufps/shufpd now that we don't convert them anymore. llvm-svn: 219112	2014-10-06 09:56:40 +00:00
Chandler Carruth	f3e880697a	[x86] Add a really preposterous number of patterns for matching all of the various ways in which blends can be used to do vector element insertion for lowering with the scalar math instruction forms that effectively re-blend with the high elements after performing the operation. This then allows me to bail on the element insertion lowering path when we have SSE4.1 and are going to be doing a normal blend, which in turn restores the last of the blends lost from the new vector shuffle lowering when I got it to prioritize insertion in other cases (for example when we don't have a blend instruction). Without the patterns, using blends here would have regressed sse-scalar-fp-arith.ll completely with the new vector shuffle lowering. For completeness, I've added RUN-lines with the new lowering here. This is somewhat superfluous as I'm about to flip the default, but hey, it shows that this actually significantly changed behavior. The patterns I've added are just ridiculously repetative. Suggestions on making them better very much welcome. In particular, handling the commuted form of the v2f64 patterns is somewhat obnoxious. llvm-svn: 219033	2014-10-03 22:43:17 +00:00
Chandler Carruth	0adda1e4d4	[x86] Adjust the patterns for lowering X86vzmovl nodes which don't perform a load to use blendps rather than movss when it is available. For non-loads, blendps is much faster. It can execute on two ports in Sandy Bridge and Ivy Bridge, and three ports on Haswell. This fixes one of the "regressions" from aggressively taking the "insertion" path in the new vector shuffle lowering. This does highlight one problem with blendps -- it isn't commuted as heavily as it should be. That's future work though. llvm-svn: 219022	2014-10-03 21:38:49 +00:00
Chandler Carruth	1964078936	[x86] Teach the new vector shuffle lowering to aggressively form MOVSS and MOVSD nodes for single element vector inserts. This is particularly important because a number of patterns in the backend detect these patterns and leverage them to simplify things. It also fixes quite a few of the insertion bad code examples. However, it regresses a specific area: when available, blendps and blendpd are dramatically faster than movss and movsd respectively. But it doesn't really work to form the blend logic first because the blends aren't as crazy efficient when the data is coming from memory anyways, and thus will have a movss or movsd regardless. Also, doing that would block a bunch of the patterns that this is designed to hit. So my plan is to go into the patterns for lowering MOVSS and MOVSD and lower them via blends when available. However that's a pretty invasive restructuring so it will need to be a follow-up patch. I have already gone into the patterns to lower MOVSS and MOVSD from memory using MOVLPD, etc. Without that, several of the test cases I already have regress. llvm-svn: 218985	2014-10-03 13:11:13 +00:00
Chandler Carruth	75e182b414	[x86] Teach the new vector shuffle lowering to widen floating point elements as well as integer elements in order to form simpler shuffle patterns. This is the primary reason why we were failing to match some of the 2-and-2 floating point shuffles such as PR21140. Even after fixing this we need to support some extra patterns in the backend in order to match the resulting X86ISD::UNPCKL nodes into the correct instructions. This commit should fix PR21140 and includes more comprehensive testing of insertion patterns in v4 shuffles. Not all of the added tests are beautiful. For example, we don't have clever instructions to insert-via-load in the integer domain. There are also some places where we aren't sufficiently cunning with our use of movq and movd, but that's future work. llvm-svn: 218911	2014-10-02 21:37:14 +00:00
Chandler Carruth	b9d3fa1e65	[x86] Teach the new vector shuffle lowering about VBROADCAST and VPBROADCAST. This has the somewhat expected pervasive impact. I don't know why I forgot about this. Everything seems good with lots of significant improvements in the tests. llvm-svn: 218724	2014-10-01 00:41:21 +00:00
Andrea Di Biagio	196e873cdc	[X86][SchedModel] SSE reciprocal square root instruction latencies. The SSE rsqrt instruction (a fast reciprocal square root estimate) was grouped in the same scheduling IIC_SSE_SQRT* class as the accurate (but very slow) SSE sqrt instruction. For code which uses rsqrt (possibly with newton-raphson iterations) this poor scheduling was affecting performances. This patch splits off the rsqrt instruction from the sqrt instruction scheduling classes and creates new IIC_SSE_RSQER* classes with latency values based on Agner's table. Differential Revision: http://reviews.llvm.org/D5370 Patch by Simon Pilgrim. llvm-svn: 218517	2014-09-26 12:56:44 +00:00
Robert Khasanov	6d62c0202b	[AVX512] Added load/store from BW/VL subsets to Register2Memory opcode tables. Added lowering tests for these instructions. llvm-svn: 218508	2014-09-26 09:48:50 +00:00
Akira Hatanaka	8cc48bd159	[X86,AVX] Add an isel pattern for X86VBroadcast. This fixes PR21050 and rdar://problem/18434607. llvm-svn: 218431	2014-09-25 00:26:15 +00:00
Chandler Carruth	6d5916a2d7	[x86] Teach the AVX1 path of the new vector shuffle lowering one more trick that I missed. VPERMILPS has a non-immediate memory operand mode that allows it to do asymetric shuffles in the two 128-bit lanes. Use this rather than two shuffles and a blend. However, it turns out the variable shuffle path to VPERMILPS (and VPERMILPD, although that one offers no functional differenc from the immediate operand other than variability) wasn't even plumbed through codegen. Do such plumbing so that we can reasonably emit a variable-masked VPERMILP instruction. Also plumb basic comment parsing and printing through so that the tests are reasonable. There are still a few tests which don't show the shuffle pattern. These are tests with undef lanes. I'll teach the shuffle decoding and printing to handle undef mask entries in a follow-up. I've looked at the masks and they seem reasonable. llvm-svn: 218300	2014-09-23 10:08:29 +00:00
Chandler Carruth	ed5dfff865	[x86] Rename X86ISD::VPERMILP to X86ISD::VPERMILPI (and the same for the td pattern). Currently we only model the immediate operand variation of VPERMILPS and VPERMILPD, we should make that clear in the pseudos used. Will be adding support for the variable mask variant in my next commit. llvm-svn: 218282	2014-09-22 22:29:42 +00:00
Sanjay Patel	7939d7229d	Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2). We generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors. This patch should preserve all existing behavior with regular optimization levels, but also use splats whenever possible when optimizing for size on any CPU with AVX or AVX2. The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save at least 8 bytes (up to 31 bytes) of constant pool data. Differential Revision: http://reviews.llvm.org/D5347 llvm-svn: 218263	2014-09-22 18:54:01 +00:00
Chandler Carruth	74acb46d26	[x86] Remove the last vestiges of the BLENDI-based ADDSUB pattern matching. This design just fundamentally didn't work because ADDSUB is available prior to any legal lowerings of BLENDI nodes. Instead, we have a dedicated ADDSUB synthetic ISD node which is pattern matched trivially into the instructions. These nodes are then recognized by both the existing and a trivial new lowering combine in the backend. Removing these patterns required adding 2 missing shuffle masks to the DAG combine, without which tests would have failed. Added the masks and a helpful assert as well to catch if anything ever goes wrong here. llvm-svn: 217851	2014-09-16 00:39:08 +00:00
Chandler Carruth	204ad4c613	[x86] Start fixing our emission of ADDSUBPS and ADDSUBPD instructions by introducing a synthetic X86 ISD node representing this generic operation. The relevant patterns for mapping these nodes into the concrete instructions are also added, and a gnarly bit of C++ code in the target-specific DAG combiner is replaced with simple code emitting this primitive. The next step is to generically combine blends of adds and subs into this node so that we can drop the reliance on an SSE4.1 ISD node (BLENDI) when matching an SSE3 feature (ADDSUB). llvm-svn: 217819	2014-09-15 20:09:47 +00:00
Chandler Carruth	373b2b1728	[x86] Fix a pretty horrible bug and inconsistency in the x86 asm parsing (and latent bug in the instruction definitions). This is effectively a revert of r136287 which tried to address a specific and narrow case of immediate operands failing to be accepted by x86 instructions with a pretty heavy hammer: it introduced a new kind of operand that behaved differently. All of that is removed with this commit, but the test cases are both preserved and enhanced. The core problem that r136287 and this commit are trying to handle is that gas accepts both of the following instructions: insertps $192, %xmm0, %xmm1 insertps $-64, %xmm0, %xmm1 These will encode to the same byte sequence, with the immediate occupying an 8-bit entry. The first form was fixed by r136287 but that broke the prior handling of the second form! =[ Ironically, we would still emit the second form in some cases and then be unable to re-assemble the output. The reason why the first instruction failed to be handled is because prior to r136287 the operands ere marked 'i32i8imm' which forces them to be sign-extenable. Clearly, that won't work for 192 in a single byte. However, making thim zero-extended or "unsigned" doesn't really address the core issue either because it breaks negative immediates. The correct fix is to make these operands 'i8imm' reflecting that they can be either signed or unsigned but must be 8-bit immediates. This patch backs out r136287 and then changes those places as well as some others to use 'i8imm' rather than one of the extended variants. Naturally, this broke something else. The custom DAG nodes had to be updated to have a much more accurate type constraint of an i8 node, and a bunch of Pat immediates needed to be specified as i8 values. The fallout didn't end there though. We also then ceased to be able to match the instruction-specific intrinsics to the instructions so modified. Digging, this is because they too used i32 rather than i8 in their signature. So I've also switched those intrinsics to i8 arguments in line with the instructions. In order to make the intrinsic adjustments of course, I also had to add auto upgrading for the intrinsics. I suspect that the intrinsic argument types may have led everything down this rabbit hole. Pretty happy with the result. llvm-svn: 217310	2014-09-06 10:00:01 +00:00
Robert Khasanov	ed8829703f	[SKX] Extended non-temporal load/store instructions for AVX512VL subsets. Added avx512_movnt_vl multiclass for handling 256/128-bit forms of instruction. Added encoding and lowering tests. Reviewed by Elena Demikhovsky <elena.demikhovsky@intel.com> llvm-svn: 215536	2014-08-13 10:46:00 +00:00
Quentin Colombet	0233d49574	[X86][SchedModel] Fixed missing/wrong scheduling model found by code inspection. Source: Agner Fog's Instruction tables. Related to <rdar://problem/15607571> llvm-svn: 215045	2014-08-07 00:20:44 +00:00
Quentin Colombet	33ea1681ce	[X86][SchedModel] Fixed some wrong scheduling model found by code inspection. Source: Agner Fog's Instruction tables. Related to <rdar://problem/15607571> llvm-svn: 214940	2014-08-06 00:22:39 +00:00
Tim Northover	fd7e424935	CodeGen: extend f16 conversions to permit types > float. This makes the two intrinsics @llvm.convert.from.f16 and @llvm.convert.to.f16 accept types other than simple "float". This is only strictly needed for the truncate operation, since otherwise double rounding occurs and there's no way to represent the strict IEEE conversion. However, for symmetry we allow larger types in the extend too. During legalization, we can expand an "fp16_to_double" operation into two extends for convenience, but abort when the truncate isn't legal. A new libcall is probably needed here. Even after this commit, various target tweaks are needed to actually use the extended intrinsics. I've put these into separate commits for clarity, so there are no actual tests of f64 conversion here. llvm-svn: 213248	2014-07-17 10:51:23 +00:00
Andrea Di Biagio	a37a2fc81f	[X86] Add ISel patterns to select 'f32_to_f16' and 'f16_to_f32' dag nodes. This patch adds tablegen patterns to select F16C float-to-half-float conversion instructions from 'f32_to_f16' and 'f16_to_f32' dag nodes. If the target doesn't have F16C, then 'f32_to_f16' and 'f16_to_f32' are expanded into library calls. llvm-svn: 212293	2014-07-03 21:51:06 +00:00
Andrea Di Biagio	07cdffc324	[X86] Always prefer to lower a VECTOR_SHUFFLE into a BLENDI instead of SHUFP (or VPERM2X128). This patch teaches method 'LowerVECTOR_SHUFFLE' to give higher precedence to the check for 'isBlendMask'; the idea is that, when possible, we should firstly check if a shuffle performs a blend, and in case, try to lower it into a BLENDI instead of selecting a SHUFP or (worse) a VPERM2X128. In general: - AVX VBLENDPS/D always have better latency and throughput than VPERM2F128; - BLENDPS/D instructions tend to always have better 'reciprocal throughput' than the equivalent SHUFPS/D; - Both BLENDPS/D and SHUFPS/D are often decoded into the same number of m-ops; however, a m-op obtained from a BLENDPS/D can be scheduled to more than one execution port. This patch: - Moves the check for 'isBlendMask' immediately before the check for 'isSHUFPMask' within method 'LowerVECTOR_SHUFFLE'; - Updates existing tests for sse/avx shuffle/blend instructions to verify that we select (v)blendps/d when possible (instead of (v)shufps/d or vperm2f128). llvm-svn: 211720	2014-06-25 17:41:58 +00:00
Andrea Di Biagio	6d9b9e125d	[X86] Add target combine rule to select ADDSUB instructions from a build_vector This patch teaches the backend how to combine a build_vector that implements an 'addsub' between packed float vectors into a sequence of vector add and vector sub followed by a VSELECT. The new VSELECT is expected to be lowered into a BLENDI. At ISel stage, the sequence 'vector add + vector sub + BLENDI' is pattern-matched against ISel patterns added at r211427 to select 'addsub' instructions. Added three more ISel patterns for ADDSUB. Added test sse3-avx-addsub-2.ll to verify that we correctly emit 'addsub' instructions. llvm-svn: 211679	2014-06-25 10:02:21 +00:00
Andrea Di Biagio	e5015d8aba	[X86] Add ISel patterns to select SSE3/AVX ADDSUB instructions. This patch adds ISel patterns to select SSE3/AVX ADDSUB instructions from a sequence of "vadd + vsub + blend". Example: /// typedef float float4 __attribute__((ext_vector_type(4))); float4 foo(float4 A, float4 B) { float4 X = A - B; float4 Y = A + B; return (float4){X[0], Y[1], X[2], Y[3]}; } /// Before this patch, (with flag -mcpu=corei7) llc produced the following assembly sequence: movaps %xmm0, %xmm2 addps %xmm1, %xmm2 subps %xmm1, %xmm0 blendps $10, %xmm2, %xmm0 With this patch, we now get a single addsubps %xmm1, %xmm0 llvm-svn: 211427	2014-06-21 01:31:15 +00:00
Chandler Carruth	8366cebeb5	[x86] Make the x86 PACKSSWB, PACKSSDW, PACKUSWB, and PACKUSDW instructions available as synthetic SDNodes PACKSS and PACKUS that will select to the correct instruction variants based on the return type. This allows us to use these rather important instructions when lowering vector shuffles. Also moves the relevant instruction definitions to be split out from the fully generic multiclasses to allow them to match these new SDNodes in the same way that the UNPCK instructions do. No functionality should actually be changed here. llvm-svn: 211332	2014-06-20 01:05:28 +00:00
Adam Nemet	35b80eaef1	[X86] Remove AVX1 vbroadcast intrinsics The corresponding CFE patch replaces these intrinsics with vector initializers in avxintrin.h. This patch removes the LLVM intrinsics from the backend. We now stop lowering at X86ISD::VBROADCAST custom node rather than lowering that further to the intrinsics. The patch only changes VBROADCASTS* and leaves VBROADCAST[FI]128 to continue to use intrinsics. As explained in the CFE patch, the reason is that we currently don't generate as good code for them without the intrinsics. CodeGen/X86/avx-vbroadcast.ll already provides coverage for this change. It checks that for a series of insertelements we generate the appropriate vbroadcast instruction. Also verified that there was no assembly change in the test-suite before and after this patch. llvm-svn: 209864	2014-05-29 23:35:36 +00:00
Filipe Cabecinhas	dc92102766	Added more insertps optimizations Summary: When inserting an element that's coming from a vector load or a broadcast of a vector (or scalar) load, combine the load into the insertps instruction. Added PerformINSERTPSCombine for the case where we need to fix the load (load of a vector + insertps with a non-zero CountS). Added patterns for the broadcasts. Also added tests for SSE4.1, AVX, and AVX2. Reviewers: delena, nadav, craig.topper Subscribers: llvm-commits Differential Revision: http://reviews.llvm.org/D3581 llvm-svn: 209156	2014-05-19 19:45:57 +00:00
Tim Northover	60091cfeb9	TableGen: use correct MIOperand when printing aliases Previously, TableGen assumed that every aliased operand consumed precisely 1 MachineInstr slot (this was reasonable because until a couple of days ago, nothing more complicated was eligible for printing). This allows a couple more ARM64 aliases to print so we can remove the special code. On the X86 side, I've gone for explicit AT&T size specifiers as the default, so turned off a few of the aliases that would have just started printing. llvm-svn: 208880	2014-05-15 13:36:01 +00:00
Tim Northover	d8d65a69cf	TableGen/ARM64: print aliases even if they have syntax variants. To get at least one use of the change (and some actual tests) in with its commit, I've enabled the AArch64 & ARM64 NEON mov aliases. llvm-svn: 208867	2014-05-15 11:16:32 +00:00
Benjamin Kramer	6d2dff61f9	X86: Lower SMUL_LOHI of v4i32 to pmuldq when SSE4.1 is available. llvm-svn: 207318	2014-04-26 14:12:19 +00:00
Benjamin Kramer	c9827ab103	X86: Add patterns for MULHU/MULHS of v8i16 and v16i16. This gets us pretty code for divs of i16 vectors. Turn the existing intrinsics into the corresponding nodes. llvm-svn: 207317	2014-04-26 13:01:03 +00:00
Quentin Colombet	04f7b74c39	[X86] Fix missing/wrong scheduling model found by code inspection. llvm-svn: 207014	2014-04-23 19:30:26 +00:00
Filipe Cabecinhas	20352216fb	Rename X86insrtps to the proper instruction name. Summary: The INSERTPS pattern fragment was called insrtps (mising 'e'), which would make it harder to grep for the patterns related to this instruction. Renaming it to use the proper instruction name. Reviewers: nadav CC: llvm-commits Differential Revision: http://reviews.llvm.org/D3443 llvm-svn: 206779	2014-04-21 20:07:29 +00:00
Benjamin Kramer	e6c821ef4c	X86: Pattern match scalar loads + vcvtph2ps into just vcvtph2ps. vcvtph2ps only reads the lower 64 bits of the address passed to the intrinsic. llvm-svn: 206579	2014-04-18 10:45:33 +00:00
Jim Grosbach	e4fef71981	Add support for load folding of avx1 logical instructions AVX supports logical operations using an operand from memory. Unfortunately because integer operations were not added until AVX2 the AVX1 logical operation's types were preventing the isel from folding the loads. In a limited number of cases the peephole optimizer would fold the loads, but most were missed. This patch adds explicit patterns with appropriate casts in order for these loads to be folded. The included test cases run on reduced examples and disable the peephole optimizer to ensure the folds are being pattern matched. Patch by Louis Gerbarg <lgg@apple.com> rdar://16355124 llvm-svn: 205938	2014-04-09 23:39:25 +00:00
Quentin Colombet	9c816f39ad	Revert r205599, the commit was not intended to have so many changes llvm-svn: 205600	2014-04-04 02:02:49 +00:00
Quentin Colombet	7ee4e79dec	[RegAllocGreedy][Last Chance Recoloring] Emit diagnostics when last chance recoloring cut-offs are hit. This is related to PR18747. Patch by MAYUR PANDEY <mayur.p@samsung.com> llvm-svn: 205599	2014-04-04 01:58:57 +00:00
Cameron McInally	45dc489403	Fix AVX2 Gather execution domains. llvm-svn: 204713	2014-03-25 12:36:38 +00:00
Quentin Colombet	2d5c156b96	[X86][ISelDAG] Add missing fallback patterns for avx2 broadcast instructions. Those patterns are used when the load cannot be folded into the related broadcast during the select phase. This happens when the load gets additional uses that were not anticipated during the previous lowering phases (constant vector to constant load, then constant load reused) or when selection DAG is not able to prove that folding the load will not create a cycle in the DAG. <rdar://problem/16074331> llvm-svn: 204631	2014-03-24 17:54:19 +00:00
Quentin Colombet	ca49851833	[X86][SchedModel] Add missing scheduling model for SSE related instructions. The patch defines new or refines existing generic scheduling classes to match the behavior of the SSE instructions. It also maps those scheduling classes on the related SSE instructions. <rdar://problem/15607571> llvm-svn: 202065	2014-02-24 19:33:51 +00:00
Craig Topper	e2347df24d	[x86] Switch PAUSE instruction to use XS prefix instead of HasREPPrefix. Remove HasREPPrefix support from disassembler table generator since its now only used by CodeGenOnly instructions. llvm-svn: 201767	2014-02-20 07:59:43 +00:00
Craig Topper	6872fd3ad9	Add a bunch of OpSize32 tags to 64-bit mode only instructions to match their 32-bit mode counterparts for cases where there is also a OpSize16 instruction. llvm-svn: 201550	2014-02-18 08:18:29 +00:00
Craig Topper	5ccb61781f	Add an x86 prefix encoding for instructions that would decode to a different instruction with 0xf2/f3/66 were in front of them, but don't themselves have a prefix. For now this doesn't change any bbehavior, but plan to use it to fix some bugs in the disassembler. llvm-svn: 201538	2014-02-18 00:21:49 +00:00
Craig Topper	a0869dceea	Recommit r201059 and r201060 with hopefully a fix for its original failure. Original commits messages: Add MRMXr/MRMXm form to X86 for use by instructions which treat the 'reg' field of modrm byte as a don't care value. Will allow for simplification of disassembler code. Simplify a bunch of code by removing the need for the x86 disassembler table builder to know about extended opcodes. The modrm forms are sufficient to convey the information. llvm-svn: 201065	2014-02-10 06:55:41 +00:00
Bob Wilson	ebdae7c2ff	Revert r201059 and r201060. r201059 appears to cause a crash in a bootstrapped build of clang. Craig isn't available to look at it right now, so I'm reverting it while he investigates. llvm-svn: 201064	2014-02-10 05:28:30 +00:00
Craig Topper	0d88de8c56	Add MRMXr/MRMXm form to X86 for use by instructions which treat the 'reg' field of modrm byte as a don't care value. Will allow for simplification of disassembler code. llvm-svn: 201059	2014-02-10 00:50:34 +00:00
Jim Grosbach	e9008de652	X86: Resolve a long standing FIXME and properly isel pextr[bw]. Generalize the AArch64 .td nodes for AssertZext and AssertSext. Use them to match the relevant pextr store instructions. The test widen_load-2.ll requires a slight change because with the stores gone, the remaining instructions are scheduled in a different order. Add test cases for SSE4 and AVX variants. Resolves rdar://13414672. Patch by Adam Nemet <anemet@apple.com>. llvm-svn: 200957	2014-02-07 00:16:33 +00:00
Tim Northover	546b57b011	X86: deduplicate V[SZ]EXT_MOVL and V[SZ]EXT nodes I believe VZEXT_MOVL means "zero all vector elements except the first" (and should have identical input & output types) whereas VZEXT means "zero extend each element of a vector (discarding higher elements if necessary)". For example: (v4i32 (vzext (v16i8 ...))) should zero extend the low 4 bytes of the incoming vector to 32-bits, discarding higher bytes. However, somewhere in the past, these two concepts had become confused, even leading to a nonsensical VSEXT_MOVL. This re-merges the nodes where appropriate (all VSEXT_MOVL -> VSEXT, VZEXT_MOVL -> VZEXT when it's an actual extension). rdar://problem/15981990 llvm-svn: 200918	2014-02-06 09:54:51 +00:00
Craig Topper	fa6298a162	Merge x86 HasOpSizePrefix/HasOpSize16Prefix into a 2-bit OpSize field with 0 meaning no 0x66 prefix in any mode. Rename Opsize16->OpSize32 and OpSize->OpSize16. The classes now refer to their operand size rather than the mode in which they need a 0x66 prefix. Hopefully can merge REX_W into this as OpSize64. llvm-svn: 200626	2014-02-02 09:25:09 +00:00
Craig Topper	348cbdacda	Remove duplicate patterns llvm-svn: 200461	2014-01-30 07:19:10 +00:00

1 2 3 4 5 ...

1132 Commits