llvm-project

Commit Graph

Author	SHA1	Message	Date
Daniel Kiss	b0343a38a5	Support the min of module flags when linking, use for AArch64 BTI/PAC-RET LTO objects might compiled with different `mbranch-protection` flags which will cause an error in the linker. Such a setup is allowed in the normal build with this change that is possible. Reviewed By: pcc Differential Revision: https://reviews.llvm.org/D123493	2022-04-13 09:31:51 +02:00
David Green	6bd8f114c8	[ARM] Handle splats of constants for MVE qr instruction Some MVE instructions have qr variants that take a Q and R register, splatting the R register for each lane. This is usually handled fine for standard splats as we sink the splat into the loop and combine the resulting dup into the qr instruction. It does not work for constant splats though, as we generate a vmovimm or constant pool load instead. This intercepts that, generating a vdup of the constant instead where we can turn the result into a qr instruction variant. Differential Revision: https://reviews.llvm.org/D115242	2021-12-17 09:16:28 +00:00
David Sherwood	652faed353	[CodeGen] Improve SelectionDAGBuilder lowering code for get.active.lane.mask intrinsic Previously we were using UADDO to generate a two-result value with the unsigned addition and the overflow mask. We then combined the overflow mask with the trip count comparison to get a result. However, we don't need to do this - we can simply use a UADDSAT saturating add node to add the vector index splat and the stepvector together. Then we can just compare this to a splat of the trip count. This results in overall better code quality for both Thumb2 and AArch64. Differential Revision: https://reviews.llvm.org/D115354	2021-12-10 13:39:38 +00:00
Philip Reames	8d85e945b2	[SCEV] Canonicalize X - urem X, Y patterns There are multiple possible ways to represent the X - urem X, Y pattern. SCEV was not canonicalizing, and thus, depending on which you analyzed, you could get different results. The sub representation appears to produce strictly inferior results in practice, so I decided to canonicalize to the Y * X/Y version. The motivation here is that runtime unroll produces the sub X - (and X, Y-1) pattern when Y is a power of two. SCEV is thus unable to recognize that an unrolled loop exits because we don't figure out that the new unrolled step evenly divides the trip count of the unrolled loop. After instcombine runs, we convert the the andn form which SCEV recognizes, so essentially, this is just fixing a nasty pass ordering dependency. The ARM loop hardware interaction in the test diff is opague to me, but the comments in the review from others knowledge of the infrastructure appear to indicate these are improvements in loop recognition, not regressions. Differential Revision: https://reviews.llvm.org/D114018	2021-11-16 11:59:21 -08:00
David Green	091244023a	[ARM] Move VPTBlock pass after post-ra scheduling Currently when tail predicating loops, vpt blocks need to be created with the vctp predicate in case we need to revert to non-tail predicated form. This has the unfortunate side effect of severely hampering post-ra scheduling at times as the instructions are already stuck in vpt blocks, not allowed to be independently ordered. This patch addresses that by just moving the creation of VPT blocks later in the pipeline, after post-ra scheduling has been performed. This allows more optimal scheduling post-ra before the vpt blocks are created, leading to more optimal tail predicated loops. Differential Revision: https://reviews.llvm.org/D113094	2021-11-04 18:42:12 +00:00
David Green	e6df795759	[ARM] Add a complex dotprod test case.	2021-10-25 10:52:12 +01:00
David Green	73346f5848	[ARM] Introduce a MQPRCopy Currently when creating tail predicated loops, we need to validate that all the live-outs of a loop will be equivalent with and without tail predication, and if they are not we cannot legally create a tail-predicated loop, leaving expensive vctp and vpst instructions in the loop. These notably can include register-allocation instructions like stack loads and stores, and copys lowered from COPYs to MVE_VORRs. Instead of trying to prove this is valid late in the pipeline, this patch introduces a MQPRCopy pseudo instruction that COPY is lowered to. This can then either be converted to a MVE_VORR where possible, or to a couple of VMOVD instructions if not. This way they do not behave differently within and outside of tail-predications regions, and we can know by construction that they are always valid. The idea is that we can do the same with stack load and stores, converting them to VLDR/VSTR or VLDM/VSTM where required to prove tail predication is always valid. This does unfortunately mean inserting multiple VMOVD instructions, instead of a single MVE_VORR, but my experiments show it to be an improvement in general. Differential Revision: https://reviews.llvm.org/D111048	2021-10-07 12:52:12 +01:00
David Green	bf916cdbd2	[ARM] Add tests for code that spills in tail predicate loops.	2021-10-07 11:35:02 +01:00
David Green	fdd8c10959	[ARM] Delay reverting WLS in arm-block-placement As we have to split blocks, we may be left in an invalid loop state after a WLS is reverted to a DLS. Instead remember the WLS that could not be fixed and revert them after finishing processing all other loops. Differential Revision: https://reviews.llvm.org/D110567	2021-09-28 15:38:29 +01:00
David Green	2c53215e99	[ARM] Skip debug info in recomputeVPTBlockMask The ARMLowOverheadLoops pass recalculates VPT block masks when it converts VCMP's inside VPT blocks into VPT's. The function to do so doesn't seem to handle debug info though, leading to invalid block creation or asserts at compile time. Make sure the function skips any debug info between the MVE instructions it inspects. Differential Revision: https://reviews.llvm.org/D110564	2021-09-28 14:58:13 +01:00
Stanislav Mekhanoshin	08d7eec06e	Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit `92c1fd19ab`.	2021-09-24 10:26:11 -07:00
David Green	9cb8f4d1ad	[ARM] Add a tail-predication loop predicate register The semantics of tail predication loops means that the value of LR as an instruction is executed determines the predicate. In other words: mov r3, #3 DLSTP lr, r3 // Start tail predication, lr==3 VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0. mov lr, #1 VADD.s32 q0, q1, q2 // Only first lane is updated. This means that the value of lr cannot be spilled and re-used in tail predication regions without potentially altering the behaviour of the program. More lanes than required could be stored, for example, and in the case of a gather those lanes might not have been setup, leading to alignment exceptions. This patch adds a new lr predicate operand to MVE instructions in order to keep a reference to the lr that they use as a tail predicate. It will usually hold the zeroreg meaning not predicated, being set to the LR phi value in the MVETPAndVPTOptimisationsPass. This will prevent it from being spilled anywhere that it needs to be used. A lot of tests needed updating. Differential Revision: https://reviews.llvm.org/D107638	2021-09-02 13:42:58 +01:00
Stanislav Mekhanoshin	92c1fd19ab	Allow rematerialization of virtual reg uses Currently isReallyTriviallyReMaterializableGeneric() implementation prevents rematerialization on any virtual register use on the grounds that is not a trivial rematerialization and that we do not want to extend liveranges. It appears that LRE logic does not attempt to extend a liverange of a source register for rematerialization so that is not an issue. That is checked in the LiveRangeEdit::allUsesAvailableAt(). The only non-trivial aspect of it is accounting for tied-defs which normally represent a read-modify-write operation and not rematerializable. The test for a tied-def situation already exists in the /CodeGen/AMDGPU/remat-vop.mir, test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve. The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets where I more or less understand the asm it seems to reduce spilling (as expected) or be neutral. However, it needs a review by all targets' specialists. Differential Revision: https://reviews.llvm.org/D106408	2021-08-24 11:09:02 -07:00
Petr Hosek	2d4470ab89	Revert "Allow rematerialization of virtual reg uses" This reverts commit `877572cc19` which introduced PR51516.	2021-08-18 00:12:41 -07:00
David Green	52e0cf9d61	[ARM] Enable subreg liveness This enables subreg liveness in the arm backend when MVE is present, which allows the register allocator to detect when subregister are alive/dead, compared to only acting on full registers. This can helps produce better code on MVE with the way MQPR registers are made up of SPR registers, but is especially helpful for MQQPR and MQQQQPR registers, where there are very few "registers" available and being able to split them up into subregs can help produce much better code. Differential Revision: https://reviews.llvm.org/D107642	2021-08-17 14:10:33 +01:00
Stanislav Mekhanoshin	877572cc19	Allow rematerialization of virtual reg uses Currently isReallyTriviallyReMaterializableGeneric() implementation prevents rematerialization on any virtual register use on the grounds that is not a trivial rematerialization and that we do not want to extend liveranges. It appears that LRE logic does not attempt to extend a liverange of a source register for rematerialization so that is not an issue. That is checked in the LiveRangeEdit::allUsesAvailableAt(). The only non-trivial aspect of it is accounting for tied-defs which normally represent a read-modify-write operation and not rematerializable. The test for a tied-def situation already exists in the /CodeGen/AMDGPU/remat-vop.mir, test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve. The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets where I more or less understand the asm it seems to reduce spilling (as expected) or be neutral. However, it needs a review by all targets' specialists. Differential Revision: https://reviews.llvm.org/D106408	2021-08-16 12:42:42 -07:00
David Green	eeddcba525	[RDA] Attempt to make RDA subreg aware This attempts to make more of RDA aware of potentially overlapping subregisters. Some of this was already in place, with it iterating through MCRegUnitIterators. This also replaces calls to LiveRegs.contains(..) with !LiveRegs.available(..), and updates the isValidRegUseOf and isValidRegDefOf to search subregs. Differential Revision: https://reviews.llvm.org/D107351	2021-08-04 14:21:32 +01:00
David Green	ff9958b70e	[ARM] Test showing incorrect codegen when subreg liveness is enabled. NFC	2021-08-04 14:21:32 +01:00
Sam Tebbs	ff0ef6a518	[ARM][LowOverheadLoops] Make some stack spills valid for tail predication This patch makes vector spills valid for tail predication when all loads from the same stack slot are within the loop Differential Revision: https://reviews.llvm.org/D105443	2021-07-15 19:23:52 +01:00
Matt Arsenault	fae05692a3	CodeGen: Print/parse LLTs in MachineMemOperands This will currently accept the old number of bytes syntax, and convert it to a scalar. This should be removed in the near future (I think I converted all of the tests already, but likely missed a few). Not sure what the exact syntax and policy should be. We can continue printing the number of bytes for non-generic instructions to avoid test churn and only allow non-scalar types for generic instructions. This will currently print the LLT in parentheses, but accept parsing the existing integers and implicitly converting to scalar. The parentheses are a bit ugly, but the parser logic seems unable to deal without either parentheses or some keyword to indicate the start of a type.	2021-06-30 16:54:13 -04:00
David Green	bee2f618d5	[ARM] Introduce t2WhileLoopStartTP This adds t2WhileLoopStartTP, similar to the t2DoLoopStartTP added in D90591. It keeps a reference to both the tripcount register and the element count register, so that the ARMLowOverheadLoops pass in the backend can pick the correct one without having to search for it from the operand of a VCTP. Differential Revision: https://reviews.llvm.org/D103236	2021-06-13 13:55:34 +01:00
David Green	e0c605f638	[ARM] Ensure instructions are simplified prior to GatherScatter lowering. Surprisingly, not all instructions are always simplified after unrolling and before MVE gather/scatter lowering. Notably dead gather operations can be left around which cause the gather/scatter lowering pass to crash if there are multiple gathers, some of which are dead. This patch ensures they are simplified before we modify anything, which can change some of the existing tests, including making them no-longer test what they originally tested. This uses a combination of disabling the gather/scatter lowering pass and adjusting the test to keep them as before. Differential Revision: https://reviews.llvm.org/D103150	2021-06-10 20:18:12 +01:00
David Green	d7853bae94	[ARM] Generate VDUP(Const) from constant buildvectors If we cannot otherwise use a VMOVimm/VMOVFPimm/VMVNimm, fall back to producing a VDUP(const) as opposed to a constant pool load. This will at least be smaller codesize and can allow the VDUP to be folded into other instructions. Differential Revision: https://reviews.llvm.org/D103808	2021-06-08 20:51:33 +01:00
David Green	65831422a9	[ARM] Guard against WhileLoopStart kill flags If the operand of the WhileLoopStart is flagged as killed, that currently gets propogated to both the t2CMPri as the instruction is reverted, and the newly created t2DoLoopStart. Only the second should remain as killing the operand, the first dropping the flags.	2021-05-29 21:04:26 +01:00
serge-sans-paille	4ab3041acb	Revert "[NFC] remove explicit default value for strboolattr attribute in tests" This reverts commit `bda6e5bee0`. See https://lab.llvm.org/buildbot/#/builders/109/builds/15424 for instance	2021-05-24 19:43:40 +02:00
serge-sans-paille	bda6e5bee0	[NFC] remove explicit default value for strboolattr attribute in tests Since `d6de1e1a71`, no attributes is quivalent to setting attribute to false. This is a preliminary commit for https://reviews.llvm.org/D99080	2021-05-24 19:31:04 +02:00
David Green	211ce51f27	[ARM] Clean up some tests, removing dead instructions. NFC	2021-05-22 13:38:00 +01:00
David Green	e7a6df68a6	[ARM] Fix the operand used for WLS in ARMLowOverheadLoops The Loop start instruction handled by the ARMLowOverheadLoops are: $lr = t2DoLoopStart $r0 $lr = t2DoLoopStartTP $r1, $r0 $lr = t2WhileLoopStartLR $r0, %bb, implicit-def dead $cpsr All three of these will have LR as the 0 argument, the trip count as the 1 argument. This patch updated a few places in ARMLowOverheadLoops where the 0th arg was being used for t2WhileLoopStartLR instructions as the trip count. One place was entirely removed as it does not seem valid any more, the case the code is trying to protect against should not be able to occur with our correct-by-construction low overhead loops. Differential Revision: https://reviews.llvm.org/D102620	2021-05-21 09:29:30 +01:00
David Green	ce76093c3c	[ARM] Expand predecessor search to multiple blocks when reverting WhileLoopStarts We were previously only searching a single preheader for call instructions when reverting WhileLoopStarts to DoLoopStarts. This extends that to multiple blocks that can come up when, for example a loop is expanded from a memcpy. It also expends the instructions from just Call's to also include other LoopStarts, to catch other low overhead loops in the preheader. Differential Revision: https://reviews.llvm.org/D102269	2021-05-14 15:08:14 +01:00
David Green	1011d4ed60	[ARM] Constrain CMPZ shift combine to a single use We currently prefer t2CMPrs over t2CMPri when the node contains a shift. This can introduce more nodes if the shift has multiple uses though, as value from the shift will be needed anyway, and in the case of a t2CMPri compared with zero will more readily be removed entirely. Differential Revision: https://reviews.llvm.org/D101688	2021-05-13 18:31:01 +01:00
Malhar Jajoo	dfe3ffaa4a	[ARM] Transforming memset to Tail predicated Loop This patch converts llvm.memset intrinsic into Tail Predicated Hardware loops for a target that supports the Arm M-profile Vector Extension (MVE). The llvm.memset is converted to a TP loop for both constant and non-constant input sizes (of llvm.memset). Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D100435	2021-05-07 13:35:53 +01:00
Malhar Jajoo	9ff38e2d9d	[ARM] Transforming memcpy to Tail predicated Loop This patch converts llvm.memcpy intrinsic into Tail Predicated Hardware loops for a target that supports the Arm M-profile Vector Extension (MVE). From an implementation point of view, the patch - adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel) - adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter, on matching the above node. - Adds a custom inserter function that expands the pseudo instruction into MIR suitable to be (by later passes) into a WLSTP loop. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D99723	2021-05-06 23:21:28 +01:00
Malhar Jajoo	fc690777fc	Revert "[ARM] Transforming memcpy to Tail predicated Loop" Reverting commit since it causes failure (10462). This reverts commit `b856f4a232`.	2021-05-06 12:39:08 +01:00
Malhar Jajoo	b856f4a232	[ARM] Transforming memcpy to Tail predicated Loop This patch converts llvm.memcpy intrinsic into Tail Predicated Hardware loops for a target that supports the Arm M-profile Vector Extension (MVE). From an implementation point of view, the patch - adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel) - adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter, on matching the above node. - Adds a custom inserter function that expands the pseudo instruction into MIR suitable to be (by later passes) into a WLSTP loop. Note: A cli option is used to control the conversion of memcpy to TP loop and this option is currently disabled by default. It may be enabled in the future after further downstream testing. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D99723	2021-05-06 09:34:09 +01:00
Teresa Johnson	ea817d79be	[SimplifyCFG] Look for control flow changes instead of side effects. When passingValueIsAlwaysUndefined scans for an instruction between an inst with a null or undef argument and its first use, it was checking for instructions that may have side effects, which is a superset of the instructions it intended to find (as per the comments, control flow changing instructions that would prevent reaching the uses). Switch to using isGuaranteedToTransferExecutionToSuccessor() instead. Without this change, when enabling -fwhole-program-vtables, which causes assumes to be inserted by clang, we can get different simplification decisions. In particular, when building with instrumentation FDO it can affect the optimizations decisions before FDO matching, leading to some mismatches. I had to modify d83507-knowledge-retention-bug.ll since this fix enables more aggressive optimization of that code such that it no longer tested the original bug it was meant to test. I removed the undef which still provokes the original failure (confirmed by temporarily reverting the fix) and also changed it to just invoke the passes of interest to narrow the testing. Similarly I needed to adjust code for UnreachableEliminate.ll to avoid an undef which was causing the function body to get optimized away with this fix. Differential Revision: https://reviews.llvm.org/D101507	2021-05-03 13:32:22 -07:00
David Green	48cef1fa8e	[ARM] Create VMOVRRD from adjacent vector extracts This adds a combine for extract(x, n); extract(x, n+1) -> VMOVRRD(extract x, n/2). This allows two vector lanes to be moved at the same time in a single instruction, and thanks to the other VMOVRRD folds we have added recently can help reduce the amount of executed instructions. Floating point types are very similar, but will include a bitcast to an integer type. This also adds a shouldRewriteCopySrc, to prevent copy propagation from DPR to SPR, which can break as not all DPR regs can be extracted from directly. Otherwise the machine verifier is unhappy. Differential Revision: https://reviews.llvm.org/D100244	2021-04-20 15:15:43 +01:00
David Green	d97189600e	[ARM] Revert WhileLoopStartLR to DoLoopStart If a WhileLoopStartLR is reverted due to calls in the preheader, we may still be able to instead create a DoLoopStart, preserving the low overhead loop. This adds code for that, only reverting the WhileLoopStartR to a Br/Cmp, leaving the rest of the low overhead loop in place. Differential Revision: https://reviews.llvm.org/D98413	2021-03-25 16:44:15 +00:00
Victor Campos	f22b4c7122	[ARM] Handle debug instrs in ARM Low Overhead Loop pass In function ConvertVPTBlocks(), it is assumed that every instruction within a vector-predicated block is predicated. This is false for debug instructions, used by LLVM. Because of this, an assertion failure is reached when an input contains debug instructions inside VPT blocks. In non-assert builds, an out of bounds memory access took place. The present patch properly covers the case of debug instructions. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D99075	2021-03-23 11:49:06 +00:00
David Green	bd516d24c1	[ARM] Move t2DoLoopStart reg alloc hint This adjusts the place that the t2DoLoopStart reg allocation hint is inserted, adding it in the ARMTPAndVPTOptimizaionPass in a similar place as other tail predicated loop optimizations. This removes the need for doing so in a custom inserter, and should make the hint more accurate, only adding it where we expect to create a DLS (not DLSTP or WLS).	2021-03-11 17:56:19 +00:00
David Green	fad70c3068	[ARM] Improve WLS lowering Recently we improved the lowering of low overhead loops and tail predicated loops, but concentrated first on the DLS do style loops. This extends those improvements over to the WLS while loops, improving the chance of lowering them successfully. To do this the lowering has to change a little as the instructions are terminators that produce a value - something that needs to be treated carefully. Lowering starts at the Hardware Loop pass, inserting a new llvm.test.start.loop.iterations that produces both an i1 to control the loop entry and an i32 similar to the llvm.start.loop.iterations intrinsic added for do loops. This feeds into the loop phi, properly gluing the values together: %wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div) %wls0 = extractvalue { i32, i1 } %wls, 0 %wls1 = extractvalue { i32, i1 } %wls, 1 br i1 %wls1, label %loop.ph, label %loop.exit ... loop: %lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ] .. %iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1) %cmp = icmp ne i32 %iv.next, 0 br i1 %cmp, label %loop, label %loop.exit The llvm.test.start.loop.iterations need to be lowered through ISel lowering as a pair of WLS and WLSSETUP nodes, which each get converted to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent t2WhileLoopStart from being a terminator that produces a value, something difficult to control at that stage in the pipeline. Instead the t2WhileLoopSetup produces the value of LR (essentially acting as a lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc). These are then converted into a single t2WhileLoopStartLR at the same point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop to prevent them from progressing further in the pipeline. The t2WhileLoopStartLR is a single instruction that takes a GPR and produces LR, similar to the WLS instruction. %1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3 t2B %bb.1 ... bb.2.loop: %2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2 ... %3:gprlr = t2LoopEndDec %2:gprlr, %bb.2 t2B %bb.3 The t2WhileLoopStartLR can then be treated similar to the other low overhead loop pseudos, eventually being lowered to a WLS providing the branches are within range. Differential Revision: https://reviews.llvm.org/D97729	2021-03-11 17:56:19 +00:00
Roman Lebedev	b46c085d2b	[NFCI] SCEVExpander: emit intrinsics for integral {u,s}{min,max} SCEV expressions These intrinsics, not the icmp+select are the canonical form nowadays, so we might as well directly emit them. This should not cause any regressions, but if it does, then then they would needed to be fixed regardless. Note that this doesn't deal with `SCEVExpander::isHighCostExpansion()`, but that is a pessimization, not a correctness issue. Additionally, the non-intrinsic form has issues with undef, see https://reviews.llvm.org/D88287#2587863	2021-03-06 21:52:46 +03:00
David Green	a968e7b82e	[ARM] KnownBits for CSINC/CSNEG/CSINV This adds some simple known bits handling for the three CSINC/NEG/INV instructions. From the operands known bits we can compute the common bits of the first operand and incremented/negated/inverted second operand. The first, especially CSINC ZR, ZR, comes up fair amount in the tests. The others are more rare so a unit test for them is added. Differential Revision: https://reviews.llvm.org/D97788	2021-03-04 08:40:20 +00:00
David Green	ab280cbaa3	[ARM] Ensure undef is propagated to CBZ/CBNZ flags In some rare circumstances we can be using an undef register for a compare. When folded into a CBZ/CBNZ the undef flags are lost, leading to machine verifier problems. This propagates the existing flags to the new instruction.	2021-03-03 08:02:58 +00:00
David Green	54e2876132	[ARM] Update and add extra WLS testing. NFC	2021-03-01 21:46:09 +00:00
David Green	21a4faab60	[ARM] Move double vector insert patterns using vins to DAG combine This removes the existing patterns for inserting two lanes into an f16/i16 vector register using VINS, instead using a DAG combine to pattern match the same code sequences. The tablegen patterns were already on the large side (foreach LANE = [0, 2, 4, 6]) and were not handling all the cases they could. Moving that to a DAG combine, whilst not less code, allows us to better control and expand the selection of VINSs. Additionally this allows us to remove the AddedComplexity on VCVTT. The extra trick that this has learned in the process is to move two adjacent lanes using a single f32 vmov, allowing some extra inefficiencies to be removed. Differenial Revision: https://reviews.llvm.org/D96876	2021-02-22 09:29:47 +00:00
David Green	1e007cf43c	[ARM] Use rGPR for writeback vldrs From what I can tell, a writeback is unpredictable with LR for both loads and stores. This changes the operand from a gprnopc to a rGPR in both cases (which I believe is essentially a NFC due to the tied-def already being a rGPR.) Differential Revision: https://reviews.llvm.org/D96723	2021-02-16 16:44:47 +00:00
David Green	a838a4f69f	[ARM] Extend search for increment in load/store optimizer Currently the findIncDecAfter will only look at the next instruction for post-inc candidates in the load/store optimizer. This extends that to a search through the current BB, until an instruction that modifies or uses the increment reg is found. This allows more post-inc load/stores and ldm/stm's to be created, especially in cases where a schedule might move instructions further apart. We make sure not to look any further for an SP, as that might invalidate stack slots that are still in use. Differential Revision: https://reviews.llvm.org/D95881	2021-02-15 13:17:21 +00:00
David Green	7786ac8377	[ARM] Remove dead mov's in preheader of tail predicated loops With t2DoLoopDec we can be left with some extra MOV's in the preheaders of tail predicated loops. This removes them, in the same way we remove other dead variables. Differential Revision: https://reviews.llvm.org/D91857	2021-02-11 10:48:20 +00:00
David Green	c722575633	[ARM] Select VINS from vector inserts This patch adds tablegen patterns for pairs of i16/f16 insert/extracts. If we are inserting into two adjacent vector lanes (0 and 1 for example), we can use either a vmov;vins or vmovx;vins to insert the pair together, avoiding a round-trip from GRP registers. This is quite a large patterns with a number of EXTRACT_SUBREG/INSERT_SUBREG/ COPY_TO_REGCLASS nodes, but hopefully as most of those become copies all that will be cleaned up by further optimizations. The VINS pattern was also adjusted to allow it to represent that it is inserting into the top half of an existing register. Differential Revision: https://reviews.llvm.org/D95381	2021-02-02 13:50:02 +00:00
David Green	48230355e9	[ARM] Remove DLS lr, lr A DLS lr, lr instruction only moves lr to itself. It need not be emitted on it's own to save a instruction in the loop preheader. Differential Revision: https://reviews.llvm.org/D78916	2021-02-02 11:09:31 +00:00

1 2 3 4 5

250 Commits