Commit Graph

250 Commits

Author SHA1 Message Date
David Green 64421988e3 [ARM] Regenerate LowOverheadLoops mir tests. NFC 2021-02-02 10:28:58 +00:00
Nikita Popov 835104a114 [LSR] Drop potentially invalid nowrap flags when switching to post-inc IV (PR46943)
When LSR converts a branch on the pre-inc IV into a branch on the
post-inc IV, the nowrap flags on the addition may no longer be valid.
Previously, a poison result of the addition might have been ignored,
in which case the program was well defined. After branching on the
post-inc IV, we might be branching on poison, which is undefined behavior.

Fix this by discarding nowrap flags which are not present on the SCEV
expression. Nowrap flags on the SCEV expression are proven by SCEV
to always hold, independently of how the expression will be used.
This is essentially the same fix we applied to IndVars LFTR, which
also performs this kind of pre-inc to post-inc conversion.

I believe a similar problem can also exist for getelementptr inbounds,
but I was not able to come up with a problematic test case. The
inbounds case would have to be addressed in a differently anyway
(as SCEV does not track this property).

Fixes https://bugs.llvm.org/show_bug.cgi?id=46943.

Differential Revision: https://reviews.llvm.org/D95286
2021-01-25 23:13:48 +01:00
David Green e7dc083a41 [ARM] Don't handle low overhead branches in AnalyzeBranch
It turns our that the BranchFolder and IfCvt does not like unanalyzable
branches that fall-through. This means that removing the unconditional
branches from the end of tail predicated instruction can run into
asserts and verifier issues.

This effectively reverts 372eb2bbb6, but
adds handling to t2DoLoopEndDec which are not branches, so can be safely
skipped.
2021-01-18 17:16:07 +00:00
David Green 372eb2bbb6 [ARM] Add low overhead loops terminators to AnalyzeBranch
This treats low overhead loop branches the same as jump tables and
indirect branches in analyzeBranch - they cannot be analyzed but the
direct branches on the end of the block may be removed. This helps
remove the unnecessary branches earlier, which can help produce better
codegen (and change block layout in a number of cases).

Differential Revision: https://reviews.llvm.org/D94392
2021-01-16 18:30:21 +00:00
David Green f5abf0bd48 [ARM] Tail predication with constant loop bounds
The TripCount for a predicated vector loop body will be
ceil(ElementCount/Width). This alters the conversion of an
active.lane.mask to a VCPT intrinsics to match.

Differential Revision: https://reviews.llvm.org/D94608
2021-01-15 18:17:31 +00:00
David Green a0770f9e4e [ARM] Constant tripcount tail predication loop tests. NFC 2021-01-15 18:02:07 +00:00
David Green 1de3e7fd62 [ARM] Improve handling of empty VPT blocks in tail predicated loops
A vpt block that just contains either VPST;VCTP or VPT;VCTP, once the
VCTP is removed will become invalid. This fixed the first by removing
the now empty block and bails out for the second, as we have no simple
way of converting a VPT to a VCMP.

Differential Revision: https://reviews.llvm.org/D92369
2020-12-14 11:17:01 +00:00
David Green 3f571be1c0 [ARM] Make t2DoLoopStartTP a terminator
Although this was something that I was hoping we would not have to do,
this patch makes t2DoLoopStartTP a terminator in order to keep it at the
end of it's block, so not allowing extra MVE instruction between it and
the end. With t2DoLoopStartTP's also starting tail predication regions,
it also marks them as having side effects. The t2DoLoopStart is still
not a terminator, giving it the extra scheduling freedom that can be
helpful, but now that we have a TP version they can be treated
differently.

Differential Revision: https://reviews.llvm.org/D91887
2020-12-11 09:23:57 +00:00
David Green 0447f3508f [ARM][RegAlloc] Add t2LoopEndDec
We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.

Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.

There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.

This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.

Differential Revision: https://reviews.llvm.org/D91358
2020-12-10 12:14:23 +00:00
David Green 5abbf20f0f [ARM] Additional test for Min loop. NFC 2020-12-10 10:49:00 +00:00
David Green b0ce615b2d [ARM] Remove copies from low overhead phi inductions.
The phi created in a low overhead loop gets created with a default
register class it seems. There are then copied inserted between the low
overhead loop pseudo instructions (which produce/consume GPRlr
instructions) and the phi holding the induction. This patch removes
those as a step towards attempting to make t2LoopDec and t2LoopEnd a
single instruction, and appears useful in it's own right as shown in the
tests.

Differential Revision: https://reviews.llvm.org/D91267
2020-12-10 10:30:31 +00:00
David Green d9bf6245bf [ARM] Revert low overhead loops with calls before registry allocation.
This adds code to revert low overhead loops with calls in them before
register allocation. Ideally we would not create low overhead loops with
calls in them to begin with, but that can be difficult to always get
correct. If we want to try and glue together t2LoopDec and t2LoopEnd
into a single instruction, we need to ensure that no instructions use LR
in the loop. (Technically the final code can be better too, as it
doesn't need to use the same registers but that has not been optimized
for here, as reverting loops with calls is expected to be very rare).

It also adds a MVETailPredUtils.h header to share the revert code
between different passes, and provides a place to expand upon, with
RevertLoopWithCall becoming a place to perform other low overhead loop
alterations like removing copies or combining LoopDec and End into a
single instruction.

Differential Revision: https://reviews.llvm.org/D91273
2020-12-07 15:44:40 +00:00
Simon Pilgrim 1f2353734d [Thumb2] Regenerate predicated-liveout-unknown-lanes.ll test
Helps to reduce diff in D90113
2020-12-02 18:00:42 +00:00
David Green 0e49a40d75 [ARM] Cleanup for the MVETailPrediction pass
This strips out a lot of the code that should no longer be needed from
the MVETailPredictionPass, leaving the important part - find active lane
mask instructions and convert them to VCTP operations.

Differential Revision: https://reviews.llvm.org/D91866
2020-11-26 15:10:44 +00:00
David Green f08c37da7b [ARM] Disable WLSTP loops
This checks to see if the loop will likely become a tail predicated loop
and disables wls loop generation if so, as the likelihood for reverting
is currently too high. These should be fairly rare situations anyway due
to the way iterations and element counts are used during lowering. Just
not trying can alter how SCEV's are materialized however, leading to
different codegen.

It also adds a option to disable all while low overhead loops, for
debugging.

Differential Revision: https://reviews.llvm.org/D91663
2020-11-20 13:30:44 +00:00
Sam Tebbs 8ecb015ed5 [ARM][LowOverheadLoops] Convert intermediate vpr use assertion to condition
This converts the intermediate VPR use assertion to a condition in the if-statement to protect against assertion failures in case behaviuour is changed.

This is a follow-up to https://reviews.llvm.org/D90935 and implements the post-approval comments.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D91790
2020-11-19 17:15:45 +00:00
David Green 0e4cdfc56a [ARM] Add a WLS tail predication test. NFC 2020-11-19 14:52:46 +00:00
David Green 006b3bdedd [ARM] Deliberately prevent inline asm in low overhead loops. NFC
This was already something that was handled by one of the "else"
branches in maybeLoweredToCall, so this patch is an NFC but makes it
explicit and adds a test. We may in the future want to support this
under certain situations but for the moment just don't try and create
low overhead loops with inline asm in them.

Differential Revision: https://reviews.llvm.org/D91257
2020-11-19 13:28:21 +00:00
Sam Tebbs da2e4728c7 [ARM][LowOverheadLoops] Merge VCMP and VPST across VPT blocks
This patch adds support for combining a VPST with a dangling VCMP from a
previous VPT block.

Differential Revision: https://reviews.llvm.org/D90935
2020-11-18 12:54:16 +00:00
David Green 11dee2eae2 [ARM] Ensure CountReg definition dominates InsertPt when creating t2DoLoopStartTP
Of course there was something missing, in this case a check that the def
of the count register we are adding to a t2DoLoopStartTP would dominate
the insertion point.

In the future, when we remove some of these COPY's in between, the
t2DoLoopStartTP will always become the last instruction in the block,
preventing this from happening. In the meantime we need to check they
are created in a sensible order.

Differential Revision: https://reviews.llvm.org/D91287
2020-11-12 13:47:46 +00:00
David Green 08d1c2d470 [ARM] Introduce t2DoLoopStartTP
This introduces a new pseudo instruction, almost identical to a
t2DoLoopStart but taking 2 parameters - the original loop iteration
count needed for a low overhead loop, plus the VCTP element count needed
for a DLSTP instruction setting up a tail predicated loop. The idea is
that the instruction holds both values and the backend
ARMLowOverheadLoops pass can pick between the two, depending on whether
it creates a tail predicated loop or falls back to a low overhead loop.

To do that there needs to be something that converts a t2DoLoopStart to
a t2DoLoopStartTP, for which this patch repurposes the
MVEVPTOptimisationsPass as a "tail predication and vpt optimisation"
pass. The extra operand for the t2DoLoopStartTP is chosen based on the
operands of VCTP's in the loop, and the instruction is moved as late in
the block as possible to attempt to increase the likelihood of making
tail predicated loops.

Differential Revision: https://reviews.llvm.org/D90591
2020-11-10 18:08:12 +00:00
David Green 73a6cd4b6b [ARM] Add a RegAllocHint for hinting t2DoLoopStart towards LR
This hints the operand of a t2DoLoopStart towards using LR, which can
help make it more likely to become t2DLS lr, lr. This makes it easier to
move if needed (as the input is the same as the output), or potentially
remove entirely.

The hint is added after others (from COPY's etc) which still take
precedence. It needed to find a place to add the hint, which currently
uses the post isel custom inserter.

Differential Revision: https://reviews.llvm.org/D89883
2020-11-10 16:28:57 +00:00
David Green b2ac9681a7 [ARM] Alter t2DoLoopStart to define lr
This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR

This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.

This is a fairly simple change in itself, but leads to a number of other
required alterations.

 - The hardware loop pass, if UsePhi is set, now generates loops of the
   form:
       %start = llvm.start.loop.iterations(%N)
     loop:
       %p = phi [%start], [%dec]
       %dec = llvm.loop.decrement.reg(%p, 1)
       %c = icmp ne %dec, 0
       br %c, loop, exit
 - For this a new llvm.start.loop.iterations intrinsic was added, identical
   to llvm.set.loop.iterations but produces a value as seen above, gluing
   the loop together more through def-use chains.
 - This new instrinsic conceptually produces the same output as input,
   which is taught to SCEV so that the checks in MVETailPredication are not
   affected.
 - Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
   been left mostly as before. We should now more reliably be able to tell
   that the t2DoLoopStart is correct without having to prove it, but
   t2WhileLoopStart and tail-predicated loops will remain the same.
 - And all the tests have been updated. There are a lot of them!

This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.

Differential Revision: https://reviews.llvm.org/D89881
2020-11-10 15:57:58 +00:00
Sam Tebbs 40a3f7e48d [ARM][LowOverheadLoops] Merge a VCMP and the new VPST into a VPT
There were cases where a VCMP and a VPST were merged even if the VCMP
didn't have the same defs of its operands as the VPST. This is fixed by
adding RDA checks for the defs. This however gave rise to cases where
the new VPST created would precede the un-merged VCMP and so would fail
a predicate mask assertion since the VCMP wasn't predicated. This was
solved by converting the VCMP to a VPT instead of inserting the new
VPST.

Differential Revision: https://reviews.llvm.org/D90461
2020-11-09 15:03:48 +00:00
Sanjay Patel c40126e740 [ARM] remove cost-kind predicate for most math op costs
This is based on the same idea that I am using for the basic model implementation
and what I have partly already done for x86: throughput cost is number of
instructions/uops, so size/blended costs are identical except in special cases
(for example, fdiv or other known-expensive machine instructions or things like
MVE that may require cracking into >1 uop)).

Differential Revision: https://reviews.llvm.org/D90692
2020-11-03 17:23:46 -05:00
David Green e474499402 [ARM] Treat memcpy/memset/memmove as call instructions for low overhead loops
If an instruction will be lowered to a call there is no advantage of
using a low overhead loop as the LR register will need to be spilled and
reloaded around the call, and the low overhead will end up being
reverted. This teaches our hardware loop lowering that these memory
intrinsics will be calls under certain situations.

Differential Revision: https://reviews.llvm.org/D90439
2020-11-03 11:53:09 +00:00
David Green 785080e3fa [ARM] Low overhead loop memcpy lowering test. NFC 2020-11-03 11:44:50 +00:00
David Green 6dcbc323fd Revert "[ARM][LowOverheadLoops] Adjust Start insertion."
This reverts commit 38f625d0d1.

This commit contains some holes in its logic and has been causing
issues since it was commited. The idea sounds OK but some cases were not
handled correctly. Instead of trying to fix that up later it is probably
simpler to revert it and work to reimplement it in a more reliable way.
2020-10-20 08:55:21 +01:00
David Green cb27006a94 [ARM] Attempt to make Tail predication / RDA more resilient to empty blocks
There are a number of places in RDA where we assume the block will not
be empty. This isn't necessarily true for tail predicated loops where we
have removed instructions. This attempt to make the pass more resilient
to empty blocks, not casting pointers to machine instructions where they
would be invalid.

The test contains a case that was previously failing, but recently been
hidden on trunk. It contains an empty block to begin with to show a
similar error.

Differential Revision: https://reviews.llvm.org/D88926
2020-10-10 14:50:25 +01:00
Amara Emerson 322d0afd87 [llvm][mlir] Promote the experimental reduction intrinsics to be first class intrinsics.
This change renames the intrinsics to not have "experimental" in the name.

The autoupgrader will handle legacy intrinsics.

Relevant ML thread: http://lists.llvm.org/pipermail/llvm-dev/2020-April/140729.html

Differential Revision: https://reviews.llvm.org/D88787
2020-10-07 10:36:44 -07:00
Sam Parker 38f625d0d1 [ARM][LowOverheadLoops] Adjust Start insertion.
Try to move the insertion point to become the terminator of the
block, usually the preheader.

Differential Revision: https://reviews.llvm.org/D88638
2020-10-01 10:49:19 +01:00
Sam Parker 6ec5f32497 [ARM][LowOverheadLoops] Iteration count liveness
Before deciding to insert a [W|D]LSTP, check that defining LR with
the element count won't affect any other instructions that should be
taking the iteration count.

Differential Revision: https://reviews.llvm.org/D88549
2020-10-01 10:11:10 +01:00
Sam Parker 7b90516d47 [ARM][LowOverheadLoops] Start insertion point
If possible, try not to move the start position earlier than it
already is.

Differential Revision: https://reviews.llvm.org/D88542
2020-10-01 10:05:25 +01:00
Sam Parker dfa2c14b8f [ARM][LowOverheadLoops] Use iterator for InsertPt.
Use a MachineBasicBlock::iterator instead of a MachineInstr* for the
position of our LoopStart instruction. NFCish, as it change debug
info.
2020-10-01 08:32:35 +01:00
Matt Arsenault 89baeaef2f Reapply "RegAllocFast: Rewrite and improve"
This reverts commit 73a6a164b8.
2020-09-30 10:35:25 -04:00
Sam Parker 3f88c10a6b [RDA] isSafeToDefRegAt: Look at global uses
We weren't looking at global uses of a value, so we could happily
overwrite the register incorrectly.

Differential Revision: https://reviews.llvm.org/D88554
2020-09-30 14:06:45 +01:00
Sam Parker 3cbd01ddb9 [NFC][ARM] Add more LowOverheadLoop tests. 2020-09-30 12:20:07 +01:00
Sam Parker 834b6470d9 [NFC][ARM] Add LowOverheadLoop test 2020-09-30 08:36:48 +01:00
Sam Parker 700f93e92b [RDA] Switch isSafeToMove iterators
So forwards is forwards and backwards is reverse. Also add a check
so that we know the instructions are in the expected order.

Differential Revision: https://reviews.llvm.org/D88419
2020-09-30 08:10:48 +01:00
Sam Parker 195c22f273 [ARM] Change VPT state assertion
Just because we haven't encountered an instruction setting the VPR,
it doesn't mean we can't create a VPT block - the VPR maybe a
live-in.

Differential Revision: https://reviews.llvm.org/D88224
2020-09-30 08:01:10 +01:00
Sjoerd Meijer f39f92c1f6 [ARM][MVE] tail-predication: overflow checks for elementcount, cont'd
This is a reimplementation of the overflow checks for the elementcount,
i.e. the 2nd argument of intrinsic get.active.lane.mask. The element
count is lowered in each iteration of the tail-predicated loop, and
we must prove that this expression doesn't overflow.

Many thanks to Eli Friedman and Sam Parker for all their help with
this work.

Differential Revision: https://reviews.llvm.org/D88086
2020-09-28 09:20:51 +01:00
David Green e4b9867cb6 [ARM] Expand cannotInsertWDLSTPBetween to the last instruction
9d9a11c7be added this check for predicatable instructions between the
D/WLSTP and the loop's start, but it was missing the last instruction in
the block. Change it to use some iterators instead.

Differential Revision: https://reviews.llvm.org/D88354
2020-09-28 09:14:40 +01:00
Sam Parker a399d1880b [ARM] Find VPT implicitly predicated by VCTP
On failing to find a VCTP in the list of instructions that explicitly
predicate the entry of a VPT block, inspect whether the block is
controlled via VPT which is implicitly predicated due to it's
predicated operand(s).

Differential Revision: https://reviews.llvm.org/D87819
2020-09-25 08:50:53 +01:00
Sjoerd Meijer 2fc690ac90 [ARM] LowoverheadLoops: add an option to disable tail-predication
This might be useful for testing. We already have an option -tail-predication
but that controls the MVETailPredication pass.  This
-arm-loloops-disable-tail-pred is just for disabling it in the LowoverheadLoops
pass.

Differential Revision: https://reviews.llvm.org/D88212
2020-09-24 13:30:48 +01:00
Sam Parker 9d9a11c7be [ARM] Check for LSTP side-effects.
If the LSTP instruction is inserted with an element count low enough
to immediately predicate some lanes as false, this can have some
unintended effects on any proceeding MVE instructions in the
preheader.

Differential Revision: https://reviews.llvm.org/D88209
2020-09-24 13:28:35 +01:00
Sam Parker 00c34f72fb [NFC][ARM] Pre-commit tail predication test 2020-09-23 14:29:26 +01:00
Sam Parker b4fa884a73 [ARM] Improve VPT predicate tracking
The VPTBlock has been modified to track the 'global' state of the
VPR, as well as the state for each block. Each object now just holds
a list of instructions that makeup the block, while static structures
hold the predicate information. This enables global access for
querying how both a VPT block and individual instructions are
predicated. These changes now allow us, again, to handle more
complicated cases where multiple instructions build a predicate
and/or where the same predicate in used in multiple blocks.

It doesn't, however, get us back to before the tracking was 'fixed'
as some extra logic will be required to properly handle VPT
instructions. Currently a VPT could be effectively predicated because
of it's inputs, but the existing logic will not detect that and so
will refuse to perform the transformation. This can be seen in
remat-vctp.ll test where we still don't perform the transform.

Differential Revision: https://reviews.llvm.org/D87681
2020-09-22 10:40:27 +01:00
Muhammad Omair Javaid 73a6a164b8 Revert "Reapply Revert "RegAllocFast: Rewrite and improve""
This reverts commit 55f9f87da2.

Breaks following buildbots:
http://lab.llvm.org:8011/builders/lldb-arm-ubuntu/builds/4306
http://lab.llvm.org:8011/builders/lldb-aarch64-ubuntu/builds/9154
2020-09-22 14:40:06 +05:00
Sam Parker a0c1dcc318 [ARM] Remove MVEDomain from VLDR/STR of P0
Remove the domain from the instructions and create a shouldInspect
helper for LowOverheadLoops which queries it or a vpr operand.

Differential Revision: https://reviews.llvm.org/D87900
2020-09-22 09:05:50 +01:00
Matt Arsenault 55f9f87da2 Reapply Revert "RegAllocFast: Rewrite and improve"
This reverts commit dbd53a1f0c.

Needed lldb test updates
2020-09-21 15:45:27 -04:00
Sam Parker 13c73632c7 [NFC][ARM] More tail predication tests.
Add mir tests for use/def of P0.
2020-09-21 10:58:05 +01:00
Eric Christopher dbd53a1f0c Temporarily Revert "RegAllocFast: Rewrite and improve"
as it's breaking a few tests in the lldb test suite.

Bot: http://lab.llvm.org:8011/builders/lldb-arm-ubuntu/builds/4226/steps/test/logs/stdio

This reverts commit c8757ff3aa.
2020-09-18 18:11:21 -07:00
Matt Arsenault c8757ff3aa RegAllocFast: Rewrite and improve
This rewrites big parts of the fast register allocator. The basic
strategy of doing block-local allocation hasn't changed but I tweaked
several details:

Track register state on register units instead of physical
registers. This simplifies and speeds up handling of register aliases.
Process basic blocks in reverse order: Definitions are known to end
register livetimes when walking backwards (contrary when walking
forward then uses may or may not be a kill so we need heuristics).

Check register mask operands (calls) instead of conservatively
assuming everything is clobbered.  Enhance heuristics to detect
killing uses: In case of a small number of defs/uses check if they are
all in the same basic block and if so the last one is a killing use.
Enhance heuristic for copy-coalescing through hinting: We check the
first k defs of a register for COPYs rather than relying on there just
being a single definition.  When testing this on the full llvm
test-suite including SPEC externals I measured:

average 5.1% reduction in code size for X86, 4.9% reduction in code on
aarch64. (ranging between 0% and 20% depending on the test) 0.5%
faster compiletime (some analysis suggests the pass is slightly slower
than before, but we more than make up for it because later passes are
faster with the reduced instruction count)

Also adds a few testcases that were broken without this patch, in
particular bug 47278.

Patch mostly by Matthias Braun
2020-09-18 14:05:18 -04:00
David Green 34b27b9441 [ARM] Sink splats to MVE intrinsics
The predicated MVE intrinsics are generated as, for example,
llvm.arm.mve.add.predicated(x, splat(y). p). We need to sink the splat
value back into the loop, like we do for other instructions, so we can
re-select qr variants.

Differential Revision: https://reviews.llvm.org/D87693
2020-09-17 16:00:51 +01:00
Sam Parker 1c421046d7 [RDA] Fix getUniqueReachingDef for self loops
We've fixed the case where this could return an instruction after the
given instruction, but also means that we can falsely return a
'unique' def when they could be one coming from the backedge of a
loop.

Differential Revision: https://reviews.llvm.org/D87751
2020-09-16 12:44:23 +01:00
Sam Parker a63b2a4614 [ARM] Fix tail predication predicate tracking
Clear the CurrentPredicate when we find an instruction which would
completely overwrite the VPR. This fix essentially means we're back
to not really being able to handle VPT instructions when tail
predicating.

Differential Revision: https://reviews.llvm.org/D87610
2020-09-16 11:59:29 +01:00
Sam Tebbs ac2717bfdd [ARM][LowOverheadLoops] Fix tests after ef0b9f3
ef0b9f3 didn't update the tests that it affected.
2020-09-16 11:01:21 +01:00
Sjoerd Meijer cb1ef0eaff Follow up rG635b87511ec3: forgot to add/commit the new test file. NFC. 2020-09-16 09:38:37 +01:00
Sam Tebbs ef0b9f3307 [ARM][LowOverheadLoops] Combine a VCMP and VPST into a VPT
This patch combines a VCMP followed by a VPST into a VPT, which has the
same semantics as the combination of the former two.
2020-09-16 09:27:10 +01:00
Sjoerd Meijer 487412988c [MVE] Rename of tests making them consistent with tail-predication tests. NFC. 2020-09-15 09:24:22 +01:00
Sjoerd Meijer 676febc044 [ARM][MVE] Tail-predication: check get.active.lane.mask's TC value
This adds additional checks for the original scalar loop tripcount value, i.e.
get.active.lane.mask second argument, and perform several sanity checks to see
if it is of the form that we expect similarly like we already do for the IV
which is the first argument of get.active.lane.

Differential Revision: https://reviews.llvm.org/D86074
2020-09-14 11:32:15 +01:00
Sam Tebbs b81c57d646 [ARM][LowOverheadLoops] Allow tail predication on predicated instructions with unknown lane
values

The effects of unpredicated vector instruction with unknown
lanes cannot be predicted and therefore cannot be tail predicated. This
does not apply to predicated vector instructions and so this patch
allows tail predication on them.

Differential Revision: https://reviews.llvm.org/D87376
2020-09-10 10:34:32 +01:00
Sam Parker 1919b65052 [ARM] Tail predicate VQDMULH and VQRDMULH
Mark the family of instructions as valid for tail predication.

Differential Revision: https://reviews.llvm.org/D87348
2020-09-10 08:20:07 +01:00
Sjoerd Meijer 8cb8cea1bd [ARM] Fixup of a few test cases. NFC.
After changing the semantics of get.active.lane.mask, I missed a few tests
that should use now the tripcount instead of the backedge taken count.
2020-09-09 11:14:44 +01:00
Sam Parker 3ebc755227 [ARM] Try to rematerialize VCTP instructions
We really want to try and avoid spilling P0, which can be difficult
since there's only one register, so try to rematerialize any VCTP
instructions.

Differential Revision: https://reviews.llvm.org/D87280
2020-09-09 07:41:22 +01:00
Sam Parker 94cfbef0a7 [NFC][ARM] Precommit test 2020-09-08 14:44:28 +01:00
Sam Tebbs 7aabb6ad77 [ARM][LowOverheadLoops] Remove modifications to the correct element
count register

After my patch at D86087, code that now uses the mov operand rather than
the vctp operand will no longer remove modifications to the vctp operand
as they should. This patch fixes that by explicitly removing
modifications to the vctp operand rather than the register used as the
element count.
2020-09-08 10:30:05 +01:00
Sam Parker b30adfb529 [ARM][LowOverheadLoops] Liveouts and reductions
Remove the code that tried to look for reduction patterns, since the
vectorizer and isel can now produce predicated arithmetic instructios
within the loop body. This has required some reorganisation and fixes
around live-out and predication checks, as well as looking for cases
where an input/output is initialised to zero.

Differential Revision: https://reviews.llvm.org/D86613
2020-08-28 13:56:16 +01:00
Sam Parker 3c8be94f3d [NFC][ARM] Add tail predication test 2020-08-28 13:46:10 +01:00
Sam Parker a3e41d4581 [ARM] Make MachineVerifier more strict about terminators
Fix the ARM backend's analyzeBranch so it doesn't ignore predicated
return instructions, and make the MachineVerifier rule more strict.

Differential Revision: https://reviews.llvm.org/D40061
2020-08-27 07:10:20 +01:00
David Green 5b7a889a67 [ARM] Additional test for tailpred reductions. NFC 2020-08-25 18:29:15 +01:00
Sjoerd Meijer 39522b1e10 [SelectionDAG] Legalize intrinsic get.active.lane.mask
This adapts legalization of intrinsic get.active.lane.mask to the new semantics
as described in D86147. Because the second argument is now the loop tripcount,
we legalize this intrinsic to an 'icmp ULT' instead of an ULE when it was the
backedge-taken count.

Differential Revision: https://reviews.llvm.org/D86302
2020-08-25 15:00:10 +01:00
Sjoerd Meijer c352e7fbda [ARM][MVE] Tail-predication: remove the BTC + 1 overflow checks
This adapts tail-predication to the new semantics of get.active.lane.mask as
defined in D86147. This means that:
- we can remove the BTC + 1 overflow checks because now the loop tripcount is
  passed in to the intrinsic,
- we can immediately use that value to setup a counter for the number of
  elements processed by the loop and don't need to materialize BTC + 1.

Differential Revision: https://reviews.llvm.org/D86303
2020-08-25 14:38:03 +01:00
David Green 41495dd57a [ARM] Change target triple to arm-none-none-eabi. NFC 2020-08-19 11:58:50 +01:00
David Green 3471520b1f [ARM] Allow tail predication of VLDn
VLD2/4 instructions cannot be predicated, so we cannot tail predicate
them from autovec. From intrinsics though, they should be valid as they
will just end up loading extra values into off vector lanes, not
effecting the on lanes. The same is true for loads in general where so
long as we are not using the other vector lanes, an unpredicated load
can be converted to a predicated one.

This marks VLD2 and VLD4 instructions as validForTailPredication and
allows any unpredicated load in tail predication loop, which seems to be
valid given the other checks we have.

Differential Revision: https://reviews.llvm.org/D86022
2020-08-18 17:15:45 +01:00
Sam Tebbs 31f02ac60a [ARM] Use mov operand if the mov cannot be moved while tail predicating
There are some cases where the instruction that sets up the iteration
count for a tail predicated loop cannot be moved before the dlstp,
stopping tail predication entirely. This patch checks if the mov operand
can be used and if so, uses that instead.

Differential Revision: https://reviews.llvm.org/D86087
2020-08-18 17:10:29 +01:00
Dávid Bolvanský 0f14b2e6cb Revert "[BPI] Improve static heuristics for integer comparisons"
This reverts commit 50c743fa71. Patch will be split to smaller ones.
2020-08-17 20:44:33 +02:00
David Green 5f45f91de4 [ARM] Tests for tail predicated loads. NFC 2020-08-16 19:46:37 +01:00
Dávid Bolvanský 50c743fa71 [BPI] Improve static heuristics for integer comparisons
Similarly as for pointers, even for integers a == b is usually false.

GCC also uses this heuristic.

Reviewed By: ebrevnov

Differential Revision: https://reviews.llvm.org/D85781
2020-08-13 19:54:27 +02:00
Dávid Bolvanský f9264995a6 Revert "[BPI] Improve static heuristics for integer comparisons"
This reverts commit 44587e2f7e. Sanitizer tests need to be updated.
2020-08-13 14:37:40 +02:00
Dávid Bolvanský 44587e2f7e [BPI] Improve static heuristics for integer comparisons
Similarly as for pointers, even for integers a == b is usually false.

GCC also uses this heuristic.

Reviewed By: ebrevnov

Differential Revision: https://reviews.llvm.org/D85781
2020-08-13 14:23:58 +02:00
Dávid Bolvanský a0485421d2 Revert "[BPI] Improve static heuristics for integer comparisons"
This reverts commit 385c9d673f.
2020-08-13 12:59:15 +02:00
Dávid Bolvanský 385c9d673f [BPI] Improve static heuristics for integer comparisons
Similarly as for pointers, even for integers a == b is usually false.

GCC also uses this heuristic.

Reviewed By: ebrevnov

Differential Revision: https://reviews.llvm.org/D85781
2020-08-13 12:45:40 +02:00
Sjoerd Meijer 6716e7868e [ARM][MVE] tail-predication: overflow checks for backedge taken count.
This pick ups the work on the overflow checks for get.active.lane.mask,
which ensure that it is safe to insert the VCTP intrinisc that enables
tail-predication. For a 2d auto-correlation kernel and its inner loop j:

  M = Size - i;
  for (j = 0; j < M; j++)
    Sum += Input[j] * Input[j+i];

For this inner loop, the SCEV backedge taken count (BTC) expression is:

  (-1 + (sext i16 %Size to i32)),+,-1}<nw><%for.body>

and LoopUtil cannotBeMaxInLoop couldn't calculate a bound on this, thus "BTC
cannot be max" could not be determined. So overflow behaviour had to be assumed
in the loop tripcount expression that uses the BTC. As a result
tail-predication had to be forced (with an option) for this case.

This change solves that by using ScalarEvolution's helper
getConstantMaxBackedgeTakenCount which is able to determine the range of BTC,
thus can determine it is safe, so that we no longer need to force tail-predication
as reflected in the changed test cases.

Differential Revision: https://reviews.llvm.org/D85737
2020-08-12 09:32:26 +01:00
Sjoerd Meijer a680c797b9 [ARM][MVE] Added extra tail-predication runs for auto-correlation test case. NFC 2020-08-11 14:33:41 +01:00
David Green 22916481c1 [ARM] Convert VPSEL to VMOV in tail predicated loops
VPSEL has slightly different semantics under tail predication (it can
end up selecting from Qn, Qm and Qd). We do not model that at the moment
so they block tail predicated loops from being formed.

This just converts them into a predicated VMOV instead (via a VORR),
allowing tail predication to happen whilst still modelling the original
behaviour of the input.

Differential Revision: https://reviews.llvm.org/D85110
2020-08-03 22:03:14 +01:00
David Green fd69df62ed [ARM] Distribute post-inc for Thumb2 sign/zero extending loads/stores
This adds sign/zero extending scalar loads/stores to the MVE
instructions added in D77813, allowing us to create up more post-inc
instructions. These are comparatively simple, compared to LDR/STR (which
may be better turned into an LDRD/LDM), but still require some additions
over MVE instructions. Because there are i12 and i8 variants of the
offset loads/stores dealing with different signs, we may need to convert
an i12 address to a i8 negative instruction. t2LDRBi12 can also be
shrunk to a tLDRi under the right conditions, so we need to be careful
with codesize too.

Differential Revision: https://reviews.llvm.org/D78625
2020-08-01 14:01:18 +01:00
David Green 3533e0a08d [ARM] Add patterns for select(p, BinOp(x, y), z) -> BinOpT(x, y,p z)
Most MVE instructions can be predicated to fold a select into the
instruction, using the predicate and the selects else as a passthough.
This adds tablegen patterns for most two operand instructions using the
newly added TwoOpPattern from 1030e82598.

Differential Revision: https://reviews.llvm.org/D83222
2020-07-22 13:24:01 +01:00
Sam Tebbs 6c348e4067 [HWLoops] Stop converting to a while loop when it would be unsafe to
There were cases where a do-while loop would be converted to a while
loop before finding out that it would be unsafe to expand the SCEV in
this situation and then bailing out of hardware loop conversion.

This patch checks if it would be unsafe to expand the SCEV and if so stops converting the do-while into a while, allowing conversion to a hardware loop.

Differential Revision: https://reviews.llvm.org/D83953
2020-07-17 11:47:08 +01:00
Sjoerd Meijer 595270ae39 [ARM][MVE] Refactor option -disable-mve-tail-predication
This refactors option -disable-mve-tail-predication to take different arguments
so that we have 1 option to control tail-predication rather than several
different ones.

This is also a prep step for D82953, in which we want to reject reductions
unless that is requested with this option.

Differential Revision: https://reviews.llvm.org/D83133
2020-07-13 13:40:33 +01:00
David Green 74ca67c109 [ARM] Remove hasSideEffects from FP converts
Whether an instruction is deemed to have side effects in determined by
whether it has a tblgen pattern that emits a single instruction.
Because of the way a lot of the the vcvt instructions are specified
either in dagtodag code or with patterns that emit multiple
instructions, they don't get marked as not having side effects.

This just marks them as not having side effects manually. It can help
especially with instruction scheduling, to not create artificial
barriers, but one of these tests also managed to produce fewer
instructions.

Differential Revision: https://reviews.llvm.org/D81639
2020-07-05 16:23:24 +01:00
David Green 9e03547cab [ARM][HWLoops] Create hardware loops for sibling loops
Given a loop with two subloops, it should be possible for both to be
converted to hardware loops. That's what this patch does, simply enough.
It slightly alters the loop iterating order to try and convert all
subloops. If one (or more) succeeds, it stops as before.

Differential Revision: https://reviews.llvm.org/D78502
2020-07-03 17:20:02 +01:00
Sam Parker f0ecfb789b [NFC][ARM] Add test. 2020-07-01 09:28:56 +01:00
Sam Parker 3ee580d017 [ARM][LowOverheadLoops] Handle reductions
While validating live-out values, record instructions that look like
a reduction. This will comprise of a vector op (for now only vadd),
a vorr (vmov) which store the previous value of vadd and then a vpsel
in the exit block which is predicated upon a vctp. This vctp will
combine the last two iterations using the vmov and vadd into a vector
which can then be consumed by a vaddv.

Once we have determined that it's safe to perform tail-predication,
we need to change this sequence of instructions so that the
predication doesn't produce incorrect code. This involves changing
the register allocation of the vadd so it updates itself and the
predication on the final iteration will not update the falsely
predicated lanes. This mimics what the vmov, vctp and vpsel do and
so we then don't need any of those instructions.

Differential Revision: https://reviews.llvm.org/D75533
2020-07-01 08:31:49 +01:00
Samuel Tebbs 3324e3a6ee [ARM] Allow the fabs intrinsic to be tail predicated
This patch stops the fabs intrinsic from blocking tail predication.

Differential Revision: https://reviews.llvm.org/D82570
2020-06-30 17:27:28 +01:00
Samuel Tebbs 66fa313999 [ARM] Allow the usub_sat and ssub_sat intrinsics to be tail predicated
This patch stops the usub_sat and ssub_sat intrinsics from blocking tail predication.

Differential Revision: https://reviews.llvm.org/D82571
2020-06-30 17:16:58 +01:00
Sjoerd Meijer af45907653 [ARM][MVE] Tail-predication: clean-up of unused code
After the rewrite of this pass (D79175) I missed one thing: the inserted VCTP
intrinsic can be cloned to exit blocks if there are instructions present in it
that perform the same operation, but this wasn't triggering anymore. However,
it turns out that for handling reductions, see D75533, it's actually easier not
not to have the VCTP in exit blocks, so this removes that code.

This was possible because it turned out that some other code that depended on
this, rematerialization of the trip count enabling more dead code removal
later, wasn't doing much anymore due to more aggressive dead code removal that
was added to the low-overhead loops pass.

Differential Revision: https://reviews.llvm.org/D82773
2020-06-30 17:09:36 +01:00
Samuel Tebbs d9cb811cbf [ARM] Allow rounding intrinsics to be tail predicated
This patch stops the trunc, rint, round, floor and ceil intrinsics from blocking tail predication.

Differential Revision: https://reviews.llvm.org/D82553
2020-06-30 16:52:25 +01:00
Sam Parker 2723a9dd6d [NFC][ARM] Tail predication reduction tests 2020-06-30 13:27:22 +01:00
David Green 76e0e1a55d [ARM] VCVTT instruction selection
We current extract and convert from a top lane of a f16 vector using a
VMOVX;VCVTB pair. We can simplify that to use a single VCVTT. The
pattern is mostly copied from a vector extract pattern, but produces a
VCVTTHS f32 directly.

This had to move some code around so that ARMInstrVFP had access to the
required pattern frags that were previously part of ARMInstrNEON.

Differential Revision: https://reviews.llvm.org/D81556
2020-06-26 08:58:55 +01:00