Summary:
A recent commit accidentally defined names like `MVE_VMAXAs8` as
instances of the multiclass `MVE_VMINA`, and vice versa. This has no
effect on the test suite, because nothing directly refers to those
instruction names (the isel patterns are generated in Tablegen using
`!cast<Instruction>(NAME)` inside a lower-level multiclass). But it
means that `llvm-mc -show-inst` was listing VMAXA as VMINA, and it
would also affect any further draft code gen patches that use those
instruction ids.
Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73034
This patch uses helper function rewriteLoopExitValues that is refactored in
D72602 to rematerialise the iteration count in exit blocks, so that we can
clean-up loop update expressions inside the hardware-loops later in
ARMLowOverheadLoops, which is necessary to get actual performance gains for
tail-predicated loops.
Differential Revision: https://reviews.llvm.org/D72714
This adds Post inc variants of the VLD2/4 and VST2/4 instructions in
MVE. It uses the same mechanism/nodes as Neon, transforming the
intrinsic+add pair into a ARMISD::VLD2_UPD, which gets selected to a
post-inc instruction. The code to do that is mostly taken from the
existing Neon code, but simplified as less variants are needed.
It also fills in some getTgtMemIntrinsic for the arm.mve.vld2/4
instrinsics, which allow the nodes to have MMO's, calculated as the full
length to the memory being loaded/stored.
Differential Revision: https://reviews.llvm.org/D71194
We were previously not necessarily favouring postinc for the MVE loads
and stores, leading to extra code prior to the loop to set up the
preinc. MVE in general can benefit from postinc (as we don't have
unrolled loops), and certain instructions like the VLD2's only post-inc
versions are available.
Differential Revision: https://reviews.llvm.org/D70790
Except AMDGPU/R600RegisterInfo (a bunch of MIR tests seem to have
problems), every target overrides it with true. PostMachineScheduler
requires livein information. Not providing it can cause assertion
failures in ScheduleDAGInstrs::addSchedBarrierDeps().
Introduce a method to walk through use-def chains to decide whether
it's possible to remove a given instruction and its users. These
instructions are then stored in a set until the end of the transform
when they're erased. This is now used to perform checks on the
iteration count (LoopDec chain), element count (VCTP chain) and the
possibly redundant iteration count.
As well as being able to remove chains of instructions, we know also
check that the sub feeding the vctp is producing the expected value.
Differential Revision: https://reviews.llvm.org/D71837
Recommitting e93e0d413f after reverting due to test failures, which
will hopefully now be fixed. Original commit message:
After expanding the pseudo instructions, update the liveness info.
We do this in a post-order traversal of the loop, including its
exit blocks and preheader(s).
Differential Revision: https://reviews.llvm.org/D72131
Summary:
For builds with LLVM_BUILD_LLVM_DYLIB=ON and BUILD_SHARED_LIBS=OFF
this change makes all symbols in the target specific libraries hidden
by default.
A new macro called LLVM_EXTERNAL_VISIBILITY has been added to mark symbols in these
libraries public, which is mainly needed for the definitions of the
LLVMInitialize* functions.
This patch reduces the number of public symbols in libLLVM.so by about
25%. This should improve load times for the dynamic library and also
make abi checker tools, like abidiff require less memory when analyzing
libLLVM.so
One side-effect of this change is that for builds with
LLVM_BUILD_LLVM_DYLIB=ON and LLVM_LINK_LLVM_DYLIB=ON some unittests that
access symbols that are no longer public will need to be statically linked.
Before and after public symbol counts (using gcc 8.2.1, ld.bfd 2.31.1):
nm before/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
36221
nm after/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
26278
Reviewers: chandlerc, beanz, mgorny, rnk, hans
Reviewed By: rnk, hans
Subscribers: merge_guards_bot, luismarques, smeenai, ldionne, lenary, s.egerton, pzheng, sameer.abuasal, MaskRay, wuzish, echristo, Jim, hiraditya, michaelplatings, chapuni, jholewinski, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, javed.absar, sbc100, jgravelle-google, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, zzheng, edward-jones, mgrang, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, kristina, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D54439
Fix a missing and broken test: 2 VPT blocks predicated on the same VCMP
instruction that can be folded. The problem was that for each VPT block, we
record the predicate statements with a list, but the same instruction was added
twice. Thus, we were running in an assert trying to remove the same instruction
twice. To avoid this the instructions are now recorded with a set.
Differential Revision: https://reviews.llvm.org/D72699
Summary:
This cleans up a lot of ugly `foreach` bodges that I've been using to
work around the lack of those two language features. Now they both
exist, I can make then all into something more legible!
In particular, in the common pattern in `ARMInstrMVE.td` where a
multiclass defines an `Instruction` instance plus one or more `Pat` that
select it, I've used a `defvar` to wrap `!cast<Instruction>(NAME)` so
that the patterns themselves become a little more legible.
Replacing a `foreach` with a `defvar` removes a level of block
structure, so several pieces of code have their indentation changed by
this patch. Best viewed with whitespace ignored.
NFC: the output of `llvm-tblgen -print-records` on the two affected
Tablegen sources is exactly identical before and after this change, so
there should be no effect at all on any of the other generated files.
Reviewers: MarkMurrayARM, miyuki
Reviewed By: MarkMurrayARM
Subscribers: kristof.beyls, hiraditya, dmgreen, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D72690
We have a whitelist of instructions that we allow when tail
predicating, since these are trivial ones that we've deemed need no
special handling. Now change ARMLowOverheadLoops to allow the
non-trivial instructions if they're contained within a valid VPT
block. Since a valid block is one that is predicated upon the VCTP so
we know that these non-trivial instructions will still behave as
expected once the implicit predication is used instead.
This also fixes a previous test failure.
Differential Revision: https://reviews.llvm.org/D72509
Use the already provided helper function to get the operand type so
that we can detect whether the vpr is being used as a predicate or
not. Also use existing helpers to get the predicate indices when we
converting the vpt blocks. This enables us to support both types of
vpr predicate operand.
Differential Revision: https://reviews.llvm.org/D72504
Summary:
This patch fixes pr23772 [ARM] r226200 can emit illegal thumb2 instruction: "sub sp, r12, #80".
The violation was that SUB and ADD (reg, immediate) instructions can only write to SP if the source register is also SP. So the above instructions was unpredictable.
To enforce that the instruction t2(ADD|SUB)ri does not write to SP we now enforce the destination register to be rGPR (That exclude PC and SP).
Different than the ARM specification, that defines one instruction that can read from SP, and one that can't, here we inserted one that can't write to SP, and other that can only write to SP as to reuse most of the hard-coded size optimizations.
When performing this change, it uncovered that emitting Thumb2 Reg plus Immediate could not emit all variants of ADD SP, SP #imm instructions before so it was refactored to be able to. (see test/CodeGen/Thumb2/mve-stacksplot.mir where we use a subw sp, sp, Imm12 variant )
It also uncovered a disassembly issue of adr.w instructions, that were only written as SUBW instructions (see llvm/test/MC/Disassembler/ARM/thumb2.txt).
Reviewers: eli.friedman, dmgreen, carwil, olista01, efriedma, andreadb
Reviewed By: efriedma
Subscribers: gbedwell, john.brawn, efriedma, ostannard, kristof.beyls, hiraditya, dmgreen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70680
Due to the current way that we collect predicated instructions, we
can't easily handle vpsel in tail predicated loops. There are a
couple of issues:
1) It will use the VPR as a predicate operand, but doesn't have to be
instead a VPT block, which means we can assert while building up
the VPT block because we don't find another VPST to being a new
one.
2) VPSEL still requires a VPR operand even after tail predicating,
which means we can't remove it unless there is another
instruction, such as vcmp, that can provide the VPR def.
The first issue should be a relatively simple fix in the logic of the
LowOverheadLoops pass, whereas the second will require us to
represent the 'implicit' tail predication with an explicit value.
Differential Revision: https://reviews.llvm.org/D72629
Enables the masked gather pass to create a masked
gather loading from a base and vector of offsets.
This also enables v8i16 and v16i8 gather loads.
Differential Revision: https://reviews.llvm.org/D72330
The argument is llvm::null() everywhere except llvm::errs() in
llvm-objdump in -DLLVM_ENABLE_ASSERTIONS=On builds. It is used by no
target but X86 in -DLLVM_ENABLE_ASSERTIONS=On builds.
If we ever have the needs to add verbose log to disassemblers, we can
record log with a member function, instead of passing it around as an
argument.
Summary:
This always just used the same libcall as unordered, but the comparison predicate was different. This change appears to have been made when targets were given the ability to override the predicates. Before that they were hardcoded into the type legalizer. At that time we never inverted predicates and we handled ugt/ult/uge/ule compares by emitting an unordered check ORed with a ogt/olt/oge/ole checks. So only ordered needed an inverted predicate. Later ugt/ult/uge/ule were optimized to only call a single libcall and invert the compare.
This patch removes the ordered entries and just uses the inverting logic that is now present. This removes some odd things in both the Mips and WebAssembly code.
Reviewers: efriedma, ABataev, uweigand, cameron.mcinally, kpn
Reviewed By: efriedma
Subscribers: dschuff, sdardis, sbc100, arichardson, jgravelle-google, kristof.beyls, hiraditya, aheejin, sunfish, atanasyan, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72536
Summary:
This patch fixes pr23772 [ARM] r226200 can emit illegal thumb2 instruction: "sub sp, r12, #80".
The violation was that SUB and ADD (reg, immediate) instructions can only write to SP if the source register is also SP. So the above instructions was unpredictable.
To enforce that the instruction t2(ADD|SUB)ri does not write to SP we now enforce the destination register to be rGPR (That exclude PC and SP).
Different than the ARM specification, that defines one instruction that can read from SP, and one that can't, here we inserted one that can't write to SP, and other that can only write to SP as to reuse most of the hard-coded size optimizations.
When performing this change, it uncovered that emitting Thumb2 Reg plus Immediate could not emit all variants of ADD SP, SP #imm instructions before so it was refactored to be able to. (see test/CodeGen/Thumb2/mve-stacksplot.mir where we use a subw sp, sp, Imm12 variant )
It also uncovered a disassembly issue of adr.w instructions, that were only written as SUBW instructions (see llvm/test/MC/Disassembler/ARM/thumb2.txt).
Reviewers: eli.friedman, dmgreen, carwil, olista01, efriedma
Reviewed By: efriedma
Subscribers: john.brawn, efriedma, ostannard, kristof.beyls, hiraditya, dmgreen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70680
The current implementation assumes there is an instruction associated
with the transform, but this is not the case for
timm/TargetConstant/immarg values. These transforms should directly
operate on a specific MachineOperand in the source
instruction. TableGen would assert if you attempted to define an
equivalent GISDNodeXFormEquiv using timm when it failed to find the
instruction matcher.
Specially recognize SDNodeXForms on timm, and pass the operand index
to the render function.
Ideally this would be a separate render function type that looks like
void renderFoo(MachineInstrBuilder, const MachineOperand&), but this
proved to be somewhat mechanically painful. Add an optional operand
index which will only be passed if the transform should only look at
the one source operand.
Theoretically it would also be possible to only ever pass the
MachineOperand, and the existing renderers would check the parent. I
think that would be somewhat ugly for the standard usage which may
want to inspect other operands, and I also think MachineOperand should
eventually not carry a pointer to the parent instruction.
Use it in one sample pattern. This isn't a great example, since the
transform exists to satisfy DAG type constraints. This could also be
avoided by just changing the MachineInstr's arbitrary choice of
operand type from i16 to i32. Other patterns have nontrivial uses, but
this serves as the simplest example.
One flaw this still has is if you try to use an SDNodeXForm defined
for imm, but the source pattern uses timm, you still see the "Failed
to lookup instruction" assert. However, there is now a way to avoid
it.
Only PPC seems to be using it, and only checks some simple cases and
doesn't distinguish between FP. Just switch to using LLT to simplify
use from GlobalISel.
We don't unroll vector loops for MVE targets, but we miss the case
when loops only contain intrinsic calls. So just move the logic a
bit to catch this case.
Differential Revision: https://reviews.llvm.org/D72440
After expanding the pseudo instructions, update the liveness info.
We do this in a post-order traversal of the loop, including its
exit blocks and preheader(s).
Differential Revision: https://reviews.llvm.org/D72131
This batch of intrinsics fills in all the shift instructions that take
a variable shift distance in a register, instead of an immediate. Some
of these instructions take a single shift distance in a scalar
register and apply it to all lanes; others take a vector of per-lane
distances.
These instructions are all basically one family, varying in whether
they saturate out-of-range values, and whether they round when bits
are shifted off the bottom. I've implemented them at the IR level by a
much smaller family of IR intrinsics, which take flag parameters to
indicate saturating and/or rounding (along with the usual one to
specify signed/unsigned integers).
An oddity is that all of them are //left// shift instructions – but if
you pass a negative shift count, they'll shift right. So the vector
shift distances are always vectors of //signed// integers, regardless
of whether you're considering the other input vector to be of signed
or unsigned. Also, even the simplest `vshlq` instruction in this
family (neither saturating nor rounding) has to be implemented as an
IR intrinsic, because the ordinary LLVM IR `shl` operation would
consider an out-of-range shift count to be undefined behavior.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D72329
This batch of intrinsics covers two sets of immediate shift
instructions, which have in common that they only overwrite part of
their output register and so they need an extra input giving its
previous value.
The VSLI and VSRI instructions shift each lane of the input vector
left or right just as if they were normal immediate VSHL/VSHR, but
then they only overwrite the output bits that correspond to actual
shifted bits of the input. So VSLI will leave the low n bits of each
output lane unchanged, and VSRI the same with the top n bits.
The V[Q][R]SHR[U]N family are all narrowing shifts: they take an input
vector of 2n-bit integers, shift each lane right by a constant, and
then narrowing the shifted result to only n bits. So they only
overwrite half of the n-bit lanes in the output register, and the B/T
suffix indicates whether it's the bottom or top half of each 2n-bit
lane.
I've implemented the whole of the latter family using a single IR
intrinsic `vshrn`, which takes a lot of i32 parameters indicating
which instruction it expands to (by specifying signedness of the input
and output types, whether it saturates and/or rounds, etc).
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D72328
Adds a pass to the ARM backend that takes a v4i32
gather and transforms it into a call to MVE's
masked gather intrinsics.
Differential Revision: https://reviews.llvm.org/D71743
This is a recommit of D71330, but with a few things fixed and changed:
1) ReachingDefAnalysis: this was not running with optnone as it was checking
skipFunction(), which other analysis passes don't do. I guess this is a
copy-paste from a codegen pass.
2) VPTBlockPass: here I've added skipFunction(), because like most/all
optimisations, we don't want to run this with optnone.
This fixes the issues with the initial/previous commit: the VPTBlockPass was
running with optnone, but ReachingDefAnalysis wasn't, and so VPTBlockPass was
crashing querying ReachingDefAnalysis.
I've added test case mve-vpt-block-optnone.mir to check that we don't run
VPTBlock with optnone.
Differential Revision: https://reviews.llvm.org/D71470
Summary:
Instead of generating two i32 instructions for each load or store of a volatile
i64 value (two LDRs or STRs), now emit LDRD/STRD.
These improvements cover architectures implementing ARMv5TE or Thumb-2.
Reviewers: dmgreen, efriedma, john.brawn, nickdesaulniers
Reviewed By: efriedma, nickdesaulniers
Subscribers: nickdesaulniers, vvereschaka, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70072
printInst prints a branch/call instruction as `b offset` (there are many
variants on various targets) instead of `b address`.
It is a convention to use address instead of offset in most external
symbolizers/disassemblers. This difference makes `llvm-objdump -d`
output unsatisfactory.
Add `uint64_t Address` to printInst(), so that it can pass the argument to
printInstruction(). `raw_ostream &OS` is moved to the last to be
consistent with other print* methods.
The next step is to pass `Address` to printInstruction() (generated by
tablegen from the instruction set description). We can gradually migrate
targets to print addresses instead of offsets.
In any case, downstream projects which don't know `Address` can pass 0 as
the argument.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D72172
The segmented stack lowering code appears to be using ARM opcodes under
Thumb2. The MRC opcode will be the same for Thumb and ARM, but t2LDR
seems wrong. Either way, using the correct thumb vs arm opcodes is more
correct.
Differential Revision: https://reviews.llvm.org/D72074
We were previously unconditionally using the ARM::TRAP opcode, even
under Thumb. My understanding is that these are essentially the same
thing (they both result in a trap under Thumb), but the ARM::TRAP opcode
is marked as requiring IsARM, so it is more correct to use ARM::tTRAP.
Differential Revision: https://reviews.llvm.org/D72075
Summary:
Running an end-to-end test last week I noticed that a lot of the ACLE
intrinsics that operate differently on vectors of signed and unsigned
integers were ending up generating the signed version of the
instruction unconditionally. This is because the IR intrinsics had no
way to distinguish signed from unsigned: the LLVM type system just
calls them both `v8i16` (or whatever), so you need either separate
intrinsics for signed and unsigned, or a flag parameter that tells
ISel which one to choose.
This patch fixes all the problems of that kind that I've noticed, by
adding an i32 flag parameter to many of the IR intrinsics which is set
to 1 for unsigned (matching the existing practice in cases where we
got it right), and conditioning all the isel patterns on that flag. So
the fundamental change is in `IntrinsicsARM.td`, changing the
low-level IR intrinsics API; there are knock-on changes in
`arm_mve.td` (adjusting code gen for the ACLE intrinsics to use the
modified API) and in `ARMInstrMVE.td` (adjusting isel to expect the
new unsigned flags). The rest of this patch is boringly updating tests.
Reviewers: dmgreen, miyuki, MarkMurrayARM
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D72270
Summary:
Due to a copy-paste error in the isel patterns, the predicated version
of this intrinsic was expanding to the `VMAXNMT.F32` instruction
instead of `VMAXNMT.F16`. Similarly for vminnm.
Reviewers: dmgreen, miyuki, MarkMurrayARM
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72269
I've added a few more debug messages to MVETailPredication because I wanted to
trace better which instructions are added/removed. And while I was at it, I
factored out one function which I thought was clearer, and have added some
comments to describe better the flow between MVETailPredication and
ARMLowOverheadLoops.
Differential Revision: https://reviews.llvm.org/D71549
This simplifies the generic interface and also makes SHF_ARM_PURECODE
more robust (fixes a TODO). Inspecting MCDataFragment contents covers
more cases than MCObjectStreamer::EmitBytes.
This adds extra scalar handling to isFMAFasterThanFMulAndFAdd, allowing
the target independent code to handle more folds in more situations (for
example if the fast math flags are present, but the global
AllowFPOpFusion option isnt). It also splits apart the HasSlowFPVMLx
into HasSlowFPVFMx, to allow VFMA and VMLA to be controlled separately
if needed.
Differential Revision: https://reviews.llvm.org/D72139
SCEVExpander modifies the underlying function so it is more suitable in
Transforms/Utils, rather than Analysis. This allows using other
transform utils in SCEVExpander.
Reviewers: sanjoy.google, efriedma, reames
Reviewed By: sanjoy.google
Differential Revision: https://reviews.llvm.org/D71537