Commit Graph

181 Commits

Author SHA1 Message Date
David Green b3499f572d [ARM] Change VDUP type to i32 for MVE
The MVE VDUP instruction take a GPR and splats into every lane of a
vector register. Unlike NEON we do not have a VDUPLANE equivalent
instruction, doing the same splat from a fp register. Previously a VDUP
to a v4f32/v8f16 would be represented as a (v4f32 VDUP f32), which
would mean the instruction pattern needs to add a COPY_TO_REGCLASS to
the GPR.

Instead this now converts that earlier during an ISel DAG combine,
converting (VDUP x) to (VDUP (bitcast x)). This can allow instruction
selection to tell that the input needs to be an i32, which in one of the
testcases allows it to use ldr (or specifically ldm) over (vldr;vmov).

Whilst being simple enough for floats, as the types sizes are the same,
these is no BITCAST equivalent for getting a half into a i32. This uses
a VMOVrh ARMISD node, which doesn't know the same tricks yet.

Differential Revision: https://reviews.llvm.org/D76292
2020-03-20 09:48:45 +00:00
Simon Tatham e13d153c1b [ARM,MVE] Add intrinsics for the VQDMLAD family.
Summary:
This is another set of instructions too complicated to be sensibly
expressed in IR by anything short of a target-specific intrinsic.
Given input vectors a,b, the instruction generates intermediate values
2*(a[0]*b[0]+a[1]+b[1]), 2*(a[2]*b[2]+a[3]+b[3]), etc; takes the high
half of each double-width values, and overwrites half the lanes in the
output vector c, which you therefore have to provide the input value
of. Optionally you can swap the elements of b so that the are things
like a[0]*b[1]+a[1]*b[0]; optionally you can round to nearest when
taking the high half; and optionally you can take the difference
rather than sum of the two products. Finally, saturation is applied
when converting back to a single-width vector lane.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: miyuki

Subscribers: kristof.beyls, hiraditya, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D76359
2020-03-18 17:11:22 +00:00
Simon Tatham 928776de92 [ARM,MVE] Add intrinsics for the VQDMLAH family.
Summary:
These are complicated integer multiply+add instructions with extra
saturation, taking the high half of a double-width product, and
optional rounding. There's no sensible way to represent that in
standard IR, so I've converted the clang builtins directly to
target-specific intrinsics.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: miyuki

Subscribers: kristof.beyls, hiraditya, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D76123
2020-03-18 10:55:04 +00:00
Simon Tatham 28c5d97bee [ARM,MVE] Add intrinsics and isel for MVE integer VMLA.
Summary:
These instructions compute multiply+add in integers, with one of the
operands being a splat of a scalar. (VMLA and VMLAS differ in whether
the splat operand is a multiplier or the addend.)

I've represented these in IR using existing standard IR operations for
the unpredicated forms. The predicated forms are done with target-
specific intrinsics, as usual.

When operating on n-bit vector lanes, only the bottom n bits of the
i32 scalar operand are used. So we have to tell that to isel lowering,
to allow it to remove a pointless sign- or zero-extension instruction
on that input register. That's done in `PerformIntrinsicCombine`, but
first I had to enable `PerformIntrinsicCombine` for MVE targets
(previously all the intrinsics it handled were for NEON), and make it
a method of `ARMTargetLowering` so that it can get at
`SimplifyDemandedBits`.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D76122
2020-03-18 10:55:04 +00:00
Simon Tatham 3f8e714e2f [ARM,MVE] Add intrinsics and isel for MVE fused multiply-add.
Summary:
This adds the ACLE intrinsic family for the VFMA and VFMS
instructions, which perform fused multiply-add on vectors of floats.

I've represented the unpredicated versions in IR using the cross-
platform `@llvm.fma` IR intrinsic. We already had isel rules to
convert one of those into a vector VFMA in the simplest possible way;
but we didn't have rules to detect a negated argument and turn it into
VFMS, or rules to detect a splat argument and turn it into one of the
two vector/scalar forms of the instruction. Now we have all of those.

The predicated form uses a target-specific intrinsic as usual, but
I've stuck to just one, for a predicated FMA. The subtraction and
splat versions are code-generated by passing an fneg or a splat as one
of its operands, the same way as the unpredicated version.

In arm_mve_defs.h, I've had to introduce a tiny extra piece of
infrastructure: a record `id` for use in codegen dags which implements
the identity function. (Just because you can't declare a Tablegen
value of type dag which is //only// a `$varname`: you have to wrap it
in something. Now I can write `(id $varname)` to get the same effect.)

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D75998
2020-03-12 11:13:50 +00:00
Sam Parker a314050065 [ARM][MVE] VFMA and VFMS validForTailPredication
Add four instructions to the whitelist.

Differential Revision: https://reviews.llvm.org/D75902
2020-03-10 14:58:29 +00:00
Sam Parker 4cf0dddcc6 [ARM][MVE] Enable VMOVN for tail predication
These instructions also don't exchange lanes, so make them legal.

Differential Revision: https://reviews.llvm.org/D75669
2020-03-06 08:59:22 +00:00
Sam Parker 77e30758dd [ARM][MVE] Enable *SHRN* for tail predication
These instructions don't swap lanes so make them valid.

Differential Revision: https://reviews.llvm.org/D75667
2020-03-05 11:00:45 +00:00
Simon Tatham 1a8cbfa514 [ARM,MVE] Add ACLE intrinsics for VCVT[ANPM] family.
Summary:
These instructions convert a vector of floats to a vector of integers
of the same size, with assorted non-default rounding modes.
Implemented in IR as target-specific intrinsics, because as far as I
can see there are no matches for that functionality in the standard IR
intrinsics list.

Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D75255
2020-03-02 10:33:30 +00:00
Simon Tatham b08d2ddd69 [ARM,MVE] Add ACLE intrinsics for VCVT.F32.F16 family.
Summary:
These instructions make a vector of `<4 x float>` by widening every
other lane of a vector of `<8 x half>`.

I wondered about representing these using standard IR, along the lines
of a shufflevector to extract elements of the input into a `<4 x half>`
followed by an `fpext` to turn that into `<4 x float>`. But it looks as
if that would take a lot of work in isel lowering to make it match any
pattern I could sensibly write in Tablegen, and also I haven't been
able to think of any other case where that pattern might be generated
in IR, so there wouldn't be any extra code generation win from doing
it that way.

Therefore, I've just used another target-specific intrinsic. We can
always change it to the other way later if anyone thinks of a good
reason.

(In order to put the intrinsic definition near similar things in
`IntrinsicsARM.td`, I've also lifted the definition of the
`MVEMXPredicated` multiclass higher up the file, without changing it.)

Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard

Reviewed By: miyuki

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D75254
2020-03-02 10:33:30 +00:00
Simon Tatham 69441e53c9 [ARM,MVE] Correct MC operands in VCVT.F32.F16. (NFC)
Summary:
The two MVE instructions that convert between v4f32 and v8f16 were
implemented as instances of the same class, with the same MC operand
list.

But that's not really appropriate, because the narrowing conversion
only partially overwrites its output register (it only has 4 f16
values to write into a vector of 8), so even when unpredicated, it
needs a $Qd_src input, a constraint tying that to the $Qd output, and
a vpred_n.

The widening conversion is better represented like any other
instruction that completely replaces its output when unpredicated: it
should have no $Qd_src operand, and instead, a vpred_r containing a
$inactive parameter. That's a better match to other similar
instructions, such as its integer analogue, the VMOVL instruction that
makes a v4i32 by sign- or zero-extending every other lane of a v8i16.

This commit brings the widening VCVT.F32.F16 into line with the other
instructions that behave like it. That means you can write isel
patterns that use it unpredicated, without having to add a pointless
undefined $QdSrc operand.

No existing code generation uses that instruction yet, so there should
be no functional change from this fix.

Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75253
2020-03-02 10:33:30 +00:00
Simon Tatham a41ecf0eb0 [ARM,MVE] Add ACLE intrinsics for VQMOV[U]N family.
Summary:
These instructions work like VMOVN (narrowing a vector of wide values
to half size, and overwriting every other lane of an output register
with the result), except that the narrowing conversion is saturating.
They come in three signedness flavours: signed to signed, unsigned to
unsigned, and signed to unsigned. All are represented in IR by a
target-specific intrinsic that takes two separate 'unsigned' flags.

Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D75252
2020-03-02 10:33:30 +00:00
David Green e2a2f3f7fc [ARM] MVE VMLAS
This addes extra patterns for the VMLAS MVE instruction, which performs
Qda = Qda * Qn + Rm, a similar pattern to the existing VMLA. The sinking
of splat(Rm) into the loop is already performed, meaning we just need
extra Pat's in tablegen.

Differential Revision: https://reviews.llvm.org/D75115
2020-02-28 14:27:21 +00:00
Simon Tatham 9eb3cc10b2 [ARM,MVE] Add predicated intrinsics for many unary functions.
Summary:
This commit adds the predicated MVE intrinsics for the same set of
unary operations that I added in their unpredicated forms in

* D74333 (vrint)
* D74334 (vrev)
* D74335 (vclz, vcls)
* D74336 (vmovl)
* D74337 (vmovn)

but since the predicated versions are a lot more similar to each
other, I've kept them all together in a single big patch. Everything
here is done in the standard way we've been doing other predicated
operations: an IR intrinsic called `@llvm.arm.mve.foo.predicated` and
some isel rules that match that alongside whatever they accept for the
unpredicated version of the same instruction.

In order to write the isel rules conveniently, I've refactored the
existing isel rules for the affected instructions into multiclasses
parametrised by a vector-type class, in the usual way. All those
refactorings are intended to leave the existing isel rules unchanged:
the only difference should be that new ones for the predicated
intrinsics are introduced.

The only tiny infrastructure change I needed in this commit was to
change the implementation of `IntrinsicMX` in `arm_mve_defs.td` so
that the records it defines are anonymous rather than named (and use
`NameOverride` to set the output intrinsic name), which allows me to
call it twice in two multiclasses with the same `NAME` without a
tablegen-time error.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: MarkMurrayARM

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D75165
2020-02-26 15:12:07 +00:00
Mikhail Maltsev f4fd7dbf85 [ARM,MVE] Add vqdmull[b,t]q intrinsic families
Summary:
This patch adds two families of ACLE intrinsics: vqdmullbq and
vqdmulltq (including vector-vector and vector-scalar variants) and the
corresponding LLVM IR intrinsics llvm.arm.mve.vqdmull and
llvm.arm.mve.vqdmull.predicated.

Reviewers: simon_tatham, MarkMurrayARM, dmgreen, ostannard

Reviewed By: MarkMurrayARM

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74845
2020-02-20 10:51:19 +00:00
Mikhail Maltsev 461fd94f00 [ARM,MVE] Fix predicate types of some intrinsics
Summary:
Some predicated MVE intrinsics return a vector with element size
different from the input vector element size. In this case the
predicate must type correspond to the output vector type.

The following intrinsics use the incorrect predicate type:
* llvm.arm.mve.mull.int.predicated
* llvm.arm.mve.mull.poly.predicated
* llvm.arm.mve.vshll.imm.predicated

This patch fixes the issue.

Reviewers: simon_tatham, dmgreen, ostannard, MarkMurrayARM

Reviewed By: MarkMurrayARM

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74838
2020-02-19 16:24:54 +00:00
David Green 33aa5dfe9c [ARM] VMLAVA reduction patterns
Similar to VADDV and VADDLV that have been added recently, this adds
lowering and patterns for VMLAV, VMLAVA, VMLALV and VMLALVA. They
perform the same roles as the add's, just folding a mul into the same
instruction (and so taking two inputs). As such, they need to be lowered
in the same way as the types are often not legal.

Differential Revision: https://reviews.llvm.org/D74390
2020-02-19 12:39:58 +00:00
David Green fceb3e3b4a [ARM] MVE VADDLV lowering
Following on from the extra VADDV lowering, this extends things to
handle VADDLV which allows summing values into a pair of i32 registers,
together treated as a i64. This needs to be done in DAGCombine too as
the types are otherwise illegal, which is a fairly simple addition on
top of the existing code.

There is also a VADDLVA instruction handled here, that adds the incoming
values from the two general purpose registers. As opposed to the
non-long version where we could just add patterns for add(x, VADDV), the
long version needs to handle this early before the i64 has being split
into too many pieces.

Differential Revision: https://reviews.llvm.org/D74224
2020-02-19 11:07:20 +00:00
David Green 51c6e9445c [ARM] Extra MVE VADDV reduction patterns
We already make use of the VADDV vector reduction instruction for cases
where the input and the output start out at the same type. The MVE
instruction however will sum into an i32, so if we are summing a v16i8
into an i32, we can still use the same instructions. In terms of IR,
this looks like a sext of a legal type (v16i8) into a very illegal type
(v16i32) and a vecreduce.add of that into the result. This means we have
to catch the pattern early in a DAG combine, producing a target VADDVs/u
node, where the signedness is now important.

This is the first part, handling VADDV and VADDVA. There are also
VADDVL/VADDVLA instructions, which are interesting because they sum into
a 64bit value. And VMLAV and VMLALV, which are interesting because they
also do a multiply of two values. It may look a little odd in places as
a result.

On it's own this will probably not do very much, as the vectorizer will
not produce this IR yet.

Differential Revision: https://reviews.llvm.org/D74218
2020-02-19 09:45:35 +00:00
Mikhail Maltsev 63809d365e [ARM,MVE] Add vbrsrq intrinsics family
Summary:
This patch adds a new MVE intrinsics family, `vbrsrq`: vector bit
reverse and shift right. The intrinsics are compiled into the VBRSR
instruction. Two new LLVM IR intrinsics were also added: arm.mve.vbrsr
and arm.mve.vbrsr.predicated.

Reviewers: simon_tatham, dmgreen, ostannard, MarkMurrayARM

Reviewed By: simon_tatham

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74721
2020-02-18 17:31:21 +00:00
Simon Tatham c32af4447f [ARM,MVE] Add the vmovnbq,vmovntq intrinsic family.
Summary:
These are in some sense the inverse of vmovl[bt]q: they take a vector
of n wide elements and truncate each to half its width. So they only
write half a vector's worth of output data, and therefore they also
take an 'inactive' parameter to provide the other half of the data in
the output vector. So vmovnb overwrites the even lanes of 'inactive'
with the narrowed values from the main input, and vmovnt overwrites
the odd lanes.

LLVM had existing codegen which generates these MVE instructions in
response to IR that takes two vectors of wide elements, or two vectors
of narrow ones. But in this case, we have one vector of each. So my
clang codegen strategy is to narrow the input vector of wide elements
by simply reinterpreting it as the output type, and then we have two
narrow vectors and can represent the operation as a vector shuffle
that interleaves lanes from both of them.

Even so, not all the cases I needed ended up being selected as a
single MVE instruction, so I've added a couple more patterns that spot
combinations of the 'MVEvmovn' and 'ARMvrev32' SDNodes which can be
generated as a VMOVN instruction with operands swapped.

This commit adds the unpredicated forms only.

Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74337
2020-02-18 09:34:50 +00:00
Simon Tatham 5e97940cd2 [ARM,MVE] Add the vmovlbq,vmovltq intrinsic family.
Summary:
These intrinsics take a vector of 2n elements, and return a vector of
n wider elements obtained by sign- or zero-extending every other
element of the input vector. They're represented in IR as a
shufflevector that extracts the odd or even elements of the input,
followed by a sext or zext.

Existing LLVM codegen already matches this pattern and generates the
VMOVLB instruction (which widens the even-index input lanes). But no
existing isel rule was generating VMOVLT, so I've added some. However,
the new rules currently only work in little-endian MVE, because the
pattern they expect from isel lowering includes a bitconvert which
doesn't have the right semantics in big-endian.

The output of one existing codegen test is improved by those new
rules.

This commit adds the unpredicated forms only.

Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74336
2020-02-18 09:34:50 +00:00
Simon Tatham 9dcc1667ab [ARM] Allow `ARMVectorRegCast` to match bitconverts too. (NFC)
Summary:
When we start putting instances of `ARMVectorRegCast` in complex isel
patterns, it will be awkward that they're often turned into the more
standard `bitconvert` in little-endian mode. We'd rather not have to
write separate isel patterns for the two endiannesses, matching
different but equivalent cast operations.

This change aims to fix that awkwardness in advance, by turning the
Tablegen record `ARMVectorRegCast` from a simple `SDNode` instance
into a `PatFrags` that can match either kind of cast – with a
predicate that prevents it matching a bitconvert in the big-endian
case, where bitconvert isn't semantically identical.

No existing code generation should be affected by this change, but it
will enable the patterns introduced by D74336 to work in both
endiannesses.

Reviewers: dmgreen

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74716
2020-02-18 09:34:50 +00:00
Simon Tatham 68b49f7ef4 [ARM,MVE] Add intrinsics vclzq and vclsq.
Summary:
vclzq maps nicely to the existing target-independent @llvm.ctlz IR
intrinsic. But vclsq ('count leading sign bits') has no corresponding
target-independent intrinsic, so I've made up @llvm.arm.mve.vcls.

This commit adds the unpredicated forms only.

Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard

Reviewed By: miyuki

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74335
2020-02-18 09:34:50 +00:00
Simon Tatham c8b3196e54 [ARM,MVE] Add intrinsics for FP rounding operations.
Summary:
This adds the unpredicated forms of six different MVE intrinsics which
all round a vector of floating-point numbers to integer values,
leaving them still in FP format, differing only in rounding mode and
exception settings.

Five of them map to existing target-independent intrinsics in LLVM IR,
such as @llvm.trunc and @llvm.rint. The sixth, mapping to the `vrintn`
instruction, is done by inventing a target-specific intrinsic.

(`vrintn` behaves the same as `vrintx` in terms of the output value:
the side effects on the FPSCR flags are the only difference between
the two. But ACLE specifies separate user-callable intrinsics for the
two, so the side effects matter enough to make sure we generate the
right one of the two instructions in each case.)

Reviewers: dmgreen, miyuki, MarkMurrayARM, ostannard

Reviewed By: miyuki

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74333
2020-02-18 09:34:50 +00:00
Mikhail Maltsev 489f62e801 [ARM,MVE] Add vector-scalar intrinsics
Summary:
This patch adds vector-scalar variants to the following families of
MVE intrinsics:
* vaddq
* vsubq
* vmulq
* vqaddq
* vqsubq
* vhaddq
* vhsubq
* vqdmulhq
* vqrdmulhq

The vector-scalar variants perform a splat operation on the scalar
operand and then perform the same operations as their vector-vector
counterparts. Code generation is done accordingly (using LLVM IR 'insert'
and 'shuffle' operations which are later converted into an ARMvdup
SDNode).

Reviewers: simon_tatham, dmgreen, MarkMurrayARM, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74620
2020-02-17 17:47:05 +00:00
David Green 9d4c597541 [ARM] Fix ReconstructShuffle for bigendian
Simon pointed out that this function is doing a bitcast, which can be
incorrect for big endian. That makes the lowering of VMOVN in MVE
wrong, but the function is shared between Neon and MVE so both can
be incorrect.

This attempts to fix things by using the newly added VECTOR_REG_CAST
instead of the BITCAST. As it may now be used on Neon, I've added the
relevant patterns for it there too. I've also added a quick dag combine
for it to remove them where possible.

Differential Revision: https://reviews.llvm.org/D74485
2020-02-13 09:56:46 +00:00
Mikhail Maltsev 2694cc3dca [ARM][MVE] Add fixed point vector conversion intrinsics
Summary:
This patch implements the following Arm ACLE MVE intrinsics:
* vcvtq_n_*
* vcvtq_m_n_*
* vcvtq_x_n_*

and two corresponding LLVM IR intrinsics:
* int_arm_mve_vcvt_fix (vcvtq_n_*)
* int_arm_mve_vcvt_fix_predicated (vcvtq_m_n_*, vcvtq_x_n_*)

Reviewers: simon_tatham, ostannard, MarkMurrayARM, dmgreen

Reviewed By: MarkMurrayARM

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74134
2020-02-06 16:49:45 +00:00
Simon Tatham 961530fdc9 [ARM,MVE] Fix vreinterpretq in big-endian mode.
Summary:
In big-endian MVE, the simple vector load/store instructions (i.e.
both contiguous and non-widening) don't all store the bytes of a
register to memory in the same order: it matters whether you did a
VSTRB.8, VSTRH.16 or VSTRW.32. Put another way, the in-register
formats of different vector types relate to each other in a different
way from the in-memory formats.

So, if you want to 'bitcast' or 'reinterpret' one vector type as
another, you have to carefully specify which you mean: did you want to
reinterpret the //register// format of one type as that of the other,
or the //memory// format?

The ACLE `vreinterpretq` intrinsics are specified to reinterpret the
register format. But I had implemented them as LLVM IR bitcast, which
is specified for all types as a reinterpretation of the memory format.
So a `vreinterpretq` intrinsic, applied to values already in registers,
would code-generate incorrectly if compiled big-endian: instead of
emitting no code, it would emit a `vrev`.

To fix this, I've introduced a new IR intrinsic to perform a
register-format reinterpretation: `@llvm.arm.mve.vreinterpretq`. It's
implemented by a trivial isel pattern that expects the input in an
MQPR register, and just returns it unchanged.

In the clang codegen, I only emit this new intrinsic where it's
actually needed: I prefer a bitcast wherever it will have the right
effect, because LLVM understands bitcasts better. So we still generate
bitcasts in little-endian mode, and even in big-endian when you're
casting between two vector types with the same lane size.

For testing, I've moved all the codegen tests of vreinterpretq out
into their own file, so that they can have a different set of RUN
lines to check both big- and little-endian.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73786
2020-02-03 11:20:06 +00:00
Simon Tatham cf7e98e6f7 [ARM,MVE] Add intrinsics for vdupq.
Summary:
The unpredicated case of this is trivial: the clang codegen just makes
a vector splat of the input, and LLVM isel is already prepared to
handle that. For the predicated version, I've generated a `select`
between the same vector splat and the `inactive` input parameter, and
added new Tablegen isel rules to match that pattern into a predicated
`MVE_VDUP` instruction.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73356
2020-02-03 11:20:06 +00:00
Simon Tatham 4321c6af28 [ARM,MVE] Support immediate vbicq,vorrq,vmvnq intrinsics.
Summary:
Immediate vmvnq is code-generated as a simple vector constant in IR,
and left to the backend to recognize that it can be created with an
MVE VMVN instruction. The predicated version is represented as a
select between the input and the same constant, and I've added a
Tablegen isel rule to turn that into a predicated VMVN. (That should
be better than the previous VMVN + VPSEL: it's the same number of
instructions but now it can fold into an adjacent VPT block.)

The unpredicated forms of VBIC and VORR are done by enabling the same
isel lowering as for NEON, recognizing appropriate immediates and
rewriting them as ARMISD::VBICIMM / ARMISD::VORRIMM SDNodes, which I
then instruction-select into the right MVE instructions (now that I've
also reworked those instructions to use the same MC operand encoding).
In order to do that, I had to promote the Tablegen SDNode instance
`NEONvorrImm` to a general `ARMvorrImm` available in MVE as well, and
similarly for `NEONvbicImm`.

The predicated forms of VBIC and VORR are represented as a vector
select between the original input vector and the output of the
unpredicated operation. The main convenience of this is that it still
lets me use the existing isel lowering for VBICIMM/VORRIMM, and not
have to write another copy of the operand encoding translation code.

This intrinsic family is the first to use the `imm_simd` system I put
into the MveEmitter tablegen backend. So, naturally, it showed up a
bug or two (emitting bogus range checks and the like). Fixed those,
and added a full set of tests for the permissible immediates in the
existing Sema test.

Also adjusted the isel pattern for `vmovlb.u8`, which stopped matching
because lowering started turning its input into a VBICIMM. Now it
recognizes the VBICIMM instead.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D72934
2020-01-23 11:53:52 +00:00
Simon Tatham 772e493193 [ARM,MVE] Revise immediate VBIC/VORR to look more like NEON.
Summary:
In NEON, the immediate forms of VBIC and VORR are each represented as
a single MC instruction, which takes its immediate operand already
encoded in a NEON-friendly format: 8 data bits, plus some control bits
indicating how to expand them into a full vector.

In MVE, we represented immediate VBIC and VORR as four separate MC
instructions each, for an 8-bit immediate shifted left by 0, 8, 16 or
24 bits. For each one, the value of the immediate operand is in the
'natural' form, i.e. the numerical value that would actually be BICed
or ORRed into each vector lane (and also the same value shown in
assembly). For example, MVE_VBICIZ16v4i32 takes an operand such as
0xab0000, which NEON would represent as 0xab | (control bits << 8).

The MVE approach is superficially nice (it makes assembly input and
output easy, and it's also nice if you're manually constructing
immediate VBICs). But it turns out that it's better for isel if we
make the NEON and MVE instructions work the same, because the
ARMISD::VBICIMM and VORRIMM node types already encode their immediate
into the NEON format, so it's easier if we can just use it.

Also, this commit reduces the total amount of code rather than
increasing it, which is surely an indication that it really is simpler
to do it this way!

Reviewers: dmgreen, ostannard, miyuki, MarkMurrayARM

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73205
2020-01-23 11:53:52 +00:00
David Green 58991ba773 [ARM] Mark MVE loads/store as not having side effects
The hasSideEffect parameter is usually automatically inferred from
instruction patterns. For some of our MVE instructions, we do not have
patterns though, such as for the pre/post inc loads and stores. This
instead specifies the flag manually on the base MVE_VLDRSTR_base
tablegen class, making sure we get this correct.

This can help with scheduling multiple loads more optimally. Here I've
added a unittest as a more direct form of testing.

Differential Revision: https://reviews.llvm.org/D73117
2020-01-22 17:56:55 +00:00
Mark Murray b10a0eb04a [ARM][MVE][Intrinsics] Take abs() of VMINNMAQ, VMAXNMAQ intrinsics' first arguments.
Summary: Fix VMINNMAQ, VMAXNMAQ intrinsics; BOTH arguments have the absolute values taken.

Reviewers: dmgreen, simon_tatham

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D72830
2020-01-20 14:33:26 +00:00
Simon Tatham f3e73e88fd [ARM,MVE] Fix confusing MC names for MVE VMINA/VMAXA insns.
Summary:
A recent commit accidentally defined names like `MVE_VMAXAs8` as
instances of the multiclass `MVE_VMINA`, and vice versa. This has no
effect on the test suite, because nothing directly refers to those
instruction names (the isel patterns are generated in Tablegen using
`!cast<Instruction>(NAME)` inside a lower-level multiclass). But it
means that `llvm-mc -show-inst` was listing VMAXA as VMINA, and it
would also affect any further draft code gen patches that use those
instruction ids.

Reviewers: MarkMurrayARM, dmgreen, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73034
2020-01-20 13:25:52 +00:00
David Green ff2e67a4f7 [ARM] MVE VLDn postinc
This adds Post inc variants of the VLD2/4 and VST2/4 instructions in
MVE. It uses the same mechanism/nodes as Neon, transforming the
intrinsic+add pair into a ARMISD::VLD2_UPD, which gets selected to a
post-inc instruction. The code to do that is mostly taken from the
existing Neon code, but simplified as less variants are needed.

It also fills in some getTgtMemIntrinsic for the arm.mve.vld2/4
instrinsics, which allow the nodes to have MMO's, calculated as the full
length to the memory being loaded/stored.

Differential Revision: https://reviews.llvm.org/D71194
2020-01-20 06:57:07 +00:00
Mark Murray da9d57d2c2 [ARM][MVE][Intrinsics] Add VMINAQ, VMINNMAQ, VMAXAQ, VMAXNMAQ intrinsics.
Summary: Add VMINAQ, VMINNMAQ, VMAXAQ, VMAXNMAQ intrinsics and unit tests.

Reviewers: simon_tatham, miyuki, dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D72761
2020-01-15 17:20:15 +00:00
Simon Tatham 71d5454b37 [ARM,MVE] Use the new Tablegen `defvar` and `if` statements.
Summary:
This cleans up a lot of ugly `foreach` bodges that I've been using to
work around the lack of those two language features. Now they both
exist, I can make then all into something more legible!

In particular, in the common pattern in `ARMInstrMVE.td` where a
multiclass defines an `Instruction` instance plus one or more `Pat` that
select it, I've used a `defvar` to wrap `!cast<Instruction>(NAME)` so
that the patterns themselves become a little more legible.

Replacing a `foreach` with a `defvar` removes a level of block
structure, so several pieces of code have their indentation changed by
this patch. Best viewed with whitespace ignored.

NFC: the output of `llvm-tblgen -print-records` on the two affected
Tablegen sources is exactly identical before and after this change, so
there should be no effect at all on any of the other generated files.

Reviewers: MarkMurrayARM, miyuki

Reviewed By: MarkMurrayARM

Subscribers: kristof.beyls, hiraditya, dmgreen, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D72690
2020-01-14 12:08:03 +00:00
Sam Parker e73b20c57d [ARM][MVE] Disallow VPSEL for tail predication
Due to the current way that we collect predicated instructions, we
can't easily handle vpsel in tail predicated loops. There are a
couple of issues:
1) It will use the VPR as a predicate operand, but doesn't have to be
   instead a VPT block, which means we can assert while building up
   the VPT block because we don't find another VPST to being a new
   one.
2) VPSEL still requires a VPR operand even after tail predicating,
   which means we can't remove it unless there is another
   instruction, such as vcmp, that can provide the VPR def.

The first issue should be a relatively simple fix in the logic of the
LowOverheadLoops pass, whereas the second will require us to
represent the 'implicit' tail predication with an explicit value.

Differential Revision: https://reviews.llvm.org/D72629
2020-01-14 11:41:17 +00:00
Sam Parker 3772ea9dd9 [ARM][MVE] Tail predicate VMAX,VMAXA,VMIN,VMINA
Add the MVE min and max instructions to our tail predication
whitelist.

Differential Revision: https://reviews.llvm.org/D72502
2020-01-10 14:24:25 +00:00
Simon Tatham dac7b23cc3 [ARM,MVE] Intrinsics for variable shift instructions.
This batch of intrinsics fills in all the shift instructions that take
a variable shift distance in a register, instead of an immediate. Some
of these instructions take a single shift distance in a scalar
register and apply it to all lanes; others take a vector of per-lane
distances.

These instructions are all basically one family, varying in whether
they saturate out-of-range values, and whether they round when bits
are shifted off the bottom. I've implemented them at the IR level by a
much smaller family of IR intrinsics, which take flag parameters to
indicate saturating and/or rounding (along with the usual one to
specify signed/unsigned integers).

An oddity is that all of them are //left// shift instructions – but if
you pass a negative shift count, they'll shift right. So the vector
shift distances are always vectors of //signed// integers, regardless
of whether you're considering the other input vector to be of signed
or unsigned. Also, even the simplest `vshlq` instruction in this
family (neither saturating nor rounding) has to be implemented as an
IR intrinsic, because the ordinary LLVM IR `shl` operation would
consider an out-of-range shift count to be undefined behavior.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D72329
2020-01-08 14:42:24 +00:00
Simon Tatham 3100480925 [ARM,MVE] Intrinsics for partial-overwrite imm shifts.
This batch of intrinsics covers two sets of immediate shift
instructions, which have in common that they only overwrite part of
their output register and so they need an extra input giving its
previous value.

The VSLI and VSRI instructions shift each lane of the input vector
left or right just as if they were normal immediate VSHL/VSHR, but
then they only overwrite the output bits that correspond to actual
shifted bits of the input. So VSLI will leave the low n bits of each
output lane unchanged, and VSRI the same with the top n bits.

The V[Q][R]SHR[U]N family are all narrowing shifts: they take an input
vector of 2n-bit integers, shift each lane right by a constant, and
then narrowing the shifted result to only n bits. So they only
overwrite half of the n-bit lanes in the output register, and the B/T
suffix indicates whether it's the bottom or top half of each 2n-bit
lane.

I've implemented the whole of the latter family using a single IR
intrinsic `vshrn`, which takes a lot of i32 parameters indicating
which instruction it expands to (by specifying signedness of the input
and output types, whether it saturates and/or rounds, etc).

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D72328
2020-01-08 14:42:24 +00:00
Simon Tatham 34817e04fe [ARM,MVE] Fix many signedness errors in MVE intrinsics.
Summary:
Running an end-to-end test last week I noticed that a lot of the ACLE
intrinsics that operate differently on vectors of signed and unsigned
integers were ending up generating the signed version of the
instruction unconditionally. This is because the IR intrinsics had no
way to distinguish signed from unsigned: the LLVM type system just
calls them both `v8i16` (or whatever), so you need either separate
intrinsics for signed and unsigned, or a flag parameter that tells
ISel which one to choose.

This patch fixes all the problems of that kind that I've noticed, by
adding an i32 flag parameter to many of the IR intrinsics which is set
to 1 for unsigned (matching the existing practice in cases where we
got it right), and conditioning all the isel patterns on that flag. So
the fundamental change is in `IntrinsicsARM.td`, changing the
low-level IR intrinsics API; there are knock-on changes in
`arm_mve.td` (adjusting code gen for the ACLE intrinsics to use the
modified API) and in `ARMInstrMVE.td` (adjusting isel to expect the
new unsigned flags). The rest of this patch is boringly updating tests.

Reviewers: dmgreen, miyuki, MarkMurrayARM

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D72270
2020-01-06 16:33:16 +00:00
Simon Tatham b99ef32d04 [ARM,MVE] Generate the right instruction for vmaxnmq_m_f16.
Summary:
Due to a copy-paste error in the isel patterns, the predicated version
of this intrinsic was expanding to the `VMAXNMT.F32` instruction
instead of `VMAXNMT.F16`. Similarly for vminnm.

Reviewers: dmgreen, miyuki, MarkMurrayARM

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D72269
2020-01-06 16:28:20 +00:00
Sam Parker acbc9aed72 [ARM][MVE] Fixes for tail predication.
1) Fix an issue with the incorrect value being used for the number of
   elements being passed to [d|w]lstp. We were trying to check that
   the value was available at LoopStart, but this doesn't consider
   that the last instruction in the block could also define the
   register. Two helpers have been added to RDA for this.
2) Insert some code to now try to move the element count def or the
   insertion point so that we can perform more tail predication.
3) Related to (1), the same off-by-one could prevent us from
   generating a low-overhead loop when a mov lr could have been
   the last instruction in the block.
4) Fix up some instruction attributes so that not all the
   low-overhead loop instructions are labelled as branches and
   terminators - as this is not true for dls/dlstp.

Differential Revision: https://reviews.llvm.org/D71609
2019-12-20 09:34:18 +00:00
Sam Parker 4f0fe6b97e [ARM][MVE] Tail predicate bottom/top muls.
Add VMULL and VQDMULL variants to our tail predication white list.

Differential Revision: https://reviews.llvm.org/D71465
2019-12-20 08:33:01 +00:00
Sam Parker 84593f058b [ARM][MVE] Make VPT invalid for tail predication
We've been marking VPT incompatible instructions as invalid for tail
predication too, though this may not strictly be true. VPT are
incompatible and, unless its the first predicate def in a loop,
they shouldn't be compatible for tail predication either.

Differential Revision: https://reviews.llvm.org/D71410
2019-12-13 15:01:08 +00:00
Mikhail Maltsev 99581fd4c8 [ARM][MVE] Add vector reduction intrinsics with two vector operands
Summary:
This patch adds intrinsics for the following MVE instructions:
* VABAV
* VMLADAV, VMLSDAV
* VMLALDAV, VMLSLDAV
* VRMLALDAVH, VRMLSLDAVH

Each of the above 4 groups has a corresponding new LLVM IR intrinsic,
since the instructions cannot be easily represented using
general-purpose IR operations.

Reviewers: simon_tatham, ostannard, dmgreen, MarkMurrayARM

Reviewed By: MarkMurrayARM

Subscribers: merge_guards_bot, kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D71062
2019-12-13 13:17:29 +00:00
Simon Tatham 25305a9311 [ARM][MVE] Add intrinsics for more immediate shifts.
Summary:
This fills in the remaining shift operations that take a single vector
input and an immediate shift count: the `vqshl`, `vqshlu`, `vrshr` and
`vshll[bt]` families.

`vshll[bt]` (which shifts each input lane left into a double-width
output lane) is the most interesting one. There are separate MC
instruction ids for shifting by exactly the input lane width and
shifting by less than that, because the instruction encoding is so
completely different for the lane-width special case. So I had to
write two sets of patterns to match based on the immediate shift
count, which involved adding a ComplexPattern matcher to avoid the
general-case pattern accidentally matching the special case too. For
that family I've made sure to add an llc codegen test for both
versions of each instruction.

I'm experimenting with a new strategy for parametrising the isel
patterns for all these instructions: adding extra fields to the
relevant `Instruction` subclass itself, which are ignored by the
Tablegen backends that generate the MC data, but can be retrieved from
each instance of that instruction subclass when it's passed as a
template parameter to the multiclass that generates its isel patterns.
A nice effect of that is that I can fill in those informational fields
using `let` blocks, rather than having to type them out once per
instruction at `defm` time.

(As a result, quite a lot of existing instruction `def`s are
reindented by this patch, so it's clearer to read with whitespace
changes ignored.)

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: MarkMurrayARM

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D71458
2019-12-13 13:07:39 +00:00
Mark Murray 228c74076d [ARM][MVE][Intrinsics] Add *_x() variants of my *_m() intrinsics.
Summary:
Better use of multiclass is used, and this helped find some existing
bugs in the predicated VMULL* intrinsics, which are now fixed.

The refactored VMULL[TB]Q_(INT|POLY)_M() intrinsics were discovered
to have an argument ("inactive") with incorrect type, and this required
a fix that is included in this whole patch. The argument "inactive"
should have been the same width (per vector element) as the return
type of the intrinsic, but was not in the case where the return type
was double the element width of the input types.

To assist in testing the multiclassing , and to thwart further gremlins,
the unit tests are improved in scope.

The *.ll tests are all generated by a small bit of throw-away scripting
from the corresponding *.c tests, and as such the diffs are large and
nasty. Look at the file rather than the diff.

Reviewers: dmgreen, miyuki, ostannard, simon_tatham

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D71421
2019-12-13 11:51:23 +00:00