This is the groundwork for adding the Armv8.2-A FP16 vector intrinsics, which
uses v4f16 and v8f16 vector operands and return values. All the moving parts
are tested with two intrinsics, a 1-operand v8f16 and a 2-operand v4f16
intrinsic. In a follow-up patch the rest of the intrinsics and tests will be
added.
Differential Revision: https://reviews.llvm.org/D44538
llvm-svn: 327839
r327171 "Improve Dependency analysis when doing multi-node Instruction Selection"
r328170 "[DAG] Enforce stricter NodeId invariant during Instruction selection"
Reverting patch as NodeId invariant change is causing pathological
increases in compile time on PPC
llvm-svn: 327197
Instruction Selection makes use of the topological ordering of nodes
by node id (a node's operands have smaller node id than it) when doing
cycle detection. During selection we may violate this property as a
selection of multiple nodes may induce a use dependence (and thus a
node id restriction) between two unrelated nodes. If a selected node
has an unselected successor this may allow us to miss a cycle in
detection an invalid selection.
This patch fixes this by marking all unselected successors of a
selected node have negated node id. We avoid pruning on such negative
ids but still can reconstruct the original id for pruning.
In-tree targets have been updated to replace DAG-level replacements
with ISel-level ones which enforce this property.
This preemptively fixes PR36312 before triggering commit r324359 relands
Reviewers: craig.topper, bogner, jyknight
Subscribers: arsenm, nhaehnle, javed.absar, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D43198
llvm-svn: 327170
Code generation of VLD3, VLD4, VST3 and VST4 with register writeback is
broken due to 2 separate bugs:
1) VLD1d64TPseudoWB_register and VLD1d64QPseudoWB_register are missing
rules to expand them to non pseudo MIR. These are selected for
ARMISD::VLD3_UPD/VLD4_UPD with v1i64 vectors in SelectVLD.
2) Selection of the right VLD/VST instruction is broken for load and
store of 3 and 4 v1i64 vectors. SelectVLD and SelectVST are called
with MIR opcode for fixed writeback (ie increment is access size)
and call getVLDSTRegisterUpdateOpcode() to select an opcode with
register writeback if base register update is of a different size.
Since getVLDSTRegisterUpdateOpcode() only knows about
VLD1/VLD2/VST1/VST2 the call is currently conditional on the number
of element in the vector.
However, VLD1/VST1 is selected by SelectVLD/SelectVST's caller for
load and stores of 3 or 4 v1i64 vectors. Therefore the opcode is not
updated which later lead to a fixed writeback instruction being
constructed with an extra operand for the register writeback.
This patch addresses the two issues as follows:
- it adds the necessary mapping from VLD1d64TPseudoWB_register and
VLD1d64QPseudoWB_register to VLD1d64Twb_register and
VLD1d64Qwb_register respectively. Like for the existing _fixed
variants, the cost of these is bumped for unaligned access.
- it changes the logic in SelectVLD and SelectVSD to call isVLDfixed
and isVSTfixed respectively to decide whether the opcode should be
updated. It also reworks the logic and comments for pushing the
writeback offset operand and r0 operand to clarify the logic:
writeback offset needs to be pushed if it's a register writeback,
r0 needs to be pushed if not and the instruction is a
VLD1/VLD2/VST1/VST2.
Reviewers: rengolin, t.p.northover, samparker
Reviewed By: samparker
Patch by Thomas Preud'homme <thomas.preudhomme@arm.com>
Differential Revision: https://reviews.llvm.org/D42970
llvm-svn: 326570
This is the groundwork for Armv8.2-A FP16 code generation .
Clang passes and returns _Float16 values as floats, together with the required
bitconverts and truncs etc. to implement correct AAPCS behaviour, see D42318.
We will implement half-precision argument passing/returning lowering in the ARM
backend soon, but for now this means that this:
_Float16 sub(_Float16 a, _Float16 b) {
return a + b;
}
gets lowered to this:
define float @sub(float %a.coerce, float %b.coerce) {
entry:
%0 = bitcast float %a.coerce to i32
%tmp.0.extract.trunc = trunc i32 %0 to i16
%1 = bitcast i16 %tmp.0.extract.trunc to half
<SNIP>
%add = fadd half %1, %3
<SNIP>
}
When FullFP16 is *not* supported, we don't make f16 a legal type, and we get
legalization for "free", i.e. nothing changes and everything works as before.
And also f16 argument passing/returning is handled.
When FullFP16 is supported, we do make f16 a legal type, and have 2 places that
we need to patch up: f16 argument passing and returning, which involves minor
tweaks to avoid unnecessary code generation for some bitcasts.
As a "demonstrator" that this works for the different FP16, FullFP16, softfp
modes, etc., I've added match rules to the VSUB instruction description showing
that we can codegen this instruction from IR, but more importantly, also to
some conversion instructions. These conversions were causing issue before in
the FP16 and FullFP16 cases.
I've also added match rules to the VLDRH and VSTRH desriptions, so that we can
actually compile the entire half-precision sub code example above. This showed
that these loads and stores had the wrong addressing mode specified: AddrMode5
instead of AddrMode5FP16, which turned out not be implemented at all, so that
has also been added.
This is the minimal patch that shows all the different moving parts. In patch
2/3 I will add some efficient lowering of bitcasts, and in 2/3 I will add the
remaining Armv8.2-A FP16 instruction descriptions.
Thanks to Sam Parker and Oliver Stannard for their help and reviews!
Differential Revision: https://reviews.llvm.org/D38315
llvm-svn: 323512
All these headers already depend on CodeGen headers so moving them into
CodeGen fixes the layering (since CodeGen depends on Target, not the
other way around).
llvm-svn: 318490
This is a refactoring/cleanup of Arm `addrmode2` operand class. The patch
removes it completely.
Differential Revision: https://reviews.llvm.org/D39832
llvm-svn: 318291
Moves encoding (SYSm) information of banked registers to ARMSystemRegister.td,
where it rightly belongs and forms a single point of reference in the code.
Reviewed by: @fhahn, @rovka, @olista01
Differential Revision: https://reviews.llvm.org/D36219
llvm-svn: 309910
This patch cleans up and fixes issues in the M-Class system register handling:
1. It defines the system registers and the encoding (SYSm values) in one place:
a new ARMSystemRegister.td using SearchableTable, thereby removing the
hand-coded values which existed in multiple places.
2. Some system registers e.g. BASEPRI_MAX_NS which do not exist were being allowed!
Ref: ARMv6/7/8M architecture reference manual.
Reviewed by: @t.p.northover, @olist01, @john.brawn
Differential Revision: https://reviews.llvm.org/D35209
llvm-svn: 308456
Constants are crucial for code size in the ARM Thumb-1 instruction
set. The 16 bit instruction size often does not offer enough space
for immediate arguments. This means that additional instructions are
frequently used to load constants into registers. Since constants are
hoisted, this can lead to significant register spillage if they are
used multiple times in a single function. This can be avoided by
rematerialization, i.e. recomputing a constant instead of reloading
it from the stack. This patch fixes the rematerialization of literal
pool loads in the ARM Thumb instruction set.
Patch by Philip Ginsbach
Differential Revision: https://reviews.llvm.org/D33936
llvm-svn: 308004
When we replaced the multiplicand the destination node might already exist.
When that happens the original gets CSEd and deleted. However, it's actually
used as the offset so nonsense is produced.
Should fix PR32726.
llvm-svn: 301983
Before, we assumed that any ConstantInt offset was precisely the access width,
so we could use the "[rN]!" form. ISelLowering only ever created that kind, but
further simplification during combining could lead to unexpected constants and
incorrect codegen.
Should fix PR32658.
llvm-svn: 300878
This will become asan errors once the patch lands that poisons the
memory after free. The x86 change is a hack, but I don't see how to
solve this properly at the moment.
llvm-svn: 300867
Create nodes for smulwb and smulwt and move their selection from
DAGToDAG to DAG combine. smlawb and smlawt can then be selected
using tablegen. Added some helper functions to detect shift patterns
as well as a wrapper around SimplifyDemandBits. Added a couple of
extra tests.
Differential Revision: https://reviews.llvm.org/D30708
llvm-svn: 297716
ARMISD::ADD[CE] nodes, instead of the generic ISD::ADD[CE].
Summary:
This allows for some simplification because the combines
are no longer limited to just one go at the node before
it gets legalized into an ARM target-specific one.
Reviewers: jmolloy, rogfer01
Subscribers: aemerson, llvm-commits, rengolin
Differential Revision: https://reviews.llvm.org/D30401
llvm-svn: 297453
In the encoding of system registers in the M-class MSR instruction the mask bits
should be 2 for registers that don't take a _<bits> qualifier (the instruction
is unpredictable otherwise), and should also be 2 if the register takes a
_<bits> qualifier but it's not present as no _<bits> is an alias for _nzcvq.
Differential Revision: https://reviews.llvm.org/D29828
llvm-svn: 294762
Currently, there are substantial problems forming vld1_dup even if the
VDUP survives legalization. The lack of an actual node
leads to terrible results: not only can we not form post-increment vld1_dup
instructions, but we form scalar pre-increment and post-increment
loads which force the loaded value into a GPR. This patch fixes that
by combining the vdup+load into an ARMISD node before DAGCombine
messes it up.
Also includes a crash fix for vld2_dup (see testcase @vld2dupi8_postinc_variable).
Recommiting with fix to avoid forming vld1dup if the type of the load
doesn't match the type of the vdup (see
https://llvm.org/bugs/show_bug.cgi?id=31404).
Differential Revision: https://reviews.llvm.org/D27694
llvm-svn: 289972
This is essentially a recommit of r285893, but with a correctness fix. The
problem of the original commit was that this:
bic r5, r7, #31
cbz r5, .LBB2_10
got rewritten into:
lsrs r5, r7, #5
beq .LBB2_10
The result in destination register r5 is not the same and this is incorrect
when r5 is not dead. So this fix includes checking the uses of the AND
destination register. And also, compared to the original commit, some regression
tests didn't need changing anymore because of this extra check.
For completeness, this was the original commit message:
For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more
efficient instruction selection if the bitmask is one consecutive sequence of
set bits (32 - clz(bm) - ctz(bm) == popcount(bm)).
1) If the bitmask touches the LSB, then we can remove all the upper bits and
set the flags by doing one LSLS.
2) If the bitmask touches the MSB, then we can remove all the lower bits and
set the flags with one LSRS.
3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit
into the sign bit with one LSLS and change the condition query from NE/EQ to
MI/PL (we could also implement this by shifting into the carry bit and
branching on BCC/BCS).
4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower
zero bits of the mask.
1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two
16-bit instructions but can elide the CMP and doesn't require materializing a
complex immediate, so is also a win.
Differential Revision: https://reviews.llvm.org/D27761
llvm-svn: 289794
Currently, there are substantial problems forming vld1_dup even if the
VDUP survives legalization. The lack of an actual node
leads to terrible results: not only can we not form post-increment vld1_dup
instructions, but we form scalar pre-increment and post-increment
loads which force the loaded value into a GPR. This patch fixes that
by combining the vdup+load into an ARMISD node before DAGCombine
messes it up.
Also includes a crash fix for vld2_dup (see testcase @vld2dupi8_postinc_variable).
Differential Revision: https://reviews.llvm.org/D27694
llvm-svn: 289703
This recommits r281323, which was backed out for two reasons. One, a selfhost failure, and two, it apparently caused Chromium failures. Actually, the latter was a red herring. The log has expired from the former, but I suspect that was a red herring too (actually caused by another problematic patch of mine). Therefore reapplying, and will watch the bots like a hawk.
For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more efficient instruction selection if the bitmask is one consecutive sequence of set bits (32 - clz(bm) - ctz(bm) == popcount(bm)).
1) If the bitmask touches the LSB, then we can remove all the upper bits and set the flags by doing one LSLS.
2) If the bitmask touches the MSB, then we can remove all the lower bits and set the flags with one LSRS.
3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit into the sign bit with one LSLS and change the condition query from NE/EQ to MI/PL (we could also implement this by shifting into the carry bit and branching on BCC/BCS).
4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower zero bits of the mask.
1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two 16-bit instructions but can elide the CMP and doesn't require materializing a complex immediate, so is also a win.
llvm-svn: 285893
UMAAL is a DSP instruction and it is not available on thumbv7m
(Cortex-M3) and thumbv6m (Cortex-M0+1) targets. Also fix wrong
CHECK prefix in longMAC.ll test.
Patch by Vadzim Dambrouski.
Differential Revision: https://reviews.llvm.org/D25890
llvm-svn: 285278
This is not a valid encoding - these instructions cannot do PC-relative addressing.
The underlying problem here is of whitelist in ARMISelDAGToDAG that unwraps ARMISD::Wrappers during addressing-mode selection. This didn't realise TargetConstantPool was actually possible, so didn't handle it.
llvm-svn: 283323
ldm and stm instructions always require 4-byte alignment on the pointer, but we
weren't checking this before trying to reduce code-size by replacing a
post-indexed load/store with them. Unfortunately, we were also dropping this
incormation in DAG ISel too, but that's easy enough to fix.
llvm-svn: 281893
For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more efficient instruction selection if the bitmask is one consecutive sequence of set bits (32 - clz(bm) - ctz(bm) == popcount(bm)).
1) If the bitmask touches the LSB, then we can remove all the upper bits and set the flags by doing one LSLS.
2) If the bitmask touches the MSB, then we can remove all the lower bits and set the flags with one LSRS.
3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit into the sign bit with one LSLS and change the condition query from NE/EQ to MI/PL (we could also implement this by shifting into the carry bit and branching on BCC/BCS).
4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower zero bits of the mask.
1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two 16-bit instructions but can elide the CMP and doesn't require materializing a complex immediate, so is also a win.
llvm-svn: 281323
For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more efficient instruction selection if the bitmask is one consecutive sequence of set bits (32 - clz(bm) - ctz(bm) == popcount(bm)).
1) If the bitmask touches the LSB, then we can remove all the upper bits and set the flags by doing one LSLS.
2) If the bitmask touches the MSB, then we can remove all the lower bits and set the flags with one LSRS.
3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit into the sign bit with one LSLS and change the condition query from NE/EQ to MI/PL (we could also implement this by shifting into the carry bit and branching on BCC/BCS).
4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower zero bits of the mask.
1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two 16-bit instructions but can elide the CMP and doesn't require materializing a complex immediate, so is also a win.
llvm-svn: 281215
The CMPZ #0 disappears during peepholing, leaving just a tADDi3, tADDi8 or t2ADDri. This avoids having to materialize the expensive negative constant in Thumb-1, and allows a shrinking from a 32-bit CMN to a 16-bit ADDS in Thumb-2.
llvm-svn: 281040
This is a mechanical change of comments in switches like fallthrough,
fall-through, or fall-thru to use the LLVM_FALLTHROUGH macro instead.
llvm-svn: 278902
Summary: Thumb2 supports encoding immediates with specific patterns into mov.w by splatting the low 8 bits into other bytes.
I'm resubmitting this patch. The test case in the original commit
r277610 does not specify triple, so builds with differnt default triple
will have different output.
This patch fixed trile as thumb-darwin-apple.
Reviewers: john.brawn, jmolloy, bruno
Subscribers: jmolloy, aemerson, rengolin, samparker, llvm-commits
Differential Revision: https://reviews.llvm.org/D23090
llvm-svn: 277865
Summary: Thumb2 supports encoding immediates with specific patterns into mov.w by splatting the low 8 bits into other bytes.
Reviewers: john.brawn, jmolloy
Subscribers: jmolloy, aemerson, rengolin, samparker, llvm-commits
Differential Revision: https://reviews.llvm.org/D23090
llvm-svn: 277610
In this particular example we wouldn't want the smmls anyway (the value is
actually unused), but in general smmls does not provide the required flags
register so if that SUBE result is used we can't replace it.
llvm-svn: 277541
Retry r275776 (no changes, we suspect the issue was with another commit).
The current logic for handling inline asm operands in DAGToDAGISel interprets
the operands by looking for constants, which should represent the flags
describing the kind of operand we're dealing with (immediate, memory, register
def etc). The operands representing actual data are skipped only if they are
non-const, with the exception of immediate operands which are skipped explicitly
when a flag describing an immediate is found.
The oversight is that memory operands may be const too (e.g. for device drivers
reading a fixed address), so we should explicitly skip the operand following a
flag describing a memory operand. If we don't, we risk interpreting that
constant as a flag, which is definitely not intended.
Fixes PR26038
Differential Revision: https://reviews.llvm.org/D22103
llvm-svn: 276101