This patch adds basic support for BFloat in the Arm backend.
For now the code generation relies on fullfp16 being present.
Briefly:
* adds the bfloat scalar and vector types in the necessary register classes,
* adjusts the calling convention to cope with bfloat argument passing and return,
* adds codegen patterns for moves, loads and stores.
It's tested mostly by the intrinsic patches that depend on it (load/store, convert/copy).
The following people contributed to this patch:
* Alexandros Lamprineas
* Ties Stuij
Differential Revision: https://reviews.llvm.org/D81373
Code like the following:
define i32 @foo(i32 %a, i1 zeroext %b) addrspace(1) {
entry:
%conv = zext i1 %b to i32
%add = add nsw i32 %conv, %a
ret i32 %add
}
Would compile to the following (incorrect) code:
foo:
mov r18, r20
clr r19
add r22, r18
adc r23, r19
sbci r24, 0
sbci r25, 0
ret
Those sbci instructions are clearly wrong, they should have been adc
instructions.
This commit improves codegen to use adc instead:
foo:
mov r18, r20
clr r19
ldi r20, 0
ldi r21, 0
add r22, r18
adc r23, r19
adc r24, r20
adc r25, r21
ret
This code is not optimal (it could be just 5 instructions instead of the
current 9) but at least it doesn't miscompile.
Differential Revision: https://reviews.llvm.org/D78439
Since i32 is not legal in riscv64,
it always promoted to i64 before emitting lib call and
for conversions like float/double to int and float/double to unsigned int
wrong lib call was emitted. This commit fix it using custom lowering.
Differential Revision: https://reviews.llvm.org/D80526
There are now quite a few SVE tests in LLVM and Clang that do not
emit warnings related to invalid use of EVT::getVectorNumElements()
and VectorType::getNumElements(). For these tests I have added
additional checks that there are no warnings in order to prevent
any future regressions.
Differential Revision: https://reviews.llvm.org/D80712
Summary:
As half-precision floating point arguments and returns were previously
coerced to either float or int32 by clang's codegen, the CMSE handling
of those was also performed in clang's side by zeroing the unused MSBs
of the coercer values.
This patch moves this handling to the backend's calling convention
lowering, making sure the high bits of the registers used by
half-precision arguments and returns are zeroed.
Reviewers: chill, rjmccall, ostannard
Reviewed By: ostannard
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D81428
Summary:
Half-precision floating point arguments and returns are currently
promoted to either float or int32 in clang's CodeGen and there's
no existing support for the lowering of `half` arguments and returns
from IR in AArch32's backend.
Such frontend coercions, implemented as coercion through memory
in clang, can cause a series of issues in argument lowering, as causing
arguments to be stored on the wrong bits on big-endian architectures
and incurring in missing overflow detections in the return of certain
functions.
This patch introduces the handling of half-precision arguments and returns in
the backend using the actual "half" type on the IR. Using the "half"
type the backend is able to properly enforce the AAPCS' directions for
those arguments, making sure they are stored on the proper bits of the
registers and performing the necessary floating point convertions.
Reviewers: rjmccall, olista01, asl, efriedma, ostannard, SjoerdMeijer
Reviewed By: ostannard
Subscribers: stuij, hiraditya, dmgreen, llvm-commits, chill, dnsampaio, danielkiss, kristof.beyls, cfe-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75169
There are now quite a few SVE tests in LLVM and Clang that do not
emit warnings related to invalid use of EVT::getVectorNumElements()
and VectorType::getNumElements(). For these tests I have added
additional checks that there are no warnings in order to prevent
any future regressions.
Differential Revision: https://reviews.llvm.org/D80712
This also enables running the AArch64 SLSHardening pass with GlobalISel,
so add a test for that.
Differential Revision: https://reviews.llvm.org/D81403
The enum values for AArch64 registers are not all consecutive.
Therefore, the computation
"__llvm_slsblr_thunk_x" + utostr(Reg - AArch64::X0)
is not always correct. utostr(Reg - AArch64::X0) will not generate the
expected string for the registers that do not have consecutive values in
the enum.
This happened to work for most registers, but does not for AArch64::FP
(i.e. register X29).
This can get triggered when the X29 is not used as a frame pointer.
Differential Revision: https://reviews.llvm.org/D81997
Summary:
For PPC BinaryOperator of fp128 will become libcall, we shouldn't
convert loop to CTR loop if the loop contain libCall.
But currently, in the PPCTTIImpl::mightUseCTR() function, we only deal
with BinaryOperator for ppc_fp128, don't deal with the fp128.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D81353
Summary: A bug is reported in bugzilla-45628, where the swap_with_shift case can’t be matched to a single HW instruction xxswapd as expected.
In fact the case matches the idiom of rotate. We have MatchRotate to handle an ‘or’ of two operands and generate a rot[lr] if the case matches the idiom of rotate. While PPC doesn’t support ROTL v1i128. We can custom lower ROTL v1i128 to the vector_shuffle. The vector_shuffle will be matched to a single HW instruction during the phase of instruction selection.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D81076
It seems to be a hardware defect that the half inline constants do not
work as expected for the 16-bit integer operations (the inverse does
work correctly). Experimentation seems to show these are really
reading the 32-bit inline constants, which can be observed by writing
inline asm using op_sel to see what's in the high half of the
constant. Theoretically we could fold the high halves of the 32-bit
constants using op_sel.
The *_asm_all.s MC tests are broken, and I don't know where the script
to autogenerate these are. I started manually fixing it, but there's
just too many cases to fix. This also does break the
assembler/disassembler support for these values, and I'm not sure what
to do about it. These are still valid encodings, so it seems like you
should be able to use them in some way. If you wrote assembly using
them, you could have really meant it (perhaps to read the high bits
with op_sel?). The disassembler will print the invalid literal
constant which will fail to re-assemble. The behavior is also
different depending on the use context. Consider this example, which
was previously accepted and encoded using the inline constant:
v_mad_i16 v5, v1, -4.0, v3
; encoding: [0x05,0x00,0xec,0xd1,0x01,0xef,0x0d,0x04]
In contexts where an inline immediate is required (such as on gfx8/9),
this will now be rejected. For gfx10, this will produce the literal
encoding and change the printed format:
v_mad_i16 v5, v1, 0xc400, v3
; encoding: [0x05,0x00,0x5e,0xd7,0x01,0xff,0x0d,0x04,0x00,0xc4,0x00,0x00]
This is just another variation of the issue that we don't perfectly
handle round trip assembly/disassembly due to not tracking how
immediates were encoded. This doesn't matter much in practice, since
compilers don't emit the suboptimal encoding. I doubt any users are
relying on this behavior (although I did make use of the old behavior
to figure out what was wrong).
Fixes bug 46302.
In BTF, pointee type pruning is used to reduce cluttering
too many unused types into prog BTF. For example,
struct task_struct {
...
struct mm_struct *mm;
...
}
If bpf program does not access members of "struct mm_struct",
there is no need to bring types for "struct mm_struct" to BTF.
This patch fixed a bug where an incorrect pruning happened.
The test case like below:
struct t;
typedef struct t _t;
struct s1 { _t *c; };
int test1(struct s1 *arg) { ... }
struct t { int a; int b; };
struct s2 { _t c; }
int test2(struct s2 *arg) { ... }
After processing test1(), among others, BPF backend generates BTF types for
"struct s1", "_t" and a placeholder for "struct t".
Note that "struct t" is not really generated. If later a direct access
to "struct t" member happened, "struct t" BTF type will be generated
properly.
During processing test2(), when processing member type "_t c",
BPF backend sees type "_t" already generated, so returned.
This caused the problem that "struct t" BTF type is never generated and
eventually causing incorrect type definition for "struct s2".
To fix the issue, during DebugInfo type traversal, even if a
typedef/const/volatile/restrict derived type has been recorded in BTF,
if it is not a type pruning candidate, type traversal of its base type continues.
Differential Revision: https://reviews.llvm.org/D82041
Summary:
This commit fixes a bug in the FixBrTables pass in which an
unconditional branch from the switch header block to the jump table
block was not removed before the blocks were combined. The result was
an invalid CFG in the MachineFunction. This commit also switches from
using bespoke branch analysis and deletion code to using the standard
utilities for the same.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81909
Summary:
Add a flag to omit the xray_fn_idx to cut size overhead and relocations
roughly in half at the cost of reduced performance for single function
patching. Minor additions to compiler-rt support per-function patching
without the index.
Reviewers: dberris, MaskRay, johnislarry
Subscribers: hiraditya, arphaman, cfe-commits, #sanitizers, llvm-commits
Tags: #clang, #sanitizers, #llvm
Differential Revision: https://reviews.llvm.org/D81995
In order to support hot-patching, we need to make sure the first emitted instruction in a function is a two-byte+ op. This is already the case on x86_64, which seems to always emit two-byte+ ops. However on 32-bit targets this wasn't the case.
PATCHABLE_OP now lowers to a XCHG AX, AX, (66 90) like MSVC does. However when targetting pentium3 (/arch:SSE) or i386 (/arch:IA32) targets, we generate MOV EDI,EDI (8B FF) like MSVC does. This is for compatiblity reasons with older tools that rely on this two byte pattern.
Differential Revision: https://reviews.llvm.org/D81301
Summary:
CFI emitted during PEI at the beginning of the prologue needs to apply
to any inserted waitcnts on function entry.
Reviewers: arsenm, t-tye, RamNalamothu
Reviewed By: arsenm
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm, #debug-info
Differential Revision: https://reviews.llvm.org/D76881
To set up a tail-predicated loop, we need to to calculate the number of
elements processed by the loop. We can now use intrinsic
@llvm.get.active.lane.mask() to do this, which is emitted by the vectoriser in
D79100. This intrinsic generates a predicate for the masked loads/stores, and
consumes the Backedge Taken Count (BTC) as its second argument. We can now use
that to reconstruct the loop tripcount, instead of the IR pattern match
approach we were using before.
Many thanks to Eli Friedman and Sam Parker for all their help with this work.
This also adds overflow checks for the different, new expressions that we
create: the loop tripcount, and the sub expression that calculates the
remaining elements to be processed. For the latter, SCEV is not able to
calculate precise enough bounds, so we work around that at the moment, but is
not entirely correct yet, it's conservative. The overflow checks can be
overruled with a force flag, which is thus potentially unsafe (but not really
because the vectoriser is the only place where this intrinsic is emitted at the
moment). It's also good to mention that the tail-predication pass is not yet
enabled by default. We will follow up to see if we can implement these
overflow checks better, either by a change in SCEV or we may want revise the
definition of llvm.get.active.lane.mask.
Differential Revision: https://reviews.llvm.org/D79175
In more complicated loops we can easily hit the complexity limits of
loop strength reduction. If we do and filtering occurs, it's all too
easy to remove the wrong formulae for post-inc preferring accesses due
to it attempting to maximise register re-use. The patch adds an
alternative filtering step when the target is preferring postinc to pick
postinc formulae instead, hopefully lowering the complexity to below the
limit so that aggressive filtering is not needed.
There is also a change in here to stop considering existing addrecs as
free under postinc. We should already be modelling them as a reg so
don't want it to cause us to get the cost wrong. (I'm not sure that code
makes sense in general, but there are X86 tests specifically for it
where it seems to be helping so have left it around for the standard
non-post-inc case).
Differential Revision: https://reviews.llvm.org/D80273
Spills of VCC (SGPR64) will fail with new SGPR spill code,
because super register is not correctly resolved.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D81224
Summary:
In the patch D62907 the PPC CTRLoops pass has been replaced by Generic
Hardware Loop pass, and it has imported some new intrinsic for Generic
Hardware Loop.
The old intrinsic used in PPC CTRLoops int_ppc_mtctr and
int_ppc_is_decremented_ctr_nonzero is been replaced by
int_set_loop_iterations and loop_decrement.
This patch is to remove above unused two instrinsic.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D81539
Summary:
Adds two features to the generated rule disable option:
- '*' - Disable all rules
- '!<foo>' - Re-enable rule(s)
- '!foo' - Enable rule named 'foo'
- '!5' - Enable rule five
- '!4-9' - Enable rule four to nine
- '!foo-bar' - Enable rules from 'foo' to (and including) 'bar'
(the '!' is available to the generated disable option but is not part of the underlying and determines whether to call setRuleDisabled() or setRuleEnabled())
This is intended to support unit testing of combine rules so
that you can do:
GeneratedCfg.setRuleDisabled("*")
GeneratedCfg.setRuleEnabled("foo")
to ensure only a specific rule is in effect. The rule is still
required to be included in a combiner though
Also added --...-only-enable-rule=X,Y which is effectively an
alias for --...-disable-rule=*,!X,!Y and as such interacts
properly with disable-rule.
Reviewers: aditya_nandakumar, bogner, volkan, aemerson, paquette, arsenm
Subscribers: wdng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81889
When selecting 32 b -> 64 b G_ZEXTs, we don't have to always emit the extend.
If the instruction feeding into the G_ZEXT implicitly zero extends the high
half of the register, we can just emit a SUBREG_TO_REG instead.
Differential Revision: https://reviews.llvm.org/D81897
This patch upstreams support for BFloat Matrix Multiplication Intrinsics
and Code Generation from __bf16 to AArch64. This includes IR intrinsics. Unittests are
provided as needed. AArch32 Intrinsics + CodeGen will come after this
patch.
This patch is part of a series implementing the Bfloat16 extension of
the
Armv8.6-a architecture, as detailed here:
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
The bfloat type, and its properties are specified in the Arm
Architecture
Reference Manual:
https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
The following people contributed to this patch:
Luke Geeson
- Momchil Velikov
- Mikhail Maltsev
- Luke Cheeseman
Reviewers: SjoerdMeijer, t.p.northover, sdesmalen, labrinea, miyuki,
stuij
Reviewed By: miyuki, stuij
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits,
llvm-commits, miyuki, chill, pbarrio, stuij
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D80752
Change-Id: I174f0fd0f600d04e3799b06a7da88973c6c0703f
An instruction like this will need to allocate some stack space for the
last parameter:
%x = call addrspace(1) i16 @bar(i64 undef, i64 undef, i16 undef, i16 0)
This worked fine when passing an actual value (in this case 0). However,
when passing undef, no value was pushed to the stack and therefore no
push instructions were created. This caused an unbalanced stack leading
to interesting results.
This commit fixes that by replacing the push logic with a regular stack
adjustment and stack-relative load/stores. This is less efficient but at
least it correctly compiles the code.
I can think of a few improvements in the future:
* The stack should have been adjusted in the function prologue when
there are no allocas in the function.
* Many (if not most) stack adjustments can be replaced by
pushing/popping the values directly. Exactly like the previous code
attempted but didn't do correctly.
* Small stack adjustments can be done more efficiently with a few
push/pop instructions (pushing/popping bogus values), both for code
size and for speed.
All in all, as long as there are no allocas in the function I think that
it is almost always more efficient to emit regular push/pop
instructions. This is however left for future optimizations.
Differential Revision: https://reviews.llvm.org/D78581
This patch fixes a bug in stack save/restore code. Because the frame
pointer was saved/restored manually (not by marking it as clobbered) the
StackSize variable was not updated accordingly. Most code still worked,
but code that tried to load a parameter passed on the stack did not.
This commit fixes this by marking the frame pointer as a
callee-clobbered register. This will let it be saved without any effort
in prolog/epilog code and will make sure the correct address is
calculated for loading parameters that are passed on the stack.
This approach is used by most other targets (such as X86, AArch64 and
RISC-V).
Differential Revision: https://reviews.llvm.org/D78579
These code patterns attempt to call isVMOVModifiedImm on a splat of i1
values, leading to an unreachable being hit. I've guarded the call on a
more specific set of sizes, as i1 vectors are legal under MVE.
Differential Revision: https://reviews.llvm.org/D81860
This patch extends MatchVectorAllZeroTest to handle OR vector reduction patterns where the result is compared against zero.
Fixes PR45378
Differential Revision: https://reviews.llvm.org/D81547
Only functions with floating-point return type accepts fast-math flags.
When adding such flags to function returning integer, we'll see a crash,
because there's still an undeleted value referencing the argument. This
patch manually removes the temporary instruction when error occurs.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D78355
It's possible to end up with a zext or something in the way of a G_CONSTANT,
even pre-legalization. This can happen with memsets.
e.g.
https://godbolt.org/z/Bjc8cw
To make sure we can catch these cases, use `getConstantVRegValWithLookThrough`
instead of `mi_match`.
Differential Revision: https://reviews.llvm.org/D81875
Summary:
L is meant to support the second word used by 32b calling conventions for 64b arguments.
This is required for build 32b PowerPC Linux kernels after upstream
commit 334710b1496a ("powerpc/uaccess: Implement unsafe_put_user() using 'asm goto'")
Thanks for the report from @nathanchance, and reference to GCC's
implementation from @segher.
Fixes: pr/46186
Fixes: https://github.com/ClangBuiltLinux/linux/issues/1044
Reviewers: echristo, hfinkel, MaskRay
Reviewed By: MaskRay
Subscribers: MaskRay, wuzish, nemanjai, hiraditya, kbarton, steven.zhang, llvm-commits, segher, nathanchance, srhines
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81767
Add selection support for ext via a new opcode, G_EXT and a post-legalizer
combine which matches it.
Add an `applyEXT` function, because the AArch64ext patterns require a register
for the immediate. So, we have to create a G_CONSTANT to get these without
writing new patterns or modifying the existing ones.
Tests are the same as arm64-ext.ll.
Also prevent ext from firing on the zip test. It has higher priority, so we
don't want it potentially getting in the way of mask tests.
Also fix up the shuffle-splat test, because ext is now selected there. The
test was incorrectly regbank selected before, which could cause a verifier
failure when you emit copies.
Differential Revision: https://reviews.llvm.org/D81436