Given a min(max(fptosi, INT_MIN), INT_MAX) with the correct constants,
we can now generate a fptosi.sat. But in the arm backend, the constant
can be treated as high cost, pulling it out of the basic block in a way
that the DAG combine can no longer see it. This teaches it again that it
is a low cost constant, not worth hoisting out.
Recommitted from 0e98659ea1 with a fix for APInt comparison.
Differential Revision: https://reviews.llvm.org/D114380
This patch implements a new MachineFunction in the ARM backend for
placing BTI instructions. It is similar to the existing AArch64
aarch64-branch-targets pass.
BTI instructions are inserted into basic blocks that:
- Have their address taken
- Are the entry block of a function, if the function has external
linkage or has its address taken
- Are mentioned in jump tables
- Are exception/cleanup landing pads
Each BTI instructions is placed in the beginning of a BB after the
so-called meta instructions (e.g. exception handler labels).
Each outlining candidate and the outlined function need to be in agreement about
whether BTI placement is enabled or not. If branch target enforcement is
disabled for a function, the outliner should not covertly enable it by emitting
a call to an outlined function, which begins with BTI.
The cost mode of the outliner is adjusted to account for the extra BTI
instructions in the outlined function.
The ARM Constant Islands pass will maintain the count of the jump tables, which
reference a block. A `BTI` instruction is removed from a block only if the
reference count reaches zero.
PAC instructions in entry blocks are replaced with PACBTI instructions (tests
for this case will be added in a later patch because the compiler currently does
not generate PAC instructions).
The ARM Constant Island pass is adjusted to handle BTI
instructions correctly.
Functions with static linkage that don't have their address taken can
still be called indirectly by linker-generated veneers and thus their
entry points need be marked with BTI or PACBTI.
The changes are tested using "LLVM IR -> assembly" tests, jump tables
also have a MIR test. Unfortunately it is not possible add MIR tests
for exception handling and computed gotos because of MIR parser
limitations.
This patch is part of a series that adds support for the PACBTI-M extension of
the Armv8.1-M architecture, as detailed here:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/armv8-1-m-pointer-authentication-and-branch-target-identification-extension
The PACBTI-M specification can be found in the Armv8-M Architecture Reference
Manual:
https://developer.arm.com/documentation/ddi0553/latest
The following people contributed to this patch:
- Mikhail Maltsev
- Momchil Velikov
- Ties Stuij
Reviewed By: ostannard
Differential Revision: https://reviews.llvm.org/D112426
Given a min(max(fptosi, INT_MIN), INT_MAX) with the correct constants,
we can now generate a fptosi.sat. But in the arm backend, the constant
can be treated as high cost, pulling it out of the basic block in a way
that the DAG combine can no longer see it. This teaches it again that it
is a low cost constant, not worth hoisting out.
Differential Revision: https://reviews.llvm.org/D114380
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
It causes builds to fail with this assert:
llvm/include/llvm/ADT/APInt.h:990:
bool llvm::APInt::operator==(const llvm::APInt &) const:
Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.
See comment on the code review.
> This adds a fold in DAGCombine to create fptosi_sat from sequences for
> smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
> the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
> it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
> ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
> to be handled similarly.
>
> A shouldConvertFpToSat method was added to control when converting may
> be profitable. The original fptosi will have a less strict semantics
> than the fptosisat, with less values that need to produce defined
> behaviour.
>
> This especially helps on ARM/AArch64 where the vcvt instructions
> naturally saturate the result.
>
> Differential Revision: https://reviews.llvm.org/D111976
This reverts commit 52ff3b0093.
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
We can't use the existing pseudo ARM::tLDRLIT_ga_pcrel for loading the
stack guard for PIC code that references the GOT, since arm-pseudo may
expand this to the narrow tLDRpci rather than the wider t2LDRpci.
Create a new pseudo, t2LDRLIT_ga_pcrel, and expand it to t2LDRpci.
Fixes: https://bugs.chromium.org/p/chromium/issues/detail?id=1270361
Reviewed By: ardb
Differential Revision: https://reviews.llvm.org/D114762
Currently we create register mappings for registers used only once in current
MBB. For registers with multiple uses, when all the uses are in the current MBB,
we can also create mappings for them similarly according to the last use.
For example
%reg101 = ...
= ... reg101
%reg103 = ADD %reg101, %reg102
We can create mapping between %reg101 and %reg103.
Differential Revision: https://reviews.llvm.org/D113193
[NFC] As part of using inclusive language within the llvm project, this patch
replaces master with main in `2007-04-02-RegScavengerAssert.ll`.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D114276
Patch to fix some of the regressions in D77804.
By folding to rotate/funnel-shift by constant amounts for illegal types, we prevent SimplifyDemandedBits from destroying the patterns prematurely, allowing us to use the rotate/funnel-shift legalization that was added in D112443.
Differential Revision: https://reviews.llvm.org/D113192
Implement support for loading the stack canary from a memory location held in
the TLS register, with an optional offset applied. This is used by the Linux
kernel to implement per-task stack canaries, which is impossible on SMP systems
when using a global variable for the stack canary.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D112768
Currently, LOAD_STACK_GUARD on ARM is only implemented for Mach-O targets, and
other targets rely on the generic support which may result in spilling of the
stack canary value or address, or may cause it to be kept in a callee save
register across function calls, which means they essentially get spilled as
well, only by the callee when it wants to free up this register.
So let's implement LOAD_STACK GUARD for other targets as well. This ensures
that the load of the stack canary is rematerialized fully in the epilogue.
This code was split off from
D112768: [ARM] implement support for TLS register based stack protector
for which it is a prerequisite.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D112811
In TwoAddressInstructionPass::processTiedPairs with
-early-live-intervals, update any preexisting physreg live intervals,
as well as virtreg live intervals. By default (without
-precompute-phys-liveness) physreg live intervals only exist for
registers that are live-in to some basic block.
Differential Revision: https://reviews.llvm.org/D113191
In TwoAddressInstructionPass::processTiedPairs with
-early-live-intervals, update any preexisting physreg live intervals,
as well as virtreg live intervals. By default (without
-precompute-phys-liveness) physreg live intervals only exist for
registers that are live-in to some basic block.
Differential Revision: https://reviews.llvm.org/D113191
Currently when tail predicating loops, vpt blocks need to be created
with the vctp predicate in case we need to revert to non-tail predicated
form. This has the unfortunate side effect of severely hampering post-ra
scheduling at times as the instructions are already stuck in vpt blocks,
not allowed to be independently ordered.
This patch addresses that by just moving the creation of VPT blocks
later in the pipeline, after post-ra scheduling has been performed. This
allows more optimal scheduling post-ra before the vpt blocks are
created, leading to more optimal tail predicated loops.
Differential Revision: https://reviews.llvm.org/D113094
If the type of a funnel shift needs to be expanded, expand it to two funnel shifts instead of regular shifts. For constant shifts, this doesn't make much difference, but for variable shifts it allows a more optimal lowering.
Also use the optimized funnel shift lowering for rotates.
Alive2: https://alive2.llvm.org/ce/z/TvHDB- / https://alive2.llvm.org/ce/z/yzPept
(Branched from D108058 as getting this completed should help unlock some other WIP patches).
Original Patch: @efriedma (Eli Friedman)
Differential Revision: https://reviews.llvm.org/D112443
This is relanding commit da1d1a0869 .
This patch additionally addresses failures found in buildbots & post review comments.
ARM EHABI[1] specifies the __cxa_end_cleanup to be called after cleanup.
It will call the UnwindResume.
__cxa_begin_cleanup will be called from libcxxabi while __cxa_end_cleanup is never called.
This will trigger a termination when a foreign exception is processed while UnwindResume is called
because the global state will be wrong due to the missing __cxa_end_cleanup call.
Additional test here: D109856
[1] https://github.com/ARM-software/abi-aa/blob/main/ehabi32/ehabi32.rst#941compiler-helper-functions
Reviewed By: logan
Differential Revision: https://reviews.llvm.org/D111703
This is relanding commit da1d1a0869 .
This patch additionally addresses failures found in buildbots & post review comments.
ARM EHABI[1] specifies the __cxa_end_cleanup to be called after cleanup.
It will call the UnwindResume.
__cxa_begin_cleanup will be called from libcxxabi while __cxa_end_cleanup is never called.
This will trigger a termination when a foreign exception is processed while UnwindResume is called
because the global state will be wrong due to the missing __cxa_end_cleanup call.
Additional test here: D109856
[1] https://github.com/ARM-software/abi-aa/blob/main/ehabi32/ehabi32.rst#941compiler-helper-functions
Reviewed By: logan
Differential Revision: https://reviews.llvm.org/D111703
In ARM mode, passing -mtp=cp15 forces the use of an inline MRC system register read to move the thread pointer value into a register.
Currently, in Thumb2 mode, -mtp=cp15 is ignored, and a call to the __aeabi_read_tp helper is emitted instead.
This is inconsistent, and breaks the Linux/ARM build for Thumb2 targets, as the Linux kernel does not provide an implementation of __aeabi_read_tp,.
Reviewed By: nickdesaulniers, peter.smith
Differential Revision: https://reviews.llvm.org/D112600
ARM EHABI[1] specifies the __cxa_end_cleanup to be called after cleanup.
It will call the UnwindResume.
__cxa_begin_cleanup will be called from libcxxabi while __cxa_end_cleanup is never called.
This will trigger a termination when a foreign exception is processed while UnwindResume is called
because the global state will be wrong due to the missing __cxa_end_cleanup call.
Additional test here: D109856
[1] https://github.com/ARM-software/abi-aa/blob/main/ehabi32/ehabi32.rst#941compiler-helper-functions
Reviewed By: logan
Differential Revision: https://reviews.llvm.org/D111703
The intrinsic is called llvm.ceil not llvm.fceil. The checks weren't
strong enough to notice that a call to llvm.fceil was emitted in
the final assembly.
The MOVCC peephole eliminates a MOVCC by making one of its inputs a
conditional instruction, but when doing this it should be using both
inputs of the MOVCC to decide on the register class to use as
otherwise we can get an error when using -verify-machineinstrs.
Differential Revision: https://reviews.llvm.org/D111714
The patch attempts to optimize a sequence of SIMD loads from the same
base pointer:
%0 = gep float*, float* base, i32 4
%1 = bitcast float* %0 to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1
...
%n1 = gep float*, float* base, i32 N
%n2 = bitcast float* %n1 to <4 x float>*
%n3 = load <4 x float>, <4 x float>* %n2
For AArch64 the compiler generates a sequence of LDR Qt, [Xn, #16].
However, 32-bit NEON VLD1/VST1 lack the [Wn, #imm] addressing mode, so
the address is computed before every ld/st instruction:
add r2, r0, #32
add r0, r0, #16
vld1.32 {d18, d19}, [r2]
vld1.32 {d22, d23}, [r0]
This can be improved by computing address for the first load, and then
using a post-indexed form of VLD1/VST1 to load the rest:
add r0, r0, #16
vld1.32 {d18, d19}, [r0]!
vld1.32 {d22, d23}, [r0]
In order to do that, the patch adds more patterns to DAGCombine:
- (load (add ptr inc1)) and (add ptr inc2) are now folded if inc1
and inc2 are constants.
- (or ptr inc) is now recognized as a pointer increment if ptr is
sufficiently aligned.
In addition to that, we now search for all possible base updates and
then pick the best one.
Differential Revision: https://reviews.llvm.org/D108988
This patch contains following enhancements to SrcRegMap and DstRegMap:
1 In findOnlyInterestingUse not only check if the Reg is two address usage,
but also check after commutation can it be two address usage.
2 If a physical register is clobbered, remove SrcRegMap entries that are
mapped to it.
3 In processTiedPairs, when create a new COPY instruction, add a SrcRegMap
entry only when the COPY instruction is coalescable. (The COPY src is
killed)
With these enhancements isProfitableToCommute can do better commute decision,
and finally more register copies are removed.
Differential Revision: https://reviews.llvm.org/D108731
D100244 missed a check on the ResNo of the extract's operand 0 when finding a
pair of extracts to combine into a VMOVRRD (extract(x, n); extract(x, n+1) ->
VMOVRRD(extract x, n/2)). As a result, it can incorrectly pair an extract(x, n)
with another extract(x:3, n+1) for example. This patch fixes the bug by adding
the proper check on ResNo.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D111188
The commit e497b12a69 went and regenerated
all the checks lines in the Arm speculation-hardening-sls.ll test in a
way that removed most of the important checks. This just resets them
back to how they were before, with the single character fix to change:
; NOHARDENARM: {{bxge lr$}}
to
; NOHARDENARM: {{bxgt lr$}}
Differential Revision: https://reviews.llvm.org/D111074
The delayed stack protector feature which is currently used for SDAG (and thus
allows for more commonly generating tail calls) depends on being able to extract
the tail call into a separate return block. To do this it also has to extract
the vreg->physreg copies that set up the call's arguments, since if it doesn't
then the call inst ends up using undefined physregs in it's new spliced block.
SelectionDAG implementations can do this because they delay emitting register
copies until *after* the stack arguments are set up. GISel however just
processes and emits the arguments in IR order, so stack arguments always end up
last, and thus this breaks the code that looks for any register arg copies that
precede the call instruction.
This patch adds a thunk argument to the assignValueToReg() and custom assignment
hooks. For outgoing arguments, register assignments use this return param to
return a thunk that does the actual generating of the copies. We collect these
until all the outgoing stack assignments have been done and then execute them,
so that the copies (and perhaps some artifacts like G_SEXTs) are placed after
any stores.
Differential Revision: https://reviews.llvm.org/D110610
A <= -1 constant on a compare can be converted to a < 0 operation, which
is usually cheap. If we mark the constant as cheap, preventing hoisting,
we allow that fold to happen even across different blocks.
Differential Revision: https://reviews.llvm.org/D109360
The fix applied in D23303 "LiveIntervalAnalysis: fix a crash in repairOldRegInRange"
was over-zealous. It would bail out when the end of the range to be
repaired was in the middle of the first segment of the live range of
Reg, which was always the case when the range contained a single def of
Reg.
This patch fixes it as suggested by Matthias Braun in post-commit review
on the original patch, and tests it by adding -early-live-intervals to
a selection of existing lit tests that now pass.
(Note that D23303 was originally applied to fix a crash in
SILoadStoreOptimizer, but that is now moot since D23814 updated
SILoadStoreOptimizer to run before scheduling so it no longer has to
update live intervals.)
Differential Revision: https://reviews.llvm.org/D110238
Unrevert with some changes to the tests:
- Add -verify-machineinstrs to check for remaining problems in live
interval support in TwoAddressInstructionPass.
- Drop test/CodeGen/AMDGPU/extract-load-i1.ll since it suffers from
some of those remaining problems.