This adds a simple fold into codegenprepare that converts comparison of
branches towards comparison with zero if possible. For example:
%c = icmp ult %x, 8
br %c, bla, blb
%tc = lshr %x, 3
becomes
%tc = lshr %x, 3
%c = icmp eq %tc, 0
br %c, bla, blb
As a first order approximation, this can reduce the number of
instructions needed to perform the branch as the shift is (often) needed
anyway. At the moment this does not effect very much, as llvm tends to
prefer the opposite form. But it can protect against regressions from
commits like rG9423f78240a2.
Simple cases of Add and Sub are added along with Shift, equally as the
comparison to zero can often be folded with cpsr flags.
Differential Revision: https://reviews.llvm.org/D101778
Based on the same for AArch64: 4751cadcca
At -O0, the fast register allocator may insert spills between the ldrex and
strex instructions inserted by AtomicExpandPass when expanding atomicrmw
instructions in LL/SC loops. To avoid this, expand to cmpxchg loops and
therefore expand the cmpxchg pseudos after register allocation.
Required a tweak to ARMExpandPseudo::ExpandCMP_SWAP to use the 4-byte encoding
of UXT, since the pseudo instruction can be allocated a high register (R8-R15)
which the 2-byte encoding doesn't support. However, the 4-byte encodings
are not present for ARM v8-M Baseline. To enable this, two new pseudos are
added for Thumb which are only valid for v8mbase, tCMP_SWAP_8 and
tCMP_SWAP_16.
The previously committed attempt in D101164 had to be reverted due to runtime
failures in the test suites. Rather than spending time fixing that
implementation (adding another implementation of atomic operations and more
divergence between backends) I have chosen to follow the approach taken in
D101163.
Differential Revision: https://reviews.llvm.org/D101898
Depends on D101912
Analogously to https://reviews.llvm.org/D98794 this patch uses the
`alignstack` attribute to fix incorrect passing of homogeneous
aggregate (HA) arguments on AArch32. The EABI/AAPCS was recently
updated to clarify how VFP co-processor candidates are aligned:
4488e34998
Differential Revision: https://reviews.llvm.org/D100853
Unfortunately the current call lowering code is built on top of the
legacy MVT/DAG based code. However, GlobalISel was not using it the
same way. In short, the DAG passes legalized types to the assignment
function, and GlobalISel was passing the original raw type if it was
simple.
I do believe the DAG lowering is conceptually broken since it requires
picking a type up front before knowing how/where the value will be
passed. This ends up being a problem for AArch64, which wants to pass
i1/i8/i16 values as a different size if passed on the stack or in
registers.
The argument type decision is split across 3 different places which is
hard to follow. SelectionDAG builder uses
getRegisterTypeForCallingConv to pick a legal type, tablegen gives the
illusion of controlling the type, and the target may have additional
hacks in the C++ part of the call lowering. AArch64 hacks around this
by not using the standard AnalyzeFormalArguments and special casing
i1/i8/i16 by looking at the underlying type of the original IR
argument.
I believe people have generally assumed the calling convention code is
processing the original types, and I've discovered a number of dead
paths in several targets.
x86 actually relies on the opposite behavior from AArch64, and relies
on x86_32 and x86_64 sharing calling convention code where the 64-bit
cases implicitly do not work on x86_32 due to using the pre-legalized
types.
AMDGPU targets without legal i16/f16 have always used a broken ABI
that promotes to i32/f32. GlobalISel accidentally fixed this to be the
ABI we should have, but this fixes it so we're using the worse ABI
that is compatible with the DAG. Ideally we would fix the DAG to match
the old GlobalISel behavior, but I don't wish to fight that battle.
A new native GlobalISel call lowering framework should let the target
process the incoming types directly.
CCValAssigns select a "ValVT" and "LocVT" but the meanings of these
aren't entirely clear. Different targets don't use them consistently,
even within their own call lowering code. My current belief is the
intent was "ValVT" is supposed to be the legalized value type to use
in the end, and and LocVT was supposed to be the ABI passed type
(which is also legalized).
With the default CCState::Analyze functions always passing the same
type for these arguments, these only differ when the TableGen part of
the lowering decide to promote the type from one legal type to
another. AArch64's i1/i8/i16 hack ends up inverting the meanings of
these values, so I had to add an additional hack to let the target
interpret how large the argument memory is.
Since targets don't consistently interpret ValVT and LocVT, this
doesn't produce quite equivalent code to the initial DAG
lowerings. I've opted to consistently interpret LocVT as the in-memory
size for stack passed values, and ValVT as the register type to assign
from that memory. We therefore produce extending loads directly out of
the IRTranslator, whereas the DAG would emit regular loads of smaller
values. This will also produce loads/stores that are wider than the
argument value if the allocated stack slot is larger (and there will
be undef padding bytes). If we had the optimizations to reduce
load/stores based on truncated values, this wouldn't produce a
different end result.
Since ValVT/LocVT are more consistently interpreted, we now will emit
more G_BITCASTS as requested by the CCAssignFn. For example AArch64
was directly assigning types to some physical vector registers which
according to the tablegen spec should have been casted to a vector
with a different element type.
This also moves the responsibility for inserting
G_ASSERT_SEXT/G_ASSERT_ZEXT from the target ValueHandlers into the
generic code, which is closer to how SelectionDAGBuilder works.
I had to xfail an x86 test since I don't see a quick way to fix it
right now (I filed bug 50035 for this). It's broken independently of
this change, and only triggers since now we end up with more ands
which hit the improperly handled selection pattern.
I also observed that FP arguments that need promotion (e.g. f16 passed
as f32) are broken, and use regular G_TRUNC and G_ANYEXT.
TLDR; the current call lowering infrastructure is bad and nobody has
ever understood how it chooses types.
This reverts the revert 02c5ba8679
Fix:
Pass was registered as DUMMY_FUNCTION_PASS causing the newpm-pass
functions to be doubly defined. Triggered in -DLLVM_ENABLE_MODULE=1
builds.
Original commit:
This patch implements expansion of llvm.vp.* intrinsics
(https://llvm.org/docs/LangRef.html#vector-predication-intrinsics).
VP expansion is required for targets that do not implement VP code
generation. Since expansion is controllable with TTI, targets can switch
on the VP intrinsics they do support in their backend offering a smooth
transition strategy for VP code generation (VE, RISC-V V, ARM SVE,
AVX512, ..).
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D78203
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
Originally submitted as 3338290c18.
Reverted in c7df6b1223.
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
This patch implements expansion of llvm.vp.* intrinsics
(https://llvm.org/docs/LangRef.html#vector-predication-intrinsics).
VP expansion is required for targets that do not implement VP code
generation. Since expansion is controllable with TTI, targets can switch
on the VP intrinsics they do support in their backend offering a smooth
transition strategy for VP code generation (VE, RISC-V V, ARM SVE,
AVX512, ..).
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D78203
This expands the VMOVRRD(extract(..(build_vector(a, b, c, d)))) pattern,
to also handle insert_vectors. Providing we can find the correct insert,
this helps further simplify patterns by removing the redundant VMOVRRD.
Differential Revision: https://reviews.llvm.org/D100245
The generic SoftFloatVectorExtract.ll test was failing when run on arm
machines, as it tries to create a f64 under soft float. Limit the
transform to when f64 is legal.
Also add a missing override, as reported in D100244.
This adds a combine for extract(x, n); extract(x, n+1) ->
VMOVRRD(extract x, n/2). This allows two vector lanes to be moved at the
same time in a single instruction, and thanks to the other VMOVRRD folds
we have added recently can help reduce the amount of executed
instructions. Floating point types are very similar, but will include a
bitcast to an integer type.
This also adds a shouldRewriteCopySrc, to prevent copy propagation from
DPR to SPR, which can break as not all DPR regs can be extracted from
directly. Otherwise the machine verifier is unhappy.
Differential Revision: https://reviews.llvm.org/D100244
When the ProcResGroup has BufferSize=0,
1. if there is a subunit in the list of write resources for the
scheduling class, do not attempt to schedule the ProcResGroup.
2. if there is not a subunit in the list of write resources for the
scheduling class, choose a subunit to use instead of the ProcResGroup.
3. having both the ProcResGroup and any of its subunits in the resources
implied by a InstRW is not supported.
Used to model parallel uses from a pool of resources.
Differential Revision: https://reviews.llvm.org/D98976
Used to model structural hazards on FP issue, where some
instructions take up 2 issue slots and others one as well
as similar structural hazards on load issue, where some
instructions take up two load lanes and others one.
Differential Revision: https://reviews.llvm.org/D98977
The IR stack protector pass must insert stack checks before the call instead of
between it and the return.
Similarly, SDAG one should recognize that ADJCALLFRAME instructions could be
part of the terminal sequence of a tail call. In this case because such call
frames cannot be nested in LLVM the stack protection code must skip over the
whole sequence (or risk clobbering argument registers).
It breaks up the function pass manager in the codegen pipeline.
With empty parameters, it looks at the -mllvm flag -rewrite-map-file.
This is likely not in use.
Add a check that we only have one function pass manager in the codegen
pipeline.
This required reverting commit 9583a3f2625818b78c0cf6d473cdedb9f23ad82c:
"[AsmPrinter] Delete dead takeDeletedSymbsForFunction()".
This was not NFC as initially thought. By coalescing two function
psas managers, this exposed the reverted code as necessary.
addr-label.ll was crashing due to an emitted blockaddress's block being
removed but the label not emitted.
Some tests relied on the fact that we had a module pass somewhere in the
codegen pipeline.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D99707
This allows these optimisations to apply to e.g. `urem i16` directly
before `urem` is promoted to i32 on architectures where i16 operations
are not intrinsically legal (such as on Aarch64). The legalization then
later can happen more directly and generated code gets a chance to avoid
wasting time on computing results in types wider than necessary, in the end.
Seems like mostly an improvement in terms of results at least as far as x86_64 and aarch64 are concerned, with a few regressions here and there. It also helps in preventing regressions in changes like {D87976}.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D88785
MVE does not have a single sext/zext or trunc instruction that takes the
bottom half of a vector and extends to a full width, like NEON has with
MOVL. Instead it is expected that this happens through top/bottom
instructions. So the MVE equivalent VMOVLT/B instructions take either
the even or odd elements of the input and extend them to the larger
type, producing a vector with half the number of elements each of double
the bitwidth. As there is no simple instruction for a normal extend, we
often have to expand sext/zext/trunc into a series of lane moves (or
stack loads/stores, which we do not do yet).
This pass takes vector code that starts at truncs, looks for
interconnected blobs of operations that end with sext/zext and
transforms them by adding shuffles so that the lanes are interleaved and
the MVE VMOVL/VMOVN instructions can be used. This is done pre-ISel so
that it can work across basic blocks.
This initial version of the pass just handles a limited set of
instructions, not handling constants or splats or FP, which can all come
as extensions to this base.
Differential Revision: https://reviews.llvm.org/D95804
Reuse the existing KnownBits multiplication code to handle the 'extend + multiply + extract high bits' pattern for multiply-high ops.
Noticed while looking at the codegen for D88785 / D98587 - the patch helps division-by-constant expansion code in particular, which suggests that we might have some further KnownBits div/rem cases we could handle - but this was far easier to implement.
Differential Revision: https://reviews.llvm.org/D98857
Given a sextload i16, we can usually generate "ldrsh [rn. rm]". If we
don't naturally have a rn, rm addressing mode, we can either generate
"ldrh [rn, #0]; sxth" or "mov rm, #0; ldrsh [rn. rm]".
We currently generate the first, always creating a sxth. They are both
the same number of instructions, but if we generate the second then the
mov #0 will likely be CSE'd or pulled out of a loop, etc.
This adjusts the ISel patterns to do that, creating a mov instead of a
sxth.
Differential Revision: https://reviews.llvm.org/D98693
This also briefly tests a larger set of architectures than the more
exhaustive functionality tests for AArch64 and x86.
As requested in D88785
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D98339
The result of ISD::USUBSAT will never be larger than the LHS. We
can use this to put a bound on the number of leading zeros.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D98133
:: (store 1 + 4, addrspace 1)
->
:: (store 1 into undef + 4, addrspace 1)
An offset without a base isn't terribly useful but it's convenient to update
the offset without checking the value. For example, when breaking apart
stores into smaller units
Differential Revision: https://reviews.llvm.org/D97812
This seems to be more of a Clang thing rather than a generic LLVM thing,
so this moves it out of LLVM pipelines and as Clang extension hooks into
LLVM pipelines.
Move the post-inline EEInstrumentation out of the backend pipeline and
into a late pass, similar to other sanitizer passes. It doesn't fit
into the codegen pipeline.
Also fix up EntryExitInstrumentation not running at -O0 under the new
PM. PR49143
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D97608
Rather than converting 3 signbits to bools and comparing them,
we can do bitwise logic on the whole vector and convert the
resulting sign bit to a bool at the end.
This is still a different algorithm than what we do in LegalizeDAG
through expandSADDOSSUBO. That algorithm needs to know that the
RHS of SSUBO is > 0, but that's costly when the type is split.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97325
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.
This is a recommit of 3b34b06fc5 with correctness fixes and extra
tests.
Differential Revision: https://reviews.llvm.org/D95885
This code creates 3 setccs that need to be expanded. It was
creating a sign bit test as setge X, 0 which is non-canonical.
Canonical would be setgt X, -1. This misses the special case in
IntegerExpandSetCCOperands for sign bit tests that assumes
canonical form. If we don't hit this special case we end up
with a multipart setcc instead of just checking the sign of
the high part.
To fix this I've reversed the polarity of all of the setccs to
setlt X, 0 which is canonical. The rest of the logic should
still work. This seems to produce better code on RISCV which
lacks a setgt instruction.
This probably still isn't the best code sequence we could use here.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97181
Early versions of the ARMv7 reference manuals considered the sp register
as a deprecated register for ldm/stm familiy of instructions. However,
later versions such as ARM DDI 0406C.d added a note to the Appendix:
D9.3 Use of the SP as a general-purpose register
Most ARM instructions, unlike Thumb instructions, provide exactly the
same access to the SP as to R0-R12. This means that it is possible to
use the SP as a general-purpose register. Earlier issues of this manual
deprecated the use of SP in an ARM instruction, in any way that is
deprecated, not permitted, or not possible in the corresponding
Thumb instruction. However, user feedback indicates a number of cases
where these instructions are useful. Therefore, ARM no longer deprecates
these instruction uses.
Also Armv8 manuals no longer consider SP as deprecated register for ldm/
stm A32 instructions.
Furthermore, GNU as also does not print a deprecated warning when using
SP with those instructions.
Drop deprecation warning for pop/ldm/push/stm instructions.
Patch by: Stefan Agner.
Differential Revision: https://reviews.llvm.org/D82692
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.
Differential Revision: https://reviews.llvm.org/D95885
A v8i32 compare will produce a v8i1 predicate, but during codegen the
v8i32 will be split into two v4i32, potentially requiring two v4i1
predicates to be merged into a single v8i1. Because this merging of two
v4i1's into a v8i1 is very expensive, we need to make the cost of the
compare equally high.
This patch adds the cost of that to ARMTTIImpl::getCmpSelInstrCost.
Because we don't know whether the user of the predicate can be split,
and the cost model is mostly pre-instruction, we may be pessimistic but
that should only be for larger and legal types. This also adds min/max
detection to the costmodel where it can be detected, to keep those in
line with the cost of simple min/max instructions. Otherwise for the
most part, costs that were already expensive have become more expensive.
Differential Revision: https://reviews.llvm.org/D96692
This adds a new flag -lsr-preferred-addressing-mode to override the target's
preferred addressing mode. It replaces flag -lsr-backedge-indexing, which is
equivalent to preindexed addressing that is one of the options that
-lsr-preferred-addressing-mode accepts.
Differential Revision: https://reviews.llvm.org/D96855
Currently the findIncDecAfter will only look at the next instruction for
post-inc candidates in the load/store optimizer. This extends that to a
search through the current BB, until an instruction that modifies or
uses the increment reg is found. This allows more post-inc load/stores
and ldm/stm's to be created, especially in cases where a schedule might
move instructions further apart.
We make sure not to look any further for an SP, as that might invalidate
stack slots that are still in use.
Differential Revision: https://reviews.llvm.org/D95881
As discussed on D96413, as long as the promoted bits of the args are zero we can use the basic ISD::USUBSAT pattern directly, without the shifting like we do for other ops.
I think something similar should be possible for ISD::UADDSAT as well, which I'll look at later.
Also, create a ISD::USUBSAT node directly - this will be expanded back by the legalizer later on if necessary.
Differential Revision: https://reviews.llvm.org/D96622
Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets.
This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization.
We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation.
The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987.
Differential Revision: https://reviews.llvm.org/D96413
This patch adds a pass to replace calls to vector intrinsics (i.e., LLVM
intrinsics operating on vector operands) with calls to a vector library.
Currently, calls to LLVM intrinsics are only replaced with calls to vector
libraries when scalar calls to intrinsics are vectorized by the Loop- or
SLP-Vectorizer.
With this pass, it is now possible to replace calls to LLVM intrinsics
already operating on vector operands, e.g., if such code was generated
by MLIR. For the replacement, information from the TargetLibraryInfo,
e.g., as specified via -vector-library is used.
This is a re-try of the original commit 2303e93e66 that was reverted
due to pass manager problems. Other minor changes have also been made.
Differential Revision: https://reviews.llvm.org/D95373
This patch adds a pass to replace calls to vector intrinsics
(i.e., LLVM intrinsics operating on vector operands) with
calls to a vector library.
Currently, calls to LLVM intrinsics are only replaced with
calls to vector libraries when scalar calls to intrinsics are
vectorized by the Loop- or SLP-Vectorizer.
With this pass, it is now possible to replace calls to LLVM
intrinsics already operating on vector operands, e.g., if
such code was generated by MLIR. For the replacement,
information from the TargetLibraryInfo, e.g., as specified
via -vector-library is used.
Differential Revision: https://reviews.llvm.org/D95373
This new f16 shuffle under Neon would hit an assert in
GeneratePerfectShuffle as it would try to treat a f16 vector as an i8.
Add f16 handling, treating them like an i16.
Differential Revision: https://reviews.llvm.org/D95446
Under the softfp calling convention, we are often left with
VMOVRRD(extract(bitcast(build_vector(a, b, c, d)))) for the return value
of the function. These can be simplified to a,b or c,d directly,
depending on the value of the extract.
Big endian is a little different because the bitcast switches the lanes
around, meaning we end up with b,a or d,c.
Differential Revision: https://reviews.llvm.org/D94989
This adds a DAG combine for converting sext_inreg of VGetLaneu into
VGetLanes, providing the types match correctly.
Differential Revision: https://reviews.llvm.org/D95073
This de-pessimizes the arguably more usual case of no masked mem intrinsics,
and gets rid of one more Dominator Tree recalculation.
As per llvm/test/CodeGen/X86/opt-pipeline.ll,
there's one more Dominator Tree recalculation left, we could get rid of.
While this is mostly NFC right now, because only ARM happens
to run this pass with DomTree available before it,
and required after it, more backends will be affected once
the SimplifyCFG's switch for domtree preservation is flipped,
and DwarfEHPrepare also preserves the domtree.
This adds some simple fp16 scalar_to_vector patterns, preventing a
selection failure if this came up.
Differential Revision: https://reviews.llvm.org/D95427
Recent shouldAssumeDSOLocal changes (introduced by 961f31d8ad)
do not take in consideration the relocation model anymore. The ARM
fast-isel pass uses the function return to set whether a global symbol
is loaded indirectly or not, and without the expected information
llvm now generates an extra load for following code:
```
$ cat test.ll
@__asan_option_detect_stack_use_after_return = external global i32
define dso_local i32 @main(i32 %argc, i8** %argv) #0 {
entry:
%0 = load i32, i32* @__asan_option_detect_stack_use_after_return,
align 4
%1 = icmp ne i32 %0, 0
br i1 %1, label %2, label %3
2:
ret i32 0
3:
ret i32 1
}
attributes #0 = { noinline optnone }
$ lcc test.ll -o -
[...]
main:
.fnstart
[...]
movw r0, :lower16:__asan_option_detect_stack_use_after_return
movt r0, :upper16:__asan_option_detect_stack_use_after_return
ldr r0, [r0]
ldr r0, [r0]
cmp r0, #0
[...]
```
And without 'optnone' it produces:
```
[...]
main:
.fnstart
[...]
movw r0, :lower16:__asan_option_detect_stack_use_after_return
movt r0, :upper16:__asan_option_detect_stack_use_after_return
ldr r0, [r0]
clz r0, r0
lsr r0, r0, #5
bx lr
[...]
```
This triggered a lot of invalid memory access in sanitizers for
arm-linux-gnueabihf. I checked this patch both a stage1 built with
gcc and a stage2 bootstrap and it fixes all the Linux sanitizers
issues.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D95379
I may have given bad advice, and skipping sext_inreg when matching SSAT
patterns is not valid on it's own. It at least needs to sext_inreg the
input again, but as far as I can tell is still only valid based on
demanded bits. For the moment disable that part of the combine,
hopefully reimplementing it in the future more correctly.
Add DemandedElts support inside the TRUNCATE analysis.
REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d
Differential Revision: https://reviews.llvm.org/D56387
It caused "Vector shift amounts must be in the same as their first arg"
asserts in Chromium builds. See the code review for repro instructions.
> Add DemandedElts support inside the TRUNCATE analysis.
>
> Differential Revision: https://reviews.llvm.org/D56387
This reverts commit cad4275d69.
This patch handles cases where we have to save/restore the link register
into the stack and and load/store instruction which use the stack are
part of the outlined region. It checks that there will be no overflow
introduced by the new offset and fixup these instructions accordingly.
Differential Revision: https://reviews.llvm.org/D92934
If the previous block in a function does not fallthough, adding nop's to
align it will never be executed. This means we can freely (except for
codesize) align more branches. This happens in constantislandspass (as
it cannot happen later) and only happens at aggressive optimization
levels as it does increase codesize.
Differential Revision: https://reviews.llvm.org/D94394
For the ARM hard-float calling convention, calls to variadic functions
need to be treated diffrently, even if only the fixed arguments are
provided.
This fixes GCC-C-execute-pr68390 in the test-suite, which is failing on
the ARM GlobaISel bot.
Blocks can be laid out such that a t2WhileLoopStart branches backwards. This is forbidden by the architecture and so it fails to be converted into a low-overhead loop. This new pass checks for these cases and moves the target block, fixing any fall-through that would then be broken.
Differential Revision: https://reviews.llvm.org/D92385
This patch finishes addressing unused prefixes under CodeGen: 2
remaining tests fixed, and then undo-ing the lit.local.cfg changes under
various subdirs and moving the policy under CodeGen.
Differential Revision: https://reviews.llvm.org/D94430
Local values are constants or addresses that can't be folded into
the instruction that uses them. FastISel materializes these in a
"local value" area that always dominates the current insertion
point, to try to avoid materializing these values more than once
(per block).
https://reviews.llvm.org/D43093 added code to sink these local
value instructions to their first use, which has two beneficial
effects. One, it is likely to avoid some unnecessary spills and
reloads; two, it allows us to attach the debug location of the
user to the local value instruction. The latter effect can
improve the debugging experience for debuggers with a "set next
statement" feature, such as the Visual Studio debugger and PS4
debugger, because instructions to set up constants for a given
statement will be associated with the appropriate source line.
There are also some constants (primarily addresses) that could be
produced by no-op casts or GEP instructions; the main difference
from "local value" instructions is that these are values from
separate IR instructions, and therefore could have multiple users
across multiple basic blocks. D43093 avoided sinking these, even
though they were emitted to the same "local value" area as the
other instructions. The patch comment for D43093 states:
Local values may also be used by no-op casts, which adds the
register to the RegFixups table. Without reversing the RegFixups
map direction, we don't have enough information to sink these
instructions.
This patch undoes most of D43093, and instead flushes the local
value map after(*) every IR instruction, using that instruction's
debug location. This avoids sometimes incorrect locations used
previously, and emits instructions in a more natural order.
In addition, constants materialized due to PHI instructions are
not assigned a debug location immediately; instead, when the
local value map is flushed, if the first local value instruction
has no debug location, it is given the same location as the
first non-local-value-map instruction. This prevents PHIs
from introducing unattributed instructions, which would either
be implicitly attributed to the location for the preceding IR
instruction, or given line 0 if they are at the beginning of
a machine basic block. Neither of those consequences is good
for debugging.
This does mean materialized values are not re-used across IR
instruction boundaries; however, only about 5% of those values
were reused in an experimental self-build of clang.
(*) Actually, just prior to the next instruction. It seems like
it would be cleaner the other way, but I was having trouble
getting that to work.
This reapplies commits cf1c774d and dc35368c, and adds the
modification to PHI handling, which should avoid problems
with debugging under gdb.
Differential Revision: https://reviews.llvm.org/D91734
There were a number of tests needing updates for D91734, and I added a
bunch of LABEL directives to help track down where those had to go.
These directives are an improvement independent of the functional
patch, so I'm committing them as their own separate patch.
Same as a9b6440edd, use zanyext to treat any_extends as zero extends
during lowering to create addw/addl/subw/subl nodes.
Differential Revision: https://reviews.llvm.org/D93835
Similar to 78d8a821e2 but for ARM, this handles any_extend whilst
creating MULL nodes, treating them as zextends.
Differential Revision: https://reviews.llvm.org/D93834
Once the default for SimplifyCFG flips, we can no longer pass nullptr
instead of DomTree to SimplifyCFG, so we need to propagate it here.
We don't strictly need to actually preserve DomTree in DwarfEHPrepare,
but we might as well do it, since it's trivial.
TargetMachine::shouldAssumeDSOLocal currently implies dso_local for such definitions.
Since clang -fno-pic add the dso_local specifier, we don't need to special case.
TargetMachine::shouldAssumeDSOLocal currently implies dso_local for such definitions.
Adding explicit dso_local makes these tests align with the clang -fno-pic behavior
and allow the removal of the TargetMachine::shouldAssumeDSOLocal special case.
Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
of other paths.
New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.
One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.
In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.
Reviewed By: yrouban
Differential Revision: https://reviews.llvm.org/D79485
As a linker is allowed to clobber r12 on function calls, the code
transformation that hardens indirect calls is not correct in case a
linker does so. Similarly, the transformation is not correct when
register lr is used.
This patch makes sure that r12 or lr are not used for indirect calls
when harden-sls-blr is enabled.
Differential Revision: https://reviews.llvm.org/D92469
To make sure that no barrier gets placed on the architectural execution
path, each indirect call calling the function in register rN, it gets
transformed to a direct call to __llvm_slsblr_thunk_mode_rN. mode is
either arm or thumb, depending on the mode of where the indirect call
happens.
The llvm_slsblr_thunk_mode_rN thunk contains:
bx rN
<speculation barrier>
Therefore, the indirect call gets split into 2; one direct call and one
indirect jump.
This transformation results in not inserting a speculation barrier on
the architectural execution path.
The mitigation is off by default and can be enabled by the
harden-sls-blr subtarget feature.
As a linker is allowed to clobber r12 on function calls, the
above code transformation is not correct in case a linker does so.
Similarly, the transformation is not correct when register lr is used.
Avoiding r12/lr being used is done in a follow-on patch to make
reviewing this code easier.
Differential Revision: https://reviews.llvm.org/D92468
The only non-trivial consideration in this patch is that the formation
of TBB/TBH instructions, which is done in the constant island pass, does
not understand the speculation barriers inserted by the SLSHardening
pass. As such, when harden-sls-retbr is enabled for a function, the
formation of TBB/TBH instructions in the constant island pass is
disabled.
Differential Revision: https://reviews.llvm.org/D92396
Some processors may speculatively execute the instructions immediately
following indirect control flow, such as returns, indirect jumps and
indirect function calls.
To avoid a potential miss-speculatively executed gadget after these
instructions leaking secrets through side channels, this pass places a
speculation barrier immediately after every indirect control flow where
control flow doesn't return to the next instruction, such as returns and
indirect jumps, but not indirect function calls.
Hardening of indirect function calls will be done in a later,
independent patch.
This patch is implementing the same functionality as the AArch64 counter
part implemented in https://reviews.llvm.org/D81400.
For AArch64, returns and indirect jumps only occur on RET and BR
instructions and hence the function attribute to control the hardening
is called "harden-sls-retbr" there. On AArch32, there is a much wider
variety of instructions that can trigger an indirect unconditional
control flow change. I've decided to stick with the name
"harden-sls-retbr" as introduced for the corresponding AArch64
mitigation.
This patch implements this for ARM mode. A future patch will extend this
to also support Thumb mode.
The inserted barriers are never on the correct, architectural execution
path, and therefore performance overhead of this is expected to be low.
To ensure these barriers are never on an architecturally executed path,
when the harden-sls-retbr function attribute is present, indirect
control flow is never conditionalized/predicated.
On targets that implement that Armv8.0-SB Speculation Barrier extension,
a single SB instruction is emitted that acts as a speculation barrier.
On other targets, a DSB SYS followed by a ISB is emitted to act as a
speculation barrier.
These speculation barriers are implemented as pseudo instructions to
avoid later passes to analyze them and potentially remove them.
The mitigation is off by default and can be enabled by the
harden-sls-retbr subtarget feature.
Differential Revision: https://reviews.llvm.org/D92395
This patch adds support for the fptoui.sat and fptosi.sat intrinsics,
which provide basically the same functionality as the existing fptoui
and fptosi instructions, but will saturate (or return 0 for NaN) on
values unrepresentable in the target type, instead of returning
poison. Related mailing list discussion can be found at:
https://groups.google.com/d/msg/llvm-dev/cgDFaBmCnDQ/CZAIMj4IBAAJ
The intrinsics have overloaded source and result type and support
vector operands:
i32 @llvm.fptoui.sat.i32.f32(float %f)
i100 @llvm.fptoui.sat.i100.f64(double %f)
<4 x i32> @llvm.fptoui.sat.v4i32.v4f16(half %f)
// etc
On the SelectionDAG layer two new ISD opcodes are added,
FP_TO_UINT_SAT and FP_TO_SINT_SAT. These opcodes have two operands
and one result. The second operand is an integer constant specifying
the scalar saturation width. The idea here is that initially the
second operand and the scalar width of the result type are the same,
but they may change during type legalization. For example:
i19 @llvm.fptsi.sat.i19.f32(float %f)
// builds
i19 fp_to_sint_sat f, 19
// type legalizes (through integer result promotion)
i32 fp_to_sint_sat f, 19
I went for this approach, because saturated conversion does not
compose well. There is no good way of "adjusting" a saturating
conversion to i32 into one to i19 short of saturating twice.
Specifying the saturation width separately allows directly saturating
to the correct width.
There are two baseline expansions for the fp_to_xint_sat opcodes. If
the integer bounds can be exactly represented in the float type and
fminnum/fmaxnum are legal, we can expand to something like:
f = fmaxnum f, FP(MIN)
f = fminnum f, FP(MAX)
i = fptoxi f
i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN
If the bounds cannot be exactly represented, we expand to something
like this instead:
i = fptoxi f
i = select f ult FP(MIN), MIN, i
i = select f ogt FP(MAX), MAX, i
i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN
It should be noted that this expansion assumes a non-trapping fptoxi.
Initial tests are for AArch64, x86_64 and ARM. This exercises all of
the scalar and vector legalization. ARM is included to test float
softening.
Original patch by @nikic and @ebevhan (based on D54696).
Differential Revision: https://reviews.llvm.org/D54749
Add tests for this particular detail for x86 and arm (similar tests
already existed for x86_64 and aarch64).
The libssp implementation may be located in a separate DLL, and in
those cases, the references need to be in a .refptr stub, to avoid
needing to touch up code in the text section at runtime (which is
supported but inefficient for x86, and unsupported for arm).
Differential Revision: https://reviews.llvm.org/D92738
Move fold of (sext (not i1 x)) -> (add (zext i1 x), -1) from X86 to DAGCombiner to improve codegen on other targets.
Differential Revision: https://reviews.llvm.org/D91589
This does not deserve special handling. The code should be added to Clang
instead if deemed useful. With this simplification, we can additionally delete
the PIC extern_weak special case.
clang/lib/CodeGen/CodeGenModule sets dso_local on applicable function declarations,
we don't need to duplicate the work in TargetMachine:shouldAssumeDSOLocal.
(Actually the long-term goal (started by r324535) is to drop TargetMachine::shouldAssumeDSOLocal.)
By not implying dso_local, we will respect dso_local/dso_preemptable specifiers
set by the frontend. This allows the proposed -fno-direct-access-external-data
option to work with -fno-pic and prevent a canonical PLT entry (SHN_UNDEF with non-zero st_value)
when taking the address of a function symbol.
This patch should be NFC in terms of the Clang emitted assembly because the case
we don't set dso_local is a case Clang sets dso_local. However, some tests don't
set dso_local on some function declarations and expose some differences. Most
tests have been fixed to be more robust in the previous commit.