Commit Graph

4359 Commits

Author SHA1 Message Date
David Green dd5c52029d [CPG][ARM] Optimize towards branch on zero in codegenprepare
This adds a simple fold into codegenprepare that converts comparison of
branches towards comparison with zero if possible. For example:
  %c = icmp ult %x, 8
  br %c, bla, blb
  %tc = lshr %x, 3
becomes
  %tc = lshr %x, 3
  %c = icmp eq %tc, 0
  br %c, bla, blb

As a first order approximation, this can reduce the number of
instructions needed to perform the branch as the shift is (often) needed
anyway. At the moment this does not effect very much, as llvm tends to
prefer the opposite form. But it can protect against regressions from
commits like rG9423f78240a2.

Simple cases of Add and Sub are added along with Shift, equally as the
comparison to zero can often be folded with cpsr flags.

Differential Revision: https://reviews.llvm.org/D101778
2021-05-16 17:54:06 +01:00
David Green d539357e1b [ARM] Extra branch on zero tests. NFC 2021-05-16 17:22:52 +01:00
Tomas Matheson 34c098b780 [ARM] Prevent spilling between ldrex/strex pairs
Based on the same for AArch64: 4751cadcca

At -O0, the fast register allocator may insert spills between the ldrex and
strex instructions inserted by AtomicExpandPass when expanding atomicrmw
instructions in LL/SC loops. To avoid this, expand to cmpxchg loops and
therefore expand the cmpxchg pseudos after register allocation.

Required a tweak to ARMExpandPseudo::ExpandCMP_SWAP to use the 4-byte encoding
of UXT, since the pseudo instruction can be allocated a high register (R8-R15)
which the 2-byte encoding doesn't support. However, the 4-byte encodings
are not present for ARM v8-M Baseline. To enable this, two new pseudos are
added for Thumb which are only valid for v8mbase, tCMP_SWAP_8 and
tCMP_SWAP_16.

The previously committed attempt in D101164 had to be reverted due to runtime
failures in the test suites. Rather than spending time fixing that
implementation (adding another implementation of atomic operations and more
divergence between backends) I have chosen to follow the approach taken in
D101163.

Differential Revision: https://reviews.llvm.org/D101898

Depends on D101912
2021-05-12 09:43:21 +01:00
Tomas Matheson edf9d88266 [ARM] Precommit test for D101898
Differential Revision: https://reviews.llvm.org/D101912
2021-05-12 09:43:21 +01:00
Arthur Eubanks 16748bd2fb [TargetLowering] Only inspect attributes in the arguments for ArgListEntry
Parameter attributes are considered part of the function [1], and like
mismatched calling conventions [2], we can't have the verifier check for
mismatched parameter attributes.

[1] https://llvm.org/docs/LangRef.html#parameter-attributes
[2] https://llvm.org/docs/FAQ.html#why-does-instcombine-simplifycfg-turn-a-call-to-a-function-with-a-mismatched-calling-convention-into-unreachable-why-not-make-the-verifier-reject-it

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D101806
2021-05-10 12:35:11 -07:00
Momchil Velikov 5c7b43aa82 [clang][AArch32] Correctly align HA arguments when passed on the stack
Analogously to https://reviews.llvm.org/D98794 this patch uses the
`alignstack` attribute to fix incorrect passing of homogeneous
aggregate (HA) arguments on AArch32. The EABI/AAPCS was recently
updated to clarify how VFP co-processor candidates are aligned:
4488e34998

Differential Revision: https://reviews.llvm.org/D100853
2021-05-10 16:28:46 +01:00
David Green 76786037c6 [ARM] Fix postinc of vst1xN
These nodes are not handled correctly by CombineBaseUpdate. For the
moment, similar to 5f1cad4d29 mark them as unsupported.
2021-05-09 21:57:55 +01:00
Matt Arsenault fa0b93b5a0 GlobalISel: Use DAG call lowering infrastructure in a more compatible way
Unfortunately the current call lowering code is built on top of the
legacy MVT/DAG based code. However, GlobalISel was not using it the
same way. In short, the DAG passes legalized types to the assignment
function, and GlobalISel was passing the original raw type if it was
simple.

I do believe the DAG lowering is conceptually broken since it requires
picking a type up front before knowing how/where the value will be
passed. This ends up being a problem for AArch64, which wants to pass
i1/i8/i16 values as a different size if passed on the stack or in
registers.

The argument type decision is split across 3 different places which is
hard to follow. SelectionDAG builder uses
getRegisterTypeForCallingConv to pick a legal type, tablegen gives the
illusion of controlling the type, and the target may have additional
hacks in the C++ part of the call lowering. AArch64 hacks around this
by not using the standard AnalyzeFormalArguments and special casing
i1/i8/i16 by looking at the underlying type of the original IR
argument.

I believe people have generally assumed the calling convention code is
processing the original types, and I've discovered a number of dead
paths in several targets.

x86 actually relies on the opposite behavior from AArch64, and relies
on x86_32 and x86_64 sharing calling convention code where the 64-bit
cases implicitly do not work on x86_32 due to using the pre-legalized
types.

AMDGPU targets without legal i16/f16 have always used a broken ABI
that promotes to i32/f32. GlobalISel accidentally fixed this to be the
ABI we should have, but this fixes it so we're using the worse ABI
that is compatible with the DAG. Ideally we would fix the DAG to match
the old GlobalISel behavior, but I don't wish to fight that battle.

A new native GlobalISel call lowering framework should let the target
process the incoming types directly.

CCValAssigns select a "ValVT" and "LocVT" but the meanings of these
aren't entirely clear. Different targets don't use them consistently,
even within their own call lowering code. My current belief is the
intent was "ValVT" is supposed to be the legalized value type to use
in the end, and and LocVT was supposed to be the ABI passed type
(which is also legalized).

With the default CCState::Analyze functions always passing the same
type for these arguments, these only differ when the TableGen part of
the lowering decide to promote the type from one legal type to
another. AArch64's i1/i8/i16 hack ends up inverting the meanings of
these values, so I had to add an additional hack to let the target
interpret how large the argument memory is.

Since targets don't consistently interpret ValVT and LocVT, this
doesn't produce quite equivalent code to the initial DAG
lowerings. I've opted to consistently interpret LocVT as the in-memory
size for stack passed values, and ValVT as the register type to assign
from that memory. We therefore produce extending loads directly out of
the IRTranslator, whereas the DAG would emit regular loads of smaller
values. This will also produce loads/stores that are wider than the
argument value if the allocated stack slot is larger (and there will
be undef padding bytes). If we had the optimizations to reduce
load/stores based on truncated values, this wouldn't produce a
different end result.

Since ValVT/LocVT are more consistently interpreted, we now will emit
more G_BITCASTS as requested by the CCAssignFn. For example AArch64
was directly assigning types to some physical vector registers which
according to the tablegen spec should have been casted to a vector
with a different element type.

This also moves the responsibility for inserting
G_ASSERT_SEXT/G_ASSERT_ZEXT from the target ValueHandlers into the
generic code, which is closer to how SelectionDAGBuilder works.

I had to xfail an x86 test since I don't see a quick way to fix it
right now (I filed bug 50035 for this). It's broken independently of
this change, and only triggers since now we end up with more ands
which hit the improperly handled selection pattern.

I also observed that FP arguments that need promotion (e.g. f16 passed
as f32) are broken, and use regular G_TRUNC and G_ANYEXT.

TLDR; the current call lowering infrastructure is bad and nobody has
ever understood how it chooses types.
2021-05-05 17:35:02 -04:00
Simon Moll 1db4dbba24 Recommit "[VP,Integer,#2] ExpandVectorPredication pass"
This reverts the revert 02c5ba8679

Fix:

Pass was registered as DUMMY_FUNCTION_PASS causing the newpm-pass
functions to be doubly defined. Triggered in -DLLVM_ENABLE_MODULE=1
builds.

Original commit:

This patch implements expansion of llvm.vp.* intrinsics
(https://llvm.org/docs/LangRef.html#vector-predication-intrinsics).

VP expansion is required for targets that do not implement VP code
generation. Since expansion is controllable with TTI, targets can switch
on the VP intrinsics they do support in their backend offering a smooth
transition strategy for VP code generation (VE, RISC-V V, ARM SVE,
AVX512, ..).

Reviewed By: rogfer01

Differential Revision: https://reviews.llvm.org/D78203
2021-05-04 11:47:52 +02:00
Tomas Matheson 9d86095ff8 Revert "[CodeGen][ARM] Implement atomicrmw as pseudo operations at -O0"
This reverts commit 753185031d.
2021-05-03 21:48:20 +01:00
Tomas Matheson 753185031d [CodeGen][ARM] Implement atomicrmw as pseudo operations at -O0
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.

To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.

Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)

Differential Revision: https://reviews.llvm.org/D101164

Originally submitted as 3338290c18.
Reverted in c7df6b1223.
2021-05-03 20:25:15 +01:00
Adrian Prantl 02c5ba8679 Revert "[VP,Integer,#2] ExpandVectorPredication pass"
This reverts commit 43bc584dc0.

The commit broke the -DLLVM_ENABLE_MODULES=1 builds.

http://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/31603/consoleFull#2136199809a1ca8a51-895e-46c6-af87-ce24fa4cd561
2021-04-30 17:02:28 -07:00
Tomas Matheson c7df6b1223 Revert "[CodeGen][ARM] Implement atomicrmw as pseudo operations at -O0"
This reverts commit 3338290c18.

Broke expensive checks on debian.
2021-04-30 16:53:14 +01:00
Tomas Matheson 3338290c18 [CodeGen][ARM] Implement atomicrmw as pseudo operations at -O0
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.

To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.

Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)

Differential Revision: https://reviews.llvm.org/D101164
2021-04-30 16:40:33 +01:00
Simon Moll 43bc584dc0 [VP,Integer,#2] ExpandVectorPredication pass
This patch implements expansion of llvm.vp.* intrinsics
(https://llvm.org/docs/LangRef.html#vector-predication-intrinsics).

VP expansion is required for targets that do not implement VP code
generation. Since expansion is controllable with TTI, targets can switch
on the VP intrinsics they do support in their backend offering a smooth
transition strategy for VP code generation (VE, RISC-V V, ARM SVE,
AVX512, ..).

Reviewed By: rogfer01

Differential Revision: https://reviews.llvm.org/D78203
2021-04-30 15:47:28 +02:00
David Green 94c7bd7eb2 [ARM] Expand VMOVRRD simplification pattern
This expands the VMOVRRD(extract(..(build_vector(a, b, c, d)))) pattern,
to also handle insert_vectors. Providing we can find the correct insert,
this helps further simplify patterns by removing the redundant VMOVRRD.

Differential Revision: https://reviews.llvm.org/D100245
2021-04-26 12:27:38 +01:00
Dávid Bolvanský ef2dc7ed9f [Analysis] Attribute alignment should not prevent tail call optimization
Fixes tail folding issue mentioned in D100879.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D101230
2021-04-24 19:57:42 +02:00
David Green 21a8b9d9e9 [ARM] Limit PerformExtractEltToVMOVRRD to when f64 is legal.
The generic SoftFloatVectorExtract.ll test was failing when run on arm
machines, as it tries to create a f64 under soft float. Limit the
transform to when f64 is legal.

Also add a missing override, as reported in D100244.
2021-04-20 16:24:36 +01:00
David Green 48cef1fa8e [ARM] Create VMOVRRD from adjacent vector extracts
This adds a combine for extract(x, n); extract(x, n+1)  ->
VMOVRRD(extract x, n/2). This allows two vector lanes to be moved at the
same time in a single instruction, and thanks to the other VMOVRRD folds
we have added recently can help reduce the amount of executed
instructions. Floating point types are very similar, but will include a
bitcast to an integer type.

This also adds a shouldRewriteCopySrc, to prevent copy propagation from
DPR to SPR, which can break as not all DPR regs can be extracted from
directly.  Otherwise the machine verifier is unhappy.

Differential Revision: https://reviews.llvm.org/D100244
2021-04-20 15:15:43 +01:00
David Green 806b47ade3 [ARM] Regenerate a couple of tests. NFC 2021-04-20 10:54:41 +01:00
David Penry ca8eef7e3d [CodeGen] Use ProcResGroup information in SchedBoundary
When the ProcResGroup has BufferSize=0,

1. if there is a subunit in the list of write resources for the
   scheduling class, do not attempt to schedule the ProcResGroup.
2. if there is not a subunit in the list of write resources for the
   scheduling class, choose a subunit to use instead of the ProcResGroup.
3. having both the ProcResGroup and any of its subunits in the resources
   implied by a InstRW is not supported.

Used to model parallel uses from a pool of resources.

Differential Revision: https://reviews.llvm.org/D98976
2021-04-19 21:27:45 +01:00
David Penry 78a871abf7 [ARM] Use ProcResGroup in Cortex-M7 scheduling model
Used to model structural hazards on FP issue, where some
instructions take up 2 issue slots and others one as well
as similar structural hazards on load issue, where some
instructions take up two load lanes and others one.

Differential Revision: https://reviews.llvm.org/D98977
2021-04-19 21:23:05 +01:00
Tim Northover 5e3d9fcc3a StackProtector: ensure protection does not interfere with tail call frame.
The IR stack protector pass must insert stack checks before the call instead of
between it and the return.

Similarly, SDAG one should recognize that ADJCALLFRAME instructions could be
part of the terminal sequence of a tail call. In this case because such call
frames cannot be nested in LLVM the stack protection code must skip over the
whole sequence (or risk clobbering argument registers).
2021-04-13 15:14:57 +01:00
Arthur Eubanks c88b87f9ce Revert "Remove "Rewrite Symbols" from codegen pipeline"
This reverts commit 6210261ecb.

addr-label.ll crashes on armv7.
2021-04-10 23:28:16 -07:00
Arthur Eubanks 6210261ecb Remove "Rewrite Symbols" from codegen pipeline
It breaks up the function pass manager in the codegen pipeline.

With empty parameters, it looks at the -mllvm flag -rewrite-map-file.
This is likely not in use.

Add a check that we only have one function pass manager in the codegen
pipeline.

This required reverting commit 9583a3f2625818b78c0cf6d473cdedb9f23ad82c:
"[AsmPrinter] Delete dead takeDeletedSymbsForFunction()".
This was not NFC as initially thought. By coalescing two function
psas managers, this exposed the reverted code as necessary.
addr-label.ll was crashing due to an emitted blockaddress's block being
removed but the label not emitted.

Some tests relied on the fact that we had a module pass somewhere in the
codegen pipeline.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D99707
2021-04-10 22:38:44 -07:00
Simonas Kazlauskas 777a58e05b Support {S,U}REMEqFold before legalization
This allows these optimisations to apply to e.g. `urem i16` directly
before `urem` is promoted to i32 on architectures where i16 operations
are not intrinsically legal (such as on Aarch64). The legalization then
later can happen more directly and generated code gets a chance to avoid
wasting time on computing results in types wider than necessary, in the end.

Seems like mostly an improvement in terms of results at least as far as x86_64 and aarch64 are concerned, with a few regressions here and there. It also helps in preventing regressions in changes like {D87976}.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D88785
2021-04-01 01:35:41 +03:00
David Green 7b6f760fcd [ARM] MVE vector lane interleaving
MVE does not have a single sext/zext or trunc instruction that takes the
bottom half of a vector and extends to a full width, like NEON has with
MOVL. Instead it is expected that this happens through top/bottom
instructions. So the MVE equivalent VMOVLT/B instructions take either
the even or odd elements of the input and extend them to the larger
type, producing a vector with half the number of elements each of double
the bitwidth. As there is no simple instruction for a normal extend, we
often have to expand sext/zext/trunc into a series of lane moves (or
stack loads/stores, which we do not do yet).

This pass takes vector code that starts at truncs, looks for
interconnected blobs of operations that end with sext/zext and
transforms them by adding shuffles so that the lanes are interleaved and
the MVE VMOVL/VMOVN instructions can be used. This is done pre-ISel so
that it can work across basic blocks.

This initial version of the pass just handles a limited set of
instructions, not handling constants or splats or FP, which can all come
as extensions to this base.

Differential Revision: https://reviews.llvm.org/D95804
2021-03-28 19:34:58 +01:00
Simon Pilgrim 9d2df96407 [DAG] computeKnownBits - add ISD::MULHS/MULHU/SMUL_LOHI/UMUL_LOHI handling
Reuse the existing KnownBits multiplication code to handle the 'extend + multiply + extract high bits' pattern for multiply-high ops.

Noticed while looking at the codegen for D88785 / D98587 - the patch helps division-by-constant expansion code in particular, which suggests that we might have some further KnownBits div/rem cases we could handle - but this was far easier to implement.

Differential Revision: https://reviews.llvm.org/D98857
2021-03-19 16:02:31 +00:00
Simon Pilgrim d9b5338cfb [ARM] Regenerate select-imm.ll tests 2021-03-18 11:07:16 +00:00
David Green 402f2cae7d [ARM] Use lrdsb for more thumb1 loads.
Given a sextload i16, we can usually generate "ldrsh [rn. rm]". If we
don't naturally have a rn, rm addressing mode, we can either generate
"ldrh [rn, #0]; sxth" or "mov rm, #0; ldrsh [rn. rm]".

We currently generate the first, always creating a sxth. They are both
the same number of instructions, but if we generate the second then the
mov #0 will likely be CSE'd or pulled out of a loop, etc.

This adjusts the ISel patterns to do that, creating a mov instead of a
sxth.

Differential Revision: https://reviews.llvm.org/D98693
2021-03-17 15:29:02 +00:00
Simonas Kazlauskas a2eca31da2 Test cases for rem-seteq fold with illegal types
This also briefly tests a larger set of architectures than the more
exhaustive functionality tests for AArch64 and x86.

As requested in D88785

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D98339
2021-03-12 16:28:04 +02:00
Daniel Sanders 134a179dee [mir] Change 'undef' for MMO base addresses to 'unknown-address'
Differential Revision: https://reviews.llvm.org/D98100
2021-03-10 16:46:44 -08:00
Craig Topper 0eb405c3b8 [SelectionDAG] Add computeKnownBits support for ISD::USUBSAT.
The result of ISD::USUBSAT will never be larger than the LHS. We
can use this to put a bound on the number of leading zeros.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D98133
2021-03-07 09:48:42 -08:00
Daniel Sanders 9fc2be6f28 [mir] Fix confusing MIR when MMO's value is nullptr but offset is non-zero
:: (store 1 + 4, addrspace 1)
->
:: (store 1 into undef + 4, addrspace 1)

An offset without a base isn't terribly useful but it's convenient to update
the offset without checking the value. For example, when breaking apart
stores into smaller units

Differential Revision: https://reviews.llvm.org/D97812
2021-03-04 10:34:30 -08:00
Arthur Eubanks 040c1b49d7 Move EntryExitInstrumentation pass location
This seems to be more of a Clang thing rather than a generic LLVM thing,
so this moves it out of LLVM pipelines and as Clang extension hooks into
LLVM pipelines.

Move the post-inline EEInstrumentation out of the backend pipeline and
into a late pass, similar to other sanitizer passes. It doesn't fit
into the codegen pipeline.

Also fix up EntryExitInstrumentation not running at -O0 under the new
PM. PR49143

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D97608
2021-03-01 10:08:10 -08:00
Craig Topper fe50be12c8 [LegalizeIntegerTypes] Further improve ExpandIntRes_SADDSUBO for targets where SADDO/SSUBO aren't supported.
Rather than converting 3 signbits to bools and comparing them,
we can do bitwise logic on the whole vector and convert the
resulting sign bit to a bool at the end.

This is still a different algorithm than what we do in LegalizeDAG
through expandSADDOSSUBO. That algorithm needs to know that the
RHS of SSUBO is > 0, but that's costly when the type is split.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D97325
2021-02-24 10:05:38 -08:00
David Green 03892a27d6 [ARM] Expand the range of allowed post-incs in load/store optimizer
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.

This is a recommit of 3b34b06fc5 with correctness fixes and extra
tests.

Differential Revision: https://reviews.llvm.org/D95885
2021-02-24 08:46:15 +00:00
David Green 8fa2bbaed9 [ARM] Mir test for pre/postinc ldstopt combines. NFC 2021-02-23 22:27:06 +00:00
Craig Topper eb165090bb [LegalizeIntegerTypes] Improve ExpandIntRes_SADDSUBO codegen on targets without SADDO/SSUBO.
This code creates 3 setccs that need to be expanded. It was
creating a sign bit test as setge X, 0 which is non-canonical.
Canonical would be setgt X, -1. This misses the special case in
IntegerExpandSetCCOperands for sign bit tests that assumes
canonical form. If we don't hit this special case we end up
with a multipart setcc instead of just checking the sign of
the high part.

To fix this I've reversed the polarity of all of the setccs to
setlt X, 0 which is canonical. The rest of the logic should
still work. This seems to produce better code on RISCV which
lacks a setgt instruction.

This probably still isn't the best code sequence we could use here.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D97181
2021-02-23 09:40:32 -08:00
Sjoerd Meijer e1c3bf6afe [ARM] do not consider sp as deprecated for ldm/stm
Early versions of the ARMv7 reference manuals considered the sp register
as a deprecated register for ldm/stm familiy of instructions. However,
later versions such as ARM DDI 0406C.d added a note to the Appendix:

D9.3 Use of the SP as a general-purpose register
Most ARM instructions, unlike Thumb instructions, provide exactly the
same access to the SP as to R0-R12. This means that it is possible to
use the SP as a general-purpose register.  Earlier issues of this manual
deprecated the use of SP in an ARM instruction, in any way that is
deprecated, not permitted, or not possible in the corresponding
Thumb instruction. However, user feedback indicates a number of cases
where these instructions are useful. Therefore, ARM no longer deprecates
these instruction uses.
Also Armv8 manuals no longer consider SP as deprecated register for ldm/
stm A32 instructions.

Furthermore, GNU as also does not print a deprecated warning when using
SP with those instructions.

Drop deprecation warning for pop/ldm/push/stm instructions.

Patch by: Stefan Agner.

Differential Revision: https://reviews.llvm.org/D82692
2021-02-23 13:26:18 +00:00
David Green ebb6583e02 [ARM] Add pre/post inc tests of various sizes. NFC 2021-02-23 10:53:22 +00:00
David Green 7a5c26e99a Revert "[ARM] Expand the range of allowed post-incs in load/store optimizer"
This reverts commit 3b34b06fc5 as runtime
errors were reported.
2021-02-19 13:15:10 +00:00
David Green 3b34b06fc5 [ARM] Expand the range of allowed post-incs in load/store optimizer
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.

Differential Revision: https://reviews.llvm.org/D95885
2021-02-18 14:59:02 +00:00
David Green 1a6744e3dc [ARM] Add larger than legal ICmp costs
A v8i32 compare will produce a v8i1 predicate, but during codegen the
v8i32 will be split into two v4i32, potentially requiring two v4i1
predicates to be merged into a single v8i1. Because this merging of two
v4i1's into a v8i1 is very expensive, we need to make the cost of the
compare equally high.

This patch adds the cost of that to ARMTTIImpl::getCmpSelInstrCost.
Because we don't know whether the user of the predicate can be split,
and the cost model is mostly pre-instruction, we may be pessimistic but
that should only be for larger and legal types. This also adds min/max
detection to the costmodel where it can be detected, to keep those in
line with the cost of simple min/max instructions. Otherwise for the
most part, costs that were already expensive have become more expensive.

Differential Revision: https://reviews.llvm.org/D96692
2021-02-18 11:42:17 +00:00
David Green 1fbb3287fc [ARM] MVE ICmp costing tests. NFC 2021-02-18 10:50:34 +00:00
Sjoerd Meijer f78aa8b2c2 [LSR] Add a flag that overrides the target's preferred addressing mode
This adds a new flag -lsr-preferred-addressing-mode to override the target's
preferred addressing mode. It replaces flag -lsr-backedge-indexing, which is
equivalent to preindexed addressing that is one of the options that
-lsr-preferred-addressing-mode accepts.

Differential Revision: https://reviews.llvm.org/D96855
2021-02-17 16:50:21 +00:00
David Green a838a4f69f [ARM] Extend search for increment in load/store optimizer
Currently the findIncDecAfter will only look at the next instruction for
post-inc candidates in the load/store optimizer. This extends that to a
search through the current BB, until an instruction that modifies or
uses the increment reg is found. This allows more post-inc load/stores
and ldm/stm's to be created, especially in cases where a schedule might
move instructions further apart.

We make sure not to look any further for an SP, as that might invalidate
stack slots that are still in use.

Differential Revision: https://reviews.llvm.org/D95881
2021-02-15 13:17:21 +00:00
Simon Pilgrim 60ba5397df [DAG] PromoteIntRes_ADDSUBSHLSAT - use promoted ISD::USUBSAT directly
As discussed on D96413, as long as the promoted bits of the args are zero we can use the basic ISD::USUBSAT pattern directly, without the shifting like we do for other ops.

I think something similar should be possible for ISD::UADDSAT as well, which I'll look at later.

Also, create a ISD::USUBSAT node directly - this will be expanded back by the legalizer later on if necessary.

Differential Revision: https://reviews.llvm.org/D96622
2021-02-13 12:35:10 +00:00
Serge Pavlov 816053bc71 [FPEnv][ARM] Implement lowering of llvm.set.rounding
Differential Revision: https://reviews.llvm.org/D96501
2021-02-13 11:16:29 +07:00
Simon Pilgrim 4841a225b7 [DAG] Move basic USUBSAT pattern matches from X86 to DAGCombine
Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets.

This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization.

We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation.

The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987.

Differential Revision: https://reviews.llvm.org/D96413
2021-02-12 18:22:57 +00:00
Lukas Sommer 6577cef9b0 [CodeGen] New pass: Replace vector intrinsics with call to vector library
This patch adds a pass to replace calls to vector intrinsics (i.e., LLVM
intrinsics operating on vector operands) with calls to a vector library.

Currently, calls to LLVM intrinsics are only replaced with calls to vector
libraries when scalar calls to intrinsics are vectorized by the Loop- or
SLP-Vectorizer.

With this pass, it is now possible to replace calls to LLVM intrinsics
already operating on vector operands, e.g., if such code was generated
by MLIR. For the replacement, information from the TargetLibraryInfo,
e.g., as specified via -vector-library is used.

This is a re-try of the original commit 2303e93e66 that was reverted
due to pass manager problems. Other minor changes have also been made.

Differential Revision: https://reviews.llvm.org/D95373
2021-02-12 12:53:27 -05:00
Sanjay Patel c981f6f8e1 Revert "[Codegen][ReplaceWithVecLib] add pass to replace vector intrinsics with calls to vector library"
This reverts commit 2303e93e66.
Investigating bot failures.
2021-02-05 15:10:11 -05:00
Lukas Sommer 2303e93e66 [Codegen][ReplaceWithVecLib] add pass to replace vector intrinsics with calls to vector library
This patch adds a pass to replace calls to vector intrinsics
(i.e., LLVM intrinsics operating on vector operands) with
calls to a vector library.

Currently, calls to LLVM intrinsics are only replaced with
calls to vector libraries when scalar calls to intrinsics are
vectorized by the Loop- or SLP-Vectorizer.

With this pass, it is now possible to replace calls to LLVM
intrinsics already operating on vector operands, e.g., if
such code was generated by MLIR. For the replacement,
information from the TargetLibraryInfo, e.g., as specified
via -vector-library is used.

Differential Revision: https://reviews.llvm.org/D95373
2021-02-05 14:25:19 -05:00
Ayke van Laethem aecdf15cc7
[ARM] Do not emit ldrexd/strexd on Cortex-M chips
The ldrexd/strexd instructions are not supported on M-class chips, see
for example
https://developer.arm.com/documentation/dui0489/e/arm-and-thumb-instructions/memory-access-instructions/ldrex-and-strex
which says:

> All these 32-bit Thumb instructions are available in ARMv6T2 and
> above, except that LDREXD and STREXD are not available in the ARMv7-M
> architecture.

Looking at the ARMv8-M architecture, it appears that these instructions
aren't supported either. The Architecture Reference Manual lists
ldrex/strex but not ldrexd/strexd:
https://developer.arm.com/documentation/ddi0553/bn/

Godbolt example on LLVM 11.0.0, which incorrectly emits ldrexd/strexd
instructions: https://llvm.godbolt.org/z/5qqPnE

Differential Revision: https://reviews.llvm.org/D95891
2021-02-04 21:55:34 +01:00
David Green 649a3d00df [ARM] Handle f16 in GeneratePerfectShuffle
This new f16 shuffle under Neon would hit an assert in
GeneratePerfectShuffle as it would try to treat a f16 vector as an i8.
Add f16 handling, treating them like an i16.

Differential Revision: https://reviews.llvm.org/D95446
2021-02-04 11:14:52 +00:00
David Green 5805521207 [ARM] Simplify VMOVRRD from extracts of buildvectors
Under the softfp calling convention, we are often left with
VMOVRRD(extract(bitcast(build_vector(a, b, c, d)))) for the return value
of the function. These can be simplified to a,b or c,d directly,
depending on the value of the extract.

Big endian is a little different because the bitcast switches the lanes
around, meaning we end up with b,a or d,c.

Differential Revision: https://reviews.llvm.org/D94989
2021-02-01 16:09:25 +00:00
David Green ad12e6ee95 [ARM] Turn sext_inreg(VGetLaneu) into VGetLaneu
This adds a DAG combine for converting sext_inreg of VGetLaneu into
VGetLanes, providing the types match correctly.

Differential Revision: https://reviews.llvm.org/D95073
2021-02-01 11:10:35 +00:00
Roman Lebedev ddc4b56eef
[ExpandMemCmpPass] Preserve Dominator Tree, if available
This finishes getting rid of all the avoidable Dominator Tree recalculations
in X86 optimized codegen pipeline.
2021-01-30 01:14:51 +03:00
Roman Lebedev 056385921d
[ScalarizeMaskedMemIntrin] Preserve Dominator Tree, if avaliable
This de-pessimizes the arguably more usual case of no masked mem intrinsics,
and gets rid of one more Dominator Tree recalculation.

As per llvm/test/CodeGen/X86/opt-pipeline.ll,
there's one more Dominator Tree recalculation left, we could get rid of.
2021-01-29 01:11:36 +03:00
Roman Lebedev 6617529a1d
[CodeGen][DwarfEHPrepare] Preserve Dominator Tree
Now that D94827 has flipped the switch, and SimplifyCFG is officially marked
as production-ready regarding Dominator Tree preservation,
we can update this user pass to also preserve Dominator Tree.

This is a geomean compile-time win of `-0.05%`..`-0.08%`.
https://llvm-compile-time-tracker.com/compare.php?from=51a25846c198cff00abad0936f975167357afa6f&to=082499aac236a5c141e50a9e77870d5be2de5f0b&stat=instructions

Differential Revision: https://reviews.llvm.org/D95548
2021-01-28 14:11:34 +03:00
David Green c1c1944e69 [ARM] Regenerate constant hoisting test. NFC 2021-01-28 10:37:16 +00:00
Roman Lebedev 51a25846c1
[CodeGen] SafeStack: preserve DominatorTree if it is avaliable
While this is mostly NFC right now, because only ARM happens
to run this pass with DomTree available before it,
and required after it, more backends will be affected once
the SimplifyCFG's switch for domtree preservation is flipped,
and DwarfEHPrepare also preserves the domtree.
2021-01-27 18:32:35 +03:00
David Green 9e2768a3d9 [ARM] Add neon FP16 scalar_to_vector patterns.
This adds some simple fp16 scalar_to_vector patterns, preventing a
selection failure if this came up.

Differential Revision: https://reviews.llvm.org/D95427
2021-01-27 09:59:15 +00:00
Adhemerval Zanella dad55c2218 [ARM] [ELF] Fix ARMMaterializeGV for Indirect calls
Recent shouldAssumeDSOLocal changes (introduced by 961f31d8ad)
do not take in consideration the relocation model anymore.  The ARM
fast-isel pass uses the function return to set whether a global symbol
is loaded indirectly or not, and without the expected information
llvm now generates an extra load for following code:

```
$ cat test.ll
@__asan_option_detect_stack_use_after_return = external global i32
define dso_local i32 @main(i32 %argc, i8** %argv) #0 {
entry:
  %0 = load i32, i32* @__asan_option_detect_stack_use_after_return,
align 4
  %1 = icmp ne i32 %0, 0
  br i1 %1, label %2, label %3

2:
  ret i32 0

3:
  ret i32 1
}

attributes #0 = { noinline optnone }

$ lcc test.ll -o -
[...]
main:
        .fnstart
[...]
        movw    r0, :lower16:__asan_option_detect_stack_use_after_return
        movt    r0, :upper16:__asan_option_detect_stack_use_after_return
        ldr     r0, [r0]
        ldr     r0, [r0]
        cmp     r0, #0
[...]
```

And without 'optnone' it produces:
```
[...]
main:
        .fnstart
[...]
        movw    r0, :lower16:__asan_option_detect_stack_use_after_return
        movt    r0, :upper16:__asan_option_detect_stack_use_after_return
        ldr     r0, [r0]
        clz     r0, r0
        lsr     r0, r0, #5
        bx      lr

[...]
```

This triggered a lot of invalid memory access in sanitizers for
arm-linux-gnueabihf.  I checked this patch both a stage1 built with
gcc and a stage2 bootstrap and it fixes all the Linux sanitizers
issues.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D95379
2021-01-26 15:57:55 -03:00
David Green af03324984 [ARM] Disable sign extended SSAT pattern recognition.
I may have given bad advice, and skipping sext_inreg when matching SSAT
patterns is not valid on it's own. It at least needs to sext_inreg the
input again, but as far as I can tell is still only valid based on
demanded bits. For the moment disable that part of the combine,
hopefully reimplementing it in the future more correctly.
2021-01-22 14:07:48 +00:00
David Green 476de8cea3 [ARM] Add new and regenerate SSAT tests. NFC
Some of these new tests should be creating SSAT. They will be fixed in a
followup.
2021-01-22 10:42:36 +00:00
Simon Pilgrim 69bc0990a9 [DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE (REAPPLIED).
Add DemandedElts support inside the TRUNCATE analysis.

REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d

Differential Revision: https://reviews.llvm.org/D56387
2021-01-21 13:01:34 +00:00
Hans Wennborg a51226057f Revert "[DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE"
It caused "Vector shift amounts must be in the same as their first arg"
asserts in Chromium builds. See the code review for repro instructions.

> Add DemandedElts support inside the TRUNCATE analysis.
>
> Differential Revision: https://reviews.llvm.org/D56387

This reverts commit cad4275d69.
2021-01-20 20:06:55 +01:00
Simon Pilgrim cad4275d69 [DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE
Add DemandedElts support inside the TRUNCATE analysis.

Differential Revision: https://reviews.llvm.org/D56387
2021-01-20 15:39:58 +00:00
Yvan Roux 244ad228f3 [ARM][MachineOutliner] Add stack fixup feature
This patch handles cases where we have to save/restore the link register
into the stack and and load/store instruction which use the stack are
part of the outlined region. It checks that there will be no overflow
introduced by the new offset and fixup these instructions accordingly.

Differential Revision: https://reviews.llvm.org/D92934
2021-01-19 10:59:09 +01:00
David Green 69295815ed [ARM] Update test target triple. NFC 2021-01-18 16:36:00 +00:00
David Green 1454724215 [ARM] Align blocks that are not fallthough targets
If the previous block in a function does not fallthough, adding nop's to
align it will never be executed. This means we can freely (except for
codesize) align more branches. This happens in constantislandspass (as
it cannot happen later) and only happens at aggressive optimization
levels as it does increase codesize.

Differential Revision: https://reviews.llvm.org/D94394
2021-01-16 22:19:35 +00:00
Oliver Stannard 3676ef1053 [ARM][GISel] Treat calls as variadic even if only fixed arguments provided
For the ARM hard-float calling convention, calls to variadic functions
need to be treated diffrently, even if only the fixed arguments are
provided.

This fixes GCC-C-execute-pr68390 in the test-suite, which is failing on
the ARM GlobaISel bot.
2021-01-15 09:37:01 +00:00
Sam Tebbs 60fda8ebb6 [ARM] Add a pass that re-arranges blocks when there is a backwards WLS branch
Blocks can be laid out such that a t2WhileLoopStart branches backwards. This is forbidden by the architecture and so it fails to be converted into a low-overhead loop. This new pass checks for these cases and moves the target block, fixing any fall-through that would then be broken.

Differential Revision: https://reviews.llvm.org/D92385
2021-01-13 17:23:00 +00:00
Mircea Trofin 05e90cefeb [NFC] Disallow unused prefixes under llvm/test/CodeGen
This patch finishes addressing unused prefixes under CodeGen: 2
remaining tests fixed, and then undo-ing the lit.local.cfg changes under
various subdirs and moving the policy under CodeGen.

Differential Revision: https://reviews.llvm.org/D94430
2021-01-11 12:32:18 -08:00
Paul Robinson be179b9946 [FastISel] NFC: Remove obsolete -fast-isel-sink-local-values option
This option is not used for anything after #c161665 (D91737).
This commit reapplies #a474657.
2021-01-11 09:32:49 -08:00
Paul Robinson c161775dec [FastISel] Flush local value map on every instruction
Local values are constants or addresses that can't be folded into
the instruction that uses them. FastISel materializes these in a
"local value" area that always dominates the current insertion
point, to try to avoid materializing these values more than once
(per block).

https://reviews.llvm.org/D43093 added code to sink these local
value instructions to their first use, which has two beneficial
effects. One, it is likely to avoid some unnecessary spills and
reloads; two, it allows us to attach the debug location of the
user to the local value instruction. The latter effect can
improve the debugging experience for debuggers with a "set next
statement" feature, such as the Visual Studio debugger and PS4
debugger, because instructions to set up constants for a given
statement will be associated with the appropriate source line.

There are also some constants (primarily addresses) that could be
produced by no-op casts or GEP instructions; the main difference
from "local value" instructions is that these are values from
separate IR instructions, and therefore could have multiple users
across multiple basic blocks. D43093 avoided sinking these, even
though they were emitted to the same "local value" area as the
other instructions. The patch comment for D43093 states:

  Local values may also be used by no-op casts, which adds the
  register to the RegFixups table. Without reversing the RegFixups
  map direction, we don't have enough information to sink these
  instructions.

This patch undoes most of D43093, and instead flushes the local
value map after(*) every IR instruction, using that instruction's
debug location. This avoids sometimes incorrect locations used
previously, and emits instructions in a more natural order.

In addition, constants materialized due to PHI instructions are
not assigned a debug location immediately; instead, when the
local value map is flushed, if the first local value instruction
has no debug location, it is given the same location as the
first non-local-value-map instruction.  This prevents PHIs
from introducing unattributed instructions, which would either
be implicitly attributed to the location for the preceding IR
instruction, or given line 0 if they are at the beginning of
a machine basic block.  Neither of those consequences is good
for debugging.

This does mean materialized values are not re-used across IR
instruction boundaries; however, only about 5% of those values
were reused in an experimental self-build of clang.

(*) Actually, just prior to the next instruction. It seems like
it would be cleaner the other way, but I was having trouble
getting that to work.

This reapplies commits cf1c774d and dc35368c, and adds the
modification to PHI handling, which should avoid problems
with debugging under gdb.

Differential Revision: https://reviews.llvm.org/D91734
2021-01-11 08:32:36 -08:00
Paul Robinson e5eb5c8a7f NFC: Use -LABEL more
There were a number of tests needing updates for D91734, and I added a
bunch of LABEL directives to help track down where those had to go.
These directives are an improvement independent of the functional
patch, so I'm committing them as their own separate patch.
2021-01-11 08:14:58 -08:00
David Green 1ae762469f [ARM] Update and regenerate test checks. NFC 2021-01-08 14:54:16 +00:00
David Green 63dce70b79 [ARM] Handle any extend whilst lowering addw/addl/subw/subl
Same as a9b6440edd, use zanyext to treat any_extends as zero extends
during lowering to create addw/addl/subw/subl nodes.

Differential Revision: https://reviews.llvm.org/D93835
2021-01-06 11:26:39 +00:00
David Green ddb82fc76c [ARM] Handle any extend whilst lowering mull
Similar to 78d8a821e2 but for ARM, this handles any_extend whilst
creating MULL nodes, treating them as zextends.

Differential Revision: https://reviews.llvm.org/D93834
2021-01-06 10:51:12 +00:00
David Green 0c59a4da59 [ARM][AArch64] Some extra test to show anyextend lowering. NFC 2021-01-05 17:34:23 +00:00
Roman Lebedev 2461cdb417
[CodeGen][SimplifyCFG] Teach DwarfEHPrepare to preserve DomTree
Once the default for SimplifyCFG flips, we can no longer pass nullptr
instead of DomTree to SimplifyCFG, so we need to propagate it here.

We don't strictly need to actually preserve DomTree in DwarfEHPrepare,
but we might as well do it, since it's trivial.
2021-01-02 01:01:19 +03:00
Roman Lebedev b23b1bcc26
[NFC][CodeGen][Tests] Mark all tests that fail to preserve DomTree for SimplifyCFG as such
These tests start to fail when the SimplifyCFG's default regarding DomTree
updating is switched on, so mark them as needing changes.
2021-01-02 01:01:19 +03:00
Fangrui Song 2047c10c22 [TargetMachine] Drop implied dso_local for definitions in ELF static relocation model/PIE
TargetMachine::shouldAssumeDSOLocal currently implies dso_local for such definitions.

Since clang -fno-pic add the dso_local specifier, we don't need to special case.
2020-12-30 16:57:50 -08:00
Fangrui Song bf1160c1d6 [test] Add explicit dso_local to definitions in ELF static relocation model tests 2020-12-30 16:52:23 -08:00
Fangrui Song a64b89e69e [ARM][test] Add explicit dso_local to definitions in ELF static relocation model tests
TargetMachine::shouldAssumeDSOLocal currently implies dso_local for such definitions.

Adding explicit dso_local makes these tests align with the clang -fno-pic behavior
and allow the removal of the TargetMachine::shouldAssumeDSOLocal special case.
2020-12-30 15:23:21 -08:00
David Green 7a3e11fe96 [ARM] Add some NEON anyextend testing. NFC
This cleans up and regenerates the NEON addw/addl/subw/subl/mlal etc
tests, adding some tests that turn the zext into anyextend using an and
mask.
2020-12-27 13:18:10 +00:00
Evgeniy Brevnov 9fb074e7bb [BPI] Improve static heuristics for "cold" paths.
Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
 of other paths.

New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.

One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.

In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.

Reviewed By: yrouban

Differential Revision: https://reviews.llvm.org/D79485
2020-12-23 22:47:36 +07:00
Kristof Beyls df8ed39283 [ARM] harden-sls-blr: avoid r12 and lr in indirect calls.
As a linker is allowed to clobber r12 on function calls, the code
transformation that hardens indirect calls is not correct in case a
linker does so.  Similarly, the transformation is not correct when
register lr is used.

This patch makes sure that r12 or lr are not used for indirect calls
when harden-sls-blr is enabled.

Differential Revision: https://reviews.llvm.org/D92469
2020-12-19 12:39:59 +00:00
Kristof Beyls a4c1f5160e [ARM] Harden indirect calls against SLS
To make sure that no barrier gets placed on the architectural execution
path, each indirect call calling the function in register rN, it gets
transformed to a direct call to __llvm_slsblr_thunk_mode_rN.  mode is
either arm or thumb, depending on the mode of where the indirect call
happens.

The llvm_slsblr_thunk_mode_rN thunk contains:

bx rN
<speculation barrier>

Therefore, the indirect call gets split into 2; one direct call and one
indirect jump.
This transformation results in not inserting a speculation barrier on
the architectural execution path.

The mitigation is off by default and can be enabled by the
harden-sls-blr subtarget feature.

As a linker is allowed to clobber r12 on function calls, the
above code transformation is not correct in case a linker does so.
Similarly, the transformation is not correct when register lr is used.
Avoiding r12/lr being used is done in a follow-on patch to make
reviewing this code easier.

Differential Revision: https://reviews.llvm.org/D92468
2020-12-19 12:33:42 +00:00
Kristof Beyls 320fd3314e [ARM] Implement harden-sls-retbr for Thumb mode
The only non-trivial consideration in this patch is that the formation
of TBB/TBH instructions, which is done in the constant island pass, does
not understand the speculation barriers inserted by the SLSHardening
pass. As such, when harden-sls-retbr is enabled for a function, the
formation of TBB/TBH instructions in the constant island pass is
disabled.

Differential Revision: https://reviews.llvm.org/D92396
2020-12-19 12:32:47 +00:00
Kristof Beyls 195f44278c [ARM] Implement harden-sls-retbr for ARM mode
Some processors may speculatively execute the instructions immediately
following indirect control flow, such as returns, indirect jumps and
indirect function calls.

To avoid a potential miss-speculatively executed gadget after these
instructions leaking secrets through side channels, this pass places a
speculation barrier immediately after every indirect control flow where
control flow doesn't return to the next instruction, such as returns and
indirect jumps, but not indirect function calls.

Hardening of indirect function calls will be done in a later,
independent patch.

This patch is implementing the same functionality as the AArch64 counter
part implemented in https://reviews.llvm.org/D81400.
For AArch64, returns and indirect jumps only occur on RET and BR
instructions and hence the function attribute to control the hardening
is called "harden-sls-retbr" there. On AArch32, there is a much wider
variety of instructions that can trigger an indirect unconditional
control flow change.  I've decided to stick with the name
"harden-sls-retbr" as introduced for the corresponding AArch64
mitigation.

This patch implements this for ARM mode. A future patch will extend this
to also support Thumb mode.

The inserted barriers are never on the correct, architectural execution
path, and therefore performance overhead of this is expected to be low.
To ensure these barriers are never on an architecturally executed path,
when the harden-sls-retbr function attribute is present, indirect
control flow is never conditionalized/predicated.

On targets that implement that Armv8.0-SB Speculation Barrier extension,
a single SB instruction is emitted that acts as a speculation barrier.
On other targets, a DSB SYS followed by a ISB is emitted to act as a
speculation barrier.

These speculation barriers are implemented as pseudo instructions to
avoid later passes to analyze them and potentially remove them.

The mitigation is off by default and can be enabled by the
harden-sls-retbr subtarget feature.

Differential Revision: https://reviews.llvm.org/D92395
2020-12-19 11:42:39 +00:00
Bjorn Pettersson a89d751fb4 Add intrinsics for saturating float to int casts
This patch adds support for the fptoui.sat and fptosi.sat intrinsics,
which provide basically the same functionality as the existing fptoui
and fptosi instructions, but will saturate (or return 0 for NaN) on
values unrepresentable in the target type, instead of returning
poison. Related mailing list discussion can be found at:
https://groups.google.com/d/msg/llvm-dev/cgDFaBmCnDQ/CZAIMj4IBAAJ

The intrinsics have overloaded source and result type and support
vector operands:

    i32 @llvm.fptoui.sat.i32.f32(float %f)
    i100 @llvm.fptoui.sat.i100.f64(double %f)
    <4 x i32> @llvm.fptoui.sat.v4i32.v4f16(half %f)
    // etc

On the SelectionDAG layer two new ISD opcodes are added,
FP_TO_UINT_SAT and FP_TO_SINT_SAT. These opcodes have two operands
and one result. The second operand is an integer constant specifying
the scalar saturation width. The idea here is that initially the
second operand and the scalar width of the result type are the same,
but they may change during type legalization. For example:

    i19 @llvm.fptsi.sat.i19.f32(float %f)
    // builds
    i19 fp_to_sint_sat f, 19
    // type legalizes (through integer result promotion)
    i32 fp_to_sint_sat f, 19

I went for this approach, because saturated conversion does not
compose well. There is no good way of "adjusting" a saturating
conversion to i32 into one to i19 short of saturating twice.
Specifying the saturation width separately allows directly saturating
to the correct width.

There are two baseline expansions for the fp_to_xint_sat opcodes. If
the integer bounds can be exactly represented in the float type and
fminnum/fmaxnum are legal, we can expand to something like:

    f = fmaxnum f, FP(MIN)
    f = fminnum f, FP(MAX)
    i = fptoxi f
    i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN

If the bounds cannot be exactly represented, we expand to something
like this instead:

    i = fptoxi f
    i = select f ult FP(MIN), MIN, i
    i = select f ogt FP(MAX), MAX, i
    i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN

It should be noted that this expansion assumes a non-trapping fptoxi.

Initial tests are for AArch64, x86_64 and ARM. This exercises all of
the scalar and vector legalization. ARM is included to test float
softening.

Original patch by @nikic and @ebevhan (based on D54696).

Differential Revision: https://reviews.llvm.org/D54749
2020-12-18 11:09:41 +01:00
Yvan Roux 923ca0b411 [ARM][MachineOutliner] Fix costs model.
Fix candidates calls costs models allocation and prepare stack fixups
handling.

Differential Revision: https://reviews.llvm.org/D92933
2020-12-17 16:08:23 +01:00
David Green c100d7ba36 [NFC] Chec[^k] -> Check
Some test updates all appearing to use the wrong spelling of CHECK.
2020-12-08 11:54:39 +00:00
Martin Storsjö 78a57069b5 [CodeGen] Restore accessing __stack_chk_guard via a .refptr stub on mingw after 2518433f86
Add tests for this particular detail for x86 and arm (similar tests
already existed for x86_64 and aarch64).

The libssp implementation may be located in a separate DLL, and in
those cases, the references need to be in a .refptr stub, to avoid
needing to touch up code in the text section at runtime (which is
supported but inefficient for x86, and unsupported for arm).

Differential Revision: https://reviews.llvm.org/D92738
2020-12-07 09:35:12 +02:00
Layton Kifer ac522f8700 [DAGCombiner] Fold (sext (not i1 x)) -> (add (zext i1 x), -1)
Move fold of (sext (not i1 x)) -> (add (zext i1 x), -1) from X86 to DAGCombiner to improve codegen on other targets.

Differential Revision: https://reviews.llvm.org/D91589
2020-12-06 11:52:10 -05:00
Fangrui Song 109e70d357 [TargetMachine] Drop implied dso_local for an edge case (extern_weak + non-pic + hidden)
This does not deserve special handling. The code should be added to Clang
instead if deemed useful. With this simplification, we can additionally delete
the PIC extern_weak special case.
2020-12-05 15:52:33 -08:00
Fangrui Song a084c0388e [TargetMachine] Don't imply dso_local on function declarations in Reloc::Static model for ELF/wasm
clang/lib/CodeGen/CodeGenModule sets dso_local on applicable function declarations,
we don't need to duplicate the work in TargetMachine:shouldAssumeDSOLocal.
(Actually the long-term goal (started by r324535) is to drop TargetMachine::shouldAssumeDSOLocal.)

By not implying dso_local, we will respect dso_local/dso_preemptable specifiers
set by the frontend. This allows the proposed -fno-direct-access-external-data
option to work with -fno-pic and prevent a canonical PLT entry (SHN_UNDEF with non-zero st_value)
when taking the address of a function symbol.

This patch should be NFC in terms of the Clang emitted assembly because the case
we don't set dso_local is a case Clang sets dso_local. However, some tests don't
set dso_local on some function declarations and expose some differences. Most
tests have been fixed to be more robust in the previous commit.
2020-12-05 14:54:37 -08:00