Updated the affected scalable_of_scalable tests in sve-gep.ll, as isConstantSplatValue now returns true in DAGCombiner::visitMUL and folds `(mul x, 1) -> x`
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D91363
Local values are constants or addresses that can't be folded into
the instruction that uses them. FastISel materializes these in a
"local value" area that always dominates the current insertion
point, to try to avoid materializing these values more than once
(per block).
https://reviews.llvm.org/D43093 added code to sink these local
value instructions to their first use, which has two beneficial
effects. One, it is likely to avoid some unnecessary spills and
reloads; two, it allows us to attach the debug location of the
user to the local value instruction. The latter effect can
improve the debugging experience for debuggers with a "set next
statement" feature, such as the Visual Studio debugger and PS4
debugger, because instructions to set up constants for a given
statement will be associated with the appropriate source line.
There are also some constants (primarily addresses) that could be
produced by no-op casts or GEP instructions; the main difference
from "local value" instructions is that these are values from
separate IR instructions, and therefore could have multiple users
across multiple basic blocks. D43093 avoided sinking these, even
though they were emitted to the same "local value" area as the
other instructions. The patch comment for D43093 states:
Local values may also be used by no-op casts, which adds the
register to the RegFixups table. Without reversing the RegFixups
map direction, we don't have enough information to sink these
instructions.
This patch undoes most of D43093, and instead flushes the local
value map after(*) every IR instruction, using that instruction's
debug location. This avoids sometimes incorrect locations used
previously, and emits instructions in a more natural order.
This does mean materialized values are not re-used across IR
instruction boundaries; however, only about 5% of those values
were reused in an experimental self-build of clang.
(*) Actually, just prior to the next instruction. It seems like
it would be cleaner the other way, but I was having trouble
getting that to work.
Differential Revision: https://reviews.llvm.org/D91734
This patch adds a target-specific DAG combine for mscatter to promote indices
with element types i8 or i16 before legalisation, plus various tests with illegal types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90945
This uses the same reasoning as other similar conversions just before selection,
without it we miss out on selection because the importer considers s64 and p0
distinct types.
This reapplies 36c64af9d7 in updated
form.
Emit the xdata for each function at .seh_endproc. This keeps the
exact same output header order for most code generated by the LLVM
CodeGen layer. (Sections still change order for code built from
assembly where functions lack an explicit .seh_handlerdata
directive, and functions with chained unwind info.)
The practical effect should be that assembly output lacks
superfluous ".seh_handlerdata; .text" pairs at the end of functions
that don't handle exceptions, which allows such functions to use
the AArch64 packed unwind format again.
Differential Revision: https://reviews.llvm.org/D87448
X86 was already specially marking fma as commutable which allowed
tablegen to autogenerate commuted patterns. This moves it to the target
independent definition and fix up the targets to remove now
unneeded patterns.
Unfortunately, the tests change because the commuted version of
the patterns are generating operands in a different than the
explicit patterns.
Differential Revision: https://reviews.llvm.org/D91842
This patch implements out of line atomics for LSE deployment
mechanism. Details how it works can be found in llvm/docs/Atomics.rst
Options -moutline-atomics and -mno-outline-atomics to enable and disable it
were added to clang driver. This is clang and llvm part of out-of-line atomics
interface, library part is already supported by libgcc. Compiler-rt
support is provided in separate patch.
Differential Revision: https://reviews.llvm.org/D91157
In some cases, the values passed to `asm sideeffect` calls cannot be
mapped directly to simple MVTs. Currently, we crash in the backend if
that happens. An example can be found in the @test_vector_too_large_r_m
test case, where we pass <9 x float> vectors. In practice, this can
happen in cases like the simple C example below.
using vec = float __attribute__((ext_vector_type(9)));
void f1 (vec m) {
asm volatile("" : "+r,m"(m) : : "memory");
}
One case that use "+r,m" constraints for arbitrary data types in
practice is google-benchmark's DoNotOptimize.
This patch updates visitInlineAsm so that it use MVT::Other for
constraints with complex VTs. It looks like the rest of the backend
correctly deals with that and properly legalizes the type.
And we still report an error if there are no registers to satisfy the
constraint.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D91710
This patch uses the new `getMnemonic` helper from D90039
to display mnemonics instead of the internal opcodes.
The main motivation behind using the mnemonics is that they
are more user-friendly and more directly related to the assembly
the users will be presented.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D90040
When we see
```
xor = G_XOR xor_lhs, -1
select = G_SELECT cc, tval, xor
```
Fold this into
```
select = CSINV tval, xor_lhs, cc
```
Update select-select.mir to reflect the changes.
For now, only handle the case where the G_XOR is the false-value for the
G_SELECT. It may make more sense to handle the true-value case in post-legalizer
lowering.
Differential Revision: https://reviews.llvm.org/D90774
The G_ZEXT in these cases seems to actually come from a combine that we do but
SelectionDAG doesn't. Looking through it allows us to match "uxtw #2" addressing
modes.
Differential Revision: https://reviews.llvm.org/D91475
When we see
```
%sub = G_SUB 0, %x
%select = G_SELECT %cc, %t, %sub
```
Fold away the G_SUB by producing
```
%select = CSNEG %t, %x, cc
```
Simple IR example: https://godbolt.org/z/K8TEnh
This is valid on both sides of the select, but for now, just handle one side.
It may make more sense to handle swapping sides during post-legalizer lowering.
Differential Revision: https://reviews.llvm.org/D90723
If the scatter store is able to perform the sign/zero extend of
its index, this is folded into the instruction with refineIndexType().
Additionally, refineUniformBase() will return the base pointer and index
from an add + splat_vector.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90942
Select the following:
- G_SELECT cc, 0, 1 -> CSINC zreg, zreg, cc
- G_SELECT cc 0, -1 -> CSINV zreg, zreg cc
- G_SELECT cc, 1, f -> CSINC f, zreg, inv_cc
- G_SELECT cc, -1, f -> CSINV f, zreg, inv_cc
- G_SELECT cc, t, 1 -> CSINC t, zreg, cc
- G_SELECT cc, t, -1 -> CSINC t, zreg, cc
(IR example: https://godbolt.org/z/YfPna9)
These correspond to a bunch of the AArch64csel patterns in AArch64InstrInfo.td.
Unfortunately, it doesn't seem like we can import patterns that use NZCV like
those ones do. E.g.
```
def : Pat<(AArch64csel GPR32:$tval, (i32 1), (i32 imm:$cc), NZCV),
(CSINCWr GPR32:$tval, WZR, (i32 imm:$cc))>;
```
So we have to manually select these for now.
This replaces `selectSelectOpc` with an `emitSelect` function, which performs
these optimizations.
Differential Revision: https://reviews.llvm.org/D90701
When passing SVE types as arguments to function calls we can run
out of hardware SVE registers. This is normally fine, since we
switch to an indirect mode where we pass a pointer to a SVE stack
object in a GPR. However, if we switch over part-way through
processing a SVE tuple then part of it will be in registers and
the other part will be on the stack.
I've fixed this by ensuring that:
1. When we don't have enough registers to allocate the whole block
we mark any remaining SVE registers temporarily as allocated.
2. We temporarily remove the InConsecutiveRegs flags from the last
tuple part argument and reinvoke the autogenerated calling
convention handler. Doing this prevents the code from entering
an infinite recursion and, in combination with 1), ensures we
switch over to the Indirect mode.
3. After allocating a GPR register for the pointer to the tuple we
then deallocate any SVE registers we marked as allocated in 1).
We also set the InConsecutiveRegs flags back how they were before.
4. I've changed the AArch64ISelLowering LowerCALL and
LowerFormalArguments functions to detect the start of a tuple,
which involves allocating a single stack object and doing the
correct numbers of legal loads and stores.
Differential Revision: https://reviews.llvm.org/D90219
When there is full fp16 support, there is no reason to widen 16-bit
G_FCONSTANTs to 32 bits. Mark them as legal in this case.
Also, we currently import a pattern for materializing a 16-bit 0.0.
Add a testcase showing we select it.
(All other 16-bit G_FCONSTANTS are not yet selected.)
Differential Revision: https://reviews.llvm.org/D89164
The manual selection code for add/sub was not checking if it was possible to
fold in shifts + extends (the *rx opcode variants).
As a result, we could never select things like
```
cmp x1, w0, uxtw #2
```
Because we don't import any patterns for compares.
This adds support for the arithmetic shifted register forms and updates tests
for instructions selected using `emitADD`, `emitADDS`, and `emitSUBS`.
This is a 0.1% geomean code size improvement on SPECINT2000 at -Os.
Differential Revision: https://reviews.llvm.org/D91207
Previously, we only handled negative arithmetic immediates in the imported
selector code.
Since we don't import code for, say, compares, we were missing opportunities
for things like
```
%cst:gpr(s64) = G_CONSTANT i64 -10
%cmp:gpr(s32) = G_ICMP intpred(eq), %reg0(s64), %cst
->
%adds = ADDSXri %reg0, 10, 0, implicit-def $nzcv
%cmp = CSINCWr $wzr, $wzr, 1, implicit $nzcv
```
Instead, we would have to materialize the constant and emit a SUBS.
This adds support for selection like above for SUB, SUBS, ADD, and ADDS.
This is a 0.1% geomean code size improvement on SPECINT2000 at -Os.
Differential Revision: https://reviews.llvm.org/D91108
Lowers the llvm.masked.scatter intrinsics (scalar plus vector addressing mode only)
Changes included in this patch:
- Custom lowering for MSCATTER, which chooses the appropriate scatter store opcode to use.
Floating-point scatters are cast to integer, with patterns added to match FP reinterpret_casts.
- Added the getCanonicalIndexType function to convert redundant addressing
modes (e.g. scaling is redundant when accessing bytes)
- Tests with 32 & 64-bit scaled & unscaled offsets
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90941
These do things like turn a multiply of a pow-2+1 into a shift and and add,
which is a common pattern that pops up, and is universally better than expensive
madd instructions with a constant.
I've added check lines to an existing codegen test since the code being ported
is almost identical, however the mul by negative pow2 constant tests don't generate
the same code because we're missing some generic G_MUL combines still.
Differential Revision: https://reviews.llvm.org/D91125
For example if the sign extension is only used in for TBZ, and the value is used elsewhere with a zero extension, this can eliminate a sign extension.
Reviewed By: samparker
Differential Revision: https://reviews.llvm.org/D90606
FastISel generates instructions to materialize "local values" at the
top of a block, in the hope that these values could be reused within
the block. To reduce spills and restores, FastISel treats calls as
sub-block boundaries, flushing the "local value map" at each call.
This patch treats the mem* intrinsics as if they were calls, because
at O0 generally they are calls. Eliminating these spills/restores is
actually better for debugging (especially a "continue at this line"
command), code size, stack frame size, and maybe even performance.
Differential Revision: https://reviews.llvm.org/D90877
This to help review the impact of https://reviews.llvm.org/D89952 which
allows targets to fine tune what SelectionDAG does when vector CTPOP is
not legal.
Add support for the Neoverse V1 CPU to the ARM and AArch64 backends.
This is based on patches from Mark Murray and Victor Campos.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D90765
These were previously handled by pattern matching shuffles in the selector, but
adding a new opcode and making it equivalent to the AArch64duplane SDAG node
allows us to select more patterns, like lane indexed FMLAs (patch adding a test
for that will be committed later).
The pattern matching code has been simply moved to postlegalize lowering.
Differential Revision: https://reviews.llvm.org/D90820
Hook up legalizations for VECREDUCE_SEQ_FMUL. This is following up on the VECREDUCE_SEQ_FADD work from D90247.
Differential Revision: https://reviews.llvm.org/D90644
This patch uses the existing LowerFixedLengthReductionToSVE function to also lower
scalable vector reductions. A separate function has been added to lower VECREDUCE_AND
& VECREDUCE_OR operations with predicate types using ptest.
Lowering scalable floating-point reductions will be addressed in a follow up patch,
for now these will hit the assertion added to expandVecReduce() in TargetLowering.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D89382
- Basically iterate each pair of memory operands from both instructions
and return true if any of them may alias.
- The exception are memory instructions without any memory operand. They
may touch everything and could alias to any memory instruction.
Differential Revision: https://reviews.llvm.org/D89447
For the <2 x float> case, instead of adding another combine or legalization to
get it into a <4 x float> form, I'm just adding a GISel specific selection
pattern to cover it.
Differential Revision: https://reviews.llvm.org/D90699
This caused an explosion in ICF times during linking on Windows when libfuzzer
instrumentation is enabled. For a small binary we see ICF time go from ~0 to
~10 s. For a large binary it goes from ~1 s to forevert (I gave up after 30
minutes).
See comment on the code review.
> If we are going to write handler data (that is written as variable
> length data following after the unwind info in .xdata), we need to
> emit the handler data immediately, but for cases where no such
> info is going to be written, skip emitting it right away. (Unwind
> info for all remaining functions that hasn't gotten it emitted
> directly is emitted at the end.)
>
> This does slightly change the ordering of sections (triggering a
> bunch of updates to DebugInfo/COFF tests), but the change should be
> benign.
>
> This also matches GCC's assembly output, which doesn't output
> .seh_handlerdata unless it actually is needed.
>
> For ARM64, the unwind info can be packed into the runtime function
> entry itself (leaving no data in the .xdata section at all), but
> that can only be done if there's no follow-on data in the .xdata
> section. If emission of the unwind info is triggered via
> EmitWinEHHandlerData (or the .seh_handlerdata directive), which
> implicitly switches to the .xdata section, there's a chance of the
> caller wanting to pass further data there, so the packed format
> can't be used in that case.
>
> Differential Revision: https://reviews.llvm.org/D87448
This reverts commit 36c64af9d7.
This patch adds a bunch of CHECK lines to guard against implicit
conversions of TypeSize -> uint64_t occuring in code-paths that previously
were safe for scalable vectors.
Adds patterns to catch masks preceeding a long multiply,
and generating a single umull/smull instruction instead.
Differential revision: https://reviews.llvm.org/D89956
We have added a new load/store cluster algorithm in D85517. However, AArch64 see
some compiling deg with the new algorithm as the IsReachable() is not cheap if
the DAG is complex. O(M+N) See https://bugs.llvm.org/show_bug.cgi?id=47966
So, this patch added a heuristic to switch to old cluster algorithm if the DAG is too complex.
Reviewed By: Owen Anderson
Differential Revision: https://reviews.llvm.org/D90144