(Cond0 s> -1) ? N1 : 0 --> ~(Cond0 s>> BW-1) & N1
https://alive2.llvm.org/ce/z/mGCBrd
This was suggested as a potential enhancement in D113212 (also 7e30404c3b ).
There's an improvement for AArch that could be generalized ( X > -1 --> X >= 0 ).
For x86, we have a counter-acting fold for most cases that turns the shift+not
back into a setcc, so that needs a work-around to get more cases to use "pandn":
D113603
Note that this pattern (and a previous one) are not currently canonical forms
in IR:
https://alive2.llvm.org/ce/z/e4o96b
Adding swapped variants is left as a TODO item here, but is planned as
a near-term follow-up patch.
Differential Revision: https://reviews.llvm.org/D113426
With fullfp16, it is cheaper to cast the {U,S}INT_TO_FP operand to i16
first, rather than promoting it to i32. The custom lowering for
{U,S}INT_TO_FP already supports that, it just needs to be used.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D113601
This extends performFpToIntCombine to work on FP16 vectors as well as
the f32 and f64 vectors it already supported.
Differential Revision: https://reviews.llvm.org/D113297
Similar to D113199 but dealing with the vector size, this extends the
fptosi+fmul to fixed point fold to handle fptosi.sat nodes that are
equally viable, so long as the saturation width matches the output
width.
Differential Revision: https://reviews.llvm.org/D113200
The existing code didn't add all necessary successors, which resulted in
disjoint basic blocks. These would end up not being legalized which, in the
best case, caused a fallback only in assert builds.
Here's an example:
https://godbolt.org/z/ndx15Enfj
We also end up getting weird codegen here as well.
Refactoring the code here allows us to correctly attach all successors. With
this patch, the above example gives correct codegen at -O0 with and without
asserts.
Also autogen the testcase to show that we add all the successors now.
Differential Revision: https://reviews.llvm.org/D113437
This patch adds DUP+FMUL => FMUL_indexed pattern to InstCombiner.
FMUL_indexed is normally selected during instruction selection, but it
does not work in cases when VDUP and VMUL are in different basic
blocks.
Differential Revision: https://reviews.llvm.org/D99662
[NFC] As part of using inclusive language within the llvm project, this patch
replaces master with main in file/directory paths in llvm LIT tests.
Reviewed By: bjope
Differential Revision: https://reviews.llvm.org/D113190
When inserting an unpacked FP subvector into a packed vector we
can simply cast the unpacked value into a packed value, since
both types are legal for SVE. We can then use this as the input
for the UZP instruction. This avoids us expanding the operation
by going through the stack.
Differential Revision: https://reviews.llvm.org/D113270
We already have patterns for fptosi and fptoui plus fmul to fixed point
convert, this adds equivalent patterns for fptosi.sat and fptoui.sat,
which should apply equally well for the legal saturating variants.
Differential Revision: https://reviews.llvm.org/D113199
Performing the rearrangement for add/sub and mul instructions to match the madd/msub pattern
Reviewed By: dmgreen, sdesmalen, david-arm
Differential Revision: https://reviews.llvm.org/D111862
This rewrites the fcvt-fixed.ll test case to be separate functions, not
one large function with volatile global stores. It also adds fp16 and
fptoi.sat testing at the same time.
This is the 'or' sibling for the fold added with:
D113212
https://alive2.llvm.org/ce/z/tgnp7K
Note that neither of these transforms is poison-safe,
but it does not seem to matter at this level. We have
had the scalar version of D113212 for a long time, so
this is just making optimizer behavior consistent.
We do not have the scalar version of *this* fold,
however, so that is another follow-up.
Try simplify G_UADDO with 8 or 16 bit operands to wide G_ADD and TBNZ if
result is only used in the no-overflow case. It is restricted to cases
where we know that the high-bits of the operands are 0. If there's an
overflow, then the the 9th or 17th bit must be set, which can be checked
using TBNZ.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D111888
(X s< 0) ? Y : 0 --> (X s>> BW-1) & Y
We canonicalize to the icmp+select form in IR, and we already have this fold
for scalar select in SDAG, so I think it's an oversight that we don't have
the fold for vectors. It seems neutral for AArch64 and saves some instructions
on x86.
Whether we should also have the sibling folds for the inverse condition or
all-ones true value may depend on target-specific factors such as whether
there's an "and-not" instruction.
Differential Revision: https://reviews.llvm.org/D113212
When created a UUNPKLO/HI node with an undef input then the
output should also be undef. I've added a target DAG combine
function to ensure we avoid creating an unnecessary uunpklo/hi
instruction.
Differential Revision: https://reviews.llvm.org/D113266
A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)
Differential Revision: https://reviews.llvm.org/D113263
If the type of a funnel shift needs to be expanded, expand it to two funnel shifts instead of regular shifts. For constant shifts, this doesn't make much difference, but for variable shifts it allows a more optimal lowering.
Also use the optimized funnel shift lowering for rotates.
Alive2: https://alive2.llvm.org/ce/z/TvHDB- / https://alive2.llvm.org/ce/z/yzPept
(Branched from D108058 as getting this completed should help unlock some other WIP patches).
Original Patch: @efriedma (Eli Friedman)
Differential Revision: https://reviews.llvm.org/D112443
For NEON, FMA matching is done in the MachineCombiner, and not the
DAGCombiner. That causes problems with VLS lowering, since the
vectors are fixed width at the DAGCombiner, but are scalable in
the MachineCombiner. This patch corrects it by matching FMAs for
VLS vectors in the DAGCombiner.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D112557
Add support for generating TargetFrameIndex in complex patterns for
indexed addressing modes in SVE. Additionally, add missing load/stores
to getMemOpInfo and getLoadStoreImmIdx.
Differential Revision: https://reviews.llvm.org/D112617
The support for neoverse-512tvb mirrors the same option available in GCC[1].
There is no functional effect for this option yet.
This patch ensures the driver accepts "-mcpu=neoverse-512tvb", and enough
plumbing is in place to allow the new option to be used in the future.
[1]https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html
Differential Revision: https://reviews.llvm.org/D112406
PromoteIntRes_MLOAD always sets the extension type to `EXTLOAD`, which
results in a sign-extended load. If the type returned by getExtensionType()
for the load being promoted is something other than `NON_EXTLOAD`, we
should instead pass this to getMaskedLoad() as the extension type.
Reviewed By: CarolineConcatto
Differential Revision: https://reviews.llvm.org/D112320
Widens the result and first input vector because they have the same size.
The subvector to be inserted is widened in the operand widen function.
Differential Revision: https://reviews.llvm.org/D112187
This change introduces subtarget features to predicate certain
instructions and system registers that are available only on
'A' profile targets. Those features are not present when
targeting a generic CPU, which is the default processor.
In other words the generic CPU now means the intersection of
'A' and 'R' profiles. To maintain backwards compatibility we
enable the features that correspond to -march=armv8-a when the
architecture is not explicitly specified on the command line.
References: https://developer.arm.com/documentation/ddi0600/latest
Differential Revision: https://reviews.llvm.org/D110065
Before the code would crash with "unhandled opcode in
isAArch64FrameOffsetLegal" when there was a spill from extractelement.
Fixes pr52249
Differential Revision: https://reviews.llvm.org/D112311
This patch enables the use of reciprocal estimates for SVE
when both the -Ofast and -mrecip flags are used.
Reviewed By: david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D111657
%3:gpr32 = ORRWrs $wzr, %2, 0
%4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32
If AArch64's 32-bit form of instruction defines the source operand of ORRWrs,
we can remove the ORRWrs because the upper 32 bits of the source operand are
set to zero.
Differential Revision: https://reviews.llvm.org/D110841
This will allow us to reuse existing interleaved load logic in
lowerInterleavedLoad that exists for neon types, but for SVE fixed
types.
The goal eventually will be to replace the existing ld<n> intriniscs
with these, once a migration path has been sorted out.
Differential Revision: https://reviews.llvm.org/D112078
G_ICMP is selected to an arithmetic overflow op (ADDS/SUBS/etc) with a dead
destination + a CSINC instruction.
We have a fold which allows us to combine 32-bit adds with G_ICMP.
The problem with G_ICMP is that we model it as always having a 32-bit
destination even though it can be a 64-bit operation. So, we were missing some
opportunities for 64-bit folds.
This patch teaches the fold to recognize 64-bit G_ICMPs + refactors some of
the code surrounding CSINC accordingly.
(Later down the line, I think we should probably change the way we handle G_ICMP
in general.)
Differential Revision: https://reviews.llvm.org/D111088
When splitting a masked load, `GetDependentSplitDestVTs` is used to get the
MemVTs of the high and low parts. If the masked load is extended, this
may return VTs with different element types which are used to create the
high & low masked load instructions.
This patch changes `GetDependentSplitDestVTs` to ensure we return VTs with
the same element type.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D111996
When inserting a scalable subvector into a scalable vector through
the stack, the index to store to needs to be scaled by vscale.
Before this patch, that didn't yet happen, so it would generate the
wrong offset, thus storing a subvector to the incorrect address
and overwriting the wrong lanes.
For some insert:
nxv8f16 insert_subvector(nxv8f16 %vec, nxv2f16 %subvec, i64 2)
The offset was not scaled by vscale:
orr x8, x8, #0x4
st1h { z0.h }, p0, [sp]
st1h { z1.d }, p1, [x8]
ld1h { z0.h }, p0/z, [sp]
And is changed to:
mov x8, sp
st1h { z0.h }, p0, [sp]
st1h { z1.d }, p1, [x8, #1, mul vl]
ld1h { z0.h }, p0/z, [sp]
Differential Revision: https://reviews.llvm.org/D111633
autiasp, autibsp instructions are the counterpart of paciasp/pacibsp instructions
therefore let's emit .cfi_negate_ra_state for these too.
In case of Armv8.3 instruction set the retaa/retbb will do the return and authentication
in one step here we can't emit the . cfi_negate_ra_state because that would be point after
the ret* instruction.
Reviewed By: nickdesaulniers, MaskRay
Differential Revision: https://reviews.llvm.org/D111780
Following on from an earlier patch that introduced support for -mtune
for AArch64 backends, this patch splits out the tuning features
from the processor features. This gives us the ability to enable
architectural feature set A for a given processor with "-mcpu=A"
and define the set of tuning features B with "-mtune=B".
It's quite difficult to write a test that proves we select the
right features according to the tuning attribute because most
of these relate to scheduling. I have created a test here:
CodeGen/AArch64/misched-fusion-addr-tune.ll
that demonstrates the different scheduling choices based upon
the tuning.
Differential Revision: https://reviews.llvm.org/D111551
Try to widen element type to get a new mask value for a better permutation
sequence, so that we can use NEON shuffle instructions, such as zip1/2,
UZP1/2, TRN1/2, REV, INS, etc.
For example:
shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 6, i32 7, i32 2, i32 3>
is equivalent to:
shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1>
Finally, we can get:
mov v0.d[0], v1.d[1]
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D111619
SVE has predicated literal forms of some instructions for specific
literals, which currently are generated correctly when using ACLE
but not when those instructions are generated directly.
This adds the patterns to generate those instructions when
generating from standard LLVM IR instructions.
Differential Revision: https://reviews.llvm.org/D99074