On some architectures such as Arm and X86 the encoding for a nop may
change depending on the subtarget in operation at the time of
encoding. This change replaces the per module MCSubtargetInfo retained
by the targets AsmBackend in favour of passing through the local
MCSubtargetInfo in operation at the time.
On Arm using the architectural NOP instruction can have a performance
benefit on some implementations.
For Arm I've deleted the copy of the AsmBackend's MCSubtargetInfo to
limit the chances of this causing problems in the future. I've not
done this for other targets such as X86 as there is more frequent use
of the MCSubtargetInfo and it looks to be for stable properties that
we would not expect to vary per function.
This change required threading STI through MCNopsFragment and
MCBoundaryAlignFragment.
I've attempted to take into account the in tree experimental backends.
Differential Revision: https://reviews.llvm.org/D45962
In preparation for passing the MCSubtargetInfo (STI) through to writeNops
so that it can use the STI in operation at the time, we need to record the
STI in operation when a MCAlignFragment may write nops as padding. The
STI is currently unused, a further patch will pass it through to
writeNops.
There are many places that can create an MCAlignFragment, in most cases
we can find out the STI in operation at the time. In a few places this
isn't possible as we are in initialisation or finalisation, or are
emitting constant pools. When possible I've tried to find the most
appropriate existing fragment to obtain the STI from, when none is
available use the per module STI.
For constant pools we don't actually need to use EmitCodeAlign as the
constant pools are data anyway so falling through into it via an
executable NOP is no better than falling through into data padding.
This is a prerequisite for D45962 which uses the STI to emit the
appropriate NOP for the STI. Which can differ per fragment.
Note that involves an interface change to InitSections. It is now
called initSections and requires a SubtargetInfo as a parameter.
Differential Revision: https://reviews.llvm.org/D45961
Legalizing G_MUL for non-standard types (like i33) generated an error. Putting
minScalar and maxScalar instead of clampScalar. Also using new rule, instead
of widening to the next power of 2, widen to the next multiple of the passed
argument (32 in this case), so instead of widening i65 to i128, we widen it to
i96.
Patch by: Mateja Marjanovic
Differential Revision: https://reviews.llvm.org/D109228
Add implementation for the legalization of G_ROTL and G_ROTR machine
instructions. They are very similar to funnel shift instructions, the only
difference is funnel shifts have 3 operands, whereas rotate instructions have
two operands, the first being the register that is being rotated and the second
being the number of shifts. The legalization of G_ROTL/G_ROTR is just lowering
them into funnel shift instructions if they are legal.
Patch by: Mateja Marjanovic
Differential Revision: https://reviews.llvm.org/D105347
Legalize G_MEMCPY, G_MEMMOVE, G_MEMSET and G_MEMCPY_INLINE.
Corresponding intrinsics are replaced by a loop that uses loads/stores in
AMDGPULowerIntrinsics pass unless their length is a constant lower then
MemIntrinsicExpandSizeThresholdOpt (default 1024). Any G_MEM* instruction that
reaches legalizer should have a const length argument and should be expanded
into appropriate number of loads + stores.
Differential Revision: https://reviews.llvm.org/D108357
This patch adds support for the vector-predicated `VP_STORE` and
`VP_LOAD` nodes. We do this in the same way we lower `MSTORE` and
`MLOAD`: to regular load/store instructions via intrinsics.
One necessary change was made to `SelectionDAGLegalize` so that
`VP_STORE` nodes' operation actions are taken from the stored "value"
operands, in the same vein as `STORE` or `MSTORE`.
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108999
This patch adds support for the `VP_SCATTER` and `VP_GATHER` nodes by
lowering them to RVV's `vsox`/`vlux` instructions, respectively. This
process is almost identical to the existing `MSCATTER`/`MGATHER` support.
One extra change was made to `SelectionDAGLegalize` so that
`VP_SCATTER`'s operation action is derived from its stored "value"
operand rather than its return type (which is always the chain).
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108987
When expanding pseudo insts, in order to create a new machine instr, we use BuildMI,
which will add implicit operands by default. And transferImpOps will also copy implicit
operands from old ones. Finally, duplicate implicit operands are added to the same inst.
Sometimes this can cause correctness issues. Like below inst,
renamable $w18 = nsw SUBSWrr renamable $w30, renamable $w14, implicit-def dead $nzcv
After expanding, it will become
$w18 = SUBSWrs renamable $w13, renamable $w14, 0, implicit-def $nzcv, implicit-def dead $nzcv
A redundant implicit-def $nzcv is added, but the dead flag is missing.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D109069
This is a more general version of D109273. Though it doesn't
peek through bitcasts or rearange broadcasts.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D109295
FeaturePMU was created in AArch64 to accommodate one missing system
register, PMMIR_EL1, in commit ffcd7698ae.
However, the Performance Monitors extension already had a target
feature, which is called FeaturePerfMon. Therefore, FeaturePMU is
redundant.
This patch removes FeaturePMU and merges its contents into
FeaturePerfMon.
Reviewed By: dnsampaio
Differential Revision: https://reviews.llvm.org/D109246
The isEssentiallyExtractHighSubvector function currently calls
getVectorNumElements on a type that in specific cases might be scalable.
Since this function only has correct behaviour at the moment on scalable
types anyway, the function can just return false when given a fixed type.
Differential Revision: https://reviews.llvm.org/D109163
d8faf03807 implemented general-regs-only for X86 by disabling all features
with vector instructions. But the CRC32 instruction in SSE4.2 ISA, which uses
only GPRs, also becomes unavailable. This patch adds a CRC32 feature for this
instruction and allows it to be used with general-regs-only.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D105462
Currently, we use SExtValue to decide whether to invert tbz or tbnz.
However, for the case zext (xor x, c), we should use ZExt rather
than SExt otherwise we will generate totally opposite branches.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D108755
The current inconsistency confuse contributors which coding guidlines to follow.
It would be better to have it consistent using clang-format tool.
Reviewed By: mhjacobson
Differential Revision: https://reviews.llvm.org/D109270
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.
I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.
But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.
I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.
But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM:
"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and
store instructions go through addresses generation phase in program order to avoid on-the-fly memory
ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions."
Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.
These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with).
Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
The change here is basically the same as in D108880: Rather than
looking at bitcasts, look at calls and their function type. We
still need to look through bitcasts to find those calls.
The change in llvm/test/CodeGen/WebAssembly/add-prototypes-conflict.ll
is due to different visitation order. add-prototypes-opaque-ptrs.ll
is a copy of add-prototypes.ll with -force-opaque-pointers.
Differential Revision: https://reviews.llvm.org/D109256
As an extension to D107866, this adds store(fptosisat(..)) patterns,
similar to the existing fptosi patterns, to prevent unnecessarily moving
into gpr regs where we can use fp stores directly.
Differential Revision: https://reviews.llvm.org/D108378
This extends D107865 to the VFP insructions, lowering llvm.fptosi.sat
and llvm.fptoui.sat to VCVT instructions that inherently perform the
saturate.
Differential Revision: https://reviews.llvm.org/D107866
This patch changes the register class to avoid accidentally setting
the AVL operand to X0 through MachineIR optimizations.
There are cases where we really want to use X0, but we can't get that
past the MachineVerifier with the register class as GPRNoX0. So I've
use a 64-bit -1 as a sentinel for X0. All other immediate values should
be uimm5. I convert it to X0 at the earliest possible point in the VSETVLI
insertion pass to avoid touching the rest of the algorithm. In
SelectionDAG lowering I'm using a -1 TargetConstant to hide it from
instruction selection and treat it differently than if the user
used -1. A user -1 should be selected to a register since it doesn't
fit in uimm5.
This is the rest of the changes started in D109110. As mentioned there,
I don't have a failing test from MachineIR optimizations anymore.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109116
SLM PBLENDVB is just as bad as BLENDVPD/PS - so model it as such, fixing the rr vs rm uops diff as well. The Intel AoM appears to have a copy+paste typo with PBLENDW, it doesn't match Agner or InstLatX64.
Noticed while investigating some of the weird discrepancies reported by the D103695 helper script (SLM had much better vector shift throughputs than it should).
This is important as with exceptions enabled, non-POD allocas often have
two lifetime ends: the exception handler, and the normal one.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D108365
A small subset of the NEON instruction set is legal in streaming mode.
This patch adds support for the following vector to integer move
instructions:
0x00 1110 0000 0001 0010 11xx xxxx xxxx # SMOV W|Xd,Vn.B[0]
0x00 1110 0000 0010 0010 11xx xxxx xxxx # SMOV W|Xd,Vn.H[0]
0100 1110 0000 0100 0010 11xx xxxx xxxx # SMOV Xd,Vn.S[0]
0000 1110 0000 0001 0011 11xx xxxx xxxx # UMOV Wd,Vn.B[0]
0000 1110 0000 0010 0011 11xx xxxx xxxx # UMOV Wd,Vn.H[0]
0000 1110 0000 0100 0011 11xx xxxx xxxx # UMOV Wd,Vn.S[0]
0100 1110 0000 1000 0011 11xx xxxx xxxx # UMOV Xd,Vn.D[0]
Only the zero index variants are legal, all others indexes are illegal.
To support this, new instructions are defined specifically for zero
index which is hardcoded, along an implicit 'VectorIndex0' operand.
Since the index operand is implicit and takes no bits in the encoding,
custom decoding is required to add the operand.
I'm not sure if this is the best approach but the predicate constraint
on a subset of an operand is unusual. Would be interested to hear some
alternatives.
The instructions are predicated on 'HasNEONorStreamingSVE', i.e. they're
enabled by either +neon or +streaming-sve. This follows on from the work
in D106272 to support the subset of SVE(2) instructions that are legal
in streaming mode.
Depends on D107902.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D107903
This patch basically enables fast-isel for AIX 64-bit subtarget
(previously enabled only for ELF 64). The initial motivation is to
introduce branch folding to AIX generated code for correct debug
behavior. I also saw some compiling time improvement in a few LLVM
test-suite benchmarks. (toast, dbms, cjpeg, burg, etc.)
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D98844
This ISD node/wrapper represents am address which is relative to a base
address and therefore lowers to `i32.const` rather than `global.get`.
Use this wrapper type for TLS-relative addresses, paving the way for the
non-REL wrapper to be used to external TLS address once those are
supported.
Differential Revision: https://reviews.llvm.org/D109179
Fixing this link error:
ld: error: relocation R_X86_64_PC32 cannot be used against symbol __asan_report_load...; recompile with -fPIC
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D109183
This add support for SjLj using Wasm exception handling instructions:
https://github.com/WebAssembly/exception-handling/blob/master/proposals/exception-handling/Exceptions.md
This does not yet support the mixed use of EH and SjLj within a
function. It will be added in a follow-up CL.
This currently passes all SjLj Emscripten tests for wasm0/1/2/3/s,
except for the below:
- `test_longjmp_standalone`: Uses Node
- `test_dlfcn_longjmp`: Uses NodeRAWFS
- `test_longjmp_throw`: Mixes EH and SjLj
- `test_exceptions_longjmp1`: Mixes EH and SjLj
- `test_exceptions_longjmp2`: Mixes EH and SjLj
- `test_exceptions_longjmp3`: Mixes EH and SjLj
Reviewed By: dschuff, tlively
Differential Revision: https://reviews.llvm.org/D108960
Similar to D108842 and D108844.
__has_builtin(builtin_mul_overflow) returns true for 32b MIPS targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks MIPS malta_defconfig builds of the Linux kernel that are
using __builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support malta_defconfig builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: rengolin
Differential Revision: https://reviews.llvm.org/D108926
https://reviews.llvm.org/D56686 was supposed to allow these to
work on Windows without needing to enable the xsave feature to
match MSVC. It seems this didn't work because the backend isel
patterns would still block it.
This patch removes the predicates from the isel patterns.
Fixes PR51706.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D109097
PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]*y[0] + (int)x[1]*y[1], ..., (int)x[6]*y[6] + (int)x[7]*y[7] }
Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits.
That way the sign of the result is still preserved, and the second half is still zero.
Noticed while investigating PR47437.
Differential Revision: https://reviews.llvm.org/D108522
When lowering a fixed length gather/scatter the index type is assumed to
be the same as the memory type, this is incorrect in cases where the
extension of the index has been folded into the addressing mode.
For now add a temporary workaround to fix the codegen faults caused by
this by preventing the removal of this extension. At a later date the
lowering for SVE gather/scatters will be redesigned to improve the way
addressing modes are handled.
As a short term side effect of this change, the addressing modes
generated for fixed length gather/scatters will not be optimal.
Differential Revision: https://reviews.llvm.org/D109145
If a sext_inreg is up for isel, and all its users are W instructions,
we can skip emitting the sext_inreg. This helpful if the producing
instruction can't become a W instruction.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D108966
X0 has special meaning for vsetvli, we need to make sure we never
create it a vsetvli that uses it by accident. This could happen
if the register coalescer coalesces a copy from X0 into this
instruction.
This patch splits the instruction so that we can have GPRNoX0
register class to use for the cases where we don't want the source
to be X0. The verifier won't let us explicitly use X0 on a GPRNoX0
operand so we need a separate pseudo for those cases.
I don't currently have a failing example for this. There was a
failure in D107957, but the coalescable copy from that example
should have been optimized away much earlier so I've fixed that.
This is not a complete fix. We still need to prevent the same
possible issue on the AVL operand of all of the vector instruction
pseudos. I don't want to make two versions of all of those so we
need to find a different solution for those. I have an idea I'm
going to try.
Differential Revision: https://reviews.llvm.org/D109110
Extend SILoadStoreOptimizer to merge into DWORDX8 variant of S_BUFFER_LOAD.
Merging into DWORDX2 and DWORDX4 variants is handled already.
Differential Revision: https://reviews.llvm.org/D108909
The semantics of tail predication loops means that the value of LR as an
instruction is executed determines the predicate. In other words:
mov r3, #3
DLSTP lr, r3 // Start tail predication, lr==3
VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0.
mov lr, #1
VADD.s32 q0, q1, q2 // Only first lane is updated.
This means that the value of lr cannot be spilled and re-used in tail
predication regions without potentially altering the behaviour of the
program. More lanes than required could be stored, for example, and in
the case of a gather those lanes might not have been setup, leading to
alignment exceptions.
This patch adds a new lr predicate operand to MVE instructions in order
to keep a reference to the lr that they use as a tail predicate. It will
usually hold the zeroreg meaning not predicated, being set to the LR phi
value in the MVETPAndVPTOptimisationsPass. This will prevent it from
being spilled anywhere that it needs to be used.
A lot of tests needed updating.
Differential Revision: https://reviews.llvm.org/D107638
Please refer to
https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html
(and that whole thread.)
TLDR: the original patch had no prior RFC, yet it had some changes that
really need a proper RFC discussion. It won't be productive to discuss
such an RFC, once it's actually posted, while said patch is already
committed, because that introduces bias towards already-committed stuff,
and the tree is potentially in broken state meanwhile.
While the end result of discussion may lead back to the current design,
it may also not lead to the current design.
Therefore i take it upon myself
to revert the tree back to last known good state.
This reverts commit 4c4093e6e3.
This reverts commit 0a2b1ba33a.
This reverts commit d9873711cb.
This reverts commit 791006fb8c.
This reverts commit c22b64ef66.
This reverts commit 72ebcd3198.
This reverts commit 5fa6039a5f.
This reverts commit 9efda541bf.
This reverts commit 94d3ff09cf.
Several FP instructions (fadd, fsub, etc.) were incorrectly assigned
a higher cost for SVE because they have custom lowering, however we
know they are legal. This patch explicitly assigns a cost of 2 to
these opcodes.
Tests added here:
Analysis/CostModel/AArch64/arith-fp-sve.ll
Differential Revision: https://reviews.llvm.org/D108993
This adds lowering of the llvm.fptosi.sat and llvm.fptoui.sat intinsics,
selecting a VCVT instruction which under MVE will inherently perform the
saturate.
Differential Revision: https://reviews.llvm.org/D107865
Currently, the truncate selection dag node is expanded as a bitwise AND plus compare to 1. This change enables scalar comparison in the pattern if the truncate node is uniform.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108925
With opaque pointers, no actual bitcasts will be present. Instead,
there will be a mismatch between the call FunctionType and the
function ValueType. Change the code to collect CallBases
specifically (rather than general Uses) and compare these types.
RAUW is no longer performed, as there would no longer be any
bitcasts that can be RAUWd.
Differential Revision: https://reviews.llvm.org/D108880
If the true and false values are the same, we don't need a SELECT_CC.
This would normally be folded before a select is legalized to
select_cc. The test case exploits the late legalization of vscale
to trigger a case where they become identical after legalization.
This works around an issue found on a test case in D107957. In that
case the true/false values were both eventually 0 and the select was
used by a vector AVL operand. The select_cc got expanded to control
flow and a phi, but the phi inputs were both copies from X0. MachineIR
optimizations simplified this to a single copy from X0 going into the
vector instruction. This became the input of a vsetvli after vsetvli
insertion. Then register coalescing folded the copy into the vsetvli.
X0 as the source of a vsetvli is a special encoding and should not be
created by coalesing. We need to fix our vsetvli handling to make sure
this can never happen any other way, but removing the unneeded select
is still a worthwhile optimization.
This extends D108921 into a generic rule applied to constructing ExitLimits along all paths. The remaining paths (primarily howFarToZero) don't have the same reasoning about UB sensitivity as the howManyLessThan ones did. Instead, the remain cause for max counts being more precise than exact counts is that we apply context sensitive loop guards on the max path, and not on the exact path. That choice is mildly suspect, but out of scope of this patch.
The MVETailPredication.cpp change deserves a bit of explanation. We were previously figuring out that two SCEVs happened to be equal because the happened to be identical. When we optimized one with context sensitive information, but not the other, we lost the ability to prove them equal. So, cover this case by subtracting and then applying loop guards again. Without this, we see changes in test/CodeGen/Thumb2/mve-blockplacement.ll
Differential Revision: https://reviews.llvm.org/D109015
Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
to v2f64 extending loads not produced by legalization were already being emitted
correctly.) Instead, mark v2f32 to v2f64 extending loads as legal and explicitly
lower them using promote_low. This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.
Differential Revision: https://reviews.llvm.org/D108496
When we have an any-extending FPR bank load, none of the tablegen patterns
match and we fall back to the C++ selector. Like with the truncating stores
that were fixed recently, the C++ wasn't able to handle it and ended up
generating invalid copies between different size regclasses.
This change adds handling for this case, splitting the load into a regular
load and a SUBREG_TO_REG to extend it into the original wide destination reg.
Configure and use the TSFlags in TargetRegisterClass to
have unique flags for VGPR and AGPR register classes.
The vector register class queries like `hasVGPRs` will
now become more efficient with just a bitwise operation.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108815
While initializing the LDS pointers within entry basic block of kernel(s), make
sure that the entry basic block is split after alloca instructions.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108971
Similar to D108842, D108844, D108926, D108928, and D108936.
__has_builtin(builtin_mul_overflow) returns true for 32b RISCV targets,
but Clang is deferring to compiler RT when encountering long long types.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108939
Similar to D108842, D108844, and D108926.
__has_builtin(builtin_mul_overflow) returns true for 32b PPC targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks ppc44x_defconfig + CONFIG_BLK_DEV_NBD=y builds of the Linux
kernel that are using builtin_mul_overflow with these types for these
targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support these builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D108936
Icelake, Rocketlake and Tigerlake targets currently use the SkylakeServer scheduler model, despite being a later microarchitecture, leading to both reported bugs (PR48110) and discrepancies when comparing llvm-mca reports to other profiling tools (OSACA, uops, uica, etc.). And tbh I'm getting sick of llvm-mca getting blamed for what are backend scheduler model issues :-(
This patch doesn't attempt to fix any of these discrepancies - there should be no changes in codegen - its a setup patch that copies the skx model, renames all the resources, adds the additional ports (but doesn't reference them yet) and updates the llvm-exegesis pfm counter mappings (based off https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/intel_icl_events.h).
This should make it trivial for anyone with hardware access to use llvm-exegesis reports to iteratively improve the model (my attempts to get hold of a cheap tiger lake box haven't been fruitful yet....).
I will copy the SkylakeServer llvm-mca resource tests as follow up commits - the diff should entirely be the resource renames.
Differential Revision: https://reviews.llvm.org/D108914
Add assert to provoke failure in object file output, not just in disassembly output.
Reviewed By: yroux
Differential Revision: https://reviews.llvm.org/D107259
This is an improvement over D107852. We don't need to enumerate specific
function names; we can just check for `noreturn` attribute. This also
requires us to make sure `__resumeExeption` and `emscripten_longjmp`
have `noreturn` attribute too; one of them is a JS function and the
other calls a JS function so Clang does not have a way to deduce they
don't return.
This is effectively NFC, because I'm not sure if there is an additional
case this case covers; if we add a custom function call that has
`noreturn` attribute, it will be processed within the SjLj handling and
turned into `__invoke` call. So this really applies to some special
functions like `emscripten_longjmp`.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D108955
There are three kinds of "rethrowing" BBs in this pass:
1. In Emscripten SjLj, after a possibly longjmping function call, we
check if the thrown longjmp corresponds to one of setjmps within the
current function. If not, we rethrow the longjmp by calling
`emscripten_longjmp`.
2. In Emscripten EH, after a possibly throwing function call, we check
if the thrown exception corresponds to the current `catch` clauses.
If not, we rethrow the exception by calling `__resumeException`.
3. When both Emscripten EH and SjLj are used, when we check for an
exception after a possibly throwing function call, it is possible
that we get not an exception but a longjmp. In this case, we
shouldn't swallow it; we should rethrow the longjmp by calling
`emscripten_longjmp`.
4. When both Emscripten EH and SjLj are used, when we check for a
longjmp after a possibly longjmping function call, it is possible
that we get not a longjmp but an exception. In this case, we
shouldn't swallot it; we should rethrow the exception by calling
`__resumeException`.
Case 1 is in Emscripten SjLj, 2 is in Emscripten EH, and 3 and 4 are
relevant when both Emscripten EH and SjLj are used. 3 and 4 were first
implemented in D106525.
We create BBs for 1, 3, and 4 in this pass. We create those BBs for
every throwing/longjmping function call, along with other BBs that
contain condition checks. What this CL does is to create a single BB
within a function for each of 1, 3, and 4 cases. These BBs are exiting
BBs in the function and thus don't have successors, so easy to be shared
between calls.
The names of BBs created are:
Case 1: `call.em.longjmp`
Case 3: `rethrow.exn`
Case 4: `rethrow.longjmp`
For the case 2 we don't currently create BBs; we only replace the
existing `resume` instruction with `call @__resumeException`. And Clang
already creates only a single `resume` BB per function and reuses it,
so we don't need to optimize this case.
Not sure what are good benchmarks for EH/SjLj, but this decreases the
size of the object file for `grfmt_jpeg.bc` (presumably from opencv) we
got from one of our users by 8.9%. Even after running `wasm-opt -O4` on
them, there is still 4.8% improvement.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D108945
The min/max intrinsics are not yet canonical, but when they are the tail
predications analysis will change from treating them like icmp to
treating them like intrinsics. Unfortunately, they can currently produce
better code by not being tail predicated thanks to the vectorizer picking
higher VF's and the backend folding to better instructions (especially
for saturate patterns). In the long run we will need to improve the
vectorizers cost modelling, recognizing the instruction directly, but in
the meantime this treats min/max as before to prevent performance
regressions.
The backend generally uses 64-bit immediates (e.g. what
MachineOperand::getImm() returns), so use that for analyzeCompare()
and optimizeCompareInst() as well. This avoids truncation for
targets that support immediates larger 32-bit. In particular, we
can avoid the bugprone value normalization hack in the AArch64
target.
This is a followup to D108076.
Differential Revision: https://reviews.llvm.org/D108875
The GLM/SLM special cases still get tested first but after the the MUL/DIV/REM pattern detection - this will be necessary for when we make the SLM vXi32 MUL canonicalization generic to improve PMULLW/PMULHW/PMADDDW cost support etc.
PowerPC can model these instructions, so we don't need this flag set.
Reviewed By: shchenz, jsji
Differential Revision: https://reviews.llvm.org/D71983
Remove method X86LowerAMXType::getRowFromCol since it's not used, and
it's causing a warning.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D108862
A new kind BTF_KIND_TAG is added to .BTF to encode
btf_tag attributes. The format looks like
CommonType.name : attribute string
CommonType.type : attached to a struct/union/func/var.
CommonType.info : encoding BTF_KIND_TAG
kflag == 1 to indicate the attribute is
for CommonType.type, or kflag == 0
for struct/union member or func argument.
one uint32_t : to encode which member/argument starting from 0.
If one particular type or member/argument has more than one attribute,
multiple BTF_KIND_TAG will be generated.
Differential Revision: https://reviews.llvm.org/D106622
If the icmp is in a different block, then the register for the icmp
operand may not be initialized, as it nominally does not have
cross-block uses. Add a check that the icmp is in the same block
as the branch, which should be the common case.
This matches what X86 FastISel does:
5b6b090cf2/llvm/lib/Target/X86/X86FastISel.cpp (L1648)
The "not" transform that could have a similar issue is dropped
entirely, because it is currently dead: The incoming value is
a branch or select condition of type i1, but this code requires
an i32 to trigger.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51651.
Differential Revision: https://reviews.llvm.org/D108840
__has_builtin(__builtin_mul_overflow) returns true for 32b MIPS targets,
but Clang is deferring to compiler RT when encountering `long long`
types. This breaks sanitizer builds of the Linux kernel that are using
__builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support malta_defconfig builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: rengolin
Differential Revision: https://reviews.llvm.org/D108844
__has_builtin(__builtin_mul_overflow) returns true for 32b ARM targets,
but Clang is deferring to compiler RT when encountering `long long`
types. This breaks sanitizer builds of the Linux kernel that are using
__builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support allmodconfig builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: rengolin
Differential Revision: https://reviews.llvm.org/D108842
This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615
And again, the fix lies not in the code added in D105390
In this case, we completely don't check that the "broadcast-from-mem" we create
can actually fold the load. In this case, it's operand was not a load at all:
```
Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32
Creating new node: t29: i32 = undef
RepeatLoad:
t8: i32 = truncate t7
t7: i64 = extract_vector_elt t5, Constant:i64<0>
t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t1: i64 = Register %0
t4: i64 = undef
t3: i64 = Constant<0>
Combining: t15: v8i32 = undef
```
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108821
This makes sure, that the text section will have a 2-byte alignment, if
the +c extension is enabled.
Reviewed By: MaskRay, luismarques
Differential Revision: https://reviews.llvm.org/D102052
This adds an ELEN limit for fixed length vectors. This will scalarize
any elements larger than this. It will also disable some fractional
LMULs. For example, if ELEN=32 then mf8 becomes illegal, i32/f32
vectors can't use any fractional LMULs, i16/f16 can only use mf2,
and i8 can use mf2 and mf4.
We may also need something for the scalable vectors, but that has
interactions with the intrinsics and we can't scalarize a scalable
vector.
Longer term this should come from one of the Zve* features
For vectors that are exactly equal to getMaxSVEVectorSizeInBits, just use
AArch64SVEPredPattern::all, which can enable the use of unpredicated ptrue when available.
TestPlan: check-llvm
Differential Revision: https://reviews.llvm.org/D108706
This patch solely moves convert operation between SVE predicate pattern
and element number into two small functions. It's pre-commit patch for optimize
pture with known sve register width.
Differential Revision: https://reviews.llvm.org/D108705
x86_fp80 format allows values that do not fit any of IEEE-754 category.
Previously they were recognized by intrinsic __builtin_isnan as NaNs.
Now this intrinsic is implemented using instruction FXAM, which
distinguish between NaNs and unsupported values. It can make some
programs behave differently.
As a solution, this fix changes lowering of the intrinsic. If floating
point exceptions are ignored, llvm.isnan is lowered into unordered
comparison, as __buildtin_isnan was implemented earlier. In strictfp
functions the intrinsic is lowered using FXAM, which does not raise
exceptions even for signaling NaN, as required by IEEE-754 and C
standards.
Differential Revision: https://reviews.llvm.org/D108037
There are no paths into LowerFormalParms that have already specified
the sret register. We always materialize a virtual and then assign it
to the physical reg at the point of the return.
Differential Revision: https://reviews.llvm.org/D108762
Fixes PR51626.
The M68k requires that all instruction, word and long word reads are
aligned to word boundaries. From the 68020 onwards, there is a
performance benefit from aligning long words to long word boundaries.
The M68k uses the same data layout for pointers and integers.
In line with this, this commit updates the pointer data layout to
match the layout already set for 32-bit integers: 32:16:32.
Differential Revision: https://reviews.llvm.org/D108792
Exegesis is faulty and sometimes when measuring throughput^-1
produces snippets that have loop-carried dependencies,
which must be what caused me to incorrectly measure it originally.
After looking much more carefully, the inverse throughput should match
that of the MULX w/ reg op.
As per llvm-exegesis measurements.
In a future patch, a new set of amdgpu-no-* attributes will be
introduced to indicate when a function does not need an implicitly
passed input. This pass introduces new instances of these intrinsic
calls, and should remove the attributes if they were present before.
Switch to using BitIntegerState for each of the inputs, and invert
their meanings.
This now diverges more from the old AMDGPUAnnotateKernelFeatures, but
this isn't used yet anyway.
We only really want this to add the custom attributes. Theoretically
the regular transforms were already run at this point. Touching
undefined behavior breaks a lot of tests when this is enabled by
default, many of which are expecting to test handling of undef
operations.
amdgpu-calls and amdgpu-stack-objects don't really belong as
attributes, and are currently a hacky way of passing an analysis into
the DAG. These don't really belong in the IR, and don't really fit in
with the other attributes. Remove these to facilitate inverting the
pass.
I don't exactly understand the indirect call test changes. These tests
are using calls which are trivially replacable with a direct call, so
I'm not sure what the point is.
When doing Emscritpen EH, if SjLj is also enabled and used and if the
thrown exception has a possiblity being a longjmp instead of an
exception, we shouldn't swallow it; we should rethrow, or relay it. It
was done in D106525 and the code is here:
8441a8eea8/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L858-L898)
Here is the pseudocode of that part: (copied from comments)
```
if (%__THREW__.val == 0 || %__THREW__.val == 1)
goto %tail
else
goto %longjmp.rethrow
longjmp.rethrow: ;; This is longjmp. Rethrow it
%__threwValue.val = __threwValue
emscripten_longjmp(%__THREW__.val, %__threwValue.val);
tail: ;; Nothing happened or an exception is thrown
... Continue exception handling ...
```
If the current BB (where the `invoke` is created) has successors that
has the current BB as its PHI incoming node, now that has to change to
`tail` in the pseudocode, because `tail` is the latest BB that is
connected with the next BB, but this was missing.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108785
Looks like the NoRegister has some effect on the final code that is generated. My guess is that some optimization kicks in at the end?
When I use -S to dump the assembly I get the correct version with 'shrq $3, %r8':
movq %r9, %r8
shrq $3, %r8
movsbl 2147450880(%r8), %r8d
But, when I disassemble the final binary I get RAX in stead of R8:
mov %r9,%r8
shr $0x3,%rax
movsbl 0x7fff8000(%r8),%r8d
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D108745
This pattern
```
%elt = ... something ...
%undef = G_IMPLICIT_DEF
%vec = G_BUILD_VECTOR %elt, %undef, %undef, ... %undef
```
Can be selected to a SUBREG_TO_REG, assuming `%elt` and `%vec` have the same
register bank. We don't care about any of the bits in `%vec` aside from those
in `%elt`, which just happens to be the 0th element.
This is preferable to emitting `mov` instructions for every index.
This gives minor code size improvements on the test suite at -Os.
Differential Revision: https://reviews.llvm.org/D108773
Similar to what we do for add/sub/mul.
This can help remove some sext.w. There are some regressions on
some bswap tests, but I have an idea how to fix that for a follow up.
A new PACKW pattern is added to handle the new sext_inreg placement.
Differential Revision: https://reviews.llvm.org/D108663
On targets requiring VGPR alignment we may end up spilling an
unaligned register if we were partially spilled odd number of
leading lanes. The reminder will start with an odd register.
This problem is solved by inverting the order of lanes to
be spillied so that we start from the end.
Differential Revision: https://reviews.llvm.org/D108732
Even though https://bugs.llvm.org/show_bug.cgi?id=51615
appears to be introduced by D105390, the fix lies here.
We can not replace a wide volatile load with a broadcast-from-memory,
because that would narrow the load, which isn't legal for volatiles.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D108757
SelectADDRParam was discovered as being dead 5 years ago and removed in
7b4ef068c6 but the unused ComplexPattern definition was left behind.
SelectADDRDWord has never existed as far as I can tell, even back when
AMDGPU was R600-only and called that.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108758
It turns out that SchedWrite WriteIMulH was always assigned to the low half of
the result of a MULX (rather than to the high half).
To avoid confusion, this patch swaps the two MULX writes in the tablegen
definition of MULX32/64. That way, write names better describe what they
actually refer to; this also avoids further complications if in future we decide
to reuse the same MulH writes to also model other scalar integer multiply
instructions. I also had to swap the latency values for the two MULX writes to
make sure that the change is effectively an NFC. In fact, none of the existing
x86 tests were affected by this small refactoring.
This patch also fixes a bug in MCA: a wrong latency value was propagated for
instructions that perform multiple writes to a same register. This last issue
was found by Roman while testing MULX on targets that define a different latency
for the Low/High part of the result.
Differential Revision: https://reviews.llvm.org/D108727
Tell the cost model to use the scalable calculation for non-neon fixed vector.
This results in a cheaper cost for fixed-length SVE masked gathers/scatters
allowing the vectorizor to emit them more frequently.
Emscripten SjLj transformation is done in four steps. This will be
mostly the same for the soon-to-be-added Wasm SjLj; the step 1, 3, and 4
will be shared and there will be separate way of doing step 2.
1. Initialize `setjmpTable` and `setjmpTableSize` in the entry BB
2. Handle `setjmp` callsites
3. Handle `longjmp` callsites
4. Cleanup and update SSA
We initialize `setjmpTable` and `setjmpTableSize` in the entry BB. But
if the entry BB contains a `setjmp` call, some `setjmp` handling
transformation will also happen in the entry BB, such as calling
`saveSetjmp`.
This is fine for Emscripten SjLj but not for Wasm SjLj, because in Wasm
SjLj we will add a dispatch BB that contains a `switch` right after the
entry BB, from which we jump to one of post-`setjmp` BBs. And this
dispatch BB should precede all `setjmp` calls.
Emscripten SjLj (current):
```
entry:
%setjmpTable = ...
%setjmpTableSize = ...
...
call @saveSetjmp(...)
```
Wasm SjLj (follow-up):
```
entry:
%setjmpTable = ...
%setjmpTableSize = ...
setjmp.dispatch:
...
; Jump to the right post-setjmp BB, if we are returning from a
; longjmp. If this is the first setjmp call, go to %entry.split.
switch i32 %no, label %entry.split [
i32 1, label %post.setjmp1
i32 2, label %post.setjmp2
...
i32 N, label %post.setjmpN
]
entry.split:
...
call @saveSetjmp(...)
```
So in Wasm SjLj we split the entry BB to make the entry block only for
`setjmpTable` and `setjmpTableSize` initialization and insert a
`setjmp.dispatch` BB. (This part is not in this CL. This will be a
follow-up.) But note that Emscripten SjLj and Wasm SjLj share all
steps except for the step 2. If we only split the entry BB only for Wasm
SjLj, there will be one more `if`-`else` and the code will be more
complicated.
So this CL splits the entry BB in Emscripten SjLj and put only
initialization stuff there as follows:
Emscripten SjLj (this CL):
```
entry:
%setjmpTable = ...
%setjmpTableSize = ...
br %entry.split
entry.split:
...
call @saveSetjmp(...)
```
This is just done to share code with Wasm SjLj. It adds an unnecessary
branch but this will be removed in later optimization passes anyway.
This is in effect NFC, meaning the program behavior will not change, but
existing ll tests files have changed because the entry block was split.
The reason I upload this in a separate CL is to make the Wasm SjLj diff
tidier, because this changes many existing Emscripten SjLj tests, which
can be confusing for the follow-up Wasm SjLj CL.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108729
Emscripten SjLj and (soon-to-be-added) Wasm SjLj transformation share
many steps:
1. Initialize `setjmpTable` and `setjmpTableSize` in the entry BB
2. Handle `setjmp` callsites
3. Handle `longjmp` callsites
4. Cleanup and update SSA
1, 3, and 4 are identical for Emscripten SjLj and Wasm SjLj. Only the
step 2 is different. This CL extracts the current Emscripten SjLj's
longjmp callsites handling into a function. The reason to make this a
separate CL is, without this, the diff tool cannot compare things well
in the presence of moved code and added code in the followup Wasm SjLj
CL, and it ends up mixing them together, making the diff unreadable.
Also fixes some typos and variable names. So far we've been calling the
buffer argument to `setjmp` and `longjmp` `jmpbuf`, but the name used in
the man page for those functions is `env`, so updated them to be
consistent.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108728
Rename the M68kOperand::Type enumeration to KindTy to avoid ambiguity
with the Kind field when referencing enumeration values e.g.
`Kind::Value`.
This works around a compilation error under GCC 5, where GCC won't
lookup enum class values if you have a similarly named field
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60994).
The error in question is:
`M68kAsmParser.cpp:857:8: error: 'Kind' is not a class, namespace, or enumeration`
Differential Revision: https://reviews.llvm.org/D108723
The plan was to use `wasm.catch.exn` intrinsic to catch exceptions and
add `wasm.catch.longjmp` intrinsic, that returns two values (setjmp
buffer and return value), later to catch longjmps. But because we
decided not to use multivalue support at the moment, we are going to use
one intrinsic that returns a single value for both exceptions and
longjmps. And even if it's not for that, I now think the naming of
`wasm.catch.exn` is a little weird, because the intrinsic can still take
a tag immediate, which means it can be used for anything, not only
exceptions, as long as that returns a single value.
This partially reverts D107405.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D108683
In-register structure returns are not special, and handled by lowering
to multiple-value tuples. We can tail-call from non-sret fns to
structure-returning functions, except on i686 where the sret pointer
is callee-pop.
Differential Revision: https://reviews.llvm.org/D105807
With spilling into AGPRs enabled we cannot reliably predict
if we need to save FP or not. We may finally spill everything
into AGPRs and never touch stack. In this case we still may
save FP. This is deficiency but not an error, so avoid the
assert.
Differential Revision: https://reviews.llvm.org/D107404
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D107850
Description: This change enables the compare operations to be selected to SALU/VALU form
dependent of the SDNode divergence flag.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D106079
This patch replaces the SpecialRegisters field with a unique_ptr instead of a raw pointer. This is better practice, and allows us to remove the definition of the dtor for the SystemZSubtarget class.
Reviewed By: uweigand, Kai
Differential Revision: https://reviews.llvm.org/D108639
Before this patch, WriteIMulH reported a latency value which is correct for the
RR variant of MULX, but not for the RM variant.
This patch fixes the issue by introducing a new WriteIMulHLd, which is meant to
be used only by the RM variant of MULX.
Differential Revision: https://reviews.llvm.org/D108701
A single smov instruction is capable of moving from a vector register while performing
the sign-extend during said move, rather than each step being performed by separate instructions.
Differential Revision: https://reviews.llvm.org/D108633
Makes patterns added for gfx90a usable with the gfx10 versions of the
insts.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108654
Change-Id: I86167bf6b4823f975f74ccb619bd6190331ba16b
1. Splitted out some parts of R600 target to separate modules/headers.
2. Reduced some include lists in headers.
3. Found and fixed issue with override `GCNTargetMachine::getSubtargetImpl()`
and `R600TargetMachine::getSubtargetImpl()` had different return value type
than base class.
4. Minor forward declarations cleanup.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D108596
On some AMDGPU subtargets, copying to and from AGPR registers using another
AGPR register is not possible. A intermediate VGPR register is needed for AGPR
to AGPR copy. This is an issue when machine copy propagation forwards a
COPY $agpr, replacing a COPY $vgpr which results in $agpr = COPY $agpr. It is
removing a cross class copy that may have been optimized by previous passes and
potentially creating an unoptimized cross class copy later on.
To avoid this issue, check CrossCopyRegClass if a different register class will
be needed for the copy. If so then avoid forwarding the copy when the
destination does not match the desired register class and if the original copy
already matches the desired register class.
Issue seen while attempting to optimize another AGPR to AGPR issue:
Live-ins: $agpr0
$vgpr0 = COPY $agpr0
$agpr1 = V_ACCVGPR_WRITE_B32 $vgpr0
$agpr2 = COPY $vgpr0
$agpr3 = COPY $vgpr0
$agpr4 = COPY $vgpr0
After machine-cp:
$vgpr0 = COPY $agpr0
$agpr1 = V_ACCVGPR_WRITE_B32 $vgpr0
$agpr2 = COPY $agpr0
$agpr3 = COPY $agpr0
$agpr4 = COPY $agpr0
Machine-cp propagated COPY $agpr0 to replace $vgpr0 creating 3 AGPR to AGPR
copys. Later this creates a cross-register copy from AGPR->VGPR->AGPR for each
copy when the prior VGPR->AGPR copy was already optimal.
Reviewed By: lkail, rampitec
Differential Revision: https://reviews.llvm.org/D108011
We update SSA in two steps in Emscripten SjLj:
1. Rewrite uses of `setjmpTable` and `setjmpTableSize` variables and
place `phi`s where necessary, which are updated where we call
`saveSetjmp`.
2. Do a whole function level SSA update for all variables, because we
split BBs where `setjmp` is called and there are possibly variable
uses that are not dominated by a def.
(See 955b91c19c/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L1314-L1324))
We have been using `SSAUpdater` to do this, but `SSAUpdaterBulk` class
was added after this pass was first created, and for the step 2 it looks
like a better alternative with a possible performance benefit. Not sure
the author is aware of it, but `SSAUpdaterBulk` seems to have a
limitation: it cannot handle a use within the same BB as a def but
before it. For example:
```
... = %a + 1
%a = foo();
```
or
```
%a = %a + 1
```
The uses `%a` in RHS should be rewritten with another SSA variable of
`%a`, most likely one generated from a `phi`. But `SSAUpdaterBulk`
thinks all uses of `%a` are below the def of `%a` within the same BB.
(`SSAUpdater` has two different functions of rewriting because of this:
`RewriteUse` and `RewriteUseAfterInsertions`.) This doesn't affect our
usage in the step 2 because that deals with possibly non-dominated uses
by defs after block splitting. But it does in the step 1, which still
uses `SSAUpdater`.
But this CL also simplifies the step 1 by using `make_early_inc_range`,
removing the need to advance the iterator before rewriting a use.
This is NFC; the test changes are just the order of PHI nodes.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D108583
This CL is small, but the description can be a little long because I'm
trying to sum up the status quo for Emscripten/Wasm EH/SjLj options.
First, this CL adds an option for Wasm SjLj (`-wasm-enable-sjlj`), which
handles SjLj using Wasm EH. The implementation for this will be added as
a followup CL, but this adds the option first to do error checking.
This also adds an option for Wasm EH (`-wasm-enable-eh`), which has been
already implemented. Before we used `-exception-model=wasm` as the same
meaning as enabling Wasm EH, but after we add Wasm SjLj, it will be
possible to use Wasm EH instructions for Wasm SjLj while not enabling
EH, so going forward, to use Wasm EH, `opt` and `llc` will need this
option. This only affects `opt` and `llc` command lines and does not
affect Emscripten user interface.
Now we have two modes of EH (Emscripten/Wasm) and also two modes of SjLj
(also Emscripten/Wasm). The options corresponding to each of are:
- Emscripten EH: `-enable-emscripten-cxx-exceptions`
- Emscripten SjLj: `-enable-emscripten-sjlj`
- Wasm EH: `-wasm-enable-eh -exception-model=wasm`
`-mattr=+exception-handling`
- Wasm SjLj: `-wasm-enable-sjlj -exception-model=wasm`
`-mattr=+exception-handling`
The reason Wasm EH/SjLj's options are a little complicated are
`-exception-model` and `-mattr` are common LLVM options ane not under
our control. (`-mattr` can be omitted if it is embedded within the
bitcode file.)
And we have the following rules of the option composition:
- Emscripten EH and Wasm EH cannot be turned on at the same itme
- Emscripten SjLj and Wasm SjLj cannot be turned on at the same time
- Wasm SjLj should be used with Wasm EH
Which means we now allow these combinations:
- Emscripten EH + Emscripten SjLj: the current default in `emcc`
- Wasm EH + Emscripten SjLj:
This is allowed, but only as an interim step in which we are testing
Wasm EH but not yet have a working implementation of Wasm SjLj. This
will error out (D107687) in compile time if `setjmp` is called in a
function in which Wasm exception is used.
- Wasm EH + Wasm SjLj:
This will be the default mode later when using Wasm EH. Currently Wasm
SjLj implementation doesn't exist, so it doesn't work.
- Emscripten EH + Wasm SjLj will not work.
This CL moves these error checking routines to
`WebAssemblyPassConfig::addIRPasses`. Not sure if this is an ideal place
to do this, but I couldn't find elsewhere. Currently some checking is
done within LowerEmscriptenEHSjLj, but these checks only run if
LowerEmscriptenEHSjLj runs so it may not run when Wasm EH is used. This
moves that to `addIRPasses` and adds some more checks.
Currently LowerEmscriptenEHSjLj pass is responsible for Emscripten EH
and Emscripten SjLj. Wasm EH transformations are done in multiple
places, including WasmEHPrepare, LateEHPrepare, and CFGStackify. But in
the followup CL, LowerEmscriptenEHSjLj pass will be also responsible for
a part of Wasm SjLj transformation, because WasmSjLj will also be using
several Emscripten library functions, and we will be sharing more than
half of the transformation to do that between Emscripten SjLj and Wasm
SjLj.
Currently we have `-enable-emscripten-cxx-exceptions` and
`-enable-emscripten-sjlj` but these only work for `llc`, because for
`llc` we feed these options to the pass but when we run the pass using
`opt` the pass will be created with no options and the default options
will be used, which turns both Emscripten EH and Emscripten SjLj on.
Now we have one more SjLj option to care for, LowerEmscriptenEHSjLj pass
needs a finer way to control these options. This CL removes those
default parameters and make LowerEmscriptenEHSjLj pass read directly
from command line options specified. So if we only run
`opt -wasm-lower-em-ehsjlj`, currently both Emscripten EH and Emscripten
SjLj will run, but with this CL, none will run unless we additionally
pass `-enable-emscripten-cxx-exceptions` or `-enable-emscripten-sjlj`,
or both. This does not affect users; this only affects our `opt` tests
because `emcc` will not call either `opt` or `llc`. As a result of this,
our existing Emscripten EH/SjLj tests gained one or both of those
options in their `RUN` lines.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D107685
Fixes PR51605 in which a DAG combine and legalization sequence generated
out-of-range constants in BUILD_VECTOR lanes. In the v16i8 case, the constants
were 255, which would be in range if DAG ISel used unsigned constants, but it is
out of range because DAG ISel uses signed constants.
Differential Revision: https://reviews.llvm.org/D108669
This reverts commit 67bf3ac744.
The reason is that this change is now superseded by 04fb9b729a which fixes the
underlying problem in the selector. Now it's fine to generate truncating FP stores
since the selector code will just generate subreg copies to handle them.
When the tablegen patterns fail to select a truncating scalar FPR store,
our manual selection code also failed to handle it silently, trying to
generate an invalid copy. Fix this by adding support in the manual code
to generate a proper subreg copy before selecting a non-truncating store.
Patch 9588b685c6 introduced dependency on ASAN. But it didn't
explicitly put LLVMInstrumentation as one of the library dependencies
such that the build will fail if we're building LLVM as shared libraries
(i.e. -DBUILD_SHARED_LIBS=ON).
This patch explicitly links X86CodeGen against the Instrumentation
component.
Differential Revision: https://reviews.llvm.org/D108662
This implementation allows mca to model the desired behaviour of the s_waitcnt
instruction. This patch also adds the RetireOOO flag to the AMDGPU instructions
within the scheduling model. This flag is only used by mca and allows
instructions to finish out-of-order which helps mca's simulations more closely
model the actual device.
Differential Revision: https://reviews.llvm.org/D104730
The implementation uses the int_asan_check_memaccess intrinsic to instrument the code. The intrinsic is replaced by a call to a function which performs the access check. The generated function names encode the input register name as a number using Reg - X86::NoRegister formula.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D107850
1) Just mark this case as legal because it can just be a copy.
2) Ensure the copy in the existing code actually gets selected. Without doing
this, we'll crash because the destination won't have a register class.
This fell back 35 times in a build of clang with GISel for AArch64.
Differential Revision: https://reviews.llvm.org/D108610
Reuse the selection code from the ld2 case. This is similar to how SDAG handles
things in AArch64ISelDAGToDAG. (See SelectLoad)
This fell back ~100 times while building clang with GISel enabled for AArch64.
Factoring out the gross subreg copy part ought to make selecting the rest of
this family fairly easy.
Differential Revision: https://reviews.llvm.org/D108600
We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468.
Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative.
Differential Revision: https://reviews.llvm.org/D108139
Fixes PR51580.
Register masks will now be printed as 'movem.l (%sp), %a0-%a5/%d5'
for example and can now be parsed in the same format.
Previously the printed syntax was 'movem.l (%sp), %a0-%a5,%d', which
didn't match prior art and was too ambiguous to easily parse.
Differential Revision: https://reviews.llvm.org/D108597
Over in D105657, we started dropping instruction numbers (that become
variable locations) from call instructions, as we can't correctly represent
the x87 FP stack. Unfortunately, it turns out that the "special FP
instructions" that this pass transforms includes "every call instruction"
[0]. Thus, we've ended up dropping all return values from all calls. Ouch.
This patch adds a filter: only drop instruction numbers from calls if they
return something on the FP stack. Seeing how LLVM only allows a single
return value, this should drop instruction numbers on anything that returns
a float, and nothing else.
Rather than writing a new test, I've modified the original one to have a
positive and negative case: drop instruction number on a call with an
FP-stack modification, keep it on a plain call.
Differential Revision: https://reviews.llvm.org/D108580
This does the same as D96259, but for ARM, just like AArch64,
using the same comment char as for ELF and MinGW mode.
As the assembly input/output of LLVM is GAS style, trying to
match what MS armasm.exe does isn't needed (because the comment
char used is the least concern when it comes to that; all
directives differ too). If a separate armasm compatible mode
is implemented, it can use its own comment style (just like
llvm-ml implements MS ml.exe compatible assembly parsing).
This fixes building compiler-rt assembly files for ARM in MSVC
mode.
The updated testcase literals-comments.s was only intended to
make sure that '#' isn't interpreted as a comment char.
Differential Revision: https://reviews.llvm.org/D107251
This is pretty similar to the ST2 selection code in
`AArch64InstructionSelector::selectIntrinsicWithSideEffects`.
This is a GISel equivalent of the ld2 case in `AArch64DAGToDAGISel::Select`.
There's some weirdness there that appears here too (e.g. using ld1 for scalar
cases, which are 1-element vectors in SDAG.)
It's a little gross that we have to create the copy and then select it right
after, but I think we'd need to refactor the existing copy selection code
quite a bit to do better.
This was falling back while building llvm-project with GISel for AArch64.
Differential Revision: https://reviews.llvm.org/D108590
Intended to be NFC. ARM/AArch64 don't appear to need adjustment.
TargetMachine::shouldAssumeDSOLocal is expected to be very simple, ideally
matching isDSOLocal(). The IR producers are expected to set dso_local correctly.
(While some may think this function can make producers' work easier, the
function is really not in a good position to set dso_local. See the various
special cases we duplicate from clang CodeGenModule.cpp.)
Reviewed By: mstorsjo
Differential Revision: https://reviews.llvm.org/D108514
It appears that the Read operand for stores was being placed on the
first operand (the stored value) not the address base. This adds a
ReadST for the stored value operand, allowing the ReadAdrBase to
correctly act upon the address.
Differential Revision: https://reviews.llvm.org/D108287
After c063946476 usage of R31 doesn't necessarily mean
that alloca is used. The `TracebackTable::IsAllocaUsedMask` flag should be set only
when R31 is used as a frame pointer.
On AIX the `function calls alloca' bit seems to be set whenever R31 is
set up as a frame pointer, even when there is no alloca call.
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D108141
The following scalar FP instructions are legal in streaming mode:
0101 1110 xx1x xxxx 11x1 11xx xxxx xxxx # FMULX/FRECPS/FRSQRTS (scalar)
0101 1110 x10x xxxx 00x1 11xx xxxx xxxx # FMULX/FRECPS/FRSQRTS (scalar, FP16)
01x1 1110 1x10 0001 11x1 10xx xxxx xxxx # FRECPE/FRSQRTE/FRECPX (scalar)
01x1 1110 1111 1001 11x1 10xx xxxx xxxx # FRECPE/FRSQRTE/FRECPX (scalar, FP16)
Predicate them on `HasNEONorStreamingSVE`. Full list of affected
instructions:
FMULX16, FMULX32, FMULX64, FRECPS16, FRECPS32, FRECPS64, FRSQRTS16,
FRSQRTS32, FRSQRTS64, FRECPEv1f16, FRECPEv1i32, FRECPEv1i64, FRECPXv1f16,
FRECPXv1i32, FRECPXv1i64, FRSQRTEv1f16, FRSQRTEv1i32, FRSQRTEv1i64
Depends on D107902.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06/SIMD-FP-Instructions
Execution of NEON instructions that are illegal in streaming mode will
cause a trap or exception. Using FMULX [1] as an example, this check is
at the top of the pseudocode:
if elements == 1 then
CheckFPEnabled64();
else
CheckFPAdvSIMDEnabled64();
For the legal scalar variants it calls `CheckFPEnabled64`, whereas for the
illegal vector variants it calls `CheckFPAdvSIMDEnabled64` which traps.
This is useful for observing which instructions are/aren't legal
in streaming mode.
[1] https://developer.arm.com/documentation/ddi0602/2021-06/SIMD-FP-Instructions/FMULX--Floating-point-Multiply-extended-
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D108039
Split out from D107903 to remove dependency for D108039 and D108279.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D108293
This is the first step to enable PPC64 support huge frame size(>2G). Also fix an assertion error for frame size, i.e.,`int x; !isInt<32>(x);` should be always evaluated false, so the guard code for frame size is impossible to hit.
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D107435
Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element.
This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).
Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.
Broadwell is mainly a die shrink of Haswell, but the model had many of the scheduling classes in different orders, making side-by-side comparisons very difficult.
The InstRW overrides are still quite different, but at least that part of the side-by-side diff is now in the same position.
This was noticed while I was trying to investigate diffs between llvm-mca and other perf analyzers in https://uica.uops.info/ - we used to be able to do diffs between most of the models very easily, but we seem to have lost that simplicity as classes have been altered, models have been refined and other models have rotted.
Add a variant of mve-vqdmulh tests that uses min/max intrinsics
directly, including a scalar test that shows it misbehaving for min
intrinsics and a fix for the combine to prevent it from misbehaving.
Truncating stores with GPR bank sources shouldn't be mutated into using FPR bank
sources, since those aren't supported.
Ideally this should be a selection failure in the tablegen patterns, but for now
avoid generating them.
SDAG lowers 32-bit and 64-bit G_SMULO + G_UMULO. We were missing the 32-bit
case.
For other sizes, make the 0th type a power of 2 and clamp it to either 32 bits
or 64 bits.
Right now, this will allow us to handle narrow types (e.g. s4, s24, etc.). The
LegalizerHelper doesn't support narrowing G_SMULO or G_UMULO right now. I think
we want clamping behaviour either way, so we might as well include it now to
be explicit.
Differential Revision: https://reviews.llvm.org/D108240
We had a rule for <n x s64> but not one for <n x p0>. As a result, we'd fall
back on like <5 x p0> or whatever.
Differential Revision: https://reviews.llvm.org/D108484
Destination is always a GPR, since the result is always an integer.
Source is always a FPR, since the source is always floating point.
Differential Revision: https://reviews.llvm.org/D108419
This reverts commit f4122398e7 to
investigate a crash exposed by it.
The patch breaks building the code below with `clang -O2 --target=aarch64-linux`
int a;
double b, c;
void d() {
for (; a; a++) {
b += c;
c = a;
}
}
Before this patch, instructions MULX32rm and MULX64rm were missing a ReadAdvance
for the implicit read of register EDX/RDX. This patch fixes the issue, and it
also introduces a new SchedWrite for the two variants of MULX. The general idea
behind this last change is to eventually decrease the number of InstRW in the
scheduling models.
This patch also adds a ReadAdvance for the implicit read of EFLAGS in ADCX/ADOX.
Differential Revision: https://reviews.llvm.org/D108372
Partially reverts 85157c0079, which had removed these builtins and intrinsics
in favor of normal codegen patterns. It turns out that it is possible for the
patterns to be split over multiple basic blocks, however, which means that DAG
ISel is not able to select them to the pmin/pmax instructions. To make sure the
SIMD intrinsics generate the correct instructions in these cases, reintroduce
the clang builtins and corresponding LLVM intrinsics, but also keep the normal
pattern matching as well.
Differential Revision: https://reviews.llvm.org/D108387
Optimize (add x, c) to (SH*ADD (c>>b), x) if c is not simm12
while (c>>b) is simm12 and c has b trailing zeros.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D108193
This patch fixes an issue where RISCV's `findCommutedOpIndices` would
incorrectly return the pseudo `CommuteAnyOperandIndex` as a commutable
operand index, rather than fixing a specific index.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D108206
Prevent SIFoldOperands from creating SALU instructions with a constant
and a frame index. Previously, only one operand was checked to be a
frame index, leading to too many constants when flat scratch is enabled
and stack offsets are large.
Differential Revision: https://reviews.llvm.org/D108368
Produce remarks when atomic instructions are expanded into hardware instructions
in SIISelLowering.cpp. Currently, these remarks are only emitted for atomic fadd
instructions.
Differential Revision: https://reviews.llvm.org/D108150
For some reductions like G_VECREDUCE_OR on AArch64, we need to scalarize
completely if the source is <= 64b. This change adds support for that in
the legalizer. If the source has a pow-2 num elements, then we can do
a tree reduction using the scalar operation in the individual elements.
Otherwise, we just create a sequential chain of operations.
For AArch64, we only need to scalarize if the input is <64b. If it's great than
64b then we can first do a fewElements step to 64b, taking advantage of vector
instructions until we reach the point of scalarization.
I also had to relax the verifier checks for reductions because the intrinsics
support <1 x EltTy> types, which we lower to scalars for GlobalISel.
Differential Revision: https://reviews.llvm.org/D108276
When support for copying vector s8 lanes was added recently, this also
had the side effect of fixing a fallback for <16 x s8> extracts since
both used the same helper. However, there was a bug in another helper
to get the regclass for a specific FPR-native type, which was assigning
FPR16 to s8 instead of FPR8.
The convert_low and promote_low instructions can widen the lower two lanes of a
four-lane vector, but we were previously scalarizing patterns that widened lanes
besides the low two lanes. The commit adds a shuffle to move the widened lanes
into the low lane positions so the convert_low and promote_low instructions can
be used instead of scalarizing.
Depends on D108266.
Differential Revision: https://reviews.llvm.org/D108341
Since the simplest DAG patterns for convert_low and promote_low instructions
involved v2i32, v2f32, v4i64, and v4f64 types, which are not legal in the
WebAssembly backend and would be eliminated by type legalization, we were
previously matching those patterns in a DAG combine before the type legalization
stage. However in cases where the vectors were wider than 128 bits, the patterns
we matched were not created until the type legalization stage when the wide
vectors were split up. Type legalization would continue to eliminate the illegal
types we were matching as well, so the code ended up scalarized.
To make the ISel for these instructions more robust, match the scalarized
patterns rather than the patterns containing illegal types. Add tests with
double-wide vectors to show that this works as intended.
Fixes PR51098.
Depends on D107502.
Differential Revision: https://reviews.llvm.org/D108266
The default legalization of unsupported vector types is to promote the integers
in each lane, which leads to extra sign or zero extending and masking when
moving data into and out of vectors. Switch our preferred type legalization from
the default to vector widening, which keeps the data in the low lanes of the
vector rather than in the low bits of each lane. The unused high lanes can be
ignored.
Half-wide vectors are now loaded from memory into the low 64 bits of the v128
rather than spread out among the lanes. As a result, v128.load64_splat is a much
more common operation, so add new patterns to support it.
Differential Revision: https://reviews.llvm.org/D107502
Alias analysis is unable to disambiguate accesses to the structure
fields without it unlike distinct variables. As a result we cannot
combine ds_read and ds_write operations in a case of any store in
between which always considered clobbering.
Differential Revision: https://reviews.llvm.org/D108315
This allows the instruction selector to realize that it can directly
broadcast the low byte of the memset value, rather than replicating
it to a 64-bit GPR before broadcasting.
This fixes PR50985.
Differential Revision: https://reviews.llvm.org/D108354
This changes the lowering of saddsat and ssubsat so that instead of
using:
r,o = saddo x, y
c = setcc r < 0
s = c ? INTMAX : INTMIN
ret o ? s : r
into using asr and xor to materialize the INTMAX/INTMIN constants:
r,o = saddo x, y
s = ashr r, BW-1
x = xor s, INTMIN
ret o ? x : r
https://alive2.llvm.org/ce/z/TYufgD
This seems to reduce the instruction count in most testcases across most
architectures. X86 has some custom lowering added to compensate for
cases where it can increase instruction count.
Differential Revision: https://reviews.llvm.org/D105853
This encapsulates the APInt creation and worklist management into
a helper function.
To keep one common interface I've use Log2_32 in places that
previously created a mask by subtracting 1 from a power of 2.
Differential Revision: https://reviews.llvm.org/D108324
I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:
-force-ordered-reductions=true|false
I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:
Transforms/LoopVectorize/AArch64/strict-fadd.ll
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
Differential Revision: https://reviews.llvm.org/D106653
getAPFloatFromSize doesn't support s128, so we can't lower this without
asserting right now.
To fix the buildbots, don't allow any scalars other than s16, s32, and s64.
We need to ensure that these end up on FPR to allow imported patterns to
select them.
This will also ensure that we get good regbank selection when dealing with
instructions like G_PHI/G_LOAD/G_STORE which deduce their banks from their
uses/users.
Differential Revision: https://reviews.llvm.org/D108260
For subtargets with full FP16, this is legal for s16, s32, and s64. Without
full FP16, it's legal for s32 and s64.
For s128, this is a libcall.
We also support some vector types, but for now, let's just support scalars.
Differential Revision: https://reviews.llvm.org/D108259
Hooked up raw.buffer.atomic.fmin/max.f64
This instruction should be available on GFX6, GFX7, and GFX10.
It was implemented for GFX90a with a different name.
Added intrinsic def for image_atomic_fmin/fmax; the instruction
defs were already there.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D108208
Change-Id: I473f98d28b2afbeeb2c27822d9686b5e86634e2f
Let the sext_inreg be selected to sext.w. Remove unneeded sext.w
during PostProcessISelDAG.
This gives opportunities for some other isel patterns to match
like the ADDIPair or matching mul with immediate to shXadd.
This becomes possible after D107658 started selecting W instructions
based on users. The sext.w will be considered a W user so isel
will often select a W instruction for the sext.w input and we can
just remove the sext.w. Otherwise we can combine the sext.w with
a ADD/SUB/MUL/SLLI to create a new W instruction in parallel
to the the original instruction.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107708
We already do this for non-constants RHS. This just removes the
special case. I believe the special case may have been needed
because the ANY_EXTEND of a constant used to create zero extended
constants, but we recently changed that to produce sign extended
constants.
D107658 is needed to prevent some regressions.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107697
DAGCombiner::visitStore can clear the upper bits of constants
used by stores. This leads prevents them from being recognized as
sign extended negative values making them more expensive to
materialize.
This patch uses the hasAllNBitUsers method from D107658 to make
a negative constant if none of the users care about the upper bits.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D108052
We normally select these when the root node is a sext_inreg, but
SimplifyDemandedBits can sometimes bypass the sext_inreg for some
users. This can create situation where sext_inreg+add/sub/mul/shl
is selected to a W instruction, and then the add/sub/mul/shl is
separately selected to a non-W instruction with the same inputs.
This patch tries to detect when it would still be ok to use a W
instruction without the sext_inreg by checking the direct users.
This can allow the W instruction to CSE with one created for a
sext_inreg+add/sub/mul/shl. To minimize complexity and cost of
checking, we make no attempt to determine if the CSE will happen
and just always use a W instruction when we can.
Differential Revision: https://reviews.llvm.org/D107658
If we have these instructions, we don't need to hoist the immediate
for an AND that would match them.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107783
For tight loops like this:
float r = 0;
for (int i = 0; i < n; i++) {
r += a[i];
}
it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.
Differential Revision: https://reviews.llvm.org/D108292
If Orig produces more than one value (rare) with different types (rarer) then
we need to make sure we check against the one that Orig actually represents,
not just the first type.
Unfortunately because of the combination of things that need to happen I wasn't
able to produce a test.