The runtime library has two family library implementation for ppc_fp128 and fp128.
For IBM Long double(ppc_fp128), it is suffixed with 'l', i.e(sqrtl). For
IEEE Long double(fp128), it is suffixed with "ieee128" or "f128".
We miss to map several libcall for IEEE Long double.
Reviewed By: qiucf
Differential Revision: https://reviews.llvm.org/D91675
add a new goal MustReduceRegisterPressure for machine combiner pass.
PowerPC will use this new goal to do some register pressure related optimization.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D92068
If a faux shuffle uses smaller shuffle inputs, try to recursively combine with those inputs directly instead of widening them immediately. Then widen all smaller inputs at the bottom of the recursion.
This will still mean we're generating nodes on the fly (PR45974) even if we don't combine to a new shuffle but it does help AVX2+ targets combine across xmm/ymm/zmm types, mainly as variable shuffles.
This recommits a87fccb3ff with a fix to mark the destination operand
of the marker instruction as def, to fix a machine verifier failure.
This reverts the revert commit c0f2cea7c0.
Default expansion leads to repeated extensions/truncations to/from vXi16 which shuffle combining and demanded elts can't completely unravel.
Better just to promote (any_extend) the input and perform a vXi16 reduction.
We'll be able to remove a lot of this if we ever get decent legalization support for reduction intrinsics in SelectionDAG.
Some of the pattern matching in PPCInstrVSX.td and node lowering involving vectors assumes 64bit mode. This patch disables some of the unsafe pattern matching and lowering of BUILD_VECTOR in 32bit mode.
Reviewed By: Xiangling_L
Differential Revision: https://reviews.llvm.org/D92789
The getPayload/getMask/getPassThrough functions should return values
that could be composed into a masked load/store without any additional
type casts. The previous fix violated that.
Instead, convert scalar mask to a vector right before rescaling.
AlignVectors treats all loaded/stored values as vectors of bytes,
and masks as corresponding vectors of booleans, so make getMask
produce a 1-element vector for scalars from the start.
The ABI demands a data16 prefix for lea in 64-bit LP64 mode, but not in
64-bit ILP32 mode. In both modes this prefix would ordinarily be
ignored, but the instructions may be changed by the linker to
instructions that are affected by the prefix.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D93157
This adds some basic MVE masked load/store costs, notably changing the
cost of legal loads/stores to the MVECostFactor and the cost of
scalarized instructions to 8*NumElts.
Differential Revision: https://reviews.llvm.org/D86538
The performance improvement on LBM previously achieved with improved software
prefetching (36d4421) have gone lost recently with e00f189. There now is one
memory access in the loop that LoopDataPrefetch cannot handle (while before
there was none) which the heuristic rejects.
This patch adds a small margin by allowing 1 non-prefetched memory access for
every 32 prefetched ones, so that the heuristic doesn't bail in this type of
case.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D92985
Summary:
"Speculative fix for link failure on bots" with a mention of "the clang-ppc64le-rhel bot fails on link: http://lab.llvm.org:8011/#/builders/57/builds/2307/steps/6/logs/stdio".
PPCAsmPrinter.cpp:(.text._ZN12_GLOBAL__N_116PPCAIXAsmPrinter19emitFunctionBodyEndEv+0x2f8): undefined reference to `llvm::XCOFF::getNameForTracebackTableLanguageId(llvm::XCOFF::TracebackTable::LanguageID)'
PPCAsmPrinter.cpp:(.text._ZN12_GLOBAL__N_116PPCAIXAsmPrinter19emitFunctionBodyEndEv+0x2170): undefined reference to `llvm::XCOFF::parseParmsType(unsigned int, unsigned int)'
SUMMARY:
1. added a new option -xcoff-traceback-table to control whether generate traceback table for function.
2. implement the functionality of emit traceback table of a function.
Reviewers: hubert.reinterpretcast, Jason Liu
Differential Revision: https://reviews.llvm.org/D92398
This transform was added at:
c63799fc52
From what I see, it's the first demanded elements transform that adds
a new instruction using the IRBuilder. There are similar folds in
the generic demanded bits chunk of instcombine that also use the
InsertPointGuard code pattern.
The tests here would assert/crash because the new instruction was
being added at the start of the demanded elements analysis rather
than at the instruction that is being replaced.
This patch adds support for lowering function calls with the
rv_marker attribute. The goal is to expand such calls to the
following sequence of instructions:
BL @fn
mov x29, x29
This sequence of instructions triggers Objective-C runtime optimizations,
hence we want to ensure no instructions get moved in between them.
This patch achieves that by adding a new CALL_RVMARKER ISD node,
which gets turned into the BLR_RVMARKER pseudo, which eventually gets
expanded into the sequence mentioned above. The sequence is then marked
as instruction bundle, to avoid anything being moved in between.
@ahatanak is working on using this attribute in the front- & middle-end.
Together with the front- & middle-end changes, this should address
PR31925 for AArch64.
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D92569
Add simple pass for removing redundant vsetvli instructions within a basic block. This handles the case where the AVL register and VTYPE immediate are the same and no other instructions that change VTYPE or VL are between them.
There are going to be more opportunities for improvement in this space as we development more complex tests.
Differential Revision: https://reviews.llvm.org/D92679
Although this was something that I was hoping we would not have to do,
this patch makes t2DoLoopStartTP a terminator in order to keep it at the
end of it's block, so not allowing extra MVE instruction between it and
the end. With t2DoLoopStartTP's also starting tail predication regions,
it also marks them as having side effects. The t2DoLoopStart is still
not a terminator, giving it the extra scheduling freedom that can be
helpful, but now that we have a TP version they can be treated
differently.
Differential Revision: https://reviews.llvm.org/D91887
The compiler is making no effort to preserve upper elements. To do so would require another source operand tied with the destination and a different intrinsic interface to give control of this source to the programmer.
This patch changes the tail policy to agnostic so that the CPU doesn't need to make an effort to preserve them.
This is consistent with the RVV intrinsic spec here https://github.com/riscv/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md#configuration-setting
Differential Revision: https://reviews.llvm.org/D93080
This CL changes the asm syntax for section flags, making them more like ELF
(previously "passive" was the only option). Now we also allow "G" to designate
COMDAT group sections. In these sections we set the appropriate comdat flag on
function symbols, and also avoid auto-creating a new section for them.
This also adds asm-based tests for the changes D92691 to go along with
the direct-to-object tests.
Differential Revision: https://reviews.llvm.org/D92952
This is a reland of rG4564553b8d8a with a fix to the lit pipeline in
llvm/test/MC/WebAssembly/comdat.ll
This CL changes the asm syntax for section flags, making them more like ELF
(previously "passive" was the only option). Now we also allow "G" to designate
COMDAT group sections. In these sections we set the appropriate comdat flag on
function symbols, and also avoid auto-creating a new section for them.
This also adds asm-based tests for the changes D92691 to go along with
the direct-to-object tests.
Differential Revision: https://reviews.llvm.org/D92952
Use RegisterClass::contains instead of going through getMinimalPhysRegClass
and hasSuperClassEq.
Remove the special case for NoRegister. It's identical to the
handling for any other regsiter that isn't VRM2/M4/M8.
The loop-based probing done for stack clash protection altered R1D which
corrupted the backchain value to be stored after the probing was done.
By using R0D instead for the loop exit value, R1D is not modified.
Review: Ulrich Weigand.
Differential Revision: https://reviews.llvm.org/D92803
Inline asm can contain constructs like .bytes which may have arbitrary size.
In some cases, this causes us to miscalculate the size of blocks and therefore
offsets, causing us to incorrectly compress a JT.
To be safe, just bail out of the whole thing if we find any inline asm.
Fixes PR48255
Differential Revision: https://reviews.llvm.org/D92865
There is an in-progress proposal for the following pseudo-instructions
in the assembler, to complement the existing `sext.w` rv64i instruction:
- sext.b
- sext.h
- zext.b
- zext.h
- zext.w
The `.b` and `.h` variants are available with rv32i and rv64i, and `zext.w` is
only available with `rv64i`.
These are implemented primarily as pseudo-instructions, as these instructions
expand to multiple real instructions. In the case of `zext.b`, this expands to a
single rv32/64i instruction, so it is implemented with an InstAlias (like
`sext.w` is on rv64i).
The proposal is available here: https://github.com/riscv/riscv-asm-manual/pull/61
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D92793
If SETUNE isn't legal, UO can use the NOT of the SETO expansion.
Removes some complex isel patterns. Most of the test changes are
from using XORI instead of SEQZ.
Differential Revision: https://reviews.llvm.org/D92008
This patch changes performMSCATTERCombine to also promote the indices of
masked gathers where the element type is i8 or i16, and adds various tests
for gathers with illegal types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D91433
We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.
Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.
There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.
This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.
Differential Revision: https://reviews.llvm.org/D91358
Both ds_read_b128 and ds_read2_b64 are valid for 128bit 16-byte aligned
loads but the one that will be selected is determined either by the order in
tablegen or by the AddedComplexity attribute. Currently ds_read_b128 has
priority.
While ds_read2_b64 has lower alignment requirements, we cannot always
restrict ds_read_b128 to 16-byte alignment because of unaligned-access-mode
option. This was causing ds_read_b128 to be selected for 8-byte aligned
loads regardles of chosen access mode.
To resolve this we use two patterns for selecting ds_read_b128. One
requires alignment of 16-byte and the other requires
unaligned-access-mode option.
Same goes for ds_write2_b64 and ds_write_b128.
Differential Revision: https://reviews.llvm.org/D92767
The phi created in a low overhead loop gets created with a default
register class it seems. There are then copied inserted between the low
overhead loop pseudo instructions (which produce/consume GPRlr
instructions) and the phi holding the induction. This patch removes
those as a step towards attempting to make t2LoopDec and t2LoopEnd a
single instruction, and appears useful in it's own right as shown in the
tests.
Differential Revision: https://reviews.llvm.org/D91267
This patch implements amx programming model that discussed in llvm-dev
(http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
Thank Hal for the good suggestion in the RA. The fast RA is not in the patch yet.
This patch implemeted 7 components.
1. The c interface to end user.
2. The AMX intrinsics in LLVM IR.
3. Transform load/store <256 x i32> to AMX intrinsics or split the
type into two <128 x i32>.
4. The Lowering from AMX intrinsics to AMX pseudo instruction.
5. Insert psuedo ldtilecfg and build the def-use between ldtilecfg to amx
intruction.
6. The register allocation for tile register.
7. Morph AMX pseudo instruction to AMX real instruction.
Change-Id: I935e1080916ffcb72af54c2c83faa8b2e97d5cb0
Differential Revision: https://reviews.llvm.org/D87981