The model was committed in 4b8ade837e
but not yet enabled to allow for a few fix ups. This adds a few
of these fixes, and also a LLVM MCA test to check most instructions.
While I do have plans to look into some more tuning, it's time to
enable this as it better than using the A53 schedule.
Differential Revision: https://reviews.llvm.org/D88017
For LP64 mode, this has no effect as pointers are already 64 bits.
For ILP32 mode (x32), this extension is specified by the ABI.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D91338
Pass through the demanded elts mask to the source operands.
The next step will be to add support for folding to add/sub if we only demand odd/even elements.
Continue the work started at D50989.
The code has been long dead since the triple has been removed (D75494).
Reviewed By: nickdesaulniers, void
Differential Revision: https://reviews.llvm.org/D91836
Optimize prologue/epilogue instructions if a given function use GOT but
do not call other functions by eliminating FP. Previously, we had wrong
implementations taken from other architectures. Update regression tests
also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D92313
Previously, these check routines accepted non-generatble instructions.
This time, I clean them and add assert for those non-generatable
instructions.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D92254
This enables bswap/bitreverse to combine with other GREVI patterns or each other without needing to add more special cases to the DAG combine or new DAG combines.
I've also enabled the existing GREVI combine for GREVIW so that it can pick up the i32 bswap/bitreverse on RV64 after they've been type legalized to GREVIW.
Differential Revision: https://reviews.llvm.org/D92253
clang may produce `movl x@GOTPCREL+4(%rip), %eax` when loading the high
32 bits of the address of a global variable in -fpic/-fpie mode.
If assembled by GNU as, the fixup emits R_X86_64_GOTPCRELX with an addend != -4.
The instruction loads from the GOT entry with an offset and thus it is incorrect
to relax the instruction.
This patch does not emit a relaxable relocation for a GOT load with an offset
because R_X86_64_[REX_]GOTPCRELX do not make sense for instructions which cannot
be relaxed. The result is good enough for LLD to work. GNU ld relaxes
mov+GOTPCREL as well, but it suppresses the relaxation if addend != -4.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D92114
GORCI performs an OR between each stage. So we need to ensure only
one stage is active before doing this combine.
Initial attempts at finding a test case for this failed due to
the order things get combined. It's most likely that we'll form
one stage of GREVI then combine to GORCI before the two stages of
GREVI are able to be formed and combined with each other to form
a multi stage GREVI.
Differential Revision: https://reviews.llvm.org/D92289
Optimize eliminate FP mechanism. This time optimize a function which has
no call but fixed stack objects. LLVM eliminates FP on such functions now.
Also, optimize GOT/PLT registers save/restore instructions if a given
function doesn't uses them. In addition, remove generating mechanism of
`.cfi` instructions since those are taken from other architectures and not
inspected yet. Update regression tests, also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D92251
Change the way to truncate i64 to i32 in I64 registers. VE assumed
sext values previously. Change it to zext values this time to make
it match to the LLVM behaviour.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D92226
This was modeled to have a cost of 1, but since we do not have a MUL.2d this is
scalarized into vector inserts/extracts and scalar muls.
Motivating precommitted test is test/Transforms/SLPVectorizer/AArch64/mul.ll,
which we don't want to SLP vectorize.
Test Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
unfortunately needed changing, but the reason is documented in
LoopVectorize.cpp:6855:
// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.
which I will address next as a follow up of this.
Differential Revision: https://reviews.llvm.org/D92208
The build bots caught two additional pre-existing problems exposed by the test change part of my change https://reviews.llvm.org/D91339, when expensive checks are enabled. https://reviews.llvm.org/D91924 fixes one of them, this fixes the other.
FixupSetCC will change code in the form of
%setcc = SETCCr ...
%ext1 = MOVZX32rr8 %setcc
to
%zero = MOV32r0
%setcc = SETCCr ...
%ext2 = INSERT_SUBREG %zero, %setcc, %subreg.sub_8bit
and replace uses of %ext1 with %ext2.
The register class for %ext2 did not take into account any constraints on %ext1, which may have been required by its uses. This change ensures that the original constraints are honoured, by instead of creating a new %ext2 register, reusing %ext1 and further constraining it as needed. This requires a slight reorganisation to account for the fact that it is possible for the constraining to fail, in which case no changes should be made.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D91933
The build bots caught two additional pre-existing problems exposed by the test change part of my change https://reviews.llvm.org/D91339, when expensive checks are enabled. This fixes one of them.
X86 has CALL64r and CALL32r opcodes, where CALL64r takes a 64-bit register, and CALL32r takes a 32-bit register. CALL64r can only be used in 64-bit mode, CALL32r can only be used in 32-bit mode. LLVM would assume that after picking the appropriate CALLr opcode, a pointer-sized register would be a valid operand, but in x32 mode, a 64-bit mode, pointers are 32 bits. In this mode, it is invalid to directly pass a pointer to CALL64r, it needs to be extended to 64 bits first.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D91924
Not sure why bswap was treated specially. This also applies to bitreverse
or generic grevi. We can improve this in future patches.
For now I just wanted to get the consistency and the test coverage
as I plan to make some other changes around bswap.
Optimize emitSPAdjustment function to generate as small as possible
instructions to adjust SP.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D92174
We had an zexti32 after a sign_extend_inreg. The AND X, 0xffffffff
part of the zexti32 should never occur since SimplifyDemandedBits
from the sign_extend_inreg would have removed it.
We also had sexti32 as the root node of a pattern, but SelectionDAGISel
matches assertsext early before the tablegen based patterns are
evaluated.
Followup to D92112 now that I've learnt about HVX type splitting.
This is some necessary cleanup work for min/max ops to eventually help us move the add/sub sat patterns into DAGCombine - D91876.
Differential Revision: https://reviews.llvm.org/D92169
If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion.
Allows us to move the x86-specific lowering to the generic expansion code.
Differential Revision: https://reviews.llvm.org/D92183
These patterns are using zexti32 which matches either assertzexti32
or (and X, 0xffffffff). But if we match (and X, 0xffffffff) it will
remove the AND and the inputs may no longer have the zero bits
needed to guarantee the result has enough zeros.
This commit changes the patterns to only match assertzexti32.
I'm not sure how to test the broken case since the DIVUW/REMUW nodes
are created during type legalization, but type legalization won't
create an (and X, 0xfffffffff) directly on the inputs.
I've also changed the zexti32 on the root of the pattern to just
checking for AND. We were previously also matching assertzexti32,
but I doubt that pattern would ever occur.
For now, we will hardcode the result as 0.0 if the input is denormal or 0. That will
have the impact the precision. As the fsqrt added belong to the cold path of the
cmp+branch, it won't impact the performance for normal inputs for PowerPC, but improve
the precision if the input is denormal.
Reviewed By: Spatel
Differential Revision: https://reviews.llvm.org/D80974
Currently, we have some confusion in the codebase regarding the
meaning of LocationSize::unknown(): Some parts (including most of
BasicAA) assume that LocationSize::unknown() only allows accesses
after the base pointer. Some parts (various callers of AA) assume
that LocationSize::unknown() allows accesses both before and after
the base pointer (but within the underlying object).
This patch splits up LocationSize::unknown() into
LocationSize::afterPointer() and LocationSize::beforeOrAfterPointer()
to make this completely unambiguous. I tried my best to determine
which one is appropriate for all the existing uses.
The test changes in cs-cs.ll in particular illustrate a previously
clearly incorrect AA result: We were effectively assuming that
argmemonly functions were only allowed to access their arguments
after the passed pointer, but not before it. I'm pretty sure that
this was not intentional, and it's certainly not specified by
LangRef that way.
Differential Revision: https://reviews.llvm.org/D91649
This patch enables passing non variadic vector type parameters on the caller and callee side and vector return on AIX that are passed in vector registers only.
So far, support is enabled for only the AIX extended Altivec ABI Calling convention.
Reviewed By: sfertile, DiggerLin
Differential Revision: https://reviews.llvm.org/D86476
This strips out a lot of the code that should no longer be needed from
the MVETailPredictionPass, leaving the important part - find active lane
mask instructions and convert them to VCTP operations.
Differential Revision: https://reviews.llvm.org/D91866
If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion.
Allows us to move the x86-specific lowering to the generic expansion code.
It's more future-proof to use isGFX10Plus from the start, on the
assumption that future architectures will be based on current
architectures.
Also make use of the existing isGFX9Plus in a few places.
Differential Revision: https://reviews.llvm.org/D92092
Start with an assumption that FMA is faster than Fmul+FAdd. If thats not true
on some particular implementation we can add a tuning parameter in the future.
I've update the fmuladd test cases and added new test cases for fast math flag
based contraction.
Differential Revision: https://reviews.llvm.org/D91987
This is the logically correct thing to do. But it generates worse
code for i32 umin/umax on the rv64 due to type legalize requesting
zext even though the arguments are sext. Maybe we can teach type
legalizer to use sext for umin/umax for RISCV.
It's also producing possibly worse code on i64 on RV32 since we
still end up with selects that become branches. But this seems
like something we could improve in type legalization or DAG combine.
Hopefully this makes D92095 work for RISCV with Zbb.
This should handle the basic integer min/max handling - the HVX ops are still TODO.
This is some necessary cleanup work for min/max ops to eventually help us move the add/sub sat patterns into DAGCombine - D91876.
Differential Revision: https://reviews.llvm.org/D92112
Update costs now that D92095 and D92102 have tweaked the SSE2 implementation
The SSE42 BLENDVPD cost can actually be used on SSE41 as we don't attempt to generate PCMPGT anymore
Add scalar i16/i32/i64 costs as we can do this cheaply with CMOV
This adds custom opcodes for FSLW/FSRW so we can type legalize
fshl/fshr without needing to match a sign_extend_inreg.
I've used the operand order from fshl/fshr to make the isel
pattern similar to the non-W form. It was also hard to decide
another order since the register instruction has the shift amount
as the second operand, but the immediate instruction has it as
the third operand.
Differential Revision: https://reviews.llvm.org/D91479
Add .shader_functions to pal metadata, which contains the stack frame
size for all non-entry-point functions.
Differential Revision: https://reviews.llvm.org/D90036
If smax() is legal, this is likely to result in smaller codegen expansion for abs(x) than the xor(add,ashr) method.
This is also what PowerPC has been doing for its abs implementation, so it lets us get rid of a load of custom lowering code there (and which was never updated when they added smax lowering).
Alive2: https://alive2.llvm.org/ce/z/xRk3cD
Differential Revision: https://reviews.llvm.org/D92095
This patch adds a target-specific DAG combine for mscatter to promote indices
with element types i8 or i16 before legalisation, plus various tests with illegal types.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90945