We are using countPopulation on a LaneBitmask to determine
a number of registers it covers. This is the assumption which
does not necessarily need to be true. It is not changed but
factored into a single call SIRegisterInfo::getNumCoveredRegs().
Some other places are cleaned up with respect to assumptions
about subreg indexes values and tablegen behavior.
Differential Revision: https://reviews.llvm.org/D74177
Summary:
Current implementation of matchSwap in SIShrinkInstructions searches the entire
use_nodbg_operands set to find the possible pattern to generate v_swap instruction.
This approach will lead to a O(N^3) in compile time for SIShrinkInstructions.
But in reality, the matching pattern only exists within nearby instructions in the
same basic block. This work limits the search to a maximum of 16 instructions, and has
a linear compile time comsumption.
Reviewers:
rampitec, arsenm
Differential Revision: https://reviews.llvm.org/D74180
This reverts commit a373841407.
It looks like this broke set_shadow_test.c, so I'm reverting until I can fix it.
I also reverted the SGT change because it's probably also broken.
Update lambda function argument "[this](const auto &TRI)" with
[this](const TargetRegisterInfo &TRI).
Looks like a bug in g++-6, there is no issue compiling using g++-9.
X86 uses i8 for shift amounts. This code can fail on a 32-bit target
if it runs after type legalization.
This code was copied from AArch64 and modified for X86, but the
shift amount wasn't changed to the correct type for X86.
Fixes PR44812
When we have a G_BRCOND fed by a sgt compare against -1, we can just emit a TBZ.
This is similar to the code in `AArch64TargetLowering::LowerBR_CC`.
Also while we're here, properly scope the commutative constant check in
`selectCompareBranch`, since it sometimes would call
`getConstantVRegValWithLookThrough` twice.
Differential Revision: https://reviews.llvm.org/D74149
If we don't have cmov, X87 compares write to FPSW and we need to
move the bits to EFLAGS to use as JCC/SETCC/CMOV conditions.
Previously this was done by calling ConvertCmpIfNecessary in
multiple places which would emit the extra code for the FNSTSW,
a shift, a truncate, and a SAHF instructions. Isel would then
select trunc+X86ISD::CMP to a FUCOM instruction that produces FPSW.
This patch centralizes all of the handling into a single custom
isel handler. This allows us to remove ConvertCmpIfNecessary and
a couple target specific ISD opcodes.
Differential Revision: https://reviews.llvm.org/D73863
Only 32 and 64 bit SBB are dependency breaking instructons on some
CPUs. The 8 and 16 bit forms have to preserve upper bits of the GPR.
This patch removes the smaller forms and selects the wider form
instead. I had to do this with custom code as the tblgen generated
code glued the eflags copytoreg to the extract_subreg instead of
to the SETB pseudo.
Longer term I think we can remove X86ISD::SETCC_CARRY and use
(X86ISD::SBB zero, zero). We'll want to keep the pseudo and select
(X86ISD::SBB zero, zero) to either a MOV32r0+SBB for targets where
there is no dependency break and SETB_C32/SETB_C64 for targets
that have a dependency break. May want some way to avoid the MOV32r0
if the instruction that produced the carry flag happened to def a
register that we can use for the dependency.
I think the flag copy lowering should be using NEG instead of SUB to
handle SETB. That would avoid the MOV32r0 there. Or maybe it should
use a ADC with -1 to recreate the carry flag and keep the SETB?
That would avoid a MOVZX on the input of the SUB.
Differential Revision: https://reviews.llvm.org/D74024
When multiple instructions are moved into a waterfall loop, it's
possible some of them re-use the same operands. Avoid creating
multiple sequences of readfirstlanes for them. None of the current
uses will hit this, but will be used in a future patch.
This patch implements the caller side of placing function call arguments
in stack memory. This removes the current limitation where LLVM on AIX
will report fatal error when arguments can't be contained in registers.
There is a particular oddity that a float argument that passes in a
register and also in stack memory requires that the caller initialize
both. From what AIX "ABI" documentation I have it's not clear that this
needs to be done, however, it is necessary for compatibility with the
AIX XL compiler so I think it's best to implement it the same way.
Note a later patch will follow to address the callee side.
Differential Revision: https://reviews.llvm.org/D73209
When we have a G_ICMP which checks SLT, and the comparison is against 0, we
can emit a TBNZ instead of a CBZ.
This lets us fold in things into the branch, which can provide some code size
savings.
This is similar to the case in `AArch64TargetLowering::LowerBR_CC`.
https://reviews.llvm.org/D74090
Factor it out into `emitTestBit` and add some asserts to the new function.
This will be useful for implementing TB(N)Z emission for SLT/SGT compares.
Differential Revision: https://reviews.llvm.org/D74080
Add support for walking through G_LSHR in `getTestBitReg`. Equivalent to the
code in `getTestBitOperand` in AArch64ISelLowering.
```
(tbz (lshr x, c), b) -> (tbz x, b+c) when b + c is < # bits in x
```
Differential Revision: https://reviews.llvm.org/D74077
The "{=v0}" constraint did not result in the expected error message in the
abscence of the vector facility, because 'v0' matches as a string into the
AnyRegBitRegClass in common code.
This patch adds checks for vector support in case of "{v" and soft-float in
case of "{f" to remedy this.
Review: Ulrich Weigand.
The load ports need a cycle for each potentially loaded element just like Haswell and Skylake. Unlike Haswell and Broadwell, the number of uops does not scale with the number of elements. Instead the load uops run for multiple cycles.
I've taken the latency number from the uops.info. The port binding for the non-load uops is taken from the original IACA data I have.
Differential Revision: https://reviews.llvm.org/D74000
Really the intrinsic definition is wrong, but work around this
here. The DAG lowering introduces an MMO. We have to introduce a new
operation to avoid the verifier complaining about the missing mayLoad.
Use cmp ord instead of cmp_class compared to the DAG version for the
nan check, but mostly try to match the existsing pattern.
I think the sign doesn't matter for fract, so we could do a little
better with the source modifier matching.
I think this is also still broken as in D22898, but I'm leaving it
as-is for now while I don't have an SI system to test on.
contractCrossBankCopyIntoStore() finds the instruction defines the
source register and uses its output to replace the register. There are,
however, instructions that have multiple outputs, e.g. G_UNMERGE_VALUES.
Current implementation hardcodes to operand 0 and has no way of knowing
which output should be used.
This change adds another function to directly return the register that
is the source of the register and use that for folding.
This fixes https://bugs.llvm.org/show_bug.cgi?id=44783
Differential Revision: https://reviews.llvm.org/D74005
This implements walking over G_ASHR in the same way as `getTestBitOperand` in
AArch64ISelLowering.
```
(tbz (ashr x, c), b) -> (tbz x, b+c) or (tbz x, msb) if b+c is > # bits in x
```
Differential Revision: https://reviews.llvm.org/D73933
(1) The check needs to be on the 0th operand of whatever we're folding
(2) Checks for validity should happen before we change the bit
Fixes a bug which caused MultiSource/Applications/JM/lencod to fail at -O3.
Differential Revision: https://reviews.llvm.org/D74002
Rewrite the result register pair into the expected sinigle register
format in the legalizer.
I'm also operating under the assumption that TFE doesn't apply to
stores or atomics, but don't know if this is true or not.
The mask results of these should be uniform. The trickier part is the
dummy booleans used as IR glue need to be treated as divergent. This
should make the divergence analysis results correct for the IR the DAG
is constructed from.
This should allow us to eliminate requiresUniformRegister, which has
an expensive, recursive scan over all users looking for control flow
intrinsics. This should avoid recent compile time regressions.
The 96-bit results need to be widened.
I find the interaction between LegalizerHelper and MIRBuilder somewhat
awkward. The custom legalization is called by the LegalizerHelper, but
then does not have access to the helper. You have to construct a new
helper, which then does not own the MachineIRBuilder, but does modify
it. Maybe custom legalization should be passed the helper?
The adjusted iterator range included the last we just inserted, and
don't want to process. Figure out the new iterator range before
inserting phis. This was a harmless problem, but added an unnecessary
complication for a future patch.
If we have s_pack_* instructions, legalize this to
G_BUILD_VECTOR_TRUNC from s32 elements. This is closer to how how the
s_pack_* instructions really behave.
If we don't have s_pack_ instructions, expand this by creating a merge
to s32 and bitcasting. This expands to the expected bit operations. I
think this eventually should go in a new bitcast legalize action type
in LegalizerHelper.
We already directly emit the shift operations in RegBankSelect for the
vector case. This could possibly be cleaned up, but I also may want to
defer doing this expansion to selection anyway. I'll see about that
when I try to actually match VOP3P instructions.
This breaks the selection of the build_vector since tablegen doesn't
know how to match G_BUILD_VECTOR_TRUNC yet, so just xfail it for now.
Once we have created a tail-predicated hardware-loop, and thus know the number
of elements that are processed, we want to clean-up the iteration count
expression of that loop. In D73682, we bailed the analysis on conditionally
executed instructions. This adds support for IT-blocks, so that we can handle
these cases again. The restriction is that we only support IT blocks containing
1 statement, but that seems to cover most cases and forms of the iteration
count expression.
Differential Revision: https://reviews.llvm.org/D73947
Checking that the use-def chain that performs the loop count
isSafeToRemove is not sufficient because it means that we can
remove register copies that we need to restore lr to its correct
value. This change now prevents the transform from kicking in for the
'remove-elem-moves' test which needs to addressed later on.
Differential Revision: https://reviews.llvm.org/D74037
While validating each MVE instruction, check that all instructions
that touch memory are somehow predicated upon the VCTP.
Differential Revision: https://reviews.llvm.org/D73616
scalar_to_vector takes only one argument, not two.
The a16 tests now also check the packing of coordinates into registers
Differential Revision: https://reviews.llvm.org/D73482
This should lower the amount of used registers for gfx9.
I updated some of the changed tests with the update script because
changing them by hand is tedious.
Differential Revision: https://reviews.llvm.org/D73884
Same for any_extend though we don't have coverage for that.
The test changes are because isel didn't check one use of the
setcc_carry. So in isel we would end up with two different
sized setcc_carry instructions. And since it clobbers
the flags we would need to recreate the flags for the second
instruction.
This code handles additional uses by truncating the new wide
setcc_carry back to the original size for those uses.
The old version might be faster on EG (RECIP_IEEE is Trans only),
but it'd need extra corner case checks.
This gives correct corner case behaviour and saves a register.
Fixes OCL CTS sqrt test (1-thread, scalar) on Turks.
Reviewer: arsenm
Differential Revision: https://reviews.llvm.org/D74017
Summary:
This reverts commit 3ef169e586. The
purpose of this commit was to allow stack machines to perform
instruction selection for instructions with variadic defs. However,
MachineInstrs fundamentally cannot support variadic defs right now, so
this change does not turn out to be useful.
Depends on D73927.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73928
This was incorrectly rounding up to the next power of 2. v4f32 was
rounding up to v8f32, which was just wrong. There are also v3i16/v3f16
available in MVT, so we don't even need to round the f16 cases
anymore. Additionally, this field is really an EVT so we don't even
need to consider this.
Also switch some asserts to return invalid. We should have an IR
verifier for these intrinsic return types, but for now it's better to
not assert on IR that passes the verifier.
This should also probably be fixed to consider that dmask is really
eliminating some of the loaded components.
Summary:
This reverts commit 28857d14a8. This
commit worked toward a solution that did not turn out to be feasible
because MachineInstrs cannot contain an arbitrary number of defs.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73927
The compiler may transform the following code
ctx = ctx + reloc_offset
... (*(u32 *)ctx) & 0x8000 ...
to
ctx = ctx + reloc_offset
... (*(u8 *)(ctx + 1)) & 0x80 ...
where reloc_offset will be replaced with a constant during
AsmPrinter phase.
The above transformed code will be rejected the kernel verifier
as it does not allow
*(type *)((ctx + non_zero_offset1) + non_zero_offset2)
style access pattern.
It is hard at SelectionDag phase to identify whether a load
is related to context or not. Sometime, interprocedure analysis
may be needed. So let us simply prevent such optimization
from happening.
Differential Revision: https://reviews.llvm.org/D73997
Summary:
Moves a batch of instructions from unimplemented-simd128 to simd128
because they have recently become available in V8.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73926
lrint/llrint are defined as rounding using the current rounding
mode. Numbers that can't be converted raise FE_INVALID and an
implementation defined value is returned. They may also write to
errno.
I believe this means we can use cvtss2si/cvtsd2si or fist to
convert as long as -fno-math-errno is passed on the command line.
Clang will leave them as libcalls if errno is enabled so they
won't become ISD::LRINT/LLRINT in SelectionDAG.
For 64-bit results on a 32-bit target we can't use cvtss2si/cvtsd2si
but we can use fist since it can write to a 64-bit memory location.
Though maybe we could consider using vcvtps2qq/vcvtpd2qq on avx512dq
targets?
gcc also does this optimization.
I think we might be able to do this with STRICT_LRINT/LLRINT as
well, but I've left that for future work.
Differential Revision: https://reviews.llvm.org/D73859
The CATCHPAD node mostly existed to be selected into the EH_RESTORE
instruction, which sets the frame back up when 32-bit Windows exceptions
return to the parent function. However, creating this MachineInstr early
increases the risk that other passes will come along and insert
instructions that use the stack before ESP and EBP are restored. That
happened in PR44697.
Instead of representing these in the instruction stream early, delay it
until PEI. Mark the blocks where this needs to happen as EHPads, but not
funclet entry blocks. Passes after PEI have to be careful not to hoist
instructions that can use stack across frame setup instructions, so this
should be relatively reliable.
Fixes PR44697
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D73752
https://reviews.llvm.org/D72312 introduced an infinite loop which involves
DAGCombiner::visitFMA and AMDGPUTargetLowering::performFNegCombine.
fma( a, fneg(b), fneg(c) ) => fneg( fma (a, b, c) ) => fma( a, fneg(b), fneg(c) ) ...
This only breaks with types where 'isFNegFree' returns flase, e.g. v4f32.
Reproducing the issue also needs the attribute 'no-signed-zeros-fp-math',
and no source mods allowed on one of the users of the Op.
This fix makes changes to indicate that it is not free to negate a fma if it
has users with source mods.
Differential Revision: https://reviews.llvm.org/D73939
This time with correct types for the data result from the SUB.
Original commit message:
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.
This commit makes other places that create X86ISD::CMP have the
same behavior.
Prepare to accurately track the future denormal-fp-math attribute
changes. The way to actually set these separately is not wired in yet.
This is just a mechanical change, and mostly still assumes the input
and output mode match. This should be refined for some cases. For
example, fcanonicalize lowering should use the flushing variant if
either input or output flushing is enabled
The usage of the Imm out argument from SelectSMRDOffset is pretty
confusing. Stop trying to reject CI immediates in the case where the
offset field can be used. It's not an illegal way to encode the
immediate, so just prefer the better encoding pattern with
AddedComplexity.
We probably don't even really need the different opcodes for the
different offset types anymore, but that will be more work to cleanup.
The SMRD non-buffer load patterns could also use a cleanup to be done
separately.
AMDGPU and x86 at least both have separate controls for whether
denormal results are flushed on output, and for whether denormals are
implicitly treated as 0 as an input. The current DAGCombiner use only
really cares about the input treatment of denormals.
Similar to D73680 (AArch64 BTI).
A local linkage function whose address is not taken does not need ENDBR32/ENDBR64. Placing the patch label after ENDBR32/ENDBR64 has the advantage that code does not need to differentiate whether the function has an initial ENDBR.
Also, add 32-bit tests and test that .cfi_startproc is at the function
entry. The line information has a general implementation and is tested
by AArch64/patchable-function-entry-empty.mir
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D73760
Linux commit
1cf5b23988 (diff-289313b9fec99c6f0acfea19d9cfd949)
uses "#pragma clang attribute push (__attribute__((preserve_access_index)),
apply_to = record)"
to apply CO-RE relocations to all records including the following pattern:
#pragma clang attribute push (__attribute__((preserve_access_index)), apply_to = record)
typedef struct {
int a;
} __t;
#pragma clang attribute pop
int test(__t *arg) { return arg->a; }
The current approach to use struct/union type in the relocation record will
result in an anonymous struct, which make later type matching difficult
in bpf loader. In fact, current BPF backend will fail the above program
with assertion:
clang: ../lib/Target/BPF/BPFAbstractMemberAccess.cpp:796: ...
Assertion `TypeName.size()' failed.
clang will change to use the type of the base of the member access
which will preserve the typedef modifier for the
preserve_{struct,union}_access_index intrinsics in the above example.
Here we adjust BPF backend to accept that the debuginfo
type metadata may be 'typedef' and handle them properly.
Differential Revision: https://reviews.llvm.org/D73902
Summary:
After following Simon's suggestion about additional testing posted at
https://reviews.llvm.org/D73906, I found several more places that
need to be updated.
Reviewers: simon_tatham, dmgreen, ostannard, eli.friedman
Reviewed By: simon_tatham
Subscribers: merge_guards_bot, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73963
Summary:
This patch changes the underlying type of the ARM::ArchExtKind
enumeration to uint64_t and adjusts the related code.
The goal of the patch is to prepare the code base for a new
architecture extension.
Reviewers: simon_tatham, eli.friedman, ostannard, dmgreen
Reviewed By: dmgreen
Subscribers: merge_guards_bot, kristof.beyls, hiraditya, cfe-commits, llvm-commits, pbarrio
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73906
Under MVE, we do not have any lowering for fminimum, which a
vector_reduce_fmin without NoNan will be expanded into. As with the
other recent patches, force this to expand in the pre-isel pass. Note
that Neon lowering would be OK because the scalar fminimum uses the
vector VMIN instruction, but is probably better to just rely on the
scalar operations, which is what is done here.
Also fixes what appears to be the reversal of INF vs -INF in the
vector_reduce_fmin widening code.
This code matches (zext (trunc (setcc_carry))) -> (and (setcc_carry), 1)
but the code never checks what type we're truncating too. An and
mask of 1 would only make sense if the trunc was to MVT::i1, but
we didn't check for that.
I believe this code is a leftover from when i1 was a legal type.
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.
This commit makes other places that create X86ISD::CMP have the
same behavior.
We were creating two with different operand orders, and then only
using one of them.
Instead just swap the operands when needed and create a single node.
Broadwell was missing half the gather instructions. Both models
had some mixups in the resource costs and number of uops.
I've updated here based on what I think the original IACA source
says with some cross checking against the microcode.
I'm not sure about latency as the IACA source I have doesn't have
that information. So I'm using the latency from uops.info.
I plan to update Skylake models as well, but I'll do that in a
separate patch.
Differential Revision: https://reviews.llvm.org/D73844
This ports the existing case for G_XOR from `getTestBitOperand` in
AArch64ISelLowering into GlobalISel.
The idea is to flip between TBZ and TBNZ while walking through G_XORs.
Let's say we have
```
tbz (xor x, c), b
```
Let's say the `b`-th bit in `c` is 1. Then
- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 0.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 1.
So, then
```
tbz (xor x, c), b == tbnz x, b
```
Let's say the `b`-th bit in `c` is 0. Then
- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 1.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 0.
So, then
```
tbz (xor x, c), b == tbz x, b
```
Differential Revision: https://reviews.llvm.org/D73929
This implements the following optimization:
```
(tbz (shl x, c), b) -> (tbz x, b-c)
```
Which appears in `getTestBitOperand` in AArch64ISelLowering.cpp.
If we test bit `b` of `shl x, c`, we can fold away the `shl` by looking `c` bits
to the right of `b` in `x` when this fits in the type. So, we can just test the
`b-c`th bit.
Differential Revision: https://reviews.llvm.org/D73924
Summary:
The AIX assembler .space directive can't take a second non-zero argument to fill
with. But LLVM emitFill currently assumes it can. We add a flag to the AsmInfo
to check if non-zero fill is supported, and if we can't zerofill non-zero values
we just splat the .byte directives.
Reviewers: stevewan, sfertile, DiggerLin, jasonliu, Xiangling_L
Reviewed By: jasonliu
Subscribers: Xiangling_L, wuzish, nemanjai, hiraditya, kbarton, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73554
Given
```
tb(n)z (and x, m), b
```
Where the `b`-th bit of `m` is 1,
```
tb(n)z (and x, m), b == tb(n)z x, b
```
So, we can walk past a `G_AND` in this case.
Also add test/CodeGen/AArch64/GlobalISel/opt-fold-and-tbz-tbnz.mir to test this.
Differential Revision: https://reviews.llvm.org/D73790
convertPtrAddToAdd improved overall code size and quality by a significant amount,
but on -O0 we generate some cross-class copies due to the fact that we emitted
G_PTRTOINT and G_INTTOPTR around the G_ADD. Unfortunately at -O0 we don't run any
register coalescing, so these cross class copies end up escaping as moves, and
we ended up regressing 3 benchmarks on CTMark (though still a winner overall).
This patch changes the lowering to instead directly emit the G_ADD into the
destination register, and then force changes the dest LLT to s64 from p0. This
should be ok, as all uses of the register should now be selected and therefore
the LLT doesn't matter for the users. It does however matter for the importer
patterns, which will fail to select a G_ADD if there's a p0 LLT.
I'm not able to get rid of the G_PTRTOINT on the source yet however. We can't
use the same trick of breaking the type system since that could break the
selection of the defining instruction. Thus with -O0 we still end up with a
cross class copy on source.
Code size improvements on -O0:
Program baseline new diff
test-suite :: CTMark/Bullet/bullet.test 965520 949164 -1.7%
test-suite...TMark/7zip/7zip-benchmark.test 1069456 1052600 -1.6%
test-suite...ark/tramp3d-v4/tramp3d-v4.test 1213692 1199804 -1.1%
test-suite...:: CTMark/sqlite3/sqlite3.test 421680 419736 -0.5%
test-suite...-typeset/consumer-typeset.test 837076 833380 -0.4%
test-suite :: CTMark/lencod/lencod.test 799712 796976 -0.3%
test-suite...:: CTMark/ClamAV/clamscan.test 688264 686132 -0.3%
test-suite :: CTMark/kimwitu++/kc.test 1002344 999648 -0.3%
test-suite...Mark/mafft/pairlocalalign.test 422296 421768 -0.1%
test-suite :: CTMark/SPASS/SPASS.test 656792 656532 -0.0%
Geomean difference -0.6%
Differential Revision: https://reviews.llvm.org/D73910
Start using a new strategy with a combination of merge and unmerges.
This allows scalarizing before lowering, which in cases like
<2 x s128> avoids producing giant illegal shifts.
Followup to D73135. If the target doesn't have hard float (default
for ARM), then we assert when trying to soften the result of vector
reduction intrinsics. This patch marks these for expansion as well.
(A bit odd to use vectors on a target without hard float ... but
that's where you end up if you expose target-independent vector types.)
Differential Revision: https://reviews.llvm.org/D73854
As detailed on PR43463, this fixes a static analyzer null dereference warning by sinking Changed = true into the if() blocks where the MIB is actually created.
I did a quick check that suggested that one of those if() blocks is always guaranteed to be hit (so we could change it to if-else), but this seems like a safer approach
Differential Revision: https://reviews.llvm.org/D73883
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73885
These instructions can set the exception in FPSW. But I
don't think they can change FPCW. So this looks like a typo.
Differential Revision: https://reviews.llvm.org/D73864
These can be lowered to code sequences using CMPFP and CMPFPE which then get
selected to VCMP and VCMPE. The implementation isn't fully correct, as the chain
operand isn't handled correctly, but resolving that looks like it would involve
changes around FPSCR-handling instructions and how the FPSCR is modelled.
The fp-intrinsics test was already testing some of this but as the entire test
was being XFAILed it wasn't noticed. Un-XFAIL the test and instead leave the
cases where we aren't generating the right instruction sequences as FIXME.
Differential Revision: https://reviews.llvm.org/D73194
Summary:
In big-endian MVE, the simple vector load/store instructions (i.e.
both contiguous and non-widening) don't all store the bytes of a
register to memory in the same order: it matters whether you did a
VSTRB.8, VSTRH.16 or VSTRW.32. Put another way, the in-register
formats of different vector types relate to each other in a different
way from the in-memory formats.
So, if you want to 'bitcast' or 'reinterpret' one vector type as
another, you have to carefully specify which you mean: did you want to
reinterpret the //register// format of one type as that of the other,
or the //memory// format?
The ACLE `vreinterpretq` intrinsics are specified to reinterpret the
register format. But I had implemented them as LLVM IR bitcast, which
is specified for all types as a reinterpretation of the memory format.
So a `vreinterpretq` intrinsic, applied to values already in registers,
would code-generate incorrectly if compiled big-endian: instead of
emitting no code, it would emit a `vrev`.
To fix this, I've introduced a new IR intrinsic to perform a
register-format reinterpretation: `@llvm.arm.mve.vreinterpretq`. It's
implemented by a trivial isel pattern that expects the input in an
MQPR register, and just returns it unchanged.
In the clang codegen, I only emit this new intrinsic where it's
actually needed: I prefer a bitcast wherever it will have the right
effect, because LLVM understands bitcasts better. So we still generate
bitcasts in little-endian mode, and even in big-endian when you're
casting between two vector types with the same lane size.
For testing, I've moved all the codegen tests of vreinterpretq out
into their own file, so that they can have a different set of RUN
lines to check both big- and little-endian.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73786
Summary:
These instructions generate a vector of consecutive elements starting
from a given base value and incrementing by 1, 2, 4 or 8. The `wdup`
versions also wrap the values back to zero when they reach a given
limit value. The instruction updates the scalar base register so that
another use of the same instruction will continue the sequence from
where the previous one left off.
At the IR level, I've represented these instructions as a family of
target-specific intrinsics with two return values (the constructed
vector and the updated base). The user-facing ACLE API provides a set
of intrinsics that throw away the written-back base and another set
that receive it as a pointer so they can update it, plus the usual
predicated versions.
Because the intrinsics return two values (as do the underlying
instructions), the isel has to be done in C++.
This is the first family of MVE intrinsics that use the `imm_1248`
immediate type in the clang Tablegen framework, so naturally, I found
I'd given it the wrong C integer type. Also added some tests of the
check that the immediate has a legal value, because this is the first
time those particular checks have been exercised.
Finally, I also had to fix a bug in MveEmitter which failed an
assertion when I nested two `seq` nodes (the inner one used to extract
the two values from the pair returned by the IR intrinsic, and the
outer one put on by the predication multiclass).
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73357
Summary:
The unpredicated case of this is trivial: the clang codegen just makes
a vector splat of the input, and LLVM isel is already prepared to
handle that. For the predicated version, I've generated a `select`
between the same vector splat and the `inactive` input parameter, and
added new Tablegen isel rules to match that pattern into a predicated
`MVE_VDUP` instruction.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73356
Summary: There are no counters for individual ports, but this is already
enough to find a lot of issues in the current model (upcoming patch).
Reviewers: dblaikie, gchatelet
Subscribers: hiraditya, tschuett, RKSimon, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72032
Summary:
D68092 introduced a new SIRemoveShortExecBranches optimization pass and
broke some graphics shaders. The problem is that it was removing
branches over KILL pseudo instructions, and the fix is to explicitly
check for that in mustRetainExeczBranch.
Reviewers: critson, arsenm, nhaehnle, cdevadas, hakzsam
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73771
We only need to call this on floating point comparisons. In this
case these are known to be integer compares. One of them even
has a SUB opcode instead of CMP.
This reverts commit e34801c8e6 and the followup due to multiple
problems.
I've tried to keep the tests and RDA parts where possible, as those
still seem useful.
Summary:
Virtual registers that are undef have an empty LiveInterval at this
point, which means beginIndex() and endIndex() cannot be used. We
only need those indices to determine the range in which to scan for
affected other NSA instructions, and undef operands cannot contribute
to that range.
Reviewers: arsenm, rampitec, mareko
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73831
We were checking that the original Value * for the compare operands
were null. But that can never happen.
I believe we intended to check for 0 registers here instead.
Fixes PR44749.
This is an alternate fix for the issue D73606 was trying to
solve.
The main issue here is that we bailed out of
foldOffsetIntoAddress if Offset is 0. But if we just found a
symbolic displacement and AM.Disp became non-zero
earlier, we still need to validate that AM.Disp with the symbolic
displacement.
This is my second attempt at committing this after failing
build bots previously. One thing I realized about the previous
attempt is that its possible that AM.Disp is already non-zero
and the new Offset changes it back to zero. In that case my
previous attempt failed to update AM.Disp to zero. So this patch
removes the early out for 0 and appropriately handle the 0 case
in each check so we still update AM.Disp at the end.
This is based on this llvm-dev thread http://lists.llvm.org/pipermail/llvm-dev/2019-December/137521.html
The current strategy for f16 is to promote type to float every except where the specific width is required like loads, stores, and bitcasts. This results in rounding occurring in odd places instead of immediately after arithmetic operations. This interacts in weird ways with the __fp16 type in clang which is a storage only type where arithmetic is always promoted to float. InstCombine can remove some fpext/fptruncs around such arithmetic and turn it into arithmetic on half. This wouldn't be so bad if SelectionDAG was able to put those fpext/fpround back in when it promotes.
It is also not obvious how to handle to make the existing strategy work with STRICT fp. We need to use STRICT versions of the conversions which require chain operands. But if the conversions are created for a bitcast, there is no place to get an appropriate chain from.
This patch implements a different strategy where conversions are emitted directly around arithmetic operations. And otherwise its passed around as an i16 including in arguments and return values. This can result in more conversions between arithmetic operations, but is closer to matching the IR the frontend generates for __fp16. And it will allow us to use the chain from constrained arithmetic nodes to link the STRICT_FP_TO_FP16/STRICT_FP16_TO_FP that will need to be added. I've set it up so that each target can opt into the new behavior. Converting all the targets myself was more than I was able to handle.
Differential Revision: https://reviews.llvm.org/D73749
This fixes legalizations of global stores > 128-bits. It seems work is
needed on how this split actually occurs. For example, we get the
right code for s160, with an s128 and s32 load, but get 5 s32 loads
for <5 x s32>.
Summary:
Implements the jump pseudo-instruction, which is used in e.g. the Linux kernel.
Reviewers: asb, lenary
Reviewed By: lenary
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73178
When you encounter a G_TRUNC, you are moving from a larger type to a smaller
type.
Asking for the i-th bit on a larger value is the same as asking for the i-th
bit on a smaller value.
So, we should always be able to walk through G_TRUNC when computing the bit
for a TB(N)Z.
Differential Revision: https://reviews.llvm.org/D73748
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.
Reviewers: courbet
Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73785
Try out using combine definition rules.
This really should be a post-legalizer combine, but the combiner pass
is currently pre-legalize. Most of the target combines are really
post-legalize, so we should probably move the pass.
For a MC_GlobalAddress reference to a dso_local external GlobalValue with a definition, emit .Lfoo$local to avoid a relocation.
-fno-pic and -fpie can infer dso_local but -fpic cannot. In the future,
we can explore the possibility of inferring dso_local with -fpic. As the
description of D73228 says, LLVM's existing IPO optimization behaviors
(like -fno-semantic-interposition) and a previous assembly behavior give
us enough license to be aggressive here.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D73230
I believe this also fixes bugs with CI 32-bit handling, which was
incorrectly skipping offsets that look like signed 32-bit values. Also
validate the offsets are dword aligned before folding.
This is similar to the code in getTestBitOperand in AArch64ISelLowering. Instead
of implementing all of the TB(N)Z optimizations at once, this patch implements
the simplest case first. The way that this is set up should make it fairly easy
to add the rest as we go along.
The idea here is that after determining that we can use a TB(N)Z, we can
continue looking through instructions and perform further folding.
In this case, when we have a G_ZEXT or G_ANYEXT where the extended bits are not
used, we can fold it into the TB(N)Z.
Differential Revision: https://reviews.llvm.org/D73673
Found by inspection, but there's no test for this yet because G_PTR_ADD is
currently illegal for vectors. I'll add the test at a later time when the
legalizer support has landed.
There's not much value to this separate node from the intrinsic. Make
the operand structure the same as the intrinsic, so we can reuse the
same pattern for GlobalISel.
Summary:
Added file headers for files which implement iterative lightweight scheduling
strategies. Which is basically an exercise which I undertook in order to get
used to LLVM development process.
Reviewers: arsenm, vpykhtin, cdevadas
Reviewed By: vpykhtin
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, javed.absar, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73417
Summary:
For -fpatchable-function-entry=N,0 -mbranch-protection=bti, after
9a24488cb6, we place the NOP sled after
the initial BTI.
```
.Lfunc_begin0:
bti c
nop
nop
.section __patchable_function_entries,"awo",@progbits,f,unique,0
.p2align 3
.xword .Lfunc_begin0
```
This patch adds a label after the initial BTI and changes the __patchable_function_entries entry to reference the label:
```
.Lfunc_begin0:
bti c
.Lpatch0:
nop
nop
.section __patchable_function_entries,"awo",@progbits,f,unique,0
.p2align 3
.xword .Lpatch0
```
This placement is compatible with the resolution in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92424 .
A local linkage function whose address is not taken does not need a BTI.
Placing the patch label after BTI has the advantage that code does not
need to differentiate whether the function has an initial BTI.
Reviewers: mrutland, nickdesaulniers, nsz, ostannard
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73680
fadd/fmul reductions without reassoc are lowered to
VECREDUCE_STRICT_FADD/FMUL nodes, which don't have legalization
support. Until that is in place, expand these intrinsics on
ARM and AArch64. Other targets always expand the vector reduction
intrinsics.
Additionally expand fmax/fmin reductions without nonan flag on
AArch64, as the backend asserts that the flag is present when
lowering VECREDUCE_FMIN/FMAX.
This fixes https://bugs.llvm.org/show_bug.cgi?id=44600.
Differential Revision: https://reviews.llvm.org/D73135
The recommended optimization level for BPF programs
is O2 since (1). BPF is running inside the kernel and
linux kernel won't work at -O0 level, and (2). Verifier
is not able to handle O0 code properly, e.g., potential
large stack size and a lot of spills.
But we should keep -O0 at least compiling.
This patch fixed a bug in BPFMISimplifyPatchable phase
where with -O0, a segmentation fault will happen for a
simple program like:
int test(int a, int b) { return a + b; }
A test case is added to capture such a case.
Differential Revision: https://reviews.llvm.org/D73681
Summary:
This patch intends to support three most common relocation type
on AIX: R_POS, R_TOC, R_RBR.
These three relocation type will be needed for object file generation
on AIX for small code model.
We will have follow up patches to bring relocation support for
large code model on AIX.
Reviewers: hubert.reinterpretcast, daltenty, DiggerLin
Differential Revision: https://reviews.llvm.org/D72027
By adding the prefixed instructions the branch distances are no longer
computed correctly. Since prefixed instructions cannot cross a 64 byte
boundary we have to assume that a prefixed instruction may have a nop
prepended to it. This patch tries to take that nop into consideration
when computing the size of basic blocks.
Differential Revision: https://reviews.llvm.org/D72572
The commit https://reviews.llvm.org/rG14fc20ca6 added some options to the X86
back end that cause the help text for opt/llc to become much harder to read.
The issue is that the cl::value_desc is part of the option name and is used to
compute the indentation of the description text (i.e. the maximum length option
name is what everything aligns to). Since the commit puts a large number of
characters into that text, everything is aligned to that width.
This patch just reformats the option so that the description is contained in the
description and the list of possible values is within the angle brackets.
Note: the readability issue of the helptext was fixed in commit
70cbf8c71c, but the re-formatting wasn't
added on that commit so I am still committing this.
Differential revision: https://reviews.llvm.org/D73267
Strict fp-to-int and int-to-fp conversions can be handled in the same way that
the non-strict versions are (by using the appropriate instruction or converting
to a function call when we have no instruction).
Differential Revision: https://reviews.llvm.org/D73625
On targets that don't have the normal packed f16 layout, handle these
during legalization. Directly modify the register types. We can infer
this was a d16 load based on the mem operand size during selection.
A16 operands should possibly be handled here as well, but don't worry
about that yet.
This trivially avoids violating the constant bus restriction.
Previously this was allowing one SGPR in the first source
operand, which technically also avoided violating this for most
operations (but not for special cases reading vcc).
We do need to write some new, smarter operand folds to pick the
optimal SGPR to use in some kind of post-isel fold, but that's purely
an optimization.
I was originally thinking we would pick which operands should be SGPRs
in RegBankSelect, but I think this isn't really manageable. There
would be additional complexity to handle every G_* instruction, and
then any nontrivial instruction patterns would need to know when to
avoid violating it, which is likely to be very error prone.
I think having all inputs being canonically copies to VGPRs will
simplify the operand folding logic. The current folding we do is
backwards, and only considers one operand at a time, relative to
operands it already has. It therefore poorly handles the case where
there is already a constant bus operand user. If all operands are
copies, it's somewhat simpler to consider all input operands at once
to choose the optimal constant bus user.
Since the failure mode for constant bus violations is now a verifier
error and not an selection failure, this moves towards a place where
we can turn on the fallback mode. The SGPR copy folding optimizations
can be left for later.
A known limitation for Future CPU is that the new prefixed instructions may
not cross 64 Byte boundaries.
All instructions are already 4 byte aligned so the only situation where this
can occur is when the prefix is in one 64 byte block and the instruction that
is prefixed is at the top of the next 64 byte block. To fix this case
PPCELFStreamer was added to intercept EmitInstruction. When a prefixed
instruction is emitted we try to align it to 64 Bytes by adding a maximum of
4 bytes. If the prefixed instruction crosses the 64 Byte boundary then the
alignment would trigger and a 4 byte nop would be added to push the
instruction into the next 64 byte block.
Differential Revision: https://reviews.llvm.org/D72570
This gets selected to the appropriate fcvt instruction. Handling from there on
isn't fully correct yet, as we need to model fcvt reading and writing to fpsr
and fpcr.
Differential Revision: https://reviews.llvm.org/D73201
These become STRICT_FCMP and STRICT_FCMPE, which then get selected to the
corresponding FCMP and FCMPE instructions, though the handling from there on
isn't fully correct as we don't model reads and writes to FPCR and FPSR.
Differential Revision: https://reviews.llvm.org/D73368
Summary:
The code was assuming in a few places that if there was only one exit
from the function that it was a normal return, which is invalid. It
could be an infinite loop, in which case we still need to insert the
usual fake edge so that the null export happens. This fixes shaders that
end with an infinite loop that discards.
Reviewers: arsenm, nhaehnle, critson
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71192
Summary:
As determined with `llvm-exegesis`.
Some of these look like typos/misunderstandings of the sched model td
spec:
- latency defaults to `1` when not set => Maybe we can avoid
having a default ?
- problems with regexps not being anchored by default (XCHG matching
CMPXHG)
Note that this is not complete, it fixes only the most obvious mistakes,
and only for latency (not uops).
Reviewers: RKSimon, GGanesh
Subscribers: hiraditya, jfb, mstojanovic, hfinkel, craig.topper, andreadb, lebedev.ri, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73172
This lowering tries to look for G_PTR_ADD instructions and then converts
them to a standard G_ADD with a COPY on the source, and G_INTTOPTR on the
result. This is ok for address space 0 on AArch64 as p0 can be treated as
s64.
The motivation behind this is to expose the add semantics to the imported
tablegen patterns. We shouldn't need to check for uses being loads/stores,
because the selector works bottom up, uses before defs. By the time we
end up trying to select a G_PTR_ADD, we should have already attempted to
fold this into addressing modes and were therefore unsuccessful.
This gives some performance and code size improvements across the board.
Differential Revision: https://reviews.llvm.org/D73673
Currently some prefixes are emitted as instructions, to distinguish them
from real instruction, fuction isPrefix() is added. The kinds of prefix
are consistent with X86GenInstrInfo.inc.
Differential Revision: https://reviews.llvm.org/D73013
This is an alternate fix for the issue D73606 was trying to
solve.
The main issue here is that we bailed out of
foldOffsetIntoAddress if Offset is 0. But if we just found a
symbolic displacement and AM.Disp became non-zero
earlier, we still need to validate that AM.Disp with the symbolic
displacement.
This passes fold-add-pcrel.ll.
Differential Revision: https://reviews.llvm.org/D73608
We seem to be inheriting the cost from sse4.1. But if we have 256-bit registers we should be able to do this with just one extract to split the 16i16 and two v8i16->v8i32 operations so our cost should be 3 not 4.
Differential Revision: https://reviews.llvm.org/D73646
This is passed to legalizeCustom, but not intrinsic. Also remove the
MRI argument, since you can get that from the MachineIRBuilder.
I'm not sure why MachineIRBuilder has a private observer member, and
this is passed separately.
Rename Destructive enumerator in preparation for a larger set of patches to
support prefixing destructive oeprations with MOVPRFX.
Differential Revision: https://reviews.llvm.org/D73212
When the bit is <= 32, we have to use the W register variant for TB(N)Z.
This is because of the way the instruction is encoded.
Differential Revision: https://reviews.llvm.org/D73660
Some housekeeping for the DestructiveInstType enum before a larger set of patches to support prefixing destructive oeprations with MOVPRFX.
Differential Revision: https://reviews.llvm.org/D73141
A previous patch should have added pld and pstd and any support code in
the backend that is required for prefixed load and store type operations.
This patch adds a number of additional prefixed load and store type
instructions for the future CPU.
Differential Revision: https://reviews.llvm.org/D72577
Summary:
The initialization of RegisterBank needs to be done only once. The
logic of AlreadyInit has data race, use llvm::call_once instead.
This is continuing work of D73587.
Reviewers: arsenm, rovka, dsanders, t.p.northover, efriedma, apazos
Reviewed By: arsenm
Subscribers: wdng, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73605
Summary:
The initialization of RegisterBank needs to be done only once. The
logic of AlreadyInit has data race, use llvm::call_once instead.
This is continuing work of D73587.
Reviewers: arsenm, tstellar, ronlieb, efriedma, apazos, nhaehnle
Reviewed By: nhaehnle
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73604
Summary:
The initialization of RegisterBank needs to be done only once. The
logic of AlreadyInit has a data race, use llvm::call_once instead.
This issue was identified through thread sanitizer.
Reviewers: efriedma, apazos, qcolombet, dsanders
Reviewed By: efriedma
Subscribers: arsenm, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73587
ISD::FROUND is defined to round to nearest with ties rounding
away from 0. This mode isn't supported in hardware on X86.
But as long as we aren't compiling with trapping math, we can
emulate this with floor(X + copysign(nextafter(0.5, 0.0), X)).
We have to use nextafter to avoid some corner cases that adding
0.5 would have. For example, if X is nextafter(0.5, 0.0) it should
round to 0.0, but adding 0.5 would need one extra bit of mantissa
than can be stored so it rounds to 1.0. Adding nextafter(0.5, 0.0)
instead will just increase the exponent by 1 and leave the mantissa
as all 1s. This would be nextafter(1.0, 0.0) which will floor to 0.0.
Techically this requires -fno-trapping-math which isn't our default.
But if we care about exceptions we should be using constrained
intrinsics. Constrained intrinsics would use STRICT_FROUND which
won't go through this code.
Fixes PR42195.
Differential Revision: https://reviews.llvm.org/D73607