This a hack to fix illegal 32 to 16 bit copies.
The problem is when we make 16 bit subregs legal it creates
a huge amount of failures which can only be resolved at once
without a temporary hack like this.
The next step is to change operands, instruction definitions
and patterns until this hack is not needed.
Differential Revision: https://reviews.llvm.org/D79119
Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.
Reviewers: rampitec, arsenm, vpykhtin
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78091
Summary:
This patch adds AArch64ISD nodes for [S|U]MIN_PRED
and [S|U]MAX_PRED, and lowers both SVE intrinsics and
IR operations for min and max to these nodes.
There are two forms of these instructions for SVE: a predicated
form and an immediate (unpredicated) form. The patterns
which existed for the latter have been updated to match a
predicated node with an immediate and map this
to the immediate instruction.
Reviewers: sdesmalen, efriedma, dancgr, rengolin
Reviewed By: efriedma
Subscribers: huihuiz, tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D79087
We haven't promoted AND/OR/XOR to vXi64 types for a while. So
there's no reason to use isOperationLegalOrPromote. So we can
just use isOperationLegal by merging with ADD handling.
Default legalization will create two v8i64 truncs to v8i32, concat
them to v16i32, and then truncate the rest of the way to v16i8.
Instead we can truncate directly from v8i64 to v8i8 in the lower
half of an xmm. Then concat the two halves to use vpunpcklqdq.
This is the same number of uops, but the dependency chain through
the uops is better since the halves are merged at the end.
I had to had SimplifyDemandedBits support for VTRUNC to prevent
a regression on vector-trunc-math.ll. combineTruncatedArithmetic
no longer gets a chance to shrink vXi64 mul so we were producing
the v8i64 multiply sequence using multiple PMULUDQs. With the
demanded bits fix we are able to prune out the extra ops leaving
just two PMULUDQs, one for each v8i64 half. This is twice the
width of the 2 v8i32 PMULLDs we had before, but PMULUDQ is 1
uop and PMULLD is 2. We also save some truncates. It's probably
worth using PMULUDQ even when PMULLQ is available since the latter
is 3 uops, but that will require a different change.
Differential Revision: https://reviews.llvm.org/D79231
The splitVector helper uses extractSubVector which splits build vectors like we do here, so avoid reimplementing it.
splitVector could easily be extended to peek through bitcasts as well but I'd prefer to keep this commit NFC.
Summary:
The current lowering of `select` on RISC-V uses a branch instruction to load a
register with one or other value. This is inefficient, especially in the case of
small constants that can be computed easily.
By implementing the TargetLowering::convertSelectOfConstantsToMath hook, some of
the simpler cases are covered that let us avoid introducing a branch in these
cases.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D79260
Summary:
This patch addresses some weird assembly sequences we were seeing during
comparing floats. In particular, comparing a float to itself tells you whether
it is NaN or not, which we were doing correctly, but with an extra unneeded
`and` instruction.
This patch specialises the existing patterns to remove the `and` instructions
when both their operands are the same.
Reviewed By: luismarques, asb
Differential Revision: https://reviews.llvm.org/D78908
Summary:
As described in https://github.com/WebAssembly/simd/pull/209. This is
the final reorganization of the SIMD opcode space before
standardization. It has been landed in concert with corresponding
changes in other projects in the WebAssembly SIMD ecosystem.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D79224
Over time, we have made many additions to this file and it has frankly become a
bit of a mess. This has led to at least one issue - we have a number of
instructions where the side effects flag should be set to false and we neglected
to do this. This patch suggests a refactoring that should make the file much
more maintainable. The file is split up into major sections and the nesting
level is reduced, predicate blocks merged, etc.
Sections:
- Custom PPCISD node definitions
- Predicate definitions
- Instruction formats
- Instruction definitions
- Helper DAG definitions
- Anonymous patterns
- Instruction aliases
Differential revision: https://reviews.llvm.org/D78132
Handle concat_vectors(extract_subvector(broadcast(x)), extract_subvector(broadcast(x))) -> broadcast(x)
To expose this we also need collectConcatOps to recognise the insert_subvector(x, extract_subvector(x, lo), hi) subvector splat pattern
VMEM loads of the same type (sampler vs no sampler) are guaranteed to
write their result registers in order, so there is no need for an
s_waitcnt even if they write to overlapping vgprs.
Differential Revision: https://reviews.llvm.org/D79176
While restoring latency, check if any of the registers of
source instruction is a subregister of the successor instructions
apart from being same register.
The setting of `MCAsmInfo` properties for XCOFF got split between
`MCAsmInfoXCOFF` and `PPCXCOFFMCAsmInfo`. Except for the properties that
are dependent on the target information being passed via the
constructor, the properties being set in `PPCXCOFFMCAsmInfo` had no
fundamental reason for being treated as specific for XCOFF on PowerPC.
Indeed, the property that might be considered more specific to PowerPC,
`NeedsFunctionDescriptors`, was set in `MCAsmInfoXCOFF`.
XCOFF being specific to PowerPC anyway, this patch consolidates the
setting of the properties into `MCAsmInfoXCOFF` except for the cases
that are dependent on the information provided via the
`PPCXCOFFMCAsmInfo` constructor.
This patch also reorders the assignments to the fields to match the
declaration order in `MCAsmInfo`.
This patch adds the x, t and g modifiers for inline asm from GCC. These will print a vector register as xmm*, ymm* or zmm* respectively.
I also fixed register names with modifiers with inteldialect so they are no longer printed with a leading %.
Patch by Amanieu d'Antras
Differential Revision: https://reviews.llvm.org/D78977
Also fix some cost tables for vXi1 types to match the costs entries for the types they will be promoted to.
Differential Revision: https://reviews.llvm.org/D79045
vpermw is 2 uops. vpermt2b/vpermt2w are two shuffle uops and a port 015 uop. Weirdly vpermb is a single uop.
This patch bumps the cost to 2 for these operations. Maybe should go to 3 for the vpermt2*, but I've started conservative.
I've also removed a few entries that were now the same as earlier subtargets or that I didn't think we really did. Like I don't think we extend v32i8 to v32i16, shuffle, and then truncate.
Differential Revision: https://reviews.llvm.org/D79148
This pushes the NOT pattern up the DAG to help expose it for further combines (AND->ANDN in particular).
The PSHUFD/MOVDDUP 'splat' cases are the only ones I've seen in the wild so far, we can further generalize if/when we need to.
X86 matches several 'shift+xor' funnel shift patterns:
fold (or (srl (srl x1, 1), (xor y, 31)), (shl x0, y)) -> (fshl x0, x1, y)
fold (or (shl (shl x0, 1), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
fold (or (shl (add x0, x0), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
These patterns are also what we end up with the proposed expansion changes in D77301.
This patch moves these to DAGCombine's generic MatchFunnelPosNeg.
All existing X86 test cases still pass, and we just have a small codegen change in pr32282.ll.
Reviewed By: @spatel
Differential Revision: https://reviews.llvm.org/D78935
Summary:
This patch implements custom floating-point reduction ISD nodes that
have vector results, which are used to lower the following intrinsics:
* llvm.aarch64.sve.fadda
* llvm.aarch64.sve.faddv
* llvm.aarch64.sve.fmaxv
* llvm.aarch64.sve.fmaxnmv
* llvm.aarch64.sve.fminv
* llvm.aarch64.sve.fminnmv
SVE reduction instructions keep their result within a vector register,
with all other bits set to zero.
Changes in this patch were implemented by Paul Walker and Sander de
Smalen.
Reviewers: sdesmalen, efriedma, rengolin
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D78723
Summary:
This patch tries to ensure that we do something sensible when
generating code for the ISD::INSERT_VECTOR_ELT DAG node when operating
on scalable vectors. Previously we always returned 'undef' when
inserting an element into an out-of-bounds lane index, whereas now
we only do this for fixed length vectors. For scalable vectors it
is assumed that the backend will do the right thing in the same way
that we have to deal with variable lane indices.
In this patch I have permitted a few basic combinations for scalable
vector types where it makes sense, but in general avoided most cases
for now as they currently require the use of BUILD_VECTOR nodes.
This patch includes tests for all scalable vector types when inserting
into lane 0, but I've only included one or two vector types for other
cases such as variable lane inserts.
Differential Revision: https://reviews.llvm.org/D78992
The generic implementation is actually specific to x86. It assumes the
offset is relative to the end of the instruction and the immediate is
not scaled (which is false on most RISC).
This section is the remnant of how this code was structured before
we made v32i16/v64i8 legal types with avx512f when not restricting
to 256 bit vectors. Now that there are just a few items left,
merge them near similar things in the other section.
We generate much better code these days than we used to. And we use the same sequence for AVX1 and AVX2 for these
For v4i64->v4i32 we generate:
vextractf128 xmm1, ymm0, 1
vshufps xmm0, xmm0, xmm1, 136 # xmm0 = xmm0[0,2],xmm1[0,2]
And for v8i64->v8i32 we generate:
vperm2f128 ymm2, ymm0, ymm1, 49 # ymm2 = ymm0[2,3],ymm1[2,3]
vinsertf128 ymm0, ymm0, xmm1, 1
vshufps ymm0, ymm0, ymm2, 136 # ymm0 = ymm0[0,2],ymm2[0,2],ymm0[4,6],ymm2[4,6]
Differential Revision: https://reviews.llvm.org/D79109
For compatibility with other assemblers on the platform, allow
using just plain integer register numbers in all places where a
register operand is expected.
Bug: llvm.org/PR45582
Remove redundant Group and Regs arguments from parseRegister
and eliminate one of its overloaded versions.
Remove redundant Regs argument from parseAddress.
NFC intended.
This is currently enabled for Intel big cores from Sandy Bridge onward, as well as Atom, Silvermont, and KNL, due to 64-bit division being so slow on these cores. AMD cores can do this in hardware (use 32-bit division based on input operand width), so it's not a win there. But since the majority of x86 CPUs benefit from this optimization, and since the potential upside is significantly greater than the downside, we should enable this for the generic x86-64 target.
Patch By: @atdt
Reviewed By: @craig.topper, @RKSimon
Differential Revision: https://reviews.llvm.org/D75567
Summary:
AArch64's system register ERXTS_EL1 is present in the backend as a
component of the Arm Reliability, Availability and Serviceability (RAS)
extension. However, it has been removed from the specification before
its final release.
This patch removes the register.
Reviewers: SjoerdMeijer, DavidSpickett
Reviewed By: DavidSpickett
Subscribers: DavidSpickett, kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D79007
The improvements to the x86 vector insert/extract element costs in D74976 resulted in the estimated costs for vector initialization and scalarization increasing higher than should be expected. This is particularly noticeable on pre-SSE4 targets where the available of legal INSERT_VECTOR_ELT ops is more limited.
This patch does 2 things:
1 - it implements X86TTIImpl::getScalarizationOverhead to more accurately represent the typical costs of a ISD::BUILD_VECTOR pattern.
2 - it adds a DemandedElts mask to getScalarizationOverhead to permit the SLP's BoUpSLP::getGatherCost to be rewritten to use it directly instead of accumulating raw vector insertion costs.
This fixes PR45418 where a v4i8 (zext'd to v4i32) was no longer vectorizing.
A future patch should extend X86TTIImpl::getScalarizationOverhead to tweak the EXTRACT_VECTOR_ELT scalarization costs as well.
Reviewed By: @craig.topper
Differential Revision: https://reviews.llvm.org/D78216
It allows it not to crash and analyze 16 bit subregs if those
appear in the instructions. At the same time it does not attempt
to reassign these. It still can correctly identify register
banks to let larger registers to be reassigned.
More work will be needed here when real instructions will use
these registers and more tests as well.
Differential Revision: https://reviews.llvm.org/D78772
We generate PACK instructions with an undef second source when we are truncating from a 128-bit vector to something narrower and we don't care about the upper bits of the vector register. The register allocation process will always assign untied undef uses to xmm0. This creates a false dependency on xmm0.
By adding these instructions to hasUndefRegUpdate, we can get the BreakFalseDeps pass to reassign the source to match the other input. Normally this interface is used for instructions that might need an xor inserted to break the dependency. But the pass also has a heuristic that tries to use the same register as other sources. That should always be possible for these instructions so we'll never trigger the xor dependency break.
Differential Revision: https://reviews.llvm.org/D79032
These are used in SReg_32 and when we start to use SGPR_LO16
there will be compaints that not all registers in RC support
all subreg indexes. For now it is NFC.
Unused regunits are reserved so that verifier does not complain
about missing phys reg live-ins.
Differential Revision: https://reviews.llvm.org/D78591
Generalize the 16-bit FPR to 32-bit GPR logic to work for all cases where
destination size is bigger than source size.
Also fixed CheckCopy() always returning true instead of the result of
isValidCopy().
Differential Revision: https://reviews.llvm.org/D77530
Patch by tambre (Raul Tambre)
These are used in SReg_32 and when we start to use SGPR_LO16
there will be compaints that not all registers in RC support
all subreg indexes. For now it is NFC.
Unused regunits are reserved so that verifier does not complain
about missing phys reg live-ins.
Differential Revision: https://reviews.llvm.org/D78591
Implement passing of ByVal formal arguments when the argument is passed
partly in the argument registers, with the remainder of the argument
passed on the stack.
Differential Revision: https://reviews.llvm.org/D78515
Similar to code in `getAArch64Cmp` in AArch64ISelLowering.
When we get a compare against a constant, sometimes, that constant isn't valid
for selecting an immediate form.
However, sometimes, you can get a valid constant by adding 1 or subtracting 1,
and updating the condition code.
This implements the following transformations when valid:
- x slt c => x sle c - 1
- x sge c => x sgt c - 1
- x ult c => x ule c - 1
- x uge c => x ugt c - 1
- x sle c => x slt c + 1
- x sgt c => s sge c + 1
- x ule c => x ult c + 1
- x ugt c => s uge c + 1
Valid meaning the constant doesn't wrap around when we fudge it, and the result
gives us a compare which can be selected into an immediate form.
This also moves `getImmedFromMO` higher up in the file so we can use it.
Differential Revision: https://reviews.llvm.org/D78769
I've modified isTruncateFree to get an accurate cost for types that need to be split. I'm planning to look into fixing it for all vectors, but need more cost cleanups first.
Differential Revision: https://reviews.llvm.org/D78973
Add support for reserving LR in:
* the driver through `-ffixed-x30`
* cc1 through `-target-feature +reserve-x30`
* the backend through `-mattr=+reserve-x30`
* a subtarget feature `reserve-x30`
the same way we're doing for the other registers.
Summary:
They all match the base implementation in
TargetInstrInfo::isUnpredicatedTerminator.
Follow up to D62749.
Reviewers: echristo, MaskRay, hfinkel
Reviewed By: echristo
Subscribers: wuzish, nemanjai, hiraditya, kbarton, llvm-commits, srhines
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78976
This changes the logic with lowering fp16 bitcasts to always produce
either a VMOVhr or a VMOVrh, instead of only trying to do it with
certain surrounding nodes. To perform the same optimisations demand bits
and known bits information has been added for them.
Differential Revision: https://reviews.llvm.org/D78587
getTargetStreamer() might return null (e.g. when running inlined-strings.ll test),
downcasting to a reference will be wrong. This is detectable with -fsanitize=null.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D78686
There are several different types of cost that TTI tries to provide
explicit information for: throughput, latency, code size along with
a vague 'intersection of code-size cost and execution cost'.
The vectorizer is a keen user of RecipThroughput and there's at least
'getInstructionThroughput' and 'getArithmeticInstrCost' designed to
help with this cost. The latency cost has a single use and a single
implementation. The intersection cost appears to cover most of the
rest of the API.
getUserCost is explicitly called from within TTI when the user has
been explicit in wanting the code size (also only one use) as well
as a few passes which are concerned with a mixture of size and/or
a relative cost. In many cases these costs are closely related, such
as when multiple instructions are required, but one evident diverging
cost in this function is for div/rem.
This patch adds an argument so that the cost required is explicit,
so that we can make the important distinction when necessary.
Differential Revision: https://reviews.llvm.org/D78635
Summary:
Changing all mnemonic to match assembly instructions to simplify mnemonic
naming rules. This time update all branch instructions. This also change
to use %s10 register consistently.
Differential Revision: https://reviews.llvm.org/D78889
Summary:
Add simm7fp/mimmfp to represent floating point immediate values.
Also clean multiclasses to define floating point arithmetic instructions
to handle simm7fp/mimmfp operands. Also add several regression tests
for new operands.
Differential Revision: https://reviews.llvm.org/D78887
Currently, on PowerPC target, it uses function scope UnsafeFPMath
option to drive Machine Combiner pass.
This is not accurate in two ways:
1: the scope is not accurate. Machine Combiner pass only requires
instruction-level flags instead of the function scope.
2: the float point flag is not accurate. Machine Combiner pass
only requires float point flags reassoc and nsz.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D78183
This method has been commented as deprecated for a while. Remove
it and replace all uses with the equivalent getCalledOperand().
I also made a few cleanups in here. For example, to removes use
of getElementType on a pointer when we could just use getFunctionType
from the call.
Differential Revision: https://reviews.llvm.org/D78882
Summary:
In the ppc-expand-isel pass, we use stepForward() to update the
liveins, this function is not recommended, because it needs the
accurate kill info.
This patch uses the function computeAndAddLiveIns() to update the
liveins, it's the recommended method and can fix the liveins bug for
ppc-expand-isel pass..
Reviewed By: efriedma, lkail
Differential Revision: https://reviews.llvm.org/D78657
Summary:
While looking into issues with IfConverter, I noticed that
X86InstrInfo::isUnpredicatedTerminator matched its overriden
implementation in TargetInstrInfo::isUnpredicatedTerminator.
Reviewers: craig.topper, hfinkel, MaskRay, echristo
Reviewed By: MaskRay, echristo
Subscribers: hiraditya, llvm-commits, srhines
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62749
"unskipableSimplifyCode()" was added to handle unsafe BL8_NOTOC instruction
when TOC was not completely removed. The function is not needed after confirming
TOC pointer is not used in a function that uses PC-Relative addressing.
Differential Revision: https://reviews.llvm.org/D78517
like .cfi_restore"
Insert .cfi_offset/.cfi_register when IncomingCSRSaved of current block
is larger than OutgoingCSRSaved of its previous block.
Original commit message:
https://reviews.llvm.org/D42848 only handled CFA related cfi directives but
didn't handle CSR related cfi. The patch adds the CSR part. Basically it reuses
the framework created in D42848. For each basicblock, the patch tracks which
CSR set have been saved at its CFG predecessors's exits, and compare the CSR
set with the set at its previous basicblock's exit (The previous block is the
block laid before the current block). If the saved CSR set at its previous
basicblock's exit is larger, .cfi_restore will be inserted.
The patch also generates proper .cfi_restore in epilogue to make sure the
saved CSR set is consistent for the incoming edges of each block.
Differential Revision: https://reviews.llvm.org/D74303
All avx512 truncate instructions except vXi64->vXi32 are 2 uops
on port 5. So raise their costs to 2. Except when we have an
earlier faster sequence like pshufb for 128 bit input vectors.
Add a lower cost of 3 v16i16->v16i8 with avx512f where we can
extend to v16i32 then truncate. And a cost of 2 for avx512bw with
and without avx512vl. There we can use vpmovwb with either a ymm
or zmm input. Both of these beat masking, splitting, and using
packuswb which is our avx/avx2 codegen.
Tail Calls were initially disabled for PC Relative code because it was not safe
to make certain assumptions about the tail calls (namely that all compiled
functions no longer used the TOC pointer in R2). However, once all of the
TOC pointer references have been removed it is safe to tail call everything
that was tail called prior to the PC relative additions as well as a number of
new cases.
For example, it is now possible to tail call indirect functions as there is no
need to save and restore the TOC pointer for indirect functions if the caller
is marked as may clobber R2 (st_other=1). For the same reason it is now also
possible to tail call functions that are external.
Differential Revision: https://reviews.llvm.org/D77788
D63847 added `MCInstrAnalysis::evaluateMemoryOperandAddress()`. This patch
leverages the feature to print the target addresses for evaluable instructions.
```
-400a: movl 4080(%rip), %eax
+400a: movl 4080(%rip), %eax # 5000 <data1>
```
This patch also deletes `MIA->isCall(Inst) || MIA->isUnconditionalBranch(Inst) || MIA->isConditionalBranch(Inst)`
which is used to guard `MCInstrAnalysis::evaluateBranch()`
Reviewed By: jhenderson, skan
Differential Revision: https://reviews.llvm.org/D78776
There are some intrinsics like this that currently block tail
predication, but should be fine. This allows fma through, as the one
that I ran into. There may be others that need the same treatment but
I've only done this one here.
Differential Revision: https://reviews.llvm.org/D78385
The insert(truncate/extend(extract(vec0,c0)),vec1,c1) case in rGacbc5ede99 wasn't combining the 'mineltsize' with the src vector elt size which may be smaller due to implicit extension during extraction.
Reduced from test case provided by @mstorsjo
hasNoSchedulingInfo should be used for Pseudo's and other instructions
that are never expected to be scheduled. This removes the flag from new
ARM instructions, instead fixing the A57 schedule by marking the related
architecture features as unsupported.
When compiling for a arm5te cpu from clang, the +dsp attribute is set.
This meant we could try and generate qadd8 instructions where we would
end up having no pattern. I've changed the condition here to be hasV6Ops
&& hasDSP, which is what other parts of ARMISelLowering seem to use for
similar instructions.
Fixed PR45677.
Differential Revision: https://reviews.llvm.org/D78877
This is a NFC patch for D77319. The idea is to hide the getNegatibleCost inside the getNegatedExpression()
to have it return null if the cost is expensive, and add some helper function for easy to use. And
rename the old getNegatedExpression to negateExpression to avoid the semantic conflict.
Reviewed By: RKSimon
Differential revision: https://reviews.llvm.org/D78291
We're currently getting this from the default implementation. But
I don't like how the cost model came to this answer and I might
be making some changes there.
Followup to the PR45604 fix at rGe71dd7c011a3 where we disabled most of these cases.
By creating the shuffle at the byte level we can handle any extension/truncation as long as we track how small the scalar got and assume that the upper bytes will need to be zero.
This is another enhancement to D77895/D78362
to avoid a round-trip from XMM->GPR->XMM.
This time we handle the case of starting/ending with different FP types
but always with signed i32 as the intermediate value.
I think this covers all of the faux vector optimization possibilities
for pre-AVX512.
There is at least 1 other transform mentioned in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c19
...where we fold an 'fpext' into a preceding 'sitofp'. I think we will
want to handle that earlier (DAGCombiner or instcombine) because that's
a target-independent optimization.
Differential Revision: https://reviews.llvm.org/D78758
If one of caller/callee has disabled ZMM registers due to
prefer-vector-width=256, we were previously
disabling argument promotion as the ABI might be incompatible since
one side will split 512-bit vectors in this case.
But if we can see that the types are all scalar this shouldn't be
a problem.
This patch assumes that pointer element type reflects the type that
the argument will be promoted to.
Differential Revision: https://reviews.llvm.org/D78770
A previous bug fix for varargs introduced a regression where we would
incorrectly widen some stores to memory when passing i8/i16 parameters on the
stack. This didn't show up seemingly because it only happens when there is
no signext/zeroext parameter attribute, which I think for Darwin clang adds.
Swift however seems to be a different story, and a plain anyext on the parameter
triggered the bug.
To fix this, I've added a new ValueHandler::assignValueToAddress type override
which lets us distiguish between varargs and fixed args (we still need this
widening behaviour for varargs to fix the original bug in 2018).
rdar://61353552
Summary:
Add a check to make sure that MachineInstr::mayAlias returns prematurely if at least one of its instruction parameters does not access memory. This prevents calls to TargetInstrInfo::areMemAccessesTriviallyDisjoint with incompatible instructions.
A side effect of this change is to render the mayAlias helper in the AArch64 load/store optimizer obsolete. We can now directly call the MachineInstr::mayAlias member function.
Reviewers: hfinkel, t.p.northover, mcrosier, eli.friedman, efriedma
Reviewed By: efriedma
Subscribers: efriedma, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78823
This is to fix performance regressions introduced by
86c944d790.
The old search would collect all potentially mergeable instructions in
the entire block. In this case, the same address is written in
multiple places in the block on the other side of a fence. When sorted
by offset, the two unmergeable, identical addresses would be next to
each other and the merge would give up.
Break the search space when we encounter an instruction we won't be
able to merge across. This will keep the identical addresses in
different merge attempts.
This may also improve compile time by reducing the merge list size.
When using reversedInstructionsWithoutDebug to construct a range from a
pair of MachineInstrBundleIterators, the range unexpectedly leaves out an
element. This results in mis-optimization as @mstorsjo points out in
https://reviews.llvm.org/D78157.
The problem is that when we convert a MachineInstrBundleIterator to a
reverse iterator, the result gets incremented:
MachineInstrBundleIterator(++I.getReverse())
The comment there explains that the "resulting iterator will dereference
... to the previous node, which is somewhat unexpected; but converting
the two endpoints in a range will give the same range in reverse". This
makes it hard to understand what reversedInstructionsWithoutDebug will
do: I've removed the helper to prevent similar mistakes in the future.
Summary:
It is important to emit HINT instructions instead of PAC ones when
PAC is disabled. This allows compatibility with other assemblers
(e.g. GAS). This was implemented in commit da33762de8.
Still, developers of assembly code will want to write code that is
compatible with both pre- and post-PAC CPUs. They could use HINT
mnemonics, but the new mnemonics are a lot more readable (e.g.
paciaz instead of hint #24), and they will result in the same
encodings. So, while LLVM should not *emit* the new mnemonics when
PAC is disabled, this patch will at least make LLVM *accept*
assembly code that uses them.
Reviewers: danielkiss, chill, olista01, LukeCheeseman, simon_tatham
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78372
Follow-up of D78082 (x86-64).
This change avoids dynamic relocations in `xray_instr_map` for ARM/AArch64/powerpc64le.
MIPS64 cannot use 64-bit PC-relative addresses because R_MIPS_PC64 is not defined.
Because MIPS32 shares the same code, for simplicity, we don't use PC-relative addresses for MIPS32 as well.
Tested on AArch64 Linux and ppc64le Linux.
Reviewed By: ianlevesque
Differential Revision: https://reviews.llvm.org/D78590
This patch upstreams support for the Armv8.6-a Matrix Multiplication
Extension. A summary of the features can be found here:
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
This patch includes:
- Assembly support for AArch32 and Assembly Parsing
D77872 has already added the MC representations of the instructions so that
they can be used in code gen; this patch fills in the details needed to
make assembly parsing work, and adds tests for asm and disasm
This is part of a patch series, starting with BFloat16 support and
the other components in the armv8.6a extension (in previous patches
linked in phabricator)
Based on work by:
- Luke Geeson
- Oliver Stannard
- Luke Cheeseman
Reviewers: t.p.northover, simon_tatham
Reviewed By: simon_tatham
Subscribers: simon_tatham, ostannard, kristof.beyls, hiraditya,
danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77874
This patch upstreams support for the Armv8.6-a Matrix Multiplication
Extension. A summary of the features can be found here:
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
This patch includes:
- Assembly support for AArch64 Scalable Vector Instructions (in line
with the Scalable Vector Extension - SVE)
This is part of a patch series, starting with BFloat16 support and
the other components in the armv8.6a extension (in previous patches
linked in phabricator)
Based on work by:
- Luke Geeson
- Oliver Stannard
- Luke Cheeseman
Reviewers: t.p.northover, rengolin, c-rhodes
Reviewed By: c-rhodes
Subscribers: c-rhodes, ostannard, tschuett, kristof.beyls, hiraditya,
danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77873
This patch upstreams support for the Armv8.6-a Matrix Multiplication
Extension. A summary of the features can be found here:
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
This patch includes:
- Assembly support for AArch32
- Intrinsics Support for AArch32 Neon Intrinsics for Matrix
Multiplication
Note: these extensions are optional in the 8.6a architecture and so have
to be enabled by default
No additional IR types or C Types are needed for this extension.
This is part of a patch series, starting with BFloat16 support and
the other components in the armv8.6a extension (in previous patches
linked in phabricator)
Based on work by:
- Luke Geeson
- Oliver Stannard
- Luke Cheeseman
Reviewers: t.p.northover, miyuki
Reviewed By: miyuki
Subscribers: miyuki, ostannard, kristof.beyls, hiraditya, danielkiss,
cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D77872
This patch upstreams support for the Armv8.6-a Matrix Multiplication
Extension. A summary of the features can be found here:
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
This patch includes:
- Assembly support for AArch64 only (no SVE or Neon)
- Intrinsics Support for AArch64 Armv8.6a Matrix Multiplication Instructions (No bfloat16 matrix multiplication)
No IR types or C Types are needed for this extension.
This is part of a patch series, starting with BFloat16 support and
the other components in the armv8.6a extension (in previous patches
linked in phabricator)
Based on work by:
- Luke Geeson
- Oliver Stannard
- Luke Cheeseman
Reviewers: ostannard, t.p.northover, rengolin, kmclaughlin
Reviewed By: kmclaughlin
Subscribers: kmclaughlin, kristof.beyls, hiraditya, danielkiss,
cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D77871
Summary:
Frontend guarantees that coherent accesses have
corresponding cache policy bits set (glc, dlc).
Therefore there is no need for extra instructions
that invalidate cache.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78800
Summary:
This patch maps IR operations for sdiv & udiv to the
@llvm.aarch64.sve.[s|u]div intrinsics.
A ptrue must be created during lowering as the div instructions
have only a predicated form.
Patch contains changes by Andrzej Warzynski.
Reviewers: sdesmalen, c-rhodes, efriedma, cameron.mcinally, rengolin
Reviewed By: efriedma
Subscribers: tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, andwar, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78569
MCELFObjectWriter::setRType## methods are always used altogether to
build complete MIPS N64 ABI "chain" of relocations. Using single
function for this task makes code less verbose.
Summary:
Changing all mnemonic to match assembly instructions to simplify mnemonic
naming rules. This time update all floating-point arithmetic instructions.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D78768
This was backwards from intended and missing a test. We perhaps should
just ignored the FP mode here, since it shouldn't be legal to mix code
with different default modes in the absence of strictfp.
If f32 denormals were enabled pre-gfx9, we would still try to
implement this with v_max_f32. Pre-gfx9, these instructions ignored
the denormal mode and did not flush. Switch to the multiply form for
f32 as a workaround which should always work in any case.
This fixes conformance failures when the library implementation of
fmin/fmax were accidentally not inlined, forcing the assumption of no
flushing on targets where denormals are not enabled by default. This
is a workaround, since really we should not be mixing code with
different FP mode expectations, but prefer the lowering that will work
in any mode.
Now this will always use max to implement canonicalize on gfx9+. This
is only really beneficial for f64. For f32/f16 it's a neutral choice
(and worse in terms of code size in 1 case), but possibly worse for
the compiler since it does add an extra register use operand. Leave
this change for later.
1. Use Subtarget.isUsingPCRelativeCalls() in LowerConstantPool to
check if using PCRelative addressing.
2. Change MO_GOT_FLAG = 32 to MO_GOT_FLAG = 8 in PPC.h to use
consecutive bits.
Differential Revision: https://reviews.llvm.org/D78406
This avoids more long lists of register classes that have to be updated
every time we add a new one. NFC.
Differential Revision: https://reviews.llvm.org/D78570
12994a70cf did this for 128-bit classes:
SGPR_128 only includes the real allocatable SGPRs, and SReg_128 adds
the additional non-allocatable TTMP registers. There's no point in
allocating SReg_128 vregs. This shrinks the size of the classes
regalloc needs to consider, which is usually good.
This patch extends it to all classes > 64 bits, for consistency.
Differential Revision: https://reviews.llvm.org/D78622
This patch changes the FP conversion intrinsics to take a predicate
that matches the number of lanes for the vector with the widest element
type as opposed to using <vscale x 16 x i1>.
For example:
```<vscale x 4 x float> @llvm.aarch64.sve.fcvt.f32f16(<vscale x 4 x float>, <vscale x 4 x i1>, <vscale x 8 x half>)```
now uses <vscale x 4 x i1> instead of <vscale x 16 x i1>
And similar for:
```<vscale x 4 x float> @llvm.aarch64.sve.fcvt.f32f64(<vscale x 4 x float>, <vscale x 2 x i1>, <vscale x 2 x double>)```
where the predicate now matches the wider type, so <vscale x 2 x i1>.
Reviewers: efriedma, SjoerdMeijer, paulwalker-arm, rengolin
Reviewed By: efriedma
Tags: #clang
Differential Revision: https://reviews.llvm.org/D78402
Do not count the presence of debug insts against the limit set by
LdStLimit, and allow the optimizer to find matching insts by skipping
over debug insts.
Differential Revision: https://reviews.llvm.org/D78411
Summary:
The findSuitableCompare method can fail if debug instructions are
present in the MBB -- fix this by using helpers to skip over debug
insts.
Reviewers: aemerson, paquette
Subscribers: kristof.beyls, hiraditya, danielkiss, aprantl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78265
Summary:
This fixes several instances in which condbr optimization was missed
due to a debug instruction appearing as a bogus NZCV clobber.
Reviewers: aemerson, paquette
Subscribers: kristof.beyls, hiraditya, jfb, danielkiss, aprantl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78264
Summary:
Use the variants of these APIs which skip over debug instructions. This
is mostly a cleanup, but it does fix a debug-variance issue which causes
addsub-shifted.ll and addsub_ext.ll to fail when debug info is inserted
by -mir-debugify.
Reviewers: aemerson, paquette
Subscribers: kristof.beyls, hiraditya, jfb, danielkiss, llvm-commits, aprantl
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78262
Summary:
Fix an issue where the presence of debug info could disable the ccmp
optimization due to findConvertibleCompare failing too early (the error
is "Can't create ccmp with multiple uses", where the "use" is a
DBG_VALUE inst).
Depends on D78151.
Reviewers: t.p.northover, paquette, aemerson
Subscribers: kristof.beyls, hiraditya, danielkiss, aprantl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78156
Summary:
Fix an issue where the presence of debug info could disable a peephole
optimization due to areCFlagsAccessedBetweenInstrs returning the wrong
result.
In test/CodeGen/AArch64/arm64-csel.ll, the issue was found in the
function @foo5, in which the first compare could successfully be
optimized but not the second.
Reviewers: t.p.northover, eastig, paquette
Subscribers: kristof.beyls, hiraditya, danielkiss, aprantl, dsanders, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78157
Summary:
Fix an issue where the presence of debug info could disable a peephole
optimization in optimizeCompareInstr due to canInstrSubstituteCmpInstr
returning the wrong result.
Depends on D78137.
Reviewers: t.p.northover, eastig, paquette
Subscribers: kristof.beyls, hiraditya, danielkiss, aprantl, llvm-commits, dsanders
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78151
Summary:
These helpers are exercised by follow-up commits in this patch series,
which is all about removing CodeGen differences with vs. without debug
info in the AArch64 backend.
Reviewers: fhahn, aprantl, jpaquette, paquette
Subscribers: kristof.beyls, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78260
Preserving liveness can be useful even late in the pipeline, if we're
doing substantial optimization work afterwards. (See, for example,
D76065.) Teach MachineOutliner how to correctly set live-ins on the
basic block in outlined functions.
Differential Revision: https://reviews.llvm.org/D78605
Currently an indirect call produces the following sequence on PCRelative mode:
extern void function( );
extern void (*ptrfunc) ( );
void g() {
ptrfunc=function;
}
void f() {
(*ptrfunc) ( );
}
Producing
paddi 3, 0, .LC0@PCREL, 1
ld 3, 0(3)
std 2, 24(1)
ld 12, 0(3)
mtctr 12
bctrl
ld 2, 24(1)
Though the caller does not use or preserve r2, it is still saved and restored
across a function call. This patch is added to remove these redundant save and
restores for indirect calls.
Differential Revision: https://reviews.llvm.org/D77749
llvm/lib/Target/Hexagon/HexagonTargetObjectFile.cpp:296:11: warning: enumeration value 'ScalableVectorTyID' not handled in switch [-Wswitch]
switch (Ty->getTypeID()) {
^
Summary:
This commit recommits the reversion of https://reviews.llvm.org/D75039.
Concensus appears to be in favour of assembly-time resolution of
these ADR and LDR relocations, in line with GNU. The previous
backout broke many lld tests, now fixed by Peter Smith in
61bccda9d9.
Reviewers: psmith
Subscribers: kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78301
Add initial support for PC Relative addressing to get jump table base
address instead of using TOC.
Differential Revision: https://reviews.llvm.org/D75931
If a 16-bit thumb STM with writeback stores the base register but it isn't the
first register in the list, then an unknown value is stored. The load/store
optimizer knows this and generates a 32-bit STM without writeback instead, but
thumb2 size reduction converts it into a 16-bit STM. Fix this by having thumb2
size reduction notice such STMs and leave them as they are.
Differential Revision: https://reviews.llvm.org/D78493
This adds some extra processing into the Pre-RA ARM load/store optimizer
to detect and merge MVE loads/stores and adds of the same base. This we
don't always turn into a post-inc during ISel, and due to the nature of
it being a graph we don't always know an order to use for the nodes, not
knowing which nodes to make post-inc and which to use the new post-inc
of. After ISel, we have an order that we can use to post-inc the
following instructions.
So this looks for a loads/store with a starting offset of 0, and an
add/sub from the same base, plus a number of other loads/stores. We then
do some checks and convert the zero offset load/store into a postinc
variant. Any loads/stores after it have the offset subtracted from their
immediates. For example:
LDR #4 LDR #4
LDR #0 LDR_POSTINC #16
LDR #8 LDR #-8
LDR #12 LDR #-4
ADD #16
It only handles MVE loads/stores at the moment. Normal loads/store will
be added in a followup patch, they just have some extra details to
ensure that we keep generating LDRD/LDM successfully.
Differential Revision: https://reviews.llvm.org/D77813
Add 96-bit, 160-bit and 256-bit AReg classes to match VReg and SReg.
NFC as far as I know, but it may avoid weird legalization problems.
Differential Revision: https://reviews.llvm.org/D78348
Finding the loop tripcount is the first crucial step in preparing a loop for
tail-predication, and this adds a debug message if a tripcount cannot be found.
And while I was at it, I added some more comments here and there.
Differential Revision: https://reviews.llvm.org/D78485
Summary:
Changing all mnemonic to match assembly instructions to simplify mnemonic
naming rules. This time update all shift operation instructions. This also
corrects instruction's operation kinds.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D78468
Summary:
VE uses identical names "%s0-63" to all generic registers. Change to use
alternative name mechanism among all generic registers instead of hard-
coding them.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D78174
This patch exploits rldimi instruction for patterns like
`or %a, 0b000011110000`, which saves number of instructions when the
operand has only one use, compared with `li-ori-sldi-or`.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D77850