We want to have more load/store clustering but we also want
to maintain low register pressure which are oposit targets.
Allow scheduler to reschedule regions without mutations
applied if we hit a register limit.
Differential Revision: https://reviews.llvm.org/D73386
Use intermediate instructions, unlike with buffer stores. This is
necessary because of the need to have an internal way to distinguish
between signed and unsigned extloads. This introduces some duplication
and near duplication with the buffer store selection path. The store
handling should maybe be moved into legalization to match and
eliminate the duplication.
Try to keep simple v2s16 cases as-is. This will more naturally map to
how the VOP3P op_sel modifiers work compared to the expansion
involving bitcasts and bitshifts.
This could maybe try harder with wider source vector types, although
that could be handled with a pre-legalize combine.
This reverts commit fe23ed2c68.
It was never really clear this was responsible for the performance
regressions that caused this to be reverted. It's been a long time,
and we need to have scalar patterns for this to get GlobalISel
working.
For some reason the flat/global atomics end up in the generated
matcher table in a different order from SelectionDAG. Use
AddedComplexity to prefer checking for global atomics first.
Summary:
This is in preparation for getMemOperandsWithOffset returning more base
operands.
Depends on D73455.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73456
Summary:
This is in preparation for getMemOperandsWithOffset returning more base
operands.
Depends on D73454.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73455
The MVE_VLDRWU32_qi_pre gather loads, like the other _pre/_post mve
loads returns the writeback as result 0, the value as result 1. The llvm
ir intrinsic seems to have this the other way around though, and so when
lowering from one to the other we need to switch the first two outputs.
I've also fixed up the types of _pre/_post on normal MVE loads. There we
were already getting the values the right way around, just not for the
types. I don't believe this was causing anything to go wrong, but it was
very confusing to read in the debug output.
Differential Revision: https://reviews.llvm.org/D73370
We had support for runtime trip count values, but not constants, and this adds
supports for that.
And added a minor optimisation while I was add it: don't invoke Cleanup when
there's nothing to clean up.
Differential Revision: https://reviews.llvm.org/D73198
When expanding the LoopStart, we try to remove the iteration count
calculation. However, if part of the calculation was also used to
calculate the number of elements we could end up deleting
instructions that were required to feed DLSTP/WLSTP.
Differential Revision: https://reviews.llvm.org/D73275
G_CTPOP is generated from llvm.ctpop.<type> intrinsics, clang generates
these intrinsics from __builtin_popcount and __builtin_popcountll.
Add lower and narrow scalar for G_CTPOP.
Lower G_CTPOP for MIPS32.
Differential Revision: https://reviews.llvm.org/D73216
llvm.cttz.<type> intrinsic has additional i1 argument is_zero_undef,
it tells whether zero as the first argument produces a defined result.
G_CTTZ is generated from llvm.cttz.<type> (<type> <src>, i1 false)
intrinsics, clang generates these intrinsics from __builtin_ctz and
__builtin_ctzll.
G_CTTZ_ZERO_UNDEF comes from llvm.cttz.<type> (<type> <src>, i1 true).
Clang generates such intrinsics as parts of expansion of builtin_ffs
and builtin_ffsll. It is also traditionally part of and many
algorithms that are now predicated on avoiding zero-value inputs.
Add narrow scalar (algorithm uses G_CTTZ_ZERO_UNDEF) for G_CTTZ.
Lower G_CTTZ and G_CTTZ_ZERO_UNDEF for MIPS32.
Differential Revision: https://reviews.llvm.org/D73215
llvm.ctlz.<type> intrinsic has additional i1 argument is_zero_undef,
it tells whether zero as the first argument produces a defined result.
MIPS clz instruction returns 32 for zero input.
G_CTLZ is generated from llvm.ctlz.<type> (<type> <src>, i1 false)
intrinsics, clang generates these intrinsics from __builtin_clz and
__builtin_clzll.
G_CTLZ_ZERO_UNDEF can also be generated from llvm.ctlz with true as
second argument. It is also traditionally part of and many algorithms
that are now predicated on avoiding zero-value inputs.
Add narrow scalar for G_CTLZ (algorithm uses G_CTLZ_ZERO_UNDEF).
Lower G_CTLZ_ZERO_UNDEF and select G_CTLZ for MIPS32.
Differential Revision: https://reviews.llvm.org/D73214
Based on exhaustive llvm-exegesis measurements.
There may still be some imperfections for LEA16r/LEA32r.
Much like was observed in D68646, i'm also measuring some outliers
with some specific registers.
Pull out combineTargetShuffle code added in rG3fd5d1c6e7db into a helper function and extend it to handle shufps(shufps(load(),x),y) and shufps(y,shufps(load(),x)) cases as well.
Every case in the switch had a string version of themselves. Two
of them had a typo that used : instead of ::
By using a macro we can automate the string creation and avoid
the possibility of typos like this.
This is similar to what is done on the AMDGPU target.
Summary:
This improves merging of sequences like:
store a, ptr + 4
store b, ptr + 8
store c, ptr + 12
store d, ptr + 16
store e, ptr + 20
store f, ptr
Prior to this patch the basic block was scanned in order to find instructions
to merge and the above sequence would be transformed to:
store4 <a, b, c, d>, ptr + 4
store e, ptr + 20
store r, ptr
With this change, we now sort all the candidate merge instructions by their offset,
so instructions are visited in offset order rather than in the order they appear
in the basic block. We now transform this sequnce into:
store4 <f, a, b, c>, ptr
store2 <d, e>, ptr + 16
Another benefit of this change is that since we have sorted the mergeable lists
by offset, we can easily check if an instruction is mergeable by checking the
offset of the instruction that becomes before or after it in the sorted list.
Once we determine an instruction is not mergeable we can remove it from the list
and avoid having to do the more expensive mergeablilty checks.
Reviewers: arsenm, pendingchaos, rampitec, nhaehnle, vpykhtin
Reviewed By: arsenm, nhaehnle
Subscribers: kerbowa, merge_guards_bot, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65966
Updated FoldConstantArithmetic method signature to match that of
FoldConstantVectorArithmetic in preparation for merging the two
functions together
https://bugs.llvm.org/show_bug.cgi?id=36544
This is the first step in combining the various
FoldConstantVectorArithmetic and FoldConstantVectorArithmetic
functions into one FoldConstantArithmetic function.
Differential Revision: https://reviews.llvm.org/D72870
I believe for STRICT_FP I need to use a STRICT_FP_EXTEND for the extending to f80 for returning f32/f64 in 32-bit mode when SSE is enabled. The STRICT_FP_EXTEND node requires a Chain. I need to get that node onto the chain before any CopyToRegs are emitted. This is because all the CopyToRegs are glued and chained together. So I can't put a STRICT_FP_EXTEND on the chain between the glued nodes without also glueing the STRICT_ FP_EXTEND.
This patch moves all the extend creation to a first pass and then creates the copytoregs and fills out RetOps in a second pass.
Differential Revision: https://reviews.llvm.org/D72665
Long `cl::value_desc()` is added right after the flag name,
before `cl::desc()` column. And thus the `cl::desc()` column,
for all flags, is padded to the right,
which makes the output unreadable.
Summary:
This adds the reference types target feature. This does not enable any
more functionality in LLVM/clang for now, but this is necessary to embed
the info in the target features section, which is used by Binaryen and
Emscripten. It turned out that after D69832 `-fwasm-exceptions` crashed
because we didn't have the reference types target feature.
Reviewers: tlively
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73320
Scheduler sends NumLoads argument into shouldClusterMemOps()
one less the actual cluster length. So for 2 instructions
it will pass just 1. Correct this number.
This is NFC for in tree targets.
Differential Revision: https://reviews.llvm.org/D73292
We are relying on atrificial DAG edges inserted by the
MemOpClusterMutation to keep loads and stores together in the
post-RA scheduler. This does not work all the time since it
allows to schedule a completely independent instruction in the
middle of the cluster.
Removed the DAG mutation and added pass to bundle already
clustered instructions. These bundles are unpacked before the
memory legalizer because it does not work with bundles but also
because it allows to insert waitcounts in the middle of a store
cluster.
Removing artificial edges also allows a more relaxed scheduling.
Differential Revision: https://reviews.llvm.org/D72737
Currently BE allows only a little load narrowing because
of the fear it will produce sub-dword ext loads. However,
we can always allow narrowing if we are shrinking one
multi-dword load to another multi-dword load.
In particular we were unable to reduce s_load_dwordx8 into
s_load_dwordx4 if identity shuffle was used to extract
low 4 dwords.
Differential Revision: https://reviews.llvm.org/D73133
Summary:
Enable the new diveregence analysis by default for AMDGPU.
Resubmit with test updates since GPUDA was causing failures on Windows.
Reviewers: rampitec, nhaehnle, arsenm, thakis
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73315
This adds basic support for the Swift calling convention with WebAssembly
targets.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D71823
The codegen for splitting a llvm.vector.reduction intrinsic into parts
will be better than the codegen for the generic reductions. This will
only directly effect when vectorization factors are specified by the
user.
Also added tests to make sure the codegen for larger reductions is OK.
Differential Revision: https://reviews.llvm.org/D72257