Summary:
This improves merging of sequences like:
store a, ptr + 4
store b, ptr + 8
store c, ptr + 12
store d, ptr + 16
store e, ptr + 20
store f, ptr
Prior to this patch the basic block was scanned in order to find instructions
to merge and the above sequence would be transformed to:
store4 <a, b, c, d>, ptr + 4
store e, ptr + 20
store r, ptr
With this change, we now sort all the candidate merge instructions by their offset,
so instructions are visited in offset order rather than in the order they appear
in the basic block. We now transform this sequnce into:
store4 <f, a, b, c>, ptr
store2 <d, e>, ptr + 16
Another benefit of this change is that since we have sorted the mergeable lists
by offset, we can easily check if an instruction is mergeable by checking the
offset of the instruction that becomes before or after it in the sorted list.
Once we determine an instruction is not mergeable we can remove it from the list
and avoid having to do the more expensive mergeablilty checks.
Reviewers: arsenm, pendingchaos, rampitec, nhaehnle, vpykhtin
Reviewed By: arsenm, nhaehnle
Subscribers: kerbowa, merge_guards_bot, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65966
Summary:
This adds the reference types target feature. This does not enable any
more functionality in LLVM/clang for now, but this is necessary to embed
the info in the target features section, which is used by Binaryen and
Emscripten. It turned out that after D69832 `-fwasm-exceptions` crashed
because we didn't have the reference types target feature.
Reviewers: tlively
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73320
Unlike the existing code that I modified here, I only handle the
case where the strict_fsetcc has a single use. Not sure exactly
how to handle multiples uses.
Testing this on X86 is hard because we already have a other
combines that get rid of lowered version of the integer setcc that
this xor will eventually become. So this combine really just
saves a bunch of extra nodes being created. Not sure about other
targets.
Differential Revision: https://reviews.llvm.org/D71816
This previously only handled EXTRACT_SUBREGs from leafs, such as
operands directly in the original output. Handle extracting from a
result instruction.
We are relying on atrificial DAG edges inserted by the
MemOpClusterMutation to keep loads and stores together in the
post-RA scheduler. This does not work all the time since it
allows to schedule a completely independent instruction in the
middle of the cluster.
Removed the DAG mutation and added pass to bundle already
clustered instructions. These bundles are unpacked before the
memory legalizer because it does not work with bundles but also
because it allows to insert waitcounts in the middle of a store
cluster.
Removing artificial edges also allows a more relaxed scheduling.
Differential Revision: https://reviews.llvm.org/D72737
Currently BE allows only a little load narrowing because
of the fear it will produce sub-dword ext loads. However,
we can always allow narrowing if we are shrinking one
multi-dword load to another multi-dword load.
In particular we were unable to reduce s_load_dwordx8 into
s_load_dwordx4 if identity shuffle was used to extract
low 4 dwords.
Differential Revision: https://reviews.llvm.org/D73133
The codegen for splitting a llvm.vector.reduction intrinsic into parts
will be better than the codegen for the generic reductions. This will
only directly effect when vectorization factors are specified by the
user.
Also added tests to make sure the codegen for larger reductions is OK.
Differential Revision: https://reviews.llvm.org/D72257
Future CPU will include support for prefixed instructions.
These prefixed instructions are formed by a 4 byte prefix
immediately followed by a 4 byte instruction effectively
making an 8 byte instruction. The new instruction paddi
is a prefixed form of addi.
This patch adds paddi and all of the support required
for that instruction. The majority of the patch deals with
supporting the new prefixed instructions. The addition of
paddi is mainly to allow for testing.
Differential Revision: https://reviews.llvm.org/D72569
As mentioned on D73023, lowerShuffleWithSHUFPS should be able to commute the shufps inputs to fold the second arg as it will then permute the shufps result anyway.
Summary:
D72308 incorrectly assumed `resume` cannot exist without a `landingpad`,
which is not true. This sets `Changed` to true whenever we make changes
to a function, including creating a call to `__resumeException` within a
function without a landing pad.
Reviewers: tlively
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73308
Similar to the function attribute `prefix` (prefix data),
"patchable-function-prefix" inserts data (M NOPs) before the function
entry label.
-fpatchable-function-entry=2,1 (1 NOP before entry, 1 NOP after entry)
will look like:
```
.type foo,@function
.Ltmp0: # @foo
nop
foo:
.Lfunc_begin0:
# optional `bti c` (AArch64 Branch Target Identification) or
# `endbr64` (Intel Indirect Branch Tracking)
nop
.section __patchable_function_entries,"awo",@progbits,get,unique,0
.p2align 3
.quad .Ltmp0
```
-fpatchable-function-entry=N,0 + -mbranch-protection=bti/-fcf-protection=branch has two reasonable
placements (https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01185.html):
```
(a) (b)
func: func:
.Ltmp0: bti c
bti c .Ltmp0:
nop nop
```
(a) needs no additional code. If the consensus is to go for (b), we will
need more code in AArch64BranchTargets.cpp / X86IndirectBranchTracking.cpp .
Differential Revision: https://reviews.llvm.org/D73070
Summary:
RCP has the accuracy limit. If FDIV fpmath require high accuracy rcp may not
meet the requirement. However, in DAG lowering, fpmath information gets lost,
and thus we may generate either inaccurate rcp related computation or slow code
for fdiv.
In patch implements fdiv optimizations in the AMDGPUCodeGenPrepare, which could
exactly know !fpmath.
FastUnsafeRcpLegal: We determine whether it is legal to use rcp based on
unsafe-fp-math, fast math flags, denormals and fpmath
accuracy request.
RCP Optimizations:
1/x -> rcp(x) when fast unsafe rcp is legal or fpmath >= 2.5ULP with
denormals flushed.
a/b -> a*rcp(b) when fast unsafe rcp is legal.
Use fdiv.fast:
a/b -> fdiv.fast(a, b) when RCP optimization is not performed and
fpmath >= 2.5ULP with denormals flushed.
1/x -> fdiv.fast(1,x) when RCP optimization is not performed and
fpmath >= 2.5ULP with denormals.
Reviewers:
arsenm
Differential Revision:
https://reviews.llvm.org/D71293
The other 3-op patterns should also be theoretically handled, but
currently there's a bug in the inferred pattern complexity.
I'm not sure what the error handling strategy should be for potential
constant bus violations. I think the correct strategy is to never
produce mixed SGPR and VGPR operands in a typical VOP instruction,
which will trivially avoid them. However, it's possible to still have
hand written MIR (or erroneously transformed code) with these
operands. When these fold, the restriction will be violated. We
currently don't have any verifiers for reg bank legality. For now,
just ignore the restriction.
It might be worth triggering a DAG fallback on verifier error.
Summary:
Immediate vmvnq is code-generated as a simple vector constant in IR,
and left to the backend to recognize that it can be created with an
MVE VMVN instruction. The predicated version is represented as a
select between the input and the same constant, and I've added a
Tablegen isel rule to turn that into a predicated VMVN. (That should
be better than the previous VMVN + VPSEL: it's the same number of
instructions but now it can fold into an adjacent VPT block.)
The unpredicated forms of VBIC and VORR are done by enabling the same
isel lowering as for NEON, recognizing appropriate immediates and
rewriting them as ARMISD::VBICIMM / ARMISD::VORRIMM SDNodes, which I
then instruction-select into the right MVE instructions (now that I've
also reworked those instructions to use the same MC operand encoding).
In order to do that, I had to promote the Tablegen SDNode instance
`NEONvorrImm` to a general `ARMvorrImm` available in MVE as well, and
similarly for `NEONvbicImm`.
The predicated forms of VBIC and VORR are represented as a vector
select between the original input vector and the output of the
unpredicated operation. The main convenience of this is that it still
lets me use the existing isel lowering for VBICIMM/VORRIMM, and not
have to write another copy of the operand encoding translation code.
This intrinsic family is the first to use the `imm_simd` system I put
into the MveEmitter tablegen backend. So, naturally, it showed up a
bug or two (emitting bogus range checks and the like). Fixed those,
and added a full set of tests for the permissible immediates in the
existing Sema test.
Also adjusted the isel pattern for `vmovlb.u8`, which stopped matching
because lowering started turning its input into a VBICIMM. Now it
recognizes the VBICIMM instead.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D72934
vperm (ins ?, X, C), (ins ?, Y, C), 0x31 --> concat X, Y
This is another shuffle problem seen with PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
We have this small crack in legalization/lowering/combining/demanded
that allows forming a vperm2f128 of high halves with AVX1 when we
could do better by peeking through the insert_subvector nodes.
AFAICT, it requires IR as shown in the diffs - much larger than legal
vectors - to avoid all of the usual folds.
Another option would prevent forming the 256-bit vperm in lowering.
Differential Revision: https://reviews.llvm.org/D73197