This reverts commit e34801c8e6 and the followup due to multiple
problems.
I've tried to keep the tests and RDA parts where possible, as those
still seem useful.
This intention is to move patchable-function before aarch64-branch-targets
(configured in AArch64PassConfig::addPreEmitPass) so that we emit BTI before NOPs
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92424).
This also allows addPreEmitPass() passes to know the precise instruction sizes if they want.
Tried x86-64 Debug/Release builds of ccls with -fxray-instrument -fxray-instruction-threshold=1.
No output difference with this commit and the previous commit.
Adds a pass to the ARM backend that takes a v4i32
gather and transforms it into a call to MVE's
masked gather intrinsics.
Differential Revision: https://reviews.llvm.org/D71743
This is a recommit of D71330, but with a few things fixed and changed:
1) ReachingDefAnalysis: this was not running with optnone as it was checking
skipFunction(), which other analysis passes don't do. I guess this is a
copy-paste from a codegen pass.
2) VPTBlockPass: here I've added skipFunction(), because like most/all
optimisations, we don't want to run this with optnone.
This fixes the issues with the initial/previous commit: the VPTBlockPass was
running with optnone, but ReachingDefAnalysis wasn't, and so VPTBlockPass was
crashing querying ReachingDefAnalysis.
I've added test case mve-vpt-block-optnone.mir to check that we don't run
VPTBlock with optnone.
Differential Revision: https://reviews.llvm.org/D71470
This reverts commit 9468e3334b.
There's a test that doesn't like this change. The RDA analysis
gets invalided by changes in the block, which is not taken into
account. Revert while I work on a fix for this.
This adds ReachingDefAnalysis (RDA) to the VPTBlock pass, so that we can
reimplement findVCMPToFoldIntoVPS with just a few calls to RDA.
Differential Revision: https://reviews.llvm.org/D71330
Summary:
Split off of D67120.
Add the profile guided size optimization instrumentation / queries in the code
gen or target passes. This doesn't enable the size optimizations in those passes
yet as they are currently disabled in shouldOptimizeForSize (for non-IR pass
queries).
A second try after reverted D71072.
Reviewers: davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71149
Summary:
Split off of D67120.
Add the profile guided size optimization instrumentation / queries in the code
gen or target passes. This doesn't enable the size optimizations in those passes
yet as they are currently disabled in shouldOptimizeForSize (for non-IR pass
queries).
Reviewers: davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71072
Convert ARMCodeGenPrepare into a generic type promotion pass by:
- Removing the insertion of arm specific intrinsics to handle narrow
types as we weren't using this.
- Removing ARMSubtarget references.
- Now query a generic TLI object to know which types should be
promoted and what they should be promoted to.
- Move all codegen tests into Transforms folder and testing using opt
and not llc, which is how they should have been written in the
first place...
The pass searches up from icmp operands in an attempt to safely
promote types so we can avoid generating unnecessary unsigned extends
during DAG ISel.
Differential Revision: https://reviews.llvm.org/D69556
Add several new methods to ReachingDefAnalysis:
- getReachingMIDef, instead of returning an integer, return the
MachineInstr that produces the def.
- getInstFromId, return a MachineInstr for which the given integer
corresponds to.
- hasSameReachingDef, return whether two MachineInstr use the same
def of a register.
- isRegUsedAfter, return whether a register is used after a given
MachineInstr.
These methods have been used in ARMLowOverhead to replace searching
for uses/defs.
Differential Revision: https://reviews.llvm.org/D70009
In the ARM backend, for historical reasons we have only some targets
using Machine Scheduling. The rest use the old list scheduler as they
are using itinaries and the list scheduler seems to produce better code
(and not crash running out of register on v6m codes). So whether to use
the MIScheduler or not is checked at runtime from the subtarget
features.
This is fine, except for post-ra scheduling. Whether to use the old
post-ra list scheduler or the post-ra machine schedule is decided as the
pass manager is set up, in arms case from a newly constructed subtarget.
Under some situations, like LTO, this won't include the correct cpu so
can pick the wrong option. This can have a surprising effect on
performance.
To fix that, this patch overrides targetSchedulesPostRAScheduling and
addPreSched2 in the ARM backend, adding _both_ post-ra schedulers and
picking at runtime which to execute. To pick between the two I've had to
add a enablePostRAMachineScheduler() method that normally returns
enableMachineScheduler() && enablePostRAScheduler(), which can be
overridden to enable just one of PostRAMachineScheduler vs
PostRAScheduler.
Thanks to David Penry for the identifying this problem.
Differential Revision: https://reviews.llvm.org/D69775
With a few things fixed:
- initialisaiton of the optimisation remark pass (this was causing the buildbot
failures on PPC),
- a test case.
Differential Revision: https://reviews.llvm.org/D69660
This adds the initial plumbing to support optimisation remarks in
the IR hardware-loop pass.
I have left a todo in a comment where we can improve the reporting,
and will iterate on that now that we have this initial support in.
Differential Revision: https://reviews.llvm.org/D68579
llvm-svn: 374980
Add a pass to lower is.constant and objectsize intrinsics
This pass lowers is.constant and objectsize intrinsics not simplified by
earlier constant folding, i.e. if the object given is not constant or if
not using the optimized pass chain. The result is recursively simplified
and constant conditionals are pruned, so that dead blocks are removed
even for -O0. This allows inline asm blocks with operand constraints to
work all the time.
The new pass replaces the existing lowering in the codegen-prepare pass
and fallbacks in SDAG/GlobalISEL and FastISel. The latter now assert
on the intrinsics.
Differential Revision: https://reviews.llvm.org/D65280
llvm-svn: 374784
This pass lowers is.constant and objectsize intrinsics not simplified by
earlier constant folding, i.e. if the object given is not constant or if
not using the optimized pass chain. The result is recursively simplified
and constant conditionals are pruned, so that dead blocks are removed
even for -O0. This allows inline asm blocks with operand constraints to
work all the time.
The new pass replaces the existing lowering in the codegen-prepare pass
and fallbacks in SDAG/GlobalISEL and FastISel. The latter now assert
on the intrinsics.
Differential Revision: https://reviews.llvm.org/D65280
llvm-svn: 374743
The low-overhead branch extension provides a loop-end 'LE' instruction
that performs no decrement nor compare, it just jumps backwards. This
patch modifies the constant islands pass to try to insert LE
instructions in place of a Thumb2 conditional branch, instead of
shrinking it. This only happens if a cmp can be converted to a cbn/z
and used to exit the loop.
Differential Revision: https://reviews.llvm.org/D67404
llvm-svn: 372085
rL369567 reverted a couple of recent changes made to ARMParallelDSP
because of a miscompilation error: PR43073.
The issue stemmed from an underlying bug that was caused by adding
muls into a reduction before it was proved that they could be executed
in parallel with another mul.
Most of the changes here are from the previously reverted commits.
The additional changes have been made area:
1) The Search function now doesn't insert any muls into the Reduction
object. That now happens once the search has successfully finished.
2) For any muls added into the reduction but that weren't paired, we
accumulate their values as an input into the smlad.
Differential Revision: https://reviews.llvm.org/D66660
llvm-svn: 370171
Summary:
Current PRE hoists common computations into
CMBB = DT->findNearestCommonDominator(MBB, MBB1).
However, if CMBB is in a hot loop body, we might get performance
degradation.
Differential Revision: https://reviews.llvm.org/D64394
llvm-svn: 366570
This allows later passes (in particular InstCombine) to optimize more
cases.
One that's important to us is `memcmp(p, q, constant) < 0` and memcmp(p, q, constant) > 0.
llvm-svn: 364412
Introduce three pseudo instructions to be used during DAG ISel to
represent v8.1-m low-overhead loops. One maps to set_loop_iterations
while loop_decrement_reg is lowered to two, so that we can separate
the decrement and branching operations. The pseudo instructions are
expanded pre-emission, where we can still decide whether we actually
want to generate a low-overhead loop, in a new pass:
ARMLowOverheadLoops. The pass currently bails, reverting to an sub,
icmp and br, in the cases where a call or stack spill/restore happens
between the decrement and branching instructions, or if the loop is
too large.
Differential Revision: https://reviews.llvm.org/D63476
llvm-svn: 364288
This allows targets to make more decisions about reserved registers
after isel. For example, now it should be certain there are calls or
stack objects in the frame or not, which could have been introduced by
legalization.
Patch by Matthias Braun
llvm-svn: 363757
Initial commit of a new pass to create vector predication blocks, called VPT
blocks, that are supported by the Armv8.1-M MVE architecture.
This is a first naive implementation. I.e., for 2 consecutive predicated
instructions I1 and I2, for example, it will generate 2 VPT blocks:
VPST
I1
VPST
I2
A more optimal implementation would obviously put instructions in the same VPT
block when they are predicated on the same condition and when it is allowed to
do this:
VPTT
I1
I2
We will address this optimisation with follow up patches when the groundwork is
in. Creating VPT Blocks is very similar to IT Blocks, which is the reason I
added this to Thumb2ITBlocks.cpp. This allows reuse of the def use analysis
that we need for the more optimal implementation.
VPT blocks cannot be nested in IT blocks, and vice versa, and so these 2 passes
cannot interact with each other. Instructions allowed in VPT blocks must
be MVE instructions that are marked as VPT compatible.
Differential Revision: https://reviews.llvm.org/D63247
llvm-svn: 363370