This patch detaches SampleProfileLoader from class
SampleCoverageTracker. We plan to move SampleProfileLoader
to a template class. This would remain SampleCoverageTracker
as a class.
Also make callsiteIsHot() as a file static function.
Differential Revision: https://reviews.llvm.org/D95823
These two cases have identical implementations other than an
unreachable part of `G_ADD` that checks if the scalar we're narrowing
is a vector. Combining them to avoid unnecessary divergence.
This was only adding undef to the use if the copy itself had a
subregister index. It did not consider the subrange liveness if the
use had a subreg index to begin with.
If we had a pair of copies inside a loop which introduced new liveness
to a subregister which was undef before the loop, we would have a
dummy phi-only segment remaining across the loop body. Later, this
false segment would confuse RenameIndependentSubregs causing it to
introduce IMPLICIT_DEFs with broken value numbering.
It seems always adding the lanes to ShrinkMask is OK, so any
conditions should be purely a compile time filter.
If sext_inreg is supported, we will turn this into sext_inreg. That
will then remove it if there are enough sign bits. But if sext_inreg
isn't supported, we can still remove the shift pair based on sign
bits.
Split from D95890.
This patch updates the induction value creation to use VPValues of
recipes to map the created values. This should bring is one step closer
to being able to optimize induction recipes directly in VPlan.
Currently widenIntOrFpInduction also generates vector values for a cast
of the induction, if it exists. Make this explicit by adding the cast
instruction to the values defined by the recipe.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D92284
Discussed in this thread:
https://lists.llvm.org/pipermail/llvm-dev/2021-January/148139.html
DwarfDebug::collectEntityInfo accidentally distinguishes between variable
locations that never have a location specified, and variable locations that
have an empty location specified. The latter leads to the creation of an
empty variable referring to the abstract origin.
Fix this by seeking a non-empty location before producing a concrete
entity, to guarantee a DW_AT_location will be produced. Other loops in
collectEntityInfo and endFunctionImpl take care of examining the
retainedNodes collection and ensuring optimised-out variables are created.
Differential Revision: https://reviews.llvm.org/D95617
For the fixed ABI, set this in the initial argument constructor,
rather than relying on the allocation logic to set the values. Also
stop passing them for amdgpu_gfx, since the DAG path seems to skip
these. I'm unclear on what amdgpu_gfx's expectations are. This will
allow moving the special input registers out of the normal argument
range.
Extends D95779 to permit insertion into float/doubles vectors while avoiding a lot of aliased memory traffic.
The scalar value is already on the simd unit, so we only need to transfer and splat the index value, then perform the select.
SSE4 codegen is a little bulky due to the tied register requirements of (non-VEX) BLENDPS/PD but the extra moves are cheap so shouldn't be an actual problem.
Differential Revision: https://reviews.llvm.org/D95866
This reverts commits 62af0305b7cc..677a3529d3e6 from D93708.
They cause failures in the sanitizer builds because of uninitialized
values.
A fix is in D95878, but it might take some time until this is pushed,
so reverting the changes for now.
This patch adds a cost model for SK_Broadcast in
AArch64TTIImpl::getShuffleCost with scalable vector.
Without this patch, the scalable vector type relies on BasicTTIImpl cost
implementation and assert.
Differential Revision: https://reviews.llvm.org/D95598
This patch adds constructors to VPIteration as a cleaner way of
initialising the struct and replaces existing constructions of
the form:
{Part, Lane}
with
VPIteration(Part, Lane)
I have also added a default constructor, which is used by VPlan.cpp
when deciding whether to replicate a block or not.
This refactoring will be required in a later patch that adds more
members and functions to VPIteration.
Differential Revision: https://reviews.llvm.org/D95676
C identifier name input sections such as __llvm_prf_* are GC roots so
they cannot be discarded. In LLD, the SHF_LINK_ORDER flag overrides the
C identifier name semantics.
The !associated metadata may be attached to a global object declaration
with a single argument that references another global object, and it
gets lowered to SHF_LINK_ORDER flag. When a function symbol is discarded
by the linker, setting up !associated metadata allows linker to discard
counters, data and values associated with that function symbol.
Note that !associated metadata is only supported by ELF, it does not have
any effect on non-ELF targets.
Differential Revision: https://reviews.llvm.org/D76802
FixupStatepoints pass does not take into account the undef use
it skips may have a tied def. So when defs are handled pass
considers that tied-use should be spilled and triggers an assert.
FixupStatepoints should skip undef def as well.
Reviewers: reames, dantrushin
Reviewed By: dantrushin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D95858
We don't register i128 as a legal type with addRegisterClass, but it
appears in the list of legal register types. This inconsistency
resulted in the asm constraint lowering trying to use 2 128-bit
registers for these operands. This would leave behind a dead def that
would waste registers.
Regresses GlobalISel tests for i128 load/store, but these aren't very
important right now. Ideally these would not depend on the list of
register types.
This should only consider whether the pressure impact of the bundle at
the given point in the program will decrease the occupancy. High VGPR
pressure was incorrectly blocking the formation of scalar bundles, and
vice versa. This was also blocking bundling from high pressure
situations at other points in the program.
If the G_BR + G_BRCOND in this combine use the same MBB, then it will infinite
loop. Don't allow that to happen.
Differential Revision: https://reviews.llvm.org/D95895
Second land attempt. MachineVerifier DefRegState expensive check errors fixed.
Prologs and epilogs handle callee-save registers and tend to be irregular with
different immediate offsets that are not often handled by the MachineOutliner.
Commit D18619/a5335647d5e8 (combining stack operations) stretched irregularity
further.
This patch tries to emit homogeneous stores and loads with the same offset for
prologs and epilogs respectively. We have observed that this canonicalizes
(homogenizes) prologs and epilogs significantly and results in a greatly
increased chance of outlining, resulting in a code size reduction.
Despite the above results, there are still size wins to be had that the
MachineOutliner does not provide due to the special handling X30/LR. To handle
the LR case, his patch custom-outlines prologs and epilogs in place. It does
this by doing the following:
* Injects HOM_Prolog and HOM_Epilog pseudo instructions during a Prolog and
Epilog Injection Pass.
* Lowers and optimizes said pseudos in a AArchLowerHomogneousPrologEpilog Pass.
* Outlined helpers are created on demand. Identical helpers are merged by the linker.
* An opt-in flag is introduced to enable this feature. Another threshold flag
is also introduced to control the aggressiveness of outlining for application's need.
This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.
Differential Revision: https://reviews.llvm.org/D76570
Previously we'd hit UB due to an invalid left shift operand.
Also fix the WASM emitter to properly use SLEB128 encoding instead of
ULEB128 encoding for signed fields so that negative numbers don't
result in overly-large values that we can't read back any more.
In passing, don't diagnose a non-canonical ULEB128 that fits in a uint64_t but
has redundant trailing zero bytes.
Reviewed By: dblaikie, aardappel
Differential Revision: https://reviews.llvm.org/D95510
Sample re-annotation is required in LTO time to achieve a reasonable post-inline profile quality. However, we have seen that such LTO-time re-annotation degrades profile quality. This is mainly caused by preLTO code duplication that is done by passes such as loop unrolling, jump threading, indirect call promotion etc, where samples corresponding to a source location are aggregated multiple times due to the duplicates. In this change we are introducing a concept of distribution factor for pseudo probes so that samples can be distributed for duplicated probes scaled by a factor. We hope that optimizations duplicating code well-maintain the branch frequency information (BFI) based on which probe distribution factors are calculated. Distribution factors are updated at the end of preLTO pipeline to reflect an estimated portion of the real execution count.
This change also introduces a pseudo probe verifier that can be run after each IR passes to detect duplicated pseudo probes.
A saturated distribution factor stands for 1.0. A pesudo probe will carry a factor with the value ranged from 0.0 to 1.0. A 64-bit integral distribution factor field that represents [0.0, 1.0] is associated to each block probe. Unfortunately this cannot be done for callsite probes due to the size limitation of a 32-bit Dwarf discriminator. A 7-bit distribution factor is used instead.
Changes are also needed to the sample profile inliner to deal with prorated callsite counts. Call sites duplicated by PreLTO passes, when later on inlined in LTO time, should have the callees’s probe prorated based on the Prelink-computed distribution factors. The distribution factors should also be taken into account when computing hotness for inline candidates. Also, Indirect call promotion results in multiple callisites. The original samples should be distributed across them. This is fixed by adjusting the callisites' distribution factors.
Reviewed By: wmi
Differential Revision: https://reviews.llvm.org/D93264
Due to a clerical error, the sdiv operation was mapping to vdivu and
udiv to vdiv, when the opposite mapping is the correct one.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D95869
Without `-dwarf-version`, llvm-mc uses the default `MCContext::DwarfVersion` 4.
Without `-gdwarf-N`, Clang cc1as uses `clang::driver::ToolChain::GetDefaultDwarfVersion`
which is 4 on many toolchains. Note: `clang -c` can synthesize .debug_info without -g.
There is currently a MCParser warning upon `.file 0` and MCParser errors upon
`.loc 0` if the DWARF version is less than 5. This causes friction to the
following usage:
```
clang -S -g -gdwarf-5 a.c
// MC warning due to .file 0, MC error due to .loc 0
clang -c a.s
llvm-mc -filetype=obj a.s
```
My idea is that we can just upgrade `MCContext::DwarfVersion` to 5 upon
`.file 0` to make the above commands work.
The downside is that for an explicit version `clang -c -gdwarf-4 a.s`, it can be
argued that the new behavior drops the probably intended diagnostic. I think the
downside is small because in most cases DWARF version for an assembly action
should either match the original compile action or be omitted.
Ongoing discussion taking a similar action for GNU as: https://sourceware.org/pipermail/binutils/2021-January/114980.html
Differential Revision: https://reviews.llvm.org/D94882
On Linux target triples, GNU as sets EI_OSABI to ELFOSABI_GNU when SHF_GNU_RETAIN is used。
On `*-*-freebsd`, it usually sets EI_OSABI to ELFOSABI_FREEBSD.
GNU ld respects SHF_GNU_RETAIN only for ELFOSABI_FREEBSD/ELFOSABI_GNU.
https://sourceware.org/bugzilla/show_bug.cgi?id=27282
MC doesn't set ELFOSABI_GNU for SHF_GNU_RETAIN/STB_GNU_UNIQUE/STT_GNU_IFUNC.
MC assembled object files do not have special semantics in GNU ld.
Reviewed By: psmith
Differential Revision: https://reviews.llvm.org/D95730
In binutils, the flag is defined for ELFOSABI_GNU and ELFOSABI_FREEBSD.
It can be used to mark a section as a GC root.
In practice, the flag has generic semantics and can be applied to many
EI_OSABI values, so we consider it generic.
Differential Revision: https://reviews.llvm.org/D95728
Inlining sometimes maps different instructions to be inlined onto the same instruction.
We must ensure to only remap the noalias scopes once. Otherwise the scope might disappear (at best).
This patch ensures that we only replace scopes for which the mapping is known.
This approach is preferred over tracking which instructions we already handled in a SmallPtrSet,
as that one will need more memory.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D95862
This allows the peephole optimizer to know that a MVE_VMOV_to_lane_32 is
the same as an insert subreg, allowing it to optimize some redundant
lane moves.
Differential Revision: https://reviews.llvm.org/D95433
The temporary register is only used to compute the frame pointer.
The frame pointer is overwritten and not used in between, so we
can reuse the frame pointer for the computation, saving one register.
Differential Revision: https://reviews.llvm.org/D95865
Saving callee-save registers happens in whole wave mode. Exec is saved
to a free register, which can be reused to save the frame pointer.
Therefore, saving the fp needs to happen after saving csrs.
Differential Revision: https://reviews.llvm.org/D95861
Refactoring SampleProfileLoader::inlineHotFunctions to use helpers from CSSPGO inlining and reduce similar code in the inlining loop, plus minor cleanup for AFDO path.
This is resubmit of D95024, with build break and overtighten assertion fixed.
Test Plan:
NOTE: This patch was originally written by Anil Mahmud. His code has been
rebased but otherwise left mostly unchanged.
A new instructon on Power 10 allows for the materialization of 34 bit
immediate values. This patch allows the compiler to take advantage of
the new instruction in this situation.
Reviewed By: amyk
Differential Revision: https://reviews.llvm.org/D92879
A v4i32 insert of an extract can become a simple lane move, as opposed
to round-tripping via a GPR. This adds a patterns that turns an v4i32
insert-extract pair into a EXTRACT_SUBREG/INSERT_SUBREG, with the
required COPY_TO_REGCLASS. These get better optimized into a simple lane
move by the rest of the backend.
Differential Revision: https://reviews.llvm.org/D95428
This is a yet another hint that we will eventually need InstCombineInverter,
which would consistently sink inversions, but but for that we'll need
to consistently hoist inversions where possible, so let's do that here.
Example of a proof: https://alive2.llvm.org/ce/z/78SbDq
See https://bugs.llvm.org/show_bug.cgi?id=48995