Summary:
This is a simple patch that expands https://reviews.llvm.org/D75859 to pointer comparison and fcmp
Checked with Alive2
Reviewers: reames, jdoerfert
Reviewed By: jdoerfert
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76048
LLVM currently supports CSK_MD5 and CSK_SHA1 source file checksums in
debug info. This change adds support for CSK_SHA256 checksums.
The SHA256 checksums are supported by the CodeView debug format.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D75785
Summary:
Support ConstantInt::get() and Constant::getAllOnesValue() for scalable
vector type, this requires ConstantVector::getSplat() to take in 'ElementCount',
instead of 'unsigned' number of element count.
This change is needed for D73753.
Reviewers: sdesmalen, efriedma, apazos, spatel, huntergr, willlovett
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74386
Summary:
Using the default DAG.UnrollVectorOp on v16i8 and v8i16 vectors
results in i8 or i16 nodes being inserted into the SelectionDAG. Since
those are illegal types, this causes a legalization assertion failure
for some code patterns, as uncovered by PR45178. This change unrolls
shifts manually to avoid this issue by adding and using a new optional
EVT argument to DAG.ExtractVectorElements to control the type of the
extract_element nodes.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, zzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76043
Summary:
This patch adds the following intrinsics for non-temporal gather loads
and scatter stores:
* aarch64_sve_ldnt1_gather_index
* aarch64_sve_stnt1_scatter_index
These intrinsics implement the "scalar + vector of indices" addressing
mode.
As opposed to regular and first-faulting gathers/scatters, there's no
instruction that would take indices and then scale them. Instead, the
indices for non-temporal gathers/scatters are scaled before the
intrinsics are lowered to `ldnt1` instructions.
The new ISD nodes, GLDNT1_INDEX and SSTNT1_INDEX, are only used as
placeholders so that we can easily identify the cases implemented in
this patch in performGatherLoadCombine and performScatterStoreCombined.
Once encountered, they are replaced with:
* GLDNT1_INDEX -> SPLAT_VECTOR + SHL + GLDNT1
* SSTNT1_INDEX -> SPLAT_VECTOR + SHL + SSTNT1
The patterns for lowering ISD::SHL for scalable vectors (required by
this patch) were missing, so these are added too.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D75601
Summary: When narrowing a scalar G_EXTRACT where the destination lines up perfectly with a single result of the emitted G_UNMERGE_VALUES a COPY should be emitted instead of unconditionally trying to emit a G_MERGE_VALUES.
Reviewers: arsenm, dsanders
Reviewed By: arsenm
Subscribers: wdng, rovka, hiraditya, volkan, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75743
This is a reimplementation of the optimization removed in D75964. The actual spill/fill optimization is handled by D76013, this one just worries about reducing the stackmap section size itself by eliminating redundant entries. As noted in the comments, we could go a lot further here, but avoiding the degenerate invoke case as we did before is probably "enough" in practice.
Differential Revision: https://reviews.llvm.org/D76021
Summary:
callbr's indirect branches aren't expected to be taken, so reduce their
probabilities to 0 while increasing the default destination to 1. This
allows some code improvements through block placement.
Reviewers: nickdesaulniers
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72656
In order for dsymutil to collect .apinotes files (which capture
attributes such as nullability, Swift import names, and availability),
I want to propose adding an apinotes: field to DIModule that gets
translated into a DW_AT_LLVM_apinotes (path) nested inside
DW_TAG_module. This will be primarily used by LLDB to indirectly
extract the Swift names of Clang declarations that were deserialized
from DWARF.
<rdar://problem/59514626>
Differential Revision: https://reviews.llvm.org/D75585
This is part of PR44213 https://bugs.llvm.org/show_bug.cgi?id=44213
When importing (system) Clang modules, LLDB needs to know which SDK
(e.g., MacOSX, iPhoneSimulator, ...) they came from. While the sysroot
attribute contains the absolute path to the SDK, this doesn't work
well when the debugger is run on a different machine than the
compiler, and the SDKs are installed in different directories. It thus
makes sense to just store the name of the SDK instead of the absolute
path, so it can be found relative to LLDB.
rdar://problem/51645582
Differential Revision: https://reviews.llvm.org/D75646
Summary:
The change is to fix conflict value for metadata "Objective-C Garbage Collection" in the mix of swift and Objective-C bitcode.
The purpose is to provide the support of LTO for swift and Objective-C mixed project.
Reviewers: rjmccall, ahatanak, steven_wu
Reviewed By: rjmccall, steven_wu
Subscribers: manmanren, mehdi_amini, hiraditya, dexonsmith, llvm-commits, jinlin
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71219
We just removed a broken duplicate elimination algorithm in D75964, and after landed that it occurred to me that duplicate elimination is simply CSE. SelectionDAG has a build in CSE, so why wasn't that triggering? Well, it turns out we were overly conservative in the memory states for our reloads and CSE (rightly) considers the incoming memory state for a load part of the identity of the load.
By loosening the chain and allowing reordering, we also allow CSE. As shown in the test case, doing iterative CSE as we go is enough to eliminate duplicate stores in later statepoints as well. We key our (block local) slot map by SDValue, so commoning a previous pair of loads at construction time means we also common following stores.
Differential Revision: https://reviews.llvm.org/D76013
This patch reuses the existing MatchRotate ROTL/ROTR rotation pattern code to also recognize the more general FSHL/FSHR funnel shift patterns when we have constant shift amounts.
Differential Revision: https://reviews.llvm.org/D75114
Summary:
This patch helps CodeGenPrepare move freeze into the icmp when it is used by branch.
It reenables generation of efficient conditional jumps.
This is only done when at least one of icmp's operands is constant to prevent the transformation from increasing # of freeze instructions.
Performance degradation of MultiSource/Benchmarks/Ptrdist/yacr2/yacr2.test is resolved with this patch.
Checked with Alive2
Reviewers: reames, fhahn, nlopes
Reviewed By: reames
Subscribers: jdoerfert, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75859
A downstream test case (see included reduced test) revealed that we have a bug in how we handle duplicate relocations. If we have the same SDValue relocated twice, and that value happens to be a constant (such as null), we only export one of the two llvm::Values. Exporting on a per llvm::Value basis is required to allow lowering of gc.relocates in following basic blocks (e.g. invokes). Without it, we end up with a use of an undefined vreg and bad things happen.
Rather than fixing the optimization - which appears to be hard - I propose we simply remove it. There are no tests in tree that change with this code removed. If we find out later that this did matter for something, we can reimplement a variation of this in CodeGenPrepare to catch the easy cases without complicating the lowering code.
Thanks to Denis and Serguei who did all the hard work of figuring out what went wrong here. The patch is by far the easy part. :)
Differential Revision: https://reviews.llvm.org/D75964
If the loaded memory size was smaller than the result size, this would
produce out of bounds memory accesses. I'm wondering if we need a
distinct narrow memory legalize action type, since a case I care about
is decomposing a 4-byte unaligned access into 4 extending loads, which
would leave the original result register type. I'm currently awkwardly
using narrowScalar to handle unaligned accesses that need to be split.
Summary:
This patch extends the TargetMachine to let targets specify the integer size
used by the sjljehprepare pass. This is 64bit for the VE target and otherwise
defaults to 32bit for all targets, which was hard-wired before.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D71337
When hashing on MachineOperand::MO_ConstantPoolIndex, now MIR-Canon and
MIRVRegNamer will no longer result in a hash collision.
Differential Revision: https://reviews.llvm.org/D74449
This allows Spiller.h to be used and included outside of
the lib/CodeGen directory. For example to be used in the
lib/Target directory or other places.
This reverts commit 5583c2f2fb.
The lldb bot failure was a test that was fragile and sensitive to irrelevant
changes in instruction ordering. Re-committing this as the test should have
been skipped for AArch64 now.
Differential Revision: https://reviews.llvm.org/D75555
Handle LinkOnceODRLinkage;
Handle AppendingLinkage type for llvm.global_ctors/dtors static init global arrays;
Differential Revision: https://reviews.llvm.org/D75305
As noted on D75114, if both arguments of a funnel shift are consecutive loads we are missing the opportunity to combine them into a single load.
Differential Revision: https://reviews.llvm.org/D75624
Summary:
This performs better for sample PGO.
NFC as PGSOColdCodeOnlyForSamplePGO is still true.
Reviewers: davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75550
PowerPC hits an assertion due to somewhat the same reason as https://reviews.llvm.org/D70975.
Though there are already some hack, it still failed with some case, when the operand 0 is NOT
a const fp, it is another fma that with const fp. And that const fp is negated which result in multi-uses.
A better fix is to check the uses of the negated const fp. If there are already use of its negated
value, we will have benefit as no extra Node is added.
Differential revision: https://reviews.llvm.org/D75501
https://reviews.llvm.org/D42848 only handled CFA related cfi directives but
didn't handle CSR related cfi. The patch adds the CSR part. Basically it reuses
the framework created in D42848. For each basicblock, the patch tracks which
CSR set have been saved at its CFG predecessors's exits, and compare the CSR
set with the set at its previous basicblock's exit (The previous block is the
block laid before the current block). If the saved CSR set at its previous
basicblock's exit is larger, .cfi_restore will be inserted.
The patch also generates proper .cfi_restore in epilogue to make sure the
saved CSR set is consistent for the incoming edges of each block.
Differential Revision: https://reviews.llvm.org/D74303
As the test case shows if there is an ExtractValueInst in the Ret block, function dupRetToEnableTailCallOpts can't duplicate it into the block containing call. So later no tail call is generated in CodeGen.
This patch adds the ExtractValueInst handling code in function dupRetToEnableTailCallOpts and FoldReturnIntoUncondBranch, and later tail call can be generated for this case.
Differential Revision: https://reviews.llvm.org/D74242
Spin-off from D75407. As described there, ConstantFoldConstant()
currently returns null for non-ConstantExpr/ConstantVector inputs,
but otherwise always returns non-null, independently of whether
any folding has happened or not.
This is confusing and makes consumer code more complicated.
I would expect either that ConstantFoldConstant() returns only if
it actually folded something, or that it always returns non-null.
I'm going to the latter possibility here, which appears to be more
useful considering existing usage.
Differential Revision: https://reviews.llvm.org/D75543
As discussed in the commit thread for rGa253a2a and D73978, we can do more undef folding for FP ops.
The nnan and ninf fast-math-flags specify that if an operand is the disallowed value, the result is
poison, so we can produce an undef result.
But this doesn't work as expected (the undef operand cases remain) because of a Flags propagation
problem in SelectionDAGBuilder.
I've added DAGCombiner calls to enable these for the other cases because we've shown in other
patches that (because of the limited way that SDAG iterates), it is possible to miss simplifications
like this if they are done only at node creation time.
Several potential follow-ups to expand on this patch are possible.
Differential Revision: https://reviews.llvm.org/D75576
This changes the localizer to attempt intra-block localizer of instructions
that have local uses. This is useful because sometimes the entry block itself
has many uses of constant-like instructions, which would benefit from shortening
live ranges. Previously if an inst had no non-local uses, we wouldn't add it to
the list of instructions to attempt further intra-block localization.
This gives a 0.7% geomean code size improvement on CTMark.
Differential Revision: https://reviews.llvm.org/D75555
Summary:
This change checks for the return type in the frontend and adds a flag
to the DISubroutineType to indicate that the option should be added in
CodeViewDebug.
Previously function types sometimes appeared twice in the PDB: once with
"returns cxx udt" and once without.
See https://bugs.llvm.org/show_bug.cgi?id=44785.
Reviewers: rnk, asmith
Subscribers: hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75215
This fixes a miscompile that happened because a DBG_VALUE interfered
with the MachineOutliner's liveness analysis.
Inserting a DBG_VALUE after a terminator breaks predicates on MBB such
as isReturnBlock(). And the resulting DBG_VALUE cannot be "live".
I plan to introduce a MachineVerifier check for this situation in a
follow up.
rdar://59859175
Testing: check-llvm, LNT build with a stage2 compiler & entry values
enabled
Differential Revision: https://reviews.llvm.org/D75548
```
// clang -c -gdwarf-5 a.s -o a.o
.section .init; ret
.text; ret
```
.debug_info contains DW_AT_ranges and llvm-dwarfdump will report
a verification error because .debug_rnglists does not exist (not
implemented).
This patch generates .debug_rnglists for assembly files.
emitListsTableHeaderStart() in DwarfDebug.cpp can be shared with
MCDwarf.cpp. Because CodeGen depends on MC, I move the function to
MCDwarf.cpp
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D75375
Summary:
Follow up from D75377. If the subvector is byte sized and the
index is aligned to the subvector size, we can shrink the load.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: dbabokin, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75434
Use MIOperand in collectLocalKilledOperands to make the search
global, as we already have to search for global uses too. This
allows us to delete more dead code when tail predicating.
Differential Revision: https://reviews.llvm.org/D75167
In RDA, check against the already decided dead instructions when
looking at users. This allows an instruction to be removed if it
has multiple users, but they're all dead.
This means that IT instructions can be considered killed once all
the itstate using instructions are dead.
Differential Revision: https://reviews.llvm.org/D75245
This patch adds support for dwarf emission/dumping part of debuginfo
generation for defaulted parameters.
Reviewers: probinson, aprantl, dblaikie
Reviewed By: aprantl, dblaikie
Differential Revision: https://reviews.llvm.org/D73462
Make it a compile-time error to pass an int/unsigned/etc to
fromRawInteger.
Hopefully this prevents errors of the form:
```
for (unsigned ID : getVarLocs()) {
auto VL = LocMap[LocIndex::fromRawInteger(ID)];
...
```
I expect that the isCondCodeLegal checks should match that CC of
the node that we're going to create.
Rewriting to a switch to minimize repeated mentions of the same
constants.
This is needed for D74873, AMDGPU going to have 16 bit subregs
and the largest tuple is 32 VGPRs, which results in 64 lanes.
Differential Revision: https://reviews.llvm.org/D75378
We should be careful to allow count of re-materialization of operands to be less
then number of physical registers.
STATEPOINT instruction has a variable number of operands and potentially very big.
So re-materialization for all operands is disabled at the moment if restrict-statepoint-remat is true.
The patch relaxes the re-materialization restriction for STATEPOINT instruction allowing it for
fixed operands. Specifically it is about call target.
Reviewers: reames
Reviewed By: reames
Subscribers: llvm-commits, qcolombet, hiraditya
Differential Revision: https://reviews.llvm.org/D75335
The address calculation for the offset assumes that you can calculate the offset by multiplying the index by the store size of the element. But that only works if the element's store size is exactly its real size since we store vectors tightly packed in memory. There are improvements we could make to this like special casing extracting element 0. I think we could also handle cases where the extracted VT is byte sized and the index is aligned with the extract element count.
Differential Revision: https://reviews.llvm.org/D75377
Select_cc isn't used by all targets. X86 doesn't have optimizations
for it.
Since we already know the input to the sint_to_fp/uint_to_fp is
a setcc we can just emit a plain select using that setcc as the
condition. Other DAG combines can turn that into a select_cc on
targets that support it.
Differential Revision: https://reviews.llvm.org/D75415
We get the simple cases of this via demanded elements and other folds,
but that doesn't work if the values have >1 use, so add a dedicated
match for the pattern.
We already have this transform in IR, but it doesn't help the
motivating x86 tests (based on PR42024) because the shuffles don't
exist until after legalization and other combines have happened.
The AArch64 test shows a minimal IR example of the problem.
Differential Revision: https://reviews.llvm.org/D75348
Try again with an up-to-date version of D69471 (99317124 was a stale
revision).
---
Revise the coverage mapping format to reduce binary size by:
1. Naming function records and marking them `linkonce_odr`, and
2. Compressing filenames.
This shrinks the size of llc's coverage segment by 82% (334MB -> 62MB)
and speeds up end-to-end single-threaded report generation by 10%. For
reference the compressed name data in llc is 81MB (__llvm_prf_names).
Rationale for changes to the format:
- With the current format, most coverage function records are discarded.
E.g., more than 97% of the records in llc are *duplicate* placeholders
for functions visible-but-not-used in TUs. Placeholders *are* used to
show under-covered functions, but duplicate placeholders waste space.
- We reached general consensus about giving (1) a try at the 2017 code
coverage BoF [1]. The thinking was that using `linkonce_odr` to merge
duplicates is simpler than alternatives like teaching build systems
about a coverage-aware database/module/etc on the side.
- Revising the format is expensive due to the backwards compatibility
requirement, so we might as well compress filenames while we're at it.
This shrinks the encoded filenames in llc by 86% (12MB -> 1.6MB).
See CoverageMappingFormat.rst for the details on what exactly has
changed.
Fixes PR34533 [2], hopefully.
[1] http://lists.llvm.org/pipermail/llvm-dev/2017-October/118428.html
[2] https://bugs.llvm.org/show_bug.cgi?id=34533
Differential Revision: https://reviews.llvm.org/D69471
Revise the coverage mapping format to reduce binary size by:
1. Naming function records and marking them `linkonce_odr`, and
2. Compressing filenames.
This shrinks the size of llc's coverage segment by 82% (334MB -> 62MB)
and speeds up end-to-end single-threaded report generation by 10%. For
reference the compressed name data in llc is 81MB (__llvm_prf_names).
Rationale for changes to the format:
- With the current format, most coverage function records are discarded.
E.g., more than 97% of the records in llc are *duplicate* placeholders
for functions visible-but-not-used in TUs. Placeholders *are* used to
show under-covered functions, but duplicate placeholders waste space.
- We reached general consensus about giving (1) a try at the 2017 code
coverage BoF [1]. The thinking was that using `linkonce_odr` to merge
duplicates is simpler than alternatives like teaching build systems
about a coverage-aware database/module/etc on the side.
- Revising the format is expensive due to the backwards compatibility
requirement, so we might as well compress filenames while we're at it.
This shrinks the encoded filenames in llc by 86% (12MB -> 1.6MB).
See CoverageMappingFormat.rst for the details on what exactly has
changed.
Fixes PR34533 [2], hopefully.
[1] http://lists.llvm.org/pipermail/llvm-dev/2017-October/118428.html
[2] https://bugs.llvm.org/show_bug.cgi?id=34533
Differential Revision: https://reviews.llvm.org/D69471
As a narrow stopgap for the assertion failure described in PR45025, add
a describeLoadedValue override to ARMBaseInstrInfo and use it to detect
copies in which the forwarding reg is a super/sub reg of the copy
destination. For the moment this is unsupported.
Several follow ups are possible:
1) Handle VORRq. At the moment, we do not, because isCopyInstrImpl
returns early when !MI.isMoveReg().
2) In the case where forwarding reg is a super-reg of the copy
destination, we should be able to describe the forwarding reg as a
subreg within the copy destination. I'm not 100% sure about this, but
it looks like that's what's done in AArch64InstrInfo.
3) In the case where the forwarding reg is a sub-reg of the copy
destination, maybe we could describe the forwarding reg using the
copy destinaion and a DW_OP_LLVM_fragment (I guess this should be
possible after D75036).
https://bugs.llvm.org/show_bug.cgi?id=45025
rdar://59772698
Differential Revision: https://reviews.llvm.org/D75273
The alias analysis in DAG Combine looks at the BaseAlign, the Offset and
the Size of two accesses, and determines if they are known to access
different parts of memory by the fact that they are different offsets
from inside that "alignment window". It does not seem to account for
accesses that are not a multiple of the size, and may overflow from one
alignment window into another.
For example in the test case we have a 19byte memset that is splits into
a 16 byte neon store and an unaligned 4 byte store with a 15 byte
offset. This 15byte offset (with a base align of 8) wraps around to the
next alignment windows. When compared to an access that is a 16byte
offset (of the same 4byte size and 8byte basealign), the two accesses
are said not to alias.
I've fixed this here by just ensuring that the offsets are a multiple of
the size, ensuring that they don't overlap by wrapping. Fixes PR45035,
which was exposed by the UseAA changes in the arm backend.
Differential Revision: https://reviews.llvm.org/D75238
We can only report the knownbits for a SCALAR_TO_VECTOR node if we only demand the 0'th element - the upper elements are undefined and shouldn't be trusted.
This is causing a number of regressions that need addressing but we need to get the bugfix in first.
Way back in D24994, the combination of LexicalScopes::dominates and
LiveDebugValues was identified as having worst-case quadratic complexity,
but it wasn't triggered by any code path at the time. I've since run into a
scenario where this occurs, in a very large basic block where large numbers
of inlined DBG_VALUEs are present.
The quadratic-ness comes from LiveDebugValues::join calling "dominates" on
every variable location, and LexicalScopes::dominates potentially touching
every instruction in a block to test for the presence of a scope. We have,
however, already computed the presence of scopes in blocks, in the
"InstrRanges" of each scope. This patch switches the dominates method to
examine whether a block is present in a scope's InsnRanges, avoiding
walking through the whole block.
At the same time, fix getMachineBasicBlocks to account for the fact that
InsnRanges can cover multiple blocks, and add some unit tests, as Lexical
Scopes didn't have any.
Differential revision: https://reviews.llvm.org/D73725
Ensure that we're recording implicit defs, as well as visiting implicit
uses and implicit defs when we're walking through operands.
Differential Revision: https://reviews.llvm.org/D75185
According to Joseph Myers, a libm maintainer
> They were only ever an ABI (selected by use of -ffinite-math-only or
> options implying it, which resulted in the headers using "asm" to redirect
> calls to some libm functions), not an API. The change means that ABI has
> turned into compat symbols (only available for existing binaries, not for
> anything newly linked, not included in static libm at all, not included in
> shared libm for future glibc ports such as RV32), so, yes, in any case
> where tools generate direct calls to those functions (rather than just
> following the "asm" annotations on function declarations in the headers),
> they need to stop doing so.
As a consequence, we should no longer assume these symbols are available on the
target system.
Still keep the TargetLibraryInfo for constant folding.
Differential Revision: https://reviews.llvm.org/D74712
This is part 3 of a 3-part series to address a compile-time explosion
issue in LiveDebugValues.
---
Start encoding register locations within VarLoc IDs, and take advantage
of this encoding to speed up transferRegisterDef.
There is no fundamental algorithmic change: this patch simply swaps out
SparseBitVector in favor of CoalescingBitVector. That changes iteration
order (hence the test updates), but otherwise this patch is NFCI.
The only interesting change is in transferRegisterDef. Instead of doing:
```
KillSet = {}
for (ID : OpenRanges.getVarLocs())
if (DeadRegs.count(ID))
KillSet.add(ID)
```
We now do:
```
KillSet = {}
for (Reg : DeadRegs)
for (ID : intervalsReservedForReg(Reg, OpenRanges.getVarLocs()))
KillSet.add(ID)
```
By not visiting each open location every time we visit an instruction,
this eliminates some potentially quadratic behavior. The new
implementation basically does a constant amount of work per instruction
because the interval map lookups are very fast.
For a file in WebKit, this brings the time spent in LiveDebugValues down
from ~2.5 minutes to 4 seconds, reducing compile time spent in that pass
from 28% of the total to just over 1%.
Before:
```
2.49 min 27.8% 0 s LiveDebugValues::process
2.41 min 27.0% 5.40 s LiveDebugValues::transferRegisterDef
1.51 min 16.9% 1.51 min LiveDebugValues::VarLoc::isDescribedByReg() const
32.73 s 6.1% 8.70 s llvm::SparseBitVector<128u>::SparseBitVectorIterator::operator++()
```
After:
```
4.53 s 1.1% 0 s LiveDebugValues::process
3.00 s 0.7% 107.00 ms LiveDebugValues::transferRegisterCopy
892.00 ms 0.2% 406.00 ms LiveDebugValues::transferSpillOrRestoreInst
404.00 ms 0.1% 32.00 ms LiveDebugValues::transferRegisterDef
110.00 ms 0.0% 2.00 ms LiveDebugValues::getUsedRegs
57.00 ms 0.0% 1.00 ms std::__1::vector<>::push_back
40.00 ms 0.0% 1.00 ms llvm::CoalescingBitVector<>::find(unsigned long long)
```
FWIW, I tried the same approach using SparseBitVector, but got bad
results. To do that, I had to extend SparseBitVector to support 64-bit
indices and expose its lower bound operation. The problem with this is
that the performance is very hard to predict: SparseBitVector's lower
bound operation falls back to O(n) linear scans in a std::list if you're
not /very/ careful about managing iteration order. When I profiled this
the performance looked worse than the baseline.
You can see the full CoalescingBitVector-based implementation here:
https://github.com/vedantk/llvm-project/commits/try-coalescing
You can see the full SparseBitVector-based implementation here:
https://github.com/vedantk/llvm-project/commits/try-sparsebitvec-find
Depends on D74984 and D74985.
Differential Revision: https://reviews.llvm.org/D74986
This is part 2 of a 3-part series to address a compile-time explosion
issue in LiveDebugValues.
---
Each VarLoc has a unique ID: this ID is used to look up a VarLoc in the
VarLocMap, and to virtually insert a VarLoc into a VarLocSet. Instead of
inserting the VarLoc /itself/ into the VarLocSet, we insert just the ID,
because this can be represented efficiently with a SparseBitVector.
This change introduces LocIndex, a layer of abstraction on top of VarLoc
IDs. Prior to this change, an ID was just an index into a vector. With
this change, an ID encodes both an index /and/ a register location. The
type-checker ensures that conversions to and from LocIndex are correct.
For the moment the register location is always 0 (undef). We have plenty
of bits left over to encode physregs, stack slots, and other locations
in the future.
Differential Revision: https://reviews.llvm.org/D74985
The code changes here are hopefully straightforward:
1. Use MachineInstruction flags to decide if FP ops can be reassociated
(use both "reassoc" and "nsz" to be consistent with IR transforms;
we probably don't need "nsz", but that's a safer interpretation of
the FMF).
2. Check that both nodes allow reassociation to change instructions.
This is a stronger requirement than we've usually implemented in
IR/DAG, but this is needed to solve the motivating bug (see below),
and it seems unlikely to impede optimization at this late stage.
3. Intersect/propagate MachineIR flags to enable further reassociation
in MachineCombiner.
We managed to make MachineCombiner flexible enough that no changes are
needed to that pass itself. So this patch should only affect x86
(assuming no other targets have implemented the hooks using MachineIR
flags yet).
The motivating example in PR43609 is another case of fast-math transforms
interacting badly with special FP ops created during lowering:
https://bugs.llvm.org/show_bug.cgi?id=43609
The special fadd ops used for converting int to FP assume that they will
not be altered, so those are created without FMF.
However, the MachineCombiner pass was being enabled for FP ops using the
global/function-level TargetOption for "UnsafeFPMath". We managed to run
instruction/node-level FMF all the way down to MachineIR sometime in the
last 1-2 years though, so we can do better now.
The test diffs require some explanation:
1. llvm/test/CodeGen/X86/fmf-flags.ll - no target option for unsafe math was
specified here, so MachineCombiner kicks in where it did not previously;
to make it behave consistently, we need to specify a CPU schedule model,
so use the default model, and there are no code diffs.
2. llvm/test/CodeGen/X86/machine-combiner.ll - replace the target option for
unsafe math with the equivalent IR-level flags, and there are no code diffs;
we can't remove the NaN/nsz options because those are still used to drive
x86 fmin/fmax codegen (special SDAG opcodes).
3. llvm/test/CodeGen/X86/pow.ll - similar to #1
4. llvm/test/CodeGen/X86/sqrt-fastmath.ll - similar to #1, but MachineCombiner
does some reassociation of the estimate sequence ops; presumably these are
perf wins based on latency/throughput (and we get some reduction of move
instructions too); I'm not sure how it affects numerical accuracy, but the
test reflects reality better now because we would expect MachineCombiner to
be enabled if the IR was generated via something like "-ffast-math" with clang.
5. llvm/test/CodeGen/X86/vec_int_to_fp.ll - this is the test added to model PR43609;
the fadds are not reassociated now, so we should get the expected results.
6. llvm/test/CodeGen/X86/vector-reduce-fadd-fast.ll - similar to #1
7. llvm/test/CodeGen/X86/vector-reduce-fmul-fast.ll - similar to #1
Differential Revision: https://reviews.llvm.org/D74851
This will address the issue: P8198 and P8199 (from D73534).
The methods was not handle bundles properly.
Differential Revision: https://reviews.llvm.org/D74904
Summary:
If the describeLoadedValue() hook produced a DIExpression when
describing a instruction, and it was not possible to emit a call site
entry directly (the value operand was not an immediate nor a preserved
register), then that described value could not be inserted into the
worklist, and would instead be dropped, meaning that the parameter's
call site value couldn't be described.
This patch extends the worklist so that each entry has an DIExpression
that is built up when iterating through the instructions.
This allows us to describe instruction chains like this:
$reg0 = mv $fp
$reg0 = add $reg0, offset
call @call_with_offseted_fp
Since DW_OP_LLVM_entry_value operations can't be combined with any other
expression, such call site entries will not be emitted. I have added a
test, dbgcall-site-expr-entry-value.mir, which verifies that we don't
assert or emit broken DWARF in such cases.
Reviewers: djtodoro, aprantl, vsk
Reviewed By: djtodoro, vsk
Subscribers: hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D75036
Summary:
This is a preparatory patch for D75036, in which a debug expression is
associated with each parameter register in the worklist. In that patch
the two lambda functions addToWorklist() and finishCallSiteParams() grow
a bit, so move those out to separate functions. This patch also prepares
for each parameter register having their own expression moving the
creation of the DbgValueLoc into finishCallSiteParams().
Reviewers: djtodoro, vsk
Reviewed By: djtodoro, vsk
Subscribers: hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D75050
This may inhibit vector narrowing in general, but there's
already an inconsistency in the way that we deal with this
pattern as shown by the test diff.
We may want to add a dedicated function for narrowing fneg.
It's often folded into some other op, so moving it away from
other math ops may cause regressions that we would not see
for normal binops.
See D73978 for more details.
Add getUniqueReachingMIDef to RDA which performs a global search for
a machine instruction that produces a unique definition of a given
register at a given point. Also add two helper functions
(getMIOperand) that wrap around this functionality to get the
incoming definition uses of a given instruction. These now replace
the uses of getReachingMIDef in ARMLowOverheadLoops. getReachingMIDef
has been renamed to getReachingLocalMIDef and has been made private
along with getInstFromId.
Differential Revision: https://reviews.llvm.org/D74605
Only MCAsmStreamer (assembly output) needs to keep names of temporary labels created by
MCContext::createTempSymbol().
This change made the rL236642 optimization available for cc2as and
probably some other users.
This eliminates a behavior difference between llvm-mc -filetype=obj and cc1as, which caused
https://reviews.llvm.org/D74006#1890487
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D75097
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.
This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.
Differential Revision: https://reviews.llvm.org/D75132
Unlike what I claimed in my previous commit. The caching is
actually not NFC on PHIs.
When we put a big enough max depth, we end up simulating loops.
The cache is effectively cutting the simulation short and we
get less information as a result.
E.g.,
```
v0 = G_CONSTANT i8 0xC0
jump
v1 = G_PHI i8 v0, v2
v2 = G_LSHR i8 v1, 1
```
Let say we want the known bits of v1.
- With cache:
Set v1 cache to we know nothing
v1 is v0 & v2
v0 gives us 0xC0
v2 gives us known bits of v1 >> 1
v1 is in the cache
=> v1 is 0, thus v2 is 0x80
Finally v1 is v0 & v2 => 0x80
- Without cache and enough depth to do two iteration of the loop:
v1 is v0 & v2
v0 gives us 0xC0
v2 gives us known bits of v1 >> 1
v1 is v0 & v2
v0 is 0xC0
v2 is v1 >> 1
Reach the max depth for v1...
unwinding
v1 is know nothing
v2 is 0x80
v0 is 0xC0
v1 is 0x80
v2 is 0xC0
v0 is 0xC0
v1 is 0xC0
Thus now v1 is 0xC0 instead of 0x80.
I've added a unittest demonstrating that.
NFC
Add a dump method that recursively prints an instruction and all
the instructions defining its operands and so on.
This is helpful when looking at combiner issue.
NFC
Differential Revision: https://reviews.llvm.org/D75094
MachineVerifier still takes 45-50% of total compile time with
-verify-machineinstrs, with calcRegsPassed dataflow taking ~50-60% of
MachineVerifier.
The majority of that time is spent in BBInfo::addPassed, mostly within
DenseSet implementing the sets the dataflow is operating over.
In particular, 1/4 of that DenseSet time is spent just iterating over it
(operator++), 40-50% on insertions, and most of the rest in ::count.
Given that, we're implementing custom sets just for this analysis here,
focusing on cheap insertions and O(n) iteration time (as opposed to
O(U), where U is the universe).
As it's based _mostly_ on BitVector for sparse and SmallVector for
dense, it may remotely resemble SparseSet. The difference is, our
solution is a lot less clever, doesn't have constant time `clear` that
we won't use anyway as reusing these sets across analyses is cumbersome,
and thus more space efficient and safer (got a resizable Universe and a
fallback to DenseSet for sparse if it gets too big).
With this patch MachineVerifier gets ~15-20% faster, its contribution to
total compile time drops from 45-50% to ~35%, while contribution of
calcRegsPassed to MachineVerifier drops from 50-60% to ~35% as well.
calcRegsPassed itself gets another 2x faster here.
All measured on a large suite of shaders targeting a number of GPUs.
Reviewers: bogner, stoklund, rudkx, qcolombet
Reviewed By: rudkx
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75033
Summary:
Terminators in LLVM aren't prohibited from returning values. This means that
the "callbr" instruction, which is used for "asm goto", can support "asm goto
with outputs."
This patch removes all restrictions against "callbr" returning values. The
heavy lifting is done by the code generator. The "INLINEASM_BR" instruction's
a terminator, and the code generator doesn't allow non-terminator instructions
after a terminator. In order to correctly model the feature, we need to copy
outputs from "INLINEASM_BR" into virtual registers. Of course, those copies
aren't terminators.
To get around this issue, we split the block containing the "INLINEASM_BR"
right before the "COPY" instructions. This results in two cheats:
- Any physical registers defined by "INLINEASM_BR" need to be marked as
live-in into the block with the "COPY" instructions. This violates an
assumption that physical registers aren't marked as "live-in" until after
register allocation. But it seems as if the live-in information only
needs to be correct after register allocation. So we're able to get away
with this.
- The indirect branches from the "INLINEASM_BR" are moved to the "COPY"
block. This is to satisfy PHI nodes.
I've been told that MLIR can support this handily, but until we're able to
use it, we'll have to stick with the above.
Reviewers: jyknight, nickdesaulniers, hfinkel, MaskRay, lattner
Reviewed By: nickdesaulniers, MaskRay, lattner
Subscribers: rriddle, qcolombet, jdoerfert, MatzeB, echristo, MaskRay, xbolva00, aaron.ballman, cfe-commits, JonChesterfield, hiraditya, llvm-commits, rnk, craig.topper
Tags: #llvm, #clang
Differential Revision: https://reviews.llvm.org/D69868
Changes the handling of odd breakdowns, and avoids using
G_EXTRACT/G_INSERT. Pad with undef to a wider size, and unmerge. Also
avoid introducing instructions for the fully undef components.
Depending on the target, test suite, pipeline config and perhaps other
factors machine verifier when forced on with -verify-machineinstrs can
increase compile time 2-2.5 times over (Release, Asserts On), taking up
~60% of the time. An invaluable tool, it significantly slows down
machine verifier-enabled testing.
Nearly 75% of its time MachineVerifier spends in the calcRegsPassed
method. It's a classic forward dataflow analysis executed over sets, but
visiting MBBs in arbitrary order. We switch that to RPO here.
This speeds up MachineVerifier by about 35%, decreasing the overall
compile time with -verify-machineinstrs by 20-25% or so.
calcRegsPassed itself gets 2x faster here.
All measured on a large suite of shaders targeting a number of GPUs.
Reviewers: bogner, stoklund, rudkx, qcolombet
Reviewed By: bogner
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75032
This is the second patch as part of https://bugs.llvm.org/show_bug.cgi?id=36544
Merging in the ConstantSDNode variant of FoldConstantArithmetic. After this, I will begin merging in FoldConstantVectorArithmetic
I've ensured this patch can build & pass all lit tests in Windows and Linux environments.
Patch by @justice_adams (Justice Adams)
Differential Revision: https://reviews.llvm.org/D74881
This adds infrastructure to print and parse MIR MachineOperand comments.
The motivation for the ARM backend is to print condition code names instead of
magic constants that are difficult to read (for human beings). For example,
instead of this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14, $noreg
t2Bcc %bb.4, 0, killed $cpsr
we now print this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14 /* CC::always */, $noreg
t2Bcc %bb.4, 0 /* CC:eq */, killed $cpsr
This shows that MachineOperand comments are enclosed between /* and */. In this
example, the EOR instruction is not conditionally executed (i.e. it is "always
executed"), which is encoded by the 14 immediate machine operand. Thus, now
this machine operand has /* CC::always */ as a comment. The 0 on the next
conditional branch instruction represents the equal condition code, thus now
this operand has /* CC:eq */ as a comment.
As it is a comment, the MI lexer/parser completely ignores it. The benefit is
that this keeps the change in the lexer extremely minimal and no target
specific parsing needs to be done. The changes on the MIPrinter side are also
minimal, as there is only one target hooks that is used to create the machine
operand comments.
Differential Revision: https://reviews.llvm.org/D74306
Change the way that we remove the redundant iteration count code in
the presence of IT blocks. collectLocalKilledOperands has been
introduced to scan an instructions operands, collecting the killed
instructions and then visiting them too. This is used to delete the
code in the preheader which calculates the iteration count. We also
track any IT blocks within the preheader and, if we remove all the
instructions from the IT block, we also remove the IT instruction.
isSafeToRemove is used to remove any redundant uses of the iteration
count within the loop body.
Differential Revision: https://reviews.llvm.org/D74975
Summary:
This patch adds intrinsics and ISelDAG nodes for signed
and unsigned fixed-point division:
```
llvm.sdiv.fix.sat.*
llvm.udiv.fix.sat.*
```
These intrinsics perform scaled, saturating division
on two integers or vectors of integers. They are
required for the implementation of the Embedded-C
fixed-point arithmetic in Clang.
Reviewers: bjope, leonardchan, craig.topper
Subscribers: hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71550
Summary:
The type used to represent functional units in MC is
'unsigned', which is 32 bits wide. This is currently
not a problem in any upstream target as no one seems
to have hit the limit on this yet, but in our
downstream one, we need to define more than 32
functional units.
Increasing the size does not seem to cause a huge
size increase in the binary (an llc debug build went
from 1366497672 to 1366523984, a difference of 26k),
so perhaps it would be acceptable to have this patch
applied upstream as well.
Subscribers: hiraditya, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71210
This version fixes a buildbot failure cause by picking the wrong insert
point for XORs. We cannot pick the XOR binary operator as insert point,
as it is not guaranteed that both input operands for the overflow
intrinsic are defined before it.
This reverts the revert commit
c7fc0e5da6.
A question about this behavior came up on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-February/139003.html
...and as part of backend improvements in D73978.
We decided not to implement a more general change that would have
folded any FP binop with nearly arbitrary constant + undef operand
to undef because that is not theoretically correct (even if it is
practically correct).
This is the SDAG-equivalent to the IR change in D74713.
This patch adds a cache that is valid only for the duration of a call
to getKnownBits. With such short lived cache we avoid all the problems
of cache invalidation while still getting the benefits of reusing
the information we already computed.
This cache is useful whenever an instruction occurs more than once
in a chain of computation.
E.g.,
v0 = G_ADD v1, v2
v3 = G_ADD v0, v1
Previously we would compute the known bits for:
v1, v2, v0, then v1 again and finally v3.
With the patch, now we won't have to recompute v1 again.
NFC
Summary:
This prevents BFI queries on new blocks (from
MachineSinking::GetAllSortedSuccessors) and fixes a bunch of assert failures
under -check-bfi-unknown-block-queries=true.
Reviewers: davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74511
This changes the SimplifyLibCalls utility to accept an IRBuilderBase,
which allows us to pass through the IRBuilder used by InstCombine.
This will ensure that new instructions get added to the worklist.
The annotated test-case drops from 4 to 2 InstCombine iterations thanks
to this.
To achieve this, I'm adding an IRBuilderBase::OperandBundlesGuard,
which is basically the same as the existing InsertPointGuard and
FastMathFlagsGuard, but for operand bundles. Also add a
setDefaultOperandBundles() method so these can be set outside the
constructor.
Differential Revision: https://reviews.llvm.org/D74792
Use the SelectionDAG::getValidShiftAmountConstant helper to get const/constsplat shift amounts, which allows us to drop the out of range shift amount early-out.
First step towards better non-uniform shift amount support in SimplifyDemandedBits.
I believe this was carried over from getELFKindForNamedSection since
the wasm backend originally used ELF object writing as a template.
Differential Revision: https://reviews.llvm.org/D74565
This includes both GEPs where the indexed type is a scalable vector, and
GEPs where the result type is a scalable vector.
Differential Revision: https://reviews.llvm.org/D73602
When analyzing PHIs, we gather the known bits for every operand and
merge them together to get the known bits of the result of the PHI.
It is not unusual that merging the information leads to know nothing
on the result (e.g., phi a: i8 3, b: i8 unknown, ..., after looking at the
second argument we know we will know nothing on the result), thus, as
soon as we reach that state, stop analyzing the following operand (i.e.,
on the previous example, we won't process anything after looking at `b`).
This improves compile time in particular with PHIs with a large number
of operands.
NFC.
Similar to what we already do with SimplifyDemandedVectorElts, call SimplifyDemandedBits across all the extracted elements of the source vector, treating it as single use.
There's a minor regression in store-weird-sizes.ll which will be addressed in an upcoming SimplifyDemandedBits patch.
Summary:
If the programmer adds static profile data to a branch---i.e. uses
"__builtin_expect()" or similar---then we should honor it. Otherwise,
"__builtin_expect()" is ignored in crucial situations. So we trust that
the programmer knows what they're doing until proven wrong.
Subscribers: hiraditya, JDevlieghere, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74809
Instcombine folds (a + b <u a) to (a ^ -1 <u b) and that does not match
the expected pattern in CodeGenPerpare via UAddWithOverflow.
This causes a regression over Clang 7 on both X86 and AArch64:
https://gcc.godbolt.org/z/juhXYV
This patch extends UAddWithOverflow to also catch the XOR case, if the
XOR is only used in the ICMP. This covers just a single case, but I'd
like to make sure I am not missing anything before tackling the other
cases.
Reviewers: nikic, RKSimon, lebedev.ri, spatel
Reviewed By: nikic, lebedev.ri
Differential Revision: https://reviews.llvm.org/D74228
On some targets, like SPARC, forming overflow ops is only profitable if
the math result is used: https://godbolt.org/z/DxSmdB
This patch adds a new MathUsed parameter to allow the targets
to make the decision and defaults to only allowing it
if the math result is used. That is the conservative choice.
This patch also updates AArch64ISelLowering, X86ISelLowering,
ARMISelLowering.h, SystemZISelLowering.h to allow forming overflow
ops if the math result is not used. On those targets using the
overflow intrinsic for the overflow check only generates better code.
Reviewers: nikic, RKSimon, lebedev.ri, spatel
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D74722
https://reviews.llvm.org/D67133
While investigating some non determinism (CSE doesn't produce wrong
code, it just doesn't CSE some times) in GISel CSE on an out of tree
target, I realized that the core issue was that there were lots of code
that mutates (setReg, setRegClass etc), but doesn't notify observers
(CSE in this case but this could be any other observer). In order to
make the Observer be available in various parts of code and to avoid
having to thread it through various API, the MachineFunction now has the
observer as field. This allows it to be easily used in helper functions
such as constrainOperandRegClass.
Also added some invariant verification method in CSEInfo which can
catch these issues (when CSE is enabled).
Summary:
Unlike normal calls, call_indirects have immediate arguments that
caused a MachineVerifier failure without a small tweak to loosen the
verifier's requirements for variadicOpsAreDefs instructions.
One nice thing about the new call_indirects is that they do not need
to participate in the PCALL_INDIRECT mechanism because their post-isel
hook handles moving the function pointer argument and adding the flags
and typeindex arguments itself.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74191
This reverts commit 649aba93a2, now that
the approach started there has been shown to be workable in the patch
series culminating in https://reviews.llvm.org/D74192.
Summary:
Making `Scale` a `TypeSize` in AArch64InstrInfo::getMemOpInfo,
has the effect that all places where this information is used
(notably, TargetInstrInfo::getMemOperandWithOffset) will need
to consider Scale - and derived, Offset - possibly being scalable.
This patch adds a new operand `bool &OffsetIsScalable` to
TargetInstrInfo::getMemOperandWithOffset and fixes up all
the places where this function is used, to consider the
offset possibly being scalable.
In most cases, this means bailing out because the algorithm does not
(or cannot) support scalable offsets in places where it does some
form of alias checking for example.
Reviewers: rovka, efriedma, kristof.beyls
Reviewed By: efriedma
Subscribers: wuzish, kerbowa, MatzeB, arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, javed.absar, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72758
This patch enables the debug entry values feature.
- Remove the (CC1) experimental -femit-debug-entry-values option
- Enable it for x86, arm and aarch64 targets
- Resolve the test failures
- Leave the llc experimental option for targets that do not
support the CallSiteInfo yet
Differential Revision: https://reviews.llvm.org/D73534
Summary:
Backends should fold the subtraction into the comparison, but not all
seem to. Moreover, on targets where pointers are not integers, such as
CHERI, an integer subtraction is not appropriate. Instead we should just
compare the two pointers directly, as this should work everywhere and
potentially generate more efficient code.
Reviewers: bogner, lebedev.ri, efriedma, t.p.northover, uweigand, sunfish
Reviewed By: lebedev.ri
Subscribers: dschuff, sbc100, arichardson, jgravelle-google, hiraditya, aheejin, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74454
For a file in WebKit, this brings the time spent in LiveDebugValues down
from 16 minutes to 2 minutes. The reduction comes from iterating the set
of open variable locations just once in transferRegisterDef. Post-patch,
the most expensive item inside of transferRegisterDef is a call to
VarLoc::isDescribedByReg, which we have to do.
Testing: I built LNT using the Os-g cmake cache with & without this
patch, then diffed the object files to verify there was no binary diff.
rdar://59446577
Differential Revision: https://reviews.llvm.org/D74633
These are going to be useful in TargetLowering::SimplifyDemandedBits, so expose these helpers outside of SelectionDAG.cpp
Also add an getValidShiftAmountConstant early-out to getValidMinimumShiftAmountConstant/getValidMaximumShiftAmountConstant so we can use them for scalar cases as well.
Produce an unmerge to a narrower type and introduce a narrower shift
if needed. I wasn't sure if there was a better way to parameterize the
target's preferred shift type for the GICombineRule, so manually call
the combine helper.
Summary:
This patch implements the part of the calling convention
where SVE Vectors are passed by reference. This means the
caller must allocate stack space for these objects and
pass the address to the callee.
Reviewers: efriedma, rovka, cameron.mcinally, c-rhodes, rengolin
Reviewed By: efriedma
Subscribers: tschuett, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71216
This adds another pattern to the combiner for a case that we were not handling
to generate the REV16 instruction for ARM/Thumb2 and a bswap+ror on X86.
Differential Revision: https://reviews.llvm.org/D74032
Summary:
When using strict fp, it is required to update the
chain when performing integer type promotion of a
operand to a integer to floating point conversion.
Reviewers: craig.topper, john.brawn
Reviewed By: craig.topper
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74597
Follow-up for D74006.
When the integrated assembler is used, we use SHF_LINK_ORDER. The
linked-to symbol is part of ELFSectionKey, thus we can omit the unique
ID.
In https://reviews.llvm.org/rG8b737688c21a9755cae14cb9343930e0882164ab I
switched the condition gating the creation of the descriptor symbol from
checking the MCAsmInfo if we need to support descriptors, to if the OS
was AIX. Technically the 2 should be interchangeable: if we are
targeting AIX then we need to emit XCOFF object files, and the MCAsmInfo
must return true for needing function descriptors.
This doesn't account for lit test with runsteps that only set the arch.
Eg: test/CodeGen/XCore/section-name.ll
which when run natively on AIX we end up with a target xcore-ibm-aix and
needFunctionDescriptors is false.
This patch reverts to using the MCAsmInfo and adds an assert that the
target OS must be AIX since that is the only target using the descriptor
hook.
Differential Revision: https://reviews.llvm.org/D74622
This is more or less directly ported from the AMDGPU custom lowering
for FP_TO_FP16. I made a few minor fixups (using G_UNMERGE_VALUES
instead of creating shift/trunc to extract the two halves, and zexting
an inverted compare instead of select_cc).
This also does not include the fast math expansion the DAG which
converts to f32 and then to f16. I think that belongs in a
pre-legalize combine instead.
Like COPY instructions explained in D70616, we don't check the constraints
when combining G_UNMERGE_VALUES. Use the same logic used in D70616 to check
if registers can be replaced, or a COPY instruction needs to be built.
https://reviews.llvm.org/D70564
The goal of this patch is to maximize CPU utilization on multi-socket or high core count systems, so that parallel computations such as LLD/ThinLTO can use all hardware threads in the system. Before this patch, on Windows, a maximum of 64 hardware threads could be used at most, in some cases dispatched only on one CPU socket.
== Background ==
Windows doesn't have a flat cpu_set_t like Linux. Instead, it projects hardware CPUs (or NUMA nodes) to applications through a concept of "processor groups". A "processor" is the smallest unit of execution on a CPU, that is, an hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit processors on older 32-bit versions of Windows, which later was raised to 64-processors with 64-bit versions of Windows. This limit comes from the affinity mask, which historically is represented by the sizeof(void*). Consequently, the concept of "processor groups" was introduced for dealing with systems with more than 64 hyper-threads.
By default, the Windows OS assigns only one "processor group" to each starting application, in a round-robin manner. If the application wants to use more processors, it needs to programmatically enable it, by assigning threads to other "processor groups". This also means that affinity cannot cross "processor group" boundaries; one can only specify a "preferred" group on start-up, but the application is free to allocate more groups if it wants to.
This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P (64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor groups". This means that by default, an application can only use half of the cores. This situation could only get worse in the years to come, as dies with more cores will appear on the market.
== The problem ==
The heavyweight_hardware_concurrency() API was introduced so that only *one hardware thread per core* was used. Once that API returns, that original intention is lost, only the number of threads is retained. Consider a situation, on Windows, where the system has 2 CPU sockets, 18 cores each, each core having 2 hyper-threads, for a total of 72 hyper-threads. Both heavyweight_hardware_concurrency() and hardware_concurrency() currently return 36, because on Windows they are simply wrappers over std:🧵:hardware_concurrency() -- which can only return processors from the current "processor group".
== The changes in this patch ==
To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new ThreadPoolStrategy class. The number of threads to use is deferred as late as possible, until the moment where the std::threads are created (ThreadPool in the case of ThinLTO).
When using hardware_concurrency(), setting ThreadCount to 0 now means to use all the possible hardware CPU (SMT) threads. Providing a ThreadCount above to the maximum number of threads will have no effect, the maximum will be used instead.
The heavyweight_hardware_concurrency() is similar to hardware_concurrency(), except that only one thread per hardware *core* will be used.
When LLVM_ENABLE_THREADS is OFF, the threading APIs will always return 1, to ensure any caller loops will be exercised at least once.
Differential Revision: https://reviews.llvm.org/D71775
replaceDbgDeclare is used to update the descriptions of stack variables
when they are moved (e.g. by ASan or SafeStack). A side effect of
replaceDbgDeclare is that it moves dbg.declares around in the
instruction stream (typically by hoisting them into the entry block).
This behavior was introduced in llvm/r227544 to fix an assertion failure
(llvm.org/PR22386), but no longer appears to be necessary.
Hoisting a dbg.declare generally does not create problems. Usually,
dbg.declare either describes an argument or an alloca in the entry
block, and backends have special handling to emit locations for these.
In optimized builds, LowerDbgDeclare places dbg.values in the right
spots regardless of where the dbg.declare is. And no one uses
replaceDbgDeclare to handle things like VLAs.
However, there doesn't seem to be a positive case for moving
dbg.declares around anymore, and this reordering can get in the way of
understanding other bugs. I propose getting rid of it.
Testing: stage2 RelWithDebInfo sanitized build, check-llvm
rdar://59397340
Differential Revision: https://reviews.llvm.org/D74517
Patchable statepoint is lowered into sequence of nops, so zeroed call target
should not be on register. It is better to use getTargetConstant instead
of getConstant to select zero constant for call target.
Reviewers: reames
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D74465
Current tail duplication embedded in MBP duplicates a BB into all or none of its predecessors without too much cost analysis. So sometimes it is duplicated into cold predecessors, and in other cases it may miss the duplication into hot predecessors.
This patch improves tail duplication in 3 aspects:
A successor can be duplicated into part of its predecessors.
A more fine-grained benefit analysis, combined with 1, now a successor is duplicated into hot predecessors only.
If a successor can't be duplicated into one predecessor, it doesn't impact the duplication into other predecessors.
Differential Revision: https://reviews.llvm.org/D73387
Summary:
This was a very odd API, where you had to pass a flag into a zext
function to say whether the extended bits really were zero or not. All
callers passed in a literal true or false.
I think it's much clearer to make the function name reflect the
operation being performed on the value we're tracking (rather than on
the KnownBits Zero and One fields), so zext means the value is being
zero extended and new function anyext means the value is being extended
with unknown bits.
NFC.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74482
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.
This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.
It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.
Differential Revision: https://reviews.llvm.org/D74221
Summary:
Right now the alignment of the lower half of a store is computed as
align/2, which fails for unaligned stores (align = 1), and is overly
pessimitic for, e.g. a 8 byte store aligned to 4 bytes.
Fixes PR44851
Fixes PR44877
Reviewers: gchatelet, spatel, lebedev.ri
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74311
This patch enables the debug entry values feature.
- Remove the (CC1) experimental -femit-debug-entry-values option
- Enable it for x86, arm and aarch64 targets
- Resolve the test failures
- Leave the llc experimental option for targets that do not
support the CallSiteInfo yet
Differential Revision: https://reviews.llvm.org/D73534
Summary:
The method attempts to find loads that can be legally clustered by
looking for loads consuming the same chain glue token.
However, the old code looks at _all_ users of values produced by the
chain node -- including uses of the loaded/returned value of volatile
loads or atomics. This could lead to circular dependencies which then
failed during scheduling.
With this change, we filter out users by getResNo, i.e. by which
SDValue value they use, to ensure that we only look at users of the
chain glue token.
This appears to be a rather old bug, which is perhaps surprising.
However, the test case is actually quite fragile (i.e., it is hidden
by fairly small changes), and the test _must_ use volatile loads for
the bug to manifest.
Reviewers: arsenm, bogner, craig.topper, foad
Subscribers: MatzeB, jvesely, wdng, hiraditya, javed.absar, jfb, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74253
This adds a strict version of FP16_TO_FP and FP_TO_FP16 and uses
them to implement soft promotion for the half type. This is
enough to provide basic support for __fp16 with strictfp.
Add the necessary X86 support to use VCVTPS2PH/VCVTPH2PS when F16C
is enabled.
Instructions marked as FrameSetup do not cause requestLabelAfterInsn to
be called and so no such label is generated. Call instructions which
require call site entries to be generated require this label to be
present in order to calculate the return PC offset/address, but the
check for whether the call instruction is marked as FrameSetup was not
present.
Therefore in the case where a call instruction is marked as FrameSetup,
an assertion failure occurs if a call site entry is to be generated.
This is the case with RISC-V's implementation of save/restore via
library calls.
Differential Revision: https://reviews.llvm.org/D71593
Fixup the UserValue methods to use FragmentInfo instead of DIExpression because
the DIExpression is only ever used to get the to get the FragmentInfo. The
DIExpression is meaningless in the UserValue class because each definition point
added to a UserValue may have a unique DIExpression.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D74057
Rename the class DbgValueLocation to DbgVariableValue and instances from Loc to
DbgValue. These names better express the new semantics introduced in D74053.
The class previously represented a { Location } only. It now represents a
{ Location, DIExpression } pair which together describe a value.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D74055
LiveDebugVariables uses interval maps to explicitly represent DBG_VALUE
intervals. DBG_VALUEs are filtered into an interval map based on their {
Variable, DIExpression }. The interval map will coalesce adjacent entries that
use the same { Location }. Under this model, DBG_VALUEs which refer to the same
bits of the same variable will be filtered into different interval maps if they
have different DIExpressions which means the original intervals will not be
properly preserved.
This patch fixes the problem by using { Variable, Fragment } to filter the
DBG_VALUEs into maps, and coalesces adjacent entries iff they have the same
{ Location, DIExpression } pair.
The solution is not perfect because we see the similar issues appear when
partially overlapping fragments are encountered, but is far simpler than a
complete solution (i.e. D70121).
Fixes: pr41992, pr43957
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D74053
A downstream test exposed a simple logic bug with the manual pointer
stripping code, fix that by just using stripPointerCasts() on the value.
I don't think there's a way to expose this issue upstream.
SUMMARY:
The patch is enable to support Mergeable2ByteCString and Mergeable4ByteCString
Reviewers: daltenty
Subscribers: wuzish, nemanjai, hiraditya
Differential Revision: https://reviews.llvm.org/D74164
Add a simplification to fuse a manual vector extract with shifts and
truncate into a bitcast.
Unpacking and packing values into vectors is only optimized with
extractelement instructions, not when manually unpacked using shifts
and truncates.
This patch simplifies shifts and truncates into a bitcast if possible.
Simplify (build_vec (trunc $1)
(trunc (srl $1 width))
(trunc (srl $1 (2 * width))) ...)
to (bitcast $1)
Differential Revision: https://reviews.llvm.org/D73892
Use the isCandidateForCallSiteEntry().
This should mostly be an NFC, but there are some parts ensuring
the moveCallSiteInfo() and copyCallSiteInfo() operate with call site
entry candidates (both Src and Dest should be the call site entry
candidates).
Differential Revision: https://reviews.llvm.org/D74122
Remove code from LegalizeTypes that allowed this to work.
We were already using BUILD_PAIR for this in some places so this
standardizes on a single way to do this.