Adds the global (cl::opt) GVNOption enable-load-in-loop-pre in order
to control whether the optimization will be performed if the load
is part of a loop.
Patch by Hendrik Greving!
Differential Revision: https://reviews.llvm.org/D73804
This code matches (zext (trunc (setcc_carry))) -> (and (setcc_carry), 1)
but the code never checks what type we're truncating too. An and
mask of 1 would only make sense if the trunc was to MVT::i1, but
we didn't check for that.
I believe this code is a leftover from when i1 was a legal type.
Our normal lowering for ISD::SETCC uses X86ISD::SUB to enable
CSE unless the RHS is 0. optimizeCompareInstr called by the peephole
pass can turn subs with unused results into cmps to clean this up.
This commit makes other places that create X86ISD::CMP have the
same behavior.
We were creating two with different operand orders, and then only
using one of them.
Instead just swap the operands when needed and create a single node.
This code was incorrectly emitting extra bytes into arbitrary parts of
the object file when it was meant to be hashing them to compute the DWO
ID.
Follow-up patch(es) will refactor this API somewhat to make such bugs
harder to introduce, hopefully.
Broadwell was missing half the gather instructions. Both models
had some mixups in the resource costs and number of uops.
I've updated here based on what I think the original IACA source
says with some cross checking against the microcode.
I'm not sure about latency as the IACA source I have doesn't have
that information. So I'm using the latency from uops.info.
I plan to update Skylake models as well, but I'll do that in a
separate patch.
Differential Revision: https://reviews.llvm.org/D73844
This ports the existing case for G_XOR from `getTestBitOperand` in
AArch64ISelLowering into GlobalISel.
The idea is to flip between TBZ and TBNZ while walking through G_XORs.
Let's say we have
```
tbz (xor x, c), b
```
Let's say the `b`-th bit in `c` is 1. Then
- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 0.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 1.
So, then
```
tbz (xor x, c), b == tbnz x, b
```
Let's say the `b`-th bit in `c` is 0. Then
- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 1.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 0.
So, then
```
tbz (xor x, c), b == tbz x, b
```
Differential Revision: https://reviews.llvm.org/D73929
This implements the following optimization:
```
(tbz (shl x, c), b) -> (tbz x, b-c)
```
Which appears in `getTestBitOperand` in AArch64ISelLowering.cpp.
If we test bit `b` of `shl x, c`, we can fold away the `shl` by looking `c` bits
to the right of `b` in `x` when this fits in the type. So, we can just test the
`b-c`th bit.
Differential Revision: https://reviews.llvm.org/D73924
Summary:
The AIX assembler .space directive can't take a second non-zero argument to fill
with. But LLVM emitFill currently assumes it can. We add a flag to the AsmInfo
to check if non-zero fill is supported, and if we can't zerofill non-zero values
we just splat the .byte directives.
Reviewers: stevewan, sfertile, DiggerLin, jasonliu, Xiangling_L
Reviewed By: jasonliu
Subscribers: Xiangling_L, wuzish, nemanjai, hiraditya, kbarton, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73554
Given
```
tb(n)z (and x, m), b
```
Where the `b`-th bit of `m` is 1,
```
tb(n)z (and x, m), b == tb(n)z x, b
```
So, we can walk past a `G_AND` in this case.
Also add test/CodeGen/AArch64/GlobalISel/opt-fold-and-tbz-tbnz.mir to test this.
Differential Revision: https://reviews.llvm.org/D73790
convertPtrAddToAdd improved overall code size and quality by a significant amount,
but on -O0 we generate some cross-class copies due to the fact that we emitted
G_PTRTOINT and G_INTTOPTR around the G_ADD. Unfortunately at -O0 we don't run any
register coalescing, so these cross class copies end up escaping as moves, and
we ended up regressing 3 benchmarks on CTMark (though still a winner overall).
This patch changes the lowering to instead directly emit the G_ADD into the
destination register, and then force changes the dest LLT to s64 from p0. This
should be ok, as all uses of the register should now be selected and therefore
the LLT doesn't matter for the users. It does however matter for the importer
patterns, which will fail to select a G_ADD if there's a p0 LLT.
I'm not able to get rid of the G_PTRTOINT on the source yet however. We can't
use the same trick of breaking the type system since that could break the
selection of the defining instruction. Thus with -O0 we still end up with a
cross class copy on source.
Code size improvements on -O0:
Program baseline new diff
test-suite :: CTMark/Bullet/bullet.test 965520 949164 -1.7%
test-suite...TMark/7zip/7zip-benchmark.test 1069456 1052600 -1.6%
test-suite...ark/tramp3d-v4/tramp3d-v4.test 1213692 1199804 -1.1%
test-suite...:: CTMark/sqlite3/sqlite3.test 421680 419736 -0.5%
test-suite...-typeset/consumer-typeset.test 837076 833380 -0.4%
test-suite :: CTMark/lencod/lencod.test 799712 796976 -0.3%
test-suite...:: CTMark/ClamAV/clamscan.test 688264 686132 -0.3%
test-suite :: CTMark/kimwitu++/kc.test 1002344 999648 -0.3%
test-suite...Mark/mafft/pairlocalalign.test 422296 421768 -0.1%
test-suite :: CTMark/SPASS/SPASS.test 656792 656532 -0.0%
Geomean difference -0.6%
Differential Revision: https://reviews.llvm.org/D73910
Start using a new strategy with a combination of merge and unmerges.
This allows scalarizing before lowering, which in cases like
<2 x s128> avoids producing giant illegal shifts.
RegAllocGreedy uses a fairly compile time intensive splitting heuristic
called region splitting. This heuristic was disabled via another heuristic
when it is likely that it won't be worth the compile time. The only way
to control this other heuristic was via a command line option (huge-size-for-split).
This commit gives more control on this heuristic by making it overridable
by the target using a target hook in TargetRegisterInfo called
shouldRegionSplitForVirtReg.
The default implementation of this hook keeps the heuristic as it was
before this patch.
ClangBuildAnalyzer results show that a lot of time is spent
instantiating AnalysisManager::getResultImpl across the code base:
**** Templates that took longest to instantiate:
50445 ms: llvm::AnalysisManager<llvm::Function>::getResultImpl (412 times, avg 122 ms)
47797 ms: llvm::AnalysisManager<llvm::Function>::getResult<llvm::TargetLibraryAnalysis> (389 times, avg 122 ms)
46894 ms: std::tie<const unsigned long long, const bool> (2452 times, avg 19 ms)
43851 ms: llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096, 4096>::Allocate (3228 times, avg 13 ms)
33911 ms: std::tie<const unsigned int, const unsigned int, const unsigned int, const unsigned int> (897 times, avg 37 ms)
33854 ms: std::tie<const unsigned long long, const unsigned long long> (1897 times, avg 17 ms)
27886 ms: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (11156 times, avg 2 ms)
I mentioned this result to @chandlerc, and he suggested this direction.
AnalysisManager is already explicitly instantiated, and getResultImpl
doesn't need to be inlined. Move the definition to an Impl header, and
include that header in files that explicitly instantiate
AnalysisManager. There are only four (real) IR units:
- function
- module
- loop
- cgscc
Looking at a specific transform (ArgumentPromotion.cpp), here are three
compilations before & after this change:
BEFORE:
$ for i in $(seq 3) ; do ./ccit.bat ; done
peak memory: 258.15MB
real: 0m6.297s
peak memory: 257.54MB
real: 0m5.906s
peak memory: 257.47MB
real: 0m6.219s
AFTER:
$ for i in $(seq 3) ; do ./ccit.bat ; done
peak memory: 235.35MB
real: 0m5.454s
peak memory: 234.72MB
real: 0m5.235s
peak memory: 234.39MB
real: 0m5.469s
The 20MB of memory saved seems real, and the time improvement seems like
it is there.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D73817
Summary:
Method appendLoopsToWorklist is duplicate in LoopUnroll and in the
LoopPassManager as an internal method. Make it an utility.
Reviewers: dmgreen, chandlerc, fedor.sergeev, yamauchi
Subscribers: mehdi_amini, hiraditya, zzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73569
Summary:
* Most of the simplifications in SimplifyShuffleVectorInst depend on the
concrete value of, or the length of the mask vector. For scalable
vectors, this cannot be known at compile time.
** for these tests, detect if the vector is scalable before attempting
the transformation
* The functions ShuffleVectorInst::getMaskValue and
ShuffleVectorInst::getShuffleMask access the value of the constant mask.
However, since the length of the mask is unknown at compile time, these
function do not work for scalable vectors. Add asserts to ensure that
the input mask is not scalable
Reviewers: efriedma, sdesmalen, apazos, chrisj, huihuiz
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73555
Adds a replaceOperand() helper, which is like Instruction.setOperand()
but adds the old operand to the worklist. This reduces the amount of
missing or incorrect worklist management.
This only applies the helper to a relatively small subset of
setOperand() calls in InstCombine, namely those of the pattern
`I.setOperand(); return &I;`, where it is most obviously applicable.
Differential Revision: https://reviews.llvm.org/D73803
This renames Worklist.AddDeferred() to Worklist.add() and
Worklist.Add() to Worklist.push(). The intention here is that
Worklist.add() should be the go-to method for explicit worklist
management, while the raw Worklist.push() is mostly for
InstCombine internals. I will then migrate uses of Worklist.push()
to Worklist.add() in followup changes.
As suggested by spatel on D73411 I'm also changing the remaining
method names to lowercase first character, in line with current
coding standards.
Differential Revision: https://reviews.llvm.org/D73745
Followup to D73135. If the target doesn't have hard float (default
for ARM), then we assert when trying to soften the result of vector
reduction intrinsics. This patch marks these for expansion as well.
(A bit odd to use vectors on a target without hard float ... but
that's where you end up if you expose target-independent vector types.)
Differential Revision: https://reviews.llvm.org/D73854
Summary:
A recent change to enable more importing of global variables with
references exposed some efficiency issues with export computation.
See D73724 for more information and detailed analysis.
The first was specific to variable importing. The code was marking every
copy of a referenced value (from possibly thousands of files in the case
of linkonce_odr) as exported, and we only need to mark the copy in the
module containing the variable def being imported as exported. The
reason is that this is tracking what values are newly exported as a
result of importing. Anything that was defined in another module and
simply used in the exporting module is already exported, and would have
been identified by the caller (e.g. the LTO API implementations).
The second issue is that the code was re-adding previously exported
values (along with all references). It is easy to identify when a
variable was already imported into the same module (via the
import list insert call return value), and we already did this for
function importing. However, what we weren't doing for either function
or variable importing was avoiding a re-insertion when it was previously
exported into a different importing module. The reason we couldn't do
this is there was no way of telling from the export list whether it was
previously inserted there because its definition was exported (in which
case we already marked all its references as exported) from when it was
inserted there because it was referenced by another exported value (in
which case we haven't yet inserted its own references).
To address this we can restructure the way the export list is
constructed. This patch only adds the actual imported definitions
(variable or function) to the export list for its module during the
import computation. After import computation is complete, where we were
already post-processing the export list we go ahead and add all
references made by those exported values to the export list.
These changes speed up the thin link not only with constant variable
importing enabled, but also without (due to the efficiency improvement
in function importing).
Some thin link user time measurements for one large application, average
of 5 runs:
With constant variable importing enabled:
- without this patch: 479.5s
- with this patch: 74.6s
Without constant variable importing enabled:
- without this patch: 80.6s
- with this patch: 70.3s
Note I have not re-enabled constant variable importing here, as I would
like to do additional compile time measurements with these fixes first.
Reviewers: evgeny777
Subscribers: mehdi_amini, inglorion, hiraditya, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73851
As detailed on PR43463, this fixes a static analyzer null dereference warning by sinking Changed = true into the if() blocks where the MIB is actually created.
I did a quick check that suggested that one of those if() blocks is always guaranteed to be hit (so we could change it to if-else), but this seems like a safer approach
Differential Revision: https://reviews.llvm.org/D73883
We have to be careful in SimplifyDemandedBits with loads in case we attempt to combine back to a constant (which then gets turned into a constant pool load again), but we can at least set the upper KnownBits for a ZEXTLoad to zero.
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73885
These instructions can set the exception in FPSW. But I
don't think they can change FPCW. So this looks like a typo.
Differential Revision: https://reviews.llvm.org/D73864
Add support for Master and Critical directive in the OMPIRBuilder. Both make use of a new common interface for emitting inlined OMP regions called `emitInlinedRegion` which was added in this patch as well.
Also this patch modifies clang to use the new directives when `-fopenmp-enable-irbuilder` commandline option is passed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D72304
bo (splat X), (bo Y, OtherOp) --> bo (splat (bo X, Y)), OtherOp
This patch depends on the splat analysis enhancement in D73549.
See the test with comment:
; Negative test - mismatched splat elements
...as the motivation for that first patch.
The motivating case for reassociating splatted ops is shown in PR42174:
https://bugs.llvm.org/show_bug.cgi?id=42174
In that example, a slight change in order-of-associative math results
in a big difference in IR and codegen. This patch gets all of the
unnecessary shuffles out of the way, but doesn't address the potential
scalarization (see D50992 or D73480 for that).
Differential Revision: https://reviews.llvm.org/D73703
These can be lowered to code sequences using CMPFP and CMPFPE which then get
selected to VCMP and VCMPE. The implementation isn't fully correct, as the chain
operand isn't handled correctly, but resolving that looks like it would involve
changes around FPSCR-handling instructions and how the FPSCR is modelled.
The fp-intrinsics test was already testing some of this but as the entire test
was being XFAILed it wasn't noticed. Un-XFAIL the test and instead leave the
cases where we aren't generating the right instruction sequences as FIXME.
Differential Revision: https://reviews.llvm.org/D73194
Summary:
In big-endian MVE, the simple vector load/store instructions (i.e.
both contiguous and non-widening) don't all store the bytes of a
register to memory in the same order: it matters whether you did a
VSTRB.8, VSTRH.16 or VSTRW.32. Put another way, the in-register
formats of different vector types relate to each other in a different
way from the in-memory formats.
So, if you want to 'bitcast' or 'reinterpret' one vector type as
another, you have to carefully specify which you mean: did you want to
reinterpret the //register// format of one type as that of the other,
or the //memory// format?
The ACLE `vreinterpretq` intrinsics are specified to reinterpret the
register format. But I had implemented them as LLVM IR bitcast, which
is specified for all types as a reinterpretation of the memory format.
So a `vreinterpretq` intrinsic, applied to values already in registers,
would code-generate incorrectly if compiled big-endian: instead of
emitting no code, it would emit a `vrev`.
To fix this, I've introduced a new IR intrinsic to perform a
register-format reinterpretation: `@llvm.arm.mve.vreinterpretq`. It's
implemented by a trivial isel pattern that expects the input in an
MQPR register, and just returns it unchanged.
In the clang codegen, I only emit this new intrinsic where it's
actually needed: I prefer a bitcast wherever it will have the right
effect, because LLVM understands bitcasts better. So we still generate
bitcasts in little-endian mode, and even in big-endian when you're
casting between two vector types with the same lane size.
For testing, I've moved all the codegen tests of vreinterpretq out
into their own file, so that they can have a different set of RUN
lines to check both big- and little-endian.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73786
Summary:
These instructions generate a vector of consecutive elements starting
from a given base value and incrementing by 1, 2, 4 or 8. The `wdup`
versions also wrap the values back to zero when they reach a given
limit value. The instruction updates the scalar base register so that
another use of the same instruction will continue the sequence from
where the previous one left off.
At the IR level, I've represented these instructions as a family of
target-specific intrinsics with two return values (the constructed
vector and the updated base). The user-facing ACLE API provides a set
of intrinsics that throw away the written-back base and another set
that receive it as a pointer so they can update it, plus the usual
predicated versions.
Because the intrinsics return two values (as do the underlying
instructions), the isel has to be done in C++.
This is the first family of MVE intrinsics that use the `imm_1248`
immediate type in the clang Tablegen framework, so naturally, I found
I'd given it the wrong C integer type. Also added some tests of the
check that the immediate has a legal value, because this is the first
time those particular checks have been exercised.
Finally, I also had to fix a bug in MveEmitter which failed an
assertion when I nested two `seq` nodes (the inner one used to extract
the two values from the pair returned by the IR intrinsic, and the
outer one put on by the predication multiclass).
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73357
Summary:
The unpredicated case of this is trivial: the clang codegen just makes
a vector splat of the input, and LLVM isel is already prepared to
handle that. For the predicated version, I've generated a `select`
between the same vector splat and the `inactive` input parameter, and
added new Tablegen isel rules to match that pattern into a predicated
`MVE_VDUP` instruction.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73356
Summary:
A Copy with a source that is zeros is the same as a Set of zeros.
This fixes the invariant that SrcAlign should always be non-null.
Reviewers: courbet
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73791
Summary: There are no counters for individual ports, but this is already
enough to find a lot of issues in the current model (upcoming patch).
Reviewers: dblaikie, gchatelet
Subscribers: hiraditya, tschuett, RKSimon, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72032
Summary:
D68092 introduced a new SIRemoveShortExecBranches optimization pass and
broke some graphics shaders. The problem is that it was removing
branches over KILL pseudo instructions, and the fix is to explicitly
check for that in mustRetainExeczBranch.
Reviewers: critson, arsenm, nhaehnle, cdevadas, hakzsam
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73771
Duplicating instructions can lead to code size increases but using
a threshold of 3 is good for reducing code size.
Differential Revision: https://reviews.llvm.org/D72916
If all call sites are in `norecurse` functions we can derive `norecurse`
as the ReversePostOrderFunctionAttrsPass does. This should make
ReversePostOrderFunctionAttrsLegacyPass obsolete once the Attributor is
enabled.
Reviewed By: uenoku
Differential Revision: https://reviews.llvm.org/D72017
If we know that all call sites have been processed we can derive an
early fixpoint. The use in this patch is likely not to trigger right now
but a follow up patch will make use of it.
Reviewed By: uenoku, baziotis
Differential Revision: https://reviews.llvm.org/D72016
We only need to call this on floating point comparisons. In this
case these are known to be integer compares. One of them even
has a SUB opcode instead of CMP.
With this patch new trivial edges can be added to an SCC in a CGSCC
pass via the updateCGAndAnalysisManagerForCGSCCPass method. It shares
almost all the code with the existing
updateCGAndAnalysisManagerForFunctionPass method but it implements the
first step towards the TODOs.
This was initially part of D70927.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D72025
If we had `noalias` on an argument the inliner created alias scope
metadata already. However, the call site `noalias` annotation was not
considered. Since the Attributor can derive such call site `noalias`
annotation we should treat them the same as argument annotations.
Reviewed By: hfinkel
Differential Revision: https://reviews.llvm.org/D73528
This is the first of multiple parts to make OpenMP context/trait
handling reusable and generic. This patch was originally part of D71830
but with the unit tests it can be tested independently.
This patch implements an almost complete handling of OpenMP
contexts/traits such that we can reuse most of the logic in Flang
through the OMPContext.{h,cpp} in llvm/Frontend/OpenMP.
All but construct SIMD specifiers, e.g., inbranch, and the device ISA
selector are define in llvm/lib/Frontend/OpenMP/OMPKinds.def. From
these definitions we generate the enum classes TraitSet,
TraitSelector, and TraitProperty as well as conversion and helper
functions in llvm/lib/Frontend/OpenMP/OMPContext.{h,cpp}.
The OpenMP context is now an explicit object (see `struct OMPContext`).
This is in anticipation of construct traits that need to be tracked. The
OpenMP context, as well as the VariantMatchInfo, are basically made up
of a set of active or respectively required traits, e.g., 'host', and an
ordered container of constructs which allows duplication. Matching and
scoring is kept as generic as possible to allow easy extension in the
future.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D71847
Fix attempt
this is part of the implementation of http://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html
this patch gives the basis of building an assume to preserve all information from an instruction and add support for building an assume that preserve the information from a call.
The code paths in the absence of TargetMachine, TargetLowering or
TargetRegisterInfo are poorly tested. As rL285987 said, requiring
TargetPassConfig allows us to delete many (untested) checks littered
everywhere.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D73754
Summary:
this is part of the implementation of http://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html
this patch gives the basis of building an assume to preserve all information from an instruction and add support for building an assume that preserve the information from a call.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: mgrang, fhahn, mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72475
Summary:
this is part of the implementation of http://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html
this patch gives the basis of building an assume to preserve all information from an instruction and add support for building an assume that preserve the information from a call.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: mgrang, fhahn, mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72475
Summary:
this is part of the implementation of http://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html
this patch gives the basis of building an assume to preserve all information from an instruction and add support for building an assume that preserve the information from a call.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: mgrang, fhahn, mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72475
We want to allow splat value transforms to improve PR44588 and related bugs:
https://bugs.llvm.org/show_bug.cgi?id=44588
...but to do that, we need to know if values are splatted from the same,
specific index (lane) rather than splatted from an arbitrary index.
We can improve the undef handling with 1-liner follow-ups because the
Constant API optionally allow undefs now.
Differential Revision: https://reviews.llvm.org/D73549
Summary:
this patch makes tablegen generate llvm attributes in a more generic and simpler (at least to me).
changes: make tablegen generate
...
ATTRIBUTE_ENUM(Alignment,align)
ATTRIBUTE_ENUM(AllocSize,allocsize)
...
which can be used to generate most of what was previously used and more.
Tablegen was also generating attributes from 2 identical files leading to identical output. so I removed one of them and made user use the other.
Reviewers: jdoerfert, thakis, aaron.ballman
Reviewed By: aaron.ballman
Subscribers: mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72455
Summary:
this is part of the implementation of http://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html
this patch gives the basis of building an assume to preserve all information from an instruction and add support for building an assume that preserve the information from a call.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: mgrang, fhahn, mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72475
Summary:
this patch makes tablegen generate llvm attributes in a more generic and simpler (at least to me).
changes: make tablegen generate
...
ATTRIBUTE_ENUM(Alignment,align)
ATTRIBUTE_ENUM(AllocSize,allocsize)
...
which can be used to generate most of what was previously used and more.
Tablegen was also generating attributes from 2 identical files leading to identical output. so I removed one of them and made user use the other.
Reviewers: jdoerfert, thakis, aaron.ballman
Reviewed By: aaron.ballman
Subscribers: mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72455
This reverts commit e34801c8e6 and the followup due to multiple
problems.
I've tried to keep the tests and RDA parts where possible, as those
still seem useful.
The current FirstMI.getDebugLoc() is actually null in almost all cases.
If it isn't, the generated .loc will be considered initial. The .loc
will have the prologue_end flag and terminate the prologue prematurely.
Also use an overload of BuildMI that will not prepend
PATCHABLE_FUNCTION_ENTRY to a MachineInstr bundle.
Summary:
Virtual registers that are undef have an empty LiveInterval at this
point, which means beginIndex() and endIndex() cannot be used. We
only need those indices to determine the range in which to scan for
affected other NSA instructions, and undef operands cannot contribute
to that range.
Reviewers: arsenm, rampitec, mareko
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73831
We were checking that the original Value * for the compare operands
were null. But that can never happen.
I believe we intended to check for 0 registers here instead.
Fixes PR44749.
This is an alternate fix for the issue D73606 was trying to
solve.
The main issue here is that we bailed out of
foldOffsetIntoAddress if Offset is 0. But if we just found a
symbolic displacement and AM.Disp became non-zero
earlier, we still need to validate that AM.Disp with the symbolic
displacement.
This is my second attempt at committing this after failing
build bots previously. One thing I realized about the previous
attempt is that its possible that AM.Disp is already non-zero
and the new Offset changes it back to zero. In that case my
previous attempt failed to update AM.Disp to zero. So this patch
removes the early out for 0 and appropriately handle the 0 case
in each check so we still update AM.Disp at the end.
This is based on this llvm-dev thread http://lists.llvm.org/pipermail/llvm-dev/2019-December/137521.html
The current strategy for f16 is to promote type to float every except where the specific width is required like loads, stores, and bitcasts. This results in rounding occurring in odd places instead of immediately after arithmetic operations. This interacts in weird ways with the __fp16 type in clang which is a storage only type where arithmetic is always promoted to float. InstCombine can remove some fpext/fptruncs around such arithmetic and turn it into arithmetic on half. This wouldn't be so bad if SelectionDAG was able to put those fpext/fpround back in when it promotes.
It is also not obvious how to handle to make the existing strategy work with STRICT fp. We need to use STRICT versions of the conversions which require chain operands. But if the conversions are created for a bitcast, there is no place to get an appropriate chain from.
This patch implements a different strategy where conversions are emitted directly around arithmetic operations. And otherwise its passed around as an i16 including in arguments and return values. This can result in more conversions between arithmetic operations, but is closer to matching the IR the frontend generates for __fp16. And it will allow us to use the chain from constrained arithmetic nodes to link the STRICT_FP_TO_FP16/STRICT_FP16_TO_FP that will need to be added. I've set it up so that each target can opt into the new behavior. Converting all the targets myself was more than I was able to handle.
Differential Revision: https://reviews.llvm.org/D73749
This fixes legalizations of global stores > 128-bits. It seems work is
needed on how this split actually occurs. For example, we get the
right code for s160, with an s128 and s32 load, but get 5 s32 loads
for <5 x s32>.
This patch adds initial support for a DemandedElts mask to the internal computeKnownBits/ComputeNumSignBits methods, matching the SelectionDAG and GlobalISel equivalents.
So far only a couple of instructions have been setup to handle the DemandedElts, the remainder still using the existing 'all elements' default. The plan is to extend support as we have test coverage.
Differential Revision: https://reviews.llvm.org/D73435
Copy it instead. Otherwise, key registers (such as RBP) may get zeroed
out by the stack unwinder.
Fixes CrashRecoveryTest.DumpStackCleanup with MSVC in release builds.
Reviewed By: stella.stamenova
Differential Revision: https://reviews.llvm.org/D73809
Significant missing hashing - as per the comment this was only meant to
skip member functions (unspecified, but I think it's legible as member
function declarations, not definitions) but was skipping all named
subprograms (so only hashed child DIEs for member function definitions -
because they didn't have a direct name, but only a name given indirectly
in the DW_AT_specification-referenced DIE)
Summary:
Implements the jump pseudo-instruction, which is used in e.g. the Linux kernel.
Reviewers: asb, lenary
Reviewed By: lenary
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73178
If dumping an Split DWARF file that hasn't been split into separate
files (such as from llc - that includes the plain and .dwo sections in
the same file) allow both macinfo and macinfo.dwo sections to be dumped.
Summary: With the new pass manager, it is not possible to obtain a pointer to the pass.
Reviewers: jfb, rinon, yln
Subscribers: hiraditya, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73390
Summary:
Applying this cleanup:
- MIRBuilder.buildInstr(TargetOpcode::G_ASHR)
- .addDef(Shifted)
- .addUse(Res)
- .addUse(ShiftAmt);
+ MIRBuilder.buildAShr(Shifted, Res, ShiftAmt);
caused an assertion failure here:
llc: /home/jayfoad2/git/llvm-project/llvm/lib/CodeGen/MachineRegisterInfo.cpp:404: llvm::MachineInstr *llvm::MachineRegisterInfo::getVRegDef(unsigned int) const: Assertion `(I.atEnd() || std::next(I) == def_instr_end()) && "getVRegDef assumes a single definition or no definition"' failed.
#4 0x00000000050a6d96 in llvm::MachineRegisterInfo::getVRegDef (this=0x74606a0, Reg=2147483650) at /home/jayfoad2/git/llvm-project/llvm/lib/CodeGen/MachineRegisterInfo.cpp:403
#5 0x00000000066148f6 in llvm::getConstantVRegValWithLookThrough (VReg=2147483650, MRI=..., LookThroughInstrs=false, HandleFConstant=true) at /home/jayfoad2/git/llvm-project/llvm/lib/CodeGen/GlobalISel/Utils.cpp:244
#6 0x00000000066147da in llvm::getConstantVRegVal (VReg=2147483650, MRI=...) at /home/jayfoad2/git/llvm-project/llvm/lib/CodeGen/GlobalISel/Utils.cpp:210
#7 0x0000000006615367 in llvm::ConstantFoldBinOp (Opcode=101, Op1=2147483650, Op2=2147483656, MRI=...) at /home/jayfoad2/git/llvm-project/llvm/lib/CodeGen/GlobalISel/Utils.cpp:341
#8 0x000000000657eee0 in llvm::CSEMIRBuilder::buildInstr (this=0x7465010, Opc=101, DstOps=..., SrcOps=..., Flag=...) at /home/jayfoad2/git/llvm-project/llvm/lib/CodeGen/GlobalISel/CSEMIRBuilder.cpp:160
#9 0x0000000003645958 in llvm::MachineIRBuilder::buildAShr (this=0x7465010, Dst=..., Src0=..., Src1=..., Flags=...) at /home/jayfoad2/git/llvm-project/llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h:1298
#10 0x00000000065c35b1 in llvm::LegalizerHelper::lower (this=0x7fffffffb5f8, MI=..., TypeIdx=0, Ty=...) at /home/jayfoad2/git/llvm-project/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp:2020
because at this point there are two instructions defining Res: the
original G_SMULO/G_UMULO and the new G_MUL that we built. The fix is
to modify the original mul in place, so that there is only ever one
definition of Res.
Reviewers: arsenm, aditya_nandakumar
Subscribers: wdng, rovka, hiraditya, volkan, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72842
When you encounter a G_TRUNC, you are moving from a larger type to a smaller
type.
Asking for the i-th bit on a larger value is the same as asking for the i-th
bit on a smaller value.
So, we should always be able to walk through G_TRUNC when computing the bit
for a TB(N)Z.
Differential Revision: https://reviews.llvm.org/D73748
This allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits to create a simpler ISD::INSERT_SUBVECTOR, which is particularly useful for cases where we're splitting into subvectors anyhow.
Some code gen passes use MBFIWrapper to keep track of the frequency of new
blocks. This was not taken into account and could lead to incorrect frequencies
as MBFI silently returns zero frequency for unknown/new blocks.
Add a variant for MBFIWrapper in the PGSO query interface.
Depends on D73494.
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.
Reviewers: courbet
Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73785
Try out using combine definition rules.
This really should be a post-legalizer combine, but the combiner pass
is currently pre-legalize. Most of the target combines are really
post-legalize, so we should probably move the pass.
We may calculate reassociable math ops in arbitrary order when creating a shuffle reduction,
so there's no guarantee that things like 'nsw' hold on those intermediate values. Drop all
poison-generating flags for safety.
This change is limited to shuffle reductions because I don't think we have a problem in the
general case (where we intersect flags of each scalar op that goes into a vector op), but if
there's evidence of other cases being wrong, we can extend this fix to cover those cases.
https://bugs.llvm.org/show_bug.cgi?id=44536
Differential Revision: https://reviews.llvm.org/D73727
First attempt at implementing -fsemantic-interposition.
Rely on GlobalValue::isInterposable that already captures most of the expected
behavior.
Rely on a ModuleFlag to state whether we should respect SemanticInterposition or
not. The default remains no.
So this should be a no-op if -fsemantic-interposition isn't used, and if it is,
isInterposable being already used in most optimisation, they should honor it
properly.
Note that it only impacts architecture compiled with -fPIC and no pie.
Differential Revision: https://reviews.llvm.org/D72829
This patch wraps an external thread local storage variable inside of a
getter function and makes it have internal linkage. This allows LLVM to
be built with BUILD_SHARED_LIBS on windows with MinGW. Additionally it
allows Clang versions prior to 10 to compile current trunk for MinGW.
Differential Revision: https://reviews.llvm.org/D73639
One of the exit criteria of computeKnownBits is whether we reach the max
recursive call depth. Before this patch we would check that the
depth is exactly equal to max depth to exit.
Depth may get bigger than max depth if it gets passed to a different
GISelKnownBits object.
This may happen when say a generic part uses a GISelKnownBits object
with some max depth, but then we hit TL.computeKnownBitsForTargetInstr
which creates a new GISelKnownBits object with a different and smaller
depth. In that situation, when we hit the max depth check for the first
time in the target specific GISelKnownBits object, depth may already
be bigger than the current max depth. Hence we would continue to compute
the known bits, until we ran through the full depth of the chain of
computation or ran out of stack space.
For instance, let say we have
GISelKnownBits Info(/*MaxDepth*/ = 10);
Info.getKnownBits(Foo)
// 9 recursive calls to computeKnownBitsImpl.
// Then we hit a target specific instruction.
// The target specific GISelKnownBits does this:
GISelKnownBits TargetSpecificInfo(/*MaxDepth*/ = 6)
TargetSpecificInfo.computeKnownBitsImpl() // <-- next max depth checks would
// always return false.
This commit does not have any test case, none of the in-tree targets
use computeKnownBitsForTargetInstr.
For a MC_GlobalAddress reference to a dso_local external GlobalValue with a definition, emit .Lfoo$local to avoid a relocation.
-fno-pic and -fpie can infer dso_local but -fpic cannot. In the future,
we can explore the possibility of inferring dso_local with -fpic. As the
description of D73228 says, LLVM's existing IPO optimization behaviors
(like -fno-semantic-interposition) and a previous assembly behavior give
us enough license to be aggressive here.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D73230
This patch addresses the issue found in https://bugs.llvm.org/show_bug.cgi?id=44585
where a DW_OP_deref was placed at the end of a dwarf expression, resulting in corrupt
symbols when debugging.
This is an attempt to reland with a few fixes for buildbot since I
haven't merged from master in a bit.
Differential Revision: https://reviews.llvm.org/D73526
We can have geps that have a scalar base pointer, and a vector index value, which
means that the base pointer must be splatted into a vector of pointers.
This fixes crashes on arm64 GlobalISel with optimizations enabled.
I believe this also fixes bugs with CI 32-bit handling, which was
incorrectly skipping offsets that look like signed 32-bit values. Also
validate the offsets are dword aligned before folding.
This is similar to the code in getTestBitOperand in AArch64ISelLowering. Instead
of implementing all of the TB(N)Z optimizations at once, this patch implements
the simplest case first. The way that this is set up should make it fairly easy
to add the rest as we go along.
The idea here is that after determining that we can use a TB(N)Z, we can
continue looking through instructions and perform further folding.
In this case, when we have a G_ZEXT or G_ANYEXT where the extended bits are not
used, we can fold it into the TB(N)Z.
Differential Revision: https://reviews.llvm.org/D73673
Found by inspection, but there's no test for this yet because G_PTR_ADD is
currently illegal for vectors. I'll add the test at a later time when the
legalizer support has landed.
There's not much value to this separate node from the intrinsic. Make
the operand structure the same as the intrinsic, so we can reuse the
same pattern for GlobalISel.
In line with current conventions, create new instructions rather
than modify two operands in place and performing manual worklist
management.
This should be NFC apart from possible worklist order changes.
Summary:
Added file headers for files which implement iterative lightweight scheduling
strategies. Which is basically an exercise which I undertook in order to get
used to LLVM development process.
Reviewers: arsenm, vpykhtin, cdevadas
Reviewed By: vpykhtin
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, javed.absar, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73417
- Extends the comments related to function descriptors, noting how they
are only used on AIX.
- Changes the condition used to gate the creation of the current function
symbol in AsmPrinter::SetupMachineFunction to reflect being AIX
specific. The creation of the symbol is different because of AIXs
linkage conventions, not because AIX uses function descriptors.
Differential Revision: https://reviews.llvm.org/D73115
Summary:
For -fpatchable-function-entry=N,0 -mbranch-protection=bti, after
9a24488cb6, we place the NOP sled after
the initial BTI.
```
.Lfunc_begin0:
bti c
nop
nop
.section __patchable_function_entries,"awo",@progbits,f,unique,0
.p2align 3
.xword .Lfunc_begin0
```
This patch adds a label after the initial BTI and changes the __patchable_function_entries entry to reference the label:
```
.Lfunc_begin0:
bti c
.Lpatch0:
nop
nop
.section __patchable_function_entries,"awo",@progbits,f,unique,0
.p2align 3
.xword .Lpatch0
```
This placement is compatible with the resolution in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92424 .
A local linkage function whose address is not taken does not need a BTI.
Placing the patch label after BTI has the advantage that code does not
need to differentiate whether the function has an initial BTI.
Reviewers: mrutland, nickdesaulniers, nsz, ostannard
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73680
Summary:
Similar to issue D71445. Scalable vector should not be evaluated element by element.
Add support to handle scalable vector UndefValue.
Reviewers: sdesmalen, efriedma, apazos, huntergr, willlovett
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73678
Summary:
Disable the always importing of constants introduced in D70404 by
default under a new internal option, since it is causing order of
magnitude compile time regressions during the thin link. Will continue
investigating why the regressions occur.
Reviewers: evgeny777, wmi
Subscribers: mehdi_amini, inglorion, hiraditya, steven_wu, dexonsmith, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73724
from FC0.ExitBlock to FC1.ExitBlock when proven safe.
Summary:
Currently LoopFusion give up when the second loop nest guard
block or the first loop nest exit block is not empty. For example:
if (0 < N) {
for (int i = 0; i < N; ++i) {}
x+=1;
}
y+=1;
if (0 < N) {
for (int i = 0; i < N; ++i) {}
}
The above example should be safe to fuse.
This PR moves instructions in FC1 guard block (e.g. y+=1;) to
FC0 guard block, or instructions in FC0 exit block (e.g. x+=1;) to
FC1 exit block, which then LoopFusion is able to fuse them.
Reviewer: kbarton, jdoerfert, Meinersbur, dmgreen, fhahn, hfinkel,
bmahjour, etiotto
Reviewed By: jdoerfert
Subscribers: hiraditya, llvm-commits
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D73641
fadd/fmul reductions without reassoc are lowered to
VECREDUCE_STRICT_FADD/FMUL nodes, which don't have legalization
support. Until that is in place, expand these intrinsics on
ARM and AArch64. Other targets always expand the vector reduction
intrinsics.
Additionally expand fmax/fmin reductions without nonan flag on
AArch64, as the backend asserts that the flag is present when
lowering VECREDUCE_FMIN/FMAX.
This fixes https://bugs.llvm.org/show_bug.cgi?id=44600.
Differential Revision: https://reviews.llvm.org/D73135
The recommended optimization level for BPF programs
is O2 since (1). BPF is running inside the kernel and
linux kernel won't work at -O0 level, and (2). Verifier
is not able to handle O0 code properly, e.g., potential
large stack size and a lot of spills.
But we should keep -O0 at least compiling.
This patch fixed a bug in BPFMISimplifyPatchable phase
where with -O0, a segmentation fault will happen for a
simple program like:
int test(int a, int b) { return a + b; }
A test case is added to capture such a case.
Differential Revision: https://reviews.llvm.org/D73681
Summary:
This patch intends to support three most common relocation type
on AIX: R_POS, R_TOC, R_RBR.
These three relocation type will be needed for object file generation
on AIX for small code model.
We will have follow up patches to bring relocation support for
large code model on AIX.
Reviewers: hubert.reinterpretcast, daltenty, DiggerLin
Differential Revision: https://reviews.llvm.org/D72027
By adding the prefixed instructions the branch distances are no longer
computed correctly. Since prefixed instructions cannot cross a 64 byte
boundary we have to assume that a prefixed instruction may have a nop
prepended to it. This patch tries to take that nop into consideration
when computing the size of basic blocks.
Differential Revision: https://reviews.llvm.org/D72572
Summary:
When constant folding, constants that are wrapped in metadata were not
folded. This could lead to dbg.values being the only user of a constant
expression, due to the non-dbg uses having been rewritten, resulting in
the constant later on being removed by some other pass. This occurred
with the attached test case, in which the non-rewritten GEP in the
dbg.value intrinsic was later on removed by globalopt.
This patch makes the code look through metadata and fold such constants.
I guess that we in the future may want to allow dbg.values using GEPs and
other constant expressions to be emittable even if there are no non-dbg
uses, but for example SelectionDAG does not support that.
Reviewers: jmorse, aprantl, vsk, davide
Reviewed By: aprantl, vsk, davide
Subscribers: hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D73630
The commit https://reviews.llvm.org/rG14fc20ca6 added some options to the X86
back end that cause the help text for opt/llc to become much harder to read.
The issue is that the cl::value_desc is part of the option name and is used to
compute the indentation of the description text (i.e. the maximum length option
name is what everything aligns to). Since the commit puts a large number of
characters into that text, everything is aligned to that width.
This patch just reformats the option so that the description is contained in the
description and the list of possible values is within the angle brackets.
Note: the readability issue of the helptext was fixed in commit
70cbf8c71c, but the re-formatting wasn't
added on that commit so I am still committing this.
Differential revision: https://reviews.llvm.org/D73267
Strict fp-to-int and int-to-fp conversions can be handled in the same way that
the non-strict versions are (by using the appropriate instruction or converting
to a function call when we have no instruction).
Differential Revision: https://reviews.llvm.org/D73625
On targets that don't have the normal packed f16 layout, handle these
during legalization. Directly modify the register types. We can infer
this was a d16 load based on the mem operand size during selection.
A16 operands should possibly be handled here as well, but don't worry
about that yet.
This trivially avoids violating the constant bus restriction.
Previously this was allowing one SGPR in the first source
operand, which technically also avoided violating this for most
operations (but not for special cases reading vcc).
We do need to write some new, smarter operand folds to pick the
optimal SGPR to use in some kind of post-isel fold, but that's purely
an optimization.
I was originally thinking we would pick which operands should be SGPRs
in RegBankSelect, but I think this isn't really manageable. There
would be additional complexity to handle every G_* instruction, and
then any nontrivial instruction patterns would need to know when to
avoid violating it, which is likely to be very error prone.
I think having all inputs being canonically copies to VGPRs will
simplify the operand folding logic. The current folding we do is
backwards, and only considers one operand at a time, relative to
operands it already has. It therefore poorly handles the case where
there is already a constant bus operand user. If all operands are
copies, it's somewhat simpler to consider all input operands at once
to choose the optimal constant bus user.
Since the failure mode for constant bus violations is now a verifier
error and not an selection failure, this moves towards a place where
we can turn on the fallback mode. The SGPR copy folding optimizations
can be left for later.
Commit 9965b12fd1 was supposed to change the offset constant when
lowering load/stores, but only introduced this change for loads. This
patch adds the same fix for stores.
A known limitation for Future CPU is that the new prefixed instructions may
not cross 64 Byte boundaries.
All instructions are already 4 byte aligned so the only situation where this
can occur is when the prefix is in one 64 byte block and the instruction that
is prefixed is at the top of the next 64 byte block. To fix this case
PPCELFStreamer was added to intercept EmitInstruction. When a prefixed
instruction is emitted we try to align it to 64 Bytes by adding a maximum of
4 bytes. If the prefixed instruction crosses the 64 Byte boundary then the
alignment would trigger and a 4 byte nop would be added to push the
instruction into the next 64 byte block.
Differential Revision: https://reviews.llvm.org/D72570
This gets selected to the appropriate fcvt instruction. Handling from there on
isn't fully correct yet, as we need to model fcvt reading and writing to fpsr
and fpcr.
Differential Revision: https://reviews.llvm.org/D73201
We are missing ability to override the sh_entsize field for
SHT_REL[A] sections. It would be useful for writing test cases.
Differential revision: https://reviews.llvm.org/D73621
These become STRICT_FCMP and STRICT_FCMPE, which then get selected to the
corresponding FCMP and FCMPE instructions, though the handling from there on
isn't fully correct as we don't model reads and writes to FPCR and FPSR.
Differential Revision: https://reviews.llvm.org/D73368
Summary:
The code was assuming in a few places that if there was only one exit
from the function that it was a normal return, which is invalid. It
could be an infinite loop, in which case we still need to insert the
usual fake edge so that the null export happens. This fixes shaders that
end with an infinite loop that discards.
Reviewers: arsenm, nhaehnle, critson
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71192
Summary:
Add trimming of unused components of s_buffer_load.
For s_buffer_load and unformatted buffer_load also trim unused
components at the beginning of vector and update offset accordingly.
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71785
The function a) returned 32-bits when in DWARF64, the PrologueLength
field is 64-bits in size, and b) didn't work for DWARF version 5.
Also deleted some related dead code. With this deletion, getLength is
itself dead, but another change is about to make use of it.
Reviewed by: probinson
Differential Revision: https://reviews.llvm.org/D73626
Summary:
In a release build this variable becomes unused and may break the build
with `-Werror,-Wunused-variable`.
Reviewers: gribozavr2, jdoerfert, sstefan1
Reviewed By: gribozavr2
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73683
Summary:
As determined with `llvm-exegesis`.
Some of these look like typos/misunderstandings of the sched model td
spec:
- latency defaults to `1` when not set => Maybe we can avoid
having a default ?
- problems with regexps not being anchored by default (XCHG matching
CMPXHG)
Note that this is not complete, it fixes only the most obvious mistakes,
and only for latency (not uops).
Reviewers: RKSimon, GGanesh
Subscribers: hiraditya, jfb, mstojanovic, hfinkel, craig.topper, andreadb, lebedev.ri, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73172
InstCombine operates on the basic premise that the operands of the
currently processed instruction have already been simplified. It
achieves this by pushing instructions to the worklist in reverse
program order, so that instructions are popped off in program order.
The worklist management in the main combining loop also makes sure
to uphold this invariant.
However, the same is not true for all the code that is performing
manual worklist management. The largest problem (addressed in this
patch) are instructions inserted by InstCombine's IRBuilder. These
will be pushed onto the worklist in order of insertion (generally
matching program order), which means that a) the users of the
original instruction will be visited first, as they are pushed later
in the main loop and b) the newly inserted instructions will be
visited in reverse program order.
This causes a number of problems: First, folds operate on instructions
that have not had their operands simplified, which may result in
optimizations being missed (ran into this in
https://reviews.llvm.org/D72048#1800424, which was the original
motivation for this patch). Additionally, this increases the amount
of folds InstCombine has to perform, both within one iteration, and
by increasing the number of total iterations.
This patch addresses the issue by adding a Worklist.AddDeferred()
method, which is used for instructions inserted by IRBuilder. These
will only be added to the real worklist after the combine finished,
and in reverse order, so they will end up processed in program order.
I should note that the same should also be done to nearly all other
uses of Worklist.Add(), but I'm starting with just this occurrence,
which has by far the largest test fallout.
Most of the test changes are due to
https://bugs.llvm.org/show_bug.cgi?id=44521 or other cases where
we don't canonicalize something. These are neutral. One regression
has been addressed in D73575 and D73647. The remaining regression
in an shl+sdiv fold can't really be fixed without dropping another
transform, but does not seem particularly problematic in the first
place.
Differential Revision: https://reviews.llvm.org/D73411
This lowering tries to look for G_PTR_ADD instructions and then converts
them to a standard G_ADD with a COPY on the source, and G_INTTOPTR on the
result. This is ok for address space 0 on AArch64 as p0 can be treated as
s64.
The motivation behind this is to expose the add semantics to the imported
tablegen patterns. We shouldn't need to check for uses being loads/stores,
because the selector works bottom up, uses before defs. By the time we
end up trying to select a G_PTR_ADD, we should have already attempted to
fold this into addressing modes and were therefore unsuccessful.
This gives some performance and code size improvements across the board.
Differential Revision: https://reviews.llvm.org/D73673
Currently some prefixes are emitted as instructions, to distinguish them
from real instruction, fuction isPrefix() is added. The kinds of prefix
are consistent with X86GenInstrInfo.inc.
Differential Revision: https://reviews.llvm.org/D73013
Summary:
This patch makes sure that the field VFShape.VF is greater than zero
when demangling the vector function name of scalable vector functions
encoded in the "vector-function-abi-variant" attribute.
This change is required to be able to provide instances of VFShape
that can be used to query the VFDatabase for the vectorization passes,
as such passes always require a positive value for the Vectorization Factor (VF)
needed by the vectorization process.
It is not possible to extract the value of VFShape.VF from the mangled
name of scalable vector functions, because it is encoded as
`x`. Therefore, the VFABI demangling function has been modified to
extract such information from the IR declaration of the vector
function, under the assumption that _all_ vectors in the signature of
the vector function have the same number of lanes. Such assumption is
valid because it is also assumed by the Vector Function ABI
specifications supported by the demangling function (x86, AArch64, and
LLVM internal one).
The unit tests that demangle scalable names have been modified by
adding the IR module that carries the declaration of the vector
function name being demangled.
In particular, the demangling function fails in the following cases:
1. When the declaration of the scalable vector function is not
present in the module.
2. When the value of VFSHape.VF is not greater than 0.
Reviewers: jdoerfert, sdesmalen, andwar
Reviewed By: jdoerfert
Subscribers: mgorny, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73286
This is an alternate fix for the issue D73606 was trying to
solve.
The main issue here is that we bailed out of
foldOffsetIntoAddress if Offset is 0. But if we just found a
symbolic displacement and AM.Disp became non-zero
earlier, we still need to validate that AM.Disp with the symbolic
displacement.
This passes fold-add-pcrel.ll.
Differential Revision: https://reviews.llvm.org/D73608
Summary:
It is called when instructions aren't simplified, and the implementation
is expected to account for a penalty. Renamed to
onCommonInstructionMissedSimplification.
Reviewers: davidxl, eraman
Reviewed By: davidxl
Subscribers: hiraditya, baloghadamsoftware, haicheng, a.sidorin, Szelethus, donat.nagy, dkrupp, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73662
A pointer is privatizeable if it can be replaced by a new, private one.
Privatizing pointer reduces the use count, interaction between unrelated
code parts. This is a first step towards replacing argument promotion.
While we can already handle recursion (unlike argument promotion!) we
are restricted to stack allocations for now because we do not analyze
the uses in the callee.
Reviewed By: uenoku
Differential Revision: https://reviews.llvm.org/D68852
The helpers AAReturnedFromReturnedValues and
AACallSiteReturnedFromReturned are useful not only to avoid code
duplication but also to avoid recomputation of results. If we have N
call sites we should not recompute the function return information N
times but once. These are mostly straightforward usages with some minor
improvements on the helpers and addition of a new one
(IRPosition::getAssociatedType) that knows about function return types.
We seem to be inheriting the cost from sse4.1. But if we have 256-bit registers we should be able to do this with just one extract to split the 16i16 and two v8i16->v8i32 operations so our cost should be 3 not 4.
Differential Revision: https://reviews.llvm.org/D73646
This is passed to legalizeCustom, but not intrinsic. Also remove the
MRI argument, since you can get that from the MachineIRBuilder.
I'm not sure why MachineIRBuilder has a private observer member, and
this is passed separately.
Rename Destructive enumerator in preparation for a larger set of patches to
support prefixing destructive oeprations with MOVPRFX.
Differential Revision: https://reviews.llvm.org/D73212
For pow2 constants we should use G_SHL for pattern matching (and perf)
purposes later.
Vector support not yet implemented.
Differential Revision: https://reviews.llvm.org/D73659
When the bit is <= 32, we have to use the W register variant for TB(N)Z.
This is because of the way the instruction is encoded.
Differential Revision: https://reviews.llvm.org/D73660
Summary:
This will help with devirtualization (store forwarding with vtable pointers in
the presence of other stores into members in the constructor.) During inlining,
we don't have AA.
Reviewers: davidxl
Subscribers: mgorny, Prazek, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71307
Some housekeeping for the DestructiveInstType enum before a larger set of patches to support prefixing destructive oeprations with MOVPRFX.
Differential Revision: https://reviews.llvm.org/D73141
A previous patch should have added pld and pstd and any support code in
the backend that is required for prefixed load and store type operations.
This patch adds a number of additional prefixed load and store type
instructions for the future CPU.
Differential Revision: https://reviews.llvm.org/D72577
Summary:
gnu addr2line prints DWARF line table discriminators like so:
<file>:<line> (discriminator <Number>)
This matches that behavior.
Document how and when --output-style=GNU prints discriminators
Add test for new GNU-style discriminator printing.
Reviewers: rupprecht, labath, jhenderson
Subscribers: aprantl, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73318
For the
icmp eq (add X, C1), C2 => icmp eq X, C2-C1
icmp eq (sub C1, X), C2 => icmp eq X, C1-C2
folds, this allows C1 to be non-splat and contain undefs.
C2 is still splat, due to the structure of the code.
This is to address the remaining part of the regression in D73411,
where demanded element analysis replaces some elements with undef.
Differential Revision: https://reviews.llvm.org/D73647
For `MC_GlobalAddress` operands referencing **certain** GlobalObjects,
we can lower them to STB_LOCAL aliases to avoid costs brought by
assembler/linker's conservative decisions about symbol interposition:
* An assembler conservatively assumes a global default visibility symbol interposable (ELF
semantics). So relocations in object files are needed even if the code generator assumed
the definition exact and non-interposable.
* The relocations can cause the creation of PLT entries on some targets for -shared links.
A linker conservatively assumes a global default visibility symbol interposable (if not
otherwise constrained by -Bsymbolic/--dynamic-list/VER_NDX_LOCAL/etc).
"certain" refers to GlobalObjects in the intersection of
`hasExactDefinition() and !isInterposable()`: `external`, `appending`, `internal`, `private`.
Local linkages (`internal` and `private`) cannot be interposed. `appending` is for very
few objects LLVM interpret specially. So the set just includes `external`.
This patch emits STB_LOCAL aliases (.Lfoo$local) for such GlobalObjects, so that targets can lower
MC_GlobalAddress operands to STB_LOCAL aliases if applicable.
We may extend the scope and include GlobalAlias in the future.
LLVM's existing -fno-semantic-interposition behaviors give us license to do such optimizations:
* Various optimizations (ipconstprop, inliner, sccp, sroa, etc) treat normal ExternalLinkage
GlobalObjects as non-interposable.
* Before D72197, MC resolved a PC-relative VK_None fixup to a non-local symbol at assembly time (no
outstanding relocation), if the target is defined in the same section. Put it simply, even if IR
optimizations failed to optimize and allowed interposition for the function call in
`void foo() {} void bar() { foo(); }`, the assembler would disallow it.
This patch sets up AsmPrinter infrastructure to make -fno-semantic-interposition more so.
With and without the patch, the object file output should be identical:
`.Lfoo$local` does not take a symbol table entry.
Reviewed By: sfertile
Differential Revision: https://reviews.llvm.org/D73228
Summary:
Add test case for the same. This test case will also serve as a
starting point for later symbolizer tests.
Reviewers: dblaikie, jdoerfert
Subscribers: hiraditya, llvm-commits, jhenderson
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73583
Summary:
The initialization of RegisterBank needs to be done only once. The
logic of AlreadyInit has data race, use llvm::call_once instead.
This is continuing work of D73587.
Reviewers: arsenm, rovka, dsanders, t.p.northover, efriedma, apazos
Reviewed By: arsenm
Subscribers: wdng, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73605
Summary:
The initialization of RegisterBank needs to be done only once. The
logic of AlreadyInit has data race, use llvm::call_once instead.
This is continuing work of D73587.
Reviewers: arsenm, tstellar, ronlieb, efriedma, apazos, nhaehnle
Reviewed By: nhaehnle
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73604
Summary:
The initialization of RegisterBank needs to be done only once. The
logic of AlreadyInit has a data race, use llvm::call_once instead.
This issue was identified through thread sanitizer.
Reviewers: efriedma, apazos, qcolombet, dsanders
Reviewed By: efriedma
Subscribers: arsenm, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73587
ISD::FROUND is defined to round to nearest with ties rounding
away from 0. This mode isn't supported in hardware on X86.
But as long as we aren't compiling with trapping math, we can
emulate this with floor(X + copysign(nextafter(0.5, 0.0), X)).
We have to use nextafter to avoid some corner cases that adding
0.5 would have. For example, if X is nextafter(0.5, 0.0) it should
round to 0.0, but adding 0.5 would need one extra bit of mantissa
than can be stored so it rounds to 1.0. Adding nextafter(0.5, 0.0)
instead will just increase the exponent by 1 and leave the mantissa
as all 1s. This would be nextafter(1.0, 0.0) which will floor to 0.0.
Techically this requires -fno-trapping-math which isn't our default.
But if we care about exceptions we should be using constrained
intrinsics. Constrained intrinsics would use STRICT_FROUND which
won't go through this code.
Fixes PR42195.
Differential Revision: https://reviews.llvm.org/D73607
This code needs to map from the FPCW 2-bit encoding for rounding mode to the 2-bit encoding defined for FLT_ROUNDS. The previous implementation did some clever swapping of bits and adding 1 modulo 4 to do the mapping.
This patch instead uses an 8-bit immediate as a lookup table of four 2-bit values. Then we use the 2-bit FPCW encoding to index the lookup table by using a right shift and an AND. This requires extracting the 2-bit value from FPCW and multipying it by 2 to make it usable as a shift amount. But still results in less code.
Differential Revision: https://reviews.llvm.org/D73599
Fixes selection for scalar G_SMULH/G_UMULH. Also switches to using
tablegen selected add/sub, which switch to the signed version of the
opcode. This matches the current DAG behavior. We can't drop the
manual selection for add/sub yet, because it's still both for VALU
add/sub and for G_PTR_ADD.
Manually select this is as a tablegen workraound. Both SelectionDAG
and GlobalISel end up misplacing the copy to m0 when both instructions
in the output need it. Neither considers that both output instructions
depend on m0. I don't know of any other pattern we need to handle this
case, so it's less effort to just workaround this for now.
Summary:
BaseMemOpClusterMutation::apply forms store chains by looking for
control (i.e. non-data) dependencies from one mem op to another.
In the test case, clusterNeighboringMemOps successfully clusters the
loads, and then adds artificial edges to the loads' successors as
described in the comment:
// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.
The effect of this is that *data* dependencies from one load to a store
are copied as *artificial* dependencies from a different load to the
same store.
Then when BaseMemOpClusterMutation::apply looks at the stores, it finds
that some of them have a control dependency on a previous load, which
breaks the chains and means that the stores are not all considered part
of the same chain and won't all be clustered.
The fix is to only consider non-artificial control dependencies when
forming chains.
Subscribers: MatzeB, jvesely, nhaehnle, hiraditya, javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71717
This commit fixes PR39321.
GlobalExtensions is not guaranteed to be destroyed when optimizer plugins are unloaded. If it is indeed destroyed after a plugin is dlclose-d, the destructor of the corresponding ExtensionFn is not mapped anymore, causing a call to unmapped memory during destruction.
This commit guarantees that extensions coming from external plugins are removed from GlobalExtensions when the plugin is unloaded if GlobalExtensions has not been destroyed yet.
Differential Revision: https://reviews.llvm.org/D71959
Summary:
Due to the fact that kill is just a normal intrinsic, even though it's
supposed to terminate the thread, we can end up with provably infinite
loops that are actually supposed to end successfully. The
AMDGPUUnifyDivergentExitNodes pass breaks up these loops, but because
there's no obvious place to make the loop branch to, it just makes it
return immediately, which skips the exports that are supposed to happen
at the end and hangs the GPU if all the threads end up being killed.
While it would be nice if the fact that kill terminates the thread were
modeled in the IR, I think that the structurizer as-is would make a mess if we
did that when the kill is inside control flow. For now, we just add a null
export at the end to make sure that it always exports something, which fixes
the immediate problem without penalizing the more common case. This means that
we sometimes do two "done" exports when only some of the threads enter the
discard loop, but from tests the hardware seems ok with that.
This fixes dEQP-VK.graphicsfuzz.while-inside-switch with radv.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70781
SIMachineScheduler uses isHighLatencyInstruction with the same
sematincs, but TargetInstrInfo has virtual isHighLatencyDef
method, so override it instead.
Added FLAT to the list of high latency opcodes and a check for
mayLoad since stores are not technically high latency in terms
of data dependency.
This change did not produce any visible impact on our tests.
Differential Revision: https://reviews.llvm.org/D73582
proven safe.
Summary:
Currently LoopFusion give up when the second loop nest preheader is
not empty. For example:
for (int i = 0; i < 100; ++i) {}
x+=1;
for (int i = 0; i < 100; ++i) {}
The above example should be safe to fuse.
This PR moves instructions in FC1 preheader (e.g. x+=1; ) to
FC0 preheader, which then LoopFusion is able to fuse them.
Reviewer: kbarton, Meinersbur, jdoerfert, dmgreen, fhahn, hfinkel,
bmahjour, etiotto
Reviewed By: jdoerfert
Subscribers: hiraditya, llvm-commits
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D71821
Summary:
udiv/sdiv/urem/srem/mul integer isel patterns and tests.
Pretend for now that integer division were always cheap in HW.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D73623
This should be no problem to support with a pattern, but it turns out
there are just too many yaks to shave. The main problem is in the DAG
emitter, which I have no desire to sink effort into fixing.
If we had a bit to disable patterns in the DAG importer, fixing the
GlobalISelEmitter is more manageable.