Loop peeling removes conditions from loop bodies that become invariant
after a small number of iterations. When triggered, this leads to fewer
compares and possibly PHIs in loop bodies, enabling further
optimizations. The current cost-model of loop peeling should be quite
conservative/safe, i.e. only peel if a condition in the loop becomes
known after peeling.
For example, see PR47671, where loop peeling enables vectorization by
removing a PHI the vectorizer does not understand. Granted, the
loop-vectorizer could also be taught about constant PHIs, but loop
peeling is likely to enable other optimizations as well.
This has an impact on quite a few benchmarks from
MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto, for example
Same hash: 186 (filtered out)
Remaining: 51
Metric: loop-vectorize.LoopsVectorized
Program base patch diff
test-suite...ve-susan/automotive-susan.test 8.00 9.00 12.5%
test-suite...nal/skidmarks10/skidmarks.test 35.00 31.00 -11.4%
test-suite...lications/sqlite3/sqlite3.test 41.00 43.00 4.9%
test-suite...s/ASC_Sequoia/AMGmk/AMGmk.test 25.00 26.00 4.0%
test-suite...006/450.soplex/450.soplex.test 88.00 89.00 1.1%
test-suite...TimberWolfMC/timberwolfmc.test 120.00 119.00 -0.8%
test-suite.../CINT2006/403.gcc/403.gcc.test 215.00 216.00 0.5%
test-suite...006/447.dealII/447.dealII.test 957.00 958.00 0.1%
test-suite...ternal/HMMER/hmmcalibrate.test 75.00 75.00 0.0%
Same hash: 186 (filtered out)
Remaining: 51
Metric: loop-vectorize.LoopsAnalyzed
Program base patch diff
test-suite...ks/Prolangs-C/agrep/agrep.test 440.00 434.00 -1.4%
test-suite...nal/skidmarks10/skidmarks.test 312.00 308.00 -1.3%
test-suite...marks/7zip/7zip-benchmark.test 6399.00 6323.00 -1.2%
test-suite...lications/minisat/minisat.test 134.00 135.00 0.7%
test-suite...rks/FreeBench/pifft/pifft.test 295.00 297.00 0.7%
test-suite...TimberWolfMC/timberwolfmc.test 1879.00 1869.00 -0.5%
test-suite...pplications/treecc/treecc.test 689.00 691.00 0.3%
test-suite...T2000/300.twolf/300.twolf.test 1593.00 1597.00 0.3%
test-suite.../Benchmarks/Bullet/bullet.test 1394.00 1392.00 -0.1%
test-suite...ications/JM/ldecod/ldecod.test 1431.00 1429.00 -0.1%
test-suite...6/464.h264ref/464.h264ref.test 2229.00 2230.00 0.0%
test-suite...lications/sqlite3/sqlite3.test 2590.00 2589.00 -0.0%
test-suite...ications/JM/lencod/lencod.test 2732.00 2733.00 0.0%
test-suite...006/453.povray/453.povray.test 3395.00 3394.00 -0.0%
Note the -11% regression in number of loops vectorized for skidmarks. I
suspect this corresponds to the fact that those loops are gone now (see
the reduction in number of loops analyzed by LV).
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D88471
We already set the `sh_entsize` field in a single place
for all non-implicit sections.
This patch reorders the logic slightly and with it
we finally have the only one place where the `sh_entsize` is set.
obj2yaml will not dump the `EntSize` key for `SHT_DYNSYM/SHT_SYMTAB` sections anymore,
when the value of `sh_entsize` is equal to `sizeof(Elf_Sym)`
Note that this also seems revealed an issue in llvm-objcopy:
Previously yaml2obj set the `sh_entsize` for the `.symtab` section to 0x18,
now we it sets it for `SHT_SYMTAB` sections, i.e. by type.
But the `llvm-objcopy/ELF/only-keep-debug.test` has a `.symtab` section of type `SHT_STRTAB`,
and now yaml2obj sets the `sh_entsize` to 0 for it.
I had to update the corresponding check lines for `ES`, but the behavior of
`llvm-objcopy` should be fixed instead I think.
I've added a TODO and a comment.
Differential revision: https://reviews.llvm.org/D95364
A default version (@@) is only available for defined symbols.
Currently we use "@@" for undefined symbols too.
This patch fixes the issue and improves our test case.
Differential revision: https://reviews.llvm.org/D95219
To be able to refer to constant keypaths (e.g. `defvar cplusplus = LangOpts<"CPlusPlus">`) inside `ImpliedByAnyOf`, let's accept strings instead of `Option` instances.
This somewhat weakens the guarantees that we're referring to an existing (option) record, but we can still use the option.KeyPath syntax to simulate this.
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D95344
AMDGPULegalizerInfo.h needs MachineIRBuilder but relies on a forward
declaration of MachineIRBuilder in LegalizerInfo.h. This patch adds a
forward declaration right in AMDGPULegalizerInfo.h.
While we are at it, this patch removes the one in LegalizerInfo.h,
where it is unnecessary.
The current algorithm just tries to localize defs as far as they can go, and in
the case of G_PHI operands, it clones the def into the predecessor block for
each incoming edge. When multiple edges have the same register value, this can
cause unnecessary code bloat, and inhibit later optimizations.
This change checks if a given phi operand is unique in the phi, if not the
def of that register is not localized to the predecessor.
Differential Revision: https://reviews.llvm.org/D95406
This change leverages the work done in D83743 to replay in the SampleProfile inliner to also be used in the CGSCC inliner. NOTE: currently restricted to non-ML advisors only.
The added switch `-cgscc-inline-replay=<remarks file>` will replay the inlining decisions in that file where the remarks file is generated via `-Rpass=inline`. The aim here is to make it easier to analyze changes that would modify inlining heuristics to be separated from this behavior. Doing so allows easier examination of assembly and runtime behavior compared to the baseline rather than trying to dig through the large churn caused by inlining.
In LTO compilation, since inlining is done twice you can separately specify replay by passing the flag to the FE (`-cgscc-inline-replay=`) and to the linker (`-Wl,cgscc-inline-replay=`) with the remarks generated from their respective places.
Testing on mysqld by comparing the inline decisions between base (generates remarks.txt) and diff (replay using identical input/tools with remarks.txt) and examining the inlining sites with `diff` shows 14,000 mismatches out of 247,341 for a ~94% replay accuracy. I believe this gap can be narrowed further though for the general case we may never achieve full accuracy. For my personal use, this is close enough to be representative: I set the baseline as the one generated by the replay on identical input/toolset and compare that to my modified input/toolset using the same replay.
Testing:
ninja check-llvm
newly added test correctly replays CGSCC inlining decisions
Reviewed By: mtrofin, wenlei
Differential Revision: https://reviews.llvm.org/D94334
Refactor the duplicated canonicalize-path logic in `FileCollector` and
`ModulesDependencyCollector` into a new utility called
`PathCanonicalizer` that's shared. This popped up when tracking down a
bug common to both in https://reviews.llvm.org/D95202.
As drive-bys, update a few names and comments to better reflect the
effect of the code, delay removal of `..`s to avoid an unnecessary extra
string copy, and leave behind a couple of FIXMEs for future
consideration.
Differential Revision: https://reviews.llvm.org/D95279
Don't emit an output dash for an empty sequence. Take emitting a vector
of strings for example:
std::vector<std::string> Strings = {"foo", "bar"};
LLVM_YAML_IS_SEQUENCE_VECTOR(std::string)
yout << Strings;
This emits the following YAML document.
---
- foo
- bar
...
When the vector is empty, this generates the following result:
---
- []
...
Although this is valid YAML, it does not match what we meant to emit.
The result is a one-element sequence consisting of an empty list.
Indeed, if we were to try to read this again we get an error:
YAML:2:4: error: not a mapping
- []
The problem is the output dash before the empty list. The correct output
would be:
---
[]
...
This patch fixes that by not emitting the output dash for an empty
sequence.
Differential revision: https://reviews.llvm.org/D95280
or claimRV calls in the IR
Background:
This patch makes changes to the front-end and middle-end that are
needed to fix a longstanding problem where llvm breaks ARC's autorelease
optimization (see the link below) by separating calls from the marker
instructions or retainRV/claimRV calls. The backend changes are in
https://reviews.llvm.org/D92569.
https://clang.llvm.org/docs/AutomaticReferenceCounting.html#arc-runtime-objc-autoreleasereturnvalue
What this patch does to fix the problem:
- The front-end annotates calls with attribute "clang.arc.rv"="retain"
or "clang.arc.rv"="claim", which indicates the call is implicitly
followed by a marker instruction and a retainRV/claimRV call that
consumes the call result. This is currently done only when the target
is arm64 and the optimization level is higher than -O0.
- ARC optimizer temporarily emits retainRV/claimRV calls after the
annotated calls in the IR and removes the inserted calls after
processing the function.
- ARC contract pass emits retainRV/claimRV calls after the annotated
calls. It doesn't remove the attribute on the call since the backend
needs it to emit the marker instruction. The retainRV/claimRV calls
are emitted late in the pipeline to prevent optimization passes from
transforming the IR in a way that makes it harder for the ARC
middle-end passes to figure out the def-use relationship between the
call and the retainRV/claimRV calls (which is the cause of PR31925).
- The function inliner removes the autoreleaseRV call in the callee that
returns the result if nothing in the callee prevents it from being
paired up with the calls annotated with "clang.arc.rv"="retain/claim"
in the caller. If the call is annotated with "claim", a release call
is inserted since autoreleaseRV+claimRV is equivalent to a release. If
it cannot find an autoreleaseRV call, it tries to transfer the
attributes to a function call in the callee. This is important since
ARC optimizer can remove the autoreleaseRV call returning the callee
result, which makes it impossible to pair it up with the retainRV or
claimRV call in the caller. If that fails, it simply emits a retain
call in the IR if the call is annotated with "retain" and does nothing
if it's annotated with "claim".
- This patch teaches dead argument elimination pass not to change the
return type of a function if any of the calls to the function are
annotated with attribute "clang.arc.rv". This is necessary since the
pass can incorrectly determine nothing in the IR uses the function
return, which can happen since the front-end no longer explicitly
emits retainRV/claimRV calls in the IR, and change its return type to
'void'.
Future work:
- Use the attribute on x86-64.
- Fix the auto upgrader to convert call+retainRV/claimRV pairs into
calls annotated with the attributes.
rdar://71443534
Differential Revision: https://reviews.llvm.org/D92808
For a function that returns InstructionCost, it is very tempting to write:
return InstructionCost::Invalid;
But that actually returns InstructionCost(1 /* int value of Invalid */))
which has a totally different meaning. By marking this constructor as
`delete`, this can no longer happen.
This was discussed in D93678 thread.
Currently we have one special chunk - Fill.
This patch re implements the "SectionHeaderTable" key to become a special chunk too.
With that we are able to place the section header table at any location,
just like we place sections.
Differential revision: https://reviews.llvm.org/D95140
For now, we correct the result for sqrt if iteration > 0. This doesn't make
sense as they are not strict relative.
Reviewed By: dmgreen, spatel, RKSimon
Differential Revision: https://reviews.llvm.org/D94480
Reassociating some patterns to generate more fma instructions to
reduce register pressure.
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D92071
Frame-base materialization may insert vector instructions before EXEC is initialised.
Fix this by moving lowering of llvm.amdgcn.init.exec later in backend.
Also remove SI_INIT_EXEC_LO pseudo as this is not necessary.
Reviewed By: ruiling
Differential Revision: https://reviews.llvm.org/D94645
InstrEmitter.h needs TargetMachine but relies on a forward declaration
of TargetMachine in MachineOperand.h. This patch adds a forward
declaration right in InstrEmitter.h.
While we are at it, this patch removes the one in MachineOperand.h,
where it is unnecessary.
In the cloning infrastructure, only track an MDNode mapping,
without explicitly storing the Metadata mapping, same as is done
during inlining. This makes things slightly simpler.
To simplify the transition to using LTOBackend, move DisableVerify to
the LTOCodeGenerator class, like most/all other options.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D95223
Similar to D92887, LoopRotation also needs duplicate the noalias scopes when rotating a `@llvm.experimental.noalias.scope.decl` across a block boundary.
This is based on the version from the Full Restrict paches (D68511).
The problem it fixes also showed up in Transforms/Coroutines/ex5.ll after D93040 (when enabling strict checking with -verify-noalias-scope-decl-dom).
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D94306
This is a fix for https://bugs.llvm.org/show_bug.cgi?id=39282. Compared to D90104, this version is based on part of the full restrict patched (D68484) and uses the `@llvm.experimental.noalias.scope.decl` intrinsic to track the location where !noalias and !alias.scope scopes have been introduced. This allows us to only duplicate the scopes that are really needed.
Notes:
- it also includes changes and tests from D90104
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D92887
The tileLoops method implements the code generation part of the tile directive introduced in OpenMP 5.1. It takes a list of loops forming a loop nest, tiles it, and returns the CanonicalLoopInfo representing the generated loops.
The implementation takes n CanonicalLoopInfos, n tile size Values and returns 2*n new CanonicalLoopInfos. The input CanonicalLoopInfos are invalidated and BBs not reused in the new loop nest removed from the function.
In a modified version of D76342, I was able to correctly compile and execute a tiled loop nest.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D92974
Add an intrinsic type class to represent the
llvm.experimental.noalias.scope.decl intrinsic, to make code
working with it a bit nicer by hiding the metadata extraction
from view.
This patch adds a new InstModificationIRStrategy to mutate flags/options
for instructions. For example, it may add or remove nuw/nsw flags from
add, mul, sub, shl instructions or change the predicate for icmp
instructions.
Subtle changes such as those mentioned above should lead to a more
interesting range of inputs. The presence or absence of overflow flags
can expose subtle bugs, for example.
Reviewed By: bogner
Differential Revision: https://reviews.llvm.org/D94905
LoopUtils.h needs ICFLoopSafetyInfo but relies on a forward
declaration of ICFLoopSafetyInfo in IVDescriptors.h. This patch adds
a forward declaration right in LoopUtils.h.
While we are at it, this patch removes the one in IVDescriptors.h,
where it is unnecessary.
With the addition of the `willreturn` attribute, functions that may
not return (e.g. due to an infinite loop) are well defined, if they are
not marked as `willreturn`.
This patch updates `wouldInstructionBeTriviallyDead` to not consider
calls that may not return as dead.
This patch still provides an escape hatch for intrinsics, which are
still assumed as willreturn unconditionally. It will be removed once
all intrinsics definitions have been reviewed and updated.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94106
If i change it to AssertingVH instead, a number of existing tests fail,
which means we don't consistently remove from the set when deleting blocks,
which means newly-created blocks may happen to appear in that set
if they happen to occupy the same memory chunk as did some block
that was in the set originally.
There are many places where we delete blocks,
and while we could probably consistently delete from LoopHeaders
when deleting a block in transforms located in SimplifyCFG.cpp itself,
transforms located elsewhere (Local.cpp/BasicBlockUtils.cpp) also may
delete blocks, and it doesn't seem good to teach them to deal with it.
Since we at most only ever delete from LoopHeaders,
let's just delegate to WeakVH to do that automatically.
But to be honest, personally, i'm not sure that the idea
behind LoopHeaders is sound.
The target features are obtained as a list of features/attributes.
Instead of storing them in a single string, store the vector. This
matches lto::Config's behavior and simplifies the transition to
lto::backend().
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D95224
Rather than reimplement, use a `using` declaration to bring in
`SmallVectorImpl<char>`'s assign and append implementations in
`SmallString`.
The `SmallString` versions were missing reference invalidation
assertions from `SmallVector`. This patch also fixes a bug in
`llvm::FileCollector::addFileImpl`, which was a copy/paste from
`clang::ModuleDependencyCollector::copyToRoot`, both caught by the
no-longer-skipped assertions.
As a drive-by, this also sinks the `const SmallVectorImpl&` versions of
these methods down into `SmallVectorImpl`, since I imagine they'd be
useful elsewhere.
Differential Revision: https://reviews.llvm.org/D95202
The only caller of this function is in the LocalStackSlotAllocation
and it creates base register of class returned by the target's
getPointerRegClass(). AMDGPU wants to use a different reg class
here so let materializeFrameBaseRegister to just create and return
whatever it wants.
Differential Revision: https://reviews.llvm.org/D95268
This patch addresses inconsistencies in the way fallthrough is handled
in the RedirectingFileSystem. Rather than trying to change the working
directory of the external filesystem, the RedirectingFileSystem will
canonicalize every path before handing it down. This guarantees that
relative paths are resolved relative to the RedirectingFileSystem's
working directory.
This allows us to have a strictly virtual working directory, and still
fallthrough for absolute paths, but not for relative paths that would
get resolved incorrectly at the lower layer (for example, in case of the
RealFileSystem, because the strictly virtual path does not exist).
Differential revision: https://reviews.llvm.org/D95188
The widenScalar implementation for signed and unsigned overflowing
operations were very similar: both are checked by truncating the result
and then re-sign/zero-extending it and checking that it matches the
computed operation.
Using a truncate + zero-extend for the unsigned case instead of manually
producing the AND instruction like before leads to an extra copy
instruction during legalization, but this should be harmless.
Differential Revision: https://reviews.llvm.org/D95035
This is to support the memory routines vec_malloc, vec_calloc, vec_realloc, and vec_free. These routines manage memory that is 16-byte aligned. And they are only available on AIX.
Differential Revision: https://reviews.llvm.org/D94710
This patch implements codegen for __managed__ variable attribute for HIP.
Diagnostics will be added later.
Differential Revision: https://reviews.llvm.org/D94814
When adding an enum attribute to an AttributeList, avoid going
through an AttrBuilder and instead directly add the attribute to
the correct set. Going through AttrBuilder is expensive, because
it requires all string attributes to be reconstructed.
This can be further improved by inserting the attribute at the
right position and using the AttributeSetNode::getSorted() API.
This recovers the small compile-time regression from D94633.
This speeds up setLastUser enough to give a 5% to 10% speed up on
trivial invocations of opt and llc, as measured by:
perf stat -r 100 opt -S -o /dev/null -O3 /dev/null
perf stat -r 100 llc -march=amdgcn /dev/null -filetype null
Don't dump last use information unless -debug-pass=Details to avoid
printing lots of spam that will break some existing lit tests. Before
this patch, dumping last use information was broken anyway, because it
used InversedLastUser before it had been populated.
Differential Revision: https://reviews.llvm.org/D92309
Having a custom inliner doesn't really fit in with the new PM's
pipeline. It's also extra technical debt.
amdgpu-inline only does a couple of custom things compared to the normal
inliner:
1) It disables inlining if the number of BBs in a function would exceed
some limit
2) It increases the threshold if there are pointers to private arrays(?)
These can all be handled as TTI inliner hooks.
There already exists a hook for backends to multiply the inlining
threshold.
This way we can remove the custom amdgpu-inline pass.
This caused inline-hint.ll to fail, and after some investigation, it
looks like getInliningThresholdMultiplier() was previously getting
applied twice in amdgpu-inline (https://reviews.llvm.org/D62707 fixed it
not applying at all, so some later inliner change must have fixed
something), so I had to change the threshold in the test.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D94153
Add factory to create streams for logging the reproducer. Allows for more general logging (beyond file) and logging the configuration/module separately (logged in order, configuration before module).
Also enable querying filename of ToolOutputFile.
Differential Revision: https://reviews.llvm.org/D94868
The fault-only-first-load instructions can reduce VL if an element
other than element 0 triggers a memory fault. This can be used to
vectorize loops with data dependent exit conditions like strcmp or
strlen.
This patch adds a VL output to these intrinsics so that the new
VL value can be captured by software. This will be expanded to
'csrr gpr, vl' after the vleff instruction during SelectionDAG.
By doing this with one intrinsic we are able to guarantee that the
csrr reads the VL value produced by the vleff instruction. Having
it as a separate intrinsic would make it impossible to guarantee
ordering without making every other vector intrinsic have side
effects.
The intrinsics are expanded during lowering into two ISD nodes
that are glued together. These ISD nodes will go
through isel separately, but should maintain the glue so that they
get emitted adjacently by InstrEmitter.
I've only ran the chain through the vleff instruction, allowing
the READ_VL to be deleted if it is unused.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D94286
Upgrade RISC-V V extension to v1.0-08a0b46.
Indexed load/store have ordered and unordered form.
New whole vector load/store.
Differential Revision: https://reviews.llvm.org/D93614
This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
This fixes the final (I think?) reference invalidation in `SmallVector`
that we need to fix to align with `std::vector`. (There is still some
left in the range insert / append / assign, but the standard calls that
UB for `std::vector` so I think we don't care?)
For POD-like types, reimplement `emplace_back()` in terms of
`push_back()`, taking a copy even for large `T` rather than lose the
realloc optimization in `grow_pod()`.
For other types, split the grow operation in three and construct the new
element in the middle.
- `mallocForGrow()` calculates the new capacity and returns the result
of `safe_malloc()`. We only need a single definition per
`SmallVectorBase` so this is defined in SmallVector.cpp to avoid code
size bloat. Moving this part of non-POD grow to the source file also
allows the logic to be easily shared with `grow_pod`, and
`report_size_overflow()` and `report_at_maximum_capacity()` can move
there too.
- `moveElementsForGrow()` moves elements from the old to the new
allocation.
- `takeAllocationForGrow()` frees the old allocation and saves the
new allocation and capacity .
`SmallVector:assign(size_type, const T&)` also uses the split-grow
operations for non-POD, but it also has a semantic change when not
growing. Previously, assign would start with `clear()`, and so the old
elements were destructed and all elements of the new vector were
copy-constructed (potentially invalidating references). The new
implementation skips destruction and uses copy-assignment for the prefix
of the new vector that fits. The new semantics match what libc++ does
for `std::vector::assign()`.
Note that the following is another possible implementation:
```
void assign(size_type NumElts, ValueParamT Elt) {
std::fill_n(this->begin(), std::min(NumElts, this->size()), Elt);
this->resize(NumElts, Elt);
}
```
The downside of this simpler implementation is that if the vector has to
grow there will be `size()` redundant copy operations.
(I had planned on splitting this patch up into three for committing
(after getting performance numbers / initial review), but I've realized
that if this does for some reason need to be reverted we'll probably
want to revert the whole package...)
Differential Revision: https://reviews.llvm.org/D94739
Make this look more like the DAG handling and move to common code.
I also noticed AArch64 seems to not be properly adding the
physreg:virtreg mapping to the function live ins.
Summary:
The custom mapper API did not previously support the mapping names added previously. This means they were not present if a user requested debugging information while using the mapper functions. This adds basic support for passing the mapped names to the runtime library.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D94806
Previous code build the model that tile config register is the user of
each AMX instruction. There is a problem for the tile config register
spill. When across function, the ldtilecfg instruction may be inserted
on each AMX instruction which use tile config register. This cause all
tile data register clobber.
To fix this issue, we remove the model of tile config register. We
analyze the regmask of call instruction and insert ldtilecfg if there is
any tile data register live across the call. Inserting the sttilecfg
before the call is unneccessary, because the tile config doesn't change
and we can just reload the config.
Besides we also need check tile config register interference. Since we
don't model the config register we should check interference from the
ldtilecfg to each tile data register def.
ldtilecfg
/ \
BB1 BB2
/ \
call BB3
/ \
%1=tileload %2=tilezero
We can start from the instruction of each tile def, and backward to
ldtilecfg. If there is any call instruction, and tile data register is
not preserved, we should insert ldtilecfg after the call instruction.
Differential Revision: https://reviews.llvm.org/D94155
This makes the following improvements.
For `SHT_GNU_versym`:
* yaml2obj: set `sh_link` to index of `.dynsym` section automatically.
For `SHT_GNU_verdef`:
* yaml2obj: set `sh_link` to index of `.dynstr` section automatically.
* yaml2obj: set `sh_info` field automatically.
* obj2yaml: don't dump the `Info` field when its value matches the number of version definitions.
For `SHT_GNU_verneed`:
* yaml2obj: set `sh_link` to index of `.dynstr` section automatically.
* yaml2obj: set `sh_info` field automatically.
* obj2yaml: don't dump the `Info` field when its value matches the number of version dependencies.
Also, simplifies few test cases.
Differential revision: https://reviews.llvm.org/D94956
This reverts commit d97f776be5.
The original problem was due to build failures in shared lib builds. D95079
moved ImportedFunctionsInliningStatistics under Analysis, unblocking
this.
This is related to D94982. We want to call these APIs from the Analysis
component, so we can't leave them under Transforms.
Differential Revision: https://reviews.llvm.org/D95079
This reverts commit 5b7aef6eb4 and relands
6529d7c5a4.
The ASan error was debugged and determined to be the fault of an invalid
object file input in our test suite, which was fixed by my last change.
LLD's project policy is that it assumes input objects are valid, so I
have added a comment about this assumption to the relocation bounds
check.
Run the ObjCARCContractPass during LTO. The legacy LTO backend (under
LTO/ThinLTOCodeGenerator.cpp) already does this; this diff just adds that
behavior to the new LTO backend. Without that pass, the objc.clang.arc.use
intrinsic will get passed to the instruction selector, which doesn't know how to
handle it.
In order to test both the new and old pass managers, I've also added support for
the `--[no-]lto-legacy-pass-manager` flags.
P.S. Not sure if the ordering of the pass within the pipeline matters...
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D94547
When using 2 InlinePass instances in the same CGSCC - one for other
mandatory inlinings, the other for the heuristic-driven ones - the order
in which the ImportedFunctionStats would be output-ed would depend on
the destruction order of the inline passes, which is not deterministic.
This patch moves the ImportedFunctionStats responsibility to the
InlineAdvisor to address this problem.
Differential Revision: https://reviews.llvm.org/D94982
Add the aarch64[_be]-*-gnu_ilp32 targets to support the GNU ILP32 ABI for AArch64.
The needed codegen changes were mostly already implemented in D61259, which added support for the watchOS ILP32 ABI. The main changes are:
- Wiring up the new target to enable ILP32 codegen and MC.
- ILP32 va_list support.
- ILP32 TLSDESC relocation support.
There was existing MC support for ELF ILP32 relocations from D25159 which could be enabled by passing "-target-abi ilp32" to llvm-mc. This was changed to check for "gnu_ilp32" in the target triple instead. This shouldn't cause any issues since the existing support was slightly broken: it was generating ELF64 objects instead of the ELF32 object files expected by the GNU ILP32 toolchain.
This target has been tested by running the full rustc testsuite on a big-endian ILP32 system based on the GCC ILP32 toolchain.
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D94143
The pass analysis uses "sets" implemented using a SmallVector type
to keep track of Used, Preserved, Required and RequiredTransitive
passes. When having nested analyses we could end up with duplicates
in those sets, as there was no checks to see if a pass already
existed in the "set" before pushing to the vectors. This idea with
this patch is to avoid such duplicates by avoiding pushing elements
that already is contained when adding elements to those sets.
To align with the above PMDataManager::collectRequiredAndUsedAnalyses
is changed to skip adding both the Required and RequiredTransitive
passes to its result vectors (since RequiredTransitive always is
a subset of Required we ended up with duplicates when traversing
both sets).
Main goal with this is to avoid spending time verifying the same
analysis mulitple times in PMDataManager::verifyPreservedAnalysis
when iterating over the Preserved "set". It is assumed that removing
duplicates from a "set" shouldn't have any other negative impact
(I have not seen any problems so far). If this ends up causing
problems one could do some uniqueness filtering of the vector being
traversed in verifyPreservedAnalysis instead.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D94416
If constants are hidden behind G_ANYEXT we can treat them same way as G_SEXT.
For that purpose we extend getConstantVRegValWithLookThrough with option
to handle G_ANYEXT same way as G_SEXT.
Differential Revision: https://reviews.llvm.org/D92219
When constraining an operand register using constrainOperandRegClass(),
the function may emit a COPY in case the provided register class does
not match the current operand register class. However, the operand
itself is not updated to make use of the COPY, thereby resulting in
incorrect code. This patch fixes that bug by updating the machine
operand accordingly.
Reviewed By: dsanders
Differential Revision: https://reviews.llvm.org/D91244
For Zvlsseg, we need continuous vector registers for the values. We need
to define new register classes for the different combinations of (number
of fields and LMUL). For example,
when the number of fields(NF) = 3, LMUL = 2, the values will be assigned
to (V0M2, V2M2, V4M2), (V2M2, V4M2, V6M2), (V4M2, V6M2, V8M2), ...
We define the vlseg intrinsics with multiple outputs. There is no way to
describe the codegen patterns with multiple outputs in the tablegen
files. We do the codegen in RISCVISelDAGToDAG and use EXTRACT_SUBREG to
extract the values of output.
The multiple scalable vector values will be put into a struct. This
patch is depended on the support for scalable vector struct.
Differential Revision: https://reviews.llvm.org/D94229
Currently LLVM is relying on ValueTracking's `isKnownNonZero` to attach `nonnull`, which can return true when the value is poison.
To make the semantics of `nonnull` consistent with the behavior of `isKnownNonZero`, this makes the semantics of `nonnull` to accept poison, and return poison if the input pointer isn't null.
This makes many transformations like below legal:
```
%p = gep inbounds %x, 1 ; % p is non-null pointer or poison
call void @f(%p) ; instcombine converts this to call void @f(nonnull %p)
```
Instead, this semantics makes propagation of `nonnull` to caller illegal.
The reason is that, passing poison to `nonnull` does not immediately raise UB anymore, so such program is still well defined, if the callee does not use the argument.
Having `noundef` attribute there re-allows this.
```
define void @f(i8* %p) { ; functionattr cannot mark %p nonnull here anymore
call void @g(i8* nonnull %p) ; .. because @g never raises UB if it never uses %p.
ret void
}
```
Another attribute that needs to be updated is `align`. This patch updates the semantics of align to accept poison as well.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D90529
separate sections.
For ThinLTO, all the function profiles without context has been annotated to
outline functions if possible in prelink phase. In postlink phase, profile
annotation in postlink phase is only meaningful for function profile with
context. If the profile is large, it is better to split the profile into two
parts, one with context and one without, so the profile reading in postlink
phase only has to read the part with context. To have the profile splitting,
we extend the ExtBinary format to support different section arrangement. It
will be flexible to add other section layout in the future without the need
to create new class inheriting from ExtBinary class.
Differential Revision: https://reviews.llvm.org/D94435
If we are able to compare with 0 instead of 1, we might be able
to fold the setcc into a beqz/bnez.
Often these setccs start life as an xor that gets converted to
a setcc by DAG combiner's rebuildSetcc. I looked into a detecting
(xor X, 1) and converting to (seteq X, 0) based on boolean contents
being 0/1 in rebuildSetcc instead of using computeKnownBits. It was
very perturbing to AMDGPU tests which I didn't look closely at.
It had a few changes on a couple other targets, but didn't seem
to be much if any improvement.
Reviewed By: lenary
Differential Revision: https://reviews.llvm.org/D94730
Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93042
This is a restricted version of the combine in `DAGCombiner::MatchLoadCombine`.
(See D27861)
This tries to recognize patterns like below (assuming a little-endian target):
```
s8* x = ...
s32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
->
s32 val = *((i32)a)
s8* x = ...
s32 val = a[3] | (a[2] << 8) | (a[1] << 16) | (a[0] << 24)
->
s32 val = BSWAP(*((s32)a))
```
(This patch also handles the big-endian target case as well, in which the first
example above has a BSWAP, and the second example above does not.)
To recognize the pattern, this searches from the last G_OR in the expression
tree.
E.g.
```
Reg Reg
\ /
OR_1 Reg
\ /
OR_2
\ Reg
.. /
Root
```
Each non-OR register in the tree is put in a list. Each register in the list is
then checked to see if it's an appropriate load + shift logic.
If every register is a load + potentially a shift, the combine checks if those
loads + shifts, when OR'd together, are equivalent to a wide load (possibly with
a BSWAP.)
To simplify things, this patch
(1) Only handles G_ZEXTLOADs (which appear to be the common case)
(2) Only works in a single MachineBasicBlock
(3) Only handles G_SHL as the bit twiddling to stick the small load into a
specific location
An IR example of this is here: https://godbolt.org/z/4sP9Pj (lifted from
test/CodeGen/AArch64/load-combine.ll)
At -Os on AArch64, this is a 0.5% code size improvement for CTMark/sqlite3,
and a 0.4% improvement for CTMark/7zip-benchmark.
Also fix a bug in `isPredecessor` which caused it to fail whenever `DefMI` was
the first instruction in the block.
Differential Revision: https://reviews.llvm.org/D94350
The TableGen emitter for directives has two slots for flangClass information and this was mainly
to be able to keep up with the legacy openmp parser at the time. Now that all clauses are encapsulated in
AccClause or OmpClause, these two strings are not necessary anymore and were the the source of couple
of problem while working with the generic structure checker for OpenMP.
This patch remove the flangClassValue string from DirectiveBase.td and use the string flangClass as the
placeholder for the encapsulated class.
Reviewed By: sameeranjoshi
Differential Revision: https://reviews.llvm.org/D94821
This CPU supports all v8.5a features except BTI, and so identifies as v8.5a to
Clang. A bit weird, but the best way for things like xnu to detect the new
features it cares about.
This patch updates the llvm module map to reflect changes made in
`24672ddea3c97fd1eca3e905b23c0116d7759ab8` and fixes the module builds
(`-DLLVM_ENABLE_MODULES=On`).
Signed-off-by: Med Ismail Bennani <medismail.bennani@gmail.com>
This patch computes the cost for vector.reduce<operand> for scalable vectors.
The cost is split into two parts: the legalization cost and the horizontal
reduction.
Differential Revision: https://reviews.llvm.org/D93639
D84108 exposed a bad interaction between inlining and loop-rotation
during regular LTO, which is causing notable regressions in at least
CINT2006/473.astar.
The problem boils down to: we now rotate a loop just before the vectorizer
which requires duplicating a function call in the preheader when compiling
the individual files ('prepare for LTO'). But this then prevents further
inlining of the function during LTO.
This patch tries to resolve this issue by making LoopRotate more
conservative with respect to rotating loops that have inline-able calls
during the 'prepare for LTO' stage.
I think this change intuitively improves the current situation in
general. Loop-rotate tries hard to avoid creating headers that are 'too
big'. At the moment, it assumes all inlining already happened and the
cost of duplicating a call is equal to just doing the call. But with LTO,
inlining also happens during full LTO and it is possible that a previously
duplicated call is actually a huge function which gets inlined
during LTO.
From the perspective of LV, not much should change overall. Most loops
calling user-provided functions won't get vectorized to start with
(unless we can infer that the function does not touch memory, has no
other side effects). If we do not inline the 'inline-able' call during
the LTO stage, we merely delayed loop-rotation & vectorization. If we
inline during LTO, chances should be very high that the inlined code is
itself vectorizable or the user call was not vectorizable to start with.
There could of course be scenarios where we inline a sufficiently large
function with code not profitable to vectorize, which would have be
vectorized earlier (by scalarzing the call). But even in that case,
there probably is no big performance impact, because it should be mostly
down to the cost-model to reject vectorization in that case. And then
the version with scalarized calls should also not be beneficial. In a way,
LV should have strictly more information after inlining and make more
accurate decisions (barring cost-model issues).
There is of course plenty of room for things to go wrong unexpectedly,
so we need to keep a close look at actual performance and address any
follow-up issues.
I took a look at the impact on statistics for
MultiSource/SPEC2000/SPEC2006. There are a few benchmarks with fewer
loops rotated, but no change to the number of loops vectorized.
Reviewed By: sanwou01
Differential Revision: https://reviews.llvm.org/D94232