Update both memprof and callsite metadata to reflect inlined functions.
For callsite metadata this is simply a concatenation of each cloned
call's call stack with that of the inlined callsite's.
For memprof metadata, each profiled memory info block (MIB) is either
moved to the cloned allocation call or left on the original allocation
call depending on whether its context matches the newly refined call
stack context on the cloned call. We also reapply context trimming
optimizations based on the refined set of contexts on each of the calls
(cloned and original).
Depends on D128142.
Reviewed By: snehasish
Differential Revision: https://reviews.llvm.org/D128143
This reverts commit 0d7f3464ce and
commit f9403ca41e. The latter was
"Profile matching and IR annotation for memprof profiles." and was left
from a bad rebase from a commit already pushed upstream.
Update both memprof and callsite metadata to reflect inlined functions.
For callsite metadata this is simply a concatenation of each cloned
call's call stack with that of the inlined callsite's.
For memprof metadata, each profiled memory info block (MIB) is either
moved to the cloned allocation call or left on the original allocation
call depending on whether its context matches the newly refined call
stack context on the cloned call. We also reapply context trimming
optimizations based on the refined set of contexts on each of the calls
(cloned and original), via utilities in MemoryProfileInfo.
Depends on D128142.
Differential Revision: https://reviews.llvm.org/D128143
Second patch in the series to remove legacy PM and
associated -enable-new-pm=0 flag targets pass that
has not been ported to new PM - PruneEH.
Discussion about this can be found in D44415.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D134686
Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
when the callee's function body is executed in a different streaming mode than
its caller. This is needed because function calls are the boundaries for
streaming mode changes.
More details about the SME attributes and design can be found
in D131562.
Differential Revision: https://reviews.llvm.org/D131581
This is the first patch in a series intended for removing flag
-enable-new-pm=0 from lit tests. This is part of a bigger
effort of completely removing legacy code related to legacy
pass manager in favor of currently default new pass manager.
In this patch flag has been removed only from tests where no significant
change has been required because checks has been duplicated for
both PMs.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D134150
TLite is a lightweight, statically linkable[1], model evaluator, supporting a
subset of what the full tensorflow library does, sufficient for the
types of scenarios we envision having. It is also faster.
We still use saved models as "source of truth" - 'release' mode's AOT
starts from a saved model; and the ML training side operates in terms of
saved models.
Using TFLite solves the following problems compared to using the full TF
C API:
- a compiler-friendly implementation for runtime-loadable (as opposed
to AOT-embedded) models: it's statically linked; it can be built via
cmake;
- solves an issue we had when building the compiler with both AOT and
full TF C API support, whereby, due to a packaging issue on the TF
side, we needed to have the pip package and the TF C API library at
the same version. We have no such constraints now.
The main liability is it supporting a subset of what the full TF
framework does. We do not expect that to cause an issue, but should that
be the case, we can always revert back to using the full framework
(after also figuring out a way to address the problems that motivated
the move to TFLite).
Details:
This change switches the development mode to TFLite. Models are still
expected to be placed in a directory - i.e. the parameters to clang
don't change; what changes is the directory content: we still need
an `output_spec.json` file; but instead of the saved_model protobuf and
the `variables` directory, we now just have one file, `model.tflite`.
The change includes a utility showing how to take a saved model and
convert it to TFLite, which it uses for testing.
The full TF implementation can still be built (not side-by-side). We
intend to remove it shortly, after patching downstream dependencies. The
build behavior, however, prioritizes TFLite - i.e. trying to enable both
full TF C API and TFLite will just pick TFLite.
[1] thanks to @petrhosek's changes to TFLite's cmake support and its deps!
The value of the attribute is a size in bytes. It has the effect of
suppressing inlining of functions whose stacksizes exceed the given value.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D129904
When F calls G calls H, G is nounwind, and G is inlined into F, then the
inlined call-site to H should be effectively nounwind so as not to lose
information during inlining.
If H itself is nounwind (which often happens when H is an intrinsic), we
no longer mark the callsite explicitly as nounwind. Previously, there
were cases where the inlined call-site of H differs from a pre-existing
call-site of H in F *only* in the explicitly added nounwind attribute,
thus preventing common subexpression elimination.
v2:
- just check CI->doesNotThrow
v3 (resubmit after revert at 3443788087):
- update Clang tests
Differential Revision: https://reviews.llvm.org/D129860
When F calls G calls H, G is nounwind, and G is inlined into F, then the
inlined call-site to H should be effectively nounwind so as not to lose
information during inlining.
If H itself is nounwind (which often happens when H is an intrinsic), we
no longer mark the callsite explicitly as nounwind. Previously, there
were cases where the inlined call-site of H differs from a pre-existing
call-site of H in F *only* in the explicitly added nounwind attribute,
thus preventing common subexpression elimination.
v2:
- just check CI->doesNotThrow
Differential Revision: https://reviews.llvm.org/D129860
Following some recent discussions, this changes the representation
of callbrs in IR. The current blockaddress arguments are replaced
with `!` label constraints that refer directly to callbr indirect
destinations:
; Before:
%res = callbr i8* asm "", "=r,r,i"(i8* %x, i8* blockaddress(@test8, %foo))
to label %asm.fallthrough [label %foo]
; After:
%res = callbr i8* asm "", "=r,r,!i"(i8* %x)
to label %asm.fallthrough [label %foo]
The benefit of this is that we can easily update the successors of
a callbr, without having to worry about also updating blockaddress
references. This should allow us to remove some limitations:
* Allow unrolling/peeling/rotation of callbr, or any other
clone-based optimizations
(https://github.com/llvm/llvm-project/issues/41834)
* Allow duplicate successors
(https://github.com/llvm/llvm-project/issues/45248)
This is just the IR representation change though, I will follow up
with patches to remove limtations in various transformation passes
that are no longer needed.
Differential Revision: https://reviews.llvm.org/D129288
For recursive callers, we want to be conservative when inlining callees with large stack size. We currently have a limit `InlineConstants::TotalAllocaSizeRecursiveCaller`, but that is hard coded.
We found the current limit insufficient to suppress problematic inlining that bloats stack size for deep recursion. This change adds a switch to make the limit tunable as a mitigation.
Differential Revision: https://reviews.llvm.org/D129411
The unidentified objects recognized in `getUnderlyingObjects` may
still alias to the noalias parameter because `getUnderlyingObjects`
may not check deep enough to get the underlying object because of
`MaxLookup`. The real underlying object for the unidentified object
may still be the noalias parameter.
Originally Patched By: tingwang
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D127202
Use a common ConstantFoldInstOperands-based constant folding
implementation, instead of specifying the folding function for
each function individually. Going through the generic handling
doesn't appear to have any significant compile-time impact.
As the test change shows, this is not NFC, because we now use
DataLayout-aware constant folding, which can do slightly better
in some cases (e.g. those involving GEPs).
Support for the legacy pass manager in ArgPromotion causes
complications in D125485. As the legacy pass manager for middle-end
optimizations is unsupported, drop ArgPromotion from the legacy
pipeline, rather than introducing additional complexity to deal
with it.
Differential Revision: https://reviews.llvm.org/D128536
The hidden option max-inline-stacksize=<N> prevents the inlining of functions
with a stack size larger than N.
Reviewed By: mtrofin, aeubanks
Differential Review: https://reviews.llvm.org/D127988
This fixes a useless filecheck and wrong comment for always-inline.ll. Testing
has been done using ninja check-llvm and llvm-lit always-inline.ll --show-all.
Reviewed By: modimo, hoy
Differential Revision: https://reviews.llvm.org/D127815
Previously if the inliner split an SCC such that an empty one remained, the MLInlineAdvisor could potentially lose track of the EdgeCount if a subsequent CGSCC pass modified the calls of a function that was initially in the SCC pre-split. Saving the seen nodes in onPassEntry resolves this.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D127693
Adds option to print the contents of the Inline Advisor after each SCC Inliner pass
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D127689
When a callee function is inlined via an invoke instruction, every function call inside the callee, if not an invoke, will be converted to an invoke after cloned to the caller body. I found that during the conversion the !prof metadata was dropped. This in turned caused a cloned indirect call not properly promoted in subsequent passes.
The particular scenario I was investigating was with AutoFDO and thinLTO. In prelink, no ICP was triggered (neither by the sample loader nor PGO ICP), no indirect call was promoted. This is because 1) the particular indirect call did not have inlined samples; and 2) PGO ICP was intentionally disabled. After inlining, the prof metadata was dropped. Then in postlink, PGO ICP jumped in but didn't do anything. Thus the opportunity was missed.
I'm making a simple fix to preserve !prof metadata when converting call to invoke.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D125249
Since the size of most of SCC's is 1, the PriorityInlineOrder would not change the inline
order in SCC inliner.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D123608
Retain the behavior we get without opaque pointers: A call to a
known function with different function type is considered an
indirect call.
This fixes the crash reported in https://reviews.llvm.org/D123300#3444772.
According to the current design, if a floating point operation is
represented by a constrained intrinsic somewhere in a function, all
floating point operations in the function must be represented by
constrained intrinsics. It imposes additional requirements to inlining
mechanism. If non-strictfp function is inlined into strictfp function,
all ordinary FP operations must be replaced with their constrained
counterparts.
Inlining strictfp function into non-strictfp is not implemented as it
would require replacement of all FP operations in the host function,
which now is undesirable due to expected performance loss.
Differential Revision: https://reviews.llvm.org/D69798
The motivation for this is that while both memcpyopt and dse will catch this case, both are limited by MSSA's walk back threshold when finding clobbers. As such, if you have a memcpy of an otherwise dead alloca placed towards the end of a long basic block with lots of other memory instructions, it would be missed. This is a bit undesirable for such an "obviously" useless bit of code.
As noted in comments, we should probably generalize instcombine's escape analysis peephole (see visitAllocInst) to allow read xor write. Doing that would subsume this code in a more general way, but is also a more involved change. For the moment, I went with the easiest fix.
With D107249 I saw huge compile time regressions on a module (150s ->
5700s). This turned out to be due to a huge RefSCC in
the module. As we ran the function simplification pipeline on functions
in the SCCs in the RefSCC, some of those SCCs would be split out to
their RefSCC, a child of the current RefSCC. We'd skip the remaining
SCCs in the huge RefSCC because the current RefSCC is now the RefSCC
just split out, then revisit the original huge RefSCC from the
beginning. This happened many times because many functions in the
RefSCC were optimizable to the point of becoming their own RefSCC.
This patch makes it so we don't skip SCCs not in the current RefSCC so
that we split out all the child RefSCCs on the first iteration of
RefSCC. When we split out a RefSCC, we invalidate the original RefSCC
and add the remainder of the SCCs into a new RefSCC in
RCWorklist. This happens repeatedly until we finish visiting all
SCCs, at which point there is only one valid RefSCC in
RCWorklist from the original RefSCC containing all the SCCs that
were not split out, and we visit that.
For example, in the newly added test cgscc-refscc-mutation-order.ll,
we'd previously run instcombine in this order:
f1, f2, f1, f3, f1, f4, f1
Now it's:
f1, f2, f3, f4, f1
This can cause more passes to be run in some specific cases,
e.g. if f1<->f2 gets optimized to f1<-f2, we'd previously run f1, f2;
now we run f1, f2, f2.
This improves kimwitu++ compile times by a lot (12-15% for various -O3 configs):
https://llvm-compile-time-tracker.com/compare.php?from=2371c5a0e06d22b48da0427cebaf53a5e5c54635&to=00908f1d67400cab1ad7bcd7cacc7558d1672e97&stat=instructions
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D121953
Failures in `InlineFunction()` are caught after D121722, but `emitInlinedIntoBasedOnCost()` should only be called when inlining is successful. This also removes an unnecessary call to `shouldInline()` which always returned `InlineCost::getAlways()`.
Reviewed By: kyulee, nikic
Differential Revision: https://reviews.llvm.org/D121946
When we build clang without asserts we should still check the result of
`InlineFunction()` to be sure there wasn't an error. Otherwise we could
incorrectly merge attributes in the next line.
This also removes a redundent call to `getCaller()`.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121722
Introduce a new attribute "function-inline-cost-multiplier" which
multiplies the inline cost of a call site (or all calls to a callee) by
the multiplier.
When processing the list of calls created by inlining, check each call
to see if the new call's callee is in the same SCC as the original
callee. If so, set the "function-inline-cost-multiplier" attribute of
the new call site to double the original call site's attribute value.
This does not happen when the original call site is intra-SCC.
This is an alternative to D120584, which marks the call sites as
noinline.
Hopefully fixes PR45253.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D121084
```
always_inline foo() { }
bar () {
noinline foo();
}
```
We should prefer call site attribute over attribute on decl. This is fix for AlwaysInliner, similar fix is needed for normal Inliner (follow up).
Related to https://reviews.llvm.org/D119061
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D119553
The pruning cloner already tries to remove unreachable blocks. The
original cloning process will simplify instructions and constant
terminators, and only clone blocks that are reachable at that point.
However, phi nodes can only be simplified after everything has been
cloned. For that reason, additional blocks may become unreachable
after phi simplification.
The code does try to handle this as well, but only removes blocks
that don't have predecessors. It misses unreachable cycles. This
can cause issues if SEH exception handling code is part of an
unreachable cycle, as the inliner is not prepared to deal with that.
This patch instead performs an explicit scan for reachable blocks,
and drops everything else.
Fixes https://github.com/llvm/llvm-project/issues/53206.
Differential Revision: https://reviews.llvm.org/D118449
Problem: Migration to new PM broke flatten attribute.
This is one use case why LLVM should support inlining call-site with alwaysinline. The flatten attribute is nowdays broken, so we should either land patch like this one or remove everything related to flatten attribute from Clang.
Second use case is something like "per call site inlining intrinsics" to control inlining even more; mentioned in
https://lists.llvm.org/pipermail/cfe-dev/2018-September/059232.html
Fixes https://github.com/llvm/llvm-project/issues/53360
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D117965
The tensorflow AOT compiler can cross-target, but it can't run on (for
example) arm64. We added earlier support where the AOT-ed header and object
would be built on a separate builder and then passed at build time to
a build host where the AOT compiler can't run, but clang can be otherwise
built.
To simplify such scenarios given we now support more than one AOT-able
case (regalloc and inliner), we make the AOT scenario centered on whether
files are generated, case by case (this includes the "passed from a
different builder" scenario).
This means we shouldn't need an 'umbrella' LLVM_HAVE_TF_AOT, in favor of
case by case control. A builder can opt out of an AOT case by passing that case's
model path as `none`. Note that the overrides still take precedence.
This patch controls conditional compilation with case-specific flags,
which can be enabled locally, for the component where those are
available. We still keep an overall flag for some tests.
The 'development/training' mode is unchanged, because there the model is
passed from the command line and interpreted.
Differential Revision: https://reviews.llvm.org/D117752