DenseMap is the wrong data structure to use for sample records and call
sites. The keys are too large, causing massive core memory growth when
reading profiles.
Before this patch, a 21Mb input profile was causing the compiler to grow
to 3Gb in memory. By switching to std::map, the compiler now grows to
300Mb in memory.
There still are some opportunities for memory footprint reduction. I'll
be looking at those next.
llvm-svn: 255389
Added some missing spaces between the module identifier and the start of
the debug message. Also added a ":" after the module identifier to make
this look a little nicer.
llvm-svn: 255259
- This simplifies the CallSite class, arg_begin / arg_end are now
simple wrapper getters.
- In several places, we were creating CallSite instances solely to call
arg_begin and arg_end. With this change, that's no longer required.
llvm-svn: 255226
loading the source Module, linking the function in the destination
module, and destroying the source Module before repeating with the
next function to import (potentially from the same Module).
Ideally we would keep the source Module alive and import the next
Function needed from this Module. Unfortunately this is not possible
because the linker does not leave it in a usable state.
However we can do better by first computing the list of all candidates
per Module, and only then load the source Module and import all the
function we need for it.
The trick to process callees is to materialize function in the source
module when building the list of function to import, and inspect them
in their source module, collecting the list of callees for each
callee.
When we move the the actual import, we will import from each source
module exactly once. Each source module is loaded exactly once.
The only drawback it that it requires to have all the lazy-loaded
source Module in memory at the same time.
Currently this patch already improves considerably the link time,
a multithreaded link of llvm-dis on my laptop was:
real 1m12.175s user 6m32.430s sys 0m10.529s
and is now:
real 0m40.697s user 2m10.237s sys 0m4.375s
Note: this is the full link time (linker+Import+Optimizer+CodeGen)
Differential Revision: http://reviews.llvm.org/D15178
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 255100
For an invoke with operand bundles, the [op_begin(), op_end()-3] range
can contain things other than invoke arguments. This change teaches
PruneEH to use arg_begin() and arg_end() explicitly.
llvm-svn: 255073
Summary:
Add a field on the PassManagerBuilder that clang or gold can use to pass
down a pointer to the function index in memory to use for importing when
the ThinLTO backend is triggered. Add support to supply this to the
function import pass.
Reviewers: joker.eph, dexonsmith
Subscribers: davidxl, llvm-commits, joker.eph
Differential Revision: http://reviews.llvm.org/D15024
llvm-svn: 254926
For efficiency reason, when importing multiple functions for the same Module,
we can avoid reparsing it every time.
Differential Revision: http://reviews.llvm.org/D15102
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 254486
When linking static archive, there is no individual module files to
load. Instead they can be mmap'ed and could be initialized from a
buffer directly. The callback provide flexibility to override the
scheme for loading module from the summary.
Differential Revision: http://reviews.llvm.org/D15101
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 254479
This fixes buildbots in systems that std::to_string is not present. It
also tidies the output of the diagnostic to render doubles a bit better
(thanks Ben Kramer for help with string streams and format).
llvm-svn: 254261
This adds two thresholds to the sample profiler to affect inlining
decisions: the concept of global hotness and coldness.
Functions that have accumulated more than a certain fraction of samples at
runtime, are annotated with the InlineHint attribute. Conversely,
functions that accumulate less than a certain fraction of samples, are
annotated with the Cold attribute.
This is very similar to the hints emitted by Clang when using
instrumentation profiles.
Notice that this is a very blunt instrument. A function may have
globally collected a significant fraction of samples, but that does not
necessarily mean that every callsite for that function is hot.
Ideally, we would annotate each callsite with the samples collected at
that callsite. This way, the inliner can incorporate all these weights
into its cost model.
Once the inliner offers this functionality, we can change the hints
emitted here to a more precise per-callsite annotation. For now, this is
providing some measure of speedups with our internal benchmarks. I've
observed speedups of up to 23% (though the geo mean is about 3%). I expect
these numbers to improve as the inliner gets better annotations.
llvm-svn: 254212
Based on testing of internal benchmarks, I'm lowering this threshold to
a value of 0.1%. This means that SamplePGO will respect 99.9% of the
original inline decisions when following a profile.
The performance difference is noticeable in some tests. With the
previous threshold, the speedups over baseline -O2 was about 0.63%. With
the new default, the speedups are around 3% on average.
The point of this threshold is not to do more aggressive inlining. When
an inlined callsite crosses this threshold, SamplePGO will redo the
inline decision so that it can better apply the input profile.
By respecting most original inline decisions, we can apply more of the
input profile because the shape of the code follows the profile more
closely.
In the next series, I'll be looking at adding some inline hints for the
cold callsites and for toplevel functions that are hot/cold as well.
llvm-svn: 254211
They are as much trouble as aliases to declarations. They are requiring
the code generator to define a symbol with the same value as another
symbol, but the second symbol is undefined.
If representing this is important for some optimization, we could add
support for available_externally aliases. They would be *required* to
point to a declaration (or available_externally definition).
llvm-svn: 254170
Add a simple initial heuristic to control importing based on the number
of instructions recorded in the function's summary. Add option to
control the limit, and test using option.
llvm-svn: 254036
When the original binary is executed and sampled, the resulting profile
contains information on the original inline stack. We currently follow
the original inline plan if we notice that the inlined callsite has more
than 0 samples to it.
A better way is to determine whether the callsite is actually worth
inlining. If the callsite accumulates a small fraction of the samples
spent in the parent function, then we don't want to bother inlining it
(as it means that the callsite is actually cold).
This patch introduces a threshold expressed in percentage of samples
in relation to the parent function. If the callsite uses less than N%
of the total samples used by its parent, the original inline decision is
not re-applied.
I've set the threshold to the very arbitrary value of 5%. I'm yet to do
any actual experiments to see what's a good value. I wanted to separate
the basic mechanism from the tuning.
llvm-svn: 254034
This patch implements a minimum spanning tree (MST) based instrumentation for
PGO. The use of MST guarantees minimum number of CFG edges getting
instrumented. An addition optimization is to instrument the less executed
edges to further reduce the instrumentation overhead. The patch contains both the
instrumentation and the use of the profile to set the branch weights.
Differential Revision: http://reviews.llvm.org/D12781
llvm-svn: 254021
Analyze imported function bodies and add any new external calls to
the worklist for importing. Currently no controls on the importing
so this will end up importing everything possible in the call tree
below the importing module. Basic profitability checks coming next.
Update test to check for iteratively inlined functions.
llvm-svn: 254011
Skip imports for weak_any aliases as well. Fix the test to check
non-import of weak aliases and functions, and import of normal alias.
llvm-svn: 253991
Summary:
This is a helper to perform cross-module import for ThinLTO. Right now
it is importing naively every possible called functions.
Reviewers: tejohnson
Subscribers: dexonsmith, llvm-commits
Differential Revision: http://reviews.llvm.org/D14914
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 253954
The existing coverage tracker counts the number of records that were used
from the input profile. An alternative view of coverage is to check how
many available samples were applied.
This way, if the profile contains several records with few samples, it
doesn't really matter much that they were not applied. The more
interesting records to apply are the ones that contribute many samples.
llvm-svn: 253912
If a function was originally inlined but not actually hot at runtime,
its samples will not be counted inside the parent function. This throws
off the coverage calculation because it expects to find more used
records than it should.
Fixed by ignoring functions that will not be inlined into the parent.
Currently, this is inlined functions with 0 samples. In subsequent
patches, I'll change this to mean "cold" functions.
llvm-svn: 253716
This reverts r253661.
Turns out that the assignment is not redundant (despite the Clang static analyzer claiming the opposite).
The variable is being used by the lambda function AddUsersToWorklistIfCapturing().
llvm-svn: 253696
While debugging some sampling coverage problems, I found this useful:
When applying samples from a profile, it helps to also know what line
offset and discriminator the sample belongs to. This makes it easy to
correlate against the input profile.
llvm-svn: 253670
We currently bail out of global localization if the global has non-instruction users. However, often these can be simple bitcasts or constant-GEPs, which we can easily turn into instructions before localizing. Be a bit more aggressive.
llvm-svn: 253584
This provides a way to force a function to have certain attributes from the command line. This can be useful when debugging or doing workload exploration, where manually editing IR is tedious or not possible (due to build systems etc).
The syntax is -force-attribute=function_name:attribute_name
All function attributes are parsed except alignstack as it requires an argument.
llvm-svn: 253550
While setting function attributes we check all instructions that may access memory. For a call instruction we check all arguments. The special check is required for pointers.
I added vector-of-pointers to the call arguments types that should be checked.
Differential Revision: http://reviews.llvm.org/D14693
llvm-svn: 253363
Address Duncan Exon Smith's comments on D14148, which was added after the patch had been LGTM'd and committed:
* clang-format one area where whitespace diffs occurred.
* Add a threshold to limit the store/load dominance checks as they are quadratic.
llvm-svn: 253192
Global to local demotion can speed up programs that use globals a lot. It is particularly useful with LTO, when the entire call graph is known and most functions have been internalized.
For a global to be demoted, it must only be accessed by one function and that function:
1. Must never recurse directly or indirectly, else the GV would be clobbered.
2. Must never rely on the value in GV at the start of the function (apart from the initializer).
GlobalOpt can already do this, but it is hamstrung and only ever tries to demote globals inside "main", because C++ gives extra guarantees about how main is called - once and only once.
In LTO mode, we can often prove the first property (if the function is internal by this point, we know enough about the callgraph to determine if it could possibly recurse). FunctionAttrs now infers the "norecurse" attribute for this reason.
The second property can be proven for a subset of functions by proving that all loads from GV are dominated by a store to GV. This is conservative in the name of compile time - this only requires a DominatorTree which is fairly cheap in the grand scheme of things. We could do more fancy stuff with MemoryDependenceAnalysis too to catch more cases but this appears to catch most of the useful ones in my testing.
llvm-svn: 253168
This reapplies r252949. I've changed the type of FuncName to be
std::string instead of StringRef in emitFnAttrCompatCheck.
Original commit message for r252949:
Provide a way to specify inliner's attribute compatibility and merging
rules using table-gen. NFC.
This commit adds new classes CompatRule and MergeRule to Attributes.td,
which are used to generate code to check attribute compatibility and
merge attributes of the caller and callee.
rdar://problem/19836465
llvm-svn: 252990