Summary: In iterative sample pgo where profile is collected from PGOed binary, we may see indirect call targets promoted and inlined in the profile. Before profile annotation, we need to make this happen in order to annotate correctly on IR. This patch explicitly promotes these indirect calls and inlines them before profile annotation.
Reviewers: xur, davidxl
Reviewed By: davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D29040
llvm-svn: 292979
Summary: Callsites in the same basic block should share the same hotness. This patch checks for the hottest callsite in the same basic block, and use the hotness for all callsites in that basic block for early inline decisions. It also fixes the test to add "-S" so theat the "CHECK-NOT" is actually checking the content.
Reviewers: dnovillo
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D24734
llvm-svn: 281927
Summary: The call target count profile is directly derived from LBR branch->target data. This is more reliable than instruction frequency profiles that could be moved across basic block boundaries. This patches uses call target count profile to annotate call instructions.
Reviewers: davidxl, dnovillo
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D24410
llvm-svn: 281911
Summary: Previously we reline on inst-combine to remove inlinable invoke instructions. This causes trouble because a few extra optimizations are schedule early that could introduce too much CFG change (e.g. simplifycfg removes too much control flow). This patch handles invoke instruction in-place during sample profile annotation, so that we do not rely on instcombine to remove those invoke instructions.
Reviewers: davidxl, dnovillo
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D24409
llvm-svn: 281870
Summary: The refined propagation algorithm is more accurate and robust.
Reviewers: davidxl, dnovillo
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D23224
llvm-svn: 278522
Summary: Set ProfileSummary in SampleProfilerLoader.
Reviewers: davidxl, eraman
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D21702
llvm-svn: 273745
Summary: Inliner needs ACT when calling InlineFunction. Instead of nullptr, we need to pass it in from SampleProfileLoader
Reviewers: davidxl
Subscribers: eraman, vsk, danielcdh, llvm-commits
Differential Revision: http://reviews.llvm.org/D21205
llvm-svn: 273199
Summary:
Instead of using maximum IR weight as the basic block weight, this patch uses the voting algorithm to find the most likely weight for the basic block. This can effectively avoid the cases when some IRs are annotated incorrectly due to code motion of the profiled binary.
This patch also updates propagate.ll unittest to include discriminator in the input file so that it is testing something meaningful.
Reviewers: davidxl, dnovillo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D19301
llvm-svn: 267519
Summary: SampleProfile pass needs to be performed after InstructionCombiningPass, which helps eliminate un-inlinable function calls.
Reviewers: davidxl, dnovillo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17742
llvm-svn: 262419
This adds two thresholds to the sample profiler to affect inlining
decisions: the concept of global hotness and coldness.
Functions that have accumulated more than a certain fraction of samples at
runtime, are annotated with the InlineHint attribute. Conversely,
functions that accumulate less than a certain fraction of samples, are
annotated with the Cold attribute.
This is very similar to the hints emitted by Clang when using
instrumentation profiles.
Notice that this is a very blunt instrument. A function may have
globally collected a significant fraction of samples, but that does not
necessarily mean that every callsite for that function is hot.
Ideally, we would annotate each callsite with the samples collected at
that callsite. This way, the inliner can incorporate all these weights
into its cost model.
Once the inliner offers this functionality, we can change the hints
emitted here to a more precise per-callsite annotation. For now, this is
providing some measure of speedups with our internal benchmarks. I've
observed speedups of up to 23% (though the geo mean is about 3%). I expect
these numbers to improve as the inliner gets better annotations.
llvm-svn: 254212
If a function was originally inlined but not actually hot at runtime,
its samples will not be counted inside the parent function. This throws
off the coverage calculation because it expects to find more used
records than it should.
Fixed by ignoring functions that will not be inlined into the parent.
Currently, this is inlined functions with 0 samples. In subsequent
patches, I'll change this to mean "cold" functions.
llvm-svn: 253716
The initial coverage checking code for sample records failed to count
records inside inlined profiles. This change fixes the oversight.
llvm-svn: 251752
This adds the flag -mllvm -sample-profile-check-coverage=N to the
SampleProfile pass. N is the percent of input sample records that the
user expects to apply. If the pass does not use N% (or more) of the
sample records in the input, it emits a warning.
This is useful to detect some forms of stale profiles. If the code has
drifted enough from the original profile, there will be records that do
not match the IR anymore.
This will not detect cases where a sample profile record for line L is
referring to some other instructions that also used to be at line L.
llvm-svn: 251568
When emitting a remark for a conditional branch annotation, the remark
uses the line location information of the conditional branch in the
message. In some cases, that information is unavailable and the
optimization would segfaul. I'm still not sure whether this is a bug or
WAI, but the optimizer should not die because of this.
llvm-svn: 251420
This adds a couple of optimization remarks to the SamplePGO
transformation. When it decides to inline a hot function (to mimic the
inline stack and repeat useful inline decisions in the original build).
It will also report branch destinations. For instance, given the code
fragment:
6 if (i < 1000)
7 sum -= i;
8 else
9 sum += -i * rand();
If the 'else' branch is taken most of the time, building this code with
-Rpass=sample-profile will produce:
a.cc:9:14: remark: most popular destination for conditional branches at small.cc:6:9 [-Rpass=sample-profile]
sum += -i * rand();
^
llvm-svn: 251330
In some cases (as illustrated in the unittest), lineno can be less than the heade_lineno because the function body are included from some other files. In this case, offset will be negative. This patch makes clang still able to match the profile to IR in this situation.
http://reviews.llvm.org/D13914
llvm-svn: 250873
The number of samples collected at the head of a function only make
sense for top-level functions (i.e., those actually called as opposed to
being inlined inside another).
Head samples essentially count the time spent inside the function's
prologue. This clearly doesn't make sense for inlined functions, so we
were always emitting 0 in those.
llvm-svn: 250539
Binary encoded profiles used to encode all function names inline at
every reference. This is clearly suboptimal in terms of space. This
patch fixes this by adding a name table to the header of the file.
llvm-svn: 250241
With this patch we can now read and write inline stacks in sample
profiles to the binary encoded profiles.
In a subsequent patch, I will add a string table to the binary encoding.
Right now function names are emitted as strings every time we find them.
This is too bloated and will produce large files in applications with
lots of inlining.
llvm-svn: 249861
This adds enough machinery to support reading simple GCC AutoFDO
profiles. It now supports reading flat profiles (no function calls).
Subsequent patches will add support for:
- Inlined calls (in particular, the inline call stack is not traversed
to accumulate samples).
- Working sets and modules. These are used mostly for GCC's LIPO
optimizations, so they're not needed in LLVM atm. I'm not sure that
we will ever need them. For now, I've if0'd around the calls.
The patch also adds support in GCOV.h for gcov version V704 (generated
by GCC's profile conversion tool).
llvm-svn: 247874
This patch uses the new function profile metadata "function_entry_count"
to annotate entry counts from sample profiles.
In a sampling profile, the total samples collected at the function entry
are an approximation for the number of times that function was invoked.
llvm-svn: 237265
Summary:
This patch finishes up support for handling sampling profiles in both
text and binary formats. The new binary format uses uleb128 encoding to
represent numeric values. This makes profiles files about 25% smaller.
The profile writer class can write profiles in the existing text and the
new binary format. In subsequent patches, I will add the capability to
read (and perhaps write) profiles in the gcov format used by GCC.
Additionally, I will be adding support in llvm-profdata to manipulate
sampling profiles.
There was a bit of refactoring needed to separate some code that was in
the reader files, but is actually common to both the reader and writer.
The new test checks that reading the same profile encoded as text or
raw, produces the same results.
Reviewers: bogner, dexonsmith
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6000
llvm-svn: 220915
Summary:
The compiler does not always generate linkage names. If a function
has been inlined and its body elided, its linkage name may not be
generated.
When the binary executes, the profiler will use its unmangled name
when attributing samples. This results in unmangled names in the
input profile.
We are currently failing hard when this happens. However, in this case
all that happens is that we fail to attribute samples to the inlined
function. While this means fewer optimization opportunities, it should
not cause a compilation failure.
This patch accepts all valid function names, regardless of whether
they were mangled or not.
Reviewers: chandlerc
CC: llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D3087
llvm-svn: 204142
Summary:
When the sample profiles include discriminator information,
use the discriminator values to distinguish instruction weights
in different basic blocks.
This modifies the BodySamples mapping to map <line, discriminator> pairs
to weights. Instructions on the same line but different blocks, will
use different discriminator values. This, in turn, means that the blocks
may have different weights.
Other changes in this patch:
- Add tests for positive values of line offset, discriminator and samples.
- Change data types from uint32_t to unsigned and int and do additional
validation.
Reviewers: chandlerc
CC: llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D2857
llvm-svn: 203508
1- Use the line_iterator class to read profile files.
2- Allow comments in profile file. Lines starting with '#'
are completely ignored while reading the profile.
3- Add parsing support for discriminators and indirect call samples.
Our external profiler can emit more profile information that we are
currently not handling. This patch does not add new functionality to
support this information, but it allows profile files to provide it.
I will add actual support later on (for at least one of these
features, I need support for DWARF discriminators in Clang).
A sample line may contain the following additional information:
Discriminator. This is used if the sampled program was compiled with
DWARF discriminator support
(http://wiki.dwarfstd.org/index.php?title=Path_Discriminators). This
is currently only emitted by GCC and we just ignore it.
Potential call targets and samples. If present, this line contains a
call instruction. This models both direct and indirect calls. Each
called target is listed together with the number of samples. For
example,
130: 7 foo:3 bar:2 baz:7
The above means that at relative line offset 130 there is a call
instruction that calls one of foo(), bar() and baz(). With baz()
being the relatively more frequent call target.
Differential Revision: http://llvm-reviews.chandlerc.com/D2355
4- Simplify format of profile input file.
This implements earlier suggestions to simplify the format of the
sample profile file. The symbol table is not necessary and function
profiles do not need to know the number of samples in advance.
Differential Revision: http://llvm-reviews.chandlerc.com/D2419
llvm-svn: 198973
This adds a propagation heuristic to convert instruction samples
into branch weights. It implements a similar heuristic to the one
implemented by Dehao Chen on GCC.
The propagation proceeds in 3 phases:
1- Assignment of block weights. All the basic blocks in the function
are initial assigned the same weight as their most frequently
executed instruction.
2- Creation of equivalence classes. Since samples may be missing from
blocks, we can fill in the gaps by setting the weights of all the
blocks in the same equivalence class to the same weight. To compute
the concept of equivalence, we use dominance and loop information.
Two blocks B1 and B2 are in the same equivalence class if B1
dominates B2, B2 post-dominates B1 and both are in the same loop.
3- Propagation of block weights into edges. This uses a simple
propagation heuristic. The following rules are applied to every
block B in the CFG:
- If B has a single predecessor/successor, then the weight
of that edge is the weight of the block.
- If all the edges are known except one, and the weight of the
block is already known, the weight of the unknown edge will
be the weight of the block minus the sum of all the known
edges. If the sum of all the known edges is larger than B's weight,
we set the unknown edge weight to zero.
- If there is a self-referential edge, and the weight of the block is
known, the weight for that edge is set to the weight of the block
minus the weight of the other incoming edges to that block (if
known).
Since this propagation is not guaranteed to finalize for every CFG, we
only allow it to proceed for a limited number of iterations (controlled
by -sample-profile-max-propagate-iterations). It currently uses the same
GCC default of 100.
Before propagation starts, the pass builds (for each block) a list of
unique predecessors and successors. This is necessary to handle
identical edges in multiway branches. Since we visit all blocks and all
edges of the CFG, it is cleaner to build these lists once at the start
of the pass.
Finally, the patch fixes the computation of relative line locations.
The profiler emits lines relative to the function header. To discover
it, we traverse the compilation unit looking for the subprogram
corresponding to the function. The line number of that subprogram is the
line where the function begins. That becomes line zero for all the
relative locations.
llvm-svn: 198972
The profile file parser needed some tests for its parsing actions.
This adds tests for each of the error messages emitted by the parser.
llvm-svn: 196106
This adds a new scalar pass that reads a file with samples generated
by 'perf' during runtime. The samples read from the profile are
incorporated and emmited as IR metadata reflecting that profile.
The profile file is assumed to have been generated by an external
profile source. The profile information is converted into IR metadata,
which is later used by the analysis routines to estimate block
frequencies, edge weights and other related data.
External profile information files have no fixed format, each profiler
is free to define its own. This includes both the on-disk representation
of the profile and the kind of profile information stored in the file.
A common kind of profile is based on sampling (e.g., perf), which
essentially counts how many times each line of the program has been
executed during the run.
The SampleProfileLoader pass is organized as a scalar transformation.
On startup, it reads the file given in -sample-profile-file to
determine what kind of profile it contains. This file is assumed to
contain profile information for the whole application. The profile
data in the file is read and incorporated into the internal state of
the corresponding profiler.
To facilitate testing, I've organized the profilers to support two file
formats: text and native. The native format is whatever on-disk
representation the profiler wants to support, I think this will mostly
be bitcode files, but it could be anything the profiler wants to
support. To do this, every profiler must implement the
SampleProfile::loadNative() function.
The text format is mostly meant for debugging. Records are separated by
newlines, but each profiler is free to interpret records as it sees fit.
Profilers must implement the SampleProfile::loadText() function.
Finally, the pass will call SampleProfile::emitAnnotations() for each
function in the current translation unit. This function needs to
translate the loaded profile into IR metadata, which the analyzer will
later be able to use.
This patch implements the first steps towards the above design. I've
implemented a sample-based flat profiler. The format of the profile is
fairly simplistic. Each sampled function contains a list of relative
line locations (from the start of the function) together with a count
representing how many samples were collected at that line during
execution. I generate this profile using perf and a separate converter
tool.
Currently, I have only implemented a text format for these profiles. I
am interested in initial feedback to the whole approach before I send
the other parts of the implementation for review.
This patch implements:
- The SampleProfileLoader pass.
- The base ExternalProfile class with the core interface.
- A SampleProfile sub-class using the above interface. The profiler
generates branch weight metadata on every branch instructions that
matches the profiles.
- A text loader class to assist the implementation of
SampleProfile::loadText().
- Basic unit tests for the pass.
Additionally, the patch uses profile information to compute branch
weights based on instruction samples.
This patch converts instruction samples into branch weights. It
does a fairly simplistic conversion:
Given a multi-way branch instruction, it calculates the weight of
each branch based on the maximum sample count gathered from each
target basic block.
Note that this assignment of branch weights is somewhat lossy and can be
misleading. If a basic block has more than one incoming branch, all the
incoming branches will get the same weight. In reality, it may be that
only one of them is the most heavily taken branch.
I will adjust this assignment in subsequent patches.
llvm-svn: 194566