Sorry for the commit spam. My clang-format crashed on me and the vim
plugin did not print an error, but instead just left the formatting
untouched.
llvm-svn: 208358
To compute the dimensions of the array in a unique way, we split the
delinearization analysis in three steps:
- find parametric terms in all memory access functions
- compute the array dimensions from the set of terms
- compute the delinearized access functions for each dimension
The first step is executed on all the memory access functions such that we
gather all the patterns in which an array is accessed. The second step reduces
all this information in a unique description of the sizes of the array. The
third step is delinearizing each memory access function following the common
description of the shape of the array computed in step 2.
This rewrite of the delinearization pass also solves a problem we had with the
previous implementation: because the previous algorithm was by induction on the
structure of the SCEV, it would not correctly recognize the shape of the array
when the memory access was not following the nesting of the loops: for example,
see polly/test/ScopInfo/multidim_only_ivs_3d_reverse.ll
; void foo(long n, long m, long o, double A[n][m][o]) {
;
; for (long i = 0; i < n; i++)
; for (long j = 0; j < m; j++)
; for (long k = 0; k < o; k++)
; A[i][k][j] = 1.0;
Starting with this patch we no longer delinearize access functions that do not
contain parameters, for example in test/Analysis/DependenceAnalysis/GCD.ll
;; for (long int i = 0; i < 100; i++)
;; for (long int j = 0; j < 100; j++) {
;; A[2*i - 4*j] = i;
;; *B++ = A[6*i + 8*j];
these accesses will not be delinearized as the upper bound of the loops are
constants, and their access functions do not contain SCEVUnknown parameters.
llvm-svn: 208232
operations on the call graph. This one forms a cycle, and while not as
complex as removing an internal edge from an SCC, it involves
a reasonable amount of work to find all of the nodes newly connected in
a cycle.
Also somewhat alarming is the worst case complexity here: it might have
to walk roughly the entire SCC inverse DAG to insert a single edge. This
is carefully documented in the API (I hope).
llvm-svn: 207935
This fix simply ensures that both metadata nodes are path-aware before
performing path-aware alias analysis.
This issue isn't normally triggered in LLVM, because we perform an autoupgrade
of the TBAA metadata to the new format when reading in LL or BC files. This
issue only appears when a client creates the IR manually and mixes old and new
TBAA metadata format.
This fixes <rdar://problem/16760860>.
llvm-svn: 207923
just connects an SCC to one of its descendants directly. Not much of an
impact. The last one is the hard one -- connecting an SCC to one of its
ancestors, and thereby forming a cycle such that we have to merge all
the SCCs participating in the cycle.
llvm-svn: 207751
of SCCs in the SCC DAG. Exercise them in the big graph test case. These
will be especially useful for establishing invariants in insertion
logic.
llvm-svn: 207749
edge entirely within an existing SCC. Shockingly, making the connected
component more connected is ... a total snooze fest. =]
Anyways, its wired up, and I even added a test case to make sure it
pretty much sorta works. =D
llvm-svn: 207631
bits), and discover that it's totally broken. Yay tests. Boo bug. Fix
the basic edge removal so that it works by nulling out the removed edges
rather than actually removing them. This leaves the indices valid in the
map from callee to index, and preserves some of the locality for
iterating over edges. The iterator is made bidirectional to reflect that
it now has to skip over null entries, and the skipping logic is layered
onto it.
As future work, I would like to track essentially the "load factor" of
the edge list, and when it falls below a threshold do a compaction.
An alternative I considered (and continue to consider) is storing the
callees in a doubly linked list where each element of the list is in
a set (which is essentially the classical linked-hash-table
datastructure). The problem with that approach is that either you need
to heap allocate the linked list nodes and use pointers to them, or use
a bucket hash table (with even *more* linked list pointer overhead!),
etc. It's pretty easy to get 5x overhead for values that are just
pointers. So far, I think punching holes in the vector, and periodic
compaction is likely to be much more efficient overall in the space/time
tradeoff.
llvm-svn: 207619
This reverts commit r207287, reapplying r207286.
I'm hoping that declaring an explicit struct and instantiating
`addBlockEdges()` directly works around the GCC crash from r207286.
This is a lot more boilerplate, though.
llvm-svn: 207438
contract (and be much more useful). It now provides exactly the
post-order traversal a caller might need to perform on newly formed
SCCs.
llvm-svn: 207410
by avoiding inlining massive switches merely because they have no
instructions in them. These switches still show up where we fail to form
lookup tables, and in those cases they are actually going to cause
a very significant code size hit anyways, so inlining them is not the
right call. The right way to fix any performance regressions stemming
from this is to enhance the switch-to-lookup-table logic to fire in more
places.
This makes PR19499 about 5x less bad. It uncovers a second compile time
problem in that test case that is unrelated (surprisingly!).
llvm-svn: 207403
API requirements much more obvious.
The key here is that there are two totally different use cases for
mutating the graph. Prior to doing any SCC formation, it is very easy to
mutate the graph. There may be users that want to do small tweaks here,
and then use the already-built graph for their SCC-based operations.
This method remains on the graph itself and is documented carefully as
being cheap but unavailable once SCCs are formed.
Once SCCs are formed, and there is some in-flight DFS building them, we
have to be much more careful in how we mutate the graph. These mutation
operations are sunk onto the SCCs themselves, which both simplifies
things (the code was already there!) and helps make it obvious that
these interfaces are only applicable within that context. The other
primary constraint is that the edge being mutated is actually related to
the SCC on which we call the method. This helps make it obvious that you
cannot arbitrarily mutate some other SCC.
I've tried to write much more complete documentation for the interesting
mutation API -- intra-SCC edge removal. Currently one aspect of this
documentation is a lie (the result list of SCCs) but we also don't even
have tests for that API. =[ I'm going to add tests and fix it to match
the documentation next.
llvm-svn: 207339
them, just skip over any DFS-numbered nodes when finding the next root
of a DFS. This allows the entry set to just be a vector as we populate
it from a uniqued source. It also removes the possibility for a linear
scan of the entry set to actually do the removal which can make things
go quadratic if we get unlucky.
llvm-svn: 207312
the DFS stack for leaves in the call graph. As mentioned in my previous
commit, this is particularly interesting for graphs which have high fan
out but low connectivity resulting in many leaves. For such graphs, this
can remove a large % of the DFS stack traffic even though it doesn't
make the stack much smaller.
It's a bit easier to formulate this for the full algorithm because that
one stops completely for each SCC. For example, I was able to directly
eliminate the "Recurse" boolean used to continue an outer loop from the
inner loop.
llvm-svn: 207311
makes working through the worklist much cleaner, and makes it possible
to avoid the 'bool-to-continue-the-outer-loop' hack. Not a huge
difference, but I think this is approaching as polished as I can make
it.
llvm-svn: 207310
processed in the DFS out of the stack completely. Keep it exclusively in
a variable. Re-shuffle some code structure to make this easier. This can
have a very dramatic effect in some cases because call graphs tend to
look like a high fan-out spanning tree. As a consequence, there are
a large number of leaf nodes in the graph, and this technique causes
leaf nodes to never even go into the stack. While this only reduces the
max depth by 1, it may cause the total number of round trips through the
stack to drop by a lot.
Now, most of this isn't really relevant for the incremental version. =]
But I wanted to prototype it first here as this variant is in ways more
complex. As long as I can get the code factored well here, I'll next
make the primary walk look the same. There are several refactorings this
exposes I think.
llvm-svn: 207306
graph in any way because we don't track edges in the SCC graph, just
nodes. This also lets us add a nice assert about the invariant that
we're working on at least a certain number of nodes within the SCC.
llvm-svn: 207305
a helper function. Also factor the other two places where we did the
same thing into the helper function. =] Much cleaner this way. NFC.
llvm-svn: 207300
This reverts commit r207286. It causes an ICE on the
cmake-llvm-x86_64-linux buildbot [1]:
llvm/lib/Analysis/BlockFrequencyInfo.cpp: In lambda function:
llvm/lib/Analysis/BlockFrequencyInfo.cpp:182:1: internal compiler error: in get_expr_operands, at tree-ssa-operands.c:1035
[1]: http://bb.pgr.jp/builders/cmake-llvm-x86_64-linux/builds/12093/steps/build_llvm/logs/stdio
llvm-svn: 207287
Previously, irreducible backedges were ignored. With this commit,
irreducible SCCs are discovered on the fly, and modelled as loops with
multiple headers.
This approximation specifies the headers of irreducible sub-SCCs as its
entry blocks and all nodes that are targets of a backedge within it
(excluding backedges within true sub-loops). Block frequency
calculations act as if we insert a new block that intercepts all the
edges to the headers. All backedges and entries to the irreducible SCC
point to this imaginary block. This imaginary block has an edge (with
even probability) to each header block.
The result is now reasonable enough that I've added a number of
testcases for irreducible control flow. I've outlined in
`BlockFrequencyInfoImpl.h` ways to improve the approximation.
<rdar://problem/14292693>
llvm-svn: 207286
It's fishy to be changing the `std::vector<>` owned by the iterator, and
no one actual does it, so I'm going to remove the ability in a
subsequent commit. First, update the users.
<rdar://problem/14292693>
llvm-svn: 207252
SCCMap to test for nodes that have been re-added to the root SCC rather
than a set vector. We already have done the SCCMap lookup, we juts need
to test it in two different ways. In turn, do most of the processing of
these nodes as they go into the root SCC rather than lazily. This
simplifies the final loop to just stitch the root SCC into its
children's parent sets. No functionlatiy changed.
However, this makes a few things painfully obvious, which was my intent.
=] There is tons of repeated code introduced here and elsewhere. I'm
splitting the refactoring of that code into helpers from this change so
its clear that this is the change which switches the datastructures used
around, and the other is a pure factoring & deduplication of code
change.
llvm-svn: 207217
remove the nodes in the SCC from the SCC map entirely prior to the DFS
walk. This allows the SCC map to represent both the state of
not-yet-re-added-to-an-SCC and added-back-to-this-SCC independently. The
first is being missing from the SCC map, the second is mapping back to
'this'. In a subsequent commit, I'm going to use this property to
simplify the new node list for this SCC.
In theory, I think this also makes the contract for orphaning a node
from the graph slightly less confusing. Now it is also orphaned from the
SCC graph. Still, this isn't quite right either, and so I'm not adding
test cases here. I'll add test cases for the behavior of orphaning nodes
when the code *actually* supports it. The change here is mostly
incidental, my goal is simplifying the algorithm.
llvm-svn: 207213
child from the worklist, wait until we actually need to pop another
element off of the worklist and skip over any that were already visited
by the DFS. This also enables swapping the nodes of the SCC into the
worklist. No functionality changed.
llvm-svn: 207212
thing, just mucking up the code. I feel bad that I even wrote this loop.
Very sorry. The diff is huge because of the indent change, but I promise
all this is doing is realizing that the outer two loops were actually
the exact same loops, and we didn't need two of them.
llvm-svn: 207202
factored into a more reasonable form, replace the tail call with
a simple outer-loop continuation. It's sad that C++ makes this so
awkward to write, but it seems more direct and clear than the tail call
at this point.
llvm-svn: 207201
Remove the concepts of "forward" and "general" mass distributions, which
was wrong. The split might have made sense in an early version of the
algorithm, but it's definitely wrong now.
<rdar://problem/14292693>
llvm-svn: 207195
Rather than scaling loop headers and then scaling all the loop members
by the header frequency, scale `LoopData::Scale` itself, and scale the
loop members by it. It's much more obvious what's going on this way,
and doesn't cost any extra multiplies.
<rdar://problem/14292693>
llvm-svn: 207189
Make `getPackagedNode()` a member function of
`BlockFrequencyInfoImplBase` so that it's available for templated code.
<rdar://problem/14292693>
llvm-svn: 207183
As pointed out by David Blaikie in code review, a `std::list<T>` is
simpler than a `std::vector<std::unique_ptr<T>>`. Another option is a
`std::deque<T>` (which allocates in chunks), but I'd like to leave open
the option of inserting in the middle of the sequence for handling
irreducible control flow on the fly.
<rdar://problem/14292693>
llvm-svn: 207177
algorithm here: http://dl.acm.org/citation.cfm?id=177301.
The idea of isolating the roots has even more relevance when using the
stack not just to implement the DFS but also to implement the recursive
step. Because we use it for the recursive step, to isolate the roots we
need to maintain two stacks: one for our recursive DFS walk, and another
of the nodes that have been walked. The nice thing is that the latter
will be half the size. It also fixes a complete hack where we scanned
backwards over the stack to find the next potential-root to continue
processing. Now that is always the top of the DFS stack.
While this is a really nice improvement already (IMO) it further opens
the door for two important simplifications:
1) De-duplicating some of the code across the two different walks. I've
actually made the duplication a bit worse in some senses with this
patch because the two are starting to converge.
2) Dramatically simplifying the loop structures of both walks.
I wanted to do those separately as they'll be essentially *just* CFG
restructuring. This patch on the other hand actually uses different
datastructures to implement the algorithm itself.
llvm-svn: 207098
applied prior to pushing a node onto the DFSStack. This is the first
step toward avoiding the stack entirely for leaf nodes. It also
simplifies things a bit and I think is pointing the way toward factoring
some more of the shared logic out of the two implementations.
It is also making it more obvious how to restructure the loops
themselves to be a bit easier to read (although no different in terms of
functionality).
llvm-svn: 207095
a SmallPtrSet. Currently, there is no need for stable iteration in this
dimension, and I now thing there won't need to be going forward.
If this is ever re-introduced in any form, it needs to not be
a SetVector based solution because removal cannot be linear. There will
be many SCCs with large numbers of parents. When encountering these, the
incremental SCC update for intra-SCC edge removal was quadratic due to
linear removal (kind of).
I'm really hoping we can avoid having an ordering property here at all
though...
llvm-svn: 207091
than functions. So far, this access pattern is *much* more common. It
seems likely that any user of this interface is going to have nodes at
the point that they are querying the SCCs.
No functionality changed.
llvm-svn: 207045
values rather than an expensive dense map query to test whether children
have already been popped into an SCC. This matches the incremental SCC
building code. I've also included the assert that I put there but
updated both of their text.
No functionality changed here.
I still don't have any great ideas for sharing the code between the two
implementations, but I may try a brute-force approach to factoring it at
some point.
llvm-svn: 207042
This implements the core functionality necessary to remove an edge from
the call graph and correctly update both the basic graph and the SCC
structure. As part of that it has to run a tiny (in number of nodes)
Tarjan-style DFS walk of an SCC being mutated to compute newly formed
SCCs, etc.
This is *very rough* and a WIP. I have a bunch of FIXMEs for code
cleanup that will reduce the boilerplate in this change substantially.
I also have a bunch of simplifications to various parts of both
algorithms that I want to make, but first I'd like to have a more
holistic picture. Ideally, I'd also like more testing. I'll probably add
quite a few more unit tests as I go here to cover the various different
aspects and corner cases of removing edges from the graph.
Still, this is, so far, successfully updating the SCC graph in-place
without disrupting the identity established for the existing SCCs even
when we do challenging things like delete the critical edge that made an
SCC cycle at all and have to reform things as a tree of smaller SCCs.
Getting this to work is really critical for the new pass manager as it
is going to associate significant state with the SCC instance and needs
it to be stable. That is also the motivation behind the return of the
newly formed SCCs. Eventually, I'll wire this all the way up to the
public API so that the pass manager can use it to correctly re-enqueue
newly formed SCCs into a fresh postorder traversal.
llvm-svn: 206968
up the stack finishing the exploration of each entries children before
we're finished in addition to accounting for their low-links. Added
a unittest that really hammers home the need for this with interlocking
cycles that would each appear distinct otherwise and crash or compute
the wrong result. As part of this, nuke a stale fixme and bring the rest
of the implementation still more closely in line with the original
algorithm.
llvm-svn: 206966
resisted this for too long. Just with the basic testing here I was able
to exercise the analysis in more detail and sift out both type signature
bugs in the API and a bug in the DFS numbering. All of these are fixed
here as well.
The unittests will be much more important for the mutation support where
it is necessary to craft minimal mutations and then inspect the state of
the graph. There is just no way to do that with a standard FileCheck
test. However, unittesting these kinds of analyses is really quite easy,
especially as they're designed with the new pass manager where there is
essentially no infrastructure required to rig up the core logic and
exercise it at an API level.
As a minor aside about the DFS numbering bug, the DFS numbering used in
LCG is a bit unusual. Rather than numbering from 0, we number from 1,
and use 0 as the sentinel "unvisited" state. Other implementations often
use '-1' for this, but I find it easier to deal with 0 and it shouldn't
make any real difference provided someone doesn't write silly bugs like
forgetting to actually initialize the DFS numbering. Oops. ;]
llvm-svn: 206954
the Callee list. This is going to be quite important to prevent removal
from going quadratic. No functionality changed at this point, this is
one of the refactoring patches I've broken out of my initial work toward
mutation updates of the call graph.
llvm-svn: 206938
The branch that skips irreducible backedges was only active when
propagating mass at the top-level. In particular, when propagating mass
through a loop recognized by `LoopInfo` with irreducible control flow
inside, irreducible backedges would not be skipped.
Not sure where that idea came from, but the result was that mass was
lost until after loop exit. Added a testcase that covers this case.
llvm-svn: 206860
Store pointers directly to loops inside the nodes. This could have been
done without changing the type stored in `std::vector<>`. However,
rather than computing the number of loops before constructing them
(which `LoopInfo` doesn't provide directly), I've switched to a
`vector<unique_ptr<LoopData>>`.
This adds some heap overhead, but the number of loops is typically
small.
llvm-svn: 206857
This was implicitly with copy assignment before, which fails to actually
clear `std::vector<>`'s heap storage. Move assignment would work, but
since MSVC can't imply those anyway, explicitly `clear()`-ing members
makes more sense.
llvm-svn: 206856
definition below all the header #include lines, lib/Analysis/...
edition.
This one has a bit extra as there were *other* #define's before #include
lines in addition to DEBUG_TYPE. I've sunk all of them as a block.
llvm-svn: 206843
define below all header includes in the lib/CodeGen/... tree. While the
current modules implementation doesn't check for this kind of ODR
violation yet, it is likely to grow support for it in the future. It
also removes one layer of macro pollution across all the included
headers.
Other sub-trees will follow.
llvm-svn: 206837
behavior based on other files defining DEBUG_TYPE, which means it cannot
define DEBUG_TYPE at all. This is actually better IMO as it forces folks
to define relevant DEBUG_TYPEs for their files. However, it requires all
files that currently use DEBUG(...) to define a DEBUG_TYPE if they don't
already. I've updated all such files in LLVM and will do the same for
other upstream projects.
This still leaves one important change in how LLVM uses the DEBUG_TYPE
macro going forward: we need to only define the macro *after* header
files have been #include-ed. Previously, this wasn't possible because
Debug.h required the macro to be pre-defined. This commit removes that.
By defining DEBUG_TYPE after the includes two things are fixed:
- Header files that need to provide a DEBUG_TYPE for some inline code
can do so by defining the macro before their inline code and undef-ing
it afterward so the macro does not escape.
- We no longer have rampant ODR violations due to including headers with
different DEBUG_TYPE definitions. This may be mostly an academic
violation today, but with modules these types of violations are easy
to check for and potentially very relevant.
Where necessary to suppor headers with DEBUG_TYPE, I have moved the
definitions below the includes in this commit. I plan to move the rest
of the DEBUG_TYPE macros in LLVM in subsequent commits; this one is big
enough.
The comments in Debug.h, which were hilariously out of date already,
have been updated to reflect the recommended practice going forward.
llvm-svn: 206822
This reverts commit r206707, reapplying r206704. The preceding commit
to CalcSpillWeights should have sorted out the failing buildbots.
<rdar://problem/14292693>
llvm-svn: 206766
LazyCallGraph analysis framework. Wire it up all the way through the opt
driver and add some very basic testing that we can build pass pipelines
including these components. Still a lot more to do in terms of testing
that all of this works, but the basic pieces are here.
There is a *lot* of boiler plate here. It's something I'm going to
actively look at reducing, but I don't have any immediate ideas that
don't end up making the code terribly complex in order to fold away the
boilerplate. Until I figure out something to minimize the boilerplate,
almost all of this is based on the code for the existing pass managers,
copied and heavily adjusted to suit the needs of the CGSCC pass
management layer.
The actual CG management still has a bunch of FIXMEs in it. Notably, we
don't do *any* updating of the CG as it is potentially invalidated.
I wanted to get this in place to motivate the new analysis, and add
update APIs to the analysis and the pass management layers in concert to
make sure that the *right* APIs are present.
llvm-svn: 206745
This reverts commit r206677, reapplying my BlockFrequencyInfo rewrite.
I've done a careful audit, added some asserts, and fixed a couple of
bugs (unfortunately, they were in unlikely code paths). There's a small
chance that this will appease the failing bots [1][2]. (If so, great!)
If not, I have a follow-up commit ready that will temporarily add
-debug-only=block-freq to the two failing tests, allowing me to compare
the code path between what the failing bots and what my machines (and
the rest of the bots) are doing. Once I've triggered those builds, I'll
revert both commits so the bots go green again.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445
<rdar://problem/14292693>
llvm-svn: 206704
This reverts commit r206666, as planned.
Still stumped on why the bots are failing. Sanitizer bots haven't
turned anything up. If anyone can help me debug either of the failures
(referenced in r206666) I'll owe them a beer. (In the meantime, I'll be
auditing my patch for undefined behaviour.)
llvm-svn: 206677
This reverts commit r206628, reapplying r206622 (and r206626).
Two tests are failing only on buildbots [1][2]: i.e., I can't reproduce
on Darwin, and Chandler can't reproduce on Linux. Asan and valgrind
don't tell us anything, but we're hoping the msan bot will catch it.
So, I'm applying this again to get more feedback from the bots. I'll
leave it in long enough to trigger builds in at least the sanitizer
buildbots (it was failing for reasons unrelated to my commit last time
it was in), and hopefully a few others.... and then I expect to revert a
third time.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445
llvm-svn: 206666
This reverts commit r206622 and the MSVC fixup in r206626.
Apparently the remotely failing tests are still failing, despite my
attempt to fix the nondeterminism in r206621.
llvm-svn: 206628
This reverts commit r206556, effectively reapplying commit r206548 and
its fixups in r206549 and r206550.
In an intervening commit I've added target triples to the tests that
were failing remotely [1] (but passing locally). I'm hoping the mystery
is solved? I'll revert this again if the tests are still failing
remotely.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
llvm-svn: 206622
Reality is that we're never going to copy one of these. Supporting this
was becoming a nightmare because nothing even causes it to compile most
of the time. Lots of subtle errors built up that wouldn't have been
caught by any "normal" testing.
Also, make the move assignment actually work rather than the bogus swap
implementation that would just infloop if used. As part of that, factor
out the graph pointer updates into a helper to share between move
construction and move assignment.
llvm-svn: 206583
LazyCallGraph. This is the start of the whole point of this different
abstraction, but it is just the initial bits. Here is a run-down of
what's going on here. I'm planning to incorporate some (or all) of this
into comments going forward, hopefully with better editing and wording.
=]
The crux of the problem with the traditional way of building SCCs is
that they are ephemeral. The new pass manager however really needs the
ability to associate analysis passes and results of analysis passes with
SCCs in order to expose these analysis passes to the SCC passes. Making
this work is kind-of the whole point of the new pass manager. =]
So, when we're building SCCs for the call graph, we actually want to
build persistent nodes that stick around and can be reasoned about
later. We'd also like the ability to walk the SCC graph in more complex
ways than just the traditional postorder traversal of the current CGSCC
walk. That means that in addition to being persistent, the SCCs need to
be connected into a useful graph structure.
However, we still want the SCCs to be formed lazily where possible.
These constraints are quite hard to satisfy with the SCC iterator. Also,
using that would bypass our ability to actually add data to the nodes of
the call graph to facilite implementing the Tarjan walk. So I've
re-implemented things in a more direct and embedded way. This
immediately makes it easy to get the persistence and connectivity
correct, and it also allows leveraging the existing nodes to simplify
the algorithm. I've worked somewhat to make this implementation more
closely follow the traditional paper's nomenclature and strategy,
although it is still a bit obtuse because it isn't recursive, using
an explicit stack and a tail call instead, and it is interruptable,
resuming each time we need another SCC.
The other tricky bit here, and what actually took almost all the time
and trials and errors I spent building this, is exactly *what* graph
structure to build for the SCCs. The naive thing to build is the call
graph in its newly acyclic form. I wrote about 4 versions of this which
did precisely this. Inevitably, when I experimented with them across
various use cases, they became incredibly awkward. It was all
implementable, but it felt like a complete wrong fit. Square peg, round
hole. There were two overriding aspects that pushed me in a different
direction:
1) We want to discover the SCC graph in a postorder fashion. That means
the root node will be the *last* node we find. Using the call-SCC DAG
as the graph structure of the SCCs results in an orphaned graph until
we discover a root.
2) We will eventually want to walk the SCC graph in parallel, exploring
distinct sub-graphs independently, and synchronizing at merge points.
This again is not helped by the call-SCC DAG structure.
The structure which, quite surprisingly, ended up being completely
natural to use is the *inverse* of the call-SCC DAG. We add the leaf
SCCs to the graph as "roots", and have edges to the caller SCCs. Once
I switched to building this structure, everything just fell into place
elegantly.
Aside from general cleanups (there are FIXMEs and too few comments
overall) that are still needed, the other missing piece of this is
support for iterating across levels of the SCC graph. These will become
useful for implementing #2, but they aren't an immediate priority.
Once SCCs are in good shape, I'll be working on adding mutation support
for incremental updates and adding the pass manager that this analysis
enables.
llvm-svn: 206581
Rewrite the shared implementation of BlockFrequencyInfo and
MachineBlockFrequencyInfo entirely.
The old implementation had a fundamental flaw: precision losses from
nested loops (or very wide branches) compounded past loop exits (and
convergence points).
The @nested_loops testcase at the end of
test/Analysis/BlockFrequencyAnalysis/basic.ll is motivating. This
function has three nested loops, with branch weights in the loop headers
of 1:4000 (exit:continue). The old analysis gives non-sensical results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
---- Block Freqs ----
entry = 1.0
for.cond1.preheader = 1.00103
for.cond4.preheader = 5.5222
for.body6 = 18095.19995
for.inc8 = 4.52264
for.inc11 = 0.00109
for.end13 = 0.0
The new analysis gives correct results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
block-frequency-info: nested_loops
- entry: float = 1.0, int = 8
- for.cond1.preheader: float = 4001.0, int = 32007
- for.cond4.preheader: float = 16008001.0, int = 128064007
- for.body6: float = 64048012001.0, int = 512384096007
- for.inc8: float = 16008001.0, int = 128064007
- for.inc11: float = 4001.0, int = 32007
- for.end13: float = 1.0, int = 8
Most importantly, the frequency leaving each loop matches the frequency
entering it.
The new algorithm leverages BlockMass and PositiveFloat to maintain
precision, separates "probability mass distribution" from "loop
scaling", and uses dithering to eliminate probability mass loss. I have
unit tests for these types out of tree, but it was decided in the review
to make the classes private to BlockFrequencyInfoImpl, and try to shrink
them (or remove them entirely) in follow-up commits.
The new algorithm should generally have a complexity advantage over the
old. The previous algorithm was quadratic in the worst case. The new
algorithm is still worst-case quadratic in the presence of irreducible
control flow, but it's linear without it.
The key difference between the old algorithm and the new is that control
flow within a loop is evaluated separately from control flow outside,
limiting propagation of precision problems and allowing loop scale to be
calculated independently of mass distribution. Loops are visited
bottom-up, their loop scales are calculated, and they are replaced by
pseudo-nodes. Mass is then distributed through the function, which is
now a DAG. Finally, loops are revisited top-down to multiply through
the loop scales and the masses distributed to pseudo nodes.
There are some remaining flaws.
- Irreducible control flow isn't modelled correctly. LoopInfo and
MachineLoopInfo ignore irreducible edges, so this algorithm will
fail to scale accordingly. There's a note in the class
documentation about how to get closer. See also the comments in
test/Analysis/BlockFrequencyInfo/irreducible.ll.
- Loop scale is limited to 4096 per loop (2^12) to avoid exhausting
the 64-bit integer precision used downstream.
- The "bias" calculation proposed on llvmdev is *not* incorporated
here. This will be added in a follow-up commit, once comments from
this review have been handled.
llvm-svn: 206548