The language reference says that:
"If a symbol appears in the @llvm.used list, then the compiler,
assembler, and linker are required to treat the symbol as if there is
a reference to the symbol that it cannot see"
Since even the linker cannot see the reference, we must assume that
the reference can be using the symbol table. For example, a user can add
__attribute__((used)) to a debug helper function like dump and use it from
a debugger.
llvm-svn: 187103
GlobalOpt simplifies llvm.compiler.used by removing any members that are also
in the more strict llvm.used. Handle the special case where llvm.compiler.used
becomes empty.
llvm-svn: 186778
We were incorrectly using compiler_used instead of compiler.used. Unfortunately
the passes using the broken name had tests also using the broken name.
llvm-svn: 186705
Duncan pointed out a mistake in my fix in r186425 when only one of the allocas
being compared had the target-default alignment. This is essentially his
suggested solution. Thanks!
llvm-svn: 186510
For safety, the inliner cannot decrease the allignment on an alloca when
merging it with another.
I've included two variants of the test case for this: one with DataLayout
available, and one without. When DataLayout is not available, if only one of
the allocas uses the default alignment (getAlignment() == 0), then they cannot
be safely merged.
llvm-svn: 186425
This implies annotating it as nounwind and its arguments as nocapture. To be
conservative, we do not annotate the arguments with noalias since some platforms
do not have restrict on the declaration for gettimeofday.
llvm-svn: 185502
No functionality change.
It should suffice to check the type of a debug info metadata, instead of
calling Verify. For cases where we know the type of a DI metadata, use
assert.
Also update testing cases to make them conform to the format of DI classes.
llvm-svn: 185135
CGSCC pass manager. This should insulate the inlining decisions from the
vectorization decisions, however it may have both compile time and code
size problems so it is just an experimental option right now.
Adding this based on a discussion with Arnold and it seems at least
worth having this flag for us to both run some experiments to see if
this strategy is workable. It may solve some of the regressions seen
with the loop vectorizer.
llvm-svn: 184698
This commit completely removes what is left of the simplify-libcalls
pass. All of the functionality has now been migrated to the instcombine
and functionattrs passes. The following C API functions are now NOPs:
1. LLVMAddSimplifyLibCallsPass
2. LLVMPassManagerBuilderSetDisableSimplifyLibCalls
llvm-svn: 184459
This pass was assuming that if hasAddressTaken() returns false for a
function, the function's only uses are call sites. That's not true
because there can be references by BlockAddresses too.
Fix the pass to handle this case. Fix
BlockAddress::replaceUsesOfWithOnConstant() to allow a function's type
to be changed by RAUW'ing the function with a bitcast of the recreated
function.
Patch by Mark Seaborn.
llvm-svn: 183933
Instead of a custom implementation of replaceAllUsesWith, we just call
replaceAllUsesWith and recreate llvm.used and llvm.compiler-used.
This change is particularity interesting because it makes llvm see
through what clang is doing with static used functions in extern "C"
contexts. With this change, running clang -O2 in
extern "C" {
__attribute__((used)) static void foo() {}
}
produces
@llvm.used = appending global [1 x i8*] [i8* bitcast (void ()* @foo to
i8*)], section "llvm.metadata"
define internal void @foo() #0 {
entry:
ret void
}
llvm-svn: 183756
CXAAtExitFn was set outside a loop and before optimizations where functions
can be deleted. This patch will set CXAAtExitFn inside the loop and after
optimizations.
Seg fault when running LTO because of accesses to a deleted function.
rdar://problem/13838828
llvm-svn: 181838
We used to disable constant merging not only if a constant is llvm.used, but
also if an alias of a constant is llvm.used. This change fixes that.
llvm-svn: 181175
the things, and renames it to CBindingWrapping.h. I also moved
CBindingWrapping.h into Support/.
This new file just contains the macros for defining different wrap/unwrap
methods.
The calls to those macros, as well as any custom wrap/unwrap definitions
(like for array of Values for example), are put into corresponding C++
headers.
Doing this required some #include surgery, since some .cpp files relied
on the fact that including Wrap.h implicitly caused the inclusion of a
bunch of other things.
This also now means that the C++ headers will include their corresponding
C API headers; for example Value.h must include llvm-c/Core.h. I think
this is harmless, since the C API headers contain just external function
declarations and some C types, so I don't believe there should be any
nasty dependency issues here.
llvm-svn: 180881
The logic that actually compares the types considers pointers and integers the
same if they are of the same size. This created a strange mismatch between hash
and reality and made the test case for this fail on some platforms (yay,
test cases).
llvm-svn: 179905
Two return types are not equivalent if one is a pointer and the other is an
integral. This is because we cannot bitcast a pointer to an integral value.
PR15185
llvm-svn: 179569
This is basically the same fix in three different places. We use a set to avoid
walking the whole tree of a big ConstantExprs multiple times.
For example: (select cmp, (add big_expr 1), (add big_expr 2))
We don't want to visit big_expr twice here, it may consist of thousands of
nodes.
The testcase exercises this by creating an insanely large ConstantExprs out of
a loop. It's questionable if the optimizer should ever create those, but this
can be triggered with real C code. Fixes PR15714.
llvm-svn: 179458
The iterator could be invalidated when it's recursively deleting a whole bunch
of constant expressions in a constant initializer.
Note: This was only reproducible if `opt' was run on a `.bc' file. If `opt' was
run on a `.ll' file, it wouldn't crash. This is why the test first pushes the
`.ll' file through `llvm-as' before feeding it to `opt'.
PR15440
llvm-svn: 178531
The simplify-libcalls pass implemented a doInitialization hook to infer
function prototype attributes for well-known functions. Given that the
simplify-libcalls pass is going away *and* that the functionattrs pass
is already in place to deduce function attributes, I am moving this logic
to the functionattrs pass. This approach was discussed during patch
review:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20121126/157465.html.
llvm-svn: 177619
It's possible (e.g. after an LTO build) that an internal global may be used for
debugging purposes. If that's the case appending a '.b' to it makes it hard to
find that variable. Steal the name from the old GV before deleting it so that
they can find that variable again.
llvm-svn: 175104
says, but that's a defect (to be filed). "Cls::purevfn()" is still an odr use.
Also fixes a bug in the previous patch that caused us to not mark the function
referenced just because we didn't want to mark it odr used.
llvm-svn: 174240
Because BBVectorize may significantly shorten a loop body, unroll
again after vectorization. This is especially important when using
runtime or partial unrolling.
llvm-svn: 173730
In the future, AttributeWithIndex won't be used anymore. Besides, it exposes the
internals of the AttributeSet to outside users, which isn't goodness.
llvm-svn: 173601
In the future, AttributeWithIndex won't be used anymore. Besides, it exposes the
internals of the AttributeSet to outside users, which isn't goodness.
llvm-svn: 173600
The 'getSlot' function and its ilk allow introspection into the AttributeSet
class. However, that class should be opaque. Allow access through accessor
methods instead.
llvm-svn: 173522
SSPStrong applies a heuristic to insert stack protectors in these situations:
* A Protector is required for functions which contain an array, regardless of
type or length.
* A Protector is required for functions which contain a structure/union which
contains an array, regardless of type or length. Note, there is no limit to
the depth of nesting.
* A protector is required when the address of a local variable (i.e., stack
based variable) is exposed. (E.g., such as through a local whose address is
taken as part of the RHS of an assignment or a local whose address is taken as
part of a function argument.)
This patch implements the SSPString attribute to be equivalent to
SSPRequired. This will change in a subsequent patch.
llvm-svn: 173230
Collections of attributes are handled via the AttributeSet class now. This
finally frees us up to make significant changes to how attributes are structured.
llvm-svn: 173228
Use the AttributeSet when we're talking about more than one attribute. Add a
function that adds a single attribute. No functionality change intended.
llvm-svn: 173196
This is more code to isolate the use of the Attribute class to that of just
holding one attribute instead of a collection of attributes.
llvm-svn: 173094
a dynamic analysis done on each call to the routine. However, now it can
use the standard pass infrastructure to reference other analyses,
instead of a silly setter method. This will become more interesting as
I teach it about more analysis passes.
This updates the two inliner passes to use the inline cost analysis.
Doing so highlights how utterly redundant these two passes are. Either
we should find a cheaper way to do always inlining, or we should merge
the two and just fiddle with the thresholds to get the desired behavior.
I'm leaning increasingly toward the latter as it would also remove the
Inliner sub-class split.
llvm-svn: 173030
Because the Attribute class is going to stop representing a collection of
attributes, limit the use of it as an aggregate in favor of using AttributeSet.
This replaces some of the uses for querying the function attributes.
llvm-svn: 172844
Specifically:
1. Added a missing new line when we emit a debug message saying that we are marking a global variable as constant.
2. Added debug messages that describe what is occuring when GlobalOpt is evaluating a block/function.
3. Added a debug message that says what specific constructor is being evaluated.
llvm-svn: 172247
into their new header subdirectory: include/llvm/IR. This matches the
directory structure of lib, and begins to correct a long standing point
of file layout clutter in LLVM.
There are still more header files to move here, but I wanted to handle
them in separate commits to make tracking what files make sense at each
layer easier.
The only really questionable files here are the target intrinsic
tablegen files. But that's a battle I'd rather not fight today.
I've updated both CMake and Makefile build systems (I think, and my
tests think, but I may have missed something).
I've also re-sorted the includes throughout the project. I'll be
committing updates to Clang, DragonEgg, and Polly momentarily.
llvm-svn: 171366
The later API is nicer than the former, and is correct regarding wrap-around offsets (if anyone cares).
There are a few more places left with duplicated code, which I'll remove soon.
llvm-svn: 171259
directly.
This is in preparation for removing the use of the 'Attribute' class as a
collection of attributes. That will shift to the AttributeSet class instead.
llvm-svn: 171253
Better controls the inlining of functions when the caller function has MinSize attribute.
Basically, when the caller function has this attribute, we do not "force" the inlining
of callee functions carrying the InlineHint attribute (i.e., functions defined with
inline keyword)
llvm-svn: 170065
Sooooo many of these had incorrect or strange main module includes.
I have manually inspected all of these, and fixed the main module
include to be the nearest plausible thing I could find. If you own or
care about any of these source files, I encourage you to take some time
and check that these edits were sensible. I can't have broken anything
(I strictly added headers, and reordered them, never removed), but they
may not be the headers you'd really like to identify as containing the
API being implemented.
Many forward declarations and missing includes were added to a header
files to allow them to parse cleanly when included first. The main
module rule does in fact have its merits. =]
llvm-svn: 169131
Also check in a case to repeat the issue, on which 'opt -globalopt' consumes 1.6GB memory.
The big memory footprint cause is that current GlobalOpt one by one hoists and stores the leaf element constant into the global array, in each iteration, it recreates the global array initializer constant and leave the old initializer alone. This may result in many obsolete constants left.
For example: we have global array @rom = global [16 x i32] zeroinitializer
After the first element value is hoisted and installed: @rom = global [16 x i32] [ 1, 0, 0, ... ]
After the second element value is installed: @rom = global [16 x 32] [ 1, 2, 0, 0, ... ] // here the previous initializer is obsolete
...
When the transform is done, we have 15 obsolete initializers left useless.
llvm-svn: 169079
When code deletes the context, the AttributeImpls that the AttrListPtr points to
are now invalid. Therefore, instead of keeping a separate managed static for the
AttrListPtrs that's reference counted, move it into the LLVMContext and delete
it when deleting the AttributeImpls.
llvm-svn: 168354
This patch moves the isInlineViable function from the InlineAlways pass into
the InlineCostAnalyzer and then changes the InlineCost computation to use that
simple check for always-inline functions. All the special-case checks for
AlwaysInline in the CallAnalyzer can then go away.
llvm-svn: 168300
For global variables that get the same value stored into them
everywhere, GlobalOpt will replace them with a constant. The problem is
that a thread-local GlobalVariable looks like one value (the address of
the TLS var), but is different between threads.
This patch introduces Constant::isThreadDependent() which returns true
for thread-local variables and constants which depend on them (e.g. a GEP
into a thread-local array), and teaches GlobalOpt not to track such
values.
llvm-svn: 168037
getIntPtrType support for multiple address spaces via a pointer type,
and also introduced a crasher bug in the constant folder reported in
PR14233.
These commits also contained several problems that should really be
addressed before they are re-committed. I have avoided reverting various
cleanups to the DataLayout APIs that are reasonable to have moving
forward in order to reduce the amount of churn, and minimize the number
of commits that were reverted. I've also manually updated merge
conflicts and manually arranged for the getIntPtrType function to stay
in DataLayout and to be defined in a plausible way after this revert.
Thanks to Duncan for working through this exact strategy with me, and
Nick Lewycky for tracking down the really annoying crasher this
triggered. (Test case to follow in its own commit.)
After discussing with Duncan extensively, and based on a note from
Micah, I'm going to continue to back out some more of the more
problematic patches in this series in order to ensure we go into the
LLVM 3.2 branch with a reasonable story here. I'll send a note to
llvmdev explaining what's going on and why.
Summary of reverted revisions:
r166634: Fix a compiler warning with an unused variable.
r166607: Add some cleanup to the DataLayout changes requested by
Chandler.
r166596: Revert "Back out r166591, not sure why this made it through
since I cancelled the command. Bleh, sorry about this!
r166591: Delete a directory that wasn't supposed to be checked in yet.
r166578: Add in support for getIntPtrType to get the pointer type based
on the address space.
llvm-svn: 167221
output of both
llvm-extract foo.ll -func=bar
and
llvm-extract foo.ll -func=bar -delete
so the two new files could not be linked together anymore. With this change
alias are handled almost like functions and global variables. Almost because
with alias we cannot just clear the initializer/body, we have to create a new
declaration and replace the alias with it.
The net result is that now the output of the above commands can be linked
even if foo.ll has aliases.
llvm-svn: 166907
over the implicitly-formed-and-nesting CGSCC pass manager and function
pass managers, especially when using them on the opt commandline or
using extension points in the module builder. The '-barrier' opt flag
(or the pass itself) will create a no-op module pass in the pipeline,
resetting the pass manager stack, and allowing the creation of a new
pipeline of function passes or CGSCC passes to be created that is
independent from any previous pipelines.
For example, this can be used to test running two CGSCC passes in
independent CGSCC pass managers as opposed to in the same CGSCC pass
manager. It also allows us to introduce a further hack into the
PassManagerBuilder to separate the O0 pipeline extension passes from the
always-inliner's CGSCC pass manager, which they likely do not want to
participate in... At the very least none of the Sanitizer passes want
this behavior.
This fixes a bug with ASan at O0 currently, and I'll commit the ASan
test which covers this pass. I'm happy to add a test case that this pass
exists and works, but not sure how much time folks would like me to
spend adding test cases for the details of its behavior of partition
pass managers.... The whole thing is just vile, and mostly intended to
unblock ASan, so I'm hoping to rip this all out in a brave new pass
manager world.
llvm-svn: 166172
Convert the internal representation of the Attributes class into a pointer to an
opaque object that's uniqued by and stored in the LLVMContext object. The
Attributes class then becomes a thin wrapper around this opaque
object. Eventually, the internal representation will be expanded to include
attributes that represent code generation options, etc.
llvm-svn: 165917
DeadArgumentElimination pass can replace one LLVM function with another,
invalidating a pointer stored in debug info metadata entry for this function.
To fix this, we collect debug info descriptors for functions before
running a DeadArgumentElimination pass and "patch" pointers in metadata nodes
if we replace a function.
llvm-svn: 165490
We use the enums to query whether an Attributes object has that attribute. The
opaque layer is responsible for knowing where that specific attribute is stored.
llvm-svn: 165488
FCAs. This is essential in order to promote allocas that are used in
struct returns by frontends like Clang. The FCA load would block the
rest of the pass from firing, resulting is significant regressions with
the bullet benchmark in the nightly test suite.
Thanks to Duncan for repeated discussions about how best to do this, and
to both him and Benjamin for review.
This appears to have blocked many places where the pass tries to fire,
and so I'm expect somewhat different results with this fix added.
As with the last big patch, I'm including a change to enable the SROA by
default *temporarily*. Ben is going to remove this as soon as the LNT
bots pick up the patch. I'm just trying to get a round of LNT numbers
from the stable machines in the lab.
NOTE: Four clang tests are expected to fail in the brief window where
this is enabled. Sorry for the noise!
llvm-svn: 164119
new one, and add support for running the new pass in that mode and in
that slot of the pass manager. With this the new pass can completely
replace the old one within the pipeline.
The strategy for enabling or disabling the SSAUpdater logic is to do it
by making the requirement of the domtree analysis optional. By default,
it is required and we get the standard mem2reg approach. This is usually
the desired strategy when run in stand-alone situations. Within the
CGSCC pass manager, we disable requiring of the domtree analysis and
consequentially trigger fallback to the SSAUpdater promotion.
In theory this would allow the pass to re-use a domtree if one happened
to be available even when run in a mode that doesn't require it. In
practice, it lets us have a single pass rather than two which was
simpler for me to wrap my head around.
There is a hidden flag to force the use of the SSAUpdater code path for
the purpose of testing. The primary testing strategy is just to run the
existing tests through that path. One notable difference is that it has
custom code to handle lifetime markers, and one of the tests has been
enhanced to exercise that code.
This has survived a bootstrap and the test suite without serious
correctness issues, however my run of the test suite produced *very*
alarming performance numbers. I don't entirely understand or trust them
though, so more investigation is on-going.
To aid my understanding of the performance impact of the new SROA now
that it runs throughout the optimization pipeline, I'm enabling it by
default in this commit, and will disable it again once the LNT bots have
picked up one iteration with it. I want to get those bots (which are
much more stable) to evaluate the impact of the change before I jump to
any conclusions.
NOTE: Several Clang tests will fail because they run -O3 and check the
result's order of output. They'll go back to passing once I disable it
again.
llvm-svn: 163965
This is essentially a ground up re-think of the SROA pass in LLVM. It
was initially inspired by a few problems with the existing pass:
- It is subject to the bane of my existence in optimizations: arbitrary
thresholds.
- It is overly conservative about which constructs can be split and
promoted.
- The vector value replacement aspect is separated from the splitting
logic, missing many opportunities where splitting and vector value
formation can work together.
- The splitting is entirely based around the underlying type of the
alloca, despite this type often having little to do with the reality
of how that memory is used. This is especially prevelant with unions
and base classes where we tail-pack derived members.
- When splitting fails (often due to the thresholds), the vector value
replacement (again because it is separate) can kick in for
preposterous cases where we simply should have split the value. This
results in forming i1024 and i2048 integer "bit vectors" that
tremendously slow down subsequnet IR optimizations (due to large
APInts) and impede the backend's lowering.
The new design takes an approach that fundamentally is not susceptible
to many of these problems. It is the result of a discusison between
myself and Duncan Sands over IRC about how to premptively avoid these
types of problems and how to do SROA in a more principled way. Since
then, it has evolved and grown, but this remains an important aspect: it
fixes real world problems with the SROA process today.
First, the transform of SROA actually has little to do with replacement.
It has more to do with splitting. The goal is to take an aggregate
alloca and form a composition of scalar allocas which can replace it and
will be most suitable to the eventual replacement by scalar SSA values.
The actual replacement is performed by mem2reg (and in the future
SSAUpdater).
The splitting is divided into four phases. The first phase is an
analysis of the uses of the alloca. This phase recursively walks uses,
building up a dense datastructure representing the ranges of the
alloca's memory actually used and checking for uses which inhibit any
aspects of the transform such as the escape of a pointer.
Once we have a mapping of the ranges of the alloca used by individual
operations, we compute a partitioning of the used ranges. Some uses are
inherently splittable (such as memcpy and memset), while scalar uses are
not splittable. The goal is to build a partitioning that has the minimum
number of splits while placing each unsplittable use in its own
partition. Overlapping unsplittable uses belong to the same partition.
This is the target split of the aggregate alloca, and it maximizes the
number of scalar accesses which become accesses to their own alloca and
candidates for promotion.
Third, we re-walk the uses of the alloca and assign each specific memory
access to all the partitions touched so that we have dense use-lists for
each partition.
Finally, we build a new, smaller alloca for each partition and rewrite
each use of that partition to use the new alloca. During this phase the
pass will also work very hard to transform uses of an alloca into a form
suitable for promotion, including forming vector operations, speculating
loads throguh PHI nodes and selects, etc.
After splitting is complete, each newly refined alloca that is
a candidate for promotion to a scalar SSA value is run through mem2reg.
There are lots of reasonably detailed comments in the source code about
the design and algorithms, and I'm going to be trying to improve them in
subsequent commits to ensure this is well documented, as the new pass is
in many ways more complex than the old one.
Some of this is still a WIP, but the current state is reasonbly stable.
It has passed bootstrap, the nightly test suite, and Duncan has run it
successfully through the ACATS and DragonEgg test suites. That said, it
remains behind a default-off flag until the last few pieces are in
place, and full testing can be done.
Specific areas I'm looking at next:
- Improved comments and some code cleanup from reviews.
- SSAUpdater and enabling this pass inside the CGSCC pass manager.
- Some datastructure tuning and compile-time measurements.
- More aggressive FCA splitting and vector formation.
Many thanks to Duncan Sands for the thorough final review, as well as
Benjamin Kramer for lots of review during the process of writing this
pass, and Daniel Berlin for reviewing the data structures and algorithms
and general theory of the pass. Also, several other people on IRC, over
lunch tables, etc for lots of feedback and advice.
llvm-svn: 163883
This disables malloc-specific optimization when -fno-builtin (or -ffreestanding)
is specified. This has been a problem for a long time but became more severe
with the recent memory builtin improvements.
Since the memory builtin functions are used everywhere, this required passing
TLI in many places. This means that functions that now have an optional TLI
argument, like RecursivelyDeleteTriviallyDeadFunctions, won't remove dead
mallocs anymore if the TLI argument is missing. I've updated most passes to do
the right thing.
Fixes PR13694 and probably others.
llvm-svn: 162841
The "findUsedStructTypes" method is very expensive to run. It needs to be
optimized so that LTO can run faster. Splitting this method out of the Module
class will help this occur. For instance, it can keep a list of seen objects so
that it doesn't process them over and over again.
llvm-svn: 161228
might be deliberate "one time" leaks, so that leak checkers can find them.
This is a reapply of r160602 with the fix that this time I'm committing the
code I thought I was committing last time; the I->eraseFromParent() goes
*after* the break out of the loop.
llvm-svn: 160664
r160529 that was subsequently reverted. The fix was to not call
GV->eraseFromParent() right before the caller does the same. The existing
testcases already caught this bug if run under valgrind.
llvm-svn: 160602
This was always part of the VMCore library out of necessity -- it deals
entirely in the IR. The .cpp file in fact was already part of the VMCore
library. This is just a mechanical move.
I've tried to go through and re-apply the coding standard's preferred
header sort, but at 40-ish files, I may have gotten some wrong. Please
let me know if so.
I'll be committing the corresponding updates to Clang and Polly, and
Duncan has DragonEgg.
Thanks to Bill and Eric for giving the green light for this bit of cleanup.
llvm-svn: 159421
include/llvm/Analysis/DebugInfo.h to include/llvm/DebugInfo.h.
The reasoning is because the DebugInfo module is simply an interface to the
debug info MDNodes and has nothing to do with analysis.
llvm-svn: 159312
Original commit message:
If a constant or a function has linkonce_odr linkage and unnamed_addr, mark it
hidden. Being linkonce_odr guarantees that it is available in every dso that
needs it. Being a constant/function with unnamed_addr guarantees that the
copies don't have to be merged.
llvm-svn: 159272
hidden. Being linkonce_odr guarantees that it is available in every dso that
needs it. Being a constant/function with unnamed_addr guarantees that the
copies don't have to be merged.
llvm-svn: 159136
This allows the user/front-end to specify a model that is better
than what LLVM would choose by default. For example, a variable
might be declared as
@x = thread_local(initialexec) global i32 42
if it will not be used in a shared library that is dlopen'ed.
If the specified model isn't supported by the target, or if LLVM can
make a better choice, a different model may be used.
llvm-svn: 159077
and expose it as a utility class rather than as free function wrappers.
The simple free-function interface works well for the bugpoint-specific
pass's uses of code extraction, but in an upcoming patch for more
advanced code extraction, they simply don't expose a rich enough
interface. I need to expose various stages of the process of doing the
code extraction and query information to decide whether or not to
actually complete the extraction or give up.
Rather than build up a new predicate model and pass that into these
functions, just take the class that was actually implementing the
functions and lift it up into a proper interface that can be used to
perform code extraction. The interface is cleaned up and re-documented
to work better in a header. It also is now setup to accept the blocks to
be extracted in the constructor rather than in a method.
In passing this essentially reverts my previous commit here exposing
a block-level query for eligibility of extraction. That is no longer
necessary with the more rich interface as clients can query the
extraction object for eligibility directly. This will reduce the number
of walks of the input basic block sequence by quite a bit which is
useful if this enters the normal optimization pipeline.
llvm-svn: 156163
As has been suggested by Duncan and others, Early-CSE and GVN should
do similar redundancy elimination, but Early-CSE is much less expensive.
Most of my autovectorization benchmarks show a performance regresion, but
all of these are < 0.1%, and so I think that it is still worth using
the less expensive pass.
llvm-svn: 154673
obviously cannot know that this code is present, let alone used. So prevent the
internalize pass from internalizing those global values which code-gen may
insert.
llvm-svn: 154645
As a side note, I really dislike array_pod_sort... Do we really still
care about any STL implementations that get this so wrong? Does libc++?
llvm-svn: 153834
a single missing character. Somehow, this had gone untested. I've added
tests for returns-twice logic specifically with the always-inliner that
would have caught this, and fixed the bug.
Thanks to Matt for the careful review and spotting this!!! =D
llvm-svn: 153832
the very high overhead of the complex inline cost analysis when all it
wants to do is detect three patterns which must not be inlined. Comment
the code, clean it up, and leave some hints about possible performance
improvements if this ever shows up on a profile.
Moving this off of the (now more expensive) inline cost analysis is
particularly important because we have to run this inliner even at -O0.
llvm-svn: 153814
interfaces. These methods were used in the old inline cost system where
there was a persistent cache that had to be updated, invalidated, and
cleared. We're now doing more direct computations that don't require
this intricate dance. Even if we resume some level of caching, it would
almost certainly have a simpler and more narrow interface than this.
llvm-svn: 153813
on a per-callsite walk of the called function's instructions, in
breadth-first order over the potentially reachable set of basic blocks.
This is a major shift in how inline cost analysis works to improve the
accuracy and rationality of inlining decisions. A brief outline of the
algorithm this moves to:
- Build a simplification mapping based on the callsite arguments to the
function arguments.
- Push the entry block onto a worklist of potentially-live basic blocks.
- Pop the first block off of the *front* of the worklist (for
breadth-first ordering) and walk its instructions using a custom
InstVisitor.
- For each instruction's operands, re-map them based on the
simplification mappings available for the given callsite.
- Compute any simplification possible of the instruction after
re-mapping, and store that back int othe simplification mapping.
- Compute any bonuses, costs, or other impacts of the instruction on the
cost metric.
- When the terminator is reached, replace any conditional value in the
terminator with any simplifications from the mapping we have, and add
any successors which are not proven to be dead from these
simplifications to the worklist.
- Pop the next block off of the front of the worklist, and repeat.
- As soon as the cost of inlining exceeds the threshold for the
callsite, stop analyzing the function in order to bound cost.
The primary goal of this algorithm is to perfectly handle dead code
paths. We do not want any code in trivially dead code paths to impact
inlining decisions. The previous metric was *extremely* flawed here, and
would always subtract the average cost of two successors of
a conditional branch when it was proven to become an unconditional
branch at the callsite. There was no handling of wildly different costs
between the two successors, which would cause inlining when the path
actually taken was too large, and no inlining when the path actually
taken was trivially simple. There was also no handling of the code
*path*, only the immediate successors. These problems vanish completely
now. See the added regression tests for the shiny new features -- we
skip recursive function calls, SROA-killing instructions, and high cost
complex CFG structures when dead at the callsite being analyzed.
Switching to this algorithm required refactoring the inline cost
interface to accept the actual threshold rather than simply returning
a single cost. The resulting interface is pretty bad, and I'm planning
to do lots of interface cleanup after this patch.
Several other refactorings fell out of this, but I've tried to minimize
them for this patch. =/ There is still more cleanup that can be done
here. Please point out anything that you see in review.
I've worked really hard to try to mirror at least the spirit of all of
the previous heuristics in the new model. It's not clear that they are
all correct any more, but I wanted to minimize the change in this single
patch, it's already a bit ridiculous. One heuristic that is *not* yet
mirrored is to allow inlining of functions with a dynamic alloca *if*
the caller has a dynamic alloca. I will add this back, but I think the
most reasonable way requires changes to the inliner itself rather than
just the cost metric, and so I've deferred this for a subsequent patch.
The test case is XFAIL-ed until then.
As mentioned in the review mail, this seems to make Clang run about 1%
to 2% faster in -O0, but makes its binary size grow by just under 4%.
I've looked into the 4% growth, and it can be fixed, but requires
changes to other parts of the inliner.
llvm-svn: 153812
size bloat. Unfortunately, I expect this to disable the majority of the
benefit from r152737. I'm hopeful at least that it will fix PR12345. To
explain this requires... quite a bit of backstory I'm afraid.
TL;DR: The change in r152737 actually did The Wrong Thing for
linkonce-odr functions. This change makes it do the right thing. The
benefits we saw were simple luck, not any actual strategy. Benchmark
numbers after a mini-blog-post so that I've written down my thoughts on
why all of this works and doesn't work...
To understand what's going on here, you have to understand how the
"bottom-up" inliner actually works. There are two fundamental modes to
the inliner:
1) Standard fixed-cost bottom-up inlining. This is the mode we usually
think about. It walks from the bottom of the CFG up to the top,
looking at callsites, taking information about the callsite and the
called function and computing th expected cost of inlining into that
callsite. If the cost is under a fixed threshold, it inlines. It's
a touch more complicated than that due to all the bonuses, weights,
etc. Inlining the last callsite to an internal function gets higher
weighth, etc. But essentially, this is the mode of operation.
2) Deferred bottom-up inlining (a term I just made up). This is the
interesting mode for this patch an r152737. Initially, this works
just like mode #1, but once we have the cost of inlining into the
callsite, we don't just compare it with a fixed threshold. First, we
check something else. Let's give some names to the entities at this
point, or we'll end up hopelessly confused. We're considering
inlining a function 'A' into its callsite within a function 'B'. We
want to check whether 'B' has any callers, and whether it might be
inlined into those callers. If so, we also check whether inlining 'A'
into 'B' would block any of the opportunities for inlining 'B' into
its callers. We take the sum of the costs of inlining 'B' into its
callers where that inlining would be blocked by inlining 'A' into
'B', and if that cost is less than the cost of inlining 'A' into 'B',
then we skip inlining 'A' into 'B'.
Now, in order for #2 to make sense, we have to have some confidence that
we will actually have the opportunity to inline 'B' into its callers
when cheaper, *and* that we'll be able to revisit the decision and
inline 'A' into 'B' if that ever becomes the correct tradeoff. This
often isn't true for external functions -- we can see very few of their
callers, and we won't be able to re-consider inlining 'A' into 'B' if
'B' is external when we finally see more callers of 'B'. There are two
cases where we believe this to be true for C/C++ code: functions local
to a translation unit, and functions with an inline definition in every
translation unit which uses them. These are represented as internal
linkage and linkonce-odr (resp.) in LLVM. I enabled this logic for
linkonce-odr in r152737.
Unfortunately, when I did that, I also introduced a subtle bug. There
was an implicit assumption that the last caller of the function within
the TU was the last caller of the function in the program. We want to
bonus the last caller of the function in the program by a huge amount
for inlining because inlining that callsite has very little cost.
Unfortunately, the last caller in the TU of a linkonce-odr function is
*not* the last caller in the program, and so we don't want to apply this
bonus. If we do, we can apply it to one callsite *per-TU*. Because of
the way deferred inlining works, when it sees this bonus applied to one
callsite in the TU for 'B', it decides that inlining 'B' is of the
*utmost* importance just so we can get that final bonus. It then
proceeds to essentially force deferred inlining regardless of the actual
cost tradeoff.
The result? PR12345: code bloat, code bloat, code bloat. Another result
is getting *damn* lucky on a few benchmarks, and the over-inlining
exposing critically important optimizations. I would very much like
a list of benchmarks that regress after this change goes in, with
bitcode before and after. This will help me greatly understand what
opportunities the current cost analysis is missing.
Initial benchmark numbers look very good. WebKit files that exhibited
the worst of PR12345 went from growing to shrinking compared to Clang
with r152737 reverted.
- Bootstrapped Clang is 3% smaller with this change.
- Bootstrapped Clang -O0 over a single-source-file of lib/Lex is 4%
faster with this change.
Please let me know about any other performance impact you see. Thanks to
Nico for reporting and urging me to actually fix, Richard Smith, Duncan
Sands, Manuel Klimek, and Benjamin Kramer for talking through the issues
today.
llvm-svn: 153506
to instead rely on much more generic and powerful instruction
simplification in the function cloner (and thus inliner).
This teaches the pruning function cloner to use instsimplify rather than
just the constant folder to fold values during cloning. This can
simplify a large number of things that constant folding alone cannot
begin to touch. For example, it will realize that 'or' and 'and'
instructions with certain constant operands actually become constants
regardless of what their other operand is. It also can thread back
through the caller to perform simplifications that are only possible by
looking up a few levels. In particular, GEPs and pointer testing tend to
fold much more heavily with this change.
This should (in some cases) have a positive impact on compile times with
optimizations on because the inliner itself will simply avoid cloning
a great deal of code. It already attempted to prune proven-dead code,
but now it will be use the stronger simplifications to prove more code
dead.
llvm-svn: 153403
It was added in 2007 as the first cut at supporting no-inline
attributes, but we didn't have function attributes of any form at the
time. However, it was added without any mention in the LangRef or other
documentation.
Later on, in 2008, Devang added function notes for 'inline=never' and
then turned them into proper function attributes. From that point
onward, as far as I can tell, the world moved on, and no one has touched
'llvm.noinline' in any meaningful way since.
It's time has now come. We have had better mechanisms for doing this for
a long time, all the frontends I'm aware of use them, and this is just
holding back progress. Given that it was never a documented feature of
the IR, I've provided no auto-upgrade support. If people know of real,
in-the-wild bitcode that relies on this, yell at me and I'll add it, but
I *seriously* doubt anyone cares.
llvm-svn: 152904
directly query the function information which this set was representing.
This simplifies the interface of the inline cost analysis, and makes the
always-inline pass significantly more efficient.
Previously, always-inline would first make a single set of every
function in the module *except* those marked with the always-inline
attribute. It would then query this set at every call site to see if the
function was a member of the set, and if so, refuse to inline it. This
is quite wasteful. Instead, simply check the function attribute directly
when looking at the callsite.
The normal inliner also had similar redundancy. It added every function
in the module with the noinline attribute to its set to ignore, even
though inside the cost analysis function we *already tested* the
noinline attribute and produced the same result.
The only tricky part of removing this is that we have to be able to
correctly remove only the functions inlined by the always-inline pass
when finalizing, which requires a bit of a hack. Still, much less of
a hack than the set of all non-always-inline functions was. While I was
touching this function, I switched a heavy-weight set to a vector with
sort+unique. The algorithm already had a two-phase insert and removal
pattern, we were just needlessly paying the uniquing cost on every
insert.
This probably speeds up some compiles by a small amount (-O0 compiles
with lots of always-inline, so potentially heavy libc++ users), but I've
not tried to measure it.
I believe there is no functional change here, but yell if you spot one.
None are intended.
Finally, the direction this is going in is to greatly simplify the
inline cost query interface so that we can replace its implementation
with a much more clever one. Along the way, all the APIs get simplified,
so it seems incrementally good.
llvm-svn: 152903
which are small enough to themselves be inlined. Delaying in this manner
can be harmful if the function is inelligible for inlining in some (or
many) contexts as it pessimizes the code of the function itself in the
event that inlining does not eventually happen.
Previously the check was written to only do this delaying of inlining
for static functions in the hope that they could be entirely deleted and
in the knowledge that all callers of static functions will have the
opportunity to inline if it is in fact profitable. However, with C++ we
get two other important sources of functions where the definition is
always available for inlining: inline functions and templated functions.
This patch generalizes the inliner to allow linkonce-ODR (the linkage
such C++ routines receive) to also qualify for this delay-based
inlining.
Benchmarking across a range of large real-world applications shows
roughly 2% size increase across the board, but an average speedup of
about 0.5%. Some benhcmarks improved over 2%, and the 'clang' binary
itself (when bootstrapped with this feature) shows a 1% -O0 performance
improvement when run over all Sema, Lex, and Parse source code smashed
into a single file. A clean re-build of Clang+LLVM with a bootstrapped
Clang shows approximately 2% improvement, but that measurement is often
noisy.
llvm-svn: 152737
candidate set for subsequent inlining, try to simplify the arguments to
the inner call site now that inlining has been performed.
The goal here is to propagate and fold constants through deeply nested
call chains. Without doing this, we loose the inliner bonus that should
be applied because the arguments don't match the exact pattern the cost
estimator uses.
Reviewed on IRC by Benjamin Kramer.
llvm-svn: 152556
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120130/136146.html
Implemented CaseIterator and it solves almost all described issues: we don't need to mix operand/case/successor indexing anymore. Base iterator class is implemented as a template since it may be initialized either from "const SwitchInst*" or from "SwitchInst*".
ConstCaseIt is just a read-only iterator.
CaseIt is read-write iterator; it allows to change case successor and case value.
Usage of iterator allows totally remove resolveXXXX methods. All indexing convertions done automatically inside the iterator's getters.
Main way of iterator usage looks like this:
SwitchInst *SI = ... // intialize it somehow
for (SwitchInst::CaseIt i = SI->caseBegin(), e = SI->caseEnd(); i != e; ++i) {
BasicBlock *BB = i.getCaseSuccessor();
ConstantInt *V = i.getCaseValue();
// Do something.
}
If you want to convert case number to TerminatorInst successor index, just use getSuccessorIndex iterator's method.
If you want initialize iterator from TerminatorInst successor index, use CaseIt::fromSuccessorIndex(...) method.
There are also related changes in llvm-clients: klee and clang.
llvm-svn: 152297
are optimization hints, but at -O0 we're not optimizing. This becomes a problem
when the alwaysinline attribute is abused.
rdar://10921594
llvm-svn: 151429
they'll be simple enough to simulate, and to reduce the chance we'll encounter
equal but different simple pointer constants.
This removes the symptoms from PR11352 but is not a full fix. A proper fix would
either require a guarantee that two constant objects we simulate are folded
when equal, or a different way of handling equal pointers (ie., trying a
constantexpr icmp on them to see whether we know they're equal or non-equal or
unsure).
llvm-svn: 151093
useful to represent a variable that is const in the source but can't be constant
in the IR because of a non-trivial constructor. If globalopt evaluates the
constructor, and there was an invariant.start with no matching invariant.end
possible, it will mark the global constant afterwards.
llvm-svn: 150794
GlobalOpt runs early in the pipeline (before inlining) and complex class
hierarchies often introduce bitcasts or GEPs which weren't optimized away.
Teach it to ignore side-effect free instructions instead of depending on
other passes to remove them.
llvm-svn: 150174
* Most of the transforms come through intact by having each transformed load or
store copy the ordering and synchronization scope of the original.
* The transform that turns a global only accessed in main() into an alloca
(since main is non-recursive) with a store of the initial value uses an
unordered store, since it's guaranteed to be the first thing to happen in main.
(Threads may have started before main (!) but they can't have the address of a
function local before the point in the entry block we insert our code.)
* The heap-SRoA transforms are disabled in the face of atomic operations. This
can probably be improved; it seems odd to have atomic accesses to an alloca
that doesn't have its address taken.
AnalyzeGlobal keeps track of the strongest ordering found in any use of the
global. This is more information than we need right now, but it's cheap to
compute and likely to be useful.
llvm-svn: 149847
The purpose of refactoring is to hide operand roles from SwitchInst user (programmer). If you want to play with operands directly, probably you will need lower level methods than SwitchInst ones (TerminatorInst or may be User). After this patch we can reorganize SwitchInst operands and successors as we want.
What was done:
1. Changed semantics of index inside the getCaseValue method:
getCaseValue(0) means "get first case", not a condition. Use getCondition() if you want to resolve the condition. I propose don't mix SwitchInst case indexing with low level indexing (TI successors indexing, User's operands indexing), since it may be dangerous.
2. By the same reason findCaseValue(ConstantInt*) returns actual number of case value. 0 means first case, not default. If there is no case with given value, ErrorIndex will returned.
3. Added getCaseSuccessor method. I propose to avoid usage of TerminatorInst::getSuccessor if you want to resolve case successor BB. Use getCaseSuccessor instead, since internal SwitchInst organization of operands/successors is hidden and may be changed in any moment.
4. Added resolveSuccessorIndex and resolveCaseIndex. The main purpose of these methods is to see how case successors are really mapped in TerminatorInst.
4.1 "resolveSuccessorIndex" was created if you need to level down from SwitchInst to TerminatorInst. It returns TerminatorInst's successor index for given case successor.
4.2 "resolveCaseIndex" converts low level successors index to case index that curresponds to the given successor.
Note: There are also related compatability fix patches for dragonegg, klee, llvm-gcc-4.0, llvm-gcc-4.2, safecode, clang.
llvm-svn: 149481
This is the initial checkin of the basic-block autovectorization pass along with some supporting vectorization infrastructure.
Special thanks to everyone who helped review this code over the last several months (especially Tobias Grosser).
llvm-svn: 149468
with other symbols.
An object in the __cfstring section is suppoed to be filled with CFString
objects, which have a pointer to ___CFConstantStringClassReference followed by a
pointer to a __cstring. If we allow the object in the __cstring section to be
merged with another global, then it could end up in any section. Because the
linker is going to remove these symbols in the final executable, we shouldn't
bother to merge them.
<rdar://problem/10564621>
llvm-svn: 147899