dead code, including dead return instructions in some cases. Otherwise,
we end up having a bogus poniter to a return instruction that blows up
much further down the road.
It turns out that this pattern is both simpler to code, easier to update
in the face of enhancements to the inliner cleanup, and likely cheaper
given that it won't add dead instructions to the list.
Thanks to John Regehr's numerous test cases for teasing this out.
llvm-svn: 154157
We had special instructions for iOS because r9 is call-clobbered, but
that is represented dynamically by the register mask operands now, so
there is no need for the pseudo-instructions.
llvm-svn: 154144
The load/store optimizer splits LDRD/STRD into two instructions when the
register pairing doesn't work out. For negative offsets in Thumb2, it uses
t2STRi8 to do that. That's fine, except for the case when the offset is in
the range [-4,-1]. In that case, we'll also form a second t2STRi8 with
the original offset plus 4, resulting in a t2STRi8 with a non-negative
offset, which ends up as if it were an STRT, which is completely bogus.
Similarly for loads.
No testcase, unfortunately, as any I've been able to construct is both large
and extremely fragile.
rdar://11193937
llvm-svn: 154141
The empty 1-argument operator delete is for the benefit of the
destructor. A couple of spot checks of running yaml-bench under
valgrind against a few of the files under test/YAMLParser did
not reveal any leaks introduced by this change.
llvm-svn: 154137
Consider the following program:
$ cat main.c
void foo(void) { }
int main(int argc, char *argv[]) {
foo();
return 0;
}
$ cat bundle.c
extern void foo(void);
void bar(void) {
foo();
}
$ clang -o main main.c
$ clang -o bundle.so bundle.c -bundle -bundle_loader ./main
$ nm -m bundle.so
0000000000000f40 (__TEXT,__text) external _bar
(undefined) external _foo (from executable)
(undefined) external dyld_stub_binder (from libSystem)
$ clang -o main main.c -O4
$ clang -o bundle.so bundle.c -bundle -bundle_loader ./main
Undefined symbols for architecture x86_64:
"_foo", referenced from:
_bar in bundle-elQN6d.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
The linker was told that the 'foo' in 'main' was 'internal' and had no uses, so
it was dead stripped.
Another situation is something like:
define void @foo() {
ret void
}
define void @bar() {
call asm volatile "call _foo" ...
ret void
}
The only use of 'foo' is inside of an inline ASM call. Since we don't look
inside those for uses of functions, we don't specify this as a "use."
Get around this by not invoking the 'internalize' pass by default. This is an
admitted hack for LTO correctness.
<rdar://problem/11185386>
llvm-svn: 154124
'add r2, #-1024' should just use 'sub r2, #1024' rather than erroring out.
Thumb1 aliases for adding a negative immediate to the stack pointer,
also.
rdar://11192734
llvm-svn: 154123
LSR always tries to make the ICmp in the loop latch use the incremented
induction variable. This allows the induction variable to be kept in a
single register.
When the induction variable limit is equal to the stride,
SimplifySetCC() would break LSR's hard work by transforming:
(icmp (add iv, stride), stride) --> (cmp iv, 0)
This forced us to use lea for the IC update, preventing the simpler
incl+cmp.
<rdar://problem/7643606>
<rdar://problem/11184260>
llvm-svn: 154119
of the BBVectorizePass without using command line option. As pointed out
by Hal, we can ask the TargetLoweringInfo for the architecture specific
VectorizeConfig to perform vectorizing with architecture specific
information.
llvm-svn: 154096
the caller requested a null-terminated one.
When mapping the file there could be a racing issue that resulted in the file being larger
than the FileSize passed by the caller. We already have an assertion
for this in MemoryBuffer::init() but have a runtime guarantee that
the buffer will be null-terminated, so do a copy that adds a null-terminator.
Protects against crash of rdar://11161822.
llvm-svn: 154082
LSR can fold three addressing modes into its ICmpZero node:
ICmpZero BaseReg + Offset => ICmp BaseReg, -Offset
ICmpZero -1*ScaleReg + Offset => ICmp ScaleReg, Offset
ICmpZero BaseReg + -1*ScaleReg => ICmp BaseReg, ScaleReg
The first two cases are only used if TLI->isLegalICmpImmediate() likes
the offset.
Make sure the right Offset sign is passed to this method in the second
case. The ARM version is not symmetric.
<rdar://problem/11184260>
llvm-svn: 154079
A MOVCCr instruction can be commuted by inverting the condition. This
can help reduce register pressure and remove unnecessary copies in some
cases.
<rdar://problem/11182914>
llvm-svn: 154033
This allows us to keep passing reduced masks to SimplifyDemandedBits, but
know about all the bits if SimplifyDemandedBits fails. This allows instcombine
to simplify cases like the one in the included testcase.
llvm-svn: 154011
svn r145378 inadvertently changed the destination for the Embedded target
in the makefile. Add a "/Developer" suffix to DSTROOT to compensate.
llvm-svn: 153980
So far all of configure tests have been run against the default SDK and
architecture, regardless of what is actually being built. We've gotten
lucky until now. <rdar://problem/11112479>
llvm-svn: 153972
When folding X == X we need to check getBooleanContents() to determine if the
result is a vector of ones or a vector of negative ones.
I tried creating a test case, but the problem seems to only be exposed on a
much older version of clang (around r144500).
rdar://10923049
llvm-svn: 153966
brace) so that we get more accurate line number information about the
declaration of a given function and the line where the function
first starts.
Part of rdar://11026482
llvm-svn: 153916
This patch allows llvm to recognize that a 64 bit object file is being produced
and that the subsequently generated ELF header has the correct information.
The test case checks for both big and little endian flavors.
Patch by Jack Carter.
llvm-svn: 153889
http://llvm.org/bugs/show_bug.cgi?id=12343
We have not trivial way for splitting edges that are goes from indirect branch. We can do it with some tricks, but it should be additionally discussed. And it is still dangerous due to difficulty of indirect branches controlling.
Fix forbids this case for unswitching.
llvm-svn: 153879
reflected in the LLVM IR (as a declare or something), then treat it like a data
object.
N.B. This isn't 100% correct. The ASM parser should supply more information so
that we know what type of object it is, and what attributes it should have.
llvm-svn: 153870
Do not try to optimize swizzles of shuffles if the source shuffle has more than
a single user, except when the source shuffle is also a swizzle.
llvm-svn: 153864
rather than a bitfield, a great suggestion by Chris during code review.
There is still quite a bit of cruft in the interface, but that requires
sorting out some awkward uses of the cost inside the actual inliner.
No functionality changed intended here.
llvm-svn: 153853
Post-RA scheduling gives a significant performance improvement on
the embedded cores, so turn it on. Using full anti-dep. breaking is
important for FP-intensive blocks, so turn it on (just on the
embedded cores for now; this should also be good on the 970s because
post-ra scheduling is all that we have for now, but that should have
more testing first).
llvm-svn: 153843
This adds a full itinerary for IBM's PPC64 A2 embedded core. These
cores form the basis for the CPUs in the new IBM BG/Q supercomputer.
llvm-svn: 153842
This also avoids emitting the information twice, which led to code bloat. On i386-linux-Release+Asserts
with all targets built this change shaves a whopping 1.3 MB off clang. The number is probably exaggerated
by recent inliner changes but the methods were already enormous with the old inline cost computation.
The DWARF reg -> LLVM reg mapping doesn't seem to have holes in it, so it could be a simple lookup table.
I didn't implement that optimization yet to avoid potentially changing functionality.
There is still some duplication both in tablegen and the generated code that should be cleaned up eventually.
llvm-svn: 153837
As a side note, I really dislike array_pod_sort... Do we really still
care about any STL implementations that get this so wrong? Does libc++?
llvm-svn: 153834
a single missing character. Somehow, this had gone untested. I've added
tests for returns-twice logic specifically with the always-inliner that
would have caught this, and fixed the bug.
Thanks to Matt for the careful review and spotting this!!! =D
llvm-svn: 153832
Loads and stores can have different pipeline behavior, especially on
embedded chips. This change allows those differences to be expressed.
Except for the 440 scheduler, there are no functionality changes.
On the 440, the latency adjustment is only by one cycle, and so this
probably does not affect much. Nevertheless, it will make a larger
difference in the future and this removes a FIXME from the 440 itin.
llvm-svn: 153821
This is the CodeGen equivalent of r153747. I tested that there is not noticeable
performance difference with any combination of -O0/-O2 /-g when compiling
gcc as a single compilation unit.
llvm-svn: 153817
Dynamic linking on PPC64 has had problems since we had to move the top-down
hazard-detection logic post-ra. For dynamic linking to work there needs to be
a nop placed after every call. It turns out that it is really hard to guarantee
that nothing will be placed in between the call (bl) and the nop during post-ra
scheduling. Previous attempts at fixing this by placing logic inside the
hazard detector only partially worked.
This is now fixed in a different way: call+nop codegen-only instructions. As far
as CodeGen is concerned the pair is now a single instruction and cannot be split.
This solution works much better than previous attempts.
The scoreboard hazard detector is also renamed to be more generic, there is currently
no cpu-specific logic in it.
llvm-svn: 153816
the very high overhead of the complex inline cost analysis when all it
wants to do is detect three patterns which must not be inlined. Comment
the code, clean it up, and leave some hints about possible performance
improvements if this ever shows up on a profile.
Moving this off of the (now more expensive) inline cost analysis is
particularly important because we have to run this inliner even at -O0.
llvm-svn: 153814
interfaces. These methods were used in the old inline cost system where
there was a persistent cache that had to be updated, invalidated, and
cleared. We're now doing more direct computations that don't require
this intricate dance. Even if we resume some level of caching, it would
almost certainly have a simpler and more narrow interface than this.
llvm-svn: 153813
on a per-callsite walk of the called function's instructions, in
breadth-first order over the potentially reachable set of basic blocks.
This is a major shift in how inline cost analysis works to improve the
accuracy and rationality of inlining decisions. A brief outline of the
algorithm this moves to:
- Build a simplification mapping based on the callsite arguments to the
function arguments.
- Push the entry block onto a worklist of potentially-live basic blocks.
- Pop the first block off of the *front* of the worklist (for
breadth-first ordering) and walk its instructions using a custom
InstVisitor.
- For each instruction's operands, re-map them based on the
simplification mappings available for the given callsite.
- Compute any simplification possible of the instruction after
re-mapping, and store that back int othe simplification mapping.
- Compute any bonuses, costs, or other impacts of the instruction on the
cost metric.
- When the terminator is reached, replace any conditional value in the
terminator with any simplifications from the mapping we have, and add
any successors which are not proven to be dead from these
simplifications to the worklist.
- Pop the next block off of the front of the worklist, and repeat.
- As soon as the cost of inlining exceeds the threshold for the
callsite, stop analyzing the function in order to bound cost.
The primary goal of this algorithm is to perfectly handle dead code
paths. We do not want any code in trivially dead code paths to impact
inlining decisions. The previous metric was *extremely* flawed here, and
would always subtract the average cost of two successors of
a conditional branch when it was proven to become an unconditional
branch at the callsite. There was no handling of wildly different costs
between the two successors, which would cause inlining when the path
actually taken was too large, and no inlining when the path actually
taken was trivially simple. There was also no handling of the code
*path*, only the immediate successors. These problems vanish completely
now. See the added regression tests for the shiny new features -- we
skip recursive function calls, SROA-killing instructions, and high cost
complex CFG structures when dead at the callsite being analyzed.
Switching to this algorithm required refactoring the inline cost
interface to accept the actual threshold rather than simply returning
a single cost. The resulting interface is pretty bad, and I'm planning
to do lots of interface cleanup after this patch.
Several other refactorings fell out of this, but I've tried to minimize
them for this patch. =/ There is still more cleanup that can be done
here. Please point out anything that you see in review.
I've worked really hard to try to mirror at least the spirit of all of
the previous heuristics in the new model. It's not clear that they are
all correct any more, but I wanted to minimize the change in this single
patch, it's already a bit ridiculous. One heuristic that is *not* yet
mirrored is to allow inlining of functions with a dynamic alloca *if*
the caller has a dynamic alloca. I will add this back, but I think the
most reasonable way requires changes to the inliner itself rather than
just the cost metric, and so I've deferred this for a subsequent patch.
The test case is XFAIL-ed until then.
As mentioned in the review mail, this seems to make Clang run about 1%
to 2% faster in -O0, but makes its binary size grow by just under 4%.
I've looked into the 4% growth, and it can be fixed, but requires
changes to other parts of the inliner.
llvm-svn: 153812
visitor will now visit a CallInst and an InvokeInst with
instruction-specific visitors, then visit a generic CallSite visitor,
then delegate back to the Instruction visitor and the TerminatorInst
visitors depending on whether a call or an invoke originally. This will
be used in the soon-to-land inline cost rewrite.
llvm-svn: 153811
The powi intrinsic requires special handling because it always takes a single
integer power regardless of the result type. As a result, we can vectorize
only if the powers are equal. Fixes PR12364.
llvm-svn: 153797
First small step toward modeling multi-register multi-pressure. In the
future, register units can also be used to model liveness and
aliasing.
llvm-svn: 153794
ARMConstantIslandPass still has bugs where jump table compression can
cause constant pool entries to go out of range.
Add a safety margin of 2 bytes when placing constant islands, but use
the real max displacement for verification.
<rdar://problem/11156595>
llvm-svn: 153789
Use an explicit comparator instead of the default.
The sets are sorted, but not using the default comparator. Hopefully,
this will unbreak the Linux builders.
llvm-svn: 153772
When an immediate is both a value [t2_]so_imm and a [t2_]so_imm_neg,
we want to use the non-negated form to make sure we prefer the normal
encoding, not the aliased encoding via the negation of, e.g., 'cmp.w'.
llvm-svn: 153770
TableGen emits lists of sub-registers, super-registers, and overlaps. Put
them all in a single table and use a SequenceToOffsetTable to share
suffixes.
llvm-svn: 153761
This is similar to the StringToOffsetTable we use to produce string
tables, but it can be used for other sequences than strings, and it
eliminates entries for suffixes.
llvm-svn: 153760
For 'adds r2, r2, #56' outside of an IT block, the 16-bit encoding T2
can be used for this syntax. Prefer the narrow encoding when possible.
rdar://11156277
llvm-svn: 153759
1. The main works will made in the RuntimeDyLdImpl with uses the ObjectFile class. RuntimeDyLdMachO and RuntimeDyLdELF now only parses relocations and resolve it. This is allows to make improvements of the RuntimeDyLd more easily. In addition the support for COFF can be easily added.
2. Added ARM relocations to RuntimeDyLdELF.
3. Added support for stub functions for the ARM, allowing to do a long branch.
4. Added support for external functions that are not loaded from the object files, but can be loaded from external libraries. Now MCJIT can correctly execute the code containing the printf, putc, and etc.
5. The sections emitted instead functions, thanks Jim Grosbach. MemoryManager.startFunctionBody() and MemoryManager.endFunctionBody() have been removed.
6. MCJITMemoryManager.allocateDataSection() and MCJITMemoryManager. allocateCodeSection() used JMM->allocateSpace() instead of JMM->allocateCodeSection() and JMM->allocateDataSection(), because I got an error: "Cannot allocate an allocated block!" with object file contains more than one code or data sections.
llvm-svn: 153754
here but it has no other uses, then we have a problem. E.g.,
int foo (const int *x) {
char a[*x];
return 0;
}
If we assign 'a' a vreg and fast isel later on has to use the selection
DAG isel, it will want to copy the value to the vreg. However, there are
no uses, which goes counter to what selection DAG isel expects.
<rdar://problem/11134152>
llvm-svn: 153705
This pass splits basic blocks to insert constant islands, and it
doesn't recompute the live-in lists. No later passes depend on accurate
liveness information.
This fixes PR12410 where the machine code verifier was complaining.
llvm-svn: 153700
We are sometimes allocatinog from the DPair register class which
contains odd-even pairs in addition to the Q registers.
Place the Q registers first in the DPair allocation order as they can be
copied with a single instruction. The odd-even pairs should only be
allocated as a last resort.
llvm-svn: 153699
ARM recently gained DPair, DTriple, and DQuad register classes.
Update copyPhysReg() to handle copies in these register classes.
No test case, it is difficult to make the register allocator emit the
odd copies reliably. The missing DPair copy caused a failure on
partialsums in the nightly test suite.
<rdar://problem/11147997>
llvm-svn: 153686
CodeGenPrepare sinks compare instructions down to their uses to prevent
live flags and predicate registers across basic blocks.
PRE of a compare instruction prevents that, forcing the i1 compare
result into a general purpose register. That is usually more expensive
than the redundant compare PRE was trying to eliminate in the first
place.
llvm-svn: 153657
Module-level ASM may contain definitions of functions and globals. However, we
were not telling the linker that these globals had definitions. As far as it was
concerned, they were just declarations.
Attempt to resolve this by inserting module-level ASM functions and globals into
the '_symbol' set so that the linker will know that they have values.
This gets us further towards our goal of compiling LLVM, but it still has
problems when linking libLTO.dylib because of the `-dead_strip' flag that's
passed to the linker.
<rdar://problem/11124216>
llvm-svn: 153638
This is a code change to add support for changing instruction sequences of the form:
load
inc/dec of 8/16/32/64 bits
store
into the appropriate X86 inc/dec through memory instruction:
inc[qlwb] / dec[qlwb]
The checks that were in X86DAGToDAGISel::Select(SDNode *Node)>>ISD::STORE have been extracted to isLoadIncOrDecStore and reworked to use the better
named wrappers for getOperand(unsigned) (e.g. getOffset()) and replaced Chain.getNode() with LoadNode. The comments have also been expanded.
llvm-svn: 153635
This is a code change to add support for changing instruction sequences of the form:
load
inc/dec of 8/16/32/64 bits
store
into the appropriate X86 inc/dec through memory instruction:
inc[qlwb] / dec[qlwb]
The checks that were in X86DAGToDAGISel::Select(SDNode *Node)>>ISD::STORE have been extracted to isLoadIncOrDecStore and reworked to use the better
named wrappers for getOperand(unsigned) (e.g. getOffset()) and replaced Chain.getNode() with LoadNode. The comments have also been expanded.
llvm-svn: 153617
Some targets still mess up the liveness information, but that isn't
verified after MRI->invalidateLiveness().
The verifier can still check other useful things like register classes
and CFG, so it should be enabled after all passes.
llvm-svn: 153615
The late scheduler depends on accurate liveness information if it is
breaking anti-dependencies, so we should be able to verify it.
Relax the terminator checking in the machine code verifier so it can
handle the basic blocks created by if conversion.
llvm-svn: 153614
When an strd instruction doesn't get the registers it wants, it can be
expanded into two str instructions. Make sure the first str doesn't kill
the base register in the case where the base and data registers are
identical:
t2STRi12 %R0<kill>, %R0, 4, pred:14, pred:%noreg
t2STRi12 %R2<kill>, %R0, 8, pred:14, pred:%noreg
<rdar://problem/11101911>
llvm-svn: 153611
When a number of sub-register VLRDS instructions are combined into a
VLDM, preserve any super-register implicit defs. This is required to
keep the register scavenger and machine code verifier happy.
Enable machine code verification after ARMLoadStoreOptimizer.
ARM/2012-01-26-CopyPropKills.ll was failing because of this.
llvm-svn: 153610
The arm_neon intrinsics can create virtual registers from the DPair
register class which allows both even-odd and odd-even D-register pairs.
This fixes PR12389.
llvm-svn: 153603
Extract the liveness verification into its own method.
This makes it possible to run the machine code verifier after liveness
information is no longer required to be valid.
llvm-svn: 153596
Revert r153519: "ARMLoadStoreOptimizer invalidates register liveness."
These patches caused miscompilations in povray by turning off branch
folding's updating of live-in lists.
It turns out the the late scheduler depends on the live-in lists, even
if it doesn't need correct kill flags.
<rdar://problem/11139228>
llvm-svn: 153593
Original commit message for r153521 (aka r153423):
Use the new range metadata in computeMaskedBits and add a new optimization to
instruction simplify that lets us remove an and when loding a boolean value.
llvm-svn: 153587
blocks in the function cloner. This removes the last case of trivially
dead code that I've been seeing in the wild getting inlined, analyzed,
re-inlined, optimized, only to be deleted. Nukes a FIXME from the
cleanup tests.
llvm-svn: 153572
them as machine instructions. Directives ".set noat" and ".set at" are now
emitted only at the beginning and end of a function except in the case where
they are emitted to enclose .cpload with an immediate operand that doesn't fit
in 16-bit field or unaligned load/stores.
Also, make the following changes:
- Remove function isUnalignedLoadStore and use a switch-case statement to
determine whether an instruction is an unaligned load or store.
- Define helper function CreateMCInst which generates an instance of an MCInst
from an opcode and a list of operands.
llvm-svn: 153552
executable has been moved to another machine). If that's not available
(read-only or something), then exit gracefully.
<rdar://problem/11111686>
llvm-svn: 153538
undefined behavior, which Rafael was kind enough to fix.
Original commit message for r153423:
Use the new range metadata in computeMaskedBits and add a new optimization to
instruction simplify that lets us remove an and when loding a boolean value.
llvm-svn: 153521
This pass tries to update kill flags, but there are still many bugs.
Passes after the load/store optimizer don't need accurate liveness, so
don't even try.
<rdar://problem/11101911>
llvm-svn: 153519
Branch folding can use a register scavenger to update liveness
information when required. Don't do that if liveness information is
already invalid.
llvm-svn: 153517
Late optimization passes like branch folding and tail duplication can
transform the machine code in a way that makes it expensive to keep the
register liveness information up to date. There is a fuzzy line between
register allocation and late scheduling where the liveness information
degrades.
The MRI::tracksLiveness() flag makes the line clear: While true,
liveness information is accurate, and can be used for register
scavenging. Once the flag is false, liveness information is not
accurate, and can only be used as a hint.
Late passes generally don't need the liveness information, but they will
sometimes use the register scavenger to help update it. The scavenger
enforces strict correctness, and we have to spend a lot of code to
update register liveness that may never be used.
llvm-svn: 153511
size bloat. Unfortunately, I expect this to disable the majority of the
benefit from r152737. I'm hopeful at least that it will fix PR12345. To
explain this requires... quite a bit of backstory I'm afraid.
TL;DR: The change in r152737 actually did The Wrong Thing for
linkonce-odr functions. This change makes it do the right thing. The
benefits we saw were simple luck, not any actual strategy. Benchmark
numbers after a mini-blog-post so that I've written down my thoughts on
why all of this works and doesn't work...
To understand what's going on here, you have to understand how the
"bottom-up" inliner actually works. There are two fundamental modes to
the inliner:
1) Standard fixed-cost bottom-up inlining. This is the mode we usually
think about. It walks from the bottom of the CFG up to the top,
looking at callsites, taking information about the callsite and the
called function and computing th expected cost of inlining into that
callsite. If the cost is under a fixed threshold, it inlines. It's
a touch more complicated than that due to all the bonuses, weights,
etc. Inlining the last callsite to an internal function gets higher
weighth, etc. But essentially, this is the mode of operation.
2) Deferred bottom-up inlining (a term I just made up). This is the
interesting mode for this patch an r152737. Initially, this works
just like mode #1, but once we have the cost of inlining into the
callsite, we don't just compare it with a fixed threshold. First, we
check something else. Let's give some names to the entities at this
point, or we'll end up hopelessly confused. We're considering
inlining a function 'A' into its callsite within a function 'B'. We
want to check whether 'B' has any callers, and whether it might be
inlined into those callers. If so, we also check whether inlining 'A'
into 'B' would block any of the opportunities for inlining 'B' into
its callers. We take the sum of the costs of inlining 'B' into its
callers where that inlining would be blocked by inlining 'A' into
'B', and if that cost is less than the cost of inlining 'A' into 'B',
then we skip inlining 'A' into 'B'.
Now, in order for #2 to make sense, we have to have some confidence that
we will actually have the opportunity to inline 'B' into its callers
when cheaper, *and* that we'll be able to revisit the decision and
inline 'A' into 'B' if that ever becomes the correct tradeoff. This
often isn't true for external functions -- we can see very few of their
callers, and we won't be able to re-consider inlining 'A' into 'B' if
'B' is external when we finally see more callers of 'B'. There are two
cases where we believe this to be true for C/C++ code: functions local
to a translation unit, and functions with an inline definition in every
translation unit which uses them. These are represented as internal
linkage and linkonce-odr (resp.) in LLVM. I enabled this logic for
linkonce-odr in r152737.
Unfortunately, when I did that, I also introduced a subtle bug. There
was an implicit assumption that the last caller of the function within
the TU was the last caller of the function in the program. We want to
bonus the last caller of the function in the program by a huge amount
for inlining because inlining that callsite has very little cost.
Unfortunately, the last caller in the TU of a linkonce-odr function is
*not* the last caller in the program, and so we don't want to apply this
bonus. If we do, we can apply it to one callsite *per-TU*. Because of
the way deferred inlining works, when it sees this bonus applied to one
callsite in the TU for 'B', it decides that inlining 'B' is of the
*utmost* importance just so we can get that final bonus. It then
proceeds to essentially force deferred inlining regardless of the actual
cost tradeoff.
The result? PR12345: code bloat, code bloat, code bloat. Another result
is getting *damn* lucky on a few benchmarks, and the over-inlining
exposing critically important optimizations. I would very much like
a list of benchmarks that regress after this change goes in, with
bitcode before and after. This will help me greatly understand what
opportunities the current cost analysis is missing.
Initial benchmark numbers look very good. WebKit files that exhibited
the worst of PR12345 went from growing to shrinking compared to Clang
with r152737 reverted.
- Bootstrapped Clang is 3% smaller with this change.
- Bootstrapped Clang -O0 over a single-source-file of lib/Lex is 4%
faster with this change.
Please let me know about any other performance impact you see. Thanks to
Nico for reporting and urging me to actually fix, Richard Smith, Duncan
Sands, Manuel Klimek, and Benjamin Kramer for talking through the issues
today.
llvm-svn: 153506
MachinePointerInfo when getStore is called to create a node that stores an
argument passed in register to the stack. Without this change, the post RA
scheduler will fail to discover the dependencies between the stores
instructions and the instructions that load from a structure passed by value.
The link to the related discussion is here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-March/048055.html
llvm-svn: 153499
copies being considered for removal. Make sure to track all of the copies,
rather than just the most recent encountered, by holding a DenseSet instead of
an unsigned in SrcMap.
No test case - couldn't reduce something with a sane size.
llvm-svn: 153487
produces a 32-bit immediate which is consumed by the use. It tries to
fold the immediate by breaking it into two parts and fold them into the
immmediate fields of two uses. e.g
movw r2, #40885
movt r3, #46540
add r0, r0, r3
=>
add.w r0, r0, #3019898880
add.w r0, r0, #30146560
;
However, this transformation is incorrect if the user produces a flag. e.g.
movw r2, #40885
movt r3, #46540
adds r0, r0, r3
=>
add.w r0, r0, #3019898880
adds.w r0, r0, #30146560
Note the adds.w may not set the carry flag even if the original sequence
would.
rdar://11116189
llvm-svn: 153484
relocations. The algorithm is the same as
that for x86_64. Scattered relocations, a
feature present in i386 but not on x86_64,
are not yet supported.
llvm-svn: 153466
Original commit message:
Use the new range metadata in computeMaskedBits and add a new optimization to
instruction simplify that lets us remove an and when loading a boolean value.
llvm-svn: 153452
constant-offsets of a common base using the generic GEP-walking logic
I added for computing pointer differences in the same situation.
llvm-svn: 153419
inbounds GEPs. This isn't really necessary for simplifying pointer
differences, but I'm planning to re-use the same code to simplify
pointer comparisons where it is necessary. Since real code almost
exclusively uses inbounds GEPs, it doesn't seem worth it to support the
extra complexity of turning it on and off. If anyone would like that
back, feel free to shout. Note that instcombine will still catch any of
these patterns.
llvm-svn: 153418
aggressively. There are lots of dire warnings about this being expensive
that seem to predate switching to the TrackingVH-based value remapper
that is automatically updated on RAUW. This makes it easy to not just
prune single-entry PHIs, but to fully simplify PHIs, and to recursively
simplify the newly inlined code to propagate PHINode simplifications.
This introduces a bit of a thorny problem though. We may end up
simplifying a branch condition to a constant when we fold PHINodes, and
we would like to nuke any dead blocks resulting from this so that time
isn't wasted continually analyzing them, but this isn't easy. Deleting
basic blocks *after* they are fully cloned and mapped into the new
function currently requires manually updating the value map. The last
piece of the simplification-during-inlining puzzle will require either
switching to WeakVH mappings or some other piece of refactoring. I've
left a FIXME in the testcase about this.
llvm-svn: 153410
* Removed test/lib/llvm.exp - it is no longer needed
* Deleted the dg.exp reading code from test/lit.cfg. There are no dg.exp files
left in the test suite so this code is no longer required. test/lit.cfg is
now much shorter and clearer
* Removed a lot of duplicate code in lit.local.cfg files that need access to
the root configuration, by adding a "root" attribute to the TestingConfig
object. This attribute is dynamically computed to provide the same
information as was previously provided by the custom getRoot functions.
* Documented the config.root attribute in docs/CommandGuide/lit.pod
llvm-svn: 153408
to instead rely on much more generic and powerful instruction
simplification in the function cloner (and thus inliner).
This teaches the pruning function cloner to use instsimplify rather than
just the constant folder to fold values during cloning. This can
simplify a large number of things that constant folding alone cannot
begin to touch. For example, it will realize that 'or' and 'and'
instructions with certain constant operands actually become constants
regardless of what their other operand is. It also can thread back
through the caller to perform simplifications that are only possible by
looking up a few levels. In particular, GEPs and pointer testing tend to
fold much more heavily with this change.
This should (in some cases) have a positive impact on compile times with
optimizations on because the inliner itself will simply avoid cloning
a great deal of code. It already attempted to prune proven-dead code,
but now it will be use the stronger simplifications to prove more code
dead.
llvm-svn: 153403
fire if anything ever invalidates the assumption of a terminator
instruction being unchanged throughout the routine.
I've convinced myself that the current definition of simplification
precludes such a transformation, so I think getting some asserts
coverage that we don't violate this agreement is sufficient to make this
code safe for the foreseeable future.
Comments to the contrary or other suggestions are of course welcome. =]
The bots are now happy with this code though, so it appears the bug here
has indeed been fixed.
llvm-svn: 153401
list. This is a bad idea. ;] I'm hopeful this is the bug that's showing
up with the MSVC bots, but we'll see.
It is definitely unnecessary. InstSimplify won't do anything to
a terminator instruction, we don't need to even include it in the
iteration range. We can also skip the now dead terminator check,
although I've made it an assert to help document that this is an
important invariant.
I'm still a bit queasy about this because there is an implicit
assumption that the terminator instruction cannot be RAUW'ed by the
simplification code. While that appears to be true at the moment, I see
no guarantee that would ensure it remains true in the future. I'm
looking at the cleanest way to solve that...
llvm-svn: 153399
spotted by inspection, and I've crafted no test case that triggers it on
my machine, but some of the windows builders are hitting what looks like
memory corruption, so *something* is amiss here.
This patch takes a more generalized approach to eliminating
double-visits. Imagine code such as:
%x = ...
%y = add %x, 1
%z = add %x, %y
You can imagine that if we simplify %x, we would add %y and %z to the
list. If the use-chain order happens to cause us to add them in reverse
order, we could pull %y off first, and simplify it, adding %z to the
list. We now have %z on the list twice, and will reference it after it
is deleted.
Currently, all my test cases happen to not trigger this, likely due to
the use-chain ordering, but there seems no guarantee that such
a situation could not occur, so we should handle it correctly.
Again, if anyone knows how to craft a testcase that actually triggers
this, please let me know.
llvm-svn: 153397
worklist. This can happen in theory when an instruction uses itself,
such as a PHI node. This was spotted by inspection, and unfortunately
I've not been able to come up with a test case that would trigger it. If
anyone has ideas, let me know...
llvm-svn: 153396
regressed seriously here, we are no longer removing allocas during
inline cleanup. This appears to be because of lifetime markers "using"
them. =/ I'll look into this shortly.
llvm-svn: 153394
bit simpler by handling a common case explicitly.
Also, refactor the implementation to use a worklist based walk of the
recursive users, rather than trying to use value handles to detect and
recover from RAUWs during the recursive descent. This fixes a very
subtle bug in the previous implementation where degenerate control flow
structures could cause mutually recursive instructions (PHI nodes) to
collapse in just such a way that From became equal to To after some
amount of recursion. At that point, we hit the inf-loop that the assert
at the top attempted to guard against. This problem is defined away when
not using value handles in this manner. There are lots of comments
claiming that the WeakVH will protect against just this sort of error,
but they're not accurate about the actual implementation of WeakVHs,
which do still track RAUWs.
I don't have any test case for the bug this fixes because it requires
running the recursive simplification on unreachable phi nodes. I've no
way to either run this or easily write an input that triggers it. It was
found when using instruction simplification inside the inliner when
running over the nightly test-suite.
llvm-svn: 153393
The PPC64 SVR4 ABI requires integer stack arguments, and thus the var. args., that
are smaller than 64 bits be zero extended to 64 bits.
llvm-svn: 153373
Code such as:
%vreg100 = setcc %vreg10, -1, SETNE
brcond %vreg10, %tgt
was being incorrectly morphed into
%vreg100 = and %vreg10, 1
brcond %vreg10, %tgt
where the 'and' instruction could be eliminated since
such logic is on 1-bit types in the PTX back-end, leaving
us with just:
brcond %vreg10, %tgt
which essentially gives us inverted branch conditions.
llvm-svn: 153364
destination module, but one of them isn't used in the destination module. If
another module comes along and the uses the unused type, there could be type
conflicts when the modules are finally linked together. (This happened when
building LLVM.)
The test that was reduced is:
Module A:
%Z = type { %A }
%A = type { %B.1, [7 x x86_fp80] }
%B.1 = type { %C }
%C = type { i8* }
declare void @func_x(%C*, i64, i64)
declare void @func_z(%Z* nocapture)
Module B:
%B = type { %C.1 }
%C.1 = type { i8* }
%A.2 = type { %B.3, [5 x x86_fp80] }
%B.3 = type { %C.1 }
define void @func_z() {
%x = alloca %A.2, align 16
%y = getelementptr inbounds %A.2* %x, i64 0, i32 0, i32 0
call void @func_x(%C.1* %y, i64 37, i64 927) nounwind
ret void
}
declare void @func_x(%C.1*, i64, i64)
declare void @func_y(%B* nocapture)
(Unfortunately, this test doesn't fail under llvm-link, only during an LTO
linking.) The '%C' and '%C.1' clash. The destination module gets the '%C'
declaration. When merging Module B, it looks at the '%C.1' subtype of the '%B'
structure. It adds that in, because that's cool. And when '%B.3' is processed,
it uses the '%C.1'. But the '%B' has used '%C' and we prefer to use '%C'. So the
'@func_x' type is changed to 'void (%C*, i64, i64)', but the type of '%x' in
'@func_z' remains '%A.2'. The GEP resolves to a '%C.1', which conflicts with the
'@func_x' signature.
We can resolve this situation by making sure that the type is used in the
destination before saying that it should be used in the module being merged in.
With this fix, LLVM and Clang both compile under LTO.
<rdar://problem/10913281>
llvm-svn: 153351
same basic block, and it's not safe to insert code in the successor
blocks if the edges are critical edges. Splitting those edges is
possible, but undesirable, especially on the unwind side. Instead,
make the bottom-up code motion to consider invokes to be part of
their successor blocks, rather than part of their parent blocks, so
that it doesn't push code past them and onto the edges. This fixes
PR12307.
llvm-svn: 153343
This is necessary if the client wants to be able to mutate TargetOptions (for example, fast FP math mode) after the initial creation of the ExecutionEngine.
llvm-svn: 153342
dominated by Root, check that B is available throughout the scope. This
is obviously true (famous last words?) given the current logic, but the
check may be helpful if more complicated reasoning is added one day.
llvm-svn: 153323
(and hopefully on Windows). The bots have been down most of the day
because of this, and it's not clear to me what all will be required to
fix it.
The commits started with r153205, then r153207, r153208, and r153221.
The first commit seems to be the real culprit, but I couldn't revert
a smaller number of patches.
When resubmitting, r153207 and r153208 should be folded into r153205,
they were simple build fixes.
llvm-svn: 153241
execution-time regression for nsieve-bits on the ARMv7 -O0 -g nightly tester.
This may also improve compile-time on architectures that would otherwise
generate a libcall for urem (e.g., ARM) or fall back to the DAG selector.
rdar://10810716
llvm-svn: 153230
This is in braces so that it doesn't conflict with the existing %p.
It uses braces instead of parens because parens would have to be
regex-escaped.
llvm-svn: 153213
1. Declare a virtual function getPointerToNamedFunction() in JITMemoryManager
2. Move the implementation of getPointerToNamedFunction() form JIT/MCJIT to DefaultJITMemoryManager.
llvm-svn: 153205
Type legalization can zero-extend the elements of the build_vector node, so,
for example, we may have an <8 x i8> with i32 elements of value 255. That
should return 'true' for the vector being all ones.
llvm-svn: 153203
not attched to a basic block or function. There are conservatively
correct answers in these cases, and this makes the analysis more useful
in contexts where we have a partially formed bit of IR.
I don't have any way to test this directly... suggestions welcome here,
but I'm not seeing anything sadly. I only found this using a subsequent
patch to the inliner which runs instsimplify on partially inlined
instructions, and even then only on a quite large program. I never got
a reasonable testcase out of it, and anything I do get is likely to be
quite fragile due to requiring an interaction of two different passes,
and the only result being a segfault if it goes wrong.
llvm-svn: 153176
Adds /usr/lib/debug early to list, as some systems (debian) have unstripped libs in there
Adds /lib/i386-linux-gnu for systems that does multiarch (debian)
llvm-svn: 153174
We can simply confirm the handle released to open it with EXCLUSIVE. Attempting renaming was bad.
Disable win32file at ImportError. Thanks to Francois to let me know.
FIXME: Could we report warning or notification if win32file were not found?
llvm-svn: 153172
Remaining "uncategorized" functions have been organized into their
proper place in the hierarchy. Some functions were moved around so
groups are defined together.
No code changes were made.
llvm-svn: 153169
This gives a lot of love to the docs for the C API. Like Clang's
documentation, the C API is now organized into a Doxygen "module"
(LLVMC). Each C header file is a child of the main module. Some modules
(like Core) have a hierarchy of there own. The produced documentation is
thus better organized (before everything was in one monolithic list).
This patch also includes a lot of new documentation for APIs in Core.h.
It doesn't document them all, but is better than none. Function docs are
missing @param and @return annotation, but the documentation body now
commonly provides help details (like the expected llvm::Value sub-type
to expect).
llvm-svn: 153157
These changes allow us to compile big endian from the command line for 32 bit
Mips targets. This patch will result in code and data actually being produced
in the correct endianess.
llvm-svn: 153153