-stable-loops enables a new algorithm for generating the Loop
forest. It differs from the original algorithm in a few respects:
- Not determined by use-list order.
- Initially guarantees RPO order of block and subloops.
- Linear in the number of CFG edges.
- Nonrecursive.
I didn't want to change the LoopInfo API yet, so the block lists are
still inclusive. This seems strange to me, and it means that building
LoopInfo is not strictly linear, but it may not be a problem in
practice. At least the block lists start out in RPO order now. In the
future we may add an attribute or wrapper analysis that allows other
passes to assume RPO order.
The primary motivation of this work was not to optimize LoopInfo, but
to allow reproducing performance issues by decomposing the compilation
stages. I'm often unable to do this with the current LoopInfo, because
the loop tree order determines Loop pass order. Serializing the IR
tends to invert the order, which reverses the optimization order. This
makes it nearly impossible to debug interdependent loop optimizations
such as LSR.
I also believe this will provide more stable performance results across time.
llvm-svn: 158790
The implementation only needs inclusion from LoopInfo.cpp and
MachineLoopInfo.cpp. Clients of the interface should only include the
interface. This makes the interface readable and speeds up rebuilds
after modifying the implementation.
llvm-svn: 158787
llvm::RawMemoryObject handles empty ranges just fine, and the assert can
be triggered in the wild by e.g. invoking clang with a file that
included an empty pre-compiled header file when clang has been built
with assertions enabled. Without assertions enabled, clang will properly
report that the empty file is not a valid PCH.
llvm-svn: 158769
When LiveIntervals is tracking fixed interference in regunits, make sure
to update those intervals as well. Currently guarded by -live-regunits.
llvm-svn: 158766
ensureAlignment() in MachineFunction). Also, drop setMaxAlignment() in
favor of this new function. This creates a main entry point to setting
MaxAlignment, which will be helpful for future work. No functionality
change intended.
llvm-svn: 158758
This patch adds DAG combines to form FMAs from pairs of FADD + FMUL or
FSUB + FMUL. The combines are performed when:
(a) Either
AllowExcessFPPrecision option (-enable-excess-fp-precision for llc)
OR
UnsafeFPMath option (-enable-unsafe-fp-math)
are set, and
(b) TargetLoweringInfo::isFMAFasterThanMulAndAdd(VT) is true for the type of
the FADD/FSUB, and
(c) The FMUL only has one user (the FADD/FSUB).
If your target has fast FMA instructions you can make use of these combines by
overriding TargetLoweringInfo::isFMAFasterThanMulAndAdd(VT) to return true for
types supported by your FMA instruction, and adding patterns to match ISD::FMA
to your FMA instructions.
llvm-svn: 158757
The PPC::EXTSW instruction preserves the low 32 bits of its input, just
like some of the x86 instructions. Use it to reduce register pressure
when the low 32 bits have multiple uses.
This requires a small change to PeepholeOptimizer since EXTSW takes a
64-bit input register.
This is related to PR5997.
llvm-svn: 158743
StringMap suffered from the same bug as DenseMap: when you explicitly
construct it with a small number of buckets, you can arrange for the
tombstone-based growth path to be followed when the number of buckets
was less than '8'. In that case, even with a full map, it would compare
'0' as not less than '0', and refuse to grow the table, leading to
inf-loops trying to find an empty bucket on the next insertion. The fix
is very simple: use '<=' as the comparison. The same fix was applied to
DenseMap as well during its recent refactoring.
Thanks to Alex Bolz for the great report and test case. =]
llvm-svn: 158725
For processors with the G5-like instruction-grouping scheme, this helps avoid
early group termination due to a write-after-write dependency within the group.
It should also help on pipelined embedded cores.
On POWER7, over the test suite, this gives an average 0.5% speedup. The largest
speedups are:
SingleSource/Benchmarks/Stanford/Quicksort - 33%
MultiSource/Applications/d/make_dparser - 21%
MultiSource/Benchmarks/FreeBench/analyzer/analyzer - 12%
MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft - 12%
Largest slowdowns:
SingleSource/Benchmarks/Stanford/Bubblesort - 23%
MultiSource/Benchmarks/Prolangs-C++/city/city - 21%
MultiSource/Benchmarks/BitBench/uuencode/uuencode - 16%
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode - 13%
llvm-svn: 158719
TargetLoweringObjectFileELF. Use this to support it on X86. Unlike ARM,
on X86 it is not easy to find out if .init_array should be used or not, so
the decision is made via TargetOptions and defaults to off.
Add a command line option to llc that enables it.
llvm-svn: 158692
Original commit msg:
add the 'alloc' metadata node to represent the size of offset of buffers pointed to by pointers.
This metadata can be attached to any instruction returning a pointer
llvm-svn: 158688
This patch changes the type used to hold the FU bitset from unsigned to uint64_t.
This will be needed for some upcoming PowerPC itineraries.
llvm-svn: 158679
The NOP, WFE, WFI, SEV and YIELD instructions are all hints w/
a different immediate value in bits [7,0]. Define a generic HINT
instruction and refactor NOP, WFI, WFI, SEV and YIELD to be
assembly aliases of that.
rdar://11600518
llvm-svn: 158674
when a compile time constant is known. This occurs when implicitly zero
extending function arguments from 16 bits to 32 bits. The 8 bit case doesn't
need to be handled, as the 8 bit constants are encoded directly, thereby
not needing a separate load instruction to form the constant into a register.
<rdar://problem/11481151>
llvm-svn: 158659
This patch causes problems when both dynamic stack realignment and
dynamic allocas combine in the same function. With this patch, we no
longer build the epilog correctly, and silently restore registers from
the wrong position in the stack.
Thanks to Matt for tracking this down, and getting at least an initial
test case to Chad. I'm going to try to check a variation of that test
case in so we can easily track the fixes required.
llvm-svn: 158654
This cleans up the method used to find trip counts in order to form CTR loops on PPC.
This refactoring allows the pass to find loops which have a constant trip count but also
happen to end with a comparison to zero. This also adds explicit FIXMEs to mark two different
classes of loops that are currently ignored.
In addition, we now search through all potential induction operations instead of just the first.
Also, we check the predicate code on the conditional branch and abort the transformation if the
code is not EQ or NE, and we then make sure that the branch to be transformed matches the
condition register defined by the comparison (multiple possible comparisons will be considered).
llvm-svn: 158607
The present implementation handles only TBAA and FP metadata, discarding everything else.
For debug metadata, the current behavior is maintained (the debug metadata associated with
one of the instructions will be kept, discarding that attached to the other).
This should address PR 13040.
llvm-svn: 158606
Dynamic GEPs created by SROA needed to insert extra "i32 0"
operands to index through structs and arrays to get to the
vector being indexed.
llvm-svn: 158590
This is likely only the tip of the ice berg, but this particular bug
caused any double-free on a glibc system to turn into a deadlock! It is
not generally safe to either allocate or release heap memory from within
the signal handler. The 'pop_back()' in RemoveFilesToRemove was deleting
memory and causing the deadlock. What's worse, eraseFromDisk in PathV1
has lots of allocation and deallocation paths. We even passed 'true' in
a place that would have caused the *signal handler* to try to run the
'system' system call and shell out to 'rm -rf'. That was never going to
work...
This patch switches the file removal to use a vector of strings so that
the exact text needed for the 'unlink' system call can be stored there.
It switches the loop to be a boring indexed loop, and directly calls
unlink without looking at the error. It also works quite hard to ensure
that calling 'c_str()' is safe, by ensuring that the non-signal-handling
code path that manipulates the vector always leaves it in a state where
every element has already had 'c_str()' called at least once.
I dunno exactly how overkill this is, but it fixes the
deadlock-on-double free issue, and seems likely to prevent any other
issues from sneaking up.
Sorry for not having a test case, but I *really* don't know how to test
signal handling code easily....
llvm-svn: 158580
This patch will optimize abs(x-y)
FROM
sub, movs, rsbmi
TO
subs, rsbmi
For abs, we will use cmp instead of movs. This is necessary because we already
have an existing peephole pass which optimizes away cmp following sub.
rdar: 11633193
llvm-svn: 158551
For non-address users, Base and Scaled registers are not specially
associated to fit an address mode, so SCEVExpander should apply normal
expansion rules. Otherwise we may sink computation into inner loops
that have already been optimized.
llvm-svn: 158537
example degenerate phi nodes and binops that use themselves in unreachable code.
Thanks to Charles Davis for the testcase that uncovered this can of worms.
llvm-svn: 158508
to load an immediate that does not fit into 16-bit. Also, take into
consideration the global base register slot on the stack when computing the
stack size.
llvm-svn: 158430
delay slot filler pass of MIPS, per suggestion of Jakob Stoklund Olesen.
This change, along with the fix in r158154, enables machine verification
to be run after delay slot filling.
llvm-svn: 158426
since then the entire expression must equal zero (similarly for other operations
with an absorbing element). With this in place a bunch of reassociate code for
handling constants is dead since it is all taken care of when linearizing. No
intended functionality change.
llvm-svn: 158398
This patch extends FoldBranchToCommonDest to fold unconditional branches.
For unconditional branches, we fold them if it is easy to update the phi nodes
in the common successors.
rdar://10554090
llvm-svn: 158392
For store->load dependencies that may alias, we should always use
TrueMemOrderLatency, which may eventually become a subtarget hook. In
effect, we should guarantee at least TrueMemOrderLatency on at least
one DAG path from a store to a may-alias load.
This should fix the standard mode as well as -enable-aa-sched-mi".
llvm-svn: 158380
POD type, causing memory corruption when mapping to APInts with bitwidth > 64.
Merge another crash testcase into crash.ll while there.
llvm-svn: 158369
POWER4 is a 64-bit CPU (better matched to the 970).
The g3 is really the 750 (no altivec), the g4+ is the 74xx (not the 750).
Patch by Andreas Tobler.
llvm-svn: 158363
topologies, it is quite possible for a leaf node to have huge multiplicity, for
example: x0 = x*x, x1 = x0*x0, x2 = x1*x1, ... rapidly gives a value which is x
raised to a vast power (the multiplicity, or weight, of x). This patch fixes
the computation of weights by correctly computing them no matter how big they
are, rather than just overflowing and getting a wrong value. It turns out that
the weight for a value never needs more bits to represent than the value itself,
so it is enough to represent weights as APInts of the same bitwidth and do the
right overflow-avoiding dance steps when computing weights. As a side-effect it
reduces the number of multiplies needed in some cases of large powers. While
there, in view of external uses (eg by the vectorizer) I made LinearizeExprTree
static, pushing the rank computation out into users. This is progress towards
fixing PR13021.
llvm-svn: 158358
Original commit message:
Move PPC host-CPU detection logic from PPCSubtarget into sys::getHostCPUName().
Both the new Linux functionality and the old Darwin functions have been moved.
This change also allows this information to be queried directly by clang and
other frontends (clang, for example, will now have real -mcpu=native support).
llvm-svn: 158349
thread local data, embed them in the class using a uint64_t and make sure
we get compiler errors if there's a platform where this is not big enough.
This makes ThreadLocal more safe for using it in conjunction with CrashRecoveryContext.
Related to crash in rdar://11434201.
llvm-svn: 158342
Both the new Linux functionality and the old Darwin functions have been moved.
This change also allows this information to be queried directly by clang and
other frontends (clang, for example, will now have real -mcpu=native support).
llvm-svn: 158337
The PPC target feature gpul (IsGigaProcessor) was only used for one thing:
To enable the generation of the MFOCRF instruction. Furthermore, this
instruction is available on other PPC cores outside of the G5 line. This
feature now corresponds to the HasMFOCRF flag.
No functionality change.
llvm-svn: 158323
We turned off the CMN instruction because it had semantics which we weren't
getting correct. If we are comparing with an immediate, then it's okay to use
the CMN instruction.
<rdar://problem/7569620>
llvm-svn: 158302
This saves a cast, and zext is more expensive on platforms with subreg support
than trunc is. This occurs in the BSD implementation of memchr(3), see PR12750.
On the synthetic benchmark from that bug stupid_memchr and bsd_memchr have the
same performance now when not inlining either function.
stupid_memchr: 323.0us
bsd_memchr: 321.0us
memchr: 479.0us
where memchr is the llvm-gcc compiled bsd_memchr from osx lion's libc. When
inlining is enabled bsd_memchr still regresses down to llvm-gcc memchr time,
I haven't fully understood the issue yet, something is grossly mangling the
loop after inlining.
llvm-svn: 158297
Over the entire test-suite, this has an insignificantly negative average
performance impact, but reduces some of the worst slowdowns from the
anti-dep. change (r158294).
Largest speedups:
SingleSource/Benchmarks/Stanford/Quicksort - 28%
SingleSource/Benchmarks/Stanford/Towers - 24%
SingleSource/Benchmarks/Shootout-C++/matrix - 23%
MultiSource/Benchmarks/SciMark2-C/scimark2 - 19%
MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount - 15%
(matrix and automotive-bitcount were both in the top-5 slowdown list from the
anti-dep. change)
Largest slowdowns:
MultiSource/Benchmarks/McCat/03-testtrie/testtrie - 28%
MultiSource/Benchmarks/mediabench/gsm/toast/toast - 26%
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan - 21%
SingleSource/Benchmarks/CoyoteBench/lpbench - 20%
MultiSource/Applications/d/make_dparser - 16%
llvm-svn: 158296
Using 'all' instead of 'critical' would be better because it would make it easier to
satisfy the bundling constraints, but, as noted in the FIXME, that is currently not
possible with the crs.
This yields an average 1% speedup over the entire test suite (on Power 7). Largest speedups:
SingleSource/Benchmarks/Shootout-C++/moments - 40%
MultiSource/Benchmarks/McCat/03-testtrie/testtrie - 28%
SingleSource/Benchmarks/BenchmarkGame/nsieve-bits - 26%
SingleSource/Benchmarks/McGill/misr - 23%
MultiSource/Applications/JM/ldecod/ldecod - 22%
Largest slowdowns:
SingleSource/Benchmarks/Shootout-C++/matrix - -29%
SingleSource/Benchmarks/Shootout-C++/ary3 - -22%
MultiSource/Benchmarks/BitBench/uuencode/uuencode - -18%
SingleSource/Benchmarks/Shootout-C++/ary - -17%
MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount - -15%
llvm-svn: 158294
The PPC64 backend had patterns for i32 <-> i64 extensions and truncations that
would leave self-moves in the final assembly. Replacing those patterns with ones
based on the SUBREG builtins yields better-looking code.
Thanks to Jakob and Owen for their suggestions in this matter.
llvm-svn: 158283
Tail merging had been disabled on PPC because it would disturb bundling decisions
made during pre-RA scheduling on the 970 cores. Now, however, all bundling decisions
are made during post-RA scheduling, and tail merging is generally beneficial (the
average test-suite speedup is insignificantly positive).
Largest test-suite speedups:
MultiSource/Benchmarks/mediabench/gsm/toast/toast - 30%
MultiSource/Benchmarks/BitBench/uuencode/uuencode - 23%
SingleSource/Benchmarks/Shootout-C++/ary - 21%
SingleSource/Benchmarks/Stanford/Queens - 17%
Largest slowdowns:
MultiSource/Benchmarks/MiBench/security-sha/security-sha - 24%
MultiSource/Benchmarks/McCat/03-testtrie/testtrie - 22%
MultiSource/Applications/JM/ldecod/ldecod - 14%
MultiSource/Benchmarks/mediabench/g721/g721encode/encode - 9%
This is improved by using full (instead of just critical) anti-dependency breaking,
but doing so still causes miscompiles and so cannot yet be enabled by default.
llvm-svn: 158259
The LiveRegMatrix represents the live range of assigned virtual
registers in a Live interval union per register unit. This is not
fundamentally different from the interference tracking in RegAllocBase
that both RABasic and RAGreedy use.
The important differences are:
- LiveRegMatrix tracks interference per register unit instead of per
physical register. This makes interference checks cheaper and
assignments slightly more expensive. For example, the ARM D7 reigster
has 24 aliases, so we would check 24 physregs before assigning to one.
With unit-based interference, we check 2 units before assigning to 2
units.
- LiveRegMatrix caches regmask interference checks. That is currently
duplicated functionality in RABasic and RAGreedy.
- LiveRegMatrix is a pass which makes it possible to insert
target-dependent passes between register allocation and rewriting.
Such passes could tweak the register assignments with interference
checking support from LiveRegMatrix.
Eventually, RABasic and RAGreedy will be switched to LiveRegMatrix.
llvm-svn: 158255
This deduplicates some code from the optimizing register allocators, and
it means that it is now possible to change the register allocators'
solutions simply by editing the VirtRegMap between the register
allocator pass and the rewriter.
llvm-svn: 158249
OK, not really. We don't want to reintroduce the old rewriter hacks.
This patch extracts virtual register rewriting as a separate pass that
runs after the register allocator. This is possible now that
CodeGen/Passes.cpp can configure the full optimizing register allocator
pipeline.
The rewriter pass uses register assignments in VirtRegMap to rewrite
virtual registers to physical registers, and it inserts kill flags based
on live intervals.
These finalization steps are the same for the optimizing register
allocators: RABasic, RAGreedy, and PBQP.
llvm-svn: 158244
-%a + 42
into
42 - %a
previously we were emitting:
-(%a + 42)
This fixes the infinite loop in PR12338. The generated code is still not perfect, though.
Will work on that next
llvm-svn: 158237
problem was that by moving instructions around inside the function, the pass
could accidentally move the iterator being used to advance over the function
too. Fix this by only processing the instruction equal to the iterator, and
leaving processing of instructions that might not be equal to the iterator
to later (later = after traversing the basic block; it could also wait until
after traversing the entire function, but this might make the sets quite big).
Original commit message:
Grab-bag of reassociate tweaks. Unify handling of dead instructions and
instructions to reoptimize. Exploit this to more systematically eliminate
dead instructions (this isn't very useful in practice but is convenient for
analysing some testcase I am working on). No need for WeakVH any more: use
an AssertingVH instead.
llvm-svn: 158226
Thanks to Jakob's help, this now causes no new test suite failures!
Over the entire test suite, this gives an average 1% speedup. The largest speedups are:
SingleSource/Benchmarks/Misc/pi - 108%
SingleSource/Benchmarks/CoyoteBench/lpbench - 54%
MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail - 50%
SingleSource/Benchmarks/Shootout/ary3 - 32%
SingleSource/Benchmarks/Shootout-C++/matrix - 30%
The largest slowdowns are:
MultiSource/Benchmarks/mediabench/gsm/toast/toast - -30%
MultiSource/Benchmarks/Prolangs-C/bison/mybison - -25%
MultiSource/Benchmarks/BitBench/uuencode/uuencode - -22%
MultiSource/Applications/d/make_dparser - -14%
SingleSource/Benchmarks/Shootout-C++/ary - -13%
In light of these slowdowns, additional profiling work is obviously needed!
llvm-svn: 158223
Marking these classes as non-alocatable allows CTR loop generation to
work correctly with the block placement passes, etc. These register
classes are currently used only by some unused TCRETURN patterns.
In future cleanup, these will be removed.
Thanks again to Jakob for suggesting this fix to the CTR loop problem!
llvm-svn: 158221
Bulk move of TargetInstrInfo implementation into
TargetInstrInfoImpl. This is dirty because the code isn't part of
TargetInstrInfoImpl class, nor should it be, because the methods are
not target hooks. However, it's the current mechanism for keeping
libTarget useful outside the backend. You'll get a not-so-nice link
error if you invoke a TargetInstrInfo method that depends on CodeGen.
The TargetInstrInfoImpl class should probably be removed since it
doesn't really solve this problem.
To really fix this, we probably need separate interfaces for the
CodeGen/nonCodeGen sides of TargetInstrInfo.
llvm-svn: 158212
The pass itself works well, but the something in the Machine* infrastructure
does not understand terminators which define registers. Without the ability
to use the block-placement pass, etc. this causes performance regressions (and
so is turned off by default). Turning off the analysis turns off the problems
with the Machine* infrastructure.
llvm-svn: 158206
The code which tests for an induction operation cannot assume that any
ADDI instruction will have a register operand because the operand could
also be a frame index; for example:
%vreg16<def> = ADDI8 <fi#0>, 0; G8RC:%vreg16
llvm-svn: 158205
This pass is derived from the Hexagon HardwareLoops pass. The only significant enhancement over the Hexagon
pass is that PPCCTRLoops will also attempt to delete the replaced add and compare operations if they are
no longer otherwise used. Also, invalid preheader DebugLoc is not used.
llvm-svn: 158204
can move instructions within the instruction list. If the instruction just
happens to be the one the basic block iterator is pointing to, and it is
moved to a different basic block, then we get into an infinite loop due to
the iterator running off the end of the basic block (for some reason this
doesn't fire any assertions). Original commit message:
Grab-bag of reassociate tweaks. Unify handling of dead instructions and
instructions to reoptimize. Exploit this to more systematically eliminate
dead instructions (this isn't very useful in practice but is convenient for
analysing some testcase I am working on). No need for WeakVH any more: use
an AssertingVH instead.
llvm-svn: 158199
This patch will generate the following for integer ABS:
movl %edi, %eax
negl %eax
cmovll %edi, %eax
INSTEAD OF
movl %edi, %ecx
sarl $31, %ecx
leal (%rdi,%rcx), %eax
xorl %ecx, %eax
There exists a target-independent DAG combine for integer ABS, which converts
integer ABS to sar+add+xor. For X86, we match this pattern back to neg+cmov.
This is implemented in PerformXorCombine.
rdar://10695237
llvm-svn: 158175
This patch will optimize the following
movq %rdi, %rax
subq %rsi, %rax
cmovsq %rsi, %rdi
movq %rdi, %rax
to
cmpq %rsi, %rdi
cmovsq %rsi, %rdi
movq %rdi, %rax
Perform this optimization if the actual result of SUB is not used.
rdar: 11540023
llvm-svn: 158126
The commit is intended to fix rdar://11540023.
It is implemented as part of peephole optimization. We can actually implement
this in the SelectionDAG lowering phase.
llvm-svn: 158122
Bundles should be treated as one atomic transaction when checking
liveness. That is how the register allocator (and VLIW targets) treats
bundles.
llvm-svn: 158116
LLVM is now -Wunused-private-field clean except for
- lib/MC/MCDisassembler/Disassembler.h. Not sure why it keeps all those unaccessible fields.
- gtest.
llvm-svn: 158096
There are some that I didn't remove this round because they looked like
obvious stubs. There are dead variables in gtest too, they should be
fixed upstream.
llvm-svn: 158090
instructions to reoptimize. Exploit this to more systematically eliminate
dead instructions (this isn't very useful in practice but is convenient for
analysing some testcase I am working on). No need for WeakVH any more: use
an AssertingVH instead.
llvm-svn: 158073
Soon we'll be making LiveIntervalUnions for register units as well.
This was the only place using the RepReg member, so just remove it.
llvm-svn: 158038
Don't print out the register number and spill weight, making the TRI
argument unnecessary.
This allows callers to interpret the reg field. It can currently be a
virtual register, a physical register, a spill slot, or a register unit.
llvm-svn: 158031
Instead of computing a live interval per physreg, LiveIntervals can
compute live intervals per register unit. This makes impossible the
confusing situation where aliasing registers could have overlapping live
intervals. It should also make fixed interferernce checking cheaper
since registers have fewer register units than aliases.
Live intervals for regunits are computed on demand, using MRI use-def
chains and the new LiveRangeCalc class. Only regunits live in to ABI
blocks are precomputed during LiveIntervals::runOnMachineFunction().
The regunit liveness computations don't depend on LiveVariables.
llvm-svn: 158029
expression (a * b + c) that can be implemented as a fused multiply-add (fma)
if the target determines that this will be more efficient. This intrinsic
will be used to implement FP_CONTRACT support and an aggressive FMA formation
mode.
If your target has a fast FMA instruction you should override the
isFMAFasterThanMulAndAdd method in TargetLowering to return true.
llvm-svn: 158014