Add support for switch and indirectbr edges. This works by densely numbering
all blocks which have such terminators, and then separately numbering the
possible successors. The predecessors write down a number, the successor knows
its own number (as a ConstantInt) and sends that and the pointer to the number
the predecessor wrote down to the runtime, who looks up the counter in a
per-function table.
Coverage data should now be functional, but I haven't tested it on anything
other than my 2-file synthetic test program for coverage.
llvm-svn: 130186
return it as a clobber. This allows GVN to do smart things.
Enhance GVN to be smart about the case when a small load is clobbered
by a larger overlapping load. In this case, forward the value. This
allows us to compile stuff like this:
int test(void *P) {
int tmp = *(unsigned int*)P;
return tmp+*((unsigned char*)P+1);
}
into:
_test: ## @test
movl (%rdi), %ecx
movzbl %ch, %eax
addl %ecx, %eax
ret
which has one load. We already handled the case where the smaller
load was from a must-aliased base pointer.
llvm-svn: 130180
Fixes Thumb2 ADCS and SBCS lowering: <rdar://problem/9275821>.
t2ADCS/t2SBCS are now pseudo instructions, consistent with ARM, so the
assembly printer correctly prints the 's' suffix.
Fixes Thumb2 adde -> SBC matching to check for live/dead carry flags.
Fixes the internal ARM machine opcode mnemonic for ADCS/SBCS.
Fixes ARM SBC lowering to check for live carry (potential bug).
llvm-svn: 130048
<h2>Section Example</h2>
<div> <!-- h2+div is applied -->
<p>Section preamble.</p>
<h3>Subsection Example</h3>
<p> <!-- h3+p is applied -->
Subsection body
</p>
<!-- End of section body -->
</div>
FIXME: Care H5 better.
llvm-svn: 130040
fix bugs exposed by the gcc dejagnu testsuite:
1. The load may actually be used by a dead instruction, which
would cause an assert.
2. The load may not be used by the current chain of instructions,
and we could move it past a side-effecting instruction. Change
how we process uses to define the problem away.
llvm-svn: 130018
On x86 this allows to fold a load into the cmp, greatly reducing register pressure.
movzbl (%rdi), %eax
cmpl $47, %eax
->
cmpb $47, (%rdi)
This shaves 8k off gcc.o on i386. I'll leave applying the patch in README.txt to Chris :)
llvm-svn: 130005
This tends to happen a lot with bitfield code generated by clang. A simple example for x86_64 is
uint64_t foo(uint64_t x) { return (x&1) << 42; }
which used to compile into bloated code:
shlq $42, %rdi ## encoding: [0x48,0xc1,0xe7,0x2a]
movabsq $4398046511104, %rax ## encoding: [0x48,0xb8,0x00,0x00,0x00,0x00,0x00,0x04,0x00,0x00]
andq %rdi, %rax ## encoding: [0x48,0x21,0xf8]
ret ## encoding: [0xc3]
with this patch we can fold the immediate into the and:
andq $1, %rdi ## encoding: [0x48,0x83,0xe7,0x01]
movq %rdi, %rax ## encoding: [0x48,0x89,0xf8]
shlq $42, %rax ## encoding: [0x48,0xc1,0xe0,0x2a]
ret ## encoding: [0xc3]
It's possible to save another byte by using 'andl' instead of 'andq' but I currently see no way of doing
that without making this code even more complicated. See the TODOs in the code.
llvm-svn: 129990
add <rd>, sp, #<imm8>
ldr <rd>, [sp, #<imm8>]
When the offset from sp is multiple of 4 and in range of 0-1020.
This saves code size by utilizing 16-bit instructions.
rdar://9321541
llvm-svn: 129971
An exception is thrown via a call to _cxa_throw, which we don't expect to
return. Therefore, the "true" part of the invoke goes to a BB that has
'unreachable' as its only instruction. This is lowered into an empty MachineBB.
The landing pad for this invoke, however, is directly after the "true" MBB.
When the empty MBB is removed, the landing pad is directly below the BB with the
invoke call. The unconditional branch is removed and then the two blocks are
merged together.
The testcase is too big for a regression test.
<rdar://problem/9305728>
llvm-svn: 129965
These intervals are allocatable immediately after splitting, but they may be
evicted because of later splitting. This is rare, but when it happens they
should be split again.
The remainder intervals that cannot be allocated after splitting still move
directly to spilling.
SplitEditor::finish can optionally provide a mapping from new live intervals
back to the original interval indexes returned by openIntv().
Each original interval index can map to multiple new intervals after connected
components have been separated. Dead code elimination may also add existing
intervals to the list.
The reverse mapping allows the SplitEditor client to treat the new intervals
differently depending on the split region they came from.
llvm-svn: 129925
This patch depends on the prior fix r129908 that changes to use std::find,
rather than std::binary_search, on unordered array.
Patch by Dan Bailey
llvm-svn: 129909
necessary since gcov counts transitions between blocks. It can't see if you've
run every line in a straight-line function, so we add an edge for it to notice.
llvm-svn: 129905
<h2>Section Example</h2>
<div> <!-- h2+div is applied -->
<p>Section preamble.</p>
<h3>Subsection Example</h3>
<p> <!-- h3+p is applied -->
Subsection body
</p>
<!-- End of section body -->
</div>
llvm-svn: 129901
TII::isTriviallyReMaterializable() shouldn't depend on any properties of the
register being defined by the instruction. Rematerialization is going to create
a new virtual register anyway.
llvm-svn: 129882
On the x86-64 and thumb2 targets, some registers are more expensive to encode
than others in the same register class.
Add a CostPerUse field to the TableGen register description, and make it
available from TRI->getCostPerUse. This represents the cost of a REX prefix or a
32-bit instruction encoding required by choosing a high register.
Teach the greedy register allocator to prefer cheap registers for busy live
ranges (as indicated by spill weight).
llvm-svn: 129864
used by Clang. To help Clang integration, the PTX target has been split
into two targets: ptx32 and ptx64, depending on the desired pointer size.
- Add GCCBuiltin class to all intrinsics
- Split PTX target into ptx32 and ptx64
llvm-svn: 129851
llvm is built with unsigned chars where an immediate such as 0xff would be zero
extended to 64-bits, turning "cmp $0xff,%eax" into
"cmp $0xffffffffffffffff,%eax".
llvm-svn: 129845
manually and pass all (now) 4 arguments to the mul libcall. Add a new
ExpandLibCall for just this (copied gratuitously from type legalization).
Fixes rdar://9292577
llvm-svn: 129842
- As before, there is a minor semantic change here (evidenced by the test
change) for Darwin triples that have no version component. I debated changing
the default behavior of isOSVersionLT, but decided it made more sense for
triples to be explicit.
llvm-svn: 129805
- There is a minor semantic change here (evidenced by the test change) for
Darwin triples that have no version component. I debated changing the default
behavior of isOSVersionLT, but decided it made more sense for triples to be
explicit.
llvm-svn: 129802
Making use of VFP / NEON floating point multiply-accumulate / subtraction is
difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
additional pipeline stall. So it's frequently better to single codegen
vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
vmla + vmla is very bad. But this isn't ideal either:
vmul
vadd
vmla
Instead, we want to expand the second vmla:
vmla
vmul
vadd
Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
vmla / vmls will trigger one of the special hazards.
Enable these fp vmlx codegen changes for Cortex-A9.
llvm-svn: 129775
Add a avoidWriteAfterWrite() target hook to identify register classes that
suffer from write-after-write hazards. For those register classes, try to avoid
writing the same register in two consecutive instructions.
This is currently disabled by default. We should not spill to avoid hazards!
The command line flag -avoid-waw-hazard can be used to enable waw avoidance.
llvm-svn: 129772
when they are a truncate from something else. This eliminates fully half of all the
fastisel rejections on a test c++ file I'm working with, which should make a substantial
improvement for -O0 compile of c++ code.
This fixed rdar://9297003 - fast isel bails out on all functions taking bools
llvm-svn: 129752
Before we would bail out on i1 arguments all together, now we just bail on
non-constant ones. Also, we used to emit extraneous code. e.g. test12 was:
movb $0, %al
movzbl %al, %edi
callq _test12
and test13 was:
movb $0, %al
xorl %edi, %edi
movb %al, 7(%rsp)
callq _test13f
Now we get:
movl $0, %edi
callq _test12
and:
movl $0, %edi
callq _test13f
llvm-svn: 129751
is, it assumes addresses are 64-bit aligned (which should be the more common
case). If the alignment is found not to be aligned, then getOperandLatency()
would adjust the operand latency computation by one to compensate for it.
rdar://9294833
llvm-svn: 129742
small heap-allocated SmallString because it unconditionally forces a malloc.
(Revised version of r129688, with the necessary flush() call.)
llvm-svn: 129716
registers for fast allocation a different way. This has us updating
used registers only when we're using that exact register.
Fixes rdar://9207598
llvm-svn: 129711
the generated FastISel. X86 doesn't need to generate code to match ADD16ri8
since ADD16ri will do just fine. This is a small codesize win in the generated
instruction selector.
llvm-svn: 129692
value constraints on them (when defined as ImmLeaf's). This is particularly important
for X86-64, where almost all reg/imm instructions take a i64immSExt32 immediate operand,
which has a value constraint. Before this patch we ended up iseling the examples into
such amazing code as:
movabsq $7, %rax
imulq %rax, %rdi
movq %rdi, %rax
ret
now we produce:
imulq $7, %rdi, %rax
ret
This dramatically shrinks the generated code at -O0 on x86-64.
llvm-svn: 129691
simplifying them and exposing more information to tblgen. It would be nice
if other target authors adopted this as well, particularly arm since it has fastisel.
llvm-svn: 129676
kind of predicate: one that is specific to imm nodes. The predicate function
specified here just checks an int64_t directly instead of messing around with
SDNode's. The virtue of this is that it means that fastisel and other things
can reason about these predicates.
llvm-svn: 129675
structure and fix some fixmes. We now have a TreePredicateFn class
that handles all of the decoding of these things. This is an internal
cleanup that has no impact on the code generated by tblgen.
llvm-svn: 129670
2. implement rdar://9289501 - fast isel should fold trivial multiplies to shifts
3. teach tblgen to handle shift immediates that are different sizes than the
shifted operands, eliminating some code from the X86 fast isel backend.
4. Have FastISel::SelectBinaryOp use (the poorly named) FastEmit_ri_ function
instead of FastEmit_ri to simplify code.
llvm-svn: 129666
when we have a global variable base an an index. Instead, just give up on
folding the global variable.
Before we'd geenrate:
_test: ## @test
## BB#0:
movq _rtx_length@GOTPCREL(%rip), %rax
leaq (%rax), %rax
addq %rdi, %rax
movzbl (%rax), %eax
ret
now we generate:
_test: ## @test
## BB#0:
movq _rtx_length@GOTPCREL(%rip), %rax
movzbl (%rax,%rdi), %eax
ret
The difference is even more significant when there is a scale
involved.
This fixes rdar://9289558 - total fail with addr mode formation at -O0/x86-64
llvm-svn: 129664
less trivial things) into a dummy lea. Before we generated:
_test: ## @test
movq _G@GOTPCREL(%rip), %rax
leaq (%rax), %rax
ret
now we produce:
_test: ## @test
movq _G@GOTPCREL(%rip), %rax
ret
This is part of rdar://9289558
llvm-svn: 129662
The basic issue here is that bottom-up isel is matching the branch
and compare, and was failing to fold the load into the branch/compare
combo. Fixing this (by allowing folding into any instruction of a
sequence that is selected) allows us to produce things like:
cmpb $0, 52(%rax)
je LBB4_2
instead of:
movb 52(%rax), %cl
cmpb $0, %cl
je LBB4_2
This makes the generated -O0 code run a bit faster, but also speeds up
compile time by putting less pressure on the register allocator and
generating less code.
This was one of the biggest classes of missing load folding. Implementing
this shrinks 176.gcc's c-decl.s (as a random example) by about 4% in (verbose-asm)
line count.
llvm-svn: 129656
Break the arc-profile code out to a function like the notes emission code is,
and reorder the functions in the file.
The only functionality change is that we no longer modify the Module when the
Module has no debug info to use.
llvm-svn: 129631
The transferValues() function can now handle both singly and multiply defined
values, as long as the resulting live range is known. Only rematerialized values
have their live range recomputed by extendRange().
The updateSSA() function can now insert PHI values in bulk across multiple
values in multiple target registers in one pass. The list of blocks received
from transferValues() is in layout order which seems to work well for the
iterative algorithm. Blocks from extendRange() are still in reverse BFS order,
but this function is used so rarely now that it doesn't matter.
llvm-svn: 129580
Change ELF systems to use CFI for producing the EH tables. This reduces the
size of the clang binary in Debug builds from 690MB to 679MB.
llvm-svn: 129571