("T is 1 if the target symbol S has type STT_FUNC and the
symbol addresses a Thumb instruction ;it is 0 otherwise."
from "ELF for the ARM Architecture" 4.7.1.2)
Patch by Koan-Sin Tan!
llvm-svn: 131406
intrinsic call. This prevents it from being reordered so that it appears
*before* the setjmp intrinsic (thus making it completely useless).
<rdar://problem/9409683>
llvm-svn: 131174
who used this flag, and it now emits CFI and doesn't emit this anymore. All
other targets left this flag "false".
<rdar://problem/8486371>
llvm-svn: 130918
LiveVariables doesn't understand that clobbering D0 and D1 completely overwrites
Q0, so if Q0 is live-in to a function, its live range will extend beyond a
function call that only clobbers D0 and D1. This shows up in the
ARM/2009-11-01-NeonMoves test case.
LiveVariables should probably implement the much stricter rules for physreg
liveness that RAFast imposes - a physreg is killed by the first use of any
alias.
llvm-svn: 130801
model constants which can be added to base registers via add-immediate
instructions which don't require an additional register to materialize
the immediate.
llvm-svn: 130743
for all symbol differences and can drop the old EmitPCRelSymbolValue
method.
This also make getExprForFDESymbol on ELF equal to the one on MachO, and it
can be made non-virtual.
llvm-svn: 130634
after folding ADD32ri to ADD32mi, so don't do that.
This only happens when the greedy register allocator gets itself in trouble and
spills %vreg9 here:
16L %vreg9<def> = MOVPC32r 0, %ESP<imp-use>; GR32:%vreg9
48L %vreg9<def> = ADD32ri %vreg9, <es:_GLOBAL_OFFSET_TABLE_>[TF=1], %EFLAGS<imp-def,dead>; GR32:%vreg9
That should never happen, the live range should be split instead.
llvm-svn: 130625
Currently the output should be almost identical to the one produced by CodeGen
to make the transition easier.
The only two differences I know of are:
* Some files get an extra advance loc of size 0. This will be fixed when
relaxations are enabled.
* The optimization of declaring an EH symbol as an external variable is not
implemented. This is a subset of adding the nounwind attribute, so we if really
this at -O0 we should probably do it at the IL level.
llvm-svn: 130623
- expansion of SELECT_CC into SETCC
- force SETCC result type to i1
- custom selection for handling i1 using SETCC
Patch by Dan Bailey
llvm-svn: 130358
give it a bit more responsibility. Also implement it for MachO.
If hacked to use cfi, 32 bit MachO will produce
.cfi_personality 155, L___gxx_personality_v0$non_lazy_ptr
and 64 bit will produce
.cfi_presonality ___gxx_personality_v0
The general idea is that .cfi_personality gets passed the final symbol. It is
up to codegen to produce it if using indirect representation (like 32 bit
MachO), but it is up to MC to decide which relocations to create.
llvm-svn: 130341
when X has multiple uses. This is useful for exposing secondary optimizations,
but the X86 backend isn't ready for this when X has a single use. For example,
this can disable load folding.
This is inching towards resolving PR6627.
llvm-svn: 130238
The hook will be used by the register allocator when recomputing register
classes after removing constraints.
Thumb1 code doesn't allow anything larger than tGPR, and x86 needs to ensure
that the spill size doesn't change.
llvm-svn: 130228
Fixes Thumb2 ADCS and SBCS lowering: <rdar://problem/9275821>.
t2ADCS/t2SBCS are now pseudo instructions, consistent with ARM, so the
assembly printer correctly prints the 's' suffix.
Fixes Thumb2 adde -> SBC matching to check for live/dead carry flags.
Fixes the internal ARM machine opcode mnemonic for ADCS/SBCS.
Fixes ARM SBC lowering to check for live carry (potential bug).
llvm-svn: 130048
On x86 this allows to fold a load into the cmp, greatly reducing register pressure.
movzbl (%rdi), %eax
cmpl $47, %eax
->
cmpb $47, (%rdi)
This shaves 8k off gcc.o on i386. I'll leave applying the patch in README.txt to Chris :)
llvm-svn: 130005
This tends to happen a lot with bitfield code generated by clang. A simple example for x86_64 is
uint64_t foo(uint64_t x) { return (x&1) << 42; }
which used to compile into bloated code:
shlq $42, %rdi ## encoding: [0x48,0xc1,0xe7,0x2a]
movabsq $4398046511104, %rax ## encoding: [0x48,0xb8,0x00,0x00,0x00,0x00,0x00,0x04,0x00,0x00]
andq %rdi, %rax ## encoding: [0x48,0x21,0xf8]
ret ## encoding: [0xc3]
with this patch we can fold the immediate into the and:
andq $1, %rdi ## encoding: [0x48,0x83,0xe7,0x01]
movq %rdi, %rax ## encoding: [0x48,0x89,0xf8]
shlq $42, %rax ## encoding: [0x48,0xc1,0xe0,0x2a]
ret ## encoding: [0xc3]
It's possible to save another byte by using 'andl' instead of 'andq' but I currently see no way of doing
that without making this code even more complicated. See the TODOs in the code.
llvm-svn: 129990
add <rd>, sp, #<imm8>
ldr <rd>, [sp, #<imm8>]
When the offset from sp is multiple of 4 and in range of 0-1020.
This saves code size by utilizing 16-bit instructions.
rdar://9321541
llvm-svn: 129971
This patch depends on the prior fix r129908 that changes to use std::find,
rather than std::binary_search, on unordered array.
Patch by Dan Bailey
llvm-svn: 129909
On the x86-64 and thumb2 targets, some registers are more expensive to encode
than others in the same register class.
Add a CostPerUse field to the TableGen register description, and make it
available from TRI->getCostPerUse. This represents the cost of a REX prefix or a
32-bit instruction encoding required by choosing a high register.
Teach the greedy register allocator to prefer cheap registers for busy live
ranges (as indicated by spill weight).
llvm-svn: 129864
used by Clang. To help Clang integration, the PTX target has been split
into two targets: ptx32 and ptx64, depending on the desired pointer size.
- Add GCCBuiltin class to all intrinsics
- Split PTX target into ptx32 and ptx64
llvm-svn: 129851
llvm is built with unsigned chars where an immediate such as 0xff would be zero
extended to 64-bits, turning "cmp $0xff,%eax" into
"cmp $0xffffffffffffffff,%eax".
llvm-svn: 129845
Making use of VFP / NEON floating point multiply-accumulate / subtraction is
difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
additional pipeline stall. So it's frequently better to single codegen
vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
vmla + vmla is very bad. But this isn't ideal either:
vmul
vadd
vmla
Instead, we want to expand the second vmla:
vmla
vmul
vadd
Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
vmla / vmls will trigger one of the special hazards.
Enable these fp vmlx codegen changes for Cortex-A9.
llvm-svn: 129775
Add a avoidWriteAfterWrite() target hook to identify register classes that
suffer from write-after-write hazards. For those register classes, try to avoid
writing the same register in two consecutive instructions.
This is currently disabled by default. We should not spill to avoid hazards!
The command line flag -avoid-waw-hazard can be used to enable waw avoidance.
llvm-svn: 129772
when they are a truncate from something else. This eliminates fully half of all the
fastisel rejections on a test c++ file I'm working with, which should make a substantial
improvement for -O0 compile of c++ code.
This fixed rdar://9297003 - fast isel bails out on all functions taking bools
llvm-svn: 129752
Before we would bail out on i1 arguments all together, now we just bail on
non-constant ones. Also, we used to emit extraneous code. e.g. test12 was:
movb $0, %al
movzbl %al, %edi
callq _test12
and test13 was:
movb $0, %al
xorl %edi, %edi
movb %al, 7(%rsp)
callq _test13f
Now we get:
movl $0, %edi
callq _test12
and:
movl $0, %edi
callq _test13f
llvm-svn: 129751
is, it assumes addresses are 64-bit aligned (which should be the more common
case). If the alignment is found not to be aligned, then getOperandLatency()
would adjust the operand latency computation by one to compensate for it.
rdar://9294833
llvm-svn: 129742
the generated FastISel. X86 doesn't need to generate code to match ADD16ri8
since ADD16ri will do just fine. This is a small codesize win in the generated
instruction selector.
llvm-svn: 129692
simplifying them and exposing more information to tblgen. It would be nice
if other target authors adopted this as well, particularly arm since it has fastisel.
llvm-svn: 129676
kind of predicate: one that is specific to imm nodes. The predicate function
specified here just checks an int64_t directly instead of messing around with
SDNode's. The virtue of this is that it means that fastisel and other things
can reason about these predicates.
llvm-svn: 129675
structure and fix some fixmes. We now have a TreePredicateFn class
that handles all of the decoding of these things. This is an internal
cleanup that has no impact on the code generated by tblgen.
llvm-svn: 129670
2. implement rdar://9289501 - fast isel should fold trivial multiplies to shifts
3. teach tblgen to handle shift immediates that are different sizes than the
shifted operands, eliminating some code from the X86 fast isel backend.
4. Have FastISel::SelectBinaryOp use (the poorly named) FastEmit_ri_ function
instead of FastEmit_ri to simplify code.
llvm-svn: 129666
when we have a global variable base an an index. Instead, just give up on
folding the global variable.
Before we'd geenrate:
_test: ## @test
## BB#0:
movq _rtx_length@GOTPCREL(%rip), %rax
leaq (%rax), %rax
addq %rdi, %rax
movzbl (%rax), %eax
ret
now we generate:
_test: ## @test
## BB#0:
movq _rtx_length@GOTPCREL(%rip), %rax
movzbl (%rax,%rdi), %eax
ret
The difference is even more significant when there is a scale
involved.
This fixes rdar://9289558 - total fail with addr mode formation at -O0/x86-64
llvm-svn: 129664
less trivial things) into a dummy lea. Before we generated:
_test: ## @test
movq _G@GOTPCREL(%rip), %rax
leaq (%rax), %rax
ret
now we produce:
_test: ## @test
movq _G@GOTPCREL(%rip), %rax
ret
This is part of rdar://9289558
llvm-svn: 129662
Change ELF systems to use CFI for producing the EH tables. This reduces the
size of the clang binary in Debug builds from 690MB to 679MB.
llvm-svn: 129571