Apparently we never added code to expand these pseudo instructions, and in
over a year, no one has noticed. Our register allocator must be awesome!
llvm-svn: 137551
Coalescing can remove copy-like instructions with sub-register operands
that constrained the register class. Examples are:
x86: GR32_ABCD:sub_8bit_hi -> GR32
arm: DPR_VFP2:ssub0 -> DPR
Recompute the register class of any virtual registers that are used by
less instructions after coalescing.
This affects code generation for the Cortex-A8 where we use NEON
instructions for f32 operations, c.f. fp_convert.ll:
vadd.f32 d16, d1, d0
vcvt.s32.f32 d0, d16
The register allocator is now free to use d16 for the temporary, and
that comes first in the allocation order because it doesn't interfere
with any s-registers.
llvm-svn: 137133
This hidden llc option runs the machine code verifier after expanding
ARM pseudo-instructions, but before if-conversion.
The machine code verifier is much better at pointing out liveness errors
that can trip up the register scavenger.
llvm-svn: 136439
Code like that would only be produced by bugpoint, but we should still
handle it correctly.
When a register is defined by a REG_SEQUENCE of undefs, the register
itself is undef. Previously, we would create a register with uses but no
defs.
Fixes part of PR10520.
llvm-svn: 136401
When splitting a live range immediately before an LDR_POST instruction
that redefines the address register, make sure to use the correct value
number in leaveIntvBefore.
We need the value number entering the instruction.
<rdar://problem/9793765>
llvm-svn: 135413
if (x != 0) x = 1
if (x == 1) x = 1
Previous codegen looks like this:
mov r1, r0
cmp r1, #1
mov r0, #0
moveq r0, #1
The naive lowering select between two different values. It should recognize the
test is equality test so it's more a conditional move rather than a select:
cmp r0, #1
movne r0, #0
rdar://9758317
llvm-svn: 135017
Print shifted immediate values directly rather than as a payload+shifter
value pair. This makes for more readable output assembly code, simplifies
the instruction printer, and is consistent with how Thumb immediates are
displayed.
llvm-svn: 134902
RAGreedy::tryAssign will now evict interference from the preferred
register even when another register is free.
To support this, add the EvictionCost struct that counts how many hints
are broken by an eviction. We don't want to break one hint just to
satisfy another.
Rename canEvict to shouldEvict, and add the first bit of eviction policy
that doesn't depend on spill weights: Always make room in the preferred
register as long as the evictees can be split and aren't already
assigned to their preferred register.
Also make the CSR avoidance more accurate. When looking for a cheaper
register it is OK to use a new volatile register. Only CSR aliases that
have never been used before should be avoided.
llvm-svn: 134735
already makes the assumption, which is correct on ARM, that a type's alignment is
less than its alloc size. This improves codegen with Clang (which inserts a lot of
extraneous alignment specifiers) and fixes <rdar://problem/9695089>.
llvm-svn: 134106
instructions can be used to match combinations of multiply/divide and VCVT
(between floating-point and integer, Advanced SIMD). Basically the VCVT
immediate operand that specifies the number of fraction bits corresponds to a
floating-point multiply or divide by the corresponding power of 2.
For example, VCVT (floating-point to fixed-point, Advanced SIMD) can replace a
combination of VMUL and VCVT (floating-point to integer) as follows:
Example (assume d17 = <float 8.000000e+00, float 8.000000e+00>):
vmul.f32 d16, d17, d16
vcvt.s32.f32 d16, d16
becomes:
vcvt.s32.f32 d16, d16, #3
Similarly, VCVT (fixed-point to floating-point, Advanced SIMD) can replace a
combinations of VCVT (integer to floating-point) and VDIV as follows:
Example (assume d17 = <float 8.000000e+00, float 8.000000e+00>):
vcvt.f32.s32 d16, d16
vdiv.f32 d16, d17, d16
becomes:
vcvt.f32.s32 d16, d16, #3
llvm-svn: 133813
1. (((x) & 0xFF00) >> 8) | (((x) & 0x00FF) << 8)
=> (bswap x) >> 16
2. ((x&0xff)<<8)|((x&0xff00)>>8)|((x&0xff000000)>>8)|((x&0x00ff0000)<<8))
=> (rotl (bswap x) 16)
This allows us to eliminate most of the def : Pat patterns for ARM rev16
revsh instructions. It catches many more cases for ARM and x86.
rdar://9609108
llvm-svn: 133503
for pre-2.9 bitcode files. We keep x86 unaligned loads, movnt, crc32, and the
target indep prefetch change.
As usual, updating the testsuite is a PITA.
llvm-svn: 133337
accumulator forwarding. Specifically (from SVN log entry):
Distribute (A + B) * C to (A * C) + (B * C) to make use of NEON multiplier
accumulator forwarding:
vadd d3, d0, d1
vmul d3, d3, d2
=>
vmul d3, d0, d2
vmla d3, d1, d2
Make sure it catches cases where operand 1 is add/fadd/sub/fsub, which was
intended in the original revision.
llvm-svn: 133127
the bits being cleared by the AND are not demanded by the BFI.
The previous BFI dag combine rule was actually incorrect (or used to be
correct until BFI representation changed).
rdar://9609030
llvm-svn: 133034
In particular, don't spill dirty registers only to satisfy a hint. It is
not worth it.
The attached test case provides an example where the fast allocator
would spill a register when other registers are available.
llvm-svn: 132900
causing an assertion failure downstream. This fixes <rdar://problem/9562908>.
This really seems like it should always be set at CCState creation time, so mistakes like
this can never happen. I'll take a look at doing that.
llvm-svn: 132811
Instead, use simpler approach and let DBG_VALUE follow its predecessor instruction. After live debug value analysis pass, all DBG_VALUE instruction are placed at the right place. Thanks Jakob for the hint!
llvm-svn: 132483
This is important for the correct lowering of unwind instructions
(which doesn't matter at all) and llvm.eh.resume calls (which does).
Take 2, now with more basic competence.
llvm-svn: 132295
This is important for the correct lowering of unwind instructions
(which doesn't matter at all) and llvm.eh.resume calls (which does).
llvm-svn: 132291
to load/store i64 values. Since there's no current support to explicitly
declare such restrictions, implement it by using specific hardcoded register
pairs during isel.
llvm-svn: 132248
register allocation dependent and will occasionally break. WIP in the
register allocator to model paired/etc registers.
rdar://9119939
llvm-svn: 132242
The practical effects here are that x86-64 fast-isel can now handle trunc from i8 to i1, and ARM fast-isel can handle many more constructs involving integers narrower than 32 bits (including loads, stores, and many integer casts).
rdar://9437928 .
llvm-svn: 132099
When instructions are deleted, they leave tombstone SlotIndex entries.
The isZeroLength method should ignore these null indexes.
This causes RABasic to sometimes spill a callee-saved register in the
abi-isel.ll test, so don't run that test with -regalloc=basic. Prioritizing
register allocation according to spill weight can cause more registers to be
used.
llvm-svn: 131436
If there is a store after the load node, then there is a chain, which means
that there is another user. Thus, asking hasOneUser would fail. Instead we
ask hasNUsesOfValue on the 'data' value.
llvm-svn: 131183
landing pad as its successor.
SjLj exception handling jumps to the correct landing pad via a switch statement
that's generated right before code-gen. Loosen the constraint in the machine
instruction verifier to allow for this. Note, this isn't the most rigorous check
since we cannot determine where that switch statement came from. But it's
marginally better than turning this check off when SjLj exceptions are used.
<rdar://problem/9187612>
llvm-svn: 130881
model constants which can be added to base registers via add-immediate
instructions which don't require an additional register to materialize
the immediate.
llvm-svn: 130743
successors) and use inverse depth first search to traverse the BBs. However
that doesn't work when the CFG has infinite loops. Simply do a linear
traversal of all BBs work just fine.
rdar://9344645
llvm-svn: 130324
more callee-saved registers and introduce copies. Only allows it if scheduling
a node above calls would end up lessen register pressure.
Call operands also has added ABI restrictions for register allocation, so be
extra careful with hoisting them above calls.
rdar://9329627
llvm-svn: 130245
Fixes Thumb2 ADCS and SBCS lowering: <rdar://problem/9275821>.
t2ADCS/t2SBCS are now pseudo instructions, consistent with ARM, so the
assembly printer correctly prints the 's' suffix.
Fixes Thumb2 adde -> SBC matching to check for live/dead carry flags.
Fixes the internal ARM machine opcode mnemonic for ADCS/SBCS.
Fixes ARM SBC lowering to check for live carry (potential bug).
llvm-svn: 130048
manually and pass all (now) 4 arguments to the mul libcall. Add a new
ExpandLibCall for just this (copied gratuitously from type legalization).
Fixes rdar://9292577
llvm-svn: 129842
- There is a minor semantic change here (evidenced by the test change) for
Darwin triples that have no version component. I debated changing the default
behavior of isOSVersionLT, but decided it made more sense for triples to be
explicit.
llvm-svn: 129802
Making use of VFP / NEON floating point multiply-accumulate / subtraction is
difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
additional pipeline stall. So it's frequently better to single codegen
vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
vmla + vmla is very bad. But this isn't ideal either:
vmul
vadd
vmla
Instead, we want to expand the second vmla:
vmla
vmul
vadd
Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
vmla / vmls will trigger one of the special hazards.
Enable these fp vmlx codegen changes for Cortex-A9.
llvm-svn: 129775
Add a avoidWriteAfterWrite() target hook to identify register classes that
suffer from write-after-write hazards. For those register classes, try to avoid
writing the same register in two consecutive instructions.
This is currently disabled by default. We should not spill to avoid hazards!
The command line flag -avoid-waw-hazard can be used to enable waw avoidance.
llvm-svn: 129772
registers for fast allocation a different way. This has us updating
used registers only when we're using that exact register.
Fixes rdar://9207598
llvm-svn: 129711
the max itself, so it is not easy to write a test case for this, but I added a
test case that would fail if the code in AsmPrinter were removed.
llvm-svn: 129432
Additional fixes:
Do something reasonable for subtargets with generic
itineraries by handle node latency the same as for an empty
itinerary. Now nodes default to unit latency unless an itinerary
explicitly specifies a zero cycle stage or it is a TokenFactor chain.
Original fixes:
UnitsSharePred was a source of randomness in the scheduler: node
priority depended on the queue data structure. I rewrote the recent
VRegCycle heuristics to completely replace the old heuristic without
any randomness. To make the ndoe latency adjustments work, I also
needed to do something a little more reasonable with TokenFactor. I
gave it zero latency to its consumers and always schedule it as low as
possible.
llvm-svn: 129421