I really need to find a way to automate this, but I can't come up with a regex
that has no false positives while handling tricky cases like custom check
prefixes.
llvm-svn: 162097
around. That's not how we do things. Besides, the commit message tells us that
it is covered by the GCC test suite.
------------------------------------------------------------------------
r127497 | zwarich | 2011-03-11 13:51:56 -0800 (Fri, 11 Mar 2011) | 3 lines
Fix the GCC test suite issue exposed by r127477, which was caused by stack
protector insertion not working correctly with unreachable code. Since that
revision was rolled out, this test doesn't actual fail before this fix.
------------------------------------------------------------------------
llvm-svn: 161985
- FP_EXTEND only support extending from vectors with matching elements.
This results in the scalarization of extending to v2f64 from v2f32,
which will be legalized to v4f32 not matching with v2f64.
- add X86-specific VFPEXT supproting extending from v4f32 to v2f64.
- add BUILD_VECTOR lowering helper to recover back the original
extending from v4f32 to v2f64.
- test case is enhanced to include different vector width.
llvm-svn: 161894
and allow some optimizations to turn conditional branches into unconditional.
This commit adds a simple control-flow optimization which merges two consecutive
basic blocks which are connected by a single edge. This allows the codegen to
operate on larger basic blocks.
rdar://11973998
llvm-svn: 161852
It is still possible to if-convert if the tail block has extra
predecessors, but the tail phis must be rewritten instead of being
removed.
llvm-svn: 161781
- FCMOV only supports a subset of X86 conditions. Skip boolean
simplification if X86 condition is not valid for FCMOV.
- add a minimal test case for PR13577.
llvm-svn: 161732
FeatureFastUAMem for Nehalem, Westmere and Sandy Bridge.
FeatureFastUAMem is already on if we pass in nehalem or westmere as a command
argument.
rdar: 7252306
llvm-svn: 161717
- if a boolean test (X86ISD::CMP or X86ISD:SUB) checks a boolean value
generated from X86ISD::SETCC, try to simplify the boolean value
generation and checking by reusing the original EFLAGS with proper
condition code
- add hooks to X86 specific SETCC/BRCOND/CMOV, the major 3 places
consuming EFLAGS
part of patches fixing PR12312
llvm-svn: 161687
When replacing Old with New, it can happen that New is already a
successor. Add the old and new edge weights instead of creating a
duplicate edge.
llvm-svn: 161653
This makes it possible to speed up def_iterator by stopping at the first
use. This makes def_empty() and getUniqueVRegDef() much faster when
there are many uses.
In a +Asserts build, LiveVariables is 100x faster in one case because
getVRegDef() has an assertion that would scan to the end of a
def_iterator chain.
Spill weight calculation is significantly faster (300x in one case)
because isTriviallyReMaterializable() calls MRI->isConstantPhysReg(%RIP)
which calls def_empty(%RIP).
llvm-svn: 161634
Use a more conventional doubly linked list where the Prev pointers form
a cycle. This means it is no longer necessary to adjust the Prev
pointers when reallocating the VRegInfo array.
The test changes are required because the register allocation hint is
using the use-list order to break ties.
llvm-svn: 161633
We perform the following:
1> Use SUB instead of CMP for i8,i16,i32 and i64 in ISel lowering.
2> Modify MachineCSE to correctly handle implicit defs.
3> Convert SUB back to CMP if possible at peephole.
Removed pattern matching of (a>b) ? (a-b):0 and like, since they are handled
by peephole now.
rdar://11873276
llvm-svn: 161462
Previously, MBP essentially aligned every branch target it could. This
bloats code quite a bit, especially non-looping code which has no real
reason to prefer aligned branch targets so heavily.
As Andy said in review, it's still a bit odd to do this without a real
cost model, but this at least has much more plausible heuristics.
Fixes PR13265.
llvm-svn: 161409
If the result of a common subexpression is used at all uses of the candidate
expression, CSE should not increase the live range of the common subexpression.
rdar://11393714 and rdar://11819721
llvm-svn: 161396
This patch is mostly just refactoring a bunch of copy-and-pasted code, but
it also adds a check that the call instructions are readnone or readonly.
That check was already present for sin, cos, sqrt, log2, and exp2 calls, but
it was missing for the rest of the builtins being handled in this code.
llvm-svn: 161282
I noticed that SelectionDAGBuilder::visitCall was missing a check for memcmp
in TargetLibraryInfo, so that it would use custom code for memcmp calls even
with -fno-builtin. I also had to add a new -disable-simplify-libcalls option
to llc so that I could write a test for this.
llvm-svn: 161262
Fast isel doesn't currently have support for translating builtin function
calls to target instructions. For embedded environments where the library
functions are not available, this is a matter of correctness and not
just optimization. Most of this patch is just arranging to make the
TargetLibraryInfo available in fast isel. <rdar://problem/12008746>
llvm-svn: 161232
Add more comments and use early returns to reduce nesting in isLoadFoldable.
Also disable folding for V_SET0 to avoid introducing a const pool entry and
a const pool load.
rdar://10554090 and rdar://11873276
llvm-svn: 161207
Machine CSE and other optimizations can remove instructions so folding
is possible at peephole while not possible at ISel.
This patch is a rework of r160919 and was tested on clang self-host on my local
machine.
rdar://10554090 and rdar://11873276
llvm-svn: 161152
One motivating example is to sink an instruction from a basic block which has
two successors: one outside the loop, the other inside the loop. We should try
to sink the instruction outside the loop.
rdar://11980766
llvm-svn: 161062
We branch to the successor with higher edge weight first.
Convert from
je LBB4_8 --> to outer loop
jmp LBB4_14 --> to inner loop
to
jne LBB4_14
jmp LBB4_8
PR12750
rdar: 11393714
llvm-svn: 161018
Machine CSE and other optimizations can remove instructions so folding
is possible at peephole while not possible at ISel.
rdar://10554090 and rdar://11873276
llvm-svn: 160919
It is possible that an instruction can use and update EFLAGS.
When checking the safety, we should check the usage of EFLAGS first before
declaring it is safe to optimize due to the update.
llvm-svn: 160912
These idempotent sub-register indices don't do anything --- They simply
map XMM registers to themselves. They no longer affect register classes
either since the SubRegClasses field has been removed from Target.td.
This patch replaces XMM->XMM EXTRACT_SUBREG and INSERT_SUBREG patterns
with COPY_TO_REGCLASS patterns which simply become COPY instructions.
The number of IMPLICIT_DEF instructions before register allocation is
reduced, and that is the cause of the test case changes.
llvm-svn: 160816
It is redundant; RegisterCoalescer will do the remat if it can't eliminate
the copy. Collected instruction counts before and after this. A few extra
instructions are generated due to spilling but it is normal to see these kinds
of changes with almost any small codegen change, according to Jakob.
This also fixed rdar://11830760 where xor is expected instead of movi0.
llvm-svn: 160749
struct s {
double x1;
float x2;
};
__attribute__((regparm(3))) struct s f(int a, int b, int c);
void g(void) {
f(41, 42, 43);
}
We need to be able to represent passing the address of s to f (sret) in a
register (inreg). Turns out that all that is needed is to not mark them as
mutually incompatible.
llvm-svn: 160695
are targeting an ELF platform. Only fold gs-relative (and fs-relative) loads
if it is actually sensible to do so for the target platform.
This fixes PR13438.
llvm-svn: 160687
LiveRangeEdit::foldAsLoad() can eliminate a register by folding a load
into its only use. Only do that when the load is safe to move, and it
won't extend any live ranges.
This fixes PR13414.
llvm-svn: 160575
PHIElimination splits critical edges when it predicts it can resolve
interference and eliminate copies. It doesn't split the edge if the
interference wouldn't be resolved anyway because the phi-use register is
live in the critical edge anyway.
Teach PHIElimination to split loop exiting edges with interference, even
if it wouldn't resolve the interference. This removes the necessary
copies from the loop, which is still an improvement from injecting the
copies into the loop.
The test case demonstrates the improvement. Before:
LBB0_1:
cmpb $0, (%rdx)
leaq 1(%rdx), %rdx
movl %esi, %eax
je LBB0_1
After:
LBB0_1:
cmpb $0, (%rdx)
leaq 1(%rdx), %rdx
je LBB0_1
movl %esi, %eax
llvm-svn: 160571
Updated OptimizeCompare in peephole to remove redundant cmp against zero.
We only remove Compare if CF and OF are not used.
rdar://11855129
llvm-svn: 160454
when run on an Intel Atom processor. The failures have arisen due
to changes elsewhere in the trunk over the past 8 weeks or so.
These failures were not detected by the Atom buildbot because the
CPU on the Atom buildbot was not being detected as an Atom CPU.
The fix for this problem is in Host.cpp and X86Subtarget.cpp, but
shall remain commented out until the current set of Atom test failures
are fixed.
Patch by Andy Zhang and Tyler Nowicki!
llvm-svn: 160451
large immediates. Add dag combine logic to recover in case the large
immediates doesn't fit in cmp immediate operand field.
int foo(unsigned long l) {
return (l>> 47) == 1;
}
we produce
%shr.mask = and i64 %l, -140737488355328
%cmp = icmp eq i64 %shr.mask, 140737488355328
%conv = zext i1 %cmp to i32
ret i32 %conv
which codegens to
movq $0xffff800000000000,%rax
andq %rdi,%rax
movq $0x0000800000000000,%rcx
cmpq %rcx,%rax
sete %al
movzbl %al,%eax
ret
TargetLowering::SimplifySetCC would transform
(X & -256) == 256 -> (X >> 8) == 1
if the immediate fails the isLegalICmpImmediate() test. For x86,
that's immediates which are not a signed 32-bit immediate.
Based on a patch by Eli Friedman.
PR10328
rdar://9758774
llvm-svn: 160346
uint32_t hi(uint64_t res)
{
uint_32t hi = res >> 32;
return !hi;
}
llvm IR looks like this:
define i32 @hi(i64 %res) nounwind uwtable ssp {
entry:
%lnot = icmp ult i64 %res, 4294967296
%lnot.ext = zext i1 %lnot to i32
ret i32 %lnot.ext
}
The optimizer has optimize away the right shift and truncate but the resulting
constant is too large to fit in the 32-bit immediate field. The resulting x86
code is worse as a result:
movabsq $4294967296, %rax ## imm = 0x100000000
cmpq %rax, %rdi
sbbl %eax, %eax
andl $1, %eax
This patch teaches the x86 lowering code to handle ult against a large immediate
with trailing zeros. It will issue a right shift and a truncate followed by
a comparison against a shifted immediate.
shrq $32, %rdi
testl %edi, %edi
sete %al
movzbl %al, %eax
It also handles a ugt comparison against a large immediate with trailing bits
set. i.e. X > 0x0ffffffff -> (X >> 32) >= 1
rdar://11866926
llvm-svn: 160312
In the added testcase the constant 55 was behind an AssertZext of type i1, and ComputeDemandedBits
reported that some of the bits were both known to be one and known to be zero.
Together with Michael Kuperstein <michael.m.kuperstein@intel.com>
llvm-svn: 160305
1. FileCheck-ize epilogue.ll and allow another asm instruction to restore %rsp.
2. Remove check in widen_arith-3.ll that was hitting instruction in epilogue instead of
vector add.
llvm-svn: 160274
undef virtual register. The problem is that ProcessImplicitDefs removes the
definition of the register and marks all uses as undef. If we lose the undef
marker then we get a register which has no def, is not marked as undef. The
live interval analysis does not collect information for these virtual
registers and we crash in later passes.
Together with Michael Kuperstein <michael.m.kuperstein@intel.com>
llvm-svn: 160260
It is intended to fix PR11468.
Old prologue and epilogue looked like this:
push %rbp
mov %rsp, %rbp
and $alignment, %rsp
push %r14
push %r15
...
pop %r15
pop %r14
mov %rbp, %rsp
pop %rbp
The problem was to reference the locations of callee-saved registers in exception handling:
locations of callee-saved had to be re-calculated regarding the stack alignment operation. It would
take some effort to implement this in LLVM, as currently MachineLocation can only have the form
"Register + Offset". Funciton prologue and epilogue are now changed to:
push %rbp
mov %rsp, %rbp
push %14
push %15
and $alignment, %rsp
...
lea -$size_of_saved_registers(%rbp), %rsp
pop %r15
pop %r14
pop %rbp
Reviewed by Chad Rosier.
llvm-svn: 160248
Allow the folding of vbroadcastRR to vbroadcastRM, where the memory operand is a spill slot.
PR12782.
Together with Michael Kuperstein <michael.m.kuperstein@intel.com>
llvm-svn: 160230
the input vector, it can be bigger (this is helpful for powerpc where <2 x i16>
is a legal vector type but i16 isn't a legal type, IIRC). However this wasn't
being taken into account by ExpandRes_EXTRACT_VECTOR_ELT, causing PR13220.
Lightly tweaked version of a patch by Michael Liao.
llvm-svn: 160116
X86. Basically, this is a reapplication of r158087 with a few fixes.
Specifically, (1) the stack pointer is restored from the base pointer before
popping callee-saved registers and (2) in obscure cases (see comments in patch)
we must cache the value of the original stack adjustment in the prologue and
apply it in the epilogue.
rdar://11496434
llvm-svn: 160002
multiple scalars and insert them into a vector. Next, we shuffle the elements
into the correct places, as before.
Also fix a small dagcombine bug in SimplifyBinOpWithSameOpcodeHands, when the
migration of bitcasts happened too late in the SelectionDAG process.
llvm-svn: 159991
getCondFromSETOpc, getCondFromCMovOpc, getSETFromCond, getCMovFromCond
No functional change intended.
If we want to update the condition code of CMOV|SET|Jcc, we first analyze the
opcode to get the condition code, then update the condition code, finally
synthesize the new opcode form the new condition code.
llvm-svn: 159955
It is safe if EFLAGS is killed or re-defined.
When we are done with the basic block, check whether EFLAGS is live-out.
Do not optimize away cmp if EFLAGS is live-out.
llvm-svn: 159888
For each Cmp, we check whether there is an earlier Sub which make Cmp
redundant. We handle the case where SUB operates on the same source operands as
Cmp, including the case where the two source operands are swapped.
llvm-svn: 159838
The CopyToReg nodes that set up the argument registers before a call
must be glued to the call instruction. Otherwise, the scheduler may emit
the physreg copies long before the call, causing long live ranges for
the fixed registers.
Besides disabling good register allocation, that can also expose
problems when EmitInstrWithCustomInserter() splits a basic block during
the live range of a physreg.
llvm-svn: 159721
Implement the TII hooks needed by EarlyIfConversion to create cmov
instructions and estimate their latency.
Early if-conversion is still not enabled by default.
llvm-svn: 159695
another mechanical change accomplished though the power of terrible Perl
scripts.
I have manually switched some "s to 's to make escaping simpler.
While I started this to fix tests that aren't run in all configurations,
the massive number of tests is due to a really frustrating fragility of
our testing infrastructure: things like 'grep -v', 'not grep', and
'expected failures' can mask broken tests all too easily.
Essentially, I'm deeply disturbed that I can change the testsuite so
radically without causing any change in results for most platforms. =/
llvm-svn: 159547
versions of Bash. In addition, I can back out the change to the lit
built-in shell test runner to support this.
This should fix the majority of fallout on Darwin, but I suspect there
will be a few straggling issues.
llvm-svn: 159544
This was done through the aid of a terrible Perl creation. I will not
paste any of the horrors here. Suffice to say, it require multiple
staged rounds of replacements, state carried between, and a few
nested-construct-parsing hacks that I'm not proud of. It happens, by
luck, to be able to deal with all the TCL-quoting patterns in evidence
in the LLVM test suite.
If anyone is maintaining large out-of-tree test trees, feel free to poke
me and I'll send you the steps I used to convert things, as well as
answer any painful questions etc. IRC works best for this type of thing
I find.
Once converted, switch the LLVM lit config to use ShTests the same as
Clang. In addition to being able to delete large amounts of Python code
from 'lit', this will also simplify the entire test suite and some of
lit's architecture.
Finally, the test suite runs 33% faster on Linux now. ;]
For my 16-hardware-thread (2x 4-core xeon e5520): 36s -> 24s
llvm-svn: 159525
Before this patch in pic 32 bit code we would add the global base register
and not load from that address. This is a really old bug, but before the
introduction of the tls attributes we would never select initial exec for
pic code.
llvm-svn: 159409
Corrected type for index of llvm.x86.avx2.gather.d.pd.256
from 256-bit to 128-bit.
Corrected types for src|dst|mask of llvm.x86.avx2.gather.q.ps.256
from 256-bit to 128-bit.
Support the following intrinsics:
llvm.x86.avx2.gather.d.q, llvm.x86.avx2.gather.q.q
llvm.x86.avx2.gather.d.q.256, llvm.x86.avx2.gather.q.q.256
llvm.x86.avx2.gather.d.d, llvm.x86.avx2.gather.q.d
llvm.x86.avx2.gather.d.d.256, llvm.x86.avx2.gather.q.d.256
llvm-svn: 159402
The primary advantage is that loop optimizations will be applied in a
stable order. This helps debugging and unit test creation. It is also
a better overall implementation without pathologically bad performance
on deep functions.
On large functions (llvm-stress --size=200000 | opt -loops)
Before: 0.1263s
After: 0.0225s
On deep functions (after tweaking llvm-stress, thanks Nadav):
Before: 0.2281s
After: 0.0227s
See r158790 for more comments.
The loop tree is now consistently generated in forward order, but loop
passes are applied in reverse order over the program. If we have a
loop optimization that prefers forward order, that can easily be
achieved by adding a different type of LoopPassManager.
llvm-svn: 159183
Implicitly defined virtual registers can simply have the <undef> bit set
on all uses, and copies can be turned into implicit defs recursively.
Physical registers are a bit trickier. We handle the common case where a
physreg def is used by a nearby instruction in the same basic block. For
more complicated cases, just leave the IMPLICIT_DEF instruction in.
llvm-svn: 159149
The function live-out registers must be live at all function returns,
and %RCX is only used by eh.return. When a function also has a normal
return, only %RAX holds a return value.
This fixes PR13188.
llvm-svn: 159116
This allows the user/front-end to specify a model that is better
than what LLVM would choose by default. For example, a variable
might be declared as
@x = thread_local(initialexec) global i32 42
if it will not be used in a shared library that is dlopen'ed.
If the specified model isn't supported by the target, or if LLVM can
make a better choice, a different model may be used.
llvm-svn: 159077
The code in X86TargetLowering::LowerEH_RETURN() assumes that a frame
pointer exists, but the frame pointer was forced by the presence of
llvm.eh.unwind.init which isn't guaranteed.
If llvm.eh.unwind.init is actually required in functions calling
eh.return (is it?), we should diagnose that instead of emitting bad
machine code.
This should fix the dragonegg-x86_64-linux-gcc-4.6-test bot.
llvm-svn: 158961
Regunit live ranges are computed on demand, so when mi-sched calls
handleMove, some regunits may not have live ranges yet.
That makes updating them easier: Just skip the non-existing ranges. They
will be computed correctly from the rescheduled machine code when they
are needed.
llvm-svn: 158831
TargetLoweringObjectFileELF. Use this to support it on X86. Unlike ARM,
on X86 it is not easy to find out if .init_array should be used or not, so
the decision is made via TargetOptions and defaults to off.
Add a command line option to llc that enables it.
llvm-svn: 158692
temporarily reverted.
This test is annoyingly overspecified, but I don't know of another way
to thoroughly test the saving and restoring of the registers. While this
will have to be adjusted even with the issue fixed in order to re-apply
r158087, those adjustments should very clearly indicate that it is still
correct (%esp getting restored prior to pops), whereas without it, this
case can easily slip under the radar.
Still, any suggestions for improvements are very welcome.
All credit to Matt Beaumont-Gay for reducing this out of an insane
Address Sanitizer crash to a reasonably small seg-faulting C program
when built with -mstackrealign. I just reduced it to IR, which was much
simpler. =]
llvm-svn: 158656
This patch causes problems when both dynamic stack realignment and
dynamic allocas combine in the same function. With this patch, we no
longer build the epilog correctly, and silently restore registers from
the wrong position in the stack.
Thanks to Matt for tracking this down, and getting at least an initial
test case to Chad. I'm going to try to check a variation of that test
case in so we can easily track the fixes required.
llvm-svn: 158654
The fast register allocator is not supposed to work in the optimizing
pipeline. It doesn't make sense to compute live intervals, run full copy
coalescing, and then run RAFast.
Fast register allocation in the optimizing pipeline is better done by
RABasic.
llvm-svn: 158242
This patch will generate the following for integer ABS:
movl %edi, %eax
negl %eax
cmovll %edi, %eax
INSTEAD OF
movl %edi, %ecx
sarl $31, %ecx
leal (%rdi,%rcx), %eax
xorl %ecx, %eax
There exists a target-independent DAG combine for integer ABS, which converts
integer ABS to sar+add+xor. For X86, we match this pattern back to neg+cmov.
This is implemented in PerformXorCombine.
rdar://10695237
llvm-svn: 158175
This patch will optimize the following
movq %rdi, %rax
subq %rsi, %rax
cmovsq %rsi, %rdi
movq %rdi, %rax
to
cmpq %rsi, %rdi
cmovsq %rsi, %rdi
movq %rdi, %rax
Perform this optimization if the actual result of SUB is not used.
rdar: 11540023
llvm-svn: 158126
The commit is intended to fix rdar://11540023.
It is implemented as part of peephole optimization. We can actually implement
this in the SelectionDAG lowering phase.
llvm-svn: 158122
This patch will optimize the following:
sub r1, r3
cmp r3, r1 or cmp r1, r3
bge L1
TO
sub r1, r3
bge L1 or ble L1
If the branch instruction can use flag from "sub", then we can eliminate
the "cmp" instruction.
llvm-svn: 157831
This implements codegen support for accesses to thread-local variables
using the local-dynamic model, and adds a clean-up pass so that the base
address for the TLS block can be re-used between local-dynamic access on
an execution path.
llvm-svn: 157818
types, as well as int<->ptr casts. This allows us to tailcall functions
with some trivial casts between the call and return (i.e. because the
return types disagree).
llvm-svn: 157798
This patch will optimize the following
movq %rdi, %rax
subq %rsi, %rax
cmovsq %rsi, %rdi
movq %rdi, %rax
to
cmpq %rsi, %rdi
cmovsq %rsi, %rdi
movq %rdi, %rax
Perform this optimization if the actual result of SUB is not used.
rdar: 11540023
llvm-svn: 157755
I disabled FMA3 autodetection, since the result may differ from expected for some benchmarks.
I added tests for GodeGen and intrinsics.
I did not change llvm.fma.f32/64 - it may be done later.
llvm-svn: 157737
It helps compile exotic inline asm. In the test case, normal GR32
virtual registers use up eax-edx so the final GR32_ABCD live range has
no registers left. Since all the live ranges were tiny, we had no way of
prioritizing the smaller register class.
This patch allows tiny unspillable live ranges to be evicted by tiny
unspillable live ranges from a smaller register class.
<rdar://problem/11542429>
llvm-svn: 157715
integer registers. This is already supported by the fastcc convention, but it doesn't
hurt to support it in the standard conventions as well.
In cases where we can cheat at the calling convention, this allows us to avoid returning
things through memory in more cases.
llvm-svn: 157698
This required light surgery on the assembler and disassembler
because the instructions use an uncommon encoding. They are
the only two instructions in x86 that use register operands
and two immediates.
llvm-svn: 157634
SimplifyCFG tends to form a lot of 2-3 case switches when merging branches. Move
the most likely condition to the front so it is checked first and the others can
be skipped. This is currently not as effective as it could be because SimplifyCFG
destroys profiling metadata when merging branches and switches. Merging branch
weight metadata is tricky though.
This code touches at most 3 cases so I didn't use a proper sorting algorithm.
llvm-svn: 157521
Now that the coalescer keeps live intervals and machine code in sync at
all times, it needs to deal with identity copies differently.
When merging two virtual registers, all identity copies are removed
right away. This means that other identity copies must come from
somewhere else, and they are going to have a value number.
Deal with such copies by merging the value numbers before erasing the
copy instruction. Otherwise, we leave dangling value numbers in the live
interval.
This fixes PR12927.
llvm-svn: 157340
may be RAUW'd by the recursive call to LegalizeOps; instead, retrieve
the other operands when calling UpdateNodeOperands. Fixes PR12889.
llvm-svn: 157162
Dead code elimination during coalescing could cause a virtual register
to be split into connected components. The following rewriting would be
confused about the already joined copies present in the code, but
without a corresponding value number in the live range.
Erase all joined copies instantly when joining intervals such that the
MI and LiveInterval representations are always in sync.
llvm-svn: 157135
The late dead code elimination is no longer necessary.
The test changes are cause by a register hint that can be either %rdi or
%rax. The choice depends on the use list order, which this patch changes.
llvm-svn: 157131
non-profitable commute using outdated info. The test case would still fail
because of poor pre-RA schedule. That will be fixed by MI scheduler.
rdar://11472010
llvm-svn: 157038
This is the same as the other tests: Clever tricks are required to make
the arguments and return value line up in a single-instruction function.
It rarely happens in real life.
We have plenty other examples of this behavior.
llvm-svn: 157030
This option has been disabled for a while, and it is going away so I can
clean up the coalescer code.
The tests that required physreg joining to be enabled were almost all of
the form "tiny function with interference between arguments and return
value". Such functions are usually inlined in the real world.
The problem exposed by phys_subreg_coalesce-3.ll is real, but fairly
rare.
llvm-svn: 157027
This patch will optimize -(x != 0) on X86
FROM
cmpl $0x01,%edi
sbbl %eax,%eax
notl %eax
TO
negl %edi
sbbl %eax %eax
In order to generate negl, I added patterns in Target/X86/X86InstrCompiler.td:
def : Pat<(X86sub_flag 0, GR32:$src), (NEG32r GR32:$src)>;
rdar: 10961709
llvm-svn: 156312
The primitive conservative heuristic seems to give a slight overall
improvement while not regressing stuff. Make it available to wider
testing. If you notice any speed regressions (or significant code
size regressions) let me know!
llvm-svn: 156258
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
to catch cases like:
%reg1024<def> = MOV r1
%reg1025<def> = MOV r0
%reg1026<def> = ADD %reg1024, %reg1025
r0 = MOV %reg1026
By commuting ADD, it let coalescer eliminate all of the copies. However, there
was a bug in the heuristics where it ended up commuting the ADD in:
%reg1024<def> = MOV r0
%reg1025<def> = MOV 0
%reg1026<def> = ADD %reg1024, %reg1025
r0 = MOV %reg1026
That did no benefit but rather ensure the last MOV would not be coalesced.
rdar://11355268
llvm-svn: 156048
This patch will optimize the following cases on X86
(a > b) ? (a-b) : 0
(a >= b) ? (a-b) : 0
(b < a) ? (a-b) : 0
(b <= a) ? (a-b) : 0
FROM
movl %edi, %ecx
subl %esi, %ecx
cmpl %edi, %esi
movl $0, %eax
cmovll %ecx, %eax
TO
xorl %eax, %eax
subl %esi, %edi
cmovll %eax, %edi
movl %edi, %eax
rdar: 10734411
llvm-svn: 155919
On x86-32, structure return via sret lets the callee pop the hidden
pointer argument off the stack, which the caller then re-pushes.
However if the calling convention is fastcc, then a register is used
instead, and the caller should not adjust the stack. This is
implemented with a check of IsTailCallConvention
X86TargetLowering::LowerCall but is now checked properly in
X86FastISel::DoSelectCall.
(this time, actually commit what was reviewed!)
llvm-svn: 155825
This time, also fix the caller of AddGlue to properly handle
incomplete chains. AddGlue had failure modes, but shamefully hid them
from its caller. It's luck ran out.
Fixes rdar://11314175: BuildSchedUnits assert.
llvm-svn: 155749
On x86-32, structure return via sret lets the callee pop the hidden
pointer argument off the stack, which the caller then re-pushes.
However if the calling convention is fastcc, then a register is used
instead, and the caller should not adjust the stack. This is
implemented with a check of IsTailCallConvention
X86TargetLowering::LowerCall but is now checked properly in
X86FastISel::DoSelectCall.
llvm-svn: 155745
x == -y --> x+y == 0
x != -y --> x+y != 0
On x86, the generated code goes from
negl %esi
cmpl %esi, %edi
je .LBB0_2
to
addl %esi, %edi
je .L4
This case is correctly handled for ARM with "cmn".
Patch by Manman Ren.
rdar://11245199
PR12545
llvm-svn: 155739
* Model FPSW (the FPU status word) as a register.
* Add ISel patterns for the FUCOM*, FNSTSW and SAHF instructions.
* During Legalize/Lowering, build a node sequence to transfer the comparison
result from FPSW into EFLAGS. If you're wondering about the right-shift: That's
an implicit sub-register extraction (%ax -> %ah) which is handled later on by
the instruction selector.
Fixes PR6679. Patch by Christoph Erhardt!
llvm-svn: 155704
DAGCombine strangeness may result in multiple loads from the same
offset. They both may try to glue themselves to another load. We could
insist that the redundant loads glue themselves to each other, but the
beter fix is to bail out from bad gluing at the time we detect it.
Fixes rdar://11314175: BuildSchedUnits assert.
llvm-svn: 155668
using the pattern (vbroadcast (i32load src)). In some cases, after we generate
this pattern new users are added to the load node, which prevent the selection
of the blend pattern. This commit provides fallback patterns which perform
in-vector broadcast (using in-vector vbroadcast in AVX2 and pshufd on AVX1).
llvm-svn: 155437
on X86 Atom. Some of our tests failed because the tail merging part of
the BranchFolding pass was creating new basic blocks which did not
contain live-in information. When the anti-dependency code in the Post-RA
scheduler ran, it would sometimes rename the register containing
the function return value because the fact that the return value was
live-in to the subsequent block had been lost. To fix this, it is necessary
to run the RegisterScavenging code in the BranchFolding pass.
This patch makes sure that the register scavenging code is invoked
in the X86 subtarget only when post-RA scheduling is being done.
Post RA scheduling in the X86 subtarget is only done for Atom.
This patch adds a new function to the TargetRegisterClass to control
whether or not live-ins should be preserved during branch folding.
This is necessary in order for the anti-dependency optimizations done
during the PostRASchedulerList pass to work properly when doing
Post-RA scheduling for the X86 in general and for the Intel Atom in particular.
The patch adds and invokes the new function trackLivenessAfterRegAlloc()
instead of using the existing requiresRegisterScavenging().
It changes BranchFolding.cpp to call trackLivenessAfterRegAlloc() instead of
requiresRegisterScavenging(). It changes the all the targets that
implemented requiresRegisterScavenging() to also implement
trackLivenessAfterRegAlloc().
It adds an assertion in the Post RA scheduler to make sure that post RA
liveness information is available when it is needed.
It changes the X86 break-anti-dependencies test to use –mcpu=atom, in order
to avoid running into the added assertion.
Finally, this patch restores the use of anti-dependency checking
(which was turned off temporarily for the 3.1 release) for
Intel Atom in the Post RA scheduler.
Patch by Andy Zhang!
Thanks to Jakob and Anton for their reviews.
llvm-svn: 155395
The X86 target is editing the selection DAG while isel is selecting
nodes following a topological ordering. When the DAG hacking triggers
CSE, nodes can be deleted and bad things happen.
llvm-svn: 155257
when the set bits aren't the same for both args of the xor.
This transformation is in the function TargetLowering::SimplifyDemandedBits
in the file lib/CodeGen/SelectionDAG/TargetLowering.cpp.
I have tested this test using a previous version of llc which the defect and
the a version of llc which does not. I got the expected fail and pass,
respectively.
This test goes with rdar://11195364 and the check in with the fix: svn r154955
llvm-svn: 155156
also fix SimplifyLibCalls to use TLI rather than compile-time conditionals to enable optimizations on floor, ceil, round, rint, and nearbyint
llvm-svn: 154960
both fallthrough and a conditional branch target the same successor.
Gracefully delete the conditional branch and introduce any unconditional
branch needed to reach the actual successor. This fixes memory
corruption in 2009-06-15-RegScavengerAssert.ll and possibly other tests.
Also, while I'm here fix a latent bug I spotted by inspection. I never
applied the same fundamental fix to this fallthrough successor finding
logic that I did to the logic used when there are no conditional
branches. As a consequence it would have selected landing pads had they
be aligned in just the right way here. I don't have a test case as
I spotted this by inspection, and the previous time I found this
required have of TableGen's source code to produce it. =/ I hate backend
bugs. ;]
Thanks to Jim Grosbach for helping me reason through this and reviewing
the fix.
llvm-svn: 154867
This is mostly to test the waters. I'd like to get results from FNT
build bots and other bots running on non-x86 platforms.
This feature has been pretty heavily tested over the last few months by
me, and it fixes several of the execution time regressions caused by the
inlining work by preventing inlining decisions from radically impacting
block layout.
I've seen very large improvements in yacr2 and ackermann benchmarks,
along with the expected noise across all of the benchmark suite whenever
code layout changes. I've analyzed all of the regressions and fixed
them, or found them to be impossible to fix. See my email to llvmdev for
more details.
I'd like for this to be in 3.1 as it complements the inliner changes,
but if any failures are showing up or anyone has concerns, it is just
a flag flip and so can be easily turned off.
I'm switching it on tonight to try and get at least one run through
various folks' performance suites in case SPEC or something else has
serious issues with it. I'll watch bots and revert if anything shows up.
llvm-svn: 154816
once we start changing the block layout, so just nuke it. If anyone has
ideas about how to craft a code layout agnostic form of the test please
let me know.
llvm-svn: 154815
rotation. When there is a loop backedge which is an unconditional
branch, we will end up with a branch somewhere no matter what. Try
placing this backedge in a fallthrough position above the loop header as
that will definitely remove at least one branch from the loop iteration,
where whole loop rotation may not.
I haven't seen any benchmarks where this is important but loop-blocks.ll
tests for it, and so this will be covered when I flip the default.
llvm-svn: 154812
laid out in a form with a fallthrough into the header and a fallthrough
out of the bottom. In that case, leave the loop alone because any
rotation will introduce unnecessary branches. If either side looks like
it will require an explicit branch, then the rotation won't add any, do
it to ensure the branch occurs outside of the loop (if possible) and
maximize the benefit of the fallthrough in the bottom.
llvm-svn: 154806
This is a complex change that resulted from a great deal of
experimentation with several different benchmarks. The one which proved
the most useful is included as a test case, but I don't know that it
captures all of the relevant changes, as I didn't have specific
regression tests for each, they were more the result of reasoning about
what the old algorithm would possibly do wrong. I'm also failing at the
moment to craft more targeted regression tests for these changes, if
anyone has ideas, it would be welcome.
The first big thing broken with the old algorithm is the idea that we
can take a basic block which has a loop-exiting successor and a looping
successor and use the looping successor as the layout top in order to
get that particular block to be the bottom of the loop after layout.
This happens to work in many cases, but not in all.
The second big thing broken was that we didn't try to select the exit
which fell into the nearest enclosing loop (to which we exit at all). As
a consequence, even if the rotation worked perfectly, it would result in
one of two bad layouts. Either the bottom of the loop would get
fallthrough, skipping across a nearer enclosing loop and thereby making
it discontiguous, or it would be forced to take an explicit jump over
the nearest enclosing loop to earch its successor. The point of the
rotation is to get fallthrough, so we need it to fallthrough to the
nearest loop it can.
The fix to the first issue is to actually layout the loop from the loop
header, and then rotate the loop such that the correct exiting edge can
be a fallthrough edge. This is actually much easier than I anticipated
because we can handle all the hard parts of finding a viable rotation
before we do the layout. We just store that, and then rotate after
layout is finished. No inner loops get split across the post-rotation
backedge because we check for them when selecting the rotation.
That fix exposed a latent problem with our exitting block selection --
we should allow the backedge to point into the middle of some inner-loop
chain as there is no real penalty to it, the whole point is that it
*won't* be a fallthrough edge. This may have blocked the rotation at all
in some cases, I have no idea and no test case as I've never seen it in
practice, it was just noticed by inspection.
Finally, all of these fixes, and studying the loops they produce,
highlighted another problem: in rotating loops like this, we sometimes
fail to align the destination of these backwards jumping edges. Fix this
by actually walking the backwards edges rather than relying on loopinfo.
This fixes regressions on heapsort if block placement is enabled as well
as lots of other cases where the previous logic would introduce an
abundance of unnecessary branches into the execution.
llvm-svn: 154783
Original message:
Modify the code that lowers shuffles to blends from using blendvXX to vblendXX.
blendV uses a register for the selection while Vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
llvm-svn: 154483
blendv uses a register for the selection while vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
llvm-svn: 154396