- Rewrite/merge pseudo-atomic instruction emitters to address the
following issue:
* Reduce one unnecessary load in spin-loop
previously the spin-loop looks like
thisMBB:
newMBB:
ld t1 = [bitinstr.addr]
op t2 = t1, [bitinstr.val]
not t3 = t2 (if Invert)
mov EAX = t1
lcs dest = [bitinstr.addr], t3 [EAX is implicit]
bz newMBB
fallthrough -->nextMBB
the 'ld' at the beginning of newMBB should be lift out of the loop
as lcs (or CMPXCHG on x86) will load the current memory value into
EAX. This loop is refined as:
thisMBB:
EAX = LOAD [MI.addr]
mainMBB:
t1 = OP [MI.val], EAX
LCMPXCHG [MI.addr], t1, [EAX is implicitly used & defined]
JNE mainMBB
sinkMBB:
* Remove immopc as, so far, all pseudo-atomic instructions has
all-register form only, there is no immedidate operand.
* Remove unnecessary attributes/modifiers in pseudo-atomic instruction
td
* Fix issues in PR13458
- Add comprehensive tests on atomic ops on various data types.
NOTE: Some of them are turned off due to missing functionality.
- Revise tests due to the new spin-loop generated.
llvm-svn: 164281
- Merge the processing of LOAD_ADD with other atomic load-arith
operations
- Separate the logic getting target constant for atomic-load-op and add
an optimization for atomic-load-add on i16 with negative value
- Optimize a minor case for atomic-fetch-add i16 with negative operand. Test
case is revised.
llvm-svn: 164243
It had patterns for zext-loading and extending. This commit adds patterns for loading a wide type, performing a bitcast,
and extending. This is an odd pattern, but it is commonly used when writing code with intrinsics.
rdar://11897677
llvm-svn: 163995
- Enhance the fix to PR12312 to support wider integer, such as 256-bit
integer. If more than 1 fully evaluated vectors are found, POR them
first followed by the final PTEST.
llvm-svn: 163832
- Find a legal vector type before casting and extracting element from it.
- As the new vector type may have more than 2 elements, build the final
hi/lo pair by BFS pairing them from bottom to top.
llvm-svn: 163830
Add a PatFrag to match X86tcret using 6 fixed registers or less. This
avoids folding loads into TCRETURNmi64 using 7 or more volatile
registers.
<rdar://problem/12282281>
llvm-svn: 163819
by xoring the high-bit. This fails if the source operand is a vector because we need to negate
each of the elements in the vector.
Fix rdar://12281066 PR13813.
llvm-svn: 163802
are within the lifetime zone. Sometime legitimate usages of allocas are
hoisted outside of the lifetime zone. For example, GEPS may calculate the
address of a member of an allocated struct. This commit makes sure that
we only check (abort regions or assert) for instructions that read and write
memory using stack frames directly. Notice that by allowing legitimate
usages outside the lifetime zone we also stop checking for instructions
which use derivatives of allocas. We will catch less bugs in user code
and in the compiler itself.
llvm-svn: 163791
We don't have enough GR64_TC registers when calling a varargs function
with 6 arguments. Since %al holds the number of vector registers used,
only %r11 is available as a scratch register.
This means that addressing modes using both base and index registers
can't be folded into TCRETURNmi64.
<rdar://problem/12282281>
llvm-svn: 163761
- BlockAddress has no support of BA + offset form and there is no way to
propagate that offset into machine operand;
- Add BA + offset support and a new interface 'getTargetBlockAddress' to
simplify target block address forming;
- All targets are modified to use new interface and X86 backend is enhanced to
support BA + offset addressing.
llvm-svn: 163743
The input program may contain intructions which are not inside lifetime
markers. This can happen due to a bug in the compiler or due to a bug in
user code (for example, returning a reference to a local variable).
This commit adds checks that all of the instructions in the function and
invalidates lifetime ranges which do not contain all of the instructions.
llvm-svn: 163678
- If a boolean value is generated from CMOV and tested as boolean value,
simplify the use of test result by referencing the original condition.
RDRAND intrinisc is one of such cases.
llvm-svn: 163516
The RegisterCoalescer understands overlapping live ranges where one
register is defined as a copy of the other. With this change, register
allocators using LiveRegMatrix can do the same, at least for copies
between physical and virtual registers.
When a physreg is defined by a copy from a virtreg, allow those live
ranges to overlap:
%CL<def> = COPY %vreg11:sub_8bit; GR32_ABCD:%vreg11
%vreg13<def,tied1> = SAR32rCL %vreg13<tied0>, %CL<imp-use,kill>
We can assign %vreg11 to %ECX, overlapping the live range of %CL.
llvm-svn: 163336
- CodeGenPrepare pass for identifying div/rem ops
- Backend specifies the type mapping using addBypassSlowDivType
- Enabled only for Intel Atom with O2 32-bit -> 8-bit
- Replace IDIV with instructions which test its value and use DIVB if the value
is positive and less than 256.
- In the case when the quotient and remainder of a divide are used a DIV
and a REM instruction will be present in the IR. In the non-Atom case
they are both lowered to IDIVs and CSE removes the redundant IDIV instruction,
using the quotient and remainder from the first IDIV. However,
due to this optimization CSE is not able to eliminate redundant
IDIV instructions because they are located in different basic blocks.
This is overcome by calculating both the quotient (DIV) and remainder (REM)
in each basic block that is inserted by the optimization and reusing the result
values when a subsequent DIV or REM instruction uses the same operands.
- Test cases check for the presents of the optimization when calculating
either the quotient, remainder, or both.
Patch by Tyler Nowicki!
llvm-svn: 163150
This reverts commit 5dd9e214fb92847e947f9edab170f9b4e52b908f.
Thanks to Duncan for explaining how this should have been done.
Conflicts:
test/CodeGen/X86/vec_select.ll
llvm-svn: 163064
output chain is correctly setup.
As an example, if the original load must happen before later stores, we need
to make sure the constructed VZEXT_LOAD is constrained to be before the stores.
rdar://11457792
llvm-svn: 163036
- In addition to undefined, if V2 is zero vector, skip 2nd PSHUFB and POR as
well as PSHUFB will zero elements with negative indices.
Patch by Sriram Murali <sriram.murali@intel.com>
llvm-svn: 163018
- Add 'UseSSEx' to force SSE legacy insn not being selected when AVX is
enabled.
As the penalty of inter-mixing SSE and AVX instructions, we need
prevent SSE legacy insn from being generated except explicitly
specified through some intrinsics. For patterns supported by both
SSE and AVX, so far, we force AVX insn will be tried first relying on
AddedComplexity or position in td file. It's error-prone and
introduces bugs accidentally.
'UseSSEx' is disabled when AVX is turned on. For SSE insns inherited
by AVX, we need this predicate to force VEX encoding or SSE legacy
encoding only.
For insns not inherited by AVX, we still use the previous predicates,
i.e. 'HasSSEx'. So far, these insns fall into the following
categories:
* SSE insns with MMX operands
* SSE insns with GPR/MEM operands only (xFENCE, PREFETCH, CLFLUSH,
CRC, and etc.)
* SSE4A insns.
* MMX insns.
* x87 insns added by SSE.
2 test cases are modified:
- test/CodeGen/X86/fast-isel-x86-64.ll
AVX code generation is different from SSE one. 'vcvtsi2sdq' cannot be
selected by fast-isel due to complicated pattern and fast-isel
fallback to materialize it from constant pool.
- test/CodeGen/X86/widen_load-1.ll
AVX code generation is different from SSE one after fixing SSE/AVX
inter-mixing. Exec-domain fixing prefers 'vmovapd' instead of
'vmovaps'.
llvm-svn: 162919
- The root cause is that target constant materialization in X86 fast-isel
creates a PC-rel addressing which may overflow 32-bit range in non-Small code
model if .rodata section is allocated too far away from code segment in
MCJIT, which uses Large code model so far.
- Follow the similar logic to fix non-Small code model in fast-isel by skipping
non-Small code model.
llvm-svn: 162881
- Add a target-specific DAG optimization to recognize a pattern PTEST-able.
Such a pattern is a OR'd tree with X86ISD::OR as the root node. When
X86ISD::OR node has only its flag result being used as a boolean value and
all its leaves are extracted from the same vector, it could be folded into an
X86ISD::PTEST node.
llvm-svn: 162735
this allows for better code generation.
Added a new DAGCombine transformation to convert FMAX and FMIN to FMANC and
FMINC, which are commutative.
For example:
movaps %xmm0, %xmm1
movsd LC(%rip), %xmm0
minsd %xmm1, %xmm0
becomes:
minsd LC(%rip), %xmm0
llvm-svn: 162187
arithmetic instructions. However, when small data types are used, a truncate
node appears between the SETCC node and the arithmetic operation. This patch
adds support for this pattern.
Before:
xorl %esi, %edi
testb %dil, %dil
setne %al
ret
After:
xorb %dil, %sil
setne %al
ret
rdar://12081007
llvm-svn: 162160
I really need to find a way to automate this, but I can't come up with a regex
that has no false positives while handling tricky cases like custom check
prefixes.
llvm-svn: 162097
around. That's not how we do things. Besides, the commit message tells us that
it is covered by the GCC test suite.
------------------------------------------------------------------------
r127497 | zwarich | 2011-03-11 13:51:56 -0800 (Fri, 11 Mar 2011) | 3 lines
Fix the GCC test suite issue exposed by r127477, which was caused by stack
protector insertion not working correctly with unreachable code. Since that
revision was rolled out, this test doesn't actual fail before this fix.
------------------------------------------------------------------------
llvm-svn: 161985
- FP_EXTEND only support extending from vectors with matching elements.
This results in the scalarization of extending to v2f64 from v2f32,
which will be legalized to v4f32 not matching with v2f64.
- add X86-specific VFPEXT supproting extending from v4f32 to v2f64.
- add BUILD_VECTOR lowering helper to recover back the original
extending from v4f32 to v2f64.
- test case is enhanced to include different vector width.
llvm-svn: 161894
and allow some optimizations to turn conditional branches into unconditional.
This commit adds a simple control-flow optimization which merges two consecutive
basic blocks which are connected by a single edge. This allows the codegen to
operate on larger basic blocks.
rdar://11973998
llvm-svn: 161852
It is still possible to if-convert if the tail block has extra
predecessors, but the tail phis must be rewritten instead of being
removed.
llvm-svn: 161781
- FCMOV only supports a subset of X86 conditions. Skip boolean
simplification if X86 condition is not valid for FCMOV.
- add a minimal test case for PR13577.
llvm-svn: 161732
FeatureFastUAMem for Nehalem, Westmere and Sandy Bridge.
FeatureFastUAMem is already on if we pass in nehalem or westmere as a command
argument.
rdar: 7252306
llvm-svn: 161717
- if a boolean test (X86ISD::CMP or X86ISD:SUB) checks a boolean value
generated from X86ISD::SETCC, try to simplify the boolean value
generation and checking by reusing the original EFLAGS with proper
condition code
- add hooks to X86 specific SETCC/BRCOND/CMOV, the major 3 places
consuming EFLAGS
part of patches fixing PR12312
llvm-svn: 161687
When replacing Old with New, it can happen that New is already a
successor. Add the old and new edge weights instead of creating a
duplicate edge.
llvm-svn: 161653
This makes it possible to speed up def_iterator by stopping at the first
use. This makes def_empty() and getUniqueVRegDef() much faster when
there are many uses.
In a +Asserts build, LiveVariables is 100x faster in one case because
getVRegDef() has an assertion that would scan to the end of a
def_iterator chain.
Spill weight calculation is significantly faster (300x in one case)
because isTriviallyReMaterializable() calls MRI->isConstantPhysReg(%RIP)
which calls def_empty(%RIP).
llvm-svn: 161634
Use a more conventional doubly linked list where the Prev pointers form
a cycle. This means it is no longer necessary to adjust the Prev
pointers when reallocating the VRegInfo array.
The test changes are required because the register allocation hint is
using the use-list order to break ties.
llvm-svn: 161633
We perform the following:
1> Use SUB instead of CMP for i8,i16,i32 and i64 in ISel lowering.
2> Modify MachineCSE to correctly handle implicit defs.
3> Convert SUB back to CMP if possible at peephole.
Removed pattern matching of (a>b) ? (a-b):0 and like, since they are handled
by peephole now.
rdar://11873276
llvm-svn: 161462
Previously, MBP essentially aligned every branch target it could. This
bloats code quite a bit, especially non-looping code which has no real
reason to prefer aligned branch targets so heavily.
As Andy said in review, it's still a bit odd to do this without a real
cost model, but this at least has much more plausible heuristics.
Fixes PR13265.
llvm-svn: 161409
If the result of a common subexpression is used at all uses of the candidate
expression, CSE should not increase the live range of the common subexpression.
rdar://11393714 and rdar://11819721
llvm-svn: 161396
This patch is mostly just refactoring a bunch of copy-and-pasted code, but
it also adds a check that the call instructions are readnone or readonly.
That check was already present for sin, cos, sqrt, log2, and exp2 calls, but
it was missing for the rest of the builtins being handled in this code.
llvm-svn: 161282
I noticed that SelectionDAGBuilder::visitCall was missing a check for memcmp
in TargetLibraryInfo, so that it would use custom code for memcmp calls even
with -fno-builtin. I also had to add a new -disable-simplify-libcalls option
to llc so that I could write a test for this.
llvm-svn: 161262
Fast isel doesn't currently have support for translating builtin function
calls to target instructions. For embedded environments where the library
functions are not available, this is a matter of correctness and not
just optimization. Most of this patch is just arranging to make the
TargetLibraryInfo available in fast isel. <rdar://problem/12008746>
llvm-svn: 161232
Add more comments and use early returns to reduce nesting in isLoadFoldable.
Also disable folding for V_SET0 to avoid introducing a const pool entry and
a const pool load.
rdar://10554090 and rdar://11873276
llvm-svn: 161207
Machine CSE and other optimizations can remove instructions so folding
is possible at peephole while not possible at ISel.
This patch is a rework of r160919 and was tested on clang self-host on my local
machine.
rdar://10554090 and rdar://11873276
llvm-svn: 161152
One motivating example is to sink an instruction from a basic block which has
two successors: one outside the loop, the other inside the loop. We should try
to sink the instruction outside the loop.
rdar://11980766
llvm-svn: 161062
We branch to the successor with higher edge weight first.
Convert from
je LBB4_8 --> to outer loop
jmp LBB4_14 --> to inner loop
to
jne LBB4_14
jmp LBB4_8
PR12750
rdar: 11393714
llvm-svn: 161018
Machine CSE and other optimizations can remove instructions so folding
is possible at peephole while not possible at ISel.
rdar://10554090 and rdar://11873276
llvm-svn: 160919
It is possible that an instruction can use and update EFLAGS.
When checking the safety, we should check the usage of EFLAGS first before
declaring it is safe to optimize due to the update.
llvm-svn: 160912
These idempotent sub-register indices don't do anything --- They simply
map XMM registers to themselves. They no longer affect register classes
either since the SubRegClasses field has been removed from Target.td.
This patch replaces XMM->XMM EXTRACT_SUBREG and INSERT_SUBREG patterns
with COPY_TO_REGCLASS patterns which simply become COPY instructions.
The number of IMPLICIT_DEF instructions before register allocation is
reduced, and that is the cause of the test case changes.
llvm-svn: 160816
It is redundant; RegisterCoalescer will do the remat if it can't eliminate
the copy. Collected instruction counts before and after this. A few extra
instructions are generated due to spilling but it is normal to see these kinds
of changes with almost any small codegen change, according to Jakob.
This also fixed rdar://11830760 where xor is expected instead of movi0.
llvm-svn: 160749
struct s {
double x1;
float x2;
};
__attribute__((regparm(3))) struct s f(int a, int b, int c);
void g(void) {
f(41, 42, 43);
}
We need to be able to represent passing the address of s to f (sret) in a
register (inreg). Turns out that all that is needed is to not mark them as
mutually incompatible.
llvm-svn: 160695
are targeting an ELF platform. Only fold gs-relative (and fs-relative) loads
if it is actually sensible to do so for the target platform.
This fixes PR13438.
llvm-svn: 160687
LiveRangeEdit::foldAsLoad() can eliminate a register by folding a load
into its only use. Only do that when the load is safe to move, and it
won't extend any live ranges.
This fixes PR13414.
llvm-svn: 160575
PHIElimination splits critical edges when it predicts it can resolve
interference and eliminate copies. It doesn't split the edge if the
interference wouldn't be resolved anyway because the phi-use register is
live in the critical edge anyway.
Teach PHIElimination to split loop exiting edges with interference, even
if it wouldn't resolve the interference. This removes the necessary
copies from the loop, which is still an improvement from injecting the
copies into the loop.
The test case demonstrates the improvement. Before:
LBB0_1:
cmpb $0, (%rdx)
leaq 1(%rdx), %rdx
movl %esi, %eax
je LBB0_1
After:
LBB0_1:
cmpb $0, (%rdx)
leaq 1(%rdx), %rdx
je LBB0_1
movl %esi, %eax
llvm-svn: 160571
Updated OptimizeCompare in peephole to remove redundant cmp against zero.
We only remove Compare if CF and OF are not used.
rdar://11855129
llvm-svn: 160454
when run on an Intel Atom processor. The failures have arisen due
to changes elsewhere in the trunk over the past 8 weeks or so.
These failures were not detected by the Atom buildbot because the
CPU on the Atom buildbot was not being detected as an Atom CPU.
The fix for this problem is in Host.cpp and X86Subtarget.cpp, but
shall remain commented out until the current set of Atom test failures
are fixed.
Patch by Andy Zhang and Tyler Nowicki!
llvm-svn: 160451
large immediates. Add dag combine logic to recover in case the large
immediates doesn't fit in cmp immediate operand field.
int foo(unsigned long l) {
return (l>> 47) == 1;
}
we produce
%shr.mask = and i64 %l, -140737488355328
%cmp = icmp eq i64 %shr.mask, 140737488355328
%conv = zext i1 %cmp to i32
ret i32 %conv
which codegens to
movq $0xffff800000000000,%rax
andq %rdi,%rax
movq $0x0000800000000000,%rcx
cmpq %rcx,%rax
sete %al
movzbl %al,%eax
ret
TargetLowering::SimplifySetCC would transform
(X & -256) == 256 -> (X >> 8) == 1
if the immediate fails the isLegalICmpImmediate() test. For x86,
that's immediates which are not a signed 32-bit immediate.
Based on a patch by Eli Friedman.
PR10328
rdar://9758774
llvm-svn: 160346
uint32_t hi(uint64_t res)
{
uint_32t hi = res >> 32;
return !hi;
}
llvm IR looks like this:
define i32 @hi(i64 %res) nounwind uwtable ssp {
entry:
%lnot = icmp ult i64 %res, 4294967296
%lnot.ext = zext i1 %lnot to i32
ret i32 %lnot.ext
}
The optimizer has optimize away the right shift and truncate but the resulting
constant is too large to fit in the 32-bit immediate field. The resulting x86
code is worse as a result:
movabsq $4294967296, %rax ## imm = 0x100000000
cmpq %rax, %rdi
sbbl %eax, %eax
andl $1, %eax
This patch teaches the x86 lowering code to handle ult against a large immediate
with trailing zeros. It will issue a right shift and a truncate followed by
a comparison against a shifted immediate.
shrq $32, %rdi
testl %edi, %edi
sete %al
movzbl %al, %eax
It also handles a ugt comparison against a large immediate with trailing bits
set. i.e. X > 0x0ffffffff -> (X >> 32) >= 1
rdar://11866926
llvm-svn: 160312
In the added testcase the constant 55 was behind an AssertZext of type i1, and ComputeDemandedBits
reported that some of the bits were both known to be one and known to be zero.
Together with Michael Kuperstein <michael.m.kuperstein@intel.com>
llvm-svn: 160305
1. FileCheck-ize epilogue.ll and allow another asm instruction to restore %rsp.
2. Remove check in widen_arith-3.ll that was hitting instruction in epilogue instead of
vector add.
llvm-svn: 160274
undef virtual register. The problem is that ProcessImplicitDefs removes the
definition of the register and marks all uses as undef. If we lose the undef
marker then we get a register which has no def, is not marked as undef. The
live interval analysis does not collect information for these virtual
registers and we crash in later passes.
Together with Michael Kuperstein <michael.m.kuperstein@intel.com>
llvm-svn: 160260
It is intended to fix PR11468.
Old prologue and epilogue looked like this:
push %rbp
mov %rsp, %rbp
and $alignment, %rsp
push %r14
push %r15
...
pop %r15
pop %r14
mov %rbp, %rsp
pop %rbp
The problem was to reference the locations of callee-saved registers in exception handling:
locations of callee-saved had to be re-calculated regarding the stack alignment operation. It would
take some effort to implement this in LLVM, as currently MachineLocation can only have the form
"Register + Offset". Funciton prologue and epilogue are now changed to:
push %rbp
mov %rsp, %rbp
push %14
push %15
and $alignment, %rsp
...
lea -$size_of_saved_registers(%rbp), %rsp
pop %r15
pop %r14
pop %rbp
Reviewed by Chad Rosier.
llvm-svn: 160248
Allow the folding of vbroadcastRR to vbroadcastRM, where the memory operand is a spill slot.
PR12782.
Together with Michael Kuperstein <michael.m.kuperstein@intel.com>
llvm-svn: 160230
the input vector, it can be bigger (this is helpful for powerpc where <2 x i16>
is a legal vector type but i16 isn't a legal type, IIRC). However this wasn't
being taken into account by ExpandRes_EXTRACT_VECTOR_ELT, causing PR13220.
Lightly tweaked version of a patch by Michael Liao.
llvm-svn: 160116
X86. Basically, this is a reapplication of r158087 with a few fixes.
Specifically, (1) the stack pointer is restored from the base pointer before
popping callee-saved registers and (2) in obscure cases (see comments in patch)
we must cache the value of the original stack adjustment in the prologue and
apply it in the epilogue.
rdar://11496434
llvm-svn: 160002
multiple scalars and insert them into a vector. Next, we shuffle the elements
into the correct places, as before.
Also fix a small dagcombine bug in SimplifyBinOpWithSameOpcodeHands, when the
migration of bitcasts happened too late in the SelectionDAG process.
llvm-svn: 159991
getCondFromSETOpc, getCondFromCMovOpc, getSETFromCond, getCMovFromCond
No functional change intended.
If we want to update the condition code of CMOV|SET|Jcc, we first analyze the
opcode to get the condition code, then update the condition code, finally
synthesize the new opcode form the new condition code.
llvm-svn: 159955
It is safe if EFLAGS is killed or re-defined.
When we are done with the basic block, check whether EFLAGS is live-out.
Do not optimize away cmp if EFLAGS is live-out.
llvm-svn: 159888
For each Cmp, we check whether there is an earlier Sub which make Cmp
redundant. We handle the case where SUB operates on the same source operands as
Cmp, including the case where the two source operands are swapped.
llvm-svn: 159838
The CopyToReg nodes that set up the argument registers before a call
must be glued to the call instruction. Otherwise, the scheduler may emit
the physreg copies long before the call, causing long live ranges for
the fixed registers.
Besides disabling good register allocation, that can also expose
problems when EmitInstrWithCustomInserter() splits a basic block during
the live range of a physreg.
llvm-svn: 159721
Implement the TII hooks needed by EarlyIfConversion to create cmov
instructions and estimate their latency.
Early if-conversion is still not enabled by default.
llvm-svn: 159695
another mechanical change accomplished though the power of terrible Perl
scripts.
I have manually switched some "s to 's to make escaping simpler.
While I started this to fix tests that aren't run in all configurations,
the massive number of tests is due to a really frustrating fragility of
our testing infrastructure: things like 'grep -v', 'not grep', and
'expected failures' can mask broken tests all too easily.
Essentially, I'm deeply disturbed that I can change the testsuite so
radically without causing any change in results for most platforms. =/
llvm-svn: 159547
versions of Bash. In addition, I can back out the change to the lit
built-in shell test runner to support this.
This should fix the majority of fallout on Darwin, but I suspect there
will be a few straggling issues.
llvm-svn: 159544
This was done through the aid of a terrible Perl creation. I will not
paste any of the horrors here. Suffice to say, it require multiple
staged rounds of replacements, state carried between, and a few
nested-construct-parsing hacks that I'm not proud of. It happens, by
luck, to be able to deal with all the TCL-quoting patterns in evidence
in the LLVM test suite.
If anyone is maintaining large out-of-tree test trees, feel free to poke
me and I'll send you the steps I used to convert things, as well as
answer any painful questions etc. IRC works best for this type of thing
I find.
Once converted, switch the LLVM lit config to use ShTests the same as
Clang. In addition to being able to delete large amounts of Python code
from 'lit', this will also simplify the entire test suite and some of
lit's architecture.
Finally, the test suite runs 33% faster on Linux now. ;]
For my 16-hardware-thread (2x 4-core xeon e5520): 36s -> 24s
llvm-svn: 159525
Before this patch in pic 32 bit code we would add the global base register
and not load from that address. This is a really old bug, but before the
introduction of the tls attributes we would never select initial exec for
pic code.
llvm-svn: 159409
Corrected type for index of llvm.x86.avx2.gather.d.pd.256
from 256-bit to 128-bit.
Corrected types for src|dst|mask of llvm.x86.avx2.gather.q.ps.256
from 256-bit to 128-bit.
Support the following intrinsics:
llvm.x86.avx2.gather.d.q, llvm.x86.avx2.gather.q.q
llvm.x86.avx2.gather.d.q.256, llvm.x86.avx2.gather.q.q.256
llvm.x86.avx2.gather.d.d, llvm.x86.avx2.gather.q.d
llvm.x86.avx2.gather.d.d.256, llvm.x86.avx2.gather.q.d.256
llvm-svn: 159402
The primary advantage is that loop optimizations will be applied in a
stable order. This helps debugging and unit test creation. It is also
a better overall implementation without pathologically bad performance
on deep functions.
On large functions (llvm-stress --size=200000 | opt -loops)
Before: 0.1263s
After: 0.0225s
On deep functions (after tweaking llvm-stress, thanks Nadav):
Before: 0.2281s
After: 0.0227s
See r158790 for more comments.
The loop tree is now consistently generated in forward order, but loop
passes are applied in reverse order over the program. If we have a
loop optimization that prefers forward order, that can easily be
achieved by adding a different type of LoopPassManager.
llvm-svn: 159183
Implicitly defined virtual registers can simply have the <undef> bit set
on all uses, and copies can be turned into implicit defs recursively.
Physical registers are a bit trickier. We handle the common case where a
physreg def is used by a nearby instruction in the same basic block. For
more complicated cases, just leave the IMPLICIT_DEF instruction in.
llvm-svn: 159149
The function live-out registers must be live at all function returns,
and %RCX is only used by eh.return. When a function also has a normal
return, only %RAX holds a return value.
This fixes PR13188.
llvm-svn: 159116
This allows the user/front-end to specify a model that is better
than what LLVM would choose by default. For example, a variable
might be declared as
@x = thread_local(initialexec) global i32 42
if it will not be used in a shared library that is dlopen'ed.
If the specified model isn't supported by the target, or if LLVM can
make a better choice, a different model may be used.
llvm-svn: 159077
The code in X86TargetLowering::LowerEH_RETURN() assumes that a frame
pointer exists, but the frame pointer was forced by the presence of
llvm.eh.unwind.init which isn't guaranteed.
If llvm.eh.unwind.init is actually required in functions calling
eh.return (is it?), we should diagnose that instead of emitting bad
machine code.
This should fix the dragonegg-x86_64-linux-gcc-4.6-test bot.
llvm-svn: 158961
Regunit live ranges are computed on demand, so when mi-sched calls
handleMove, some regunits may not have live ranges yet.
That makes updating them easier: Just skip the non-existing ranges. They
will be computed correctly from the rescheduled machine code when they
are needed.
llvm-svn: 158831
TargetLoweringObjectFileELF. Use this to support it on X86. Unlike ARM,
on X86 it is not easy to find out if .init_array should be used or not, so
the decision is made via TargetOptions and defaults to off.
Add a command line option to llc that enables it.
llvm-svn: 158692
temporarily reverted.
This test is annoyingly overspecified, but I don't know of another way
to thoroughly test the saving and restoring of the registers. While this
will have to be adjusted even with the issue fixed in order to re-apply
r158087, those adjustments should very clearly indicate that it is still
correct (%esp getting restored prior to pops), whereas without it, this
case can easily slip under the radar.
Still, any suggestions for improvements are very welcome.
All credit to Matt Beaumont-Gay for reducing this out of an insane
Address Sanitizer crash to a reasonably small seg-faulting C program
when built with -mstackrealign. I just reduced it to IR, which was much
simpler. =]
llvm-svn: 158656
This patch causes problems when both dynamic stack realignment and
dynamic allocas combine in the same function. With this patch, we no
longer build the epilog correctly, and silently restore registers from
the wrong position in the stack.
Thanks to Matt for tracking this down, and getting at least an initial
test case to Chad. I'm going to try to check a variation of that test
case in so we can easily track the fixes required.
llvm-svn: 158654
The fast register allocator is not supposed to work in the optimizing
pipeline. It doesn't make sense to compute live intervals, run full copy
coalescing, and then run RAFast.
Fast register allocation in the optimizing pipeline is better done by
RABasic.
llvm-svn: 158242
This patch will generate the following for integer ABS:
movl %edi, %eax
negl %eax
cmovll %edi, %eax
INSTEAD OF
movl %edi, %ecx
sarl $31, %ecx
leal (%rdi,%rcx), %eax
xorl %ecx, %eax
There exists a target-independent DAG combine for integer ABS, which converts
integer ABS to sar+add+xor. For X86, we match this pattern back to neg+cmov.
This is implemented in PerformXorCombine.
rdar://10695237
llvm-svn: 158175
This patch will optimize the following
movq %rdi, %rax
subq %rsi, %rax
cmovsq %rsi, %rdi
movq %rdi, %rax
to
cmpq %rsi, %rdi
cmovsq %rsi, %rdi
movq %rdi, %rax
Perform this optimization if the actual result of SUB is not used.
rdar: 11540023
llvm-svn: 158126
The commit is intended to fix rdar://11540023.
It is implemented as part of peephole optimization. We can actually implement
this in the SelectionDAG lowering phase.
llvm-svn: 158122
This patch will optimize the following:
sub r1, r3
cmp r3, r1 or cmp r1, r3
bge L1
TO
sub r1, r3
bge L1 or ble L1
If the branch instruction can use flag from "sub", then we can eliminate
the "cmp" instruction.
llvm-svn: 157831
This implements codegen support for accesses to thread-local variables
using the local-dynamic model, and adds a clean-up pass so that the base
address for the TLS block can be re-used between local-dynamic access on
an execution path.
llvm-svn: 157818
types, as well as int<->ptr casts. This allows us to tailcall functions
with some trivial casts between the call and return (i.e. because the
return types disagree).
llvm-svn: 157798
This patch will optimize the following
movq %rdi, %rax
subq %rsi, %rax
cmovsq %rsi, %rdi
movq %rdi, %rax
to
cmpq %rsi, %rdi
cmovsq %rsi, %rdi
movq %rdi, %rax
Perform this optimization if the actual result of SUB is not used.
rdar: 11540023
llvm-svn: 157755
I disabled FMA3 autodetection, since the result may differ from expected for some benchmarks.
I added tests for GodeGen and intrinsics.
I did not change llvm.fma.f32/64 - it may be done later.
llvm-svn: 157737
It helps compile exotic inline asm. In the test case, normal GR32
virtual registers use up eax-edx so the final GR32_ABCD live range has
no registers left. Since all the live ranges were tiny, we had no way of
prioritizing the smaller register class.
This patch allows tiny unspillable live ranges to be evicted by tiny
unspillable live ranges from a smaller register class.
<rdar://problem/11542429>
llvm-svn: 157715
integer registers. This is already supported by the fastcc convention, but it doesn't
hurt to support it in the standard conventions as well.
In cases where we can cheat at the calling convention, this allows us to avoid returning
things through memory in more cases.
llvm-svn: 157698
This required light surgery on the assembler and disassembler
because the instructions use an uncommon encoding. They are
the only two instructions in x86 that use register operands
and two immediates.
llvm-svn: 157634
SimplifyCFG tends to form a lot of 2-3 case switches when merging branches. Move
the most likely condition to the front so it is checked first and the others can
be skipped. This is currently not as effective as it could be because SimplifyCFG
destroys profiling metadata when merging branches and switches. Merging branch
weight metadata is tricky though.
This code touches at most 3 cases so I didn't use a proper sorting algorithm.
llvm-svn: 157521
Now that the coalescer keeps live intervals and machine code in sync at
all times, it needs to deal with identity copies differently.
When merging two virtual registers, all identity copies are removed
right away. This means that other identity copies must come from
somewhere else, and they are going to have a value number.
Deal with such copies by merging the value numbers before erasing the
copy instruction. Otherwise, we leave dangling value numbers in the live
interval.
This fixes PR12927.
llvm-svn: 157340
may be RAUW'd by the recursive call to LegalizeOps; instead, retrieve
the other operands when calling UpdateNodeOperands. Fixes PR12889.
llvm-svn: 157162
Dead code elimination during coalescing could cause a virtual register
to be split into connected components. The following rewriting would be
confused about the already joined copies present in the code, but
without a corresponding value number in the live range.
Erase all joined copies instantly when joining intervals such that the
MI and LiveInterval representations are always in sync.
llvm-svn: 157135
The late dead code elimination is no longer necessary.
The test changes are cause by a register hint that can be either %rdi or
%rax. The choice depends on the use list order, which this patch changes.
llvm-svn: 157131