Commit Graph

6812 Commits

Author SHA1 Message Date
Preston Gurd e36b685a94 The current Intel Atom microarchitecture has a feature whereby when a function
returns early then it is slightly faster to execute a sequence of NOP
instructions to wait until the return address is ready,
as opposed to simply stalling on the ret instruction
until the return address is ready.

When compiling for X86 Atom only, this patch will run a pass, called
"X86PadShortFunction" which will add NOP instructions where less than four
cycles elapse between function entry and return.

It includes tests.

Patch by Andy Zhang.

llvm-svn: 171524
2013-01-04 20:54:54 +00:00
Akira Hatanaka b13b33359b [mips] MipsTargetLowering::getSetCCResultType should return a vector type if
vectors are being compared.

llvm-svn: 171517
2013-01-04 20:06:01 +00:00
Nadav Rotem c616a5408a Revert revision: 171467. This transformation is incorrect and makes some tests fail. Original message:
Simplified TRUNCATE operation that comes after SETCC. It is possible since SETCC result is 0 or -1.
Added a test.

llvm-svn: 171468
2013-01-04 17:35:21 +00:00
Elena Demikhovsky 5f2f06d2d9 Simplified TRUNCATE operation that comes after SETCC. It is possible since SETCC result is 0 or -1.
Added a test.

llvm-svn: 171467
2013-01-03 08:48:33 +00:00
Michael Gottesman 820aac1c78 Revert "Mark DIV/IDIV instructions hasSideEffects=1 because they can trap when dividing by 0. This is needed to keep early if conversion from moving them across basic blocks."
This reverts commit r171461 since it breaks the following tests:

Clang :: Analysis/outofbound-notwork.c
Clang :: Analysis/string-fail.c
Clang :: CXX/basic/basic.lookup/basic.lookup.qual/p6-0x.cpp
Clang :: CXX/basic/basic.lookup/basic.lookup.unqual/p15.cpp
Clang :: CXX/dcl.dcl/dcl.spec/dcl.fct.spec/p4.cpp
Clang :: CXX/dcl.dcl/dcl.spec/dcl.stc/p10.cpp
Clang :: CXX/temp/temp.param/p14.cpp
Clang :: CXX/temp/temp.res/temp.dep.res/temp.point/p1.cpp
Clang :: CodeGen/2009-02-13-zerosize-union-field-ppc.c
Clang :: CodeGen/blocks-2.c
Clang :: CodeGen/libcalls-d.c
Clang :: CodeGen/libcalls-ld.c
Clang :: CodeGenCXX/conversion-function.cpp
Clang :: CodeGenCXX/debug-info-limit-type.cpp
Clang :: CodeGenCXX/inheriting-constructor.cpp
Clang :: FixIt/fixit-errors.c
Clang :: FixIt/fixit-pmem.cpp
Clang :: Modules/namespaces.cpp
Clang :: PCH/changed-files.c
Clang :: PCH/pr4489.c
Clang :: PCH/source-manager-stack.c
Clang :: Parser/cxx-ambig-decl-expr-xfail.cpp
Clang :: SemaCXX/switch-implicit-fallthrough-cxx98.cpp
Clang :: SemaTemplate/instantiate-function-1.mm

llvm-svn: 171466
2013-01-03 08:18:30 +00:00
Craig Topper 7c27cc9fd0 Mark DIV/IDIV instructions hasSideEffects=1 because they can trap when dividing by 0. This is needed to keep early if conversion from moving them across basic blocks.
llvm-svn: 171461
2013-01-03 06:40:20 +00:00
Jakob Stoklund Olesen 725d57682b Fix PR14732 by handling all kinds of IMPLICIT_DEF live ranges.
Most IMPLICIT_DEF instructions are removed by the ProcessImplicitDefs
pass, and a few are reinserted by PHIElimination when a PHI argument is
<undef>.

RegisterCoalescer was assuming that all IMPLICIT_DEF live ranges look
like those created by PHIElimination, and that their live range never
leaves the basic block.

The PR14732 test case does tricks with PHI nodes that causes a longer
IMPLICIT_DEF live range to appear. This happens very rarely, but
RegisterCoalescer should be able to handle it.

llvm-svn: 171435
2013-01-03 00:47:51 +00:00
Tom Stellard 567f886eb0 DAGCombiner: Avoid generating illegal vector INT_TO_FP nodes
DAGCombiner::reduceBuildVecConvertToConvertBuildVec() was making two
mistakes:

1. It was checking the legality of scalar INT_TO_FP nodes and then generating
vector nodes.

2. It was passing the result value type to
TargetLoweringInfo::getOperationAction() when it should have been
passing the value type of the first operand.

llvm-svn: 171420
2013-01-02 22:13:01 +00:00
Nadav Rotem c8d7047fa9 AVX: Fix a bug in WidenMaskArithmetic.
llvm-svn: 171397
2013-01-02 17:40:39 +00:00
Hal Finkel 6dbdd4307b Support ppcf128 in SelectionDAG::getConstantFP
Fixes pr14751.

Patch by Kai; Thanks!

llvm-svn: 171261
2012-12-30 19:03:32 +00:00
Dmitri Gribenko 56bf2e1830 Tests: rewrite 'opt ... %s' to 'opt ... < %s' so that opt does not emit a ModuleID
This is done to avoid odd test failures, like the one fixed in r171243.

llvm-svn: 171250
2012-12-30 02:33:22 +00:00
Nadav Rotem 3da9ac72fa AVX: Move the ZEXT/ANYEXT DAGCo optimizations to the lowering of these optimizations. The old test cases still cover all of these lowering/optimizations. The single change that we have is that now anyext does not need to zero a register, because it does not use the exact code path as the zero_extend.
llvm-svn: 171178
2012-12-28 05:45:24 +00:00
Nadav Rotem 2a054b4475 On AVX/AVX2 the type v8i1 is legalized to v8i16, which is an XMM sized
register. In most cases we actually compare or select YMM-sized registers
and mixing the two types creates horrible code. This commit optimizes
some of the transition sequences.

PR14657.

llvm-svn: 171148
2012-12-27 08:15:45 +00:00
NAKAMURA Takumi 40aa3285f4 llvm/test/CodeGen/X86: FileCheck-ize two tests in r171083.
llvm-svn: 171084
2012-12-26 03:19:30 +00:00
NAKAMURA Takumi 334f685328 llvm/test/CodeGen/X86: Disable avx in two tests corresponding to r171082.
llvm-svn: 171083
2012-12-26 03:08:55 +00:00
Hal Finkel 2ebe6d08cd Loosen scheduling restrictions on the PPC dcbt intrinsic
As with the prefetch intrinsic to which it maps, simply have dcbt
marked as reading from and writing to its arguments instead of having
unmodeled side effects. While this might cause unwanted code motion
(because aliasing checks don't really capture cache-line sharing),
it is more important that prefetches in unrolled loops don't block
the scheduler from rearranging the unrolled loop body.

llvm-svn: 171073
2012-12-25 18:51:18 +00:00
Hal Finkel 1b5ff08d43 Expand PPC64 atomic load and store
Use of store or load with the atomic specifier on 64-bit types would
cause instruction-selection failures. As with the 32-bit case, these
can use the default expansion in terms of cmp-and-swap.

llvm-svn: 171072
2012-12-25 17:22:53 +00:00
Benjamin Kramer a9f265ee98 Harden test so it's not affected by changes to compare lowering.
This only failed on hosts that don't have SSE41.

llvm-svn: 171066
2012-12-25 13:23:23 +00:00
Benjamin Kramer 81b5a8fd2e X86: Shave off one shuffle from the pcmpeqq sequence for SSE2 by making use of and commutativity.
llvm-svn: 171064
2012-12-25 13:09:08 +00:00
Benjamin Kramer df4af41b9b X86: Custom lower <2 x i64> eq and ne when SSE41 is not available.
pcmpeqd, pshufd, pshufd, pand is cheaper than unpack + cmpq, sbbq, cmpq, sbbq + pack.
Small speedup on loop-vectorized viterbi (-march=core2).

llvm-svn: 171063
2012-12-25 12:54:19 +00:00
NAKAMURA Takumi 1b18db7ea3 llvm/test/CodeGen/X86/fold-vex.ll: Add explicit triple.
llvm-svn: 171029
2012-12-24 11:14:06 +00:00
Nadav Rotem dc0ad92b64 Some x86 instructions can load/store one of the operands to memory. On SSE, this memory needs to be aligned.
When these instructions are encoded in VEX (on AVX) there is no such requirement. This changes the folding
tables and removes the alignment restrictions from VEX-encoded instructions.

llvm-svn: 171024
2012-12-24 09:40:33 +00:00
Benjamin Kramer 76268ac682 X86: Turn mul of <4 x i32> into pmuludq when no SSE4.1 is available.
pmuludq is slow, but it turns out that all the unpacking and packing of the
scalarized mul is even slower. 10% speedup on loop-vectorized paq8p.

llvm-svn: 170985
2012-12-22 16:07:56 +00:00
Benjamin Kramer b2f0a2bd4b X86: Emit vector sext as shuffle + sra if vpmovsx is not available.
Also loosen the SSSE3 dependency a bit, expanded pshufb + psra is still better
than scalarized loads. Fixes PR14590.

llvm-svn: 170984
2012-12-22 11:34:28 +00:00
Nadav Rotem d5aae980cb In some cases, due to scheduling constraints we copy the EFLAGS.
The only way to read the eflags is using push and pop. If we don't
adjust the stack then we run over the first frame index. This is
not something that we want to do, so we have to make sure that
our machine function does not copy the flags. If it does then
we have to emit the prolog that adjusts the stack.

rdar://12896831

llvm-svn: 170961
2012-12-21 23:48:49 +00:00
Benjamin Kramer b4688f84bd try to unbreak ppc buildbots.
llvm-svn: 170913
2012-12-21 18:11:45 +00:00
Benjamin Kramer 82d1c371e2 X86: Match pmin/pmax as a target specific dag combine. This occurs during vectorization.
Part of PR14667.

llvm-svn: 170908
2012-12-21 17:46:58 +00:00
Tom Stellard a8b0351720 R600: Expand vec4 INT <-> FP conversions
llvm-svn: 170901
2012-12-21 16:33:24 +00:00
Reed Kotler 93f778d2bd Add test case for r170674
llvm-svn: 170823
2012-12-21 00:55:10 +00:00
Eric Christopher 6e47b725ff Move these files over to the debug info directory.
llvm-svn: 170810
2012-12-21 00:03:42 +00:00
Bob Wilson 7bba4f8957 Revert "Adding support for llvm.arm.neon.vaddl[su].* and"
This reverts r170694.  The operations can be represented in IR without
adding any new intrinsics.

llvm-svn: 170765
2012-12-20 21:09:38 +00:00
Evan Cheng ddc0cb6dc5 On some ARM cpus, flags setting movs with shifter operand, i.e. lsl, lsr, asr,
are more expensive than the non-flag setting variant. Teach thumb2 size
reduction pass to avoid generating them unless we are optimizing for size.

rdar://12892707

llvm-svn: 170728
2012-12-20 19:59:30 +00:00
Rafael Espindola 642c7cd56e Simplify the testcase a bit.
I checked that it would still crash llc before the corresponding fix.

llvm-svn: 170709
2012-12-20 17:47:27 +00:00
Renato Golin 6b2ea4a48f Adding support for llvm.arm.neon.vaddl[su].* and
llvm.arm.neon.vsub[su].* intrinsics.

Patch by Pete Couperus <pjcoup@gmail.com>

llvm-svn: 170694
2012-12-20 13:52:11 +00:00
Reed Kotler d019dbf75e fix most of remaining issues with large frames.
these patches are tested a lot by test-suite but
make check tests are forthcoming once the next
few patches that complete this are committed.
with the next few patches the pass rate for mips16 is
near 100%

llvm-svn: 170656
2012-12-20 04:07:42 +00:00
Akira Hatanaka f423672117 [mips] Use "or $r0, $r1, $zero" instead of "addu $r0, $zero, $r1" to copy
physical register $r1 to $r0.

GNU disassembler recognizes an "or" instruction as a "move", and this change
makes the disassembled code easier to read.

Original patch by Reed Kotler.

llvm-svn: 170655
2012-12-20 04:06:06 +00:00
Bob Wilson 3365b80290 Do not introduce vector operations in functions marked with noimplicitfloat.
<rdar://problem/12879313>

llvm-svn: 170630
2012-12-20 01:36:20 +00:00
Evan Cheng eae6d2ccea LLVM sdisel normalize bit extraction of the form:
((x & 0xff00) >> 8) << 2
to
 (x >> 6) & 0x3fc

This is general goodness since it folds a left shift into the mask. However,
the trailing zeros in the mask prevents the ARM backend from using the bit
extraction instructions. And worse since the mask materialization may require
an addition instruction. This comes up fairly frequently when the result of 
the bit twiddling is used as memory address. e.g.

 = ptr[(x & 0xFF0000) >> 16]

We want to generate:
  ubfx   r3, r1, #16, #8
  ldr.w  r3, [r0, r3, lsl #2]

vs.
  mov.w  r9, #1020
  and.w  r2, r9, r1, lsr #14
  ldr    r2, [r0, r2]

Add a late ARM specific isel optimization to
ARMDAGToDAGISel::PreprocessISelDAG(). It folds the left shift to the
'base + offset' address computation; change the mask to one which doesn't have
trailing zeros and enable the use of ubfx.

Note the optimization has to be done late since it's target specific and we
don't want to change the DAG normalization. It's also fairly restrictive
as shifter operands are not always free. It's only done for lsh 1 / 2. It's
known to be free on some cpus and they are most common for address
computation.

This is a slight win for blowfish, rijndael, etc.

rdar://12870177

llvm-svn: 170581
2012-12-19 20:16:09 +00:00
Benjamin Kramer c5071466d4 PowerPC: Expand VSELECT nodes.
There's probably a better expansion for those nodes than the default for
altivec, but this is better than crashing. VSELECTs occur in loop vectorizer
output.

llvm-svn: 170551
2012-12-19 15:49:14 +00:00
Elena Demikhovsky 14a4af0e66 Optimized load + SIGN_EXTEND patterns in the X86 backend.
llvm-svn: 170506
2012-12-19 07:50:20 +00:00
Nadav Rotem 33360d8ae9 After reducing the size of an operation in the DAG we zero-extend the reduced
bitwidth op back to the original size. If we reduce ANDs then this can cause
an endless loop. This patch changes the ZEXT to ANY_EXTEND if the demanded bits
are equal or smaller than the size of the reduced operation.

llvm-svn: 170505
2012-12-19 07:39:08 +00:00
Craig Topper 63f5921776 Teach SimplifySetCC that comparing AssertZext i1 against a constant 1 can be rewritten as a compare against a constant 0 with the opposite condition.
llvm-svn: 170495
2012-12-19 06:12:28 +00:00
Quentin Colombet 23b404d5ad Disable ARM partial flag dependency optimization at -Oz
To not over constrain the scheduler for ARM in thumb mode, some optimizations  for code size reduction, specific to ARM thumb, are blocked when they add a dependency (like write after read dependency).

Disables this check when code size is the priority, i.e., code is compiled with -Oz.

llvm-svn: 170462
2012-12-18 22:47:16 +00:00
Andrew Trick ec2564818c MISched: add dependence to ExitSU to model live-out latency.
llvm-svn: 170454
2012-12-18 20:53:01 +00:00
Hal Finkel 943f76d1b3 Check multiple register classes for inline asm tied registers
A register can be associated with several distinct register classes.
For example, on PPC, the floating point registers are each associated with
both F4RC (which holds f32) and F8RC (which holds f64). As a result, this code
would fail when provided with a floating point register and an f64 operand
because it would happen to find the register in the F4RC class first and
return that. From the F4RC class, SDAG would extract f32 as the register
type and then assert because of the invalid implied conversion between
the f64 value and the f32 register.

Instead, search all register classes. If a register class containing the
the requested register has the requested type, then return that register
class. Otherwise, as before, return the first register class found that
contains the requested register.

llvm-svn: 170436
2012-12-18 17:50:58 +00:00
Craig Topper f924a58af1 Add rest of BMI/BMI2 instructions to the folding tables as well as popcnt and lzcnt.
llvm-svn: 170304
2012-12-17 05:02:29 +00:00
Reed Kotler aee4d5d194 This patch is needed to make c++ exceptions work for mips16.
Mips16 is really a processor decoding mode (ala thumb 1) and in the same
program, mips16 and mips32 functions can exist and can call each other.

If a jal type instruction encounters an address with the lower bit set, then
the processor switches to mips16 mode (if it is not already in it). If the
lower bit is not set, then it switches to mips32 mode.

The linker knows which functions are mips16 and which are mips32.
When relocation is performed on code labels, this lower order bit is
set if the code label is a mips16 code label.

In general this works just fine, however when creating exception handling
tables and dwarf, there are cases where you don't want this lower order
bit added in.

This has been traditionally distinguished in gas assembly source by using a
different syntax for the label.

lab1:      ; this will cause the lower order bit to be added
lab2=.     ; this will not cause the lower order bit to be added

In some cases, it does not matter because in dwarf and debug tables
the difference of two labels is used and in that case the lower order
bits subtract each other out.

To fix this, I have added to mcstreamer the notion of a debuglabel.
The default is for label and debug label to be the same. So calling
EmitLabel and EmitDebugLabel produce the same result.

For various reasons, there is only one set of labels that needs to be
modified for the mips exceptions to work. These are the "$eh_func_beginXXX" 
labels.

Mips overrides the debug label suffix from ":" to "=." .

This initial patch fixes exceptions. More changes most likely
will be needed to DwarfCFException to make all of this work
for actual debugging. These changes will be to emit debug labels in some
places where a simple label is emitted now.

Some historical discussion on this from gcc can be found at:
http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00623.html
http://gcc.gnu.org/ml/gcc-patches/2008-11/msg01273.html 

llvm-svn: 170279
2012-12-16 04:00:45 +00:00
Benjamin Kramer b16ccde7a4 X86: Add a couple of target-specific dag combines that turn VSELECTS into psubus if possible.
We match the pattern "x >= y ? x-y : 0" into "subus x, y" and two special cases
if y is a constant. DAGCombiner canonicalizes those so we first have to undo the
canonicalization for those cases. The pattern occurs in gzip when the loop
vectorizer is enabled. Part of PR14613.

llvm-svn: 170273
2012-12-15 16:47:44 +00:00
Reed Kotler 5fdeb21249 This code implements most of mips16 hardfloat as it is done by gcc.
In this case, essentially it is soft float with different library routines.
The next step will be to make this fully interoperational with mips32 floating
point and that requires creating stubs for functions with signatures that
contain floating point types.

I have a more sophisticated design for mips16 hardfloat which I hope to
implement at a later time that directly does floating point without the need
for function calls.

The mips16 encoding has no floating point instructions so one needs to
switch to mips32 mode to execute floating point instructions.

llvm-svn: 170259
2012-12-15 00:20:05 +00:00
Nadav Rotem 8487537bdb TypeLegalizer: Do not generate target specific nodes with illegal types, because we cant type-legalize them.
llvm-svn: 170245
2012-12-14 21:20:37 +00:00
Bill Schmidt a4f898448c This patch removes some nondeterminism from direct object file output
for TLS dynamic models on 64-bit PowerPC ELF.  The default sort routine
for relocations only sorts on the r_offset field; but with TLS, there
can be two relocations with the same r_offset.  For PowerPC, this patch
sorts secondarily on descending r_type, which matches the behavior
expected by the linker.

llvm-svn: 170237
2012-12-14 20:28:38 +00:00
Bill Schmidt 9f0b4ec0f5 This patch improves the 64-bit PowerPC InitialExec TLS support by providing
for a wider range of GOT entries that can hold thread-relative offsets.
This matches the behavior of GCC, which was not documented in the PPC64 TLS
ABI.  The ABI will be updated with the new code sequence.

Former sequence:

  ld 9,x@got@tprel(2)
  add 9,9,x@tls

New sequence:

  addis 9,2,x@got@tprel@ha
  ld 9,x@got@tprel@l(9)
  add 9,9,x@tls

Note that a linker optimization exists to transform the new sequence into
the shorter sequence when appropriate, by replacing the addis with a nop
and modifying the base register and relocation type of the ld.

llvm-svn: 170209
2012-12-14 17:02:38 +00:00
Akira Hatanaka cf9a61b6ee [mips] Do not copy GOT address to register $gp if the function being called has
internal linkage.

llvm-svn: 170092
2012-12-13 03:17:29 +00:00
Evan Cheng bf0baa9de7 Fix a bug in DAGCombiner::MatchBSwapHWord. Make sure the node has operands before referencing them. rdar://12868039
llvm-svn: 170078
2012-12-13 01:34:32 +00:00
Evan Cheng b7d3d03bf9 Fix a logic bug in inline expansion of memcpy / memset with an overlapping
load / store pair. It's not legal to use a wider load than the size of
the remaining bytes if it's the first pair of load / store.

llvm-svn: 170018
2012-12-12 20:43:23 +00:00
Bill Schmidt 25ffd19502 The ordering of two relocations on the same instruction is apparently not
predictable when compiled on at least one non-PowerPC host.  Source of
nondeterminism not apparent.  Restrict the test to build on PowerPC hosts
for now while looking into the issue further.

llvm-svn: 170016
2012-12-12 20:29:20 +00:00
Bill Schmidt 24b8dd6eb7 This patch implements local-dynamic TLS model support for the 64-bit
PowerPC target.  This is the last of the four models, so we now have 
full TLS support.

This is mostly a straightforward extension of the general dynamic model.
I had to use an additional Chain operand to tie ADDIS_DTPREL_HA to the
register copy following ADDI_TLSLD_L; otherwise everything above the
ADDIS_DTPREL_HA appeared dead and was removed.

As before, there are new test cases to test the assembly generation, and
the relocations output during integrated assembly.  The expected code
gen sequence can be read in test/CodeGen/PowerPC/tls-ld.ll.

There are a couple of things I think can be done more efficiently in the
overall TLS code, so there will likely be a clean-up patch forthcoming;
but for now I want to be sure the functionality is in place.

Bill

llvm-svn: 170003
2012-12-12 19:29:35 +00:00
NAKAMURA Takumi be230b8fdb llvm/test/CodeGen/X86/atom-bypass-slow-division.ll: Fix possible typo(s) in CHECK-NOT lines.
Found by Alexander Zinenko, thanks!

llvm-svn: 169978
2012-12-12 13:34:20 +00:00
NAKAMURA Takumi cae5321a3b llvm/test/CodeGen/X86/atom-bypass-slow-division.ll: Rename symbols, s/test_/Test/g, not to mismatch "CHECK(-NOT): test".
llvm-svn: 169977
2012-12-12 13:34:14 +00:00
NAKAMURA Takumi 69d1405e48 llvm/test/CodeGen/X86/store_op_load_fold.ll: Fix typo, s/CHECK_NEXT/CHECK-NEXT/
llvm-svn: 169957
2012-12-12 01:41:01 +00:00
NAKAMURA Takumi 01ac65af00 llvm/test/CodeGen/X86/store_op_load_fold.ll: Add explicit triple.
llvm-svn: 169956
2012-12-12 01:40:56 +00:00
Manman Ren 82751a105c DAGCombine: clamp hi bit in APInt::getBitsSet to avoid assertion
rdar://12838504

llvm-svn: 169951
2012-12-12 01:13:50 +00:00
Evan Cheng 04e5518783 Avoid using lossy load / stores for memcpy / memset expansion. e.g.
f64 load / store on non-SSE2 x86 targets.

llvm-svn: 169944
2012-12-12 00:42:09 +00:00
Tom Stellard 75aadc2813 Add R600 backend
A new backend supporting AMD GPUs: Radeon HD2XXX - HD7XXX

llvm-svn: 169915
2012-12-11 21:25:42 +00:00
Bill Schmidt c56f1d34bc This patch implements the general dynamic TLS model for 64-bit PowerPC.
Given a thread-local symbol x with global-dynamic access, the generated
code to obtain x's address is:

     Instruction                            Relocation            Symbol
  addis ra,r2,x@got@tlsgd@ha           R_PPC64_GOT_TLSGD16_HA       x
  addi  r3,ra,x@got@tlsgd@l            R_PPC64_GOT_TLSGD16_L        x
  bl __tls_get_addr(x@tlsgd)           R_PPC64_TLSGD                x
                                       R_PPC64_REL24           __tls_get_addr
  nop
  <use address in r3>

The implementation borrows from the medium code model work for introducing
special forms of ADDIS and ADDI into the DAG representation.  This is made
slightly more complicated by having to introduce a call to the external
function __tls_get_addr.  Using the full call machinery is overkill and,
more importantly, makes it difficult to add a special relocation.  So I've
introduced another opcode GET_TLS_ADDR to represent the function call, and
surrounded it with register copies to set up the parameter and return value.

Most of the code is pretty straightforward.  I ran into one peculiarity
when I introduced a new PPC opcode BL8_NOP_ELF_TLSGD, which is just like
BL8_NOP_ELF except that it takes another parameter to represent the symbol
("x" above) that requires a relocation on the call.  Something in the 
TblGen machinery causes BL8_NOP_ELF and BL8_NOP_ELF_TLSGD to be treated
identically during the emit phase, so this second operand was never
visited to generate relocations.  This is the reason for the slightly
messy workaround in PPCMCCodeEmitter.cpp:getDirectBrEncoding().

Two new tests are included to demonstrate correct external assembly and
correct generation of relocations using the integrated assembler.

Comments welcome!

Thanks,
Bill

llvm-svn: 169910
2012-12-11 20:30:11 +00:00
Chad Rosier d4c0c6cb22 Add a triple to this test.
llvm-svn: 169803
2012-12-11 00:51:36 +00:00
Chandler Carruth b27041c50b Fix a miscompile in the DAG combiner. Previously, we would incorrectly
try to reduce the width of this load, and would end up transforming:

  (truncate (lshr (sextload i48 <ptr> as i64), 32) to i32)
to
  (truncate (zextload i32 <ptr+4> as i64) to i32)

We lost the sext attached to the load while building the narrower i32
load, and replaced it with a zext because lshr always zext's the
results. Instead, bail out of this combine when there is a conflict
between a sextload and a zext narrowing. The rest of the DAG combiner
still optimize the code down to the proper single instruction:

  movswl 6(...),%eax

Which is exactly what we wanted. Previously we read past the end *and*
missed the sign extension:

  movl 6(...), %eax

llvm-svn: 169802
2012-12-11 00:36:57 +00:00
Paul Redmond c4550d4967 move X86-specific test
This test case uses -mcpu=corei7 so it belongs in CodeGen/X86

Reviewed by: Nadav

llvm-svn: 169801
2012-12-11 00:36:43 +00:00
Chad Rosier df42cf39ab Fall back to the selection dag isel to select tail calls.
This shouldn't affect codegen for -O0 compiles as tail call markers are not
emitted in unoptimized compiles.  Testing with the external/internal nightly
test suite reveals no change in compile time performance.  Testing with -O1,
-O2 and -O3 with fast-isel enabled did not cause any compile-time or
execution-time failures.  All tests were performed on my x86 machine.
I'll monitor our arm testers to ensure no regressions occur there.

In an upcoming clang patch I will be marking the objc_autoreleaseReturnValue
and objc_retainAutoreleaseReturnValue as tail calls unconditionally.  While
it's theoretically true that this is just an optimization, it's an
optimization that we very much want to happen even at -O0, or else ARC
applications become substantially harder to debug.

Part of rdar://12553082

llvm-svn: 169796
2012-12-11 00:18:02 +00:00
Evan Cheng 79e2ca90bc Some enhancements for memcpy / memset inline expansion.
1. Teach it to use overlapping unaligned load / store to copy / set the trailing
   bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies.
2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is. e.g.
   x86 and ARM.
3. When memcpy from a constant string, do *not* replace the load with a constant
   if it's not possible to materialize an integer immediate with a single
   instruction (required a new target hook: TLI.isIntImmLegal()).
4. Use unaligned load / stores more aggressively if target hooks indicates they
   are "fast".
5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 / vst1.8.
   Also increase the threshold to something reasonable (8 for memset, 4 pairs
   for memcpy).

This significantly improves Dhrystone, up to 50% on ARM iOS devices.

rdar://12760078

llvm-svn: 169791
2012-12-10 23:21:26 +00:00
Hal Finkel 66859ae0f6 Use GetUnderlyingObjects in misched
misched used GetUnderlyingObject in order to break false load/store
dependencies, and the -enable-aa-sched-mi feature similarly relied on
GetUnderlyingObject in order to ensure it is safe to use the aliasing analysis.
Unfortunately, GetUnderlyingObject does not recurse through phi nodes, and so
(especially due to LSR) all of these mechanisms failed for
induction-variable-dependent loads and stores inside loops.

This change replaces uses of GetUnderlyingObject with GetUnderlyingObjects
(which will recurse through phi and select instructions) in misched.

Andy reviewed, tested and simplified this patch; Thanks!

llvm-svn: 169744
2012-12-10 18:49:16 +00:00
Craig Topper d8005db486 Teach DAG combine to handle vector add/sub with vectors of all 0s.
llvm-svn: 169727
2012-12-10 08:12:29 +00:00
Craig Topper a183ddb0fe Teach DAG combine to handle vector logical operations with vectors of all 1s or all 0s. These cases can show up when vectors are split for legalizing. Fix some tests that were dependent on these cases not being combined.
llvm-svn: 169684
2012-12-08 22:49:19 +00:00
Nadav Rotem ad0b5fbe8c When we use the BLEND instruction that uses the MSB as a mask, we can remove
the VSRI instruction before it since it does not affect the MSB.

Thanks Craig Topper for suggesting this.

llvm-svn: 169638
2012-12-07 21:43:11 +00:00
Matthew Curtis 7a93811e8b In hexagon convertToHardwareLoop, don't deref end() iterator
In particular, check if MachineBasicBlock::iterator is end() before
using it to call getDebugLoc();

See also this thread on llvm-commits:
   http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20121112/155914.html

llvm-svn: 169634
2012-12-07 21:03:15 +00:00
Nadav Rotem 481e50efe0 X86: Prefer using VPSHUFD over VPERMIL because it has better throughput.
llvm-svn: 169624
2012-12-07 19:01:13 +00:00
Tim Northover 5cc3dc86bb Added Mapping Symbols for ARM ELF
Before this patch, when you objdump an LLVM-compiled file, objdump tried to
decode data-in-code sections as if they were code.  This patch adds the missing
Mapping Symbols, as defined by "ELF for the ARM Architecture" (ARM IHI 0044D).

Patch based on work by Greg Fitzgerald.

llvm-svn: 169609
2012-12-07 16:50:23 +00:00
Dmitri Gribenko 1c704355cf Fix typos in CHECK lines.
Patch by Alexander Zinenko.

llvm-svn: 169547
2012-12-06 21:24:47 +00:00
Nadav Rotem ac450eb59e Fix a bug in the code that merges consecutive stores. Previously we did not
check if loads that happen in between stores alias with the first store in the
chain, only with the second store onwards.

llvm-svn: 169516
2012-12-06 17:34:13 +00:00
Craig Topper 216bcd522b Remove intrinsic specific instructions for (V)MOVQUmr with patterns pointing to the normal instructions.
llvm-svn: 169482
2012-12-06 07:31:16 +00:00
Evan Cheng 16846051db Properly fix the tes.
llvm-svn: 169464
2012-12-06 02:29:29 +00:00
NAKAMURA Takumi 1eccd286fd llvm/test/CodeGen/ARM/extload-knownzero.ll: Try to unbreak, to add -O0. I guess Chad expects fastisel here.
llvm-svn: 169463
2012-12-06 02:22:58 +00:00
Chad Rosier 9f5c68af4c [arm fast-isel] Make the fast-isel implementation of memcpy respect alignment.
rdar://12821569

llvm-svn: 169460
2012-12-06 01:34:31 +00:00
Evan Cheng 5213139f48 Let targets provide hooks that compute known zero and ones for any_extend
and extload's. If they are implemented as zero-extend, or implicitly
zero-extend, then this can enable more demanded bits optimizations. e.g.

define void @foo(i16* %ptr, i32 %a) nounwind {
entry:
  %tmp1 = icmp ult i32 %a, 100
  br i1 %tmp1, label %bb1, label %bb2
bb1:
  %tmp2 = load i16* %ptr, align 2
  br label %bb2
bb2:
  %tmp3 = phi i16 [ 0, %entry ], [ %tmp2, %bb1 ]
  %cmp = icmp ult i16 %tmp3, 24
  br i1 %cmp, label %bb3, label %exit
bb3:
  call void @bar() nounwind
  br label %exit
exit:
  ret void
}

This compiles to the followings before:
        push    {lr}
        mov     r2, #0
        cmp     r1, #99
        bhi     LBB0_2
@ BB#1:                                 @ %bb1
        ldrh    r2, [r0]
LBB0_2:                                 @ %bb2
        uxth    r0, r2
        cmp     r0, #23
        bhi     LBB0_4
@ BB#3:                                 @ %bb3
        bl      _bar
LBB0_4:                                 @ %exit
        pop     {lr}
        bx      lr

The uxth is not needed since ldrh implicitly zero-extend the high bits. With
this change it's eliminated.

rdar://12771555

llvm-svn: 169459
2012-12-06 01:28:01 +00:00
Andrew Trick fda7a8832d RegisterPressureTracker: fix findUseBetween to handle DebugValue
llvm-svn: 169427
2012-12-05 21:37:50 +00:00
Andrew Trick 7f7cee39ab RegisterPresssureTracker: Track live physical register by unit.
This is much simpler to reason about, more efficient, and
fixes some corner cases involving implicit super-register defs.
Fixed rdar://12797931.

llvm-svn: 169425
2012-12-05 21:37:42 +00:00
Justin Holewinski fb711156ae [NVPTX] Fix crash with unnamed struct arguments
Patch by Eric Holk

llvm-svn: 169418
2012-12-05 20:50:28 +00:00
Jyotsna Verma 90295156d8 Use multiclass to define store instructions with base+immediate offset
addressing mode and immediate stored value.

llvm-svn: 169408
2012-12-05 19:32:03 +00:00
Elena Demikhovsky cd3c1c4a16 Simplified BLEND pattern matching for shuffles.
Generate VPBLENDD for AVX2 and VPBLENDW for v16i16 type on AVX2.

llvm-svn: 169366
2012-12-05 09:24:57 +00:00
Evan Cheng d31802c1f6 Add x86 isel lowering logic to form bit test with inverted condition. e.g.
x ^ -1.

Patch by David Majnemer.
rdar://12755626

llvm-svn: 169339
2012-12-05 00:10:38 +00:00
Evan Cheng b4eae1361c ARM custom lower ctpop for vector types. Patch by Pete Couperus.
llvm-svn: 169325
2012-12-04 22:41:50 +00:00
Bill Wendling d7767125d5 Use the 'count' attribute to calculate the upper bound of an array.
The count attribute is more accurate with regards to the size of an array. It
also obviates the upper bound attribute in the subrange. We can also better
handle an unbound array by setting the count to -1 instead of the lower bound to
1 and upper bound to 0.

llvm-svn: 169312
2012-12-04 21:34:03 +00:00
Bill Schmidt ca4a0c9dbd This patch introduces initial-exec model support for thread-local storage
on 64-bit PowerPC ELF.

The patch includes code to handle external assembly and MC output with the
integrated assembler.  It intentionally does not support the "old" JIT.

For the initial-exec TLS model, the ABI requires the following to calculate
the address of external thread-local variable x:

 Code sequence            Relocation                  Symbol
  ld 9,x@got@tprel(2)      R_PPC64_GOT_TPREL16_DS      x
  add 9,9,x@tls            R_PPC64_TLS                 x

The register 9 is arbitrary here.  The linker will replace x@got@tprel
with the offset relative to the thread pointer to the generated GOT
entry for symbol x.  It will replace x@tls with the thread-pointer
register (13).

The two test cases verify correct assembly output and relocation output
as just described.

PowerPC-specific selection node variants are added for the two
instructions above:  LD_GOT_TPREL and ADD_TLS.  These are inserted
when an initial-exec global variable is encountered by
PPCTargetLowering::LowerGlobalTLSAddress(), and later lowered to
machine instructions LDgotTPREL and ADD8TLS.  LDgotTPREL is a pseudo
that uses the same LDrs support added for medium code model's LDtocL,
with a different relocation type.

The rest of the processing is straightforward.

llvm-svn: 169281
2012-12-04 16:18:08 +00:00
Bill Wendling bfc0e5725f Add a 'count' field to the DWARF subrange.
The count field is necessary because there isn't a difference between the 'lo'
and 'hi' attributes for a one-element array and a zero-element array. When the
count is '0', we know that this is a zero-element array. When it's >=1, then
it's a normal constant sized array. When it's -1, then the array is unbounded.

llvm-svn: 169218
2012-12-04 06:20:49 +00:00
Manman Ren f563941adc Stack Alignment: when creating stack objects in MachineFrameInfo, make sure
the alignment is clamped to TargetFrameLowering.getStackAlignment if the target
does not support stack realignment or the option "realign-stack" is off.

This will cause miscompile if the address is treated as aligned and add is
replaced with or in DAGCombine.

Added a bool StackRealignable to TargetFrameLowering to check whether stack
realignment is implemented for the target. Also added a bool RealignOption
to MachineFrameInfo to check whether the option "realign-stack" is on.

rdar://12713765

llvm-svn: 169197
2012-12-04 00:52:33 +00:00
Nadav Rotem 1157e1410c Allow merging multiple store sequences on the same chain.
llvm-svn: 169111
2012-12-02 17:14:09 +00:00
Eli Bendersky b7b1ffc8e7 Fix an invalid regex in the test
llvm-svn: 169108
2012-12-02 15:46:02 +00:00
Andrew Trick b767d1eba8 misched: Fix RegisterPressureTracker handling of DebugVals.
Assertion failed: (TopRPTracker.getPos() == RegionBegin && "bad initial Top tracker").
rdar://12790302.

llvm-svn: 169072
2012-12-01 01:22:49 +00:00
Andrew Trick d5953622ce misched: Fix the DAG builder to handle an undef operand at ExitSU.
Assertion failed: (VNI && "No value to read by operand")
rdar://12790267.

llvm-svn: 169071
2012-12-01 01:22:44 +00:00
Andrew Trick a01302182c misched: Fix LiveInterval update to better handle DebugVal.
Assertion failed: (itr != mi2iMap.end() && "Instruction not found in maps.")
rdar://12777252.

llvm-svn: 169070
2012-12-01 01:22:41 +00:00