Commit Graph

1336 Commits

Author SHA1 Message Date
Kit Barton 13894c7f35 [PPC] Implement vmrgew and vmrgow instructions
This patch adds support for the vector merge even word and vector merge odd word
instructions introduced in POWER8.

Phabricator review: http://reviews.llvm.org/D10704

llvm-svn: 240650
2015-06-25 15:17:40 +00:00
Rafael Espindola 9ac06a0e6b Improve the --expand-relocs handling of MachO.
In a relocation target can take 3 basic forms

* A r_value in scattered relocations.
* A symbol in external relocations.
* A section is non-external relocations.

Have the dump reflect that. With this change we go from

CHECK-NEXT:       Extern: 0
CHECK-NEXT:       Type: X86_64_RELOC_SUBTRACTOR (5)
CHECK-NEXT:       Symbol: 0x2
CHECK-NEXT:       Scattered: 0

To just

// CHECK-NEXT:       Type: X86_64_RELOC_SUBTRACTOR (5)
// CHECK-NEXT:       Section: __data (2)

Since the relocation is with a section, we print the seciton name and don't
need to say that it is not scattered or external.

Someone motivated can add further special cases for things like
ARM64_RELOC_ADDEND and ARM_RELOC_PAIR.

llvm-svn: 240073
2015-06-18 22:38:20 +00:00
Rafael Espindola aaaa575f71 Use --expand-relocs in a test. It will make the next change easier to read.
llvm-svn: 240053
2015-06-18 20:57:35 +00:00
David Majnemer 7fddeccb8b Move the personality function from LandingPadInst to Function
The personality routine currently lives in the LandingPadInst.

This isn't desirable because:
- All LandingPadInsts in the same function must have the same
  personality routine.  This means that each LandingPadInst beyond the
  first has an operand which produces no additional information.

- There is ongoing work to introduce EH IR constructs other than
  LandingPadInst.  Moving the personality routine off of any one
  particular Instruction and onto the parent function seems a lot better
  than have N different places a personality function can sneak onto an
  exceptional function.

Differential Revision: http://reviews.llvm.org/D10429

llvm-svn: 239940
2015-06-17 20:52:32 +00:00
Kit Barton 4f79f96fd7 Properly handle the mftb instruction.
The mftb instruction was incorrectly marked as deprecated in the PPC
Backend. Instead, it should not be treated as deprecated, but rather be
implemented using the mfspr instruction. A similar patch was put into GCC last
year. Details can be found at:

https://sourceware.org/ml/binutils/2014-11/msg00383.html.
This change will replace instances of the mftb instruction with the mfspr
instruction for all CPUs except 601 and pwr3. This will also be the default
behaviour.

Additional details can be found in:

https://llvm.org/bugs/show_bug.cgi?id=23680

Phabricator review: http://reviews.llvm.org/D10419

llvm-svn: 239827
2015-06-16 16:01:15 +00:00
Nemanja Ivanovic ea1db8a697 LLVM support for vector quad bit permute and gather instructions through builtins
This patch corresponds to review:
http://reviews.llvm.org/D10096

This is the back end portion of the patch related to D10095.
The patch adds the instructions and back end intrinsics for:
vbpermq
vgbbd

llvm-svn: 239505
2015-06-11 06:21:25 +00:00
Nemanja Ivanovic 376e17364f Add support for VSX FMA single-precision instructions to the PPC back end
This patch corresponds to review:
http://reviews.llvm.org/D9941

It adds the various FMA instructions introduced in the version 2.07 of
the ISA along with the testing for them. These are operations on single
precision scalar values in VSX registers.

llvm-svn: 238578
2015-05-29 17:13:25 +00:00
Kit Barton 6646033e6e This patch adds support for the vector quadword add/sub instructions introduced
in POWER8:

vadduqm
vaddeuqm
vaddcuq
vaddecuq
vsubuqm
vsubeuqm
vsubcuq
vsubecuq
In addition to adding the instructions themselves, it also adds support for the
v1i128 type for intrinsics (Intrinsics.td, Function.cpp, and
IntrinsicEmitter.cpp).

http://reviews.llvm.org/D9081

llvm-svn: 238144
2015-05-25 15:49:26 +00:00
Hal Finkel 5f2a1379ef [PowerPC] Fix fast-isel when compare is split from branch
When the compare feeding a branch was in a different BB from the branch, we'd
try to "regenerate" the compare in the block with the branch, possibly trying
to make use of values not available there. Copy a page from AArch64's play book
here to fix the problem (at least in terms of correctness).

Fixes PR23640.

llvm-svn: 238097
2015-05-23 12:18:10 +00:00
Bill Schmidt e13ac91c5d [PPC64] Handle vpkudum mask pattern correctly when vpkudum isn't available
My recent patch to add support for ISA 2.07 vector pack/unpack
instructions didn't properly check for availability of the vpkudum
instruction when recognizing it as a special vector shuffle case.
This causes us to leave the vector shuffle in place (rather than
converting it to a vector permute) so that it can be recognized later
as a vpkudum, but that pattern is invalid for processors prior to
POWER8.  Thus LLVM crashes with an "unable to select" message.  We
observed this since one of our buildbots is configured to generate
code for a POWER7.

This patch fixes the problem by checking for availability of the
vpkudum instruction during custom lowering of vector shuffles.

I've added a test case variant for the vpkudum pattern when the
instruction isn't available.

llvm-svn: 237952
2015-05-21 20:48:49 +00:00
Nemanja Ivanovic f02def6cbc Add support for VSX scalar single-precision arithmetic in the PPC target
http://reviews.llvm.org/D9891
Following up on the VSX single precision loads and stores added earlier, this
adds support for elementary arithmetic operations on single precision values
in VSX registers. These instructions utilize the new VSSRC register class.
Instructions added:
xsaddsp
xsdivsp
xsmulsp
xsresp
xsrsqrtesp
xssqrtsp
xssubsp

llvm-svn: 237937
2015-05-21 19:32:49 +00:00
Hal Finkel 8340de142c [PowerPC] Add extra r2 read deps on @toc@l relocations
If some commits are happy, and some commits are sad, this is a sad commit. It
is sad because it restricts instruction scheduling to work around a binutils
linker bug, and moreover, one that may never be fixed. On 2012-05-21, GCC was
updated not to produce code triggering this bug, and now we'll do the same...

When resolving an address using the ELF ABI TOC pointer, two relocations are
generally required: one for the high part and one for the low part. Only
the high part generally explicitly depends on r2 (the TOC pointer). And, so,
we might produce code like this:

.Ltmp526:
        addis 3, 2, .LC12@toc@ha
.Ltmp1628:
        std 2, 40(1)
        ld 5, 0(27)
        ld 2, 8(27)
        ld 11, 16(27)
        ld 3, .LC12@toc@l(3)
        rldicl 4, 4, 0, 32
        mtctr 5
        bctrl
        ld 2, 40(1)

And there is nothing wrong with this code, as such, but there is a linker bug
in binutils (https://sourceware.org/bugzilla/show_bug.cgi?id=18414) that will
misoptimize this code sequence to this:
        nop
        std     r2,40(r1)
        ld      r5,0(r27)
        ld      r2,8(r27)
        ld      r11,16(r27)
        ld      r3,-32472(r2)
        clrldi  r4,r4,32
        mtctr   r5
        bctrl
        ld      r2,40(r1)
because the linker does not know (and does not check) that the value in r2
changed in between the instruction using the .LC12@toc@ha (TOC-relative)
relocation and the instruction using the .LC12@toc@l(3) relocation.
Because it finds these instructions using the relocations (and not by
scanning the instructions), it has been asserted that there is no good way
to detect the change of r2 in between. As a result, this bug may never be
fixed (i.e. it may become part of the definition of the ABI). GCC was
updated to add extra dependencies on r2 to instructions using the @toc@l
relocations to avoid this problem, and we'll do the same here.

This is done as a separate pass because:
 1. These extra r2 dependencies are not really properties of the
    instructions, but rather due to a linker bug, and maybe one day we'll be
    able to get rid of them when targeting linkers without this bug (and,
    thus, keeping the logic centralized here will make that
    straightforward).
 2. There are ISel-level peephole optimizations that propagate the @toc@l
    relocations to some user instructions, and so the exta dependencies do
    not apply only to a fixed set of instructions (without undesirable
    definition replication).

The test case was reduced with the help of bugpoint, with minimal cleaning. I'm
looking forward to our upcoming MI serialization support, and with that, much
better tests can be created.

llvm-svn: 237556
2015-05-18 06:25:59 +00:00
Bill Schmidt 5ed84cdba8 [PPC64] Add vector pack/unpack support from ISA 2.07
This patch adds support for the following new instructions in the
Power ISA 2.07:

  vpksdss
  vpksdus
  vpkudus
  vpkudum
  vupkhsw
  vupklsw

These instructions are available through the vec_packs, vec_packsu,
vec_unpackh, and vec_unpackl built-in interfaces.  These are
lane-sensitive instructions, so the built-ins have different
implementations for big- and little-endian, and the instructions must
be marked as killing the vector swap optimization for now.

The first three instructions perform saturating pack operations.  The
fourth performs a modulo pack operation, which means it can be
represented with a vector shuffle, and conversely the appropriate
vector shuffles may cause this instruction to be generated.  The other
instructions are only generated via built-in support for now.

Appropriate tests have been added.

There is a companion patch to clang for the rest of this support.

llvm-svn: 237499
2015-05-16 01:02:12 +00:00
James Y Knight 38514d2177 Fix test added in r236850 for OSX builders.
Need to specify triple so that llvm emits the asm syntax that the
test expected.

llvm-svn: 236855
2015-05-08 14:04:54 +00:00
James Y Knight 284e7b3d6c Fix alignment checks in MergeConsecutiveStores.
1) check whether the alignment of the memory is sufficient for the
*merged* store or load to be efficient.

Not doing so can result in some ridiculously poor code generation, if
merging creates a vector operation which must be aligned but isn't.

2) DON'T check that the alignment of each load/store is equal. If
you're merging 2 4-byte stores, the first *might* have 8-byte
alignment, but the second certainly will have 4-byte alignment. We do
want to allow those to be merged.

llvm-svn: 236850
2015-05-08 13:47:01 +00:00
Nemanja Ivanovic f3c94b1e3c Add VSX Scalar loads and stores to the PPC back end
This patch corresponds to review:
http://reviews.llvm.org/D9440

It adds a new register class to the PPC back end to contain single precision
values in VSX registers. Additionally, it adds scalar loads and stores for
VSX registers.

llvm-svn: 236755
2015-05-07 18:24:05 +00:00
Bill Schmidt 5fe2e25f7c [PPC64LE] Adjust vector splats during VSX swap optimization
The initial code drop for VSX swap optimization permitted the
optimization only when all operations in a web of related computation
are lane-insensitive.  For some lane-sensitive operations, we can
still permit the optimization provided that we make adjustments to
those operations.  This patch adds special handling for vector splats
so that their presence doesn't kill the optimization.

Vector splats are lane-sensitive since they identify by number a
vector element to be used as the source of a splat.  When swap
optimizations take place, the desired vector element will move to the
opposite doubleword of the quadword vector.  We thus replace the index
I by (I + N/2) % N, where N is the number of elements in the vector.

A new test case is added to test that swap optimization succeeds when
vector splats are present, and that the proper input element is used
as the source of the splat.

An ancillary change removes SH_BUILDVEC as one of the kinds of special
handling that may be required by VSX swap optimization.  From
experience with GCC, I had expected to need some modifications for
vector build operations, but I did not find that to be the case.

llvm-svn: 236606
2015-05-06 15:40:46 +00:00
Kit Barton d4eb73c00e This patch adds ABI support for v1i128 data type.
It adds v1i128 to the appropriate register classes and checks parameter passing
and return values.

This is related to http://reviews.llvm.org/D9081, which will add instructions
that exploit the v1i128 datatype.

Phabricator review: http://reviews.llvm.org/D9475

llvm-svn: 236503
2015-05-05 16:10:44 +00:00
Duncan P. N. Exon Smith a9308c49ef IR: Give 'DI' prefix to debug info metadata
Finish off PR23080 by renaming the debug info IR constructs from `MD*`
to `DI*`.  The last of the `DIDescriptor` classes were deleted in
r235356, and the last of the related typedefs removed in r235413, so
this has all baked for about a week.

Note: If you have out-of-tree code (like a frontend), I recommend that
you get everything compiling and tests passing with the *previous*
commit before updating to this one.  It'll be easier to keep track of
what code is using the `DIDescriptor` hierarchy and what you've already
updated, and I think you're extremely unlikely to insert bugs.  YMMV of
course.

Back to *this* commit: I did this using the rename-md-di-nodes.sh
upgrade script I've attached to PR23080 (both code and testcases) and
filtered through clang-format-diff.py.  I edited the tests for
test/Assembler/invalid-generic-debug-node-*.ll by hand since the columns
were off-by-three.  It should work on your out-of-tree testcases (and
code, if you've followed the advice in the previous paragraph).

Some of the tests are in badly named files now (e.g.,
test/Assembler/invalid-mdcompositetype-missing-tag.ll should be
'dicompositetype'); I'll come back and move the files in a follow-up
commit.

llvm-svn: 236120
2015-04-29 16:38:44 +00:00
Bill Schmidt fe723b9a6d [PPC64LE] Remove unnecessary swaps from lane-insensitive vector computations
This patch adds a new SSA MI pass that runs on little-endian PPC64
code with VSX enabled. Loads and stores of 4x32 and 2x64 vectors
without alignment constraints are accomplished for little-endian using
lxvd2x/xxswapd and xxswapd/stxvd2x. The existence of the additional
xxswapd instructions hurts performance in comparison with big-endian
code, but they are necessary in the general case to support correct
semantics.

However, the general case does not apply to most vector code. Many
vector instructions are lane-insensitive; they do not "care" which
lanes the parallel computations are performed within, provided that
the resulting data is stored into the correct locations. Thus this
pass looks for computations that perform only lane-insensitive
operations, and remove the unnecessary swaps from loads and stores in
such computations.

Future improvements will allow computations using certain
lane-sensitive operations to also be optimized in this manner, by
modifying the lane-sensitive operations to account for the permuted
order of the lanes. However, this patch only adds the infrastructure
to permit this; no lane-sensitive operations are optimized at this
time.

This code is heavily exercised by the various vectorizing applications
in the projects/test-suite tree. For the time being, I have only added
one simple test case to demonstrate what the pass is doing. Although
it is quite simple, it provides coverage for much of the code,
including the special case handling of copies and subreg-to-reg
operations feeding the swaps. I plan to add additional tests in the
future as I fill in more of the "special handling" code.

Two existing tests were affected, because they expected the swaps to
be present, but they are now removed.

llvm-svn: 235910
2015-04-27 19:57:34 +00:00
Hal Finkel d86e90abdd [PowerPC] Use sync inst alias when printing
So long as the choice between printing msync and sync is not ambiguous, we can
print 'sync 0' and just 'sync'.

llvm-svn: 235663
2015-04-23 23:05:08 +00:00
Hal Finkel 7c5cb066d0 [PowerPC] Enable printing instructions using aliases
TableGen had been nicely generating code to print a number of instructions using
shorter aliases (and PowerPC has plenty of short mnemonics), but we were not
calling it. For some of the aliases we support in the parser, TableGen can't
infer the "inverse" alias relationship, so there is still more to do.

Thus, after some hours of updating test cases...

llvm-svn: 235616
2015-04-23 18:30:38 +00:00
Hans Wennborg 0867b151c9 Re-commit r235560: Switch lowering: extract jump tables and bit tests before building binary tree (PR22262)
Third time's the charm. The previous commit was reverted as a
reverse for-loop in SelectionDAGBuilder::lowerWorkItem did 'I--'
on an iterator at the beginning of a vector, causing asserts
when using debugging iterators. This commit fixes that.

llvm-svn: 235608
2015-04-23 16:45:24 +00:00
Aaron Ballman 0be238cebd Revert r235560; this commit was causing several failed assertions in Debug builds using MSVC's STL. The iterator is being used outside of its valid range.
llvm-svn: 235597
2015-04-23 13:41:59 +00:00
Hans Wennborg 15823d49b6 Switch lowering: extract jump tables and bit tests before building binary tree (PR22262)
This is a re-commit of r235101, which also fixes the problems with the previous patch:

- Switches with only a default case and non-fallthrough were handled incorrectly

- The previous patch tickled a bug in PowerPC Early-Return Creation which is fixed here.

> This is a major rewrite of the SelectionDAG switch lowering. The previous code
> would lower switches as a binary tre, discovering clusters of cases
> suitable for lowering by jump tables or bit tests as it went along. To increase
> the likelihood of finding jump tables, the binary tree pivot was selected to
> maximize case density on both sides of the pivot.
>
> By not selecting the pivot in the middle, the binary trees would not always
> be balanced, leading to performance problems in the generated code.
>
> This patch rewrites the lowering to search for clusters of cases
> suitable for jump tables or bit tests first, and then builds the binary
> tree around those clusters. This way, the binary tree will always be balanced.
>
> This has the added benefit of decoupling the different aspects of the lowering:
> tree building and jump table or bit tests finding are now easier to tweak
> separately.
>
> For example, this will enable us to balance the tree based on profile info
> in the future.
>
> The algorithm for finding jump tables is quadratic, whereas the previous algorithm
> was O(n log n) for common cases, and quadratic only in the worst-case. This
> doesn't seem to be major problem in practice, e.g. compiling a file consisting
> of a 10k-case switch was only 30% slower, and such large switches should be rare
> in practice. Compiling e.g. gcc.c showed no compile-time difference.  If this
> does turn out to be a problem, we could limit the search space of the algorithm.
>
> This commit also disables all optimizations during switch lowering in -O0.
>
> Differential Revision: http://reviews.llvm.org/D8649

llvm-svn: 235560
2015-04-22 23:14:56 +00:00
Hal Finkel 0d49cf2645 [DAGCombine] Disable select(c, load,load) for indexed loads
This turned up after r235333, but was a pre-existing bug. The optimization
which transforms select(c, load, load) into a load of a select of the addresses
does not handle indexed loads (pre/post inc/dec). However, it did not check for
them either, leading to a crash if it tried to transform one of them.

llvm-svn: 235497
2015-04-22 11:32:25 +00:00
Olivier Sallenave b99c2eb0f0 Refactoring and enhancement to FMA combine.
llvm-svn: 235344
2015-04-20 20:29:40 +00:00
Hal Finkel 1e5733bbed [InlineAsm] Remove EarlyClobber on registers that are also inputs
When an inline asm call has an output register marked as early-clobber, but
that same register is also an input operand, what should we do? GCC accepts
this, and is documented to accept this for read/write operands saying,
"Furthermore, if the earlyclobber operand is also a read/write operand, then
that operand is written only after it's used." For write-only operands, the
situation seems less clear, but I have at least one existing codebase that
assumes this will work, in part because it has syscall macros like this:

({                                                                         \
  register uint64_t r0 __asm__ ("r0") = (__NR_ ## name);                   \
  register uint64_t r3 __asm__ ("r3") = ((uint64_t) (arg0));               \
  register uint64_t r4 __asm__ ("r4") = ((uint64_t) (arg1));               \
  register uint64_t r5 __asm__ ("r5") = ((uint64_t) (arg2));               \
  __asm__ __volatile__                                                     \
  ("sc"                                                                    \
   : "=&r"(r0),"=&r"(r3),"=&r"(r4),"=&r"(r5)                               \
   :   "0"(r0),  "1"(r3),  "2"(r4),  "3"(r5)                               \
   : "r6","r7","r8","r9","r10","r11","r12","cr0","memory");                \
  r3;                                                                      \
})

Furthermore, with register aliases and subregister relationships that only the
backend knows about, rejecting this in the frontend seems like a difficult
proposition (if we wanted to do so). However, keeping the early-clobber flag on
the INLINEASM MI does not work for us, because it will cause the register's
live interval to end to soon (so it will not appear defined to be used as an
input).

Fortunately, fixing this does not seem hard: When forming the INLINEASM MI,
check to see if any of the early-clobber outputs are also inputs, and if so,
remove the early-clobber flag.

llvm-svn: 235283
2015-04-20 00:01:30 +00:00
David Blaikie 23af64846f [opaque pointer type] Add textual IR support for explicit type parameter to the call instruction
See r230786 and r230794 for similar changes to gep and load
respectively.

Call is a bit different because it often doesn't have a single explicit
type - usually the type is deduced from the arguments, and just the
return type is explicit. In those cases there's no need to change the
IR.

When that's not the case, the IR usually contains the pointer type of
the first operand - but since typed pointers are going away, that
representation is insufficient so I'm just stripping the "pointerness"
of the explicit type away.

This does make the IR a bit weird - it /sort of/ reads like the type of
the first operand: "call void () %x(" but %x is actually of type "void
()*" and will eventually be just of type "ptr". But this seems not too
bad and I don't think it would benefit from repeating the type
("void (), void () * %x(" and then eventually "void (), ptr %x(") as has
been done with gep and load.

This also has a side benefit: since the explicit type is no longer a
pointer, there's no ambiguity between an explicit type and a function
that returns a function pointer. Previously this case needed an explicit
type (eg: a function returning a void() function was written as
"call void () () * @x(" rather than "call void () * @x(" because of the
ambiguity between a function returning a pointer to a void() function
and a function returning void).

No ambiguity means even function pointer return types can just be
written alone, without writing the whole function's type.

This leaves /only/ the varargs case where the explicit type is required.

Given the special type syntax in call instructions, the regex-fu used
for migration was a bit more involved in its own unique way (as every
one of these is) so here it is. Use it in conjunction with the apply.sh
script and associated find/xargs commands I've provided in rr230786 to
migrate your out of tree tests. Do let me know if any of this doesn't
cover your cases & we can iterate on a more general script/regexes to
help others with out of tree tests.

About 9 test cases couldn't be automatically migrated - half of those
were functions returning function pointers, where I just had to manually
delete the function argument types now that we didn't need an explicit
function type there. The other half were typedefs of function types used
in calls - just had to manually drop the * from those.

import fileinput
import sys
import re

pat = re.compile(r'((?:=|:|^|\s)call\s(?:[^@]*?))(\s*$|\s*(?:(?:\[\[[a-zA-Z0-9_]+\]\]|[@%](?:(")?[\\\?@a-zA-Z0-9_.]*?(?(3)"|)|{{.*}}))(?:\(|$)|undef|inttoptr|bitcast|null|asm).*$)')
addrspace_end = re.compile(r"addrspace\(\d+\)\s*\*$")
func_end = re.compile("(?:void.*|\)\s*)\*$")

def conv(match, line):
  if not match or re.search(addrspace_end, match.group(1)) or not re.search(func_end, match.group(1)):
    return line
  return line[:match.start()] + match.group(1)[:match.group(1).rfind('*')].rstrip() + match.group(2) + line[match.end():]

for line in sys.stdin:
  sys.stdout.write(conv(re.search(pat, line), line))

llvm-svn: 235145
2015-04-16 23:24:18 +00:00
Hans Wennborg a9e2057416 Revert the switch lowering change (r235101, r235103, r235106)
Looks like it broke the sanitizer-ppc64-linux1 build. Reverting for now.

llvm-svn: 235108
2015-04-16 15:43:26 +00:00
Hans Wennborg d403664ed8 Switch lowering: extract jump tables and bit tests before building binary tree (PR22262)
This is a major rewrite of the SelectionDAG switch lowering. The previous code
would lower switches as a binary tre, discovering clusters of cases
suitable for lowering by jump tables or bit tests as it went along. To increase
the likelihood of finding jump tables, the binary tree pivot was selected to
maximize case density on both sides of the pivot.

By not selecting the pivot in the middle, the binary trees would not always
be balanced, leading to performance problems in the generated code.

This patch rewrites the lowering to search for clusters of cases
suitable for jump tables or bit tests first, and then builds the binary
tree around those clusters. This way, the binary tree will always be balanced.

This has the added benefit of decoupling the different aspects of the lowering:
tree building and jump table or bit tests finding are now easier to tweak
separately.

For example, this will enable us to balance the tree based on profile info
in the future.

The algorithm for finding jump tables is O(n^2), whereas the previous algorithm
was O(n log n) for common cases, and quadratic only in the worst-case. This
doesn't seem to be major problem in practice, e.g. compiling a file consisting
of a 10k-case switch was only 30% slower, and such large switches should be rare
in practice. Compiling e.g. gcc.c showed no compile-time difference.  If this
does turn out to be a problem, we could limit the search space of the algorithm.

This commit also disables all optimizations during switch lowering in -O0.

Differential Revision: http://reviews.llvm.org/D8649

llvm-svn: 235101
2015-04-16 14:49:23 +00:00
Rafael Espindola 10f3de6889 Update tests to not be as dependent on section numbers.
Many of these predate llvm-readobj. With elf-dump we had to match
a relocation to symbol number and symbol number to symbol name or
section number.

llvm-svn: 235015
2015-04-15 15:59:37 +00:00
Hal Finkel 5551f2528c [PowerPC] Really iterate over all loops in PPCLoopDataPrefetch/PPCLoopPreIncPrep
When I fixed these a couple of days ago to iterate over all loops, not just
depth == 1 loops, I inadvertently made it such that we'd only look at the first
top-level loop. Make sure that we really look at all of them.

llvm-svn: 234705
2015-04-12 17:18:56 +00:00
Hal Finkel 58f5f9c393 [PowerPC] Disable part-word atomics on the P7
As it turns out, even though these are part of ISA 2.06, the P7 does not
support them (or, at least, not any P7s we're tested so far).

llvm-svn: 234686
2015-04-11 13:40:36 +00:00
Nemanja Ivanovic c38b5311cb Add direct moves to/from VSR and exploit them for FP/INT conversions
This patch corresponds to review:
http://reviews.llvm.org/D8928

It adds direct move instructions to/from VSX registers to GPR's. These are
exploited for FP <-> INT conversions.

llvm-svn: 234682
2015-04-11 10:40:42 +00:00
Hal Finkel ffb460fdf0 [PowerPC] Fix PPCLoopPreIncPrep for depth > 1 loops
This pass had the same problem as the data-prefetching pass: it was only
checking for depth == 1 loops in practice. Fix that, add some debugging
statements, and make sure that, when we grab an AddRec, it is for the loop we
expect.

llvm-svn: 234670
2015-04-11 00:33:08 +00:00
Hal Finkel a9fceb803d [PowerPC] Prefetching should also consider depth > 1 loops
Iterating over loops from the LoopInfo instance only provides top-level loops.
We need to search the whole tree of loops to find the inner ones.

llvm-svn: 234603
2015-04-10 15:05:02 +00:00
Hal Finkel 93138503ae [PowerPC] Don't crash on PPC32 i64 fp_to_uint on modern cores
When we have an instruction for this (and, thus, don't generate a runtime
call), we need to custom type legalize this (in a trivial way, just as we do
for fp_to_sint).

Fixes PR23173.

llvm-svn: 234561
2015-04-10 03:39:00 +00:00
Nemanja Ivanovic c09047916a Add LLVM support for remaining integer divide and permute instructions from ISA 2.06
This is the patch corresponding to review:
http://reviews.llvm.org/D8406

It adds some missing instructions from ISA 2.06 to the PPC back end.

llvm-svn: 234546
2015-04-09 23:54:37 +00:00
Rafael Espindola 1c84271694 Revert "Refactoring and enhancement to FMA combine."
This reverts commit r234513. It was failing on the bots.

llvm-svn: 234518
2015-04-09 18:29:32 +00:00
Olivier Sallenave 53703d0862 Refactoring and enhancement to FMA combine.
llvm-svn: 234513
2015-04-09 17:55:26 +00:00
Eric Christopher d9bbc4d2ee Strip trailing whitespace and reword explanatory comment.
llvm-svn: 234078
2015-04-04 02:26:47 +00:00
Bill Schmidt 91dd765a04 [PowerPC] Enable splat generation for BUILD_VECTOR with little endian
When enabling PPC64LE, I disabled some optimizations of BUILD_VECTOR
nodes for little endian because wrong results were produced.  I've
subsequently investigated and found this is due to a call to
BuildVectorSDNode::isConstantSplat that was always specifying
big-endian.  With this changed to correctly identify the target
endianness, the optimizations work as expected.

I found another case of a call to the same method with big-endian
hardcoded, in PPC::isAllNegativeZeroVector().  I discovered this was
an orphaned method with no callers, so I've just removed it.

The existing test/CodeGen/PowerPC/vec_constants.ll checks these
optimizations, so for testing I've just added a variant for little
endian.

llvm-svn: 234011
2015-04-03 13:48:24 +00:00
Simon Pilgrim ed2ba33ba0 [DAGCombiner] Combine shuffles of BUILD_VECTOR and SCALAR_TO_VECTOR
This patch attempts to fold the shuffling of 'scalar source' inputs - BUILD_VECTOR and SCALAR_TO_VECTOR nodes - if the shuffle node is the only user. This folds away a lot of unnecessary shuffle nodes, and allows quite a bit of constant folding that was being missed.

Differential Revision: http://reviews.llvm.org/D8516

llvm-svn: 234004
2015-04-03 10:02:21 +00:00
Hal Finkel 50271aae7e [PowerPC] FastISel can't handle i1 return values when using CR bits
Under normal circumstances, use of CR bits is disabled when running at -O0, but
it is enabled by default otherwise, and if you have optnone functions, they'll
still generally be generated with crbits turned on (because nothing else turns
them off). FastISel can't handle most things dealing with i1 values when using
CR bits, and checks for that, but was not checking the return type on
functions; we can't fast-isel function calls with i1 return values either when
using CR bits for boolean values.

Fixes PR22664.

llvm-svn: 233775
2015-04-01 00:40:48 +00:00
Hal Finkel 52368d4437 [PowerPC] Don't use a vector preferred memory type at -O0
Even at -O0, we fall back to SDAG when we hit intrinsics, and if the intrinsic
is a memset/memcpy/etc. we might normally use vector types. At -O0, this is
probably not a good idea (because, if there is a bug in the lowering code,
there would be no good way to turn it off). At -O0, only use scalar preferred
types.

Related to PR22754.

llvm-svn: 233755
2015-03-31 20:56:09 +00:00
Hal Finkel 17b6d77a5f [SDAG] Handle non-integer preferred memset types for non-constant values
The existing code in getMemsetValue only handled integer-preferred types when
the fill value was not a constant. Make this more robust in two ways:

  1. If the preferred type is a floating-point value, do the mul-splat trick on
     the corresponding integer type and then bitcast.
  2. If the preferred type is a vector, do the mul-splat trick on one vector
     element, and then build a vector out of them.

Fixes PR22754 (although, we should also turn off use of vector types at -O0).

llvm-svn: 233749
2015-03-31 20:35:26 +00:00
Duncan P. N. Exon Smith 988a7f8b79 DebugInfo: Fix bad debug info for compile units and types
Fix debug info in these tests, which started failing with a WIP patch to
verify compile units and types.  The problems look like they were all
caused by bitrot.  They fell into these categories:

  - Using `!{i32 0}` instead of `!{}`.
  - Using `!{null}` instead of `!{}`.
  - Using `!MDExpression()` instead of `!{}`.
  - Using `!8` instead of `!{!8}`.
  - `file:` references that pointed at `MDCompileUnit`s instead of the
    same `MDFile` as the compile unit.
  - `file:` references that were numerically off-by-one or (off-by-ten).

llvm-svn: 233415
2015-03-27 20:46:33 +00:00
Andrew Trick 43adfb30d5 Complete the MachineScheduler fix made way back in r210390.
"Fix the MachineScheduler's logic for updating ready times for in-order.
 Now the scheduler updates a node's ready time as soon as it is
 scheduled, before releasing dependent nodes."

This fix was only made in one variant of the ScheduleDAGMI driver.
Francois de Ferriere reported the issue in the other bit of code where
it was also needed.
I never got around to coming up with a test case, but it's an
obvious fix that shouldn't be delayed any longer.
I'll try to refactor this code a little better.

I did verify performance on a wide variety of targets and saw no
negative impact with this fix.

llvm-svn: 233366
2015-03-27 06:10:13 +00:00
Eric Christopher 9f74ca5e0f Testcase for r233239.
llvm-svn: 233240
2015-03-26 00:57:33 +00:00
Kit Barton 535e69de34 Add Hardware Transactional Memory (HTM) Support
This patch adds Hardware Transaction Memory (HTM) support supported by ISA 2.07
(POWER8). The intrinsic support is based on GCC one [1], but currently only the
'PowerPC HTM Low Level Built-in Function' are implemented.

The HTM instructions follows the RC ones and the transaction initiation result
is set on RC0 (with exception of tcheck). Currently approach is to create a
register copy from CR0 to GPR and comapring. Although this is suboptimal, since
the branch could be taken directly by comparing the CR0 value, it generates code
correctly on both test and branch and just return value. A possible future
optimization could be elimitate the MFCR instruction to branch directly.

The HTM usage requires a recently newer kernel with PPC HTM enabled. Tested on
powerpc64 and powerpc64le.

This is send along a clang patch to enabled the builtins and option switch.

[1] https://gcc.gnu.org/onlinedocs/gcc/PowerPC-Hardware-Transactional-Memory-Built-in-Functions.html

Phabricator Review: http://reviews.llvm.org/D8247

llvm-svn: 233204
2015-03-25 19:36:23 +00:00
Hal Finkel 8f7c5a7f18 [SDAG] Don't widen VSETCC during type legalization for split operands
Because the operands of a vector SETCC node can be of a different type from the
result (and often are), it can happen that even if we'd prefer to widen the
result type of the SETCC, the operands have been split instead. In this case,
the SETCC result also must be split. This mirrors what is done in
WidenVecRes_SELECT, and should be NFC elsewhere because if the operands are not
widened the following calls to GetWidenedVector will assert (which is what was
happening in the test case).

llvm-svn: 232935
2015-03-23 08:22:43 +00:00
Eric Christopher 83eb13c967 Remove the bare getSubtargetImpl call from the PPC port. As part
of this add a test that shows we can generate code with
for functions that differ by subtarget feature.

llvm-svn: 232882
2015-03-21 03:36:02 +00:00
Owen Anderson db4201235b Fix a nasty bug in DAGCombine of STORE nodes.
This is very related to the bug fixed in r174431.  The problem is that
SelectionDAG does not include alignment in the uniquing of loads and
stores.  When an otherwise no-op DAGCombine would increase the alignment
of a load or store, the original node would be returned (with the
alignment increased), which would cause the node not to be processed by
any further DAGCombines.

I don't have a direct testcase for this that manifests on an in-tree
target, but I did see some noise in the tests for other targets and have
updated them for it.

llvm-svn: 232780
2015-03-19 22:48:57 +00:00
Rafael Espindola 7f45440bd9 Note that we don't support COFF on PPC.
Should bring back the windows bots.

llvm-svn: 232701
2015-03-19 02:40:56 +00:00
Samuel Antao f68156015d Fix R0 use in PowerPC VSX store for FastIsel.
The VSX stores are sometimes generated with a undefined index register, causing %noreg to be used and R0 to be emitted later on. The semantics of the VSX store (e.g. stdsdx) requires R0 to be used as base if we want zero to be used in the computation of the effective address instead of the content of R0. This patch checks if no index register was generated and forces R0 to be used as base address.

llvm-svn: 232486
2015-03-17 15:00:57 +00:00
Rafael Espindola ccfbbd5596 Use createTempSymbol to avoid collisions instead of an ad hoc method.
llvm-svn: 232483
2015-03-17 14:50:32 +00:00
Ahmed Bougacha 082c5c707a Add a bunch of CHECK missing colons in tests. NFC.
Some wouldn't pass;  fixed most, the rest will be fixed separately.

llvm-svn: 232239
2015-03-14 01:43:57 +00:00
David Blaikie f72d05bc7b [opaque pointer type] Add textual IR support for explicit type parameter to gep operator
Similar to gep (r230786) and load (r230794) changes.

Similar migration script can be used to update test cases, which
successfully migrated all of LLVM and Polly, but about 4 test cases
needed manually changes in Clang.

(this script will read the contents of stdin and massage it into stdout
- wrap it in the 'apply.sh' script shown in previous commits + xargs to
apply it over a large set of test cases)

import fileinput
import sys
import re

rep = re.compile(r"(getelementptr(?:\s+inbounds)?\s*\()((<\d*\s+x\s+)?([^@]*?)(|\s*addrspace\(\d+\))\s*\*(?(3)>)\s*)(?=$|%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|zeroinitializer|<|\[\[[a-zA-Z]|\{\{)", re.MULTILINE | re.DOTALL)

def conv(match):
  line = match.group(1)
  line += match.group(4)
  line += ", "
  line += match.group(2)
  return line

line = sys.stdin.read()
off = 0
for match in re.finditer(rep, line):
  sys.stdout.write(line[off:match.start()])
  sys.stdout.write(conv(match))
  off = match.end()
sys.stdout.write(line[off:])

llvm-svn: 232184
2015-03-13 18:20:45 +00:00
Nemanja Ivanovic 0adf26b9b0 Add support for part-word atomics for PPC
http://reviews.llvm.org/D8090#inline-67337

llvm-svn: 231843
2015-03-10 20:51:07 +00:00
Kit Barton 20d3981e15 Change the generation of the vmuluwm instruction to be based on the MUL opcode.
Phabricator review: http://reviews.llvm.org/D8185

llvm-svn: 231827
2015-03-10 19:49:38 +00:00
Rafael Espindola 092b619e55 Use the correct func begin symbol in all places in ppc.
I missed an occurrence of the old symbol in my previous patch.

llvm-svn: 231398
2015-03-05 19:47:50 +00:00
Rafael Espindola 86bd6a1202 Use the generic Lfunc_begin label on ppc.
This removes yet another custom label to mark the start of a function.

llvm-svn: 231390
2015-03-05 18:55:50 +00:00
Kit Barton e48b1e1c4f While reviewing the changes to Clang to add builtin support for the vsld, vsrd, and vsrad instructions, it was pointed out that the builtins are generating the LLVM opcodes (shl, lshr, and ashr) not calls to the intrinsics. This patch changes the implementation of the vsld, vsrd, and vsrad instructions from from intrinsics to VXForm_1 instructions and makes them legal with P8 Altivec. It also removes the definition of the int_ppc_altivec_vsld, int_ppc_altivec_vsrd, and int_ppc_altivec_vsrad intrinsics.
llvm-svn: 231378
2015-03-05 16:24:38 +00:00
Nemanja Ivanovic e8effe1edb Add LLVM support for PPC cryptography builtins
Review: http://reviews.llvm.org/D7955

llvm-svn: 231285
2015-03-04 20:44:33 +00:00
Rafael Espindola 310e4b592f Use the vanilla func_end symbol for .size.
No need to create yet another temp symbol.

llvm-svn: 231198
2015-03-04 01:35:23 +00:00
Kit Barton 0cfa7b7ad0 Add the following 64-bit vector integer arithmetic instructions added in POWER8:
vaddudm
vsubudm
vmulesw
vmulosw
vmuleuw
vmulouw
vmuluwm
vmaxsd
vmaxud
vminsd
vminud
vcmpequd
vcmpequd.
vcmpgtsd
vcmpgtsd.
vcmpgtud
vcmpgtud.
vrld
vsld
vsrd
vsrad

Phabricator review: http://reviews.llvm.org/D7959

llvm-svn: 231115
2015-03-03 19:55:45 +00:00
Duncan P. N. Exon Smith e274180f0e DebugInfo: Move new hierarchy into place
Move the specialized metadata nodes for the new debug info hierarchy
into place, finishing off PR22464.  I've done bootstraps (and all that)
and I'm confident this commit is NFC as far as DWARF output is
concerned.  Let me know if I'm wrong :).

The code changes are fairly mechanical:

  - Bumped the "Debug Info Version".
  - `DIBuilder` now creates the appropriate subclass of `MDNode`.
  - Subclasses of DIDescriptor now expect to hold their "MD"
    counterparts (e.g., `DIBasicType` expects `MDBasicType`).
  - Deleted a ton of dead code in `AsmWriter.cpp` and `DebugInfo.cpp`
    for printing comments.
  - Big update to LangRef to describe the nodes in the new hierarchy.
    Feel free to make it better.

Testcase changes are enormous.  There's an accompanying clang commit on
its way.

If you have out-of-tree debug info testcases, I just broke your build.

  - `upgrade-specialized-nodes.sh` is attached to PR22564.  I used it to
    update all the IR testcases.
  - Unfortunately I failed to find way to script the updates to CHECK
    lines, so I updated all of these by hand.  This was fairly painful,
    since the old CHECKs are difficult to reason about.  That's one of
    the benefits of the new hierarchy.

This work isn't quite finished, BTW.  The `DIDescriptor` subclasses are
almost empty wrappers, but not quite: they still have loose casting
checks (see the `RETURN_FROM_RAW()` macro).  Once they're completely
gutted, I'll rename the "MD" classes to "DI" and kill the wrappers.  I
also expect to make a few schema changes now that it's easier to reason
about everything.

llvm-svn: 231082
2015-03-03 17:24:31 +00:00
Bill Schmidt 164350e2ea Regenerated test case from pr 230801 for change in LLVM IR syntax
llvm-svn: 230811
2015-02-27 23:29:57 +00:00
Bill Schmidt bb9460a3bc Revert test case until it can be fixed
llvm-svn: 230803
2015-02-27 22:31:14 +00:00
Bill Schmidt e3959eb54e [PowerPC] Fix PR22711 - Misaligned .toc section
Straightforward patch to emit an alignment directive when emitting a
TOC entry.  The test case was generated from the test in PR22711 that
demonstrated a misaligned .toc section.  The object code is run
through llvm-readobj to verify that the correct alignment has been
applied to the .toc section.

Thanks to Ulrich Weigand for running down where the fix was needed.

llvm-svn: 230801
2015-02-27 22:14:10 +00:00
David Blaikie a79ac14fa6 [opaque pointer type] Add textual IR support for explicit type parameter to load instruction
Essentially the same as the GEP change in r230786.

A similar migration script can be used to update test cases, though a few more
test case improvements/changes were required this time around: (r229269-r229278)

import fileinput
import sys
import re

pat = re.compile(r"((?:=|:|^)\s*load (?:atomic )?(?:volatile )?(.*?))(| addrspace\(\d+\) *)\*($| *(?:%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|\[\[[a-zA-Z]|\{\{).*$)")

for line in sys.stdin:
  sys.stdout.write(re.sub(pat, r"\1, \2\3*\4", line))

Reviewers: rafael, dexonsmith, grosser

Differential Revision: http://reviews.llvm.org/D7649

llvm-svn: 230794
2015-02-27 21:17:42 +00:00
Hal Finkel 5c3cacf5c0 [PowerPC] Use vector types for memcpy and friends (sometimes)
When using Altivec, we can use vector loads and stores for aligned memcpy and
friends. Starting with the P7 and VXS, we have reasonable unaligned vector
stores. Starting with the P8, we have fast unaligned loads too.

For QPX, we use vector loads are stores, but only for aligned memory accesses.

llvm-svn: 230788
2015-02-27 19:58:28 +00:00
David Blaikie 79e6c74981 [opaque pointer type] Add textual IR support for explicit type parameter to getelementptr instruction
One of several parallel first steps to remove the target type of pointers,
replacing them with a single opaque pointer type.

This adds an explicit type parameter to the gep instruction so that when the
first parameter becomes an opaque pointer type, the type to gep through is
still available to the instructions.

* This doesn't modify gep operators, only instructions (operators will be
  handled separately)

* Textual IR changes only. Bitcode (including upgrade) and changing the
  in-memory representation will be in separate changes.

* geps of vectors are transformed as:
    getelementptr <4 x float*> %x, ...
  ->getelementptr float, <4 x float*> %x, ...
  Then, once the opaque pointer type is introduced, this will ultimately look
  like:
    getelementptr float, <4 x ptr> %x
  with the unambiguous interpretation that it is a vector of pointers to float.

* address spaces remain on the pointer, not the type:
    getelementptr float addrspace(1)* %x
  ->getelementptr float, float addrspace(1)* %x
  Then, eventually:
    getelementptr float, ptr addrspace(1) %x

Importantly, the massive amount of test case churn has been automated by
same crappy python code. I had to manually update a few test cases that
wouldn't fit the script's model (r228970,r229196,r229197,r229198). The
python script just massages stdin and writes the result to stdout, I
then wrapped that in a shell script to handle replacing files, then
using the usual find+xargs to migrate all the files.

update.py:
import fileinput
import sys
import re

ibrep = re.compile(r"(^.*?[^%\w]getelementptr inbounds )(((?:<\d* x )?)(.*?)(| addrspace\(\d\)) *\*(|>)(?:$| *(?:%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|\[\[[a-zA-Z]|\{\{).*$))")
normrep = re.compile(       r"(^.*?[^%\w]getelementptr )(((?:<\d* x )?)(.*?)(| addrspace\(\d\)) *\*(|>)(?:$| *(?:%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|\[\[[a-zA-Z]|\{\{).*$))")

def conv(match, line):
  if not match:
    return line
  line = match.groups()[0]
  if len(match.groups()[5]) == 0:
    line += match.groups()[2]
  line += match.groups()[3]
  line += ", "
  line += match.groups()[1]
  line += "\n"
  return line

for line in sys.stdin:
  if line.find("getelementptr ") == line.find("getelementptr inbounds"):
    if line.find("getelementptr inbounds") != line.find("getelementptr inbounds ("):
      line = conv(re.match(ibrep, line), line)
  elif line.find("getelementptr ") != line.find("getelementptr ("):
    line = conv(re.match(normrep, line), line)
  sys.stdout.write(line)

apply.sh:
for name in "$@"
do
  python3 `dirname "$0"`/update.py < "$name" > "$name.tmp" && mv "$name.tmp" "$name"
  rm -f "$name.tmp"
done

The actual commands:
From llvm/src:
find test/ -name *.ll | xargs ./apply.sh
From llvm/src/tools/clang:
find test/ -name *.mm -o -name *.m -o -name *.cpp -o -name *.c | xargs -I '{}' ../../apply.sh "{}"
From llvm/src/tools/polly:
find test/ -name *.ll | xargs ./apply.sh

After that, check-all (with llvm, clang, clang-tools-extra, lld,
compiler-rt, and polly all checked out).

The extra 'rm' in the apply.sh script is due to a few files in clang's test
suite using interesting unicode stuff that my python script was throwing
exceptions on. None of those files needed to be migrated, so it seemed
sufficient to ignore those cases.

Reviewers: rafael, dexonsmith, grosser

Differential Revision: http://reviews.llvm.org/D7636

llvm-svn: 230786
2015-02-27 19:29:02 +00:00
Mehdi Amini 945a660cbc Change the fast-isel-abort option from bool to int to enable "levels"
Summary:
Currently fast-isel-abort will only abort for regular instructions,
and just warn for function calls, terminators, function arguments.
There is already fast-isel-abort-args but nothing for calls and
terminators.

This change turns the fast-isel-abort options into an integer option,
so that multiple levels of strictness can be defined.
This will help no being surprised when the "abort" option indeed does
not abort, and enables the possibility to write test that verifies
that no intrinsics are forgotten by fast-isel.

Reviewers: resistor, echristo

Subscribers: jfb, llvm-commits

Differential Revision: http://reviews.llvm.org/D7941

From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 230775
2015-02-27 18:32:11 +00:00
Hal Finkel cf59921670 [PowerPC] Make LDtocL and friends invariant loads
LDtocL, and other loads that roughly correspond to the TOC_ENTRY SDAG node,
represent loads from the TOC, which is invariant. As a result, these loads can
be hoisted out of loops, etc. In order to do this, we need to generate
GOT-style MMOs for TOC_ENTRY, which requires treating it as a legitimate memory
intrinsic node type. Once this is done, the MMO transfer is automatically
handled for TableGen-driven instruction selection, and for nodes generated
directly in PPCISelDAGToDAG, we need to transfer the MMOs manually.

Also, we were not transferring MMOs associated with pre-increment loads, so do
that too.

Lastly, this fixes an exposed bug where R30 was not added as a defined operand of
UpdateGBR.

This problem was highlighted by an example (used to generate the test case)
posted to llvmdev by Francois Pichet.

llvm-svn: 230553
2015-02-25 21:36:59 +00:00
Hal Finkel 6b6e9e2b5c [PowerPC] Add triples to QPX tests
Some of these tests fail on Darwin systems because of a lack of a triple;
fix that.

llvm-svn: 230421
2015-02-25 01:26:59 +00:00
Hal Finkel c93a9a2cb4 [PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially  { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).

I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).

The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.

llvm-svn: 230413
2015-02-25 01:06:45 +00:00
Kit Barton 263edb99ab I incorrectly marked the VORC instruction as isCommutable when I added it.
This fix removes the VORC instruction definition from the isCommutable block.

Phabricator review: http://reviews.llvm.org/D7772

llvm-svn: 230020
2015-02-20 15:54:58 +00:00
Hal Finkel e5aaf3f2cd [PowerPC] Loop Data Prefetching for the BG/Q
The IBM BG/Q supercomputer's A2 cores have a hardware prefetching unit, the
L1P, but it does not prefetch directly into the A2's L1 cache. Instead, it
prefetches into its own L1P buffer, and the latency to access that buffer is
significantly higher than that to the L1 cache (although smaller than the
latency to the L2 cache). As a result, especially when multiple hardware
threads are not actively busy, explicitly prefetching data into the L1 cache is
advantageous.

I've been using this pass out-of-tree for data prefetching on the BG/Q for well
over a year, and it has worked quite well. It is enabled by default only for
the BG/Q, but can be enabled for other cores as well via a command-line option.

Eventually, we might want to add some TTI interfaces and move this into
Transforms/Scalar (there is nothing particularly target dependent about it,
although only machines like the BG/Q will benefit from its simplistic
strategy).

llvm-svn: 229966
2015-02-20 05:08:21 +00:00
Kit Barton 298beb5e86 This patch adds the VSX logical instructions introduced in the Power ISA 2.07. It also removes the added complexity that favors VMX versions of the three instructions.
Phabricator review: http://reviews.llvm.org/D7616

Commiting on Nemanja's behalf.

llvm-svn: 229694
2015-02-18 16:21:46 +00:00
Eric Christopher fee6aaf683 Move ABI handling and 64-bitness to the PowerPC target machine.
This required changing how the computation of the ABI is handled
and how some of the checks for ABI/target are done.

llvm-svn: 229471
2015-02-17 06:45:15 +00:00
Hal Finkel 5cedafb8cd [PowerPC] Support non-direct-sub/superclass VSX copies
Our register allocation has become better recently, it seems, and is now
starting to generate cross-block copies into inflated register classes. These
copies are not transformed into subregister insertions/extractions by the
PPCVSXCopy class, and so need to be handled directly by
PPCInstrInfo::copyPhysReg. The code to do this was *almost* there, but not
quite (it was unnecessarily restricting itself to only the direct
sub/super-register-class case (not copying between, for example, something in
VRRC and the lower-half of VSRC which are super-registers of F8RC).

Triggering this behavior manually is difficult; I'm including two
bugpoint-reduced test cases from the test suite.

llvm-svn: 229457
2015-02-16 23:46:30 +00:00
Andrea Di Biagio b14ae8692d [CodeGenPrepare] Removed duplicate logic. SimplifyCFG already knows how to speculate calls to cttz/ctlz.
SimplifyCFG now knows how to speculate calls to intrinsic cttz/ctlz that are
'cheap' for the target. Therefore, some of the logic in CodeGenPrepare
that was originally added at revision 224899 can now be removed.

This patch is basically a no functional change. It removes the duplicated
logic in CodeGenPrepare and converts all the existing target specific tests
for cttz/ctlz into SimplifyCFG tests.

Differential Revision: http://reviews.llvm.org/D7608

llvm-svn: 229105
2015-02-13 14:15:48 +00:00
Hal Finkel 271e9f2870 [SDAG] Don't try to use FP_EXTEND/FP_ROUND for int<->fp promotions
The PowerPC backend has long promoted some floating-point vector operations
(such as select) to integer vector operations. Unfortunately, this behavior was
broken by r216555. When using FP_EXTEND/FP_ROUND for promotions, we must check
that both the old and new types are floating-point types. Otherwise, we must
use BITCAST as we did prior to r216555 for everything.

llvm-svn: 228969
2015-02-12 22:43:52 +00:00
Hal Finkel 7a0516ea66 [PowerPC] Mark jumps as expensive (using using CR bits)
On PowerPC, which has a full set of logical operations on (its multiple sets
of) condition-register bits, it is not profitable to break of complex
conditions feeding a jump into multiple jumps. We can turn off this feature of
CGP/SDAGBuilder by marking jumps as "expensive".

P7 test-suite speedups (no regressions):
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2
	-0.626647% +/- 0.323583%
MultiSource/Benchmarks/Olden/power/power
	-18.2821% +/- 8.06481%

llvm-svn: 228895
2015-02-12 01:02:52 +00:00
Daniel Jasper 1d966eff08 Fix overly prescriptive test that broken on Mac after r228725.
llvm-svn: 228742
2015-02-10 20:49:05 +00:00
Bill Schmidt 82f1c775a0 [PowerPC] Fix reverted patch r227976 to avoid register assignment issues
See full discussion in http://reviews.llvm.org/D7491.

We now hide the add-immediate and call instructions together in a
separate pseudo-op, which is tagged to define GPR3 and clobber the
call-killed registers.  The PPCTLSDynamicCall pass prior to RA now
expands this op into the two separate addi and call ops, with explicit
definitions of GPR3 on both instructions, and explicit clobbers on the
call instruction.  The pass is now marked as requiring and preserving
the LiveIntervals and SlotIndexes analyses, and fixes these up after
the replacement sequences are introduced.

Self-hosting has been verified on LE P8 and BE P7 with various
optimization levels, etc.  It has also been verified with the
--no-tls-optimize flag workaround removed.

llvm-svn: 228725
2015-02-10 19:09:05 +00:00
Kit Barton 0b0cdb1cd4 This change implements the following three logical vector operations:
veqv (vector equivalence)
vnand
vorc
I increased the AddedComplexity for these instructions to 500 to ensure they are generated instead of issuing other VSX instructions.


Phabricator review: http://reviews.llvm.org/D7469

llvm-svn: 228580
2015-02-09 17:03:18 +00:00
Hal Finkel 291cc7bacd [PowerPC] Handle loop predecessor invokes
If a loop predecessor has an invoke as its terminator, and the return value
from that invoke is used to determine the loop iteration space, then we can't
insert a computation based on that value in the loop predecessor prior to the
terminator (oops). If there's such an invoke, or just no predecessor for that
matter, insert a new loop preheader.

llvm-svn: 228488
2015-02-07 07:32:58 +00:00
Hal Finkel 12b607ae1b [PowerPC] Fixup incomplete revert of test/CodeGen/PowerPC/tls-pic.ll
llvm-svn: 228467
2015-02-06 23:30:06 +00:00
Hal Finkel 0d2a1515d5 Revert "r227976 - [PowerPC] Yet another approach to __tls_get_addr" and related fixups
Unfortunately, even with the workaround of disabling the linker TLS
optimizations in Clang restored (which has already been done), this still
breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native).

Bill is currently working on an alternate implementation to address the TLS
issue in a way that also fully elides the linker bug (which, unfortunately,
this approach did not fully), so I'm reverting this now.

llvm-svn: 228460
2015-02-06 23:07:40 +00:00
Hal Finkel c9dd02066c [PowerPC] Prepare loops for pre-increment loads/stores
PowerPC supports pre-increment load/store instructions (except for Altivec/VSX
vector load/stores). Using these on embedded cores can be very important, but
most loops are not naturally set up to use them. We can often change that,
however, by placing loops into a non-canonical form. Generically, this means
transforming loops like this:

  for (int i = 0; i < n; ++i)
    array[i] = c;

to look like this:

  T *p = array[-1];
  for (int i = 0; i < n; ++i)
    *++p = c;

the key point is that addresses accessed are pulled into dedicated PHIs and
"pre-decremented" in the loop preheader. This allows the use of pre-increment
load/store instructions without loop peeling.

A target-specific late IR-level pass (running post-LSR), PPCLoopPreIncPrep, is
introduced to perform this transformation. I've used this code out-of-tree for
generating code for the PPC A2 for over a year. Somewhat to my surprise,
running the test suite + externals on a P7 with this transformation enabled
showed no performance regressions, and one speedup:

External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk
	-2.32514% +/- 1.03736%

So I'm going to enable it on everything for now. I was surprised by this
because, on the POWER cores, these pre-increment load/store instructions are
cracked (and, thus, harder to schedule effectively). But seeing no regressions,
and feeling that it is generally easier to split instructions apart late than
it is to combine them late, this might be the better approach regardless.

In the future, we might want to integrate this functionality into LSR (but
currently LSR does not create new PHI nodes, so (for that and other reasons)
significant work would need to be done).

llvm-svn: 228328
2015-02-05 18:43:00 +00:00
Hal Finkel 65d1cbf9df [PowerPC] Generate pre-increment floating-point ld/st instructions
PowerPC supports pre-increment floating-point load/store instructions, both r+r
and r+i, and we had patterns for them, but they were not marked as legal. Mark
them as legal (and add a test case).

llvm-svn: 228327
2015-02-05 18:42:53 +00:00
Bill Schmidt 433b1c3aae [PowerPC] Implement the vclz instructions for PWR8
Patch by Kit Barton.

Add the vector count leading zeros instruction for byte, halfword,
word, and doubleword sizes.  This is a fairly straightforward addition
after the changes made for vpopcnt:

 1. Add the correct definitions for the various instructions in
    PPCInstrAltivec.td
 2. Make the CTLZ operation legal on vector types when using P8Altivec
    in PPCISelLowering.cpp 

Test Plan

Created new test case in test/CodeGen/PowerPC/vec_clz.ll to check the
instructions are being generated when the CTLZ operation is used in
LLVM.

Check the encoding and decoding in test/MC/PowerPC/ppc_encoding_vmx.s
and test/Disassembler/PowerPC/ppc_encoding_vmx.txt respectively.

llvm-svn: 228301
2015-02-05 15:24:47 +00:00
Bill Schmidt caf7e8b147 Add missing test case from r228046
llvm-svn: 228182
2015-02-04 20:00:04 +00:00
Bill Schmidt 1354f7c5fa [PowerPC] Handle 32-bit targets properly in PPCTLSDynamicCall.cpp
llvm-svn: 228116
2015-02-04 05:51:56 +00:00
Bill Schmidt e2062dbb29 Disable 32-bit tests in tls-pic.ll until they can be repaired
llvm-svn: 227981
2015-02-03 16:57:38 +00:00
Bill Schmidt a5908c74e6 Further revise too-restrictive test CodeGen/PowerPC/tls-pic.ll
llvm-svn: 227980
2015-02-03 16:33:55 +00:00
Bill Schmidt c2208acfbe Further revise too-restrictive test CodeGen/PowerPC/tls-pic.ll
llvm-svn: 227978
2015-02-03 16:29:52 +00:00
Bill Schmidt 50f486b447 Revise too-restrictive test CodeGen/PowerPC/tls-pic.ll
llvm-svn: 227977
2015-02-03 16:24:05 +00:00
Bill Schmidt 685aa8b0c5 [PowerPC] Yet another approach to __tls_get_addr
This patch is a third attempt to properly handle the local-dynamic and
global-dynamic TLS models.

In my original implementation, calls to __tls_get_addr were hidden
from view until the asm-printer phase, at which point the underlying
branch-and-link instruction was created with proper relocations.  This
mostly worked well, but I used some repellent techniques to ensure
that the TLS_GET_ADDR nodes at the SD and MI levels correctly received
input from GPR3 and produced output into GPR3.  This proved to work
badly in the presence of multiple TLS variable accesses, with the
copies to and from GPR3 being scheduled incorrectly and generally
creating havoc.

In r221703, I addressed that problem by representing the calls to
__tls_get_addr as true calls during instruction lowering.  This had
the advantage of removing all of the bad hacks and relying on the
existing call machinery to properly glue the copies in place. It
looked like this was going to be the right way to go.

However, as a side effect of the recent discovery of problems with
linker optimizations for TLS, we discovered cases of suboptimal code
generation with this strategy.  The problem comes when tls_get_addr is
called for the same address, and there is a resulting CSE
opportunity.  It turns out that in such cases MachineCSE will common
the addis/addi instructions that set up the input value to
tls_get_addr, but will not common the calls themselves.  MachineCSE
does not have any machinery to common idempotent calls.  This is
perfectly sensible, since presumably this would be done at the IR
level, and introducing calls in the back end isn't commonplace.  In
any case, we end up with two calls to __tls_get_addr when one would
suffice, and that isn't good.

I presumed that the original design would have allowed commoning of
the machine-specific nodes that hid the __tls_get_addr calls, so as
suggested by Ulrich Weigand, I went back to that design and cleaned it
up so that the copies were properly held together by glue
nodes.  However, it turned out that this didn't work either...the
presence of copies to physical registers kept the machine-specific
nodes from being commoned also.

All of which leads to the design presented here.  This is a return to
the original design, except that no attempt is made to introduce
copies to and from GPR3 during instruction lowering.  Virtual registers
are used until prior to register allocation.  At that point, a special
pass is run that identifies the machine-specific nodes that hide the
tls_get_addr calls and introduces the copies to and from GPR3 around
them.  The register allocator then coalesces these copies away.  With
this design, MachineCSE succeeds in commoning tls_get_addr calls where
possible, and we get nice optimal code generation (better than GCC at
the moment, which does not common these calls).

One additional problem must be dealt with:  After introducing the
mentions of the physical register GPR3, the aggressive anti-dependence
breaker sees opportunities to improve scheduling by selecting a
different register instead.  Flags must be used on the instruction
descriptions to tell the anti-dependence breaker to keep its hands in
its pockets.

One thing missing from the original design was recording a definition
of the link register on the GET_TLS_ADDR nodes.  Doing this was found
to be insufficient to force a stack frame to be created, which led to
looping behavior because two different LR values were stored at the
same address.  This appears to have been an oversight in
PPCFrameLowering::determineFrameLayout(), which is repaired here.

Because MustSaveLR() returns true for calls to builtin_return_address,
this changed the expected behavior of
test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but
formerly did not.  I've fixed the test case to reflect this.

There are existing TLS tests to catch regressions; the checks in
test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the
face of instruction scheduling with these changes, so I fixed that
up.

I've added a new test case based on the PrettyStackTrace module that
demonstrated the original problem. This checks that we get correct
code generation and that CSE of the calls to __get_tls_addr has taken
place.

llvm-svn: 227976
2015-02-03 16:16:01 +00:00
Hal Finkel e3d2b20c2b [PowerPC] VSX stores don't also read
The VSX store instructions were also picking up an implicit "may read" from the
default pattern, which was an intrinsic (and we don't currently have a way of
specifying write-only intrinsics).

This was causing MI verification to fail for VSX spill restores.

llvm-svn: 227759
2015-02-01 19:07:41 +00:00
Hal Finkel 11d3c561a4 [PowerPC] Better scheduling for isel on P7/P8
isel is actually a cracked instruction on the P7/P8, and must start a dispatch
group. The scheduling model should reflect this so that we don't bunch too many
of them together when possible.

Thanks to Bill Schmidt and Pat Haugen for helping to sort this out.

llvm-svn: 227758
2015-02-01 17:52:16 +00:00
Hal Finkel e6698d5305 [PowerPC] Make r2 allocatable on PPC64/ELF for some leaf functions
The TOC base pointer is passed in r2, and we normally reserve this register so
that we can depend on it being there. However, for leaf functions, and
specifically those leaf functions that don't do any TOC access of their own
(which is generally due to accessing the constant pool, using TLS, etc.),
we can treat r2 as an ordinary callee-saved register (it must be callee-saved
because, for local direct calls, the linker will not insert any save/restore
code).

The allocation order has been changed slightly for PPC64/ELF systems to put r2
at the end of the list (while leaving it near the beginning for Darwin systems
to prevent unnecessary output changes). While r2 is allocatable, using it still
requires spill/restore traffic, and thus comes at the end of the list.

llvm-svn: 227745
2015-02-01 15:03:28 +00:00
Bill Schmidt 8cf15ced8c [PowerPC] Complete setting the baseline for ppc64le
Patch by Nemanja Ivanovic.

As was uncovered by the failing test case (when run on non-PPC
platforms), the feature set when compiling with -march=ppc64le was not
being picked up. This change ensures that if the -mcpu option is not
specified, the correct feature set is picked up regardless of whether
we are on PPC or not.

llvm-svn: 227455
2015-01-29 15:59:09 +00:00
Bill Schmidt 258ac3cc56 [PowerPC] Revert ppc64le-aggregates.ll test changes from r227053
It appears we have different behavior with and without -mcpu=pwr8 even
with ppc64le defaulting to POWER8.  The failure appears as follows:

/home/bb/cmake-llvm-x86_64-linux/llvm-project/llvm/test/CodeGen/PowerPC/ppc64le-aggregates.ll:268:14: error: expected string not found in input
; CHECK-DAG: lfs 1, 0([[REG]])
             ^
<stdin>:497:11: note: scanning from here
 ld 3, .LC1@toc@l(3)
          ^
<stdin>:497:11: note: with variable "REG" equal to "3"
 ld 3, .LC1@toc@l(3)
          ^
<stdin>:514:2: note: possible intended match here
 lfs 1, 0(4)
 ^

Reverting this particular test case change.  Nemanja, please have a look
at the reason for the failure.

llvm-svn: 227055
2015-01-25 18:18:54 +00:00
Bill Schmidt 279cabb450 [PowerPC] Reset the baseline for ppc64le to be equivalent to pwr8
Test by Nemanja Ivanovic.

Since ppc64le implies POWER8 as a minimum, it makes sense that the
same features are included. Since the pwr8 processor model will likely
be getting new features until the implementation is complete, I
created a new list to add these updates to. This will include them in
both pwr8 and ppc64le.

Furthermore, it seems that it would make sense to compose the feature
lists for other processor models (pwr3 and up). Per discussion in the
review, I will make this change in a subsequent patch.

In order to test the changes, I've added an additional run step to
test cases that specify -march=ppc64le -mcpu=pwr8 to omit the -mcpu
option. Since the feature lists are the same, the behaviour should be
unchanged.

llvm-svn: 227053
2015-01-25 18:05:42 +00:00
Hal Finkel af51993ee1 [PowerPC] Add r2 as an operand for all calls under both PPC64 ELF V1 and V2
Our PPC64 ELF V2 call lowering logic added r2 as an operand to all direct call
instructions in order to represent the dependency on the TOC base pointer
value. Restricting this to ELF V2, however, does not seem to make sense: calls
under ELF V1 have the same dependence, and indirect calls have an r2 dependence
just as direct ones. Make sure the dependence is noted for all calls under both
ELF V1 and ELF V2.

llvm-svn: 226432
2015-01-19 07:20:27 +00:00
Hal Finkel f81b6dd7a2 [PowerPC] Initial PPC64 calling-convention changes for fastcc
The default calling convention specified by the PPC64 ELF (V1 and V2) ABI is
designed to work with both prototyped and non-prototyped/varargs functions. As
a result, GPRs and stack space are allocated for every argument, even those
that are passed in floating-point or vector registers.

GlobalOpt::OptimizeFunctions will transform local non-varargs functions (that
do not have their address taken) to use the 'fast' calling convention.

When functions are using the 'fast' calling convention, don't allocate GPRs for
arguments passed in other types of registers, and don't allocate stack space for
arguments passed in registers. Other changes for the fast calling convention
may be added in the future.

llvm-svn: 226399
2015-01-18 12:08:47 +00:00
Hal Finkel c19805a75d [PowerPC] Don't list R11 as a patchpoint scratch register
R11's status is the same under both the PPC64 ELF V1 and V2 ABIs: it is
reserved for use as an "environment pointer" for compilation models that
require such a thing. We don't, we also don't need a second scratch register,
and because we support only "local" patchpoint call targets, we might as well
let R11 be used for anyregcc patchpoints.

llvm-svn: 226369
2015-01-17 03:57:34 +00:00
Hal Finkel 52f7c018d3 [PowerPC] Adjust PatchPoints for ppc64le
Bill Schmidt pointed out that some adjustments would be needed to properly
support powerpc64le (using the ELF V2 ABI). For one thing, R11 is not available
as a scratch register, so we need to use R12. R12 is also available under ELF
V1, so to maintain consistency, I flipped the order to make R12 the first
scratch register in the array under both ABIs.

llvm-svn: 226247
2015-01-16 04:40:58 +00:00
Hal Finkel e2ab0f17cf [PowerPC] Loosen ELFv1 PPC64 func descriptor loads for indirect calls
Function pointers under PPC64 ELFv1 (which is used on PPC64/Linux on the
POWER7, A2 and earlier cores) are really pointers to a function descriptor, a
structure with three pointers: the actual pointer to the code to which to jump,
the pointer to the TOC needed by the callee, and an environment pointer. We
used to chain these loads, and make them opaque to the rest of the optimizer,
so that they'd always occur directly before the call. This is not necessary,
and in fact, highly suboptimal on embedded cores. Once the function pointer is
known, the loads can be performed ahead of time; in fact, they can be hoisted
out of loops.

Now these function descriptors are almost always generated by the linker, and
thus the contents of the descriptors are invariant. As a result, by default,
we'll mark the associated loads as invariant (allowing them to be hoisted out
of loops). I've added a target feature to turn this off, however, just in case
someone needs that option (constructing an on-stack descriptor, casting it to a
function pointer, and then calling it cannot be well-defined C/C++ code, but I
can imagine some JIT-compilation system doing so).

Consider this simple test:
  $ cat call.c

  typedef void (*fp)();
  void bar(fp x) {
    for (int i = 0; i < 1600000000; ++i)
      x();
  }

  $ cat main.c

  typedef void (*fp)();
  void bar(fp x);
  void foo() {}
  int main() {
    bar(foo);
  }

On the PPC A2 (the BG/Q supercomputer), marking the function-descriptor loads
as invariant brings the execution time down to ~8 seconds from ~32 seconds with
the loads in the loop.

The difference on the POWER7 is smaller. Compiling with:

  gcc -std=c99 -O3 -mcpu=native call.c main.c : ~6 seconds [this is 4.8.2]

  clang -O3 -mcpu=native call.c main.c : ~5.3 seconds

  clang -O3 -mcpu=native call.c main.c -mno-invariant-function-descriptors : ~4 seconds
  (looks like we'd benefit from additional loop unrolling here, as a first
   guess, because this is faster with the extra loads)

The -mno-invariant-function-descriptors will be added to Clang shortly.

llvm-svn: 226207
2015-01-15 21:17:34 +00:00
Duncan P. N. Exon Smith 9885469922 IR: Move MDLocation into place
This commit moves `MDLocation`, finishing off PR21433.  There's an
accompanying clang commit for frontend testcases.  I'll attach the
testcase upgrade script I used to PR21433 to help out-of-tree
frontends/backends.

This changes the schema for `DebugLoc` and `DILocation` from:

    !{i32 3, i32 7, !7, !8}

to:

    !MDLocation(line: 3, column: 7, scope: !7, inlinedAt: !8)

Note that empty fields (line/column: 0 and inlinedAt: null) don't get
printed by the assembly writer.

llvm-svn: 226048
2015-01-14 22:27:36 +00:00
Bill Schmidt 082cfc05f1 [PPC64] Add support for the ICBT instruction on POWER8.
Patch by Kit Barton.

Support for the ICBT instruction is currently present, but limited to
embedded processors. This change adds a new FeatureICBT that can be used
to identify whether the ICBT instruction is available on a specific processor.

Two new tests are added:
 * Positive test to ensure the icbt instruction is present when using
-mcpu=pwr8
 * Negative test to ensure the icbt instruction is not generated when
using -mcpu=pwr7

Both test cases use the Prefetch opcode in LLVM. They are based on the
ppc64-prefetch.ll test case.

llvm-svn: 226033
2015-01-14 20:17:10 +00:00
JF Bastien eeea8970b4 Revert "Insert random noops to increase security against ROP attacks (llvm)"
This reverts commit:
http://reviews.llvm.org/D3392

llvm-svn: 225948
2015-01-14 05:24:33 +00:00
Hal Finkel 2307a2f088 [PowerPC] Fix the noop-insert test
The form of nops used is CPU-specific (some CPUs, such as the POWER7, have
special group-terminating nops). We probably want a different callback for this
kind of nop insertion (something more like MCAsmBackend::writeNopData), or for
PPC to use a different mechanism for scheduling nops, but this will stop the
test from failing for now.

llvm-svn: 225928
2015-01-14 01:37:21 +00:00
Hal Finkel 934361a4b8 Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.

Original commit message:

This commit does two things:

 1. Refactors PPCFastISel to use more of the common infrastructure for call
    lowering (this lets us take advantage of this common code for lowering some
    common intrinsics, stackmap/patchpoint among them).

 2. Adds support for stackmap/patchpoint lowering. For the most part, this is
    very similar to the support in the AArch64 target, with the obvious differences
    (different registers, NOP instructions, etc.). The test cases are adapted
    from the AArch64 test cases.

One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).

StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!

llvm-svn: 225909
2015-01-14 01:07:51 +00:00
JF Bastien dcdd5ad252 Insert random noops to increase security against ROP attacks (llvm)
A pass that adds random noops to X86 binaries to introduce diversity with the goal of increasing security against most return-oriented programming attacks.

Command line options:
  -noop-insertion // Enable noop insertion.
  -noop-insertion-percentage=X // X% of assembly instructions will have a noop prepended (default: 50%, requires -noop-insertion)
  -max-noops-per-instruction=X // Randomly generate X noops per instruction. ie. roll the dice X times with probability set above (default: 1). This doesn't guarantee X noop instructions.

In addition, the following 'quick switch' in clang enables basic diversity using default settings (currently: noop insertion and schedule randomization; it is intended to be extended in the future).
  -fdiversify

This is the llvm part of the patch.
clang part: D3393

http://reviews.llvm.org/D3392
Patch by Stephen Crane (@rinon)

llvm-svn: 225908
2015-01-14 01:07:26 +00:00
Ulrich Weigand 6b577e26f0 Use the integrated assembler as default on PowerPC
This was already done in clang, this commit now uses the integrated
assembler as default when using LLVM tools directly.

A number of test cases using inline asm had to be adapted, either by
updating the expected output, or by using -no-integrated-as (for such
tests that deliberately use an invalid instruction in inline asm).

llvm-svn: 225819
2015-01-13 19:43:45 +00:00
Hal Finkel 63fb928109 Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support"
Reverting this while I investiage buildbot failures (segfaulting in
GetCostForDef at ScheduleDAGRRList.cpp:314).

llvm-svn: 225811
2015-01-13 18:25:05 +00:00
Hal Finkel 821befd52b [PowerPC] Add StackMap/PatchPoint support
This commit does two things:

 1. Refactors PPCFastISel to use more of the common infrastructure for call
    lowering (this lets us take advantage of this common code for lowering some
    common intrinsics, stackmap/patchpoint among them).

 2. Adds support for stackmap/patchpoint lowering. For the most part, this is
    very similar to the support in the AArch64 target, with the obvious differences
    (different registers, NOP instructions, etc.). The test cases are adapted
    from the AArch64 test cases.

One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).

StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!

llvm-svn: 225808
2015-01-13 17:48:12 +00:00
Hal Finkel 87deb0b8e3 [PowerPC] Fix calls to non-function objects
Looking at r225438 inspired me to see how the PowerPC backend handled the
situation (calling a bitcasted TLS global), and it turns out we also produced
an error (cannot select ...). What it means to "call" something that is not a
function is implementation and platform specific, but in the name of doing
something (besides crashing), this makes sure we do what GCC does (treat all
such calls as calls through a function pointer -- meaning that the pointer is
assumed, as is the convention on PPC, to point to a function descriptor
structure holding the actual code address along with the function's TOC pointer
and environment pointer). As GCC does, we now do the same for calling regular
(non-TLS) non-function globals too.

I'm not sure whether this is the most useful way to define the behavior, but at
least we won't be alone.

llvm-svn: 225617
2015-01-12 04:34:47 +00:00
Hal Finkel 5d5d1539cc [PowerPC] Mark zext of a small scalar load as free
This initial implementation of PPCTargetLowering::isZExtFree marks as free
zexts of small scalar loads (that are not sign-extending). This callback is
used by SelectionDAGBuilder's RegsForValue::getCopyToRegs, and thus to
determine whether a zext or an anyext is used to lower illegally-typed PHIs.
Because later truncates of zero-extended values are nops, this allows for the
elimination of later unnecessary truncations.

Fixes the initial complaint associated with PR22120.

llvm-svn: 225584
2015-01-10 08:21:59 +00:00
Justin Hibbits 654346e6f9 Fully fix Bug #22115.
Summary:
In the previous commit, the register was saved, but space was not allocated.
This resulted in the parameter save area potentially clobbering r30, leading to
nasty results.

Test Plan: Tests updated

Reviewers: hfinkel

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D6906

llvm-svn: 225573
2015-01-10 01:57:21 +00:00
Hal Finkel 6c39269a4c [PowerPC] Fold [sz]ext with fp_to_int lowering where possible
On modern cores with lfiw[az]x, we can fold a sign or zero extension from i32
to i64 into the load necessary for an i64 -> fp conversion.

llvm-svn: 225493
2015-01-09 01:34:30 +00:00
Hal Finkel 3c0952b072 [PowerPC] Mark all instructions as non-cheap for MachineLICM
MachineLICM uses a callback named hasLowDefLatency to determine if an
instruction def operand has a 'low' latency. If all relevant operands have a
'low' latency, the instruction is considered too cheap to hoist out of loops
even in low-register-pressure situations. On PowerPC cores, both the embedded
cores and the others, there is no reason to believe that this is a good choice:
all instructions have a cost inside a loop, and hoisting them when not limited
by register pressure is a reasonable default.

llvm-svn: 225471
2015-01-08 22:11:49 +00:00
Justin Hibbits 98a532dd8e Add saving and restoring of r30 to the prologue and epilogue, respectively
Summary: The PIC additions didn't update the prologue and epilogue code to save and restore r30 (PIC base register).  This does that.

Test Plan: Tests updated.

Reviewers: hfinkel

Reviewed By: hfinkel

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D6876

llvm-svn: 225450
2015-01-08 15:47:19 +00:00
Olivier Sallenave 0451532996 More FMA folding opportunities.
llvm-svn: 225380
2015-01-07 20:54:17 +00:00
Hal Finkel ed844c4ad1 [PowerPC] Reuse a load operand in int->fp conversions
int->fp conversions on PPC must be done through memory loads and stores. On a
modern core, this process begins by storing the int value to memory, then
loading it using a (sometimes special) FP load instruction. Unfortunately, we
would do this even when the value to be converted was itself a load, and we can
just use that same memory location instead of copying it to another first.
There is a slight complication when handling int_to_fp(fp_to_int(x)) pairs,
because the fp_to_int operand has not been lowered when the int_to_fp is being
lowered. We handle this specially by invoking fp_to_int's lowering logic
(partially) and getting the necessary memory location (some trivial refactoring
was done to make this possible).

This is all somewhat ugly, and it would be nice if some later CodeGen stage
could just clean this stuff up, but because doing so would involve modifying
target-specific nodes (or instructions), it is not immediately clear how that
would work.

Also, remove a related entry from the README.txt for which we now generate
reasonable code.

llvm-svn: 225301
2015-01-06 22:31:02 +00:00
Hal Finkel ef35ed0892 [PowerPC] Add a regression test for r225251
In r225251, I removed an old entry from the README.txt file. While there are
several contributing factors (including pieces in Clang's ABI code), upon
further reflection, the backend part deserves a regression test.

llvm-svn: 225268
2015-01-06 16:46:37 +00:00
Hal Finkel 5efb918844 [PowerPC] Improve int_to_fp(fp_to_int(x)) combining
The old target DAG combine that allowed for performing int_to_fp(fp_to_int(x))
without a load/store pair is updated here with support for unsigned integers,
and to support single-precision values without a third rounding step, on newer
cores with the appropriate instructions.

llvm-svn: 225248
2015-01-06 06:01:57 +00:00
Hal Finkel f93f56d619 [PowerPC] Fix test to pass on Darwin hosts
llvm-svn: 225220
2015-01-05 23:17:43 +00:00
Hal Finkel 9187711f08 [PowerPC] Convert a README.txt entry into a better test
We now produce the desired code as noted in the README.txt file (no spurious
or). Remove the README entry and improve the regression test.

llvm-svn: 225214
2015-01-05 21:53:52 +00:00
Hal Finkel c7d35bb5b1 [PowerPC] Add a test for truncating a shifted load
We now produce the desired code as noted in the README.txt file. Remove the
README entry and add a regression test.

llvm-svn: 225209
2015-01-05 21:33:14 +00:00
Hal Finkel a4750dec99 [PowerPC] Add another test for load/store with update
We now produce the desired code as noted in the README.txt file. Remove the
README entry and add a regression test.

llvm-svn: 225205
2015-01-05 21:22:42 +00:00
Hal Finkel 200d2ad188 [PowerPC] Fold i1 extensions with other ops
Consider this function from our README.txt file:

  int foo(int a, int b) { return (a < b) << 4; }

We now explicitly track CR bits by default, so the comment in the README.txt
about not really having a SETCC is no longer accurate, but we did generate this
somewhat silly code:

        cmpw 0, 3, 4
        li 3, 0
        li 12, 1
        isel 3, 12, 3, 0
        sldi 3, 3, 4
        blr

which generates the zext as a select between 0 and 1, and then shifts the
result by a constant amount. Here we preprocess the DAG in order to fold the
results of operations on an extension of an i1 value into the SELECT_I[48]
pseudo instruction when the resulting constant can be materialized using one
instruction (just like the 0 and 1). This was not implemented as a DAGCombine
because the resulting code would have been anti-canonical and depends on
replacing chained user nodes, which does not fit well into the lowering
paradigm. Now we generate:

        cmpw 0, 3, 4
        li 3, 0
        li 12, 16
        isel 3, 12, 3, 0
        blr

which is less silly.

llvm-svn: 225203
2015-01-05 21:10:24 +00:00
Hal Finkel 49557f1b42 [PowerPC] Remove zexts after i32 ctlz
The 64-bit semantics of cntlzw are not special, the 32-bit population count is
stored as a 64-bit value in the range [0,32]. As a result, it is always zero
extended, and it can be added to the PPCISelDAGToDAG peephole optimization as a
frontier instruction for the removal of unnecessary zero extensions.

llvm-svn: 225192
2015-01-05 18:52:29 +00:00
Hal Finkel 4e2c78228a [PowerPC] Remove zexts after byte-swapping loads
lhbrx and lwbrx not only load their data with byte swapping, but also clear the
upper 32 bits (at least). As a result, they can be added to the PPCISelDAGToDAG
peephole optimization as frontier instructions for the removal of unnecessary
zero extensions.

llvm-svn: 225189
2015-01-05 18:09:06 +00:00
Hal Finkel 9bb61de1be [PowerPC] Enable speculation of cttz/ctlz
PPC has an instruction for ctlz with defined zero behavior, and our lowering of
cttz (provided by DAGCombine) is also efficient and branchless, so speculating
these makes sense.

llvm-svn: 225150
2015-01-05 05:24:42 +00:00
Hal Finkel 2f61879ff4 [PowerPC] Materialize i64 constants using rotation with masking
r225135 added the ability to materialize i64 constants using rotations in order
to reduce the instruction count. Sometimes we can use a rotation only with some
extra masking, so that we take advantage of the fact that generating a bunch of
extra higher-order 1 bits is easy using li/lis.

llvm-svn: 225147
2015-01-05 03:41:38 +00:00
Hal Finkel 241ba79f95 [PowerPC] Materialize i64 constants using rotation
Materializing full 64-bit constants on PPC64 can be expensive, requiring up to
5 instructions depending on the locations of the non-zero bits. Sometimes
materializing a rotated constant, and then applying the inverse rotation, requires
fewer instructions than the direct method. If so, do that instead.

In r225132, I added support for forming constants using bit inversion. In
effect, this reverts that commit and replaces it with rotation support. The bit
inversion is useful for turning constants that are mostly ones into ones that
are mostly zeros (thus enabling a more-efficient shift-based materialization),
but the same effect can be obtained by using negative constants and a rotate,
and that is at least as efficient, if not more.

llvm-svn: 225135
2015-01-04 15:43:55 +00:00
Hal Finkel ca6375fb75 [PowerPC] Materialize i64 constants using bit inversion
Materializing full 64-bit constants on PPC64 can be expensive, requiring up to
5 instructions depending on the locations of the non-zero bits. Sometimes
materializing the bit-reversed constant, and then flipping the bits, requires
fewer instructions than the direct method. If so, do that instead.

llvm-svn: 225132
2015-01-04 12:35:03 +00:00
Hal Finkel 5772566ed6 [PowerPC/BlockPlacement] Allow target to provide a per-loop alignment preference
The existing code provided for specifying a global loop alignment preference.
However, the preferred loop alignment might depend on the loop itself. For
recent POWER cores, loops between 5 and 8 instructions should have 32-byte
alignment (while the others are better with 16-byte alignment) so that the
entire loop will fit in one i-cache line.

To support this, getPrefLoopAlignment has been made virtual, and can be
provided with an optional MachineLoop* so the target can inspect the loop
before answering the query. The default behavior, as before, is to return the
value set with setPrefLoopAlignment. MachineBlockPlacement now queries the
target for each loop instead of only once per function. There should be no
functional change for other targets.

llvm-svn: 225117
2015-01-03 17:58:24 +00:00
Hal Finkel d73bfba7eb [PowerPC] Use 16-byte alignment for modern cores for functions/loops
Most modern PowerPC cores prefer that functions and loops start on
16-byte-aligned boundaries (*), so instruct block placement, etc. to make this
happen. The branch selector has also been adjusted so account for the extra
nops that might now be inserted before loop headers.

(*) Some cores actually prefer other alignments for small loops, but that will
    be addressed in a follow-up commit.

llvm-svn: 225115
2015-01-03 14:58:25 +00:00
Hal Finkel 4edc66b8de [PowerPC] Add support for the CMPB instruction
Newer POWER cores, and the A2, support the cmpb instruction. This instruction
compares its operands, treating each of the 8 bytes in the GPRs separately,
returning a 'mask' result of 0 (for false) or -1 (for true) in each byte.

Code generation support is added, in the form of a PPCISelDAGToDAG
DAG-preprocessing routine, that recognizes patterns close to what the
instruction computes (either exactly, or related by a constant masking
operation), and generates the cmpb instruction (along with any necessary
constant masking operation). This can be expanded if use cases arise.

llvm-svn: 225106
2015-01-03 01:16:37 +00:00
Hal Finkel c58ce4132a [PowerPC] Improve instruction selection bit-permuting operations (64-bit)
This is the second installment of improvements to instruction selection for "bit
permutation" instruction sequences. r224318 added logic for instruction
selection for 32-bit bit permutation sequences, and this adds lowering for
64-bit sequences. The 64-bit sequences are more complicated than the 32-bit
ones because:
  a) the 64-bit versions of the 32-bit rotate-and-mask instructions
     work by replicating the lower 32-bits of the value-to-be-rotated into the
     upper 32 bits -- and integrating this into the cost modeling for the various
     bit group operations is non-trivial
  b) unlike the 32-bit instructions in 32-bit mode, the rotate-and-mask instructions
     cannot, in one instruction, specify the
     mask starting index, the mask ending index, and the rotation factor. Also,
     forming arbitrary 64-bit constants is more complicated than in 32-bit mode
     because the number of instructions necessary is value dependent.

Plus, support for 'late masking' was added: it is sometimes more efficient to
treat the overall value as if it had no mandatory zero bits when planning the
bit-group insertions, and then mask them in at the very end. Unfortunately, as
the structure of the bit groups is different in the two cases, the more
feasible implementation technique was to generate both instruction sequences,
and then pick the shorter one.

And finally, we now generate reasonable code for i64 bswap:

        rldicl 5, 3, 16, 0
        rldicl 4, 3, 8, 0
        rldicl 6, 3, 24, 0
        rldimi 4, 5, 8, 48
        rldicl 5, 3, 32, 0
        rldimi 4, 6, 16, 40
        rldicl 6, 3, 48, 0
        rldimi 4, 5, 24, 32
        rldicl 5, 3, 56, 0
        rldimi 4, 6, 40, 16
        rldimi 4, 5, 48, 8
        rldimi 4, 3, 56, 0

vs. what we used to produce:

        li 4, 255
        rldicl 5, 3, 24, 40
        rldicl 6, 3, 40, 24
        rldicl 7, 3, 56, 8
        sldi 8, 3, 8
        sldi 10, 3, 24
        sldi 12, 3, 40
        rldicl 0, 3, 8, 56
        sldi 9, 4, 32
        sldi 11, 4, 40
        sldi 4, 4, 48
        andi. 5, 5, 65280
        andis. 6, 6, 255
        andis. 7, 7, 65280
        sldi 3, 3, 56
        and 8, 8, 9
        and 4, 12, 4
        and 9, 10, 11
        or 6, 7, 6
        or 5, 5, 0
        or 3, 3, 4
        or 7, 9, 8
        or 4, 6, 5
        or 3, 3, 7
        or 3, 3, 4

which is 12 instructions, instead of 25, and seems optimal (at least in terms
of code size).

llvm-svn: 225056
2015-01-01 02:53:29 +00:00
David Majnemer d0bcef2040 PowerPC: CTR shouldn't fire if a TLS call is in the loop
Determining the address of a TLS variable results in a function call in
certain TLS models.  This means that a simple ICmpInst might actually
result in invalidating the CTR register.

In such cases, do not attempt to rely on the CTR register for loop
optimization purposes.

This fixes PR22034.

Differential Revision: http://reviews.llvm.org/D6786

llvm-svn: 224890
2014-12-27 19:45:38 +00:00
Rafael Espindola 5d94634c13 No need to run llvm-as. NFC.
llvm-svn: 224859
2014-12-26 16:42:47 +00:00
Hal Finkel 0c505b08a5 [PowerPC] [FastISel] i1 constants must be zero extended
When materializing constant i1 values, they must be zero extended. We represent
i1 values as [0, 1], not [0, -1], in i32 registers. As it turns out, this code
path was dead for i1 values prior to r216006 (which is why this did not manifest in
miscompiles until recently).

Fixes -O0 self-hosting on PPC64/Linux.

llvm-svn: 224842
2014-12-25 23:08:25 +00:00
Hal Finkel fc096c98f3 [PowerPC] Ensure that the TOC reload directly follows bctrl on PPC64
On non-Darwin PPC64, the TOC reload needs to come directly after the bctrl
instruction (for indirect calls) because the 'bctrl/ld 2, 40(1)' instruction
sequence is interpreted by the unwinding code in libgcc. To make sure these
occur as a pair, as with other pairings interpreted by the linker, fuse the two
instructions into one instruction (for code generation only).

In the future, we might wish to do this by emitting CFI directives instead,
but this solution is simpler, and mirrors what GCC does. Additional discussion
on this point is contained in the PR.

Fixes PR22015.

llvm-svn: 224788
2014-12-23 22:29:40 +00:00
Hal Finkel 6e27c6d450 [PowerPC] Don't mark the return-address slot as immutable
It is tempting to mark the fixed stack slot used to store the return address as
immutable when lowering @llvm.returnaddress(i32 0). Unfortunately, within the
function, it is not completely immutable: it is written during the function
prologue. When using post-RA instruction scheduling, the prologue instructions
are available for scheduling, and we're not free to interchange the order of a
particular store in the prologue with loads from that stack location.

Fixes PR21976.

llvm-svn: 224761
2014-12-23 09:45:06 +00:00
Hal Finkel 04b16b51ec [PowerPC] Don't attempt a 64-bit pow2 division on PPC32
In r224033, in moving the signed power-of-2 division expansion into
BuildSDIVPow2, I accidentally made it possible to attempt the lowering for a
64-bit division on PPC32. This later asserts.

Fixes PR21928.

llvm-svn: 224758
2014-12-23 08:38:50 +00:00
Hal Finkel 8adf2254ef [PowerPC] Improve instruction selection bit-permuting operations (32-bit)
The PowerPC backend, somewhat embarrassingly, did not generate an
optimal-length sequence of instructions for a 32-bit bswap. While adding a
pattern for the bswap intrinsic to fix this would not have been terribly
difficult, doing so would not have addressed the real problem: we had been
generating poor code for many bit-permuting operations (by which I mean things
like byte-swap that permute the bits of one or more inputs around in various
ways). Here are some initial steps toward solving this deficiency.

Bit-permuting operations are represented, at the SDAG level, using ISD::ROTL,
SHL, SRL, AND and OR (mostly with constant second operands). Looking back
through these operations, we can build up a description of the bits in the
resulting value in terms of bits of one or more input values (and constant
zeros). For each bit, we compute the rotation amount from the original value,
and then group consecutive (value, rotation factor) bits into groups. Groups
sharing these attributes are then collected and sorted, and we can then
instruction select the entire permutation using a combination of masked
rotations (rlwinm), imm ands (andi/andis), and masked rotation inserts
(rlwimi).

The result is that instead of lowering an i32 bswap as:

	rlwinm 5, 3, 24, 16, 23
	rlwinm 4, 3, 24, 0, 7
	rlwimi 4, 3, 8, 8, 15
	rlwimi 5, 3, 8, 24, 31
	rlwimi 4, 5, 0, 16, 31

we now produce:

	rlwinm 4, 3, 8, 0, 31
	rlwimi 4, 3, 24, 16, 23
	rlwimi 4, 3, 24, 0, 7

and for the 'test6' example in the PowerPC/README.txt file:

 unsigned test6(unsigned x) {
   return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
 }

we used to produce:

	lis 4, 255
	rlwinm 3, 3, 16, 0, 31
	ori 4, 4, 255
	and 3, 3, 4

and now we produce:

	rlwinm 4, 3, 16, 24, 31
	rlwimi 4, 3, 16, 8, 15

and, as a nice bonus, this fixes the FIXME in
test/CodeGen/PowerPC/rlwimi-and.ll.

This commit does not include instruction-selection for i64 operations, those
will come later.

llvm-svn: 224318
2014-12-16 05:51:41 +00:00
Duncan P. N. Exon Smith be7ea19b58 IR: Make metadata typeless in assembly
Now that `Metadata` is typeless, reflect that in the assembly.  These
are the matching assembly changes for the metadata/value split in
r223802.

  - Only use the `metadata` type when referencing metadata from a call
    intrinsic -- i.e., only when it's used as a `Value`.

  - Stop pretending that `ValueAsMetadata` is wrapped in an `MDNode`
    when referencing it from call intrinsics.

So, assembly like this:

    define @foo(i32 %v) {
      call void @llvm.foo(metadata !{i32 %v}, metadata !0)
      call void @llvm.foo(metadata !{i32 7}, metadata !0)
      call void @llvm.foo(metadata !1, metadata !0)
      call void @llvm.foo(metadata !3, metadata !0)
      call void @llvm.foo(metadata !{metadata !3}, metadata !0)
      ret void, !bar !2
    }
    !0 = metadata !{metadata !2}
    !1 = metadata !{i32* @global}
    !2 = metadata !{metadata !3}
    !3 = metadata !{}

turns into this:

    define @foo(i32 %v) {
      call void @llvm.foo(metadata i32 %v, metadata !0)
      call void @llvm.foo(metadata i32 7, metadata !0)
      call void @llvm.foo(metadata i32* @global, metadata !0)
      call void @llvm.foo(metadata !3, metadata !0)
      call void @llvm.foo(metadata !{!3}, metadata !0)
      ret void, !bar !2
    }
    !0 = !{!2}
    !1 = !{i32* @global}
    !2 = !{!3}
    !3 = !{}

I wrote an upgrade script that handled almost all of the tests in llvm
and many of the tests in cfe (even handling many `CHECK` lines).  I've
attached it (or will attach it in a moment if you're speedy) to PR21532
to help everyone update their out-of-tree testcases.

This is part of PR21532.

llvm-svn: 224257
2014-12-15 19:07:53 +00:00
Hal Finkel 4104a1a346 [PowerPC] Handle cmp op promotion for SELECT[_CC] nodes in PPCTL::DAGCombineExtBoolTrunc
PPCTargetLowering::DAGCombineExtBoolTrunc contains logic to remove unwanted
truncations and extensions when dealing with nodes of the form:
  zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)

There was a FIXME in the implementation (now removed) regarding the fact that
the function would abort the transformations if any of the non-output operands
of a SELECT or SELECT_CC node would need to be promoted (because they were
also output operands, for example). As a result, we continued to generate
unnecessary zero-extends for code such as this:

  unsigned foo(unsigned a, unsigned b) {
    return  (a <= b) ? a : b;
  }

which would produce:

  cmplw 0, 3, 4
  isel 3, 4, 3, 1
  rldicl 3, 3, 0, 32
  blr

and now we produce:

  cmplw 0, 3, 4
  isel 3, 4, 3, 1
  blr

which is better in the obvious way.

llvm-svn: 224213
2014-12-14 05:53:19 +00:00
Hal Finkel 4c6658feb0 [PowerPC] Add a DAGToDAG peephole to remove unnecessary zero-exts
On PPC64, we end up with lots of i32 -> i64 zero extensions, not only from all
of the usual places, but also from the ABI, which specifies that values passed
are zero extended. Almost all 32-bit PPC instructions in PPC64 mode are defined
to do *something* to the higher-order bits, and for some instructions, that
action clears those bits (thus providing a zero-extended result). This is
especially common after rotate-and-mask instructions. Adding an additional
instruction to zero-extend the results of these instructions is unnecessary.

This PPCISelDAGToDAG peephole optimization examines these zero-extensions, and
looks back through their operands to see if all instructions will implicitly
zero extend their results. If so, we convert these instructions to their 64-bit
variants (which is an internal change only, the actual encoding of these
instructions is the same as the original 32-bit ones) and remove the
unnecessary zero-extension (changing where the INSERT_SUBREG instructions are
to make everything internally consistent).

llvm-svn: 224169
2014-12-12 23:59:36 +00:00
Hal Finkel b5e9b0426a [PowerPC] Better lowering for add/or of a FrameIndex
If we have an add (or an or that is really an add), where one operand is a
FrameIndex and the other operand is a small constant, we can combine the
lowering of the FrameIndex (which is lowered as an add of the FI and a zero
offset) with the constant operand.

Amusingly, this is an old potential improvement entry from
lib/Target/PowerPC/README.txt which had never been resolved. In short, we used
to lower:

        %X = alloca { i32, i32 }
        %Y = getelementptr {i32,i32}* %X, i32 0, i32 1
        ret i32* %Y

as:

        addi 3, 1, -8
        ori 3, 3, 4
        blr

and now we produce:

        addi 3, 1, -4
        blr

which is much more sensible.

llvm-svn: 224071
2014-12-11 22:51:06 +00:00
Hal Finkel 13d104bf78 [PowerPC] Implement BuildSDIVPow2, lower i64 pow2 sdiv using sradi
PPCISelDAGToDAG contained existing code to lower i32 sdiv by a power-of-2 using
srawi/addze, but did not implement the i64 case. DAGCombine now contains a
callback specifically designed for this purpose (BuildSDIVPow2), and part of
the logic has been moved to an implementation of that callback. Doing this
lowering using BuildSDIVPow2 likely does not matter, compared to handling
everything in PPCISelDAGToDAG, for the positive divisor case, but the negative
divisor case, which generates an additional negation, can potentially benefit
from additional folding from DAGCombine. Now, both the i32 and the i64 cases
have been implemented.

Fixes PR20732.

llvm-svn: 224033
2014-12-11 18:37:52 +00:00
Bill Schmidt efe9ce216e [PowerPC 4/4] Enable little-endian support for VSX.
With the foregoing three patches, VSX instructions can be used for
little endian.  This patch removes the restriction that prevented
this, and re-enables the test cases from the first three patches.

llvm-svn: 223792
2014-12-09 16:59:57 +00:00
Bill Schmidt 3014435ca9 [PowerPC 3/4] Little-endian adjustments for VSX vector shuffle
When performing instruction selection for ISD::VECTOR_SHUFFLE, there
is special code for handling v2f64 and v2i64 using VSX instructions.
This code must be adjusted for little-endian.  Because the two inputs
are treated as a double-wide register, we must swap their order for
little endian.  To get the appropriate mask elements to use with the
big-endian biased XXPERMDI instruction, we must reverse their order
and invert the bits.

A new test is added to test the 16 possible values of the shuffle
mask.  It is initially disabled for reasons specified in the test.  It
is re-enabled by patch 4/4.

llvm-svn: 223791
2014-12-09 16:52:29 +00:00
Bill Schmidt 4187962697 Add test cases that were inadvertently omitted from r223783 and r223788
llvm-svn: 223789
2014-12-09 16:44:58 +00:00
Bill Schmidt fae5d71584 [PowerPC 1/4] Little-endian adjustments for VSX loads/stores
This patch addresses the inherent big-endian bias in the lxvd2x,
lxvw4x, stxvd2x, and stxvw4x instructions.  These instructions load
vector elements into registers left-to-right (with the first element
loaded into the high-order bits of the register), regardless of the
endian setting of the processor.  However, these are the only
vector memory instructions that permit unaligned storage accesses, so
we want to use them for little-endian.

To make this work, a lxvd2x or lxvw4x is replaced with an lxvd2x
followed by an xxswapd, which swaps the doublewords.  This works for
lxvw4x as well as lxvd2x, because for lxvw4x on an LE system the
vector elements are in LE order (right-to-left) within each
doubleword.  (Thus after lxvw2x of a <4 x float> the elements will
appear as 1, 0, 3, 2.  Following the swap, they will appear as 3, 2,
0, 1, as desired.)   For stores, an stxvd2x or stxvw4x is replaced
with an stxvd2x preceded by an xxswapd.

Introduction of extra swap instructions provides correctness, but
obviously is not ideal from a performance perspective.  Future patches
will address this with optimizations to remove most of the introduced
swaps, which have proven effective in other implementations.

The introduction of the swaps is performed during lowering of LOAD,
STORE, INTRINSIC_W_CHAIN, and INTRINSIC_VOID operations.  The latter
are used to translate intrinsics that specify the VSX loads and stores
directly into equivalent sequences for little endian.  Thus code that
uses vec_vsx_ld and vec_vsx_st does not have to be modified to be
ported from BE to LE.

We introduce new PPCISD opcodes for LXVD2X, STXVD2X, and XXSWAPD for
use during this lowering step.  In PPCInstrVSX.td, we add new SDType
and SDNode definitions for these (PPClxvd2x, PPCstxvd2x, PPCxxswapd).
These are recognized during instruction selection and mapped to the
correct instructions.

Several tests that were written to use -mcpu=pwr7 or pwr8 are modified
to disable VSX on LE variants because code generation changes with
this and subsequent patches in this set.  I chose to include all of
these in the first patch than try to rigorously sort out which tests
were broken by one or another of the patches.  Sorry about that.

The new test vsx-ldst-builtin-le.ll, and the changes to vsx-ldst.ll,
are disabled until LE support is enabled because of breakages that
occur as noted in those tests.  They are re-enabled in patch 4/4.

llvm-svn: 223783
2014-12-09 16:35:51 +00:00
Hal Finkel c8cf2b88bc Handle early-clobber registers in the aggressive anti-dep breaker
The aggressive anti-dep breaker, used by the PowerPC backend during post-RA
scheduling (but is available to all targets), did not handle early-clobber MI
operands (at all). When constructing the list of available registers for the
replacement of some def operand, check the using instructions, and remove
registers assigned to early-clobbered defs from the set.

Fixes PR21452.

llvm-svn: 223727
2014-12-09 01:00:59 +00:00
Hal Finkel aa10b3caaf [PowerPC] Don't use a non-allocatable register to implement the 'cc' alias
GCC accepts 'cc' as an alias for 'cr0', and we need to do the same when
processing inline asm constraints. This had previously been implemented using a
non-allocatable register, named 'cc', that was listed as an alias of 'cr0', but
the infrastructure does not seem to support this properly (neither the register
allocator nor the scheduler properly accounts for the alias). Instead, we can
just process this as a naming alias inside of the inline asm
constraint-processing code, so we'll do that instead.

There are two regression tests, one where the post-RA scheduler did the wrong
thing with the non-allocatable alias, and one where the register allocator did
the wrong thing. Fixes PR21742.

llvm-svn: 223708
2014-12-08 22:54:22 +00:00
Bill Seurer 1baeeb029b [PowerPC]Update Power VSX test cases to also test fast-isel
Update of some of the VSX test cases for Power to check fast-isel codegen as well as the regular codegen.

http://reviews.llvm.org/D6357

llvm-svn: 223509
2014-12-05 20:32:05 +00:00
Hal Finkel 66d7791176 Revert "r223440 - Consider subregs when calling MI::registerDefIsDead for phys deps"
Reverting this because, while it fixes the problem in the reduced test case, it
does not fix the problem in the full test case from the bug report.

llvm-svn: 223442
2014-12-05 02:07:35 +00:00
Hal Finkel d013d99fe0 Consider subregs when calling MI::registerDefIsDead for phys deps
The scheduling dependency graph is built bottom-up within each scheduling
region, and ScheduleDAGInstrs::addPhysRegDeps is called to add output/anti
dependencies, based on physical registers, to the SUs for instructions
based on those that come before them.

In the test case, we start before post-RA scheduling with a block that looks
like this:

...
	INLINEASM <...
andc $0,$0,$2
stdcx. $0,0,$3
bne- 1b
> [sideeffect] [mayload] [maystore] [attdialect], $0:[regdef-ec:G8RC], %X6<earlyclobber,def,dead>, $1:[mem], %X3<kill>, $2:[reguse:G8RC], %X5<kill>, $3:[reguse:G8RC], %X3, $4:[mem], %X3, $5:[clobber], %CC<earlyclobber,imp-def,dead>, <<badref>>
	...
	%X4<def,dead> = ANDIo8 %X4<kill>, 1, %CR0<imp-def,dead>, %CR0GT<imp-def>
	...
	%R29<def> = ISEL %R3<undef>, %R4<kill>, %CR0GT<kill>

where it is relevant that %CC is an alias to %CR0, and that %CR0GT is a
subregister of %CR0. However, for post-RA scheduling, no dependency was added
to prevent the INLINEASM from being scheduled in between the ANDIo8 and the
ISEL (which communicate via the %CR0GT register).

In ScheduleDAGInstrs::addPhysRegDeps, when called for the %CC operand, we'd
iterate over all of its aliases (which include %CC itself and also %CR0), and
look for previously-encountered defs of those registers. We'd find the ANDIo8,
but decide not to add a dependency between the INLINEASM and the ANDIo8 because
both the INLINEASM's def of %CC is dead, and also the ANDIo8 def of %CR0 is
dead. This ignores, however, that ANDIo8 has a non-dead def of %CR0GT, a
subregister of %CR0, and thus a dependency still must exist.

To fix this problem, when calling registerDefIsDead on the SU with the def, we
also check all subregisters for possible non-dead defs, and add the dependency
if any are found.

Fixes PR21742.

llvm-svn: 223440
2014-12-05 01:57:22 +00:00
Hal Finkel 029042b278 [PowerPC] 'cc' should be an alias only to 'cr0'
We had mistakenly believed that GCC's 'cc' referred to the entire
condition-code register (cr0 through cr7) -- and implemented this in r205630 to
fix PR19326, but 'cc' is actually an alias only to 'cr0'. This is causing LLVM
to clobber too much with legacy code with inline asm using the 'cc' clobber.

Fixes PR21451.

llvm-svn: 223328
2014-12-04 00:46:20 +00:00
Hal Finkel d433838adf [PowerPC] Fix inline asm memory operands not to use r0
On PowerPC, inline asm memory operands might be expanded as 0($r), where $r is
a register containing the address. As a result, this register cannot be r0, and
we need to enforce this register subclass constraint to prevent miscompiling
the code (we'd get this constraint for free with the usual instruction
definitions, but that scheme has no knowledge of how we end up printing inline
asm memory operands, and so here we need to do it 'by hand'). We can accomplish
this within the current address-mode selection framework by introducing an
explicit COPY_TO_REGCLASS node.

Fixes PR21443.

llvm-svn: 223318
2014-12-03 23:40:13 +00:00
Hal Finkel c91fc11181 [PowerPC] Print all inline-asm consts as signed numbers
Almost all immediates in PowerPC assembly (both 32-bit and 64-bit) are signed
numbers, and it is important that we print them as such. To make sure that
happens, we change PPCTargetLowering::LowerAsmOperandForConstraint so that it
does all intermediate checks on a signed-extended int64_t value, and then
creates the resulting target constant using MVT::i64. This will ensure that all
negative values are printed as negative values (mirroring what is done in other
backends to achieve the same sign-extension effect).

This came up in the context of inline assembly like this:
  "add%I2   %0,%0,%2", ..., "Ir"(-1ll)
where we used to print:
  addi   3,3,4294967295
and gcc would print:
  addi   3,3,-1
and gas accepts both forms, but our builtin assembler (correctly) does not. Now
we print -1 like gcc does.

While here, I replaced a bunch of custom integer checks with isInt<16> and
friends from MathExtras.h.

Thanks to Paul Hargrove for the bug report.

llvm-svn: 223220
2014-12-03 09:37:50 +00:00
Hal Finkel 01fa7701e6 [PowerPC] Fix readcyclecounter to be custom expanded for all 32-bit targets
We need to use the custom expansion of readcyclecounter on all 32-bit targets
(even those with 64-bit registers). This should fix the ppc64 buildbot.

llvm-svn: 223182
2014-12-03 00:19:17 +00:00
Hal Finkel bbdee93638 [PowerPC] Implement readcyclecounter for PPC32
We've long supported readcyclecounter on PPC64, but it is easier there (the
read of the 64-bit time-base register can be accomplished via a single
instruction). This now provides an implementation for PPC32 as well. On PPC32,
the time-base register is still 64 bits, but can only be read 32 bits at a time
via two separate SPRs. The ISA manual explains how to do this properly (it
involves re-reading the upper bits and looping if the counter has wrapped while
being read).

This requires PPC to implement a custom integer splitting legalization for the
READCYCLECOUNTER node, turning it into a target-specific SDAG node, which then
gets turned into a pseudo-instruction, which is then expanded to the necessary
sequence (which has three SPR reads, the comparison and the branch).

Thanks to Paul Hargrove for pointing out to me that this was still unimplemented.

llvm-svn: 223161
2014-12-02 22:01:00 +00:00
Jay Foad 1f0a44e662 [PowerPC] Fix unwind info with dynamic stack realignment
Summary:
PowerPC DWARF unwind info defined CFA as SP + offset even in a function
where the stack had been dynamically realigned. This clearly doesn't
work because the offset from SP to CFA is not a constant. Fix it by
defining CFA as BP instead.

This was causing the AddressSanitizer null_deref test to fail 50% of
the time, depending on whether SP happened to be 32-byte aligned on
entry to a particular function or not.

Reviewers: willschm, uweigand, hfinkel

Reviewed By: hfinkel

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D6410

llvm-svn: 222996
2014-12-01 09:42:32 +00:00
Hal Finkel 360f213d03 [PowerPC] Implement combineRepeatedFPDivisors
This does not matter on newer cores (where we can use reciprocal estimates in
fast-math mode anyway), but for older cores this allows us to generate better
fast-math code where we have multiple FDIVs with a common divisor.

llvm-svn: 222710
2014-11-24 23:45:21 +00:00
Hal Finkel f413be11f0 [PPC] Use SeparateConstOffsetFromGEP
This mirrors r222331, which enabled SeparateConstOffsetFromGEP on AArch64, in
the PowerPC backend. Yields, on a POWER7 machine, a 30% speedup on
SingleSource/Benchmarks/Shootout/nestedloop (this might just be from LICM,
there is a store moved out of the inner loop) and a potential speedup on
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode. Regardless, it
makes some code look cleaner, and synchronizing the backends in this regard
seems like a generally good thing.

llvm-svn: 222504
2014-11-21 04:35:51 +00:00
Bill Schmidt 7674692961 [PowerPC] Add VSX builtins for vec_div
This patch adds builtin support for xvdivdp and xvdivsp, along with a
test case.  Straightforward stuff.

There's a companion patch for Clang.

llvm-svn: 221983
2014-11-14 12:10:40 +00:00
Justin Hibbits 21c5353f54 Revert part of the PIC tests (TLS part)
This change actually wasn't warranted for -O0, and the new changes prove it and
break the build.

llvm-svn: 221793
2014-11-12 16:50:15 +00:00
Justin Hibbits b296c9735e Fix thet tests.
I seem to have missed the update I made for changing 'flag_pic' to "PIC Level".
Mea culpa.

llvm-svn: 221792
2014-11-12 16:40:00 +00:00
Justin Hibbits a88b605721 Add support for small-model PIC for PowerPC.
Summary:
Large-model was added first.  With the addition of support for multiple PIC
models in LLVM, now add small-model PIC for 32-bit PowerPC, SysV4 ABI.  This
generates more optimal code, for shared libraries with less than about 16380
data objects.

Test Plan: Test cases added or updated

Reviewers: joerg, hfinkel

Reviewed By: hfinkel

Subscribers: jholewinski, mcrosier, emaste, llvm-commits

Differential Revision: http://reviews.llvm.org/D5399

llvm-svn: 221791
2014-11-12 15:16:30 +00:00
Bill Schmidt 729547847f [PowerPC] Add vec_vsx_ld and vec_vsx_st intrinsics
This patch enables the vec_vsx_ld and vec_vsx_st intrinsics for
PowerPC, which provide programmer access to the lxvd2x, lxvw4x,
stxvd2x, and stxvw4x instructions.

New LLVM intrinsics are provided to represent these four instructions
in IntrinsicsPowerPC.td.  These are patterned after the similar
intrinsics for lvx and stvx (Altivec).  In PPCInstrVSX.td, these
intrinsics are tied to the code gen patterns, with additional patterns
to allow plain vanilla loads and stores to still generate these
instructions.

At -O1 and higher the intrinsics are immediately converted to loads
and stores in InstCombineCalls.cpp.  This will open up more
optimization opportunities while still allowing the correct
instructions to be generated.  (Similar code exists for aligned
Altivec loads and stores.)

The new intrinsics are added to the code that checks for consecutive
loads and stores in PPCISelLowering.cpp, as well as to
PPCTargetLowering::getTgtMemIntrinsic().

There's a new test to verify the correct instructions are generated.
The loads and stores tend to be reordered, so the test just counts
their number.  It runs at -O2, as it's not very effective to test this
at -O0, when many unnecessary loads and stores are generated.

I ended up having to modify vsx-fma-m.ll.  It turns out this test case
is slightly unreliable, but I don't know a good way to prevent
problems with it.  The xvmaddmdp instructions read and write the same
register, which is one of the multiplicands.  Commutativity allows
either to be chosen.  If the FMAs are reordered differently than
expected by the test, the register assignment can be different as a
result.  Hopefully this doesn't change often.

There is a companion patch for Clang.

llvm-svn: 221767
2014-11-12 04:19:40 +00:00
Bill Schmidt 3d9674cfb1 [PowerPC] Replace foul hackery with real calls to __tls_get_addr
My original support for the general dynamic and local dynamic TLS
models contained some fairly obtuse hacks to generate calls to
__tls_get_addr when lowering a TargetGlobalAddress.  Rather than
generating real calls, special GET_TLS_ADDR nodes were used to wrap
the calls and only reveal them at assembly time.  I attempted to
provide correct parameter and return values by chaining CopyToReg and
CopyFromReg nodes onto the GET_TLS_ADDR nodes, but this was also not
fully correct.  Problems were seen with two back-to-back stores to TLS
variables, where the call sequences ended up overlapping with unhappy
results.  Additionally, since these weren't real calls, the proper
register side effects of a call were not recorded, so clobbered values
were kept live across the calls.

The proper thing to do is to lower these into calls in the first
place.  This is relatively straightforward; see the changes to
PPCTargetLowering::LowerGlobalTLSAddress() in PPCISelLowering.cpp.
The changes here are standard call lowering, except that we need to
track the fact that these calls will require a relocation.  This is
done by adding a machine operand flag of MO_TLSLD or MO_TLSGD to the
TargetGlobalAddress operand that appears earlier in the sequence.

The calls to LowerCallTo() eventually find their way to
LowerCall_64SVR4() or LowerCall_32SVR4(), which call FinishCall(),
which calls PrepareCall().  In PrepareCall(), we detect the calls to
__tls_get_addr and immediately snag the TargetGlobalTLSAddress with
the annotated relocation information.  This becomes an extra operand
on the call following the callee, which is expected for nodes of type
tlscall.  We change the call opcode to CALL_TLS for this case.  Back
in FinishCall(), we change it again to CALL_NOP_TLS for 64-bit only,
since we require a TOC-restore nop following the call for the 64-bit
ABIs.

During selection, patterns in PPCInstrInfo.td and PPCInstr64Bit.td
convert the CALL_TLS nodes into BL_TLS nodes, and convert the
CALL_NOP_TLS nodes into BL8_NOP_TLS nodes.  This replaces the code
removed from PPCAsmPrinter.cpp, as the BL_TLS or BL8_NOP_TLS
nodes can now be emitted normally using their patterns and the
associated printTLSCall print method.

Finally, as a result of these changes, all references to get-tls-addr
in its various guises are no longer used, so they have been removed.

There are existing TLS tests to verify the changes haven't messed
anything up).  I've added one new test that verifies that the problem
with the original code has been fixed.

llvm-svn: 221703
2014-11-11 20:44:09 +00:00
Bill Schmidt 1ca69fa64d [PowerPC] Initial VSX intrinsic support, with min/max for vector double
Now that we have initial support for VSX, we can begin adding
intrinsics for programmer access to VSX instructions.  This patch adds
basic support for VSX intrinsics in general, and tests it by
implementing intrinsics for minimum and maximum for the vector double
data type.

The LLVM portion of this is quite straightforward.  There is a
companion patch for Clang.

llvm-svn: 220988
2014-10-31 19:19:07 +00:00
Ulrich Weigand c8c2ea2854 [PowerPC] Load BlockAddress values from the TOC in 64-bit SVR4 code
Since block address values can be larger than 2GB in 64-bit code, they
cannot be loaded simply using an @l / @ha pair, but instead must be
loaded from the TOC, just like GlobalAddress, ConstantPool, and
JumpTable values are.

The commit also fixes a bug in PPCLinuxAsmPrinter::doFinalization where
temporary labels could not be used as TOC values, since code would
attempt (and fail) to use GetOrCreateSymbol to create a symbol of the
same name as the temporary label.

llvm-svn: 220959
2014-10-31 10:33:14 +00:00
Bill Schmidt 9c54bbd791 [PATCH] Support select-cc for VSFRC when VSX is enabled
A previous patch enabled SELECT_VSRC and SELECT_CC_VSRC for VSX to
handle <2 x double> cases.  This patch adds SELECT_VSFRC and
SELECT_CC_VSFRC to allow use of all 64 vector-scalar registers for the
f64 type when VSX is enabled.  The changes are analogous to those in
the previous patch.  I've added a new variant to vsx.ll to test the
code generation.

(I also cleaned up a little formatting in PPCInstrVSX.td from the
previous patch.)

llvm-svn: 220395
2014-10-22 16:58:20 +00:00
Matt Arsenault 7c93690be0 Add minnum / maxnum codegen
llvm-svn: 220342
2014-10-21 23:01:01 +00:00
Bill Schmidt 5c6cb813b6 [PowerPC] Avoid VSX FMA mutate when killed product reg = addend reg
With VSX enabled, test/CodeGen/PowerPC/recipest.ll exposes a bug in
the FMA mutation pass.  If we have a situation where a killed product
register is the same register as the FMA target, such as:

   %vreg5<def,tied1> = XSNMSUBADP %vreg5<tied0>, %vreg11, %vreg5,
                       %RM<imp-use>; VSFRC:%vreg5 F8RC:%vreg11 

then the substitution makes no sense.  We end up getting a crash when
we try to extend the interval associated with the killed product
register, as there is already a live range for %vreg5 there.  This
patch just disables the mutation under those circumstances.

Since recipest.ll generates different code with VMX enabled, I've
modified that test to use -mattr=-vsx.  I've borrowed the code from
that test that exposed the bug and placed it in fma-mutate.ll, where
it tests several mutation opportunities including the "bad" one.

llvm-svn: 220290
2014-10-21 13:02:37 +00:00
Bill Schmidt 941a3244ff [PowerPC] Clean up -mattr=+vsx tests to always specify -mcpu
We recently discovered an issue that reinforces what a good idea it is
to always specify -mcpu in our code generation tests, particularly for
-mattr=+vsx.  This patch ensures that all tests that specify
-mattr=+vsx also specify -mcpu=pwr7 or -mcpu=pwr8, as appropriate.

Some of the uses of -mattr=+vsx added recently don't make much sense
(when specified for -mtriple=powerpc-apple-darwin8 or -march=ppc32,
for example).  For cases like this I've just removed the extra VSX
test commands; there's enough coverage without them.

llvm-svn: 220173
2014-10-19 21:29:21 +00:00
Bill Schmidt 87982a1e9b [PowerPC] Temporarily disable VSX for PowerPC fast-isel tests
Patch by Bill Seurer; some comment formatting changes by me.

There are a few PowerPC test cases for FastISel support that currently
fail with VSX support enabled.  The temporary workaround under
discussion in http://reviews.llvm.org/D5362 helps, but the tests still
fail because they specify -fast-isel-abort, and the VSX workaround
punts back to SelectionDAG.  We have plans to fix FastISel permanently
for VSX, but until that's in place these tests are preventing us from
enabling VSX by default.  Therefore we are adding -mattr=-vsx to these
tests until the full support is ready.

llvm-svn: 220172
2014-10-19 20:48:47 +00:00
Bill Schmidt fff860979d [PowerPC] Re-enable VSX test line for fma.ll with -mcpu=pwr7
The VSX testing variant in test/CodeGen/PowerPC/fma.ll had to be
disabled because of unexpected behavior on many of the builders.  I
tracked this down to a situation that occurs when the VSX attribute is
enabled for a target that disables the MI early scheduling pass.  This
patch adds -mcpu=pwr7 to make this predictable.  The other issue will
be addressed separately.

llvm-svn: 220171
2014-10-19 20:27:56 +00:00
Bill Schmidt 296bb31f64 [PowerPC] Disable +vsx RUN line for fma.ll due to inconsistency on other builders
llvm-svn: 220094
2014-10-17 21:32:22 +00:00
Bill Schmidt a087d74250 [PowerPC] Change liveness testing in VSX FMA mutation pass
With VSX enabled, LLVM crashes when compiling
test/CodeGen/PowerPC/fma.ll.  I traced this to the liveness test
that's revised in this patch. The interval test is designed to only
work for virtual registers, but in this case the AddendSrcReg is
physical. Since there is already a walk of the MIs between the
AddendMI and the FMA, I added a check for def/kill of the AddendSrcReg
in that loop.  At Hal Finkel's request, I converted the liveness test
to an assert restricted to virtual registers.

I've changed the fma.ll test to have VSX and non-VSX variants so we
can test both kinds of multiply-adds.

llvm-svn: 220090
2014-10-17 21:02:44 +00:00
Bill Schmidt 2d1128acb2 [PowerPC] Enable use of lxvw4x/stxvw4x in VSX code generation
Currently the VSX support enables use of lxvd2x and stxvd2x for 2x64
types, but does not yet use lxvw4x and stxvw4x for 4x32 types.  This
patch adds that support.

As with lxvd2x/stxvd2x, this involves straightforward overriding of
the patterns normally recognized for lvx/stvx, with preference given
to the VSX patterns when VSX is enabled.

In addition, the logic for permitting misaligned memory accesses is
modified so that v4r32 and v4i32 are treated the same as v2f64 and
v2i64 when VSX is enabled.  Finally, the DAG generation for unaligned
loads is changed to just use a normal LOAD (which will become lxvw4x)
on P8 and later hardware, where unaligned loads are preferred over
lvsl/lvx/lvx/vperm.

A number of tests now generate the VSX loads/stores instead of
lvx/stvx, so this patch adds VSX variants to those tests.  I've also
added <4 x float> tests to the vsx.ll test case, and created a
vsx-p8.ll test case to be used for testing code generation for the
P8Vector feature.  For now, that simply tests the unaligned load/store
behavior.

This has been tested along with a temporary patch to enable the VSX
and P8Vector features, with no new regressions encountered with or
without the temporary patch applied.

llvm-svn: 220047
2014-10-17 15:13:38 +00:00
Bill Schmidt 32e9c6465b [PPC] Adjust some PowerPC tests to account for presence/absence of VSX
Patch by Bill Seurer; committed on his behalf.

These test cases generate slightly different code sequences when VSX
is activated and thus fail. The update turns off VSX explicitly for
the existing checks and then adds a second set of checks for most of
them that test the VSX instruction output.

llvm-svn: 220019
2014-10-17 01:41:22 +00:00
Rafael Espindola 11aaaeebe0 Delete -std-compile-opts.
These days -std-compile-opts was just a silly alias for -O3.

llvm-svn: 219951
2014-10-16 20:00:02 +00:00
Sanjay Patel 3d497cd778 Improve sqrt estimate algorithm (fast-math)
This patch changes the fast-math implementation for calculating sqrt(x) from:
y = 1 / (1 / sqrt(x))
to:
y = x * (1 / sqrt(x))

This has 2 benefits: less code / faster code and one less estimate instruction 
that may lose precision.

The only target that will be affected (until http://reviews.llvm.org/D5658 is approved)
is PPC. The difference in codegen for PPC is 2 less flops for a single-precision sqrtf
or vector sqrtf and 4 less flops for a double-precision sqrt. 
We also eliminate a constant load and extra register usage.

Differential Revision: http://reviews.llvm.org/D5682

llvm-svn: 219445
2014-10-09 21:26:35 +00:00
Samuel Antao 1194b8fd40 Fix bug in GPR to FPR moves in PPC64LE.
The current implementation of GPR->FPR register moves uses a stack slot. This mechanism writes a double word and reads a word. In big-endian the load address must be displaced by 4-bytes in order to get the right value. In little endian this is no longer required. This patch fixes the issue and adds LE regression tests to fast-isel-conversion which currently expose this problem.

llvm-svn: 219441
2014-10-09 20:42:56 +00:00
Sanjay Patel 7bc9185ab5 Fast-math fold: x / (y * sqrt(z)) -> x * (rsqrt(z) / y)
The motivation is to recognize code such as this from /llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c:

float distance = sqrt(dx * dx + dy * dy + dz * dz);
float mag = dt / (distance * distance * distance);

Without this patch, we don't match the sqrt as a reciprocal sqrt, so for PPC the new testcase in this patch produces:

   addis 3, 2, .LCPI4_2@toc@ha
   lfs 4, .LCPI4_2@toc@l(3)
   addis 3, 2, .LCPI4_1@toc@ha
   lfs 0, .LCPI4_1@toc@l(3)
   fcmpu 0, 1, 4
   beq 0, .LBB4_2
# BB#1:
   frsqrtes 4, 1
   addis 3, 2, .LCPI4_0@toc@ha
   lfs 5, .LCPI4_0@toc@l(3)
   fnmsubs 13, 1, 5, 1
   fmuls 6, 4, 4
   fmadds 1, 13, 6, 5
   fmuls 1, 4, 1
   fres 4, 1                <--- reciprocal of reciprocal square root
   fnmsubs 1, 1, 4, 0
   fmadds 4, 4, 1, 4
.LBB4_2:
   fmuls 1, 4, 2
   fres 2, 1
   fnmsubs 0, 1, 2, 0
   fmadds 0, 2, 0, 2
   fmuls 1, 3, 0
   blr

After the patch, this simplifies to:

frsqrtes 0, 1
addis 3, 2, .LCPI4_1@toc@ha
fres 5, 2
lfs 4, .LCPI4_1@toc@l(3)
addis 3, 2, .LCPI4_0@toc@ha
lfs 7, .LCPI4_0@toc@l(3)
fnmsubs 13, 1, 4, 1
fmuls 6, 0, 0
fnmsubs 2, 2, 5, 7
fmadds 1, 13, 6, 4
fmadds 2, 5, 2, 5
fmuls 0, 0, 1
fmuls 0, 0, 2
fmuls 1, 3, 0
blr

Differential Revision: http://reviews.llvm.org/D5628

llvm-svn: 219139
2014-10-06 19:31:18 +00:00
Duncan P. N. Exon Smith 176b691d32 Revert "Revert "DI: Fold constant arguments into a single MDString""
This reverts commit r218918, effectively reapplying r218914 after fixing
an Ocaml bindings test and an Asan crash.  The root cause of the latter
was a tightened-up check in `DILexicalBlock::Verify()`, so I'll file a
PR to investigate who requires the loose check (and why).

Original commit message follows.

--

This patch addresses the first stage of PR17891 by folding constant
arguments together into a single MDString.  Integers are stringified and
a `\0` character is used as a separator.

Part of PR17891.

Note: I've attached my testcases upgrade scripts to the PR.  If I've
just broken your out-of-tree testcases, they might help.

llvm-svn: 219010
2014-10-03 20:01:09 +00:00
Robin Morisset 028eabcdaa [Power] Delete redundant test Atomics-32.ll
The test Atomics-32.ll was both redundant (all operations are also checked by
atomics.ll at least) and not actually checking correctness (it was not using
FileCheck, just verifying that the compiler does not crash).

llvm-svn: 218997
2014-10-03 18:10:07 +00:00
Robin Morisset 9098fee690 [Power] Use lwsync for non-seq_cst fences
Summary:
hwsync is only required for seq_cst fences, acquire and release one can use
the cheaper lwsync.

Test Plan: Added some cases to atomics.ll + make check-all

Reviewers: jfb, wschmidt

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D5317

llvm-svn: 218995
2014-10-03 18:04:36 +00:00
Hal Finkel fe3368cb57 [PowerPC] Modern Book-E cores support sync
Older Book-E cores, such as the PPC 440, support only msync (which has the same
encoding as sync 0), but not any of the other sync forms. Newer Book-E cores,
however, do support sync, and for performance reasons we should allow the use
of the more-general form.

This refactors msync use into its own feature group so that it applies by
default only to older Book-E cores (of the relevant cores, we only have
definitions for the PPC440/450 currently).

llvm-svn: 218923
2014-10-02 22:34:22 +00:00
Robin Morisset e1ca44bd4c [Power] Improve the expansion of atomic loads/stores
Summary:
Atomic loads and store of up to the native size (32 bits, or 64 for PPC64)
can be lowered to a simple load or store instruction (as the synchronization
is already handled by AtomicExpand, and the atomicity is guaranteed thanks to
the alignment requirements of atomic accesses). This is exactly what this patch
does. Previously, these were implemented by complex
load-linked/store-conditional loops.. an obvious performance problem.

For example, this patch turns
```
define void @store_i8_unordered(i8* %mem) {
  store atomic i8 42, i8* %mem unordered, align 1
  ret void
}
```
from
```
_store_i8_unordered:                    ; @store_i8_unordered
; BB#0:
    rlwinm r2, r3, 3, 27, 28
    li r4, 42
    xori r5, r2, 24
    rlwinm r2, r3, 0, 0, 29
    li r3, 255
    slw r4, r4, r5
    slw r3, r3, r5
    and r4, r4, r3
LBB4_1:                                 ; =>This Inner Loop Header: Depth=1
    lwarx r5, 0, r2
    andc r5, r5, r3
    or r5, r4, r5
    stwcx. r5, 0, r2
    bne cr0, LBB4_1
; BB#2:
    blr
```
into
```
_store_i8_unordered:                    ; @store_i8_unordered
; BB#0:
    li r2, 42
    stb r2, 0(r3)
    blr

```
which looks like a pretty clear win to me.

Test Plan:
fixed the tests + new test for indexed accesses + make check-all

Reviewers: jfb, wschmidt, hfinkel

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D5587

llvm-svn: 218922
2014-10-02 22:27:07 +00:00
Duncan P. N. Exon Smith 786cd049fc Revert "DI: Fold constant arguments into a single MDString"
This reverts commit r218914 while I investigate some bots.

llvm-svn: 218918
2014-10-02 22:15:31 +00:00
Duncan P. N. Exon Smith 571f97bd90 DI: Fold constant arguments into a single MDString
This patch addresses the first stage of PR17891 by folding constant
arguments together into a single MDString.  Integers are stringified and
a `\0` character is used as a separator.

Part of PR17891.

Note: I've attached my testcases upgrade scripts to the PR.  If I've
just broken your out-of-tree testcases, they might help.

llvm-svn: 218914
2014-10-02 21:56:57 +00:00
Adrian Prantl 87b7eb9d0f Move the complex address expression out of DIVariable and into an extra
argument of the llvm.dbg.declare/llvm.dbg.value intrinsics.

Previously, DIVariable was a variable-length field that has an optional
reference to a Metadata array consisting of a variable number of
complex address expressions. In the case of OpPiece expressions this is
wasting a lot of storage in IR, because when an aggregate type is, e.g.,
SROA'd into all of its n individual members, the IR will contain n copies
of the DIVariable, all alike, only differing in the complex address
reference at the end.

By making the complex address into an extra argument of the
dbg.value/dbg.declare intrinsics, all of the pieces can reference the
same variable and the complex address expressions can be uniqued across
the CU, too.
Down the road, this will allow us to move other flags, such as
"indirection" out of the DIVariable, too.

The new intrinsics look like this:
declare void @llvm.dbg.declare(metadata %storage, metadata %var, metadata %expr)
declare void @llvm.dbg.value(metadata %storage, i64 %offset, metadata %var, metadata %expr)

This patch adds a new LLVM-local tag to DIExpressions, so we can detect
and pretty-print DIExpression metadata nodes.

What this patch doesn't do:

This patch does not touch the "Indirect" field in DIVariable; but moving
that into the expression would be a natural next step.

http://reviews.llvm.org/D4919
rdar://problem/17994491

Thanks to dblaikie and dexonsmith for reviewing this patch!

Note: I accidentally committed a bogus older version of this patch previously.
llvm-svn: 218787
2014-10-01 18:55:02 +00:00
Adrian Prantl b458dc2eee Revert r218778 while investigating buldbot breakage.
"Move the complex address expression out of DIVariable and into an extra"

llvm-svn: 218782
2014-10-01 18:10:54 +00:00
Adrian Prantl 25a7174e7a Move the complex address expression out of DIVariable and into an extra
argument of the llvm.dbg.declare/llvm.dbg.value intrinsics.

Previously, DIVariable was a variable-length field that has an optional
reference to a Metadata array consisting of a variable number of
complex address expressions. In the case of OpPiece expressions this is
wasting a lot of storage in IR, because when an aggregate type is, e.g.,
SROA'd into all of its n individual members, the IR will contain n copies
of the DIVariable, all alike, only differing in the complex address
reference at the end.

By making the complex address into an extra argument of the
dbg.value/dbg.declare intrinsics, all of the pieces can reference the
same variable and the complex address expressions can be uniqued across
the CU, too.
Down the road, this will allow us to move other flags, such as
"indirection" out of the DIVariable, too.

The new intrinsics look like this:
declare void @llvm.dbg.declare(metadata %storage, metadata %var, metadata %expr)
declare void @llvm.dbg.value(metadata %storage, i64 %offset, metadata %var, metadata %expr)

This patch adds a new LLVM-local tag to DIExpressions, so we can detect
and pretty-print DIExpression metadata nodes.

What this patch doesn't do:

This patch does not touch the "Indirect" field in DIVariable; but moving
that into the expression would be a natural next step.

http://reviews.llvm.org/D4919
rdar://problem/17994491

Thanks to dblaikie and dexonsmith for reviewing this patch!

llvm-svn: 218778
2014-10-01 17:55:39 +00:00
Sanjay Patel bdf1e38856 Refactor reciprocal and reciprocal square root estimate into target-independent functions (part 2).
This is purely refactoring. No functional changes intended. PowerPC is the only target
that is currently using this interface.

The ultimate goal is to allow targets other than PowerPC (certainly X86 and Aarch64) to turn this:

z = y / sqrt(x)

into:

z = y * rsqrte(x)

And:

z = y / x

into:

z = y * rcpe(x)

using whatever HW magic they can use. See http://llvm.org/bugs/show_bug.cgi?id=20900 .

There is one hook in TargetLowering to get the target-specific opcode for an estimate instruction
along with the number of refinement steps needed to make the estimate usable.

Differential Revision: http://reviews.llvm.org/D5484

llvm-svn: 218553
2014-09-26 23:01:47 +00:00
Robin Morisset 2212996936 [Power] Use AtomicExpandPass for fence insertion, and use lwsync where appropriate
Summary:
This patch makes use of AtomicExpandPass in Power for inserting fences around
atomic as part of an effort to remove fence insertion from SelectionDAGBuilder.
As a big bonus, it lets us use sync 1 (lightweight sync, often used by the mnemonic
lwsync) instead of sync 0 (heavyweight sync) in many cases.

I also added a test, as there was no test for the barriers emitted by the Power
backend for atomic loads and stores.

Test Plan: new test + make check-all

Reviewers: jfb

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D5180

llvm-svn: 218331
2014-09-23 20:46:49 +00:00
Sanjay Patel 4bc685c206 tighten up checks
We manage to generate all of the matching instructions (and a lot more) via
the reciprocal optimization function - even if we completely remove the square
root optimization. With CHECK_NEXT, we assure that we're executing the
expected square root optimization paths and not generating extra insts.

llvm-svn: 218284
2014-09-22 22:46:44 +00:00
Sanjay Patel 5cf7561d21 remove unnecessary labels; NFC
llvm-svn: 218278
2014-09-22 21:52:53 +00:00
Hal Finkel 62ac736faa Optionally enable more-aggressive FMA formation in DAGCombine
The heuristic used by DAGCombine to form FMAs checks that the FMUL has only one
use, but this is overly-conservative on some systems. Specifically, if the FMA
and the FADD have the same latency (and the FMA does not compete for resources
with the FMUL any more than the FADD does), there is no need for the
restriction, and furthermore, forming the FMA leaving the FMUL can still allow
for higher overall throughput and decreased critical-path length.

Here we add a new TLI callback, enableAggressiveFMAFusion, false by default, to
elide the hasOneUse check. This is enabled for PowerPC by default, as most
PowerPC systems will benefit.

Patch by Olivier Sallenave, thanks!

llvm-svn: 218120
2014-09-19 11:42:56 +00:00
Samuel Antao 61570df715 Fix FastISel bug in boolean returns for PowerPC.
For PPC targets, FastISel does not take the sign extension information into account when selecting return instructions whose operands are constants. A consequence of this is that the return of boolean values is not correct. This patch fixes the problem by evaluating the sign extension information also for constants, forwarding this information to PPCMaterializeInt which takes this information to drive the sign extension during the materialization. 

llvm-svn: 217993
2014-09-17 23:25:06 +00:00
Rafael Espindola 9dd2d5810f Add back tests for empty function in SPARC and PowerPC.
llvm-svn: 217834
2014-09-15 22:11:07 +00:00
Rafael Espindola 6865d6f08a Fix a lot of confusion around inserting nops on empty functions.
On MachO, and MachO only, we cannot have a truly empty function since that
breaks the linker logic for atomizing the section.

When we are emitting a frame pointer, the presence of an unreachable will
create a cfi instruction pointing past the last instruction. This is perfectly
fine. The FDE information encodes the pc range it applies to. If some tool
cannot handle this, we should explicitly say which bug we are working around
and only work around it when it is actually relevant (not for ELF for example).

Given the unreachable we could omit the .cfi_def_cfa_register, but then
again, we could also omit the entire function prologue if we wanted to.

llvm-svn: 217801
2014-09-15 18:32:58 +00:00
Bill Schmidt b73b370809 Address comments on r217622
llvm-svn: 217680
2014-09-12 14:26:36 +00:00
Bill Schmidt 3ae268076b Add missing colon to RUN line...
llvm-svn: 217623
2014-09-11 20:13:52 +00:00
Bill Schmidt be95fd5357 [PATCH, PowerPC] Accept 'U' and 'X' constraints in inline asm
Inline asm may specify 'U' and 'X' constraints to print a 'u' for an
update-form memory reference, or an 'x' for an indexed-form memory
reference.  However, these are really only useful in GCC internal code
generation.  In inline asm the operand of the memory constraint is
typically just a register containing the address, so 'U' and 'X' make
no sense.

This patch quietly accepts 'U' and 'X' in inline asm patterns, but
otherwise does nothing.  If we ever unexpectedly see a non-register,
we'll assert and sort it out afterwards.

I've added a new test for these constraints; the test case should be
used for other asm-constraints changes down the road.

llvm-svn: 217622
2014-09-11 20:10:03 +00:00
Hal Finkel e19006ea22 Enable splitting indexing from loads with TargetConstants
When I recommitted r208640 (in r216898) I added an exclusion for TargetConstant
offsets, as there is no guarantee that a backend can handle them on generic
ADDs (even if it generates them during address-mode matching) -- and,
specifically, applying this transformation directly with TargetConstants caused
a self-hosting failure on PPC64. Ignoring all TargetConstants, however, is less
than ideal. Instead, for non-opaque constants, we can convert them into regular
constants for use with the generated ADD (or SUB).

llvm-svn: 216908
2014-09-02 16:05:23 +00:00
Hal Finkel 584a70c820 [PowerPC] Add support for dcbtst and icbt (prefetch)
Adds code generation support for dcbtst (data cache prefetch for write) and
icbt (instruction cache prefetch for read - Book E cores only).

We still end up with a 'cannot select' error for the non-supported prefetch
intrinsic forms. This will be fixed in a later commit.

Fixes PR20692.

llvm-svn: 216339
2014-08-23 23:21:04 +00:00
Juergen Ributzka 4bf6c01cdb Reapply [FastISel] Let the target decide first if it wants to materialize a constant (215588).
Note: This was originally reverted to track down a buildbot error. This commit
exposed a latent bug that was fixed in r215753. Therefore it is reapplied
without any modifications.

I run it through SPEC2k and SPEC2k6 for AArch64 and it didn't introduce any new
regeressions.

Original commit message:
This changes the order in which FastISel tries to materialize a constant.
Originally it would try to use a simple target-independent approach, which
can lead to the generation of inefficient code.

On X86 this would result in the use of movabsq to materialize any 64bit
integer constant - even for simple and small values such as 0 and 1. Also
some very funny floating-point materialization could be observed too.

On AArch64 it would materialize the constant 0 in a register even the
architecture has an actual "zero" register.

On ARM it would generate unnecessary mov instructions or not use mvn.

This change simply changes the order and always asks the target first if it
likes to materialize the constant. This doesn't fix all the issues
mentioned above, but it enables the targets to implement such
optimizations.

Related to <rdar://problem/17420988>.

llvm-svn: 216006
2014-08-19 19:05:24 +00:00
Hal Finkel 41a55ad0a5 [PowerPC] Mark fixed-offset byvals as pointed-to by IR values
A byval object, even if allocated at a fixed offset (prescribed by the ABI) is
pointed to by IR values. Most fixed-offset stack objects are not pointed-to by
IR values, so the default is to assume this is not possible. However, we need
to override the default in this case (instruction scheduling can cause
miscompiles otherwise).

Fixes PR20280.

llvm-svn: 215795
2014-08-16 00:17:05 +00:00
Bill Schmidt 9bac9ca796 [PPC64] Add test case for r215685.
I had deferred adding this test case until I could get it down to a
reasonable size.  That's done now.

Thanks,
Bill

llvm-svn: 215711
2014-08-15 13:51:57 +00:00
Juergen Ributzka 790bacf232 Revert several FastISel commits to track down a buildbot error.
This reverts:
r215595 "[FastISel][X86] Add large code model support for materializing floating-point constants."
r215594 "[FastISel][X86] Use XOR to materialize the "0" value."
r215593 "[FastISel][X86] Emit more efficient instructions for integer constant materialization."
r215591 "[FastISel][AArch64] Make use of the zero register when possible."
r215588 "[FastISel] Let the target decide first if it wants to materialize a constant."
r215582 "[FastISel][AArch64] Cleanup constant materialization code. NFCI."

llvm-svn: 215673
2014-08-14 19:56:28 +00:00
Juergen Ributzka 7cee768e55 [FastISel] Let the target decide first if it wants to materialize a constant.
This changes the order in which FastISel tries to materialize a constant.
Originally it would try to use a simple target-independent approach, which
can lead to the generation of inefficient code.

On X86 this would result in the use of movabsq to materialize any 64bit
integer constant - even for simple and small values such as 0 and 1. Also
some very funny floating-point materialization could be observed too.

On AArch64 it would materialize the constant 0 in a register even the
architecture has an actual "zero" register.

On ARM it would generate unnecessary mov instructions or not use mvn.

This change simply changes the order and always asks the target first if it
likes to materialize the constant. This doesn't fix all the issues
mentioned above, but it enables the targets to implement such
optimizations.

Related to <rdar://problem/17420988>.

llvm-svn: 215588
2014-08-13 22:08:02 +00:00
Hal Finkel 46ef7ce283 [PowerPC] Implement PPCTargetLowering::getTgtMemIntrinsic
This implements PPCTargetLowering::getTgtMemIntrinsic for Altivec load/store
intrinsics. As with the construction of the MachineMemOperands for the
intrinsic calls used for unaligned load/store lowering, the only slight
complication is that we need to represent a larger memory range than the
loaded/stored value-type size (because the address is rounded down to an
aligned address, and we need to conservatively represent the entire possible
range of the actual access). This required adding an extra size field to
TargetLowering::IntrinsicInfo, and this was done in a way that required no
modifications to other targets (the size defaults to the store size of the
provided memory data type).

This fixes test/CodeGen/PowerPC/unal-altivec-wint.ll (so it can be un-XFAILed).

llvm-svn: 215512
2014-08-13 01:15:40 +00:00
Hal Finkel 415e344f29 Fix classof for ISD::INTRINSIC_W_CHAIN and INTRINSIC_VOID
Unfortunately, our use of the SDNode class hierarchy for INTRINSIC_W_CHAIN and
INTRINSIC_VOID nodes is somewhat broken right now. These nodes sometimes are
used for memory intrinsics (those with MachineMemOperands), and sometimes not.
When not, the nodes are not created as instances of MemIntrinsicSDNode, but
rather created as some other subclass of SDNode using DAG::getNode. When they
are memory intrinsics, they are created using DAG::getMemIntrinsicNode as
instances of MemIntrinsicSDNode. MemIntrinsicSDNode is a subclass of
MemSDNode, but prior to r214452, we had a non-self-consistent setup whereby
MemIntrinsicSDNode::classof on INTRINSIC_W_CHAIN and INTRINSIC_VOID would
return true but MemSDNode::classof on INTRINSIC_W_CHAIN and INTRINSIC_VOID
would return false. In r214452, MemSDNode::classof was changed to return true
for INTRINSIC_W_CHAIN and INTRINSIC_VOID, which is now self-consistent. The
problem is that neither the pre-r214452 logic and the post-r214452 logic are
really right. The truth is that not all INTRINSIC_W_CHAIN and INTRINSIC_VOID
nodes are instances of MemIntrinsicSDNode (or MemSDNode for that matter), and
the return value from classof needs to reflect that. This was broken before
r214452 (because MemIntrinsicSDNode::classof always returned true), and was
broken afterward (because MemSDNode::classof also always returned true), and
will now be correct.

The minimal solution is to grab one of the SubclassData bits (there is one left
for MemIntrinsicSDNode nodes) and use it to store whether or not a particular
INTRINSIC_W_CHAIN or INTRINSIC_VOID is really an instance of
MemIntrinsicSDNode or not. Doing this allows both MemIntrinsicSDNode::classof
and MemSDNode::classof to return the correct answer for the underlying object
for both the memory-intrinsic and non-memory-intrinsic cases.

This fixes the problem that r214452 created in the SelectionDAGDumper (thanks
to Matt Arsenault for pointing it out).

Because PowerPC does not implement getTgtMemIntrinsic, this change breaks
test/CodeGen/PowerPC/unal-altivec-wint.ll. I've XFAILed it for now, and will
fix it in a follow-up commit.

llvm-svn: 215511
2014-08-13 01:15:37 +00:00
Joerg Sonnenberger 7ee0f31a8b Provide an implementation of getNoopForMachoTarget for PPC, otherwise
empty functions will assert in the MC object writer.

llvm-svn: 215238
2014-08-08 19:13:23 +00:00
Bill Schmidt 42a6936c78 [PowerPC] Swap arguments and adjust shift count for vsldoi on little endian
Commits r213915 and r214718 fix recognition of shuffle masks for vmrg*
and vpku*um instructions for a little-endian target, by swapping the
input arguments.  The vsldoi instruction requires similar treatment,
and also needs its shift count adjusted for little endian.

Reviewed by Ulrich Weigand.

This is a bug fix candidate for release 3.5 (and hopefully the last of
those for PowerPC).

llvm-svn: 214923
2014-08-05 20:47:25 +00:00
Bill Schmidt f04e998e00 [PPC64LE] Fix wrong IR for vec_sld and vec_vsldoi
My original LE implementation of the vsldoi instruction, with its
altivec.h interfaces vec_sld and vec_vsldoi, produces incorrect
shufflevector operations in the LLVM IR.  Correct code is generated
because the back end handles the incorrect shufflevector in a
consistent manner.

This patch and a companion patch for Clang correct this problem by
removing the fixup from altivec.h and the corresponding fixup from the
PowerPC back end.  Several test cases are also modified to reflect the
now-correct LLVM IR.

llvm-svn: 214800
2014-08-04 23:21:01 +00:00
Joerg Sonnenberger 6d05a2b461 MC uses .lcomm now, so adjust.
llvm-svn: 214776
2014-08-04 21:06:00 +00:00
Ulrich Weigand 983341d3f3 [PowerPC] Add target triple to vec_urem_const.ll test case
This should hopefully fix build bots on other architectures.

llvm-svn: 214721
2014-08-04 14:55:26 +00:00
Ulrich Weigand cc9909b881 [PowerPC] Swap arguments to vpkuhum/vpkuwum on little-endian
In commit r213915, Bill fixed little-endian usage of vmrgh* and vmrgl*
by swapping the input arguments.  As it turns out, the exact same fix
is also required for the vpkuhum/vpkuwum patterns.

This fixes another regression in llvmpipe when vector support is
enabled.

Reviewed by Bill Schmidt.

llvm-svn: 214718
2014-08-04 13:53:40 +00:00
Ulrich Weigand 51eccec5d9 [PowerPC] MULHU/MULHS are not legal for vector types
I ran into some test failures where common code changed vector division
by constant into a multiply-high operation (MULHU).  But these are not
implemented by the back-end, so we failed to recognize the insn.

Fixed by marking MULHU/MULHS as Expand for vector types.

llvm-svn: 214716
2014-08-04 13:27:12 +00:00
Ulrich Weigand c4cc7febb0 [PowerPC] Fix and improve vector comparisons
This patch refactors code generation of vector comparisons.

This fixes a wrong code-gen bug for ISD::SETGE for floating-point types,
and improves generated code for vector comparisons in general.

Specifically, the patch moves all logic deciding how to implement vector
comparisons into getVCmpInst, which gets two extra boolean outputs
indicating to its caller whether its needs to swap the input operands
and/or negate the result of the comparison.  Apart from implementing
these two modifications as directed by getVCmpInst, there is no need
to ever implement vector comparisons in any other manner; in particular,
there is never a need to perform two separate comparisons (e.g. one for
equal and one for greater-than, as code used to do before this patch).

Reviewed by Bill Schmidt.

llvm-svn: 214714
2014-08-04 13:13:57 +00:00
Hal Finkel 3604bf7fe7 [PowerPC] Recognize consecutive memory accesses from intrinsics
When generating unaligned vector loads, we need to search for other loads or
stores nearby offset by one vector width. If we find one, then we know that we
can safely generate another aligned load at that address. Otherwise, we must
generate the next load using an offset of the vector width minus one byte (so
we don't read off the end of the allocation if the base unaligned address
happened to be aligned at runtime). We had previously done this using only
other vector loads and stores, but did not consider the PowerPC-specific vector
load/store intrinsics. Now we'll also consider vector intrinsics. By itself,
this change is a feature enhancement, but is a necessary step toward fixing the
underlying problem behind PR19991.

llvm-svn: 214469
2014-08-01 01:02:01 +00:00
Will Schmidt 44ff8f06ec Disable IsSub subregister assert. pr18663.
This is a follow-up to the activity in the bug at
http://llvm.org/bugs/show_bug.cgi?id=18663 .  The underlying issue has
to do with how the KILL pseudo-instruction is handled.  I defer to
Hal/Jakob/Uli for additional details and background.

This will disable the (bad?) assert, add an associated fixme comment,
and add a pair of tests.

The code change and the pr18663-2.ll test are copied from the referenced
bug.  That test does not immediately fail in my environment, but I have
added the pr18663.ll test which does.

(Comment from Hal)
to provide everyone else with some context, this assert was not bad when
it was written. At that time, we only generated KILL pseudo instructions
around subregister copies. This logic, unfortunately, had its own problems.
In r199797, the relevant logic in MachineCopyPropagation was replaced to
generate KILLs for other kinds of copies too. This change in semantics broke
this now-problematic assumption in AggressiveAntiDepBreaker. The
AggressiveAntiDepBreaker really needs a proper cleanup to deal with the
change, but removing the assert (which just allows the function to return
false) is a safe conservative behavior, and should do for the time being.

llvm-svn: 214429
2014-07-31 19:50:53 +00:00
Hal Finkel 36eff0f854 Fix ScalarEvolutionExpander when creating a PHI in a block with duplicate predecessors
It seems that when I fixed this, almost exactly a year ago, I did not quite do
it correctly. When we have duplicate block predecessors, we can indeed not have
different incoming values for the same block, but we *must* have duplicate
entries. So, instead of skipping the duplicates, we explicitly add the
duplicate incoming values.

Fixes PR20442.

llvm-svn: 214423
2014-07-31 19:13:38 +00:00
Ulrich Weigand e09f73716a [PowerPC] Fix ppc64-elf-abi.ll test case on Darwin
Use full -mtriple instead of just -march to ensure Linux ABI
(ELFv1 or ELFv2) is selected.

llvm-svn: 214179
2014-07-29 12:48:14 +00:00
Ulrich Weigand 085a10c49e [PowerPC] Add testcase forgotten in the 214072 commit.
llvm-svn: 214073
2014-07-28 13:10:25 +00:00
Hal Finkel 7c8ae53506 [PowerPC] Support TLS on PPC32/ELF
Patch by Justin Hibbits!

llvm-svn: 213960
2014-07-25 17:47:22 +00:00
Bill Schmidt c9fa5dd618 [PATCH][PPC64LE] Correct little-endian usage of vmrgh* and vmrgl*.
Because the PowerPC vmrgh* and vmrgl* instructions have a built-in
big-endian bias, it is necessary to swap their inputs in little-endian
mode when using them to implement a vector shuffle.  This was
previously missed in the vector LE implementation.

There was already logic to distinguish between unary and "normal"
vmrg* vector shuffles, so this patch extends that logic to use a third
option:  "swapped" vmrg* vector shuffles that are used for little
endian in place of the "normal" ones.

I've updated the vec-shuffle-le.ll test to check for the expected
register ordering on the generated instructions.

This bug was discovered when testing the LE and ELFv2 patches for
safety if they were backported to 3.4.  A different vectorization
decision was made in 3.4 than on mainline trunk, and that exposed the
problem.  I've verified this fix takes care of that issue.

llvm-svn: 213915
2014-07-25 01:55:55 +00:00
Joerg Sonnenberger b5459e6e22 Don't use 128bit functions on PPC32.
llvm-svn: 213899
2014-07-24 22:20:10 +00:00
Chandler Carruth 9a0051cd59 [SDAG] Make the DAGCombine worklist not grow endlessly due to duplicate
insertions.

The old behavior could cause arbitrarily bad memory usage in the DAG
combiner if there was heavy traffic of adding nodes already on the
worklist to it. This commit switches the DAG combine worklist to work
the same way as the instcombine worklist where we null-out removed
entries and only add new entries to the worklist. My measurements of
codegen time shows slight improvement. The memory utilization is
unsurprisingly dominated by other factors (the IR and DAG itself
I suspect).

This change results in subtle, frustrating churn in the particular order
in which DAG combines are applied which causes a number of minor
regressions where we fail to match a pattern previously matched by
accident. AFAICT, all of these should be using AddToWorklist to directly
or should be written in a less brittle way. None of the changes seem
drastically bad, and a few of the changes seem distinctly better.

A major change required to make this work is to significantly harden the
way in which the DAG combiner handle nodes which become dead
(zero-uses). Previously, we relied on the ability to "priority-bump"
them on the combine worklist to achieve recursive deletion of these
nodes and ensure that the frontier of remaining live nodes all were
added to the worklist. Instead, I've introduced a routine to just
implement that precise logic with no indirection. It is a significantly
simpler operation than that of the combiner worklist proper. I suspect
this will also fix some other problems with the combiner.

I think the x86 changes are really minor and uninteresting, but the
avx512 change at least is hiding a "regression" (despite the test case
being just noise, not testing some performance invariant) that might be
looked into. Not sure if any of the others impact specific "important"
code paths, but they didn't look terribly interesting to me, or the
changes were really minor. The consensus in review is to fix any
regressions that show up after the fact here.

Thanks to the other reviewers for checking the output on other
architectures. There is a specific regression on ARM that Tim already
has a fix prepped to commit.

Differential Revision: http://reviews.llvm.org/D4616

llvm-svn: 213727
2014-07-23 07:08:53 +00:00
Ulrich Weigand 85d5df25de [PowerPC] ELFv2 aggregate passing support
This patch adds infrastructure support for passing array types
directly.  These can be used by the front-end to pass aggregate
types (coerced to an appropriate array type).  The details of the
array type being used inform the back-end about ABI-relevant
properties.  Specifically, the array element type encodes:
- whether the parameter should be passed in FPRs, VRs, or just
  GPRs/stack slots  (for float / vector / integer element types,
  respectively)
- what the alignment requirements of the parameter are when passed in
  GPRs/stack slots  (8 for float / 16 for vector / the element type
  size for integer element types) -- this corresponds to the
  "byval align" field

Using the infrastructure provided by this patch, a companion patch
to clang will enable two features:
- In the ELFv2 ABI, pass (and return) "homogeneous" floating-point
  or vector aggregates in FPRs and VRs (this is similar to the ARM
  homogeneous aggregate ABI)
- As an optimization for both ELFv1 and ELFv2 ABIs, pass aggregates
  that fit fully in registers without using the "byval" mechanism

The patch uses the functionArgumentNeedsConsecutiveRegisters callback
to encode that special treatment is required for all directly-passed
array types.  The isInConsecutiveRegs / isInConsecutiveRegsLast bits set
as a results are then used to implement the required size and alignment
rules in CalculateStackSlotSize / CalculateStackSlotAlignment etc.

As a related change, the ABI routines have to be modified to support
passing floating-point types in GPRs.  This is necessary because with
homogeneous aggregates of 4-byte float type we can now run out of FPRs
*before* we run out of the 64-byte argument save area that is shadowed
by GPRs.  Any extra floating-point arguments that no longer fit in FPRs
must now be passed in GPRs until we run out of those too.

Note that there was already code to pass floating-point arguments in
GPRs used with vararg parameters, which was done by writing the argument
out to the argument save area first and then reloading into GPRs.  The
patch re-implements this, however, in favor of code packing float arguments
directly via extension/truncation, BITCAST, and BUILD_PAIR operations.

This is required to support the ELFv2 ABI, since we cannot unconditionally
write to the argument save area (which the caller might not have allocated).
The change does, however, affect ELFv1 varags routines too; but even here
the overall effect should be advantageous: Instead of loading the argument
into the FPR, then storing the argument to the stack slot, and finally
reloading the argument from the stack slot into a GPR, the new code now
just loads the argument into the FPR, and subsequently loads the argument
into the GPR (via BITCAST).  That BITCAST might imply a save/reload from
a stack temporary (in which case we're no worse than before); but it
might be implemented more efficiently in some cases.

The final part of the patch enables up to 8 FPRs and VRs for argument
return in PPCCallingConv.td; this is required to support returning
ELFv2 homogeneous aggregates.  (Note that this doesn't affect other ABIs
since LLVM wil only look for which register to use if the parameter is
marked as "direct" return anyway.)

Reviewed by Hal Finkel.

llvm-svn: 213493
2014-07-21 00:13:26 +00:00
Ulrich Weigand be928cc276 [PowerPC] ELFv2 explicit CFI for CR fields
This is a minor improvement in the ELFv2 ABI.   In ELFv1, DWARF CFI
would represent a saved CR word (holding CR fields CR2, CR3, and CR4)
using just a single CFI record refering to CR2.   In ELFv2 instead,
each of the CR fields is represented by its own CFI record.  The
advantage is that the compiler can now chose to save just a single
(or two) CR fields instead of all of them, if those are the only ones
that actually need saving.  That can lead to more efficient code using
mf(o)crf instead of the (slow) mfcr instruction.

Note that this patch does not (yet) implement this more efficient
code generation, but it does implement the part that is required to
be ABI compliant: creating multiple CFI records if multiple CR fields
are saved.

Reviewed by Hal Finkel.

llvm-svn: 213492
2014-07-21 00:03:18 +00:00
Ulrich Weigand 8658f17eef [PowerPC] ELFv2 stack space reduction
The ELFv2 ABI reduces the amount of stack required to implement an
ABI-compliant function call in two ways:
* the "linkage area" is reduced from 48 bytes to 32 bytes by
  eliminating two unused doublewords
* the 64-byte "parameter save area" is now optional and need not be
  present in certain cases (it remains mandatory in functions with
  variable arguments, and functions that have any parameter that is
  passed on the stack)

The following patch implements this required changes:
- reducing the linkage area, and associated relocation of the TOC save
  slot, in getLinkageSize / getTOCSaveOffset (this requires updating all
  callers of these routines to pass in the isELFv2ABI flag).
- (partially) handling the case where the parameter save are is optional

This latter part requires some extra explanation:  Currently, we still
always allocate the parameter save area when *calling* a function.
That is certainly always compliant with the ABI, but may cause code to
allocate stack unnecessarily.  This can be addressed by a follow-on
optimization patch.

On the *callee* side, in LowerFormalArguments, we *must* track
correctly whether the ABI guarantees that the caller has allocated
the parameter save area for our use, and the patch does so. However,
there is one complication: the code that handles incoming "byval"
arguments will currently *always* write to the parameter save area,
because it has to force incoming register arguments to the stack since
it must return an *address* to implement the byval semantics.

To fix this, the patch changes the LowerFormalArguments code to write
arguments to a freshly allocated stack slot on the function's own stack
frame instead of the argument save area in those cases where that area
is not present.

Reviewed by Hal Finkel.

llvm-svn: 213490
2014-07-20 23:43:15 +00:00
Ulrich Weigand aa0ac4f11c [PowerPC] ELFv2 function call changes
This patch builds upon the two preceding MC changes to implement the
basic ELFv2 function call convention.  In the ELFv1 ABI, a "function
descriptor" was associated with every function, pointing to both the
entry address and the related TOC base (and a static chain pointer
for nested functions).  Function pointers would actually refer to that
descriptor, and the indirect call sequence needed to load up both entry
address and TOC base.

In the ELFv2 ABI, there are no more function descriptors, and function
pointers simply refer to the (global) entry point of the function code.
Indirect function calls simply branch to that address, after loading it
up into r12 (as required by the ABI rules for a global entry point).
Direct function calls continue to just do a "bl" to the target symbol;
this will be resolved by the linker to the local entry point of the
target function if it is local, and to a PLT stub if it is global.
That PLT stub would then load the (global) entry point address of the
final target into r12 and branch to it.  Note that when performing a
local function call, r2 must be set up to point to the current TOC
base: if the target ends up local, the ABI requires that its local
entry point is called with r2 set up; if the target ends up global,
the PLT stub requires that r2 is set up.

This patch implements all LLVM changes to implement that scheme:
- No longer create a function descriptor when emitting a function
  definition (in EmitFunctionEntryLabel)
- Emit two entry points *if* the function needs the TOC base (r2)
  anywhere (this is done EmitFunctionBodyStart; note that this cannot
  be done in EmitFunctionBodyStart because the global entry point
  prologue code must be *part* of the function as covered by debug info).
- In order to make use tracking of r2 (as needed above) work correctly,
  mark direct function calls as implicitly using r2.
- Implement the ELFv2 indirect function call sequence (no function
  descriptors; load target address into r12).
- When creating an ELFv2 object file, emit the .abiversion 2 directive
  to tell the linker to create the appropriate version of PLT stubs.  

Reviewed by Hal Finkel.

llvm-svn: 213489
2014-07-20 23:31:44 +00:00
Ulrich Weigand 55a96650d9 [PowerPC] Fix FrameIndex handling in SelectAddressRegImm
The PPCTargetLowering::SelectAddressRegImm routine needs to handle
FrameIndex nodes in a special manner, by tranlating them into a
TargetFrameIndex node.  This was done in most cases, but seems to
have been neglected in one path: when the input tree has an OR of
the FrameIndex with an immediate.  This can happen if the FrameIndex
can be proven to be sufficiently aligned that an OR of that immediate
is equivalent to an ADD.

The missing handling of FrameIndex in that case caused the SelectionDAG
instruction selection to miss opportunities to merge the OR back into
the FrameIndex node, leading to superfluous addi/ori instructions in
the final assembler output.

llvm-svn: 213482
2014-07-20 22:26:40 +00:00
Hal Finkel 3ee2af7d1c [PowerPC] 32-bit ELF PIC support
This adds initial support for PPC32 ELF PIC (Position Independent Code; the
-fPIC variety), thus rectifying a long-standing deficiency in the PowerPC
backend.

Patch by Justin Hibbits!

llvm-svn: 213427
2014-07-18 23:29:49 +00:00
Ulrich Weigand ea147a9d43 [PowerPC] Fix invalid displacement created by LocalStackAlloc
This commit fixes a bug in PPCRegisterInfo::isFrameOffsetLegal that
could result in the LocalStackAlloc pass creating an MI instruction
out-of-range displacement:
        %vreg17<def> = LD 33184, %vreg31; mem:LD8[%g](align=32)
        %G8RC:%vreg17 G8RC_and_G8RC_NOX0:%vreg31
(In final assembler output the top bits are stripped off, resulting
in a negative offset loading from below the stack pointer.)

Common code expects the isFrameOffsetLegal routine to verify whether
adding a given offset to the offset already present in the instruction
results in a valid displacement.  However, on PowerPC the routine
did not take the already present instruction offset into account.

This commit fixes isFrameOffsetLegal to add the instruction offset,
and updates a local caller (needsFrameBaseReg) to no longer add the
instruction offset itself before calling isFrameOffsetLegal.

Reviewed by Hal Finkel.

llvm-svn: 212832
2014-07-11 17:19:31 +00:00
Ulrich Weigand 5fd91e0c0b [PowerPC] Fix testcase regression
Use -mcpu to avoid different codegen depending on host platform.

llvm-svn: 212478
2014-07-07 19:41:54 +00:00
Ulrich Weigand ec2bf93895 [PowerPC] Fix "byval align" arguments
Arguments passed as "byval align" should get the specified alignment
in the parameter save area.  There was some code in PPCISelLowering.cpp
that attempted to implement this, but this didn't work correctly:
while code did update the ArgOffset value, it neglected to update
the PtrOff value (which was already computed from the old ArgOffset),
and it also neglected to update GPR_idx -- fields skipped due to
alignment in the save area must likewise be skipped in GPRs.

This patch fixes and simplifies this logic by:
- handling argument offset alignment right at the beginning
  of argument processing, using a new helper routine
  CalculateStackSlotAlignment (this avoids having to update
  PtrOff and other derived values later on)
- not tracking GPR_idx separately, but always computing the
  correct GPR_idx for each argument *from* its ArgOffset
- removing some redundant computation in LowerFormalArguments:
  MinReservedArea must equal ArgOffset after argument processing,
  so there's no use in computing it twice.

[This doesn't change the behavior of the current clang front-end,
since that never creates "byval align" arguments at the moment.
This will change with a follow-on patch, however.]

llvm-svn: 212476
2014-07-07 19:26:41 +00:00
Tim Northover 07f99fb769 llvm-readobj: fix MachO relocatoin printing a bit.
There were two issues here:
1. At the very least, scattered relocations cannot use the same code to
   determine the corresponding symbol being referred to. For some reason we
   pretend there is no symbol, even when one actually exists in the symtab, so to
   match this behaviour getRelocationSymbol should simply return symbols_end for
   scattered relocations.
2. Printing "-" when we can't get a symbol (including the scattered case, but
   not exclusively), isn't that helpful. In both cases there *is* interesting
   information in that field, so we should print it. As hex will do.

Small part of rdar://problem/17553104

llvm-svn: 212332
2014-07-04 10:57:56 +00:00
Ulrich Weigand f236bb1b5b Fix ppcf128 component access on little-endian systems
The PowerPC 128-bit long double data type (ppcf128 in LLVM) is in fact a
pair of two doubles, where one is considered the "high" or
more-significant part, and the other is considered the "low" or
less-significant part.  When a ppcf128 value is stored in memory or a
register pair, the high part always comes first, i.e. at the lower
memory address or in the lower-numbered register, and the low part
always comes second.  This is true both on big-endian and little-endian
PowerPC systems.  (Similar to how with a complex number, the real part
always comes first and the imaginary part second, no matter the byte
order of the system.)

This was implemented incorrectly for little-endian systems in LLVM.
This commit fixes three related issues:

- When printing an immediate ppcf128 constant to assembler output
  in emitGlobalConstantFP, emit the high part first on both big-
  and little-endian systems.

- When lowering a ppcf128 type to a pair of f64 types in SelectionDAG
  (which is used e.g. when generating code to load an argument into a
  register pair), use correct low/high part ordering on little-endian
  systems.

- In a related issue, because lowering ppcf128 into a pair of f64 must
  operate differently from lowering an int128 into a pair of i64,
  bitcasts between ppcf128 and int128 must not be optimized away by the
  DAG combiner on little-endian systems, but must effect a word-swap.

Reviewed by Hal Finkel.

llvm-svn: 212274
2014-07-03 15:06:47 +00:00
Ulrich Weigand 14bd521f4c [PowerPC] Constrain base register in PPCRegisterInfo::resolveFrameIndex
I've run into a bug where current LLVM at -O0 (with fast-isel)
generated invalid code like:

        ld 0, 20936(1)                  # 8-byte Folded Reload
        stw 12, 10348(0)
        stw 12, 10344(0)

The underlying vreg had been introduced as base register by the
Local Stack Slot Allocation pass.  That register was constrained
to G8RC by PPCRegisterInfo::materializeFrameBaseRegister to match
the ADDI instruction used to set it, but it was *not* constrained
to G8RC_NOX0 to fit the *use* of the register in an address.

That should have happened in PPCRegisterInfo::resolveFrameIndex.
This patch adds an appropriate constrainRegClass call.

Reviewed by Hal Finkel.

llvm-svn: 211897
2014-06-27 13:04:12 +00:00
Eli Bendersky 5d5e18da3e Rename loop unrolling and loop vectorizer metadata to have a common prefix.
[LLVM part]

These patches rename the loop unrolling and loop vectorizer metadata
such that they have a common 'llvm.loop.' prefix.  Metadata name
changes:

llvm.vectorizer.* => llvm.loop.vectorizer.*
llvm.loopunroll.* => llvm.loop.unroll.*

This was a suggestion from an earlier review
(http://reviews.llvm.org/D4090) which added the loop unrolling
metadata. 

Patch by Mark Heffernan.

llvm-svn: 211710
2014-06-25 15:41:00 +00:00
Bill Schmidt 83973ef23b [PPC64] Fix PR20071 (fctiduz generated for targets lacking that instruction)
PR20071 identifies a problem in PowerPC's fast-isel implementation for
floating-point conversion to integer.  The fctiduz instruction was added in
Power ISA 2.06 (i.e., Power7 and later).  However, this instruction is being
generated regardless of which 64-bit PowerPC target is selected.

The intent is for fast-isel to punt to DAG selection when this instruction is
not available.  This patch implements that change.  For testing purposes, the
existing fast-isel-conversion.ll test adds a RUN line for -mcpu=970 and tests
for the expected code generation.  Additionally, the existing test
fast-isel-conversion-p5.ll was found to be incorrectly expecting the
unavailable instruction to be generated.  I've removed these test variants
since we have adequate coverage in fast-isel-conversion.ll.

llvm-svn: 211627
2014-06-24 20:05:18 +00:00
Ulrich Weigand f316e1db75 [PowerPC] Allow stack frames without parameter save area
The PPCFrameLowering::determineFrameLayout routine currently ensures
that every function that allocates a stack frame provides space for the
parameter save area (via PPCFrameLowering::getMinCallFrameSize).

This is actually not necessary.  There may be functions that never call
another routine but still allocate a frame; those do not require the
parameter save area.  In the future, with the ELFv2 ABI, even some
routines that do call other functions do not need to allocate the
parameter save area.

While it is not a bug to allocate the parameter area when it is not
needed, it is better to avoid it to save stack space.

Note that when any particular function call requires the parameter save
area, this space will already have been included by ABI code in the size
the CALLSEQ_START insn is annotated with, and therefore included in the
size returned by MFI->getMaxCallFrameSize().

This means that determineFrameLayout simply does not need to care about
the parameter save area.  (It still needs to ensure that every frame
provides the linkage area.)  This is implemented by this patch.

Note that this exposed a bug in the new fast-isel code where the parameter
area was *not* included in the CALLSEQ_START size; this is also fixed.

A couple of test cases needed to be adapted for the new (smaller) stack
frame size those tests now see.

llvm-svn: 211495
2014-06-23 13:47:52 +00:00
Ulrich Weigand 9ba552db89 [PowerPC] Fix on-stack AltiVec arguments with 64-bit SVR4
Current 64-bit SVR4 code seems to have some remnants of Darwin code
in AltiVec argument handing.  This had the effect that AltiVec arguments
(or subsequent arguments) were not correctly placed in the parameter area
in some cases.

The correct behaviour with the 64-bit SVR4 ABI is:
- All AltiVec arguments take up space in the parameter area, just like
  any other arguments, whether vararg or not.
- They are always 16-byte aligned, skipping a parameter area doubleword
  (and the associated GPR, if any), if necessary.

This patch implements the correct behaviour and adds a test case.
(Verified against GCC behaviour via the ABI compat test suite.)

llvm-svn: 211492
2014-06-23 12:36:34 +00:00
Ulrich Weigand 59c6ab20d6 [PowerPC] Fix small argument stack slot offset for LE
When small arguments (structures < 8 bytes or "float") are passed in a
stack slot in the ppc64 SVR4 ABI, they must reside in the least
significant part of that slot.  On BE, this means that an offset needs
to be added to the stack address of the parameter, but on LE, the least
significant part of the slot has the same address as the slot itself.

This changes the PowerPC back-end ABI code to only add the small
argument stack slot offset for BE.  It also adds test cases to verify
the correct behavior on both BE and LE.

llvm-svn: 211368
2014-06-20 16:34:05 +00:00
Alp Toker 1d099d9339 Fix typos
llvm-svn: 211304
2014-06-19 19:41:26 +00:00
Ulrich Weigand ad0cb91ed9 [PowerPC] Simplify and improve loading into TOC register
During an indirect function call sequence on the 64-bit SVR4 ABI,
generate code must load and then restore the TOC register.

This does not use a regular LOAD instruction since the TOC
register r2 is marked as reserved.  Instead, the are two
special instruction patterns:

 let RST = 2, DS = 2 in
 def LDinto_toc: DSForm_1a<58, 0, (outs), (ins g8rc:$reg),
                     "ld 2, 8($reg)", IIC_LdStLD,
                     [(PPCload_toc i64:$reg)]>, isPPC64;
 
 let RST = 2, DS = 10, RA = 1 in
 def LDtoc_restore : DSForm_1a<58, 0, (outs), (ins),
                     "ld 2, 40(1)", IIC_LdStLD,
                     [(PPCtoc_restore)]>, isPPC64;

Note that these not only restrict the destination of the
load to r2, but they also restrict the *source* of the
load to particular address combinations.  The latter is
a problem when we want to support the ELFv2 ABI, since
there the TOC save slot is no longer at 40(1).

This patch replaces those two instructions with a single
instruction pattern that only hard-codes r2 as destination,
but supports generic addresses as source.  This will allow
supporting the ELFv2 ABI, and also helps generate more
efficient code for calls to absolute addresses (allowing
simplification of the ppc64-calls.ll test case).

llvm-svn: 211193
2014-06-18 17:52:49 +00:00
Ulrich Weigand e581920d12 [PowerPC] Add back test case for absolute calls (removed in r211174)
As requested by Hal Finkel, this adds back a test for calls to
a known-constant function pointer value, and verifies that the
64-bit SVR4 indirect function call sequence is used.

llvm-svn: 211190
2014-06-18 17:28:56 +00:00
Ulrich Weigand 9aa09ef30f [PowerPC] Do not use BLA with the 64-bit SVR4 ABI
The PowerPC back-end uses BLA to implement calls to functions at
known-constant addresses, which is apparently used for certain
system routines on Darwin.

However, with the 64-bit SVR4 ABI, this is actually incorrect.
An immediate function pointer value on this platform is not
directly usable as a target address for BLA:
- in the ELFv1 ABI, the function pointer value refers to the
  *function descriptor*, not the code address
- in the ELFv2 ABI, the function pointer value refers to the
  global entry point, but BL(A) would only be correct when
  calling the *local* entry point

This bug didn't show up since using immediate function pointer
values is not usually done in the 64-bit SVR4 ABI in the first
place.  However, I ran into this issue with a certain use case
of LLVM as JIT, where immediate function pointer values were
uses to implement callbacks from JITted code to helpers in
statically compiled code.

Fixed by simply not using BLA with the 64-bit SVR4 ABI.

llvm-svn: 211174
2014-06-18 16:14:04 +00:00
Bill Schmidt 5d82f09b53 [PPC64] Fix PR19893 - improve code generation for local function addresses
Rafael opened http://llvm.org/bugs/show_bug.cgi?id=19893 to track non-optimal
code generation for forming a function address that is local to the compile
unit.  The existing code was treating both local and non-local functions
identically.

This patch fixes the problem by properly identifying local functions and
generating the proper addis/addi code.  I also noticed that Rafael's earlier
changes to correct the surrounding code in PPCISelLowering.cpp were also
needed for fast instruction selection in PPCFastISel.cpp, so this patch
fixes that code as well.

The existing test/CodeGen/PowerPC/func-addr.ll is modified to test the new
code generation.  I've added a -O0 run line to test the fast-isel code as
well.

Tested on powerpc64[le]-unknown-linux-gnu with no regressions.

llvm-svn: 211056
2014-06-16 21:36:02 +00:00
Tim Northover 420a216817 IR: add "cmpxchg weak" variant to support permitted failure.
This commit adds a weak variant of the cmpxchg operation, as described
in C++11. A cmpxchg instruction with this modifier is permitted to
fail to store, even if the comparison indicated it should.

As a result, cmpxchg instructions must return a flag indicating
success in addition to their original iN value loaded. Thus, for
uniformity *all* cmpxchg instructions now return "{ iN, i1 }". The
second flag is 1 when the store succeeded.

At the DAG level, a new ATOMIC_CMP_SWAP_WITH_SUCCESS node has been
added as the natural representation for the new cmpxchg instructions.
It is a strong cmpxchg.

By default this gets Expanded to the existing ATOMIC_CMP_SWAP during
Legalization, so existing backends should see no change in behaviour.
If they wish to deal with the enhanced node instead, they can call
setOperationAction on it. Beware: as a node with 2 results, it cannot
be selected from TableGen.

Currently, no use is made of the extra information provided in this
patch. Test updates are almost entirely adapting the input IR to the
new scheme.

Summary for out of tree users:
------------------------------

+ Legacy Bitcode files are upgraded during read.
+ Legacy assembly IR files will be invalid.
+ Front-ends must adapt to different type for "cmpxchg".
+ Backends should be unaffected by default.

llvm-svn: 210903
2014-06-13 14:24:07 +00:00
Bill Schmidt f910a0650e [PPC64LE] Recognize shufflevector patterns for little endian
Various masks on shufflevector instructions are recognizable as
specific PowerPC instructions (vector pack, vector merge, etc.).
There is existing code in PPCISelLowering.cpp to recognize the correct
patterns for big endian code.  The masks for these instructions are
different for little endian code due to the big-endian numbering
employed by these instructions.  This patch adds the recognition code
for little endian.

I've added a new test case test/CodeGen/PowerPC/vec_shuffle_le.ll for
this.  The existing recognizer test (vec_shuffle.ll) is unnecessarily
verbose and difficult to read, so I felt it was better to add a new
test rather than modify the old one.

llvm-svn: 210536
2014-06-10 14:35:01 +00:00
Alp Toker d3d017cf00 Reduce verbiage of lit.local.cfg files
We can just split targets_to_build in one place and make it immutable.

llvm-svn: 210496
2014-06-09 22:42:55 +00:00
Bill Schmidt 6b5a7dfc24 [PPC64LE] Generate correct code for unaligned little-endian vector loads
The code in PPCTargetLowering::PerformDAGCombine() that handles
unaligned Altivec vector loads generates a lvsl followed by a vperm.
As we've seen in numerous other places, the vperm instruction has a
big-endian bias, and this is fixed for little endian by complementing
the permute control vector and swapping the input operands.  In this
case the lvsl is providing the permute control vector.  Rather than
generating an lvsl and a complement operation, it is sufficient to
generate an lvsr instruction instead.  Thus for LE code generation we
will generate an lvsr rather than an lvsl, and swap the other input
arguments on the vperm.

The existing test/CodeGen/PowerPC/vec_misalign.ll is updated to test
the code generation for PPC64 and PPC64LE, in addition to the existing
PPC32/G5 testing.

llvm-svn: 210493
2014-06-09 22:00:52 +00:00
Bill Schmidt 42995e8c74 [PPC64LE] Generate correct little-endian code for v16i8 multiply
The existing code in PPCTargetLowering::LowerMUL() for multiplying two
v16i8 values assumes that vector elements are numbered in big-endian
order.  For little-endian targets, the vector element numbering is
reversed, but the vmuleub, vmuloub, and vperm instructions still
assume big-endian numbering.  To account for this, we must adjust the
permute control vector and reverse the order of the input registers on
the vperm instruction.

The existing test/CodeGen/PowerPC/vec_mul.ll is updated to be executed
on powerpc64 and powerpc64le targets as well as the original powerpc
(32-bit) target.

llvm-svn: 210474
2014-06-09 16:06:29 +00:00
Bill Schmidt 4aedff8995 [PPC64LE] Fix lowering of BUILD_VECTOR and SHUFFLE_VECTOR for little endian
This patch fixes a couple of lowering issues for little endian
PowerPC.  The code for lowering BUILD_VECTOR contains a number of
optimizations that are only valid for big endian.  For now, we disable
those optimizations for correctness.  In the future, we will add
analogous optimizations that are correct for little endian.

When lowering a SHUFFLE_VECTOR to a VPERM operation, we again need to
make the now-familiar transformation of swapping the input operands
and complementing the permute control vector.  Correctness of this
transformation is tested by the accompanying test case.

llvm-svn: 210336
2014-06-06 14:06:26 +00:00
Bill Schmidt 72dfa16aa9 [PPC64LE] Add test case for r210282 commit
Chandler correctly pointed out that I need an LLVM IR test for
r210282, which modified the vperm -> shuffle transform for little
endian PowerPC.  This patch provides that test.

llvm-svn: 210297
2014-06-05 22:57:38 +00:00
Rafael Espindola 04902862a8 [PPC] Use alias symbols in address computation.
This seems to match what gcc does for ppc and what every other llvm
backend does.

This is a fixed version of r209638. The difference is to avoid any change
in behavior for functions. The logic for using constant pools for function
addresseses is spread over a few places and we have to keep them in sync.

llvm-svn: 209821
2014-05-29 15:41:38 +00:00
Rafael Espindola 1613650e90 Add a test showing the ppc code sequence for getting a function pointer.
This would have found the miscompile in r209638.

llvm-svn: 209820
2014-05-29 15:13:23 +00:00
Hal Finkel f5c07ada1d Revert "[PPC] Use alias symbols in address computation."
This reverts commit r209638 because it broke self-hosting on ppc64/Linux. (the
Clang-compiled TableGen would segfault because it jumped to an invalid address
from within _ZNK4llvm17ManagedStaticBase21RegisterManagedStaticEPFPvvEPFvS1_E
(which is within the command-line parameter registration process)).

llvm-svn: 209745
2014-05-28 15:25:06 +00:00
Bill Schmidt 71dddd51d9 [PATCH] Correct type used for VADD_SPLAT optimization on PowerPC
In PPCISelLowering.cpp: PPCTargetLowering::LowerBUILD_VECTOR(), there
is an optimization for certain patterns to generate one or two vector
splats followed by a vector add or subtract.  This operation is
represented by a VADD_SPLAT in the selection DAG.  Prior to this
patch, it was possible for the VADD_SPLAT to be assigned the wrong
data type, causing incorrect code generation.  This patch corrects the
problem.

Specifically, the code previously assigned the value type of the
BUILD_VECTOR node to the newly generated VADD_SPLAT node.  This is
correct much of the time, but not always.  The problem is that the
call to isConstantSplat() may return a SplatBitSize that is not the
same as the number of bits in the original element vector type.  The
correct type to assign is a vector type with the same element bit size
as SplatBitSize.

The included test case shows an example of this, where the
BUILD_VECTOR node has a type of v16i8.  The vector to be built is {0,
16, 0, 16, 0, 16, 0, 16, 0, 16, 0, 16, 0, 16, 0, 16}.  isConstantSplat
detects that we can generate a splat of 16 for type v8i16, which is
the type we must assign to the VADD_SPLAT node.  If we do not, we
generate a vspltisb of 8 and a vaddubm, which generates the incorrect
result {16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16}.  The correct code generation is a vspltish of 8 and a vadduhm.

This patch also corrected code generation for
CodeGen/PowerPC/2008-07-10-SplatMiscompile.ll, which had been marked
as an XFAIL, so we can remove the XFAIL from the test case.

llvm-svn: 209662
2014-05-27 15:57:51 +00:00
Rafael Espindola ac69cee6a2 [PPC] Use alias symbols in address computation.
This seems to match what gcc does for ppc and what every other llvm
backend does.

llvm-svn: 209638
2014-05-26 19:08:19 +00:00
Adam Nemet 571eb5fc91 [PowerPC] PR19796: Also match ISD::TargetConstant in isIntS16Immediate
The SplitIndexingFromLoad changes exposed a latent isel bug in the PowerPC64
backend.  We matched an immediate offset with STWX8 even though it only
supports register offset.

The culprit is the complex-pattern predicate, SelectAddrIdx, which decides
that if the offset is not ISD::Constant it must be a register.

Many thanks to Bill Schmidt for testing this.

llvm-svn: 209219
2014-05-20 17:20:34 +00:00
David Blaikie 9ba7254688 DebugInfo: Sure up subprogram variable list handling with more assertions and fewer conditionals.
Many old tests using prior schemas still had some brokenness here (both
indirect arrays and arrays with single bogus elements). Fixed those up
so they don't hit the new assertions.

Also reduced nesting in some places, etc.

llvm-svn: 208817
2014-05-14 21:52:46 +00:00
Hal Finkel 0d8db46799 [PowerPC] Add global named register support
Support for the intrinsics that read from and write to global named registers
is added for r1, r2 and r13 (depending on the subtarget).

llvm-svn: 208509
2014-05-11 19:29:11 +00:00
Hal Finkel c4c6c87666 [PowerPC] On PPC32, 128-bit shifts might be runtime calls
The counter-loops formation pass needs to know what operations might be
function calls (because they can't appear in counter-based loops). On PPC32,
128-bit shifts might be runtime calls (even though you can't use __int128 on
PPC32, it seems that SROA might form them).

Fixes PR19709.

llvm-svn: 208501
2014-05-11 16:23:29 +00:00
Hal Finkel d9963c75da [PowerPC] Fix rlwimi isel when mask is not constant
We had been using the known-zero values of the operand of the or to construct
the mask for an rlwimi; this is not quite correct, but fine when the mask is
constant. When the mask is constant, then the known zeros of the operand must
be a superset of the zeros in the mask. However, when the mask is not a
constant, then there might be bits in the operand that are not known to be zero
that, at runtime, might be zero in the mask. Therefore, we check that any bits
not known to be zero *are* known to be one in the mask. Otherwise, we can't
fold the mask with the or and shift.

This was revealed as a miscompile of
MultiSource/Benchmarks/BitBench/drop3/drop3 when I started experimenting with
constant hoisting.

llvm-svn: 206136
2014-04-13 17:10:58 +00:00
Hal Finkel 34974ed503 [PowerPC] Implement some additional TLI callbacks
Add implementations of:
  bool isLegalICmpImmediate(int64_t Imm) const
  bool isLegalAddImmediate(int64_t Imm) const
  bool isTruncateFree(Type *Ty1, Type *Ty2) const
  bool isTruncateFree(EVT VT1, EVT VT2) const
  bool shouldConvertConstantLoadToIntImm(const APInt &Imm, Type *Ty) const

Unfortunately, this regresses counter-register-based loop formation because
some of the loops now end up in forms were SE cannot compute loop counts.
However, nevertheless, the test-suite results favor committing:

SingleSource/Benchmarks/BenchmarkGame/puzzle: 26% speedup
MultiSource/Benchmarks/FreeBench/analyzer/analyzer: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan: 20% speedup
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv: 19% speedup
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv: 15% speedup
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2: 2% speedup

MultiSource/Benchmarks/VersaBench/bmm/bmm: 26% slowdown

llvm-svn: 206120
2014-04-12 21:52:38 +00:00
Hal Finkel 3b48d08f54 Reenable use of TBAA during CodeGen
We had disabled use of TBAA during CodeGen (even when otherwise using AA)
because the ptrtoint/inttoptr used by CGP for address sinking caused BasicAA to
miss basic type punning that it should catch (and, thus, we'd fail to override
TBAA when we should).

However, when AA is in use during CodeGen, CGP now uses normal GEPs and
bitcasts, instead of ptrtoint/inttoptr, when doing address sinking. As a
result, BasicAA should be able to make us do the right thing in the face of
type-punning, and it seems safe to enable use of TBAA again. self-hosting seems
fine on PPC64/Linux on the P7, with TBAA enabled and -misched=shuffle.

Note: We still don't update TBAA when merging stack slots, although because
BasicAA should now catch all such cases, this is no longer a blocking issue.
Nevertheless, I plan to commit code to deal with this properly in the near
future.

llvm-svn: 206093
2014-04-12 01:26:00 +00:00
Hal Finkel c3998306f4 Add the ability to use GEPs for address sinking in CGP
The current memory-instruction optimization logic in CGP, which sinks parts of
the address computation that can be adsorbed by the addressing mode, does this
by explicitly converting the relevant part of the address computation into
IR-level integer operations (making use of ptrtoint and inttoptr). For most
targets this is currently not a problem, but for targets wishing to make use of
IR-level aliasing analysis during CodeGen, the use of ptrtoint/inttoptr is a
problem for two reasons:
  1. BasicAA becomes less powerful in the face of the ptrtoint/inttoptr
  2. In cases where type-punning was used, and BasicAA was used
     to override TBAA, BasicAA may no longer do so. (this had forced us to disable
     all use of TBAA in CodeGen; something which we can now enable again)

This (use of GEPs instead of ptrtoint/inttoptr) is not currently enabled by
default (except for those targets that use AA during CodeGen), and so aside
from some PowerPC subtargets and SystemZ, there should be no change in
behavior. We may be able to switch completely away from the ptrtoint/inttoptr
sinking on all targets, but further testing is required.

I've doubled-up on a number of existing tests that are sensitive to the
address sinking behavior (including some store-merging tests that are
sensitive to the order of the resulting ADD operations at the SDAG level).

llvm-svn: 206092
2014-04-12 00:59:48 +00:00
Hal Finkel fbf7e2a1a1 [PowerPC] Add a full condition code register to make the "cc" clobber work
gcc inline asm supports specifying "cc" as a clobber of all condition
registers. Add just enough modeling of the full register to make this work.
Fixed PR19326.

llvm-svn: 205630
2014-04-04 15:15:57 +00:00
Hal Finkel 9e0baa6d3a [PowerPC] Add some missing VSX bitcast patterns
llvm-svn: 205352
2014-04-01 19:24:27 +00:00
Hal Finkel b4240ca0f4 [PowerPC] Don't ever expand BUILD_VECTOR of v2i64 with shuffles
If we have two unique values for a v2i64 build vector, this will always result
in two vector loads if we expand using shuffles. Only one is necessary.

llvm-svn: 205231
2014-03-31 17:48:16 +00:00
Hal Finkel 02807595fb Look at shuffles of build_vectors in DAGCombiner::visitEXTRACT_VECTOR_ELT
When the loop vectorizer vectorizes code that uses the loop induction variable,
we often end up with IR like this:

  %b1 = insertelement <2 x i32> undef, i32 %v, i32 0
  %b2 = shufflevector <2 x i32> %b1, <2 x i32> undef, <2 x i32> zeroinitializer
  %i = add <2 x i32> %b2, <i32 2, i32 3>

If the add in this example is not legal (as is the case on PPC with VSX), it
will be scalarized, and we'll end up with a number of extract_vector_elt nodes
with the vector shuffle as the input operand, and that vector shuffle is fed by
one or more build_vector nodes. By the time that vector operations are
expanded, visitEXTRACT_VECTOR_ELT will not create new extract_vector_elt by
looking through the vector shuffle (to make sure that no illegal operations are
created), and so the extract_vector_elt -> vector shuffle -> build_vector is
never simplified to an operand of the build vector.

By looking at build_vectors through a shuffle we fix this particular situation,
preventing a vector from being built, only to be deconstructed again (for the
scalarized add) -- an expensive proposition when this all needs to be done via
the stack. We probably want a more comprehensive fix here where we look back
recursively through any shuffles to any build_vectors or scalar_to_vectors,
etc. but that can come later.

llvm-svn: 205179
2014-03-31 11:43:19 +00:00
Hal Finkel 90adf0fe06 Make use of previously generated stores in SelectionDAGLegalize::ExpandExtractFromVectorThroughStack
When expanding EXTRACT_VECTOR_ELT and EXTRACT_SUBVECTOR using
SelectionDAGLegalize::ExpandExtractFromVectorThroughStack, we store the entire
vector and then load the piece we want. This is fine in isolation, but
generating a new store (and corresponding stack slot) for each extraction ends
up producing code of poor quality. When we scalarize a vector operation (using
SelectionDAG::UnrollVectorOp for example) we generate one EXTRACT_VECTOR_ELT
for each element in the vector. This used to generate one stored copy of the
vector for each element in the vector. Now we search the uses of the vector for
a suitable store before generating a new one, which results in much more
efficient scalarization code.

llvm-svn: 205153
2014-03-30 15:10:18 +00:00
Hal Finkel 5c0d1454d6 [PowerPC] Handle VSX v2i64 SIGN_EXTEND_INREG
sitofp from v2i32 to v2f64 ends up generating a SIGN_EXTEND_INREG v2i64 node
(and similarly for v2i16 and v2i8). Even though there are no sign-extension (or
algebraic shifts) for v2i64 types, we can handle v2i32 sign extensions by
converting two and from v2i64. The small trick necessary here is to shift the
i32 elements into the right lanes before the i32 -> f64 step. This is because
of the big Endian nature of the system, we need the i32 portion in the high
word of the i64 elements.

For v2i16 and v2i8 we can do the same, but we first use the default Altivec
shift-based expansion from v2i16 or v2i8 to v2i32 (by casting to v4i32) and
then apply the above procedure.

llvm-svn: 205146
2014-03-30 13:22:59 +00:00
Hal Finkel 777c9dd90a [PowerPC] Handle v2i64 comparisons
v2i64 is a legal type under VSX, however we don't have native vector
comparisons. We can handle eq/ne by casting it to an Altivec type, but
everything else must be expanded.

llvm-svn: 205106
2014-03-29 16:04:40 +00:00
Hal Finkel 19be506a5e [PowerPC] Add subregister classes for f64 VSX values
We had stored both f64 values and v2f64, etc. values in the VSX registers. This
worked, but was suboptimal because we would always spill 16-byte values even
through we almost always had scalar 8-byte values. This resulted in an
increase in stack-size use, extra memory bandwidth, etc. To fix this, I've
added 64-bit subregisters of the Altivec registers, and combined those with the
existing scalar floating-point registers to form a class of VSX scalar
floating-point registers. The ABI code has also been enhanced to use this
register class and some other necessary improvements have been made.

llvm-svn: 205075
2014-03-29 05:29:01 +00:00
Hal Finkel 2583b06310 [PowerPC] Fix VSX permutation isel
Not only did I invert the indices when I wrote the code, but I also did the
same thing when I wrote the regression test. Oops.

llvm-svn: 205046
2014-03-28 20:24:55 +00:00
Hal Finkel 7811c6188e [PowerPC] v2[fi]64 need to be explicitly passed in VSX registers
v2[fi]64 values need to be explicitly passed in VSX registers. This is because
the code in TRI that finds the minimal register class given a register and a
value type will assert if given an Altivec register and a non-Altivec type.

llvm-svn: 205041
2014-03-28 19:58:11 +00:00
Hal Finkel c6fc9b8960 [PowerPC] Use a small cleanup pass to remove VSX self copies
As explained in r204976, because of how the allocation of VSX registers
interacts with the call-lowering code, we sometimes end up generating self VSX
copies. Specifically, things like this:
  %VSL2<def> = COPY %F2, %VSL2<imp-use,kill>
(where %F2 is really a sub-register of %VSL2, and so this copy is a nop)

This adds a small cleanup pass to remove these prior to post-RA scheduling.

llvm-svn: 204980
2014-03-27 23:12:31 +00:00
Hal Finkel 82569b6366 [PowerPC] Fix v2f64 vector extract and related patterns
First, v2f64 vector extract had not been declared legal (and so the existing
patterns were not being used). Second, the patterns for that, and for
scalar_to_vector, should really be a regclass copy, not a subregister
operation, because the VSX registers directly hold both the vector and scalar data.

llvm-svn: 204971
2014-03-27 22:22:48 +00:00
Hal Finkel ad801b7459 [PowerPC] Expand v2i64 shifts
These operations need to be expanded during legalization so that isel does not
crash. In theory, we might be able to custom lower some of these. That,
however, would need to be follow-up work.

llvm-svn: 204963
2014-03-27 21:26:33 +00:00
Hal Finkel df3e34d944 [PowerPC] Generate VSX permutations for v2[fi]64 vectors
llvm-svn: 204873
2014-03-26 22:58:37 +00:00
Hal Finkel 6e28e6aaaf [PowerPC] VSX loads and stores support unaligned access
I've not yet updated PPCTTI because I'm not sure what the actual relative cost
is compared to the aligned uses.

llvm-svn: 204848
2014-03-26 19:39:09 +00:00
Hal Finkel 7279f4b00d [PowerPC] Use v2f64 <-> v2i64 VSX conversion instructions
llvm-svn: 204843
2014-03-26 19:13:54 +00:00
Hal Finkel 9281c9a38b [PowerPC] Use VSX vector load/stores for v2[fi]64
These instructions have access to the complete VSX register file. In addition,
they "swap" the order of the elements so that element 0 (the scalar part) comes
first in memory and element 1 follows at a higher address.

llvm-svn: 204838
2014-03-26 18:26:30 +00:00
Hal Finkel a6c8b51212 [PowerPC] Add v2i64 as a legal VSX type
v2i64 needs to be a legal VSX type because it is the SetCC result type from
v2f64 comparisons. We need to expand all non-arithmetic v2i64 operations.

This fixes the lowering for v2f64 VSELECT.

llvm-svn: 204828
2014-03-26 16:12:58 +00:00
Hal Finkel 732f0f73a7 [PowerPC] Lower VSELECT using xxsel when VSX is available
With VSX there is a real vector select instruction, and so we should use it.
Note that VSELECT will still scalarize for v2f64 because the corresponding
SetCC result type (v2i64) is not currently a legal type.

llvm-svn: 204801
2014-03-26 12:49:28 +00:00
Hal Finkel bd4de9d478 [PowerPC] Generate logical vector VSX instructions
These instructions are essentially the same as their Altivec counterparts, but
have access to the larger VSX register file.

llvm-svn: 204782
2014-03-26 04:55:40 +00:00
Hal Finkel 174e590966 [PowerPC] Select between VSX A-type and M-type FMA instructions just before RA
The VSX instruction set has two types of FMA instructions: A-type (where the
addend is taken from the output register) and M-type (where one of the product
operands is taken from the output register). This adds a small pass that runs
just after MI scheduling (and, thus, just before register allocation) that
mutates A-type instructions (that are created during isel) into M-type
instructions when:

 1. This will eliminate an otherwise-necessary copy of the addend

 2. One of the product operands is killed by the instruction

The "right" moment to make this decision is in between scheduling and register
allocation, because only there do we know whether or not one of the product
operands is killed by any particular instruction. Unfortunately, this also
makes the implementation somewhat complicated, because the MIs are not in SSA
form and we need to preserve the LiveIntervals analysis.

As a simple example, if we have:

%vreg5<def> = COPY %vreg9; VSLRC:%vreg5,%vreg9
%vreg5<def,tied1> = XSMADDADP %vreg5<tied0>, %vreg17, %vreg16,
                        %RM<imp-use>; VSLRC:%vreg5,%vreg17,%vreg16
  ...
  %vreg9<def,tied1> = XSMADDADP %vreg9<tied0>, %vreg17, %vreg19,
                        %RM<imp-use>; VSLRC:%vreg9,%vreg17,%vreg19
  ...

We can eliminate the copy by changing from the A-type to the
M-type instruction. This means:

  %vreg5<def,tied1> = XSMADDADP %vreg5<tied0>, %vreg17, %vreg16,
                        %RM<imp-use>; VSLRC:%vreg5,%vreg17,%vreg16

is replaced by:

  %vreg16<def,tied1> = XSMADDMDP %vreg16<tied0>, %vreg18, %vreg9,
                        %RM<imp-use>; VSLRC:%vreg16,%vreg18,%vreg9

and we remove: %vreg5<def> = COPY %vreg9; VSLRC:%vreg5,%vreg9

llvm-svn: 204768
2014-03-25 23:29:21 +00:00
Hal Finkel 4a912250fa [PowerPC] Make use of VSX f64 <-> i64 conversion instructions
When VSX is available, these instructions should be used in preference to the
older variants that only have access to the scalar floating-point registers.

llvm-svn: 204559
2014-03-23 05:35:00 +00:00
Hal Finkel 55805eb562 [PowerPC] Fix the VSX v2f64 return register
v2f64 values, like other 128-bit values, are returned under VSX in register
vs34 (Altivec register v2).

llvm-svn: 204543
2014-03-22 18:24:43 +00:00
Rafael Espindola d2bd8def3f Remove redundant test.
This is tested from MC already.

llvm-svn: 204491
2014-03-21 18:00:51 +00:00
Bill Schmidt ff9622ef0e Fix PR19144: Incorrect offset generated for int-to-fp conversion at -O0.
When converting a signed 32-bit integer to double-precision floating point on
hardware without a lfiwax instruction, we have to instead use a lfd followed
by fcfid.  We were erroneously offsetting the address by 4 bytes in
preparation for either a lfiwax or lfiwzx when generating the lfd.  This fixes
that silly error.

This was not caught in the test suite since the conversion tests were run with
-mcpu=pwr7, which implies availability of lfiwax.  I've added another test
case for older hardware that checks the code we expect in the absence of
lfiwax and other flavors of fcfid.  There are fewer tests in this test case
because we punt to DAG selection in more cases on older hardware.  (We must
generate complex fiddly sequences in those cases, and there is marginal
benefit in duplicating that logic in fast-isel.)

llvm-svn: 204155
2014-03-18 14:32:50 +00:00
Ulrich Weigand f445399870 [ppc64] Avoid copy relocs in named rodata sections
Commit r181723 introduced code to avoid placing initialized variables
needing relocations into the .rodata section, which avoid copy relocs
that do not work as expected on ppc64 function references.

The same treatment is also needed for *named* .rodata.XXX sections.
This patch changes PPC64LinuxTargetObjectFile::SelectSectionForGlobal
to modify "Kind" *before* calling the default SelectSectionForGlobal
routine, instead of first calling the default routine and then just
checking for the (main) .rodata section afterwards.

llvm-svn: 203921
2014-03-14 12:45:22 +00:00
Rafael Espindola 2fb5bc33a3 Remove the linker_private and linker_private_weak linkages.
These linkages were introduced some time ago, but it was never very
clear what exactly their semantics were or what they should be used
for. Some investigation found these uses:

* utf-16 strings in clang.
* non-unnamed_addr strings produced by the sanitizers.

It turns out they were just working around a more fundamental problem.
For some sections a MachO linker needs a symbol in order to split the
section into atoms, and llvm had no idea that was the case. I fixed
that in r201700 and it is now safe to use the private linkage. When
the object ends up in a section that requires symbols, llvm will use a
'l' prefix instead of a 'L' prefix and things just work.

With that, these linkages were already dead, but there was a potential
future user in the objc metadata information. I am still looking at
CGObjcMac.cpp, but at this point I am convinced that linker_private
and linker_private_weak are not what they need.

The objc uses are currently split in

* Regular symbols (no '\01' prefix). LLVM already directly provides
whatever semantics they need.
* Uses of a private name (start with "\01L" or "\01l") and private
linkage. We can drop the "\01L" and "\01l" prefixes as soon as llvm
agrees with clang on L being ok or not for a given section. I have two
patches in code review for this.
* Uses of private name and weak linkage.

The last case is the one that one could think would fit one of these
linkages. That is not the case. The semantics are

* the linker will merge these symbol by *name*.
* the linker will hide them in the final DSO.

Given that the merging is done by name, any of the private (or
internal) linkages would be a bad match. They allow llvm to rename the
symbols, and that is really not what we want. From the llvm point of
view, these objects should really be (linkonce|weak)(_odr)?.

For now, just keeping the "\01l" prefix is probably the best for these
symbols. If we one day want to have a more direct support in llvm,
IMHO what we should add is not a linkage, it is just a hidden_symbol
attribute. It would be applicable to multiple linkages. For example,
on weak it would produce the current behavior we have for objc
metadata. On internal, it would be equivalent to private (and we
should then remove private).

llvm-svn: 203866
2014-03-13 23:18:37 +00:00
Hal Finkel 27774d9274 [PowerPC] Initial support for the VSX instruction set
VSX is an ISA extension supported on the POWER7 and later cores that enhances
floating-point vector and scalar capabilities. Among other things, this adds
<2 x double> support and generally helps to reduce register pressure.

The interesting part of this ISA feature is the register configuration: there
are 64 new 128-bit vector registers, the 32 of which are super-registers of the
existing 32 scalar floating-point registers, and the second 32 of which overlap
with the 32 Altivec vector registers. This makes things like vector insertion
and extraction tricky: this can be free but only if we force a restriction to
the right register subclass when needed. A new "minipass" PPCVSXCopy takes care
of this (although it could do a more-optimal job of it; see the comment about
unnecessary copies below).

Please note that, currently, VSX is not enabled by default when targeting
anything because it is not yet ready for that.  The assembler and disassembler
are fully implemented and tested. However:

 - CodeGen support causes miscompiles; test-suite runtime failures:
      MultiSource/Benchmarks/FreeBench/distray/distray
      MultiSource/Benchmarks/McCat/08-main/main
      MultiSource/Benchmarks/Olden/voronoi/voronoi
      MultiSource/Benchmarks/mafft/pairlocalalign
      MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
      SingleSource/Benchmarks/CoyoteBench/almabench
      SingleSource/Benchmarks/Misc/matmul_f64_4x4

 - The lowering currently falls back to using Altivec instructions far more
   than it should. Worse, there are some things that are scalarized through the
   stack that shouldn't be.

 - A lot of unnecessary copies make it past the optimizers, and this needs to
   be fixed.

 - Many more regression tests are needed.

Normally, I'd fix these things prior to committing, but there are some
students and other contributors who would like to work this, and so it makes
sense to move this development process upstream where it can be subject to the
regular code-review procedures.

llvm-svn: 203768
2014-03-13 07:58:58 +00:00
Tim Northover e94a518a22 IR: add a second ordering operand to cmpxhg for failure
The syntax for "cmpxchg" should now look something like:

	cmpxchg i32* %addr, i32 42, i32 3 acquire monotonic

where the second ordering argument gives the required semantics in the case
that no exchange takes place. It should be no stronger than the first ordering
constraint and cannot be either "release" or "acq_rel" (since no store will
have taken place).

rdar://problem/15996804

llvm-svn: 203559
2014-03-11 10:48:52 +00:00
Hal Finkel 7f908e8ef4 Fixup PPC Darwin i1 argument handling
Like on other targets, we need to zero_extend/truncate i1 args before copying
them to GPRs.

llvm-svn: 203045
2014-03-06 00:45:19 +00:00
Hal Finkel 2a9d318e4a When using CR bit registers on PPC32, handle the i1 vaarg case
When copying an i1 value into a GPR for a vaarg call, we need to explicitly
zero-extend the i1 value (otherwise an invalid CRBIT -> GPR copy will be
generated).

llvm-svn: 203041
2014-03-06 00:23:33 +00:00
Hal Finkel 6a56b21729 With PPC CR bit registers, handle int_to_fp on older cores
On cores without fpcvt support, we cannot promote int_to_fp i1 operations,
because there is nothing to promote them to. The most straightforward
implementation of this uses a select to choose between the two possible
resulting floating-point values (and that's what is done here).

llvm-svn: 203015
2014-03-05 22:14:00 +00:00
Hal Finkel 6aca2373f2 Add a PPC inline asm constraint type for single CR bits
Now that the PowerPC backend can track individual CR bits as first-class
registers, we should also have a way of allocating them for inline asm
statements. Because these registers are only one bit, if an output variable is
implicitly cast to a larger integer size, we'll get an any_extend to that
larger type (this is part of the existing target-independent logic). As a
result, regardless of the size of the output type, only the first bit is
meaningful.

The constraint identifier "wc" has been chosen for this purpose. Although gcc
does not currently support allocating individual CR bits, this identifier
choice has been coordinated with the gcc PowerPC team, and will be marked as
reserved for this purpose in the gcc constraints.md file.

llvm-svn: 202657
2014-03-02 18:23:39 +00:00
Hal Finkel 46043edc56 Remove extra truncs/exts around i32 bit operations on PPC64
This generalizes the code to eliminate extra truncs/exts around i1 bit
operations to also do the same on PPC64 for i32 bit operations. This eliminates
a fairly prevalent code wart:

int foo(int a) {
  return a == 5 ? 7 : 8;
}

On PPC64, because of the extension implied by the ABI, this would generate:

	cmplwi 0, 3, 5
	li 12, 8
	li 4, 7
	isel 3, 4, 12, 2
	rldicl 3, 3, 0, 32
	blr

where the 'rldicl 3, 3, 0, 32', the extension, is completely unnecessary. At
least for the single-BB case (which is all that the DAG combine mechanism can
handle), this unnecessary extension is no longer generated.

llvm-svn: 202600
2014-03-01 21:36:57 +00:00
Hal Finkel b998915ee1 Swap PPC isel operands to allow for 0-folding
The PPC isel instruction can fold 0 into the first operand (thus eliminating
the need to materialize a zero-containing register when the 'true' result of
the isel is 0). When the isel is fed by a bit register operation that we can
invert, do so as part of the bit-register-operation peephole routine.

llvm-svn: 202469
2014-02-28 06:11:16 +00:00
Hal Finkel 940ab934d4 Add CR-bit tracking to the PowerPC backend for i1 values
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:

 - Reduction in register pressure (because we no longer need GPRs to store
   boolean values).

 - Logical operations on booleans can be handled more efficiently; we used to
   have to move all results from comparisons into GPRs, perform promoted
   logical operations in GPRs, and then move the result back into condition
   register bits to be used by conditional branches. This can be very
   inefficient, because the throughput of these CR <-> GPR moves have high
   latency and low throughput (especially when other associated instructions
   are accounted for).

 - On the POWER7 and similar cores, we can increase total throughput by using
   the CR bits. CR bit operations have a dedicated functional unit.

Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).

This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.

It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
  trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
  zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).

POWER7 test-suite performance results (from 10 runs in each configuration):

SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup

SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown

llvm-svn: 202451
2014-02-28 00:27:01 +00:00
Hal Finkel 22304046c7 Account for 128-bit integer operations in PPCCTRLoops
We need to abort the formation of counter-register-based loops where there are
128-bit integer operations that might become function calls.

llvm-svn: 202192
2014-02-25 20:51:50 +00:00
Rafael Espindola daeafb4c2a Add back r201608, r201622, r201624 and r201625
r201608 made llvm corretly handle private globals with MachO. r201622 fixed
a bug in it and r201624 and r201625 were changes for using private linkage,
assuming that llvm would do the right thing.

They all got reverted because r201608 introduced a crash in LTO. This patch
includes a fix for that. The issue was that TargetLoweringObjectFile now has
to be initialized before we can mangle names of private globals. This is
trivially true during the normal codegen pipeline (the asm printer does it),
but LTO has to do it manually.

llvm-svn: 201700
2014-02-19 17:23:20 +00:00
Daniel Jasper 7e198ad862 Revert r201622 and r201608.
This causes the LLVMgold plugin to segfault. More information on the
replies to r201608.

llvm-svn: 201669
2014-02-19 12:26:01 +00:00
Rafael Espindola 09dcc6a536 Fix PR18743.
The IR
@foo = private constant i32 42

is valid, but before this patch we would produce an invalid MachO from it. It
was invalid because it would use an L label in a section where the liker needs
the labels in order to atomize it.

One way of fixing it would be to just reject this IR in the backend, but that
would not be very front end friendly.

What this patch does is use an 'l' prefix in sections that we know the linker
requires symbols for atomizing them. This allows frontends to just use
private and not worry about which sections they go to or how the linker handles
them.

One small issue with this strategy is that now a symbol name depends on the
section, which is not available before codegen. This is not a problem in
practice. The reason is that it only happens with private linkage, which will
be ignored by the non codegen users (llvm-nm and llvm-ar).

llvm-svn: 201608
2014-02-18 22:24:57 +00:00
Nico Rieck 35a237d4ed Actually call FileCheck in tests
llvm-svn: 201491
2014-02-16 13:27:39 +00:00
Rafael Espindola e6f37b908a "foo" is not a ppc instruction, don't try to parse it.
llvm-svn: 201336
2014-02-13 15:33:35 +00:00
Daniel Sanders 753e17629d Re-commit: Demote EmitRawText call in AsmPrinter::EmitInlineAsm() and remove hasRawTextSupport() call
Summary:
AsmPrinter::EmitInlineAsm() will no longer use the EmitRawText() call for
targets with mature MC support. Such targets will always parse the inline
assembly (even when emitting assembly). Targets without mature MC support
continue to use EmitRawText() for assembly output.

The hasRawTextSupport() check in AsmPrinter::EmitInlineAsm() has been replaced
with MCAsmInfo::UseIntegratedAs which when true, causes the integrated assembler
to parse inline assembly (even when emitting assembly output). UseIntegratedAs
is set to true for targets that consider any failure to parse valid assembly
to be a bug. Target specific subclasses generally enable the integrated
assembler in their constructor. The default value can be overridden with
-no-integrated-as.

All tests that rely on inline assembly supporting invalid assembly (for example,
those that use mnemonics such as 'foo' or 'hello world') have been updated to
disable the integrated assembler.

Changes since review (and last commit attempt):
- Fixed test failures that were missed due to configuration of local build.
  (fixes crash.ll and a couple others).
- Fixed tests that happened to pass because the local build was on X86
  (should fix 2007-12-17-InvokeAsm.ll)
- mature-mc-support.ll's should no longer require all targets to be compiled.
  (should fix ARM and PPC buildbots)
- Object output (-filetype=obj and similar) now forces the integrated assembler
  to be enabled regardless of default setting or -no-integrated-as.
  (should fix SystemZ buildbots)

Reviewers: rafael

Reviewed By: rafael

CC: llvm-commits

Differential Revision: http://llvm-reviews.chandlerc.com/D2686

llvm-svn: 201333
2014-02-13 14:44:26 +00:00
Daniel Sanders abe212a3b8 Revert r201237+r201238: Demote EmitRawText call in AsmPrinter::EmitInlineAsm() and remove hasRawTextSupport() call
It introduced multiple test failures in the buildbots.

llvm-svn: 201241
2014-02-12 15:39:20 +00:00
Daniel Sanders a7d504cf58 Demote EmitRawText call in AsmPrinter::EmitInlineAsm() and remove hasRawTextSupport() call
Summary:
AsmPrinter::EmitInlineAsm() will no longer use the EmitRawText() call for targets with mature MC support. Such targets will always parse the inline assembly (even when emitting assembly). Targets without mature MC support continue to use EmitRawText() for assembly output.

The hasRawTextSupport() check in AsmPrinter::EmitInlineAsm() has been replaced with MCAsmInfo::UseIntegratedAs which when true, causes the integrated assembler to parse inline assembly (even when emitting assembly output). UseIntegratedAs is set to true for targets that consider any failure to parse valid assembly to be a bug. Target specific subclasses generally enable the integrated assembler in their constructor. The default value can be overridden with -no-integrated-as.

All tests that rely on inline assembly supporting invalid assembly (for example, those that use mnemonics such as 'foo' or 'hello world') have been updated to disable the integrated assembler.

Reviewers: rafael

Reviewed By: rafael

CC: llvm-commits

Differential Revision: http://llvm-reviews.chandlerc.com/D2686

llvm-svn: 201237
2014-02-12 14:44:54 +00:00
Rafael Espindola 61acf5d9b0 Fix a bug with .weak_def_can_be_hidden: Mutable variables cannot use it.
Thanks to John McCall for noticing it.

llvm-svn: 200977
2014-02-07 16:21:30 +00:00
Rafael Espindola 803fb10874 Convert test to FileCheck.
llvm-svn: 200955
2014-02-06 23:35:22 +00:00
David Blaikie 5e390e4df7 DebugInfo: Remove some unneeded conditionals now that DIBuilder no longer emits zero-length arrays as {i32 0}
A bunch of test cases needed to be cleaned up for this, many my fault -
when implementid imported modules I updated test cases by simply
duplicating the prior metadata field - which wasn't always the empty
metadata entry.

llvm-svn: 200731
2014-02-04 01:23:52 +00:00
Hal Finkel 4e703bcecd Handle spilling the PPC GPRC_NOR0 register class
GPRC_NOR0 is not a subclass of GPRC (because it also contains the ZERO pseudo
register). As a result, we also need to check for it in the spilling code.

llvm-svn: 200288
2014-01-28 05:32:58 +00:00
Hal Finkel 5eb2466243 Add a TBAA CodeGen failure test case
I disabled the use of TBAA in CodeGen in r200093. This adds a test case that
demonstrates the problems with inttoptr and TBAA in CodeGen (and, specifically,
the problem that causes LLVM to miscompile itself in Release mode). This test
will currently fail if -use-tbaa-in-sched-mi is enabled.

llvm-svn: 200097
2014-01-25 20:16:36 +00:00
Hal Finkel 3e4a34c8c3 Fix pointer info on PPC byval stores
For PPC64 SVR (and Darwin), the stores that take byval aggregate parameters
from registers into the stack frame had MachinePointerInfo objects with
incorrect offsets. These offsets are relative to the object itself, not to the
stack frame base.

This fixes self hosting on PPC64 when compiling with -enable-aa-sched-mi.

llvm-svn: 199763
2014-01-21 20:15:58 +00:00
Roman Divacky 32143e2bda Implement initial-exec TLS for PPC32.
llvm-svn: 197824
2013-12-20 18:08:54 +00:00
Rafael Espindola 382ee385fd One ppc32-darwin, a i64 inside a structure can have 32 bit alignment.
Thanks for Iain Sandoe for testing this with the original gcc.

Clang was already getting this right.

llvm-svn: 197572
2013-12-18 14:35:37 +00:00
Rafael Espindola c56960871f Add a reduced testcase from the recent bootstrap crash.
llvm-svn: 197426
2013-12-16 21:24:00 +00:00
Iain Sandoe e0b4cb62f5 [Powerpc darwin] AsmParser Base implementation.
This is a base implementation of the powerpc-apple-darwin asm parser dialect.

* Enables infrastructure (essentially isDarwin()) and fixes up the parsing of asm directives to separate out ELF and MachO/Darwin additions.
* Enables parsing of {r,f,v}XX as register identifiers.
* Enables parsing of lo16() hi16() and ha16() as modifiers.

The changes to the test case are from David Fang (fangism).

llvm-svn: 197324
2013-12-14 13:34:02 +00:00
Tim Northover 444eba2d4d PowerPC: add Linux triple to TLS tests
The tests were failing on OS X.

llvm-svn: 197146
2013-12-12 11:51:23 +00:00
Hal Finkel ceb1f12d9a Improve instruction scheduling for the PPC POWER7
Aside from a few minor latency corrections, the major change here is a new
hazard recognizer which focuses on better dispatch-group formation on the
POWER7. As with the PPC970's hazard recognizer, the most important thing it
does is avoid load-after-store hazards within the same dispatch group. It uses
the POWER7's special dispatch-group-terminating nop instruction (instead of
inserting multiple regular nop instructions). This new hazard recognizer makes
use of the scheduling dependency graph itself, built using AA information, to
robustly detect the possibility of load-after-store hazards.

significant test-suite performance changes (the error bars are 99.5% confidence
intervals based on 5 test-suite runs both with and without the change --
speedups are negative):

speedups:

MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2
	-0.55171% +/- 0.333168%

MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl
	-17.5576% +/- 14.598%

MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl
	-29.5708% +/- 7.09058%

MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt
	-34.9471% +/- 11.4391%

SingleSource/Benchmarks/BenchmarkGame/puzzle
	-25.1347% +/- 11.0104%

SingleSource/Benchmarks/Misc/flops-8
	-17.7297% +/- 9.79061%

SingleSource/Benchmarks/Shootout-C++/ary3
	-35.5018% +/- 23.9458%

SingleSource/Regression/C/uint64_to_float
	-56.3165% +/- 25.4234%

SingleSource/UnitTests/Vectorizer/gcc-loops
	-18.5309% +/- 6.8496%

regressions:

MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000
	18.351% +/- 12.156%

SingleSource/Benchmarks/Shootout-C++/methcall
	27.3086% +/- 14.4733%

llvm-svn: 197099
2013-12-12 00:19:11 +00:00
Hal Finkel 94a6f380bb Fix the PPC subsumes-predicate check
For one predicate to subsume another, they must both check the same condition
register. Failure to check this prerequisite was causing miscompiles.

Fixes PR18003.

llvm-svn: 197089
2013-12-11 23:12:25 +00:00
Roman Divacky 1bab70580d Merge all tls tests to two files. One for normal codegen (initial and local
exec) and one for PIC codegen (local and general dynamic).

llvm-svn: 197081
2013-12-11 22:25:39 +00:00
Roman Divacky e439f92a6c Remove test thats testing the same thing as tls.ll.
llvm-svn: 197074
2013-12-11 21:37:04 +00:00
David Fang 1b01849f2d on darwin<10, fallback to .weak_definition (PPC,X86)
.weak_def_can_be_hidden was not yet supported by the system assembler

llvm-svn: 196970
2013-12-10 21:37:41 +00:00
Alp Toker f907b891da Correct word hyphenations
This patch tries to avoid unrelated changes other than fixing a few
hyphen-related ambiguities and contractions in nearby lines.

llvm-svn: 196471
2013-12-05 05:44:44 +00:00
Hal Finkel ca93e47258 Convert a PPC test from grep to FileCheck
Convert this test to FileCheck, and improve it to check for the instructions it
is trying to exclude instead of checking for register use (especially because
grepping for r1 can be thrown off, for example, by a use of r12).

llvm-svn: 195979
2013-11-30 20:04:33 +00:00