This patch allows the read_register and write_register intrinsics to
read/write the RBP/EBP registers on X86 iff the targeted register is
the frame pointer for the containing function.
Differential Revision: http://reviews.llvm.org/D10977
llvm-svn: 241827
Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the one owned by the
module.
Reviewers: echristo
Subscribers: jholewinski, llvm-commits, rafael, yaron.keren
Differential Revision: http://reviews.llvm.org/D11040
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 241778
Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the one owned by the
module.
Reviewers: echristo
Subscribers: yaron.keren, rafael, llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D11038
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 241777
Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the one owned by the
module.
Reviewers: echristo
Subscribers: jholewinski, llvm-commits, rafael, yaron.keren
Differential Revision: http://reviews.llvm.org/D11037
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 241776
Summary:
This change is part of a series of commits dedicated to have a single
DataLayout during compilation by using always the one owned by the
module.
Reviewers: echristo
Subscribers: jholewinski, ted, yaron.keren, rafael, llvm-commits
Differential Revision: http://reviews.llvm.org/D11028
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 241775
There is some functional change here because it changes target code from
atoi(3) to StringRef::getAsInteger which has error checking. For valid
constraints there should be no difference.
llvm-svn: 241411
This patch adds support for the vector merge even word and vector merge odd word
instructions introduced in POWER8.
Phabricator review: http://reviews.llvm.org/D10704
llvm-svn: 240650
The patch is generated using this command:
tools/clang/tools/extra/clang-tidy/tool/run-clang-tidy.py -fix \
-checks=-*,llvm-namespace-comment -header-filter='llvm/.*|clang/.*' \
llvm/lib/
Thanks to Eugene Kosov for the original patch!
llvm-svn: 240137
This is important because of different addressing modes
depending on the address space for GPU targets.
This only adds the argument, and does not update
any of the uses to provide the correct address space.
llvm-svn: 238723
This patch adds support for the ISA 2.07 additions involving the
branch history rolling buffer and event-based branching. These will
not be used by typical applications, so built-in support is not
required. They will only be available via inline assembly.
Assembly/disassembly tests are included in the patch.
llvm-svn: 238032
This patch adds support for the following new instructions in the
Power ISA 2.07:
vpksdss
vpksdus
vpkudus
vpkudum
vupkhsw
vupklsw
These instructions are available through the vec_packs, vec_packsu,
vec_unpackh, and vec_unpackl built-in interfaces. These are
lane-sensitive instructions, so the built-ins have different
implementations for big- and little-endian, and the instructions must
be marked as killing the vector swap optimization for now.
The first three instructions perform saturating pack operations. The
fourth performs a modulo pack operation, which means it can be
represented with a vector shuffle, and conversely the appropriate
vector shuffles may cause this instruction to be generated. The other
instructions are only generated via built-in support for now.
Appropriate tests have been added.
There is a companion patch to clang for the rest of this support.
llvm-svn: 237499
This patch corresponds to review:
http://reviews.llvm.org/D8928
It adds direct move instructions to/from VSX registers to GPR's. These are
exploited for FP <-> INT conversions.
llvm-svn: 234682
When enabling PPC64LE, I disabled some optimizations of BUILD_VECTOR
nodes for little endian because wrong results were produced. I've
subsequently investigated and found this is due to a call to
BuildVectorSDNode::isConstantSplat that was always specifying
big-endian. With this changed to correctly identify the target
endianness, the optimizations work as expected.
I found another case of a call to the same method with big-endian
hardcoded, in PPC::isAllNegativeZeroVector(). I discovered this was
an orphaned method with no callers, so I've just removed it.
The existing test/CodeGen/PowerPC/vec_constants.ll checks these
optimizations, so for testing I've just added a variant for little
endian.
llvm-svn: 234011
Summary:
But still handle them the same way since I don't know how they differ on
this target.
Of these, 'es', and 'Q' do not have backend tests but are accepted by
clang.
No functional change intended. Depends on D8173.
Reviewers: hfinkel
Reviewed By: hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8213
llvm-svn: 232466
Summary:
This is instead of doing this in target independent code and is the last
non-functional change before targets begin to distinguish between
different memory constraints when selecting code for the ISD::INLINEASM
node.
Next, each target will individually move away from the idea that all
memory constraints behave like 'm'.
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D8173
llvm-svn: 232373
a lookup, pass that in rather than use a naked call to getSubtargetImpl.
This involved passing down and around either a TargetMachine or
TargetRegisterInfo. Update all callers/definitions around the targets
and SelectionDAG.
llvm-svn: 230699
LDtocL, and other loads that roughly correspond to the TOC_ENTRY SDAG node,
represent loads from the TOC, which is invariant. As a result, these loads can
be hoisted out of loops, etc. In order to do this, we need to generate
GOT-style MMOs for TOC_ENTRY, which requires treating it as a legitimate memory
intrinsic node type. Once this is done, the MMO transfer is automatically
handled for TableGen-driven instruction selection, and for nodes generated
directly in PPCISelDAGToDAG, we need to transfer the MMOs manually.
Also, we were not transferring MMOs associated with pre-increment loads, so do
that too.
Lastly, this fixes an exposed bug where R30 was not added as a defined operand of
UpdateGBR.
This problem was highlighted by an example (used to generate the test case)
posted to llvmdev by Francois Pichet.
llvm-svn: 230553
We had somehow accumulated a few target-specific SDAG nodes dealing with PPC64
TOC access that were referenced only in TableGen patterns. The associated
(pseudo-)instructions are used, but are being generated directly. NFC.
llvm-svn: 230518
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
See full discussion in http://reviews.llvm.org/D7491.
We now hide the add-immediate and call instructions together in a
separate pseudo-op, which is tagged to define GPR3 and clobber the
call-killed registers. The PPCTLSDynamicCall pass prior to RA now
expands this op into the two separate addi and call ops, with explicit
definitions of GPR3 on both instructions, and explicit clobbers on the
call instruction. The pass is now marked as requiring and preserving
the LiveIntervals and SlotIndexes analyses, and fixes these up after
the replacement sequences are introduced.
Self-hosting has been verified on LE P8 and BE P7 with various
optimization levels, etc. It has also been verified with the
--no-tls-optimize flag workaround removed.
llvm-svn: 228725
Unfortunately, even with the workaround of disabling the linker TLS
optimizations in Clang restored (which has already been done), this still
breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native).
Bill is currently working on an alternate implementation to address the TLS
issue in a way that also fully elides the linker bug (which, unfortunately,
this approach did not fully), so I'm reverting this now.
llvm-svn: 228460
This patch is a third attempt to properly handle the local-dynamic and
global-dynamic TLS models.
In my original implementation, calls to __tls_get_addr were hidden
from view until the asm-printer phase, at which point the underlying
branch-and-link instruction was created with proper relocations. This
mostly worked well, but I used some repellent techniques to ensure
that the TLS_GET_ADDR nodes at the SD and MI levels correctly received
input from GPR3 and produced output into GPR3. This proved to work
badly in the presence of multiple TLS variable accesses, with the
copies to and from GPR3 being scheduled incorrectly and generally
creating havoc.
In r221703, I addressed that problem by representing the calls to
__tls_get_addr as true calls during instruction lowering. This had
the advantage of removing all of the bad hacks and relying on the
existing call machinery to properly glue the copies in place. It
looked like this was going to be the right way to go.
However, as a side effect of the recent discovery of problems with
linker optimizations for TLS, we discovered cases of suboptimal code
generation with this strategy. The problem comes when tls_get_addr is
called for the same address, and there is a resulting CSE
opportunity. It turns out that in such cases MachineCSE will common
the addis/addi instructions that set up the input value to
tls_get_addr, but will not common the calls themselves. MachineCSE
does not have any machinery to common idempotent calls. This is
perfectly sensible, since presumably this would be done at the IR
level, and introducing calls in the back end isn't commonplace. In
any case, we end up with two calls to __tls_get_addr when one would
suffice, and that isn't good.
I presumed that the original design would have allowed commoning of
the machine-specific nodes that hid the __tls_get_addr calls, so as
suggested by Ulrich Weigand, I went back to that design and cleaned it
up so that the copies were properly held together by glue
nodes. However, it turned out that this didn't work either...the
presence of copies to physical registers kept the machine-specific
nodes from being commoned also.
All of which leads to the design presented here. This is a return to
the original design, except that no attempt is made to introduce
copies to and from GPR3 during instruction lowering. Virtual registers
are used until prior to register allocation. At that point, a special
pass is run that identifies the machine-specific nodes that hide the
tls_get_addr calls and introduces the copies to and from GPR3 around
them. The register allocator then coalesces these copies away. With
this design, MachineCSE succeeds in commoning tls_get_addr calls where
possible, and we get nice optimal code generation (better than GCC at
the moment, which does not common these calls).
One additional problem must be dealt with: After introducing the
mentions of the physical register GPR3, the aggressive anti-dependence
breaker sees opportunities to improve scheduling by selecting a
different register instead. Flags must be used on the instruction
descriptions to tell the anti-dependence breaker to keep its hands in
its pockets.
One thing missing from the original design was recording a definition
of the link register on the GET_TLS_ADDR nodes. Doing this was found
to be insufficient to force a stack frame to be created, which led to
looping behavior because two different LR values were stored at the
same address. This appears to have been an oversight in
PPCFrameLowering::determineFrameLayout(), which is repaired here.
Because MustSaveLR() returns true for calls to builtin_return_address,
this changed the expected behavior of
test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but
formerly did not. I've fixed the test case to reflect this.
There are existing TLS tests to catch regressions; the checks in
test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the
face of instruction scheduling with these changes, so I fixed that
up.
I've added a new test case based on the PrettyStackTrace module that
demonstrated the original problem. This checks that we get correct
code generation and that CSE of the calls to __get_tls_addr has taken
place.
llvm-svn: 227976
Function pointers under PPC64 ELFv1 (which is used on PPC64/Linux on the
POWER7, A2 and earlier cores) are really pointers to a function descriptor, a
structure with three pointers: the actual pointer to the code to which to jump,
the pointer to the TOC needed by the callee, and an environment pointer. We
used to chain these loads, and make them opaque to the rest of the optimizer,
so that they'd always occur directly before the call. This is not necessary,
and in fact, highly suboptimal on embedded cores. Once the function pointer is
known, the loads can be performed ahead of time; in fact, they can be hoisted
out of loops.
Now these function descriptors are almost always generated by the linker, and
thus the contents of the descriptors are invariant. As a result, by default,
we'll mark the associated loads as invariant (allowing them to be hoisted out
of loops). I've added a target feature to turn this off, however, just in case
someone needs that option (constructing an on-stack descriptor, casting it to a
function pointer, and then calling it cannot be well-defined C/C++ code, but I
can imagine some JIT-compilation system doing so).
Consider this simple test:
$ cat call.c
typedef void (*fp)();
void bar(fp x) {
for (int i = 0; i < 1600000000; ++i)
x();
}
$ cat main.c
typedef void (*fp)();
void bar(fp x);
void foo() {}
int main() {
bar(foo);
}
On the PPC A2 (the BG/Q supercomputer), marking the function-descriptor loads
as invariant brings the execution time down to ~8 seconds from ~32 seconds with
the loads in the loop.
The difference on the POWER7 is smaller. Compiling with:
gcc -std=c99 -O3 -mcpu=native call.c main.c : ~6 seconds [this is 4.8.2]
clang -O3 -mcpu=native call.c main.c : ~5.3 seconds
clang -O3 -mcpu=native call.c main.c -mno-invariant-function-descriptors : ~4 seconds
(looks like we'd benefit from additional loop unrolling here, as a first
guess, because this is faster with the extra loads)
The -mno-invariant-function-descriptors will be added to Clang shortly.
llvm-svn: 226207
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225808
This initial implementation of PPCTargetLowering::isZExtFree marks as free
zexts of small scalar loads (that are not sign-extending). This callback is
used by SelectionDAGBuilder's RegsForValue::getCopyToRegs, and thus to
determine whether a zext or an anyext is used to lower illegally-typed PHIs.
Because later truncates of zero-extended values are nops, this allows for the
elimination of later unnecessary truncations.
Fixes the initial complaint associated with PR22120.
llvm-svn: 225584
On modern cores with lfiw[az]x, we can fold a sign or zero extension from i32
to i64 into the load necessary for an i64 -> fp conversion.
llvm-svn: 225493
int->fp conversions on PPC must be done through memory loads and stores. On a
modern core, this process begins by storing the int value to memory, then
loading it using a (sometimes special) FP load instruction. Unfortunately, we
would do this even when the value to be converted was itself a load, and we can
just use that same memory location instead of copying it to another first.
There is a slight complication when handling int_to_fp(fp_to_int(x)) pairs,
because the fp_to_int operand has not been lowered when the int_to_fp is being
lowered. We handle this specially by invoking fp_to_int's lowering logic
(partially) and getting the necessary memory location (some trivial refactoring
was done to make this possible).
This is all somewhat ugly, and it would be nice if some later CodeGen stage
could just clean this stuff up, but because doing so would involve modifying
target-specific nodes (or instructions), it is not immediately clear how that
would work.
Also, remove a related entry from the README.txt for which we now generate
reasonable code.
llvm-svn: 225301
The old target DAG combine that allowed for performing int_to_fp(fp_to_int(x))
without a load/store pair is updated here with support for unsigned integers,
and to support single-precision values without a third rounding step, on newer
cores with the appropriate instructions.
llvm-svn: 225248
PPC has an instruction for ctlz with defined zero behavior, and our lowering of
cttz (provided by DAGCombine) is also efficient and branchless, so speculating
these makes sense.
llvm-svn: 225150
The existing code provided for specifying a global loop alignment preference.
However, the preferred loop alignment might depend on the loop itself. For
recent POWER cores, loops between 5 and 8 instructions should have 32-byte
alignment (while the others are better with 16-byte alignment) so that the
entire loop will fit in one i-cache line.
To support this, getPrefLoopAlignment has been made virtual, and can be
provided with an optional MachineLoop* so the target can inspect the loop
before answering the query. The default behavior, as before, is to return the
value set with setPrefLoopAlignment. MachineBlockPlacement now queries the
target for each loop instead of only once per function. There should be no
functional change for other targets.
llvm-svn: 225117
Newer POWER cores, and the A2, support the cmpb instruction. This instruction
compares its operands, treating each of the 8 bytes in the GPRs separately,
returning a 'mask' result of 0 (for false) or -1 (for true) in each byte.
Code generation support is added, in the form of a PPCISelDAGToDAG
DAG-preprocessing routine, that recognizes patterns close to what the
instruction computes (either exactly, or related by a constant masking
operation), and generates the cmpb instruction (along with any necessary
constant masking operation). This can be expanded if use cases arise.
llvm-svn: 225106
On non-Darwin PPC64, the TOC reload needs to come directly after the bctrl
instruction (for indirect calls) because the 'bctrl/ld 2, 40(1)' instruction
sequence is interpreted by the unwinding code in libgcc. To make sure these
occur as a pair, as with other pairings interpreted by the linker, fuse the two
instructions into one instruction (for code generation only).
In the future, we might wish to do this by emitting CFI directives instead,
but this solution is simpler, and mirrors what GCC does. Additional discussion
on this point is contained in the PR.
Fixes PR22015.
llvm-svn: 224788
PPCISelDAGToDAG contained existing code to lower i32 sdiv by a power-of-2 using
srawi/addze, but did not implement the i64 case. DAGCombine now contains a
callback specifically designed for this purpose (BuildSDIVPow2), and part of
the logic has been moved to an implementation of that callback. Doing this
lowering using BuildSDIVPow2 likely does not matter, compared to handling
everything in PPCISelDAGToDAG, for the positive divisor case, but the negative
divisor case, which generates an additional negation, can potentially benefit
from additional folding from DAGCombine. Now, both the i32 and the i64 cases
have been implemented.
Fixes PR20732.
llvm-svn: 224033
This patch addresses the inherent big-endian bias in the lxvd2x,
lxvw4x, stxvd2x, and stxvw4x instructions. These instructions load
vector elements into registers left-to-right (with the first element
loaded into the high-order bits of the register), regardless of the
endian setting of the processor. However, these are the only
vector memory instructions that permit unaligned storage accesses, so
we want to use them for little-endian.
To make this work, a lxvd2x or lxvw4x is replaced with an lxvd2x
followed by an xxswapd, which swaps the doublewords. This works for
lxvw4x as well as lxvd2x, because for lxvw4x on an LE system the
vector elements are in LE order (right-to-left) within each
doubleword. (Thus after lxvw2x of a <4 x float> the elements will
appear as 1, 0, 3, 2. Following the swap, they will appear as 3, 2,
0, 1, as desired.) For stores, an stxvd2x or stxvw4x is replaced
with an stxvd2x preceded by an xxswapd.
Introduction of extra swap instructions provides correctness, but
obviously is not ideal from a performance perspective. Future patches
will address this with optimizations to remove most of the introduced
swaps, which have proven effective in other implementations.
The introduction of the swaps is performed during lowering of LOAD,
STORE, INTRINSIC_W_CHAIN, and INTRINSIC_VOID operations. The latter
are used to translate intrinsics that specify the VSX loads and stores
directly into equivalent sequences for little endian. Thus code that
uses vec_vsx_ld and vec_vsx_st does not have to be modified to be
ported from BE to LE.
We introduce new PPCISD opcodes for LXVD2X, STXVD2X, and XXSWAPD for
use during this lowering step. In PPCInstrVSX.td, we add new SDType
and SDNode definitions for these (PPClxvd2x, PPCstxvd2x, PPCxxswapd).
These are recognized during instruction selection and mapped to the
correct instructions.
Several tests that were written to use -mcpu=pwr7 or pwr8 are modified
to disable VSX on LE variants because code generation changes with
this and subsequent patches in this set. I chose to include all of
these in the first patch than try to rigorously sort out which tests
were broken by one or another of the patches. Sorry about that.
The new test vsx-ldst-builtin-le.ll, and the changes to vsx-ldst.ll,
are disabled until LE support is enabled because of breakages that
occur as noted in those tests. They are re-enabled in patch 4/4.
llvm-svn: 223783
We've long supported readcyclecounter on PPC64, but it is easier there (the
read of the 64-bit time-base register can be accomplished via a single
instruction). This now provides an implementation for PPC32 as well. On PPC32,
the time-base register is still 64 bits, but can only be read 32 bits at a time
via two separate SPRs. The ISA manual explains how to do this properly (it
involves re-reading the upper bits and looping if the counter has wrapped while
being read).
This requires PPC to implement a custom integer splitting legalization for the
READCYCLECOUNTER node, turning it into a target-specific SDAG node, which then
gets turned into a pseudo-instruction, which is then expanded to the necessary
sequence (which has three SPR reads, the comparison and the branch).
Thanks to Paul Hargrove for pointing out to me that this was still unimplemented.
llvm-svn: 223161
This does not matter on newer cores (where we can use reciprocal estimates in
fast-math mode anyway), but for older cores this allows us to generate better
fast-math code where we have multiple FDIVs with a common divisor.
llvm-svn: 222710
My original support for the general dynamic and local dynamic TLS
models contained some fairly obtuse hacks to generate calls to
__tls_get_addr when lowering a TargetGlobalAddress. Rather than
generating real calls, special GET_TLS_ADDR nodes were used to wrap
the calls and only reveal them at assembly time. I attempted to
provide correct parameter and return values by chaining CopyToReg and
CopyFromReg nodes onto the GET_TLS_ADDR nodes, but this was also not
fully correct. Problems were seen with two back-to-back stores to TLS
variables, where the call sequences ended up overlapping with unhappy
results. Additionally, since these weren't real calls, the proper
register side effects of a call were not recorded, so clobbered values
were kept live across the calls.
The proper thing to do is to lower these into calls in the first
place. This is relatively straightforward; see the changes to
PPCTargetLowering::LowerGlobalTLSAddress() in PPCISelLowering.cpp.
The changes here are standard call lowering, except that we need to
track the fact that these calls will require a relocation. This is
done by adding a machine operand flag of MO_TLSLD or MO_TLSGD to the
TargetGlobalAddress operand that appears earlier in the sequence.
The calls to LowerCallTo() eventually find their way to
LowerCall_64SVR4() or LowerCall_32SVR4(), which call FinishCall(),
which calls PrepareCall(). In PrepareCall(), we detect the calls to
__tls_get_addr and immediately snag the TargetGlobalTLSAddress with
the annotated relocation information. This becomes an extra operand
on the call following the callee, which is expected for nodes of type
tlscall. We change the call opcode to CALL_TLS for this case. Back
in FinishCall(), we change it again to CALL_NOP_TLS for 64-bit only,
since we require a TOC-restore nop following the call for the 64-bit
ABIs.
During selection, patterns in PPCInstrInfo.td and PPCInstr64Bit.td
convert the CALL_TLS nodes into BL_TLS nodes, and convert the
CALL_NOP_TLS nodes into BL8_NOP_TLS nodes. This replaces the code
removed from PPCAsmPrinter.cpp, as the BL_TLS or BL8_NOP_TLS
nodes can now be emitted normally using their patterns and the
associated printTLSCall print method.
Finally, as a result of these changes, all references to get-tls-addr
in its various guises are no longer used, so they have been removed.
There are existing TLS tests to verify the changes haven't messed
anything up). I've added one new test that verifies that the problem
with the original code has been fixed.
llvm-svn: 221703
This is a first step for generating SSE rsqrt instructions for
reciprocal square root calcs when fast-math is allowed.
For now, be conservative and only enable this for AMD btver2
where performance improves significantly - for example, 29%
on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c
(if we convert the data type to single-precision float).
This patch adds a two constant version of the Newton-Raphson
refinement algorithm to DAGCombiner that can be selected by any target
via a parameter returned by getRsqrtEstimate()..
See PR20900 for more details:
http://llvm.org/bugs/show_bug.cgi?id=20900
Differential Revision: http://reviews.llvm.org/D5658
llvm-svn: 220570
It was hacky to use an opcode as a switch because it won't always match
(rsqrte != sqrte), and it looks like we'll need to add more special casing
per arch than I had hoped for. Eg, x86 will prefer a different NR estimate
implementation. ARM will want to use it's 'step' instructions. There also
don't appear to be any new estimate instructions in any arch in a long,
long time. Altivec vloge and vexpte may have been the first and last in
that field...
llvm-svn: 218698
This is purely refactoring. No functional changes intended. PowerPC is the only target
that is currently using this interface.
The ultimate goal is to allow targets other than PowerPC (certainly X86 and Aarch64) to turn this:
z = y / sqrt(x)
into:
z = y * rsqrte(x)
And:
z = y / x
into:
z = y * rcpe(x)
using whatever HW magic they can use. See http://llvm.org/bugs/show_bug.cgi?id=20900 .
There is one hook in TargetLowering to get the target-specific opcode for an estimate instruction
along with the number of refinement steps needed to make the estimate usable.
Differential Revision: http://reviews.llvm.org/D5484
llvm-svn: 218553
Summary:
This patch makes use of AtomicExpandPass in Power for inserting fences around
atomic as part of an effort to remove fence insertion from SelectionDAGBuilder.
As a big bonus, it lets us use sync 1 (lightweight sync, often used by the mnemonic
lwsync) instead of sync 0 (heavyweight sync) in many cases.
I also added a test, as there was no test for the barriers emitted by the Power
backend for atomic loads and stores.
Test Plan: new test + make check-all
Reviewers: jfb
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5180
llvm-svn: 218331
This is purely a plumbing patch. No functional changes intended.
The ultimate goal is to allow targets other than PowerPC (certainly X86 and Aarch64) to turn this:
z = y / sqrt(x)
into:
z = y * rsqrte(x)
using whatever HW magic they can use. See http://llvm.org/bugs/show_bug.cgi?id=20900 .
The first step is to add a target hook for RSQRTE, take the already target-independent code selfishly hoarded by PPC, and put it into DAGCombiner.
Next steps:
The code in DAGCombiner::BuildRSQRTE() should be refactored further; tests that exercise that logic need to be added.
Logic in PPCTargetLowering::BuildRSQRTE() should be hoisted into DAGCombiner.
X86 and AArch64 overrides for TargetLowering.BuildRSQRTE() should be added.
Differential Revision: http://reviews.llvm.org/D5425
llvm-svn: 218219
The heuristic used by DAGCombine to form FMAs checks that the FMUL has only one
use, but this is overly-conservative on some systems. Specifically, if the FMA
and the FADD have the same latency (and the FMA does not compete for resources
with the FMUL any more than the FADD does), there is no need for the
restriction, and furthermore, forming the FMA leaving the FMUL can still allow
for higher overall throughput and decreased critical-path length.
Here we add a new TLI callback, enableAggressiveFMAFusion, false by default, to
elide the hasOneUse check. This is enabled for PowerPC by default, as most
PowerPC systems will benefit.
Patch by Olivier Sallenave, thanks!
llvm-svn: 218120
Add header guards to files that were missing guards. Remove #endif comments
as they don't seem common in LLVM (we can easily add them back if we decide
they're useful)
Changes made by clang-tidy with minor tweaks.
llvm-svn: 215558
This implements PPCTargetLowering::getTgtMemIntrinsic for Altivec load/store
intrinsics. As with the construction of the MachineMemOperands for the
intrinsic calls used for unaligned load/store lowering, the only slight
complication is that we need to represent a larger memory range than the
loaded/stored value-type size (because the address is rounded down to an
aligned address, and we need to conservatively represent the entire possible
range of the actual access). This required adding an extra size field to
TargetLowering::IntrinsicInfo, and this was done in a way that required no
modifications to other targets (the size defaults to the store size of the
provided memory data type).
This fixes test/CodeGen/PowerPC/unal-altivec-wint.ll (so it can be un-XFAILed).
llvm-svn: 215512
Commits r213915 and r214718 fix recognition of shuffle masks for vmrg*
and vpku*um instructions for a little-endian target, by swapping the
input arguments. The vsldoi instruction requires similar treatment,
and also needs its shift count adjusted for little endian.
Reviewed by Ulrich Weigand.
This is a bug fix candidate for release 3.5 (and hopefully the last of
those for PowerPC).
llvm-svn: 214923
In commit r213915, Bill fixed little-endian usage of vmrgh* and vmrgl*
by swapping the input arguments. As it turns out, the exact same fix
is also required for the vpkuhum/vpkuwum patterns.
This fixes another regression in llvmpipe when vector support is
enabled.
Reviewed by Bill Schmidt.
llvm-svn: 214718
Rename to allowsMisalignedMemoryAccess.
On R600, 8 and 16 byte accesses are mostly OK with 4-byte alignment,
and don't need to be split into multiple accesses. Vector loads with
an alignment of the element type are not uncommon in OpenCL code.
llvm-svn: 214055
Because the PowerPC vmrgh* and vmrgl* instructions have a built-in
big-endian bias, it is necessary to swap their inputs in little-endian
mode when using them to implement a vector shuffle. This was
previously missed in the vector LE implementation.
There was already logic to distinguish between unary and "normal"
vmrg* vector shuffles, so this patch extends that logic to use a third
option: "swapped" vmrg* vector shuffles that are used for little
endian in place of the "normal" ones.
I've updated the vec-shuffle-le.ll test to check for the expected
register ordering on the generated instructions.
This bug was discovered when testing the LE and ELFv2 patches for
safety if they were backported to 3.4. A different vectorization
decision was made in 3.4 than on mainline trunk, and that exposed the
problem. I've verified this fix takes care of that issue.
llvm-svn: 213915
This patch adds infrastructure support for passing array types
directly. These can be used by the front-end to pass aggregate
types (coerced to an appropriate array type). The details of the
array type being used inform the back-end about ABI-relevant
properties. Specifically, the array element type encodes:
- whether the parameter should be passed in FPRs, VRs, or just
GPRs/stack slots (for float / vector / integer element types,
respectively)
- what the alignment requirements of the parameter are when passed in
GPRs/stack slots (8 for float / 16 for vector / the element type
size for integer element types) -- this corresponds to the
"byval align" field
Using the infrastructure provided by this patch, a companion patch
to clang will enable two features:
- In the ELFv2 ABI, pass (and return) "homogeneous" floating-point
or vector aggregates in FPRs and VRs (this is similar to the ARM
homogeneous aggregate ABI)
- As an optimization for both ELFv1 and ELFv2 ABIs, pass aggregates
that fit fully in registers without using the "byval" mechanism
The patch uses the functionArgumentNeedsConsecutiveRegisters callback
to encode that special treatment is required for all directly-passed
array types. The isInConsecutiveRegs / isInConsecutiveRegsLast bits set
as a results are then used to implement the required size and alignment
rules in CalculateStackSlotSize / CalculateStackSlotAlignment etc.
As a related change, the ABI routines have to be modified to support
passing floating-point types in GPRs. This is necessary because with
homogeneous aggregates of 4-byte float type we can now run out of FPRs
*before* we run out of the 64-byte argument save area that is shadowed
by GPRs. Any extra floating-point arguments that no longer fit in FPRs
must now be passed in GPRs until we run out of those too.
Note that there was already code to pass floating-point arguments in
GPRs used with vararg parameters, which was done by writing the argument
out to the argument save area first and then reloading into GPRs. The
patch re-implements this, however, in favor of code packing float arguments
directly via extension/truncation, BITCAST, and BUILD_PAIR operations.
This is required to support the ELFv2 ABI, since we cannot unconditionally
write to the argument save area (which the caller might not have allocated).
The change does, however, affect ELFv1 varags routines too; but even here
the overall effect should be advantageous: Instead of loading the argument
into the FPR, then storing the argument to the stack slot, and finally
reloading the argument from the stack slot into a GPR, the new code now
just loads the argument into the FPR, and subsequently loads the argument
into the GPR (via BITCAST). That BITCAST might imply a save/reload from
a stack temporary (in which case we're no worse than before); but it
might be implemented more efficiently in some cases.
The final part of the patch enables up to 8 FPRs and VRs for argument
return in PPCCallingConv.td; this is required to support returning
ELFv2 homogeneous aggregates. (Note that this doesn't affect other ABIs
since LLVM wil only look for which register to use if the parameter is
marked as "direct" return anyway.)
Reviewed by Hal Finkel.
llvm-svn: 213493
The PPCISelLowering.cpp routines PPCTargetLowering::setMinReservedArea and
CalculateParameterAndLinkageAreaSize are currently used as subroutines
from both 64-bit SVR4 and Darwin ABI code.
However, the two ABIs are already quite different w.r.t. AltiVec
conventions, and they will become more different when the ELFv2 ABI is
supported. Also, in general it seems better to disentangle ABI support
routines for different ABIs to avoid accidentally affecting one ABI when
intending to change only the other.
(Actually, the current code strictly speaking already contains a bug:
these routines call PPCFrameLowering::getMinCallFrameSize and
PPCFrameLowering::getLinkageSize with the IsDarwin parameter set to
"true" even on 64-bit SVR4. This bug currently has no adverse effect
since those routines always return the same for 64-bit SVR4 and 64-bit
Darwin, but it still seems wrong ... I'll fix this in a follow-up
commit shortly.)
To remove this code sharing, I'm simply inlining both routines into all
call sites (there are just two each, one for 64-bit SVR4 and one for
Darwin), and simplifying due to constant parameters where possible.
A small piece of code that *does* make sense to share is refactored into
the new routine EnsureStackAlignment, now also called from 32-bit SVR4
ABI code.
No change in generated code is expected.
llvm-svn: 211493
During an indirect function call sequence on the 64-bit SVR4 ABI,
generate code must load and then restore the TOC register.
This does not use a regular LOAD instruction since the TOC
register r2 is marked as reserved. Instead, the are two
special instruction patterns:
let RST = 2, DS = 2 in
def LDinto_toc: DSForm_1a<58, 0, (outs), (ins g8rc:$reg),
"ld 2, 8($reg)", IIC_LdStLD,
[(PPCload_toc i64:$reg)]>, isPPC64;
let RST = 2, DS = 10, RA = 1 in
def LDtoc_restore : DSForm_1a<58, 0, (outs), (ins),
"ld 2, 40(1)", IIC_LdStLD,
[(PPCtoc_restore)]>, isPPC64;
Note that these not only restrict the destination of the
load to r2, but they also restrict the *source* of the
load to particular address combinations. The latter is
a problem when we want to support the ELFv2 ABI, since
there the TOC save slot is no longer at 40(1).
This patch replaces those two instructions with a single
instruction pattern that only hard-codes r2 as destination,
but supports generic addresses as source. This will allow
supporting the ELFv2 ABI, and also helps generate more
efficient code for calls to absolute addresses (allowing
simplification of the ppc64-calls.ll test case).
llvm-svn: 211193
Various masks on shufflevector instructions are recognizable as
specific PowerPC instructions (vector pack, vector merge, etc.).
There is existing code in PPCISelLowering.cpp to recognize the correct
patterns for big endian code. The masks for these instructions are
different for little endian code due to the big-endian numbering
employed by these instructions. This patch adds the recognition code
for little endian.
I've added a new test case test/CodeGen/PowerPC/vec_shuffle_le.ll for
this. The existing recognizer test (vec_shuffle.ll) is unnecessarily
verbose and difficult to read, so I felt it was better to add a new
test rather than modify the old one.
llvm-svn: 210536
Support for the intrinsics that read from and write to global named registers
is added for r1, r2 and r13 (depending on the subtarget).
llvm-svn: 208509
If we have two unique values for a v2i64 build vector, this will always result
in two vector loads if we expand using shuffles. Only one is necessary.
llvm-svn: 205231
sitofp from v2i32 to v2f64 ends up generating a SIGN_EXTEND_INREG v2i64 node
(and similarly for v2i16 and v2i8). Even though there are no sign-extension (or
algebraic shifts) for v2i64 types, we can handle v2i32 sign extensions by
converting two and from v2i64. The small trick necessary here is to shift the
i32 elements into the right lanes before the i32 -> f64 step. This is because
of the big Endian nature of the system, we need the i32 portion in the high
word of the i64 elements.
For v2i16 and v2i8 we can do the same, but we first use the default Altivec
shift-based expansion from v2i16 or v2i8 to v2i32 (by casting to v4i32) and
then apply the above procedure.
llvm-svn: 205146
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
subsequent changes are easier to review. About to fix some layering
issues, and wanted to separate out the necessary churn.
Also comment and sink the include of "Windows.h" in three .inc files to
match the usage in Memory.inc.
llvm-svn: 198685
This is the first of many upcoming patches for PowerPC fast
instruction selection support. This patch implements the minimum
necessary for a functional (but extremely limited) FastISel pass. It
allows the table-generated portions of the selector to be created and
used, but in most cases selection will fall back to the DAG selector.
None of the block terminator instructions are implemented yet, and
most interesting instructions require some special handling.
Therefore there aren't any new test cases with this patch. There will
be quite a few tests coming with future patches.
This patch adds the make/CMake support for the new code (including
tablegen -gen-fast-isel) and creates the FastISel object for PPC64 ELF
only. It instantiates the necessary virtual functions
(TargetSelectInstruction, TargetMaterializeConstant,
TargetMaterializeAlloca, tryToFoldLoadIntoMI, and FastLowerArguments),
but of these, only TargetMaterializeConstant contains any useful
implementation. This is present since the table-generated code
requires the ability to materialize integer constants for some
instructions.
This patch has been tested by building and running the
projects/test-suite code with -O0. All tests passed with the
exception of a couple of long-running tests that time out using -O0
code generation.
llvm-svn: 187399
in-tree implementations of TargetLoweringBase::isFMAFasterThanMulAndAdd in
order to resolve the following issues with fmuladd (i.e. optional FMA)
intrinsics:
1. On X86(-64) targets, ISD::FMA nodes are formed when lowering fmuladd
intrinsics even if the subtarget does not support FMA instructions, leading
to laughably bad code generation in some situations.
2. On AArch64 targets, ISD::FMA nodes are formed for operations on fp128,
resulting in a call to a software fp128 FMA implementation.
3. On PowerPC targets, FMAs are not generated from fmuladd intrinsics on types
like v2f32, v8f32, v4f64, etc., even though they promote, split, scalarize,
etc. to types that support hardware FMAs.
The function has also been slightly renamed for consistency and to force a
merge/build conflict for any out-of-tree target implementing it. To resolve,
see comments and fixed in-tree examples.
llvm-svn: 185956
When accessing just a single CR register, it is always preferable to
use mfocrf instead of mfcr, if the former is available on the CPU.
Current code makes that distinction in many, but not all places
where a single CR register value is retrieved. One missing
location is PPCRegisterInfo::lowerCRSpilling.
To fix this and make this simpler in the future, this patch changes
the bulk of the back-end to always assume mfocrf is available and
simply generate it when needed.
On machines that actually do not support mfocrf, the instruction
is replaced by mfcr at the very end, in EmitInstruction.
This has the additional benefit that we no longer need the
MFCRpseud hack, since before EmitInstruction we always have
a MFOCRF instruction pattern, which already models data flow
as required.
The patch also adds the MFOCRF8 version of the instruction,
which was missing so far.
Except for the PPCRegisterInfo::lowerCRSpilling case, no change
in generated code intended.
llvm-svn: 185556
This is a preparatory patch for fast-isel support. The instruction
selector will need to access some functions in PPCGenCallingConv.inc,
which in turn requires several helper functions to be defined. These
are currently defined near the only use of PCCGenCallingConv.inc,
inside PPCISelLowering.cpp. This patch moves the declaration of the
functions into the associated header file to provide the needed
visibility.
No functional change intended.
llvm-svn: 183844
This is the second part of the change to always return "true"
offset values from getPreIndexedAddressParts, tackling the
case of "memrix" type operands.
This is about instructions like LD/STD that only have a 14-bit
field to encode immediate offsets, which are implicitly extended
by two zero bits by the machine, so that in effect we can access
16-bit offsets as long as they are a multiple of 4.
The PowerPC back end currently handles such instructions by
carrying the 14-bit value (as it will get encoded into the
actual machine instructions) in the machine operand fields
for such instructions. This means that those values are
in fact not the true offset, but rather the offset divided
by 4 (and then truncated to an unsigned 14-bit value).
Like in the case fixed in r182012, this makes common code
operations on such offset values not work as expected.
Furthermore, there doesn't really appear to be any strong
reason why we should encode machine operands this way.
This patch therefore changes the encoding of "memrix" type
machine operands to simply contain the "true" offset value
as a signed immediate value, while enforcing the rules that
it must fit in a 16-bit signed value and must also be a
multiple of 4.
This change must be made simultaneously in all places that
access machine operands of this type. However, just about
all those changes make the code simpler; in many cases we
can now just share the same code for memri and memrix
operands.
llvm-svn: 182032
The old PPCCTRLoops pass, like the Hexagon pass version from which it was
derived, could only handle some simple loops in canonical form. We cannot
directly adapt the new Hexagon hardware loops pass, however, because the
Hexagon pass contains a fundamental assumption that non-constant-trip-count
loops will contain a guard, and this is not always true (the result being that
incorrect negative counts can be generated). With this commit, we replace the
pass with a late IR-level pass which makes use of SE to calculate the
backedge-taken counts and safely generate the loop-count expressions (including
any necessary max() parts). This IR level pass inserts custom intrinsics that
are lowered into the desired decrement-and-branch instructions.
The most fragile part of this new implementation is that interfering uses of
the counter register must be detected on the IR level (and, on PPC, this also
includes any indirect branches in addition to function calls). Also, to make
all of this work, we need a variant of the mtctr instruction that is marked
as having side effects. Without this, machine-code level CSE, DCE, etc.
illegally transform the resulting code. Hopefully, this can be improved
in the future.
This new pass is smaller than the original (and much smaller than the new
Hexagon hardware loops pass), and can handle many additional cases correctly.
In addition, the preheader-creation code has been copied from LoopSimplify, and
after we decide on where it belongs, this code will be refactored so that it
can be explicitly shared (making this implementation even smaller).
The new test-case files ctrloop-{le,lt,ne}.ll have been adapted from tests for
the new Hexagon pass. There are a few classes of loops that this pass does not
transform (noted by FIXMEs in the files), but these deficiencies can be
addressed within the SE infrastructure (thus helping many other passes as well).
llvm-svn: 181927
On cores for which we know the misprediction penalty, and we have
the isel instruction, we can profitably perform early if conversion.
This enables us to replace some small branch sequences with selects
and avoid the potential stalls from mispredicting the branches.
Enabling this feature required implementing canInsertSelect and
insertSelect in PPCInstrInfo; isel code in PPCISelLowering was
refactored to use these functions as well.
llvm-svn: 178926
When unsafe FP math operations are enabled, we can use the fre[s] and
frsqrte[s] instructions, which generate reciprocal (sqrt) estimates, together
with some Newton iteration, in order to quickly generate floating-point
division and sqrt results. All of these instructions are separately optional,
and so each has its own feature flag (except for the Altivec instructions,
which are covered under the existing Altivec flag). Doing this is not only
faster than using the IEEE-compliant fdiv/fsqrt instructions, but allows these
computations to be pipelined with other computations in order to hide their
overall latency.
I've also added a couple of missing fnmsub patterns which turned out to be
missing (but are necessary for good code generation of the Newton iterations).
Altivec needs a similar fix, but that will probably be more complicated because
fneg is expanded for Altivec's v4f32.
llvm-svn: 178617
The P7 and A2 have additional floating-point conversion instructions which
allow a direct two-instruction sequence (plus load/store) to convert from all
combinations (signed/unsigned i32/i64) <--> (float/double) (on previous cores,
only some combinations were directly available).
llvm-svn: 178480