2008-09-26 12:40:32 +08:00
|
|
|
set(LLVM_TARGET_DEFINITIONS PPC.td)
|
|
|
|
|
2011-11-05 03:04:23 +08:00
|
|
|
tablegen(LLVM PPCGenAsmWriter.inc -gen-asm-writer)
|
2013-05-04 03:49:39 +08:00
|
|
|
tablegen(LLVM PPCGenAsmMatcher.inc -gen-asm-matcher)
|
2013-12-20 00:13:01 +08:00
|
|
|
tablegen(LLVM PPCGenDisassemblerTables.inc -gen-disassembler)
|
2014-09-03 06:28:02 +08:00
|
|
|
tablegen(LLVM PPCGenMCCodeEmitter.inc -gen-emitter)
|
2011-11-05 03:04:23 +08:00
|
|
|
tablegen(LLVM PPCGenRegisterInfo.inc -gen-register-info)
|
|
|
|
tablegen(LLVM PPCGenInstrInfo.inc -gen-instr-info)
|
|
|
|
tablegen(LLVM PPCGenDAGISel.inc -gen-dag-isel)
|
2013-07-30 08:50:39 +08:00
|
|
|
tablegen(LLVM PPCGenFastISel.inc -gen-fast-isel)
|
2011-11-05 03:04:23 +08:00
|
|
|
tablegen(LLVM PPCGenCallingConv.inc -gen-callingconv)
|
|
|
|
tablegen(LLVM PPCGenSubtargetInfo.inc -gen-subtarget)
|
Clean up a pile of hacks in our CMake build relating to TableGen.
The first problem to fix is to stop creating synthetic *Table_gen
targets next to all of the LLVM libraries. These had no real effect as
CMake specifies that add_custom_command(OUTPUT ...) directives (what the
'tablegen(...)' stuff expands to) are implicitly added as dependencies
to all the rules in that CMakeLists.txt.
These synthetic rules started to cause problems as we started more and
more heavily using tablegen files from *subdirectories* of the one where
they were generated. Within those directories, the set of tablegen
outputs was still available and so these synthetic rules added them as
dependencies of those subdirectories. However, they were no longer
properly associated with the custom command to generate them. Most of
the time this "just worked" because something would get to the parent
directory first, and run tablegen there. Once run, the files existed and
the build proceeded happily. However, as more and more subdirectories
have started using this, the probability of this failing to happen has
increased. Recently with the MC refactorings, it became quite common for
me when touching a large enough number of targets.
To add insult to injury, several of the backends *tried* to fix this by
adding explicit dependencies back to the parent directory's tablegen
rules, but those dependencies didn't work as expected -- they weren't
forming a linear chain, they were adding another thread in the race.
This patch removes these synthetic rules completely, and adds a much
simpler function to declare explicitly that a collection of tablegen'ed
files are referenced by other libraries. From that, we can add explicit
dependencies from the smaller libraries (such as every architectures
Desc library) on this and correctly form a linear sequence. All of the
backends are updated to use it, sometimes replacing the existing attempt
at adding a dependency, sometimes adding a previously missing dependency
edge.
Please let me know if this causes any problems, but it fixes a rather
persistent and problematic source of build flakiness on our end.
llvm-svn: 136023
2011-07-26 08:09:08 +08:00
|
|
|
add_public_tablegen_target(PowerPCCommonTableGen)
|
2008-09-26 12:40:32 +08:00
|
|
|
|
|
|
|
add_llvm_target(PowerPCCodeGen
|
2015-12-08 04:50:29 +08:00
|
|
|
PPCBoolRetToInt.cpp
|
2010-11-15 02:33:33 +08:00
|
|
|
PPCAsmPrinter.cpp
|
2008-09-26 12:40:32 +08:00
|
|
|
PPCBranchSelector.cpp
|
2012-06-08 23:38:21 +08:00
|
|
|
PPCCTRLoops.cpp
|
2008-09-26 12:40:32 +08:00
|
|
|
PPCHazardRecognizers.cpp
|
|
|
|
PPCInstrInfo.cpp
|
|
|
|
PPCISelDAGToDAG.cpp
|
|
|
|
PPCISelLowering.cpp
|
2015-02-02 06:58:46 +08:00
|
|
|
PPCEarlyReturn.cpp
|
2013-07-30 08:50:39 +08:00
|
|
|
PPCFastISel.cpp
|
2011-01-10 20:39:23 +08:00
|
|
|
PPCFrameLowering.cpp
|
[PowerPC] Loop Data Prefetching for the BG/Q
The IBM BG/Q supercomputer's A2 cores have a hardware prefetching unit, the
L1P, but it does not prefetch directly into the A2's L1 cache. Instead, it
prefetches into its own L1P buffer, and the latency to access that buffer is
significantly higher than that to the L1 cache (although smaller than the
latency to the L2 cache). As a result, especially when multiple hardware
threads are not actively busy, explicitly prefetching data into the L1 cache is
advantageous.
I've been using this pass out-of-tree for data prefetching on the BG/Q for well
over a year, and it has worked quite well. It is enabled by default only for
the BG/Q, but can be enabled for other cores as well via a command-line option.
Eventually, we might want to add some TTI interfaces and move this into
Transforms/Scalar (there is nothing particularly target dependent about it,
although only machines like the BG/Q will benefit from its simplistic
strategy).
llvm-svn: 229966
2015-02-20 13:08:21 +08:00
|
|
|
PPCLoopDataPrefetch.cpp
|
[PowerPC] Prepare loops for pre-increment loads/stores
PowerPC supports pre-increment load/store instructions (except for Altivec/VSX
vector load/stores). Using these on embedded cores can be very important, but
most loops are not naturally set up to use them. We can often change that,
however, by placing loops into a non-canonical form. Generically, this means
transforming loops like this:
for (int i = 0; i < n; ++i)
array[i] = c;
to look like this:
T *p = array[-1];
for (int i = 0; i < n; ++i)
*++p = c;
the key point is that addresses accessed are pulled into dedicated PHIs and
"pre-decremented" in the loop preheader. This allows the use of pre-increment
load/store instructions without loop peeling.
A target-specific late IR-level pass (running post-LSR), PPCLoopPreIncPrep, is
introduced to perform this transformation. I've used this code out-of-tree for
generating code for the PPC A2 for over a year. Somewhat to my surprise,
running the test suite + externals on a P7 with this transformation enabled
showed no performance regressions, and one speedup:
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk
-2.32514% +/- 1.03736%
So I'm going to enable it on everything for now. I was surprised by this
because, on the POWER cores, these pre-increment load/store instructions are
cracked (and, thus, harder to schedule effectively). But seeing no regressions,
and feeling that it is generally easier to split instructions apart late than
it is to combine them late, this might be the better approach regardless.
In the future, we might want to integrate this functionality into LSR (but
currently LSR does not create new PHI nodes, so (for that and other reasons)
significant work would need to be done).
llvm-svn: 228328
2015-02-06 02:43:00 +08:00
|
|
|
PPCLoopPreIncPrep.cpp
|
2010-11-15 03:53:02 +08:00
|
|
|
PPCMCInstLower.cpp
|
2011-12-20 16:42:11 +08:00
|
|
|
PPCMachineFunctionInfo.cpp
|
2015-11-11 05:43:45 +08:00
|
|
|
PPCMIPeephole.cpp
|
2008-09-26 12:40:32 +08:00
|
|
|
PPCRegisterInfo.cpp
|
|
|
|
PPCSubtarget.cpp
|
|
|
|
PPCTargetMachine.cpp
|
2013-05-14 03:34:37 +08:00
|
|
|
PPCTargetObjectFile.cpp
|
2013-01-26 07:05:59 +08:00
|
|
|
PPCTargetTransformInfo.cpp
|
[PowerPC] Add extra r2 read deps on @toc@l relocations
If some commits are happy, and some commits are sad, this is a sad commit. It
is sad because it restricts instruction scheduling to work around a binutils
linker bug, and moreover, one that may never be fixed. On 2012-05-21, GCC was
updated not to produce code triggering this bug, and now we'll do the same...
When resolving an address using the ELF ABI TOC pointer, two relocations are
generally required: one for the high part and one for the low part. Only
the high part generally explicitly depends on r2 (the TOC pointer). And, so,
we might produce code like this:
.Ltmp526:
addis 3, 2, .LC12@toc@ha
.Ltmp1628:
std 2, 40(1)
ld 5, 0(27)
ld 2, 8(27)
ld 11, 16(27)
ld 3, .LC12@toc@l(3)
rldicl 4, 4, 0, 32
mtctr 5
bctrl
ld 2, 40(1)
And there is nothing wrong with this code, as such, but there is a linker bug
in binutils (https://sourceware.org/bugzilla/show_bug.cgi?id=18414) that will
misoptimize this code sequence to this:
nop
std r2,40(r1)
ld r5,0(r27)
ld r2,8(r27)
ld r11,16(r27)
ld r3,-32472(r2)
clrldi r4,r4,32
mtctr r5
bctrl
ld r2,40(r1)
because the linker does not know (and does not check) that the value in r2
changed in between the instruction using the .LC12@toc@ha (TOC-relative)
relocation and the instruction using the .LC12@toc@l(3) relocation.
Because it finds these instructions using the relocations (and not by
scanning the instructions), it has been asserted that there is no good way
to detect the change of r2 in between. As a result, this bug may never be
fixed (i.e. it may become part of the definition of the ABI). GCC was
updated to add extra dependencies on r2 to instructions using the @toc@l
relocations to avoid this problem, and we'll do the same here.
This is done as a separate pass because:
1. These extra r2 dependencies are not really properties of the
instructions, but rather due to a linker bug, and maybe one day we'll be
able to get rid of them when targeting linkers without this bug (and,
thus, keeping the logic centralized here will make that
straightforward).
2. There are ISel-level peephole optimizations that propagate the @toc@l
relocations to some user instructions, and so the exta dependencies do
not apply only to a fixed set of instructions (without undesirable
definition replication).
The test case was reduced with the help of bugpoint, with minimal cleaning. I'm
looking forward to our upcoming MI serialization support, and with that, much
better tests can be created.
llvm-svn: 237556
2015-05-18 14:25:59 +08:00
|
|
|
PPCTOCRegDeps.cpp
|
2015-02-11 03:09:05 +08:00
|
|
|
PPCTLSDynamicCall.cpp
|
2015-02-02 06:01:29 +08:00
|
|
|
PPCVSXCopy.cpp
|
2015-02-02 05:51:22 +08:00
|
|
|
PPCVSXFMAMutate.cpp
|
[PPC64LE] Remove unnecessary swaps from lane-insensitive vector computations
This patch adds a new SSA MI pass that runs on little-endian PPC64
code with VSX enabled. Loads and stores of 4x32 and 2x64 vectors
without alignment constraints are accomplished for little-endian using
lxvd2x/xxswapd and xxswapd/stxvd2x. The existence of the additional
xxswapd instructions hurts performance in comparison with big-endian
code, but they are necessary in the general case to support correct
semantics.
However, the general case does not apply to most vector code. Many
vector instructions are lane-insensitive; they do not "care" which
lanes the parallel computations are performed within, provided that
the resulting data is stored into the correct locations. Thus this
pass looks for computations that perform only lane-insensitive
operations, and remove the unnecessary swaps from loads and stores in
such computations.
Future improvements will allow computations using certain
lane-sensitive operations to also be optimized in this manner, by
modifying the lane-sensitive operations to account for the permuted
order of the lanes. However, this patch only adds the infrastructure
to permit this; no lane-sensitive operations are optimized at this
time.
This code is heavily exercised by the various vectorizing applications
in the projects/test-suite tree. For the time being, I have only added
one simple test case to demonstrate what the pass is doing. Although
it is quite simple, it provides coverage for much of the code,
including the special case handling of copies and subreg-to-reg
operations feeding the swaps. I plan to add additional tests in the
future as I fill in more of the "special handling" code.
Two existing tests were affected, because they expected the swaps to
be present, but they are now removed.
llvm-svn: 235910
2015-04-28 03:57:34 +08:00
|
|
|
PPCVSXSwapRemoval.cpp
|
2008-09-26 12:40:32 +08:00
|
|
|
)
|
2011-02-20 10:55:27 +08:00
|
|
|
|
2013-05-04 03:49:39 +08:00
|
|
|
add_subdirectory(AsmParser)
|
2013-12-20 00:13:01 +08:00
|
|
|
add_subdirectory(Disassembler)
|
2011-02-20 10:55:27 +08:00
|
|
|
add_subdirectory(InstPrinter)
|
|
|
|
add_subdirectory(TargetInfo)
|
2011-07-15 04:59:42 +08:00
|
|
|
add_subdirectory(MCTargetDesc)
|