Summary:
This patch makes use of AtomicExpandPass in Power for inserting fences around
atomic as part of an effort to remove fence insertion from SelectionDAGBuilder.
As a big bonus, it lets us use sync 1 (lightweight sync, often used by the mnemonic
lwsync) instead of sync 0 (heavyweight sync) in many cases.
I also added a test, as there was no test for the barriers emitted by the Power
backend for atomic loads and stores.
Test Plan: new test + make check-all
Reviewers: jfb
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5180
llvm-svn: 218331
Adds code generation support for dcbtst (data cache prefetch for write) and
icbt (instruction cache prefetch for read - Book E cores only).
We still end up with a 'cannot select' error for the non-supported prefetch
intrinsic forms. This will be fixed in a later commit.
Fixes PR20692.
llvm-svn: 216339
This adds initial support for PPC32 ELF PIC (Position Independent Code; the
-fPIC variety), thus rectifying a long-standing deficiency in the PowerPC
backend.
Patch by Justin Hibbits!
llvm-svn: 213427
During an indirect function call sequence on the 64-bit SVR4 ABI,
generate code must load and then restore the TOC register.
This does not use a regular LOAD instruction since the TOC
register r2 is marked as reserved. Instead, the are two
special instruction patterns:
let RST = 2, DS = 2 in
def LDinto_toc: DSForm_1a<58, 0, (outs), (ins g8rc:$reg),
"ld 2, 8($reg)", IIC_LdStLD,
[(PPCload_toc i64:$reg)]>, isPPC64;
let RST = 2, DS = 10, RA = 1 in
def LDtoc_restore : DSForm_1a<58, 0, (outs), (ins),
"ld 2, 40(1)", IIC_LdStLD,
[(PPCtoc_restore)]>, isPPC64;
Note that these not only restrict the destination of the
load to r2, but they also restrict the *source* of the
load to particular address combinations. The latter is
a problem when we want to support the ELFv2 ABI, since
there the TOC save slot is no longer at 40(1).
This patch replaces those two instructions with a single
instruction pattern that only hard-codes r2 as destination,
but supports generic addresses as source. This will allow
supporting the ELFv2 ABI, and also helps generate more
efficient code for calls to absolute addresses (allowing
simplification of the ppc64-calls.ll test case).
llvm-svn: 211193
I'm under the impression that we used to infer the isCommutable flag from the
instruction-associated pattern. Regardless, we don't seem to do this (at least
by default) any more. I've gone through all of our instruction definitions, and
marked as commutative all of those that should be trivial to commute (by
exchanging the first two operands). There has been special code for the RL*
instructions, and that's not changed.
Before this change, we had the following commutative instructions:
RLDIMI
RLDIMIo
RLWIMI
RLWIMI8
RLWIMI8o
RLWIMIo
XSADDDP
XSMULDP
XVADDDP
XVADDSP
XVMULDP
XVMULSP
After:
ADD4
ADD4o
ADD8
ADD8o
ADDC
ADDC8
ADDC8o
ADDCo
ADDE
ADDE8
ADDE8o
ADDEo
AND
AND8
AND8o
ANDo
CRAND
CREQV
CRNAND
CRNOR
CROR
CRXOR
EQV
EQV8
EQV8o
EQVo
FADD
FADDS
FADDSo
FADDo
FMADD
FMADDS
FMADDSo
FMADDo
FMSUB
FMSUBS
FMSUBSo
FMSUBo
FMUL
FMULS
FMULSo
FMULo
FNMADD
FNMADDS
FNMADDSo
FNMADDo
FNMSUB
FNMSUBS
FNMSUBSo
FNMSUBo
MULHD
MULHDU
MULHDUo
MULHDo
MULHW
MULHWU
MULHWUo
MULHWo
MULLD
MULLDo
MULLW
MULLWo
NAND
NAND8
NAND8o
NANDo
NOR
NOR8
NOR8o
NORo
OR
OR8
OR8o
ORo
RLDIMI
RLDIMIo
RLWIMI
RLWIMI8
RLWIMI8o
RLWIMIo
VADDCUW
VADDFP
VADDSBS
VADDSHS
VADDSWS
VADDUBM
VADDUBS
VADDUHM
VADDUHS
VADDUWM
VADDUWS
VAND
VAVGSB
VAVGSH
VAVGSW
VAVGUB
VAVGUH
VAVGUW
VMADDFP
VMAXFP
VMAXSB
VMAXSH
VMAXSW
VMAXUB
VMAXUH
VMAXUW
VMHADDSHS
VMHRADDSHS
VMINFP
VMINSB
VMINSH
VMINSW
VMINUB
VMINUH
VMINUW
VMLADDUHM
VMULESB
VMULESH
VMULEUB
VMULEUH
VMULOSB
VMULOSH
VMULOUB
VMULOUH
VNMSUBFP
VOR
VXOR
XOR
XOR8
XOR8o
XORo
XSADDDP
XSMADDADP
XSMAXDP
XSMINDP
XSMSUBADP
XSMULDP
XSNMADDADP
XSNMSUBADP
XVADDDP
XVADDSP
XVMADDADP
XVMADDASP
XVMAXDP
XVMAXSP
XVMINDP
XVMINSP
XVMSUBADP
XVMSUBASP
XVMULDP
XVMULSP
XVNMADDADP
XVNMADDASP
XVNMSUBADP
XVNMSUBASP
XXLAND
XXLNOR
XXLOR
XXLXOR
This is a by-inspection change, and I'm not sure how to write a reliable test
case. I would like advice on this, however.
llvm-svn: 204609
VSX is an ISA extension supported on the POWER7 and later cores that enhances
floating-point vector and scalar capabilities. Among other things, this adds
<2 x double> support and generally helps to reduce register pressure.
The interesting part of this ISA feature is the register configuration: there
are 64 new 128-bit vector registers, the 32 of which are super-registers of the
existing 32 scalar floating-point registers, and the second 32 of which overlap
with the 32 Altivec vector registers. This makes things like vector insertion
and extraction tricky: this can be free but only if we force a restriction to
the right register subclass when needed. A new "minipass" PPCVSXCopy takes care
of this (although it could do a more-optimal job of it; see the comment about
unnecessary copies below).
Please note that, currently, VSX is not enabled by default when targeting
anything because it is not yet ready for that. The assembler and disassembler
are fully implemented and tested. However:
- CodeGen support causes miscompiles; test-suite runtime failures:
MultiSource/Benchmarks/FreeBench/distray/distray
MultiSource/Benchmarks/McCat/08-main/main
MultiSource/Benchmarks/Olden/voronoi/voronoi
MultiSource/Benchmarks/mafft/pairlocalalign
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
SingleSource/Benchmarks/CoyoteBench/almabench
SingleSource/Benchmarks/Misc/matmul_f64_4x4
- The lowering currently falls back to using Altivec instructions far more
than it should. Worse, there are some things that are scalarized through the
stack that shouldn't be.
- A lot of unnecessary copies make it past the optimizers, and this needs to
be fixed.
- Many more regression tests are needed.
Normally, I'd fix these things prior to committing, but there are some
students and other contributors who would like to work this, and so it makes
sense to move this development process upstream where it can be subject to the
regular code-review procedures.
llvm-svn: 203768
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
My understanding (from reading just the llvm code) is that
* most ppc cpus have a "sync n" instruction and an msync alias that is "sync 0".
* "book e" cpus instead have a msync instruction and not the more
general "sync n"
This patch reflects that in the .td files, allowing a single codepath for
asm ond obj streamer and incidentelly fixes a crash when EmitRawText was
called on a obj streamer.
llvm-svn: 199832
Several of the 64-bit fixed-point instructions with immediate operands were
using the 32-bit (i32) operand nodes instead of the corresponding 64-bit (i64)
operand definitions (u16imm instead of u16imm64, for example).
This error has had no effect so far, but would have caused type-checking
violations with an upcoming change.
llvm-svn: 198356
The tests for the disassembler were adapted from the encoder tests, and for the
most part, the output from the disassembler matches that encoder-test inputs.
There are some places where more-informative mnemonics could be produced
(notably for the branch instructions), and those cases are noted in the tests
with FIXMEs.
Future work includes:
- Generating more-informative mnemonics when possible (this may also be done
in the printer).
- Remove the dependence on positional "numbered" operand-to-variable mapping
(for both encoding and decoding).
- Internally using 64-bit instruction variants in 64-bit mode (if this turns
out to matter).
llvm-svn: 197693