Currently, pre-increment store patterns are written to use two separate
operands to represent address base and displacement:
stwu $rS, $ptroff($ptrreg)
This causes problems when implementing the assembler parser, so this
commit changes the patterns to use standard (complex) memory operands
like in all other memory access instruction patterns:
stwu $rS, $dst
To still match those instructions against the appropriate pre_store
SelectionDAG nodes, the patch uses the new feature that allows a Pat
to match multiple DAG operands against a single (complex) instruction
operand.
Approved by Hal Finkel.
llvm-svn: 177429
Currently the PPC r0 register is unconditionally reserved. There are two reasons
for this:
1. r0 is treated specially (as the constant 0) by certain instructions, and so
cannot be used with those instructions as a regular register.
2. r0 is used as a temporary register in the CR-register spilling process
(where, under some circumstances, we require two GPRs).
This change addresses the first reason by introducing a restricted register
class (without r0) for use by those instructions that treat r0 specially. These
register classes have a new pseudo-register, ZERO, which represents the r0-as-0
use. This has the side benefit of making the existing target code simpler (and
easier to understand), and will make it clear to the register allocator that
uses of r0 as 0 don't conflict will real uses of the r0 register.
Once the CR spilling code is improved, we'll be able to allocate r0.
Adding these extra register classes, for some reason unclear to me, causes
requests to the target to copy 32-bit registers to 64-bit registers. The
resulting code seems correct (and causes no test-suite failures), and the new
test case covers this new kind of asymmetric copy.
As r0 is still reserved, no functionality change intended.
llvm-svn: 177423
PPC64 supports unaligned loads and stores of 64-bit values, but
in order to use the r+i forms, the offset must be a multiple of 4.
Unfortunately, this cannot always be determined by examining the
immediate itself because it might be available only via a TOC entry.
In order to get around this issue, we additionally predicate the
selection of the r+i form on the alignment of the load or store
(forcing it to be at least 4 in order to select the r+i form).
llvm-svn: 177338
Large code model is identical to medium code model except that the
addis/addi sequence for "local" accesses is never used. All accesses
use the addis/ld sequence.
The coding changes are straightforward; most of the patch is taken up
with creating variants of the medium model tests for large model.
llvm-svn: 175767
for a wider range of GOT entries that can hold thread-relative offsets.
This matches the behavior of GCC, which was not documented in the PPC64 TLS
ABI. The ABI will be updated with the new code sequence.
Former sequence:
ld 9,x@got@tprel(2)
add 9,9,x@tls
New sequence:
addis 9,2,x@got@tprel@ha
ld 9,x@got@tprel@l(9)
add 9,9,x@tls
Note that a linker optimization exists to transform the new sequence into
the shorter sequence when appropriate, by replacing the addis with a nop
and modifying the base register and relocation type of the ld.
llvm-svn: 170209
PowerPC target. This is the last of the four models, so we now have
full TLS support.
This is mostly a straightforward extension of the general dynamic model.
I had to use an additional Chain operand to tie ADDIS_DTPREL_HA to the
register copy following ADDI_TLSLD_L; otherwise everything above the
ADDIS_DTPREL_HA appeared dead and was removed.
As before, there are new test cases to test the assembly generation, and
the relocations output during integrated assembly. The expected code
gen sequence can be read in test/CodeGen/PowerPC/tls-ld.ll.
There are a couple of things I think can be done more efficiently in the
overall TLS code, so there will likely be a clean-up patch forthcoming;
but for now I want to be sure the functionality is in place.
Bill
llvm-svn: 170003
Given a thread-local symbol x with global-dynamic access, the generated
code to obtain x's address is:
Instruction Relocation Symbol
addis ra,r2,x@got@tlsgd@ha R_PPC64_GOT_TLSGD16_HA x
addi r3,ra,x@got@tlsgd@l R_PPC64_GOT_TLSGD16_L x
bl __tls_get_addr(x@tlsgd) R_PPC64_TLSGD x
R_PPC64_REL24 __tls_get_addr
nop
<use address in r3>
The implementation borrows from the medium code model work for introducing
special forms of ADDIS and ADDI into the DAG representation. This is made
slightly more complicated by having to introduce a call to the external
function __tls_get_addr. Using the full call machinery is overkill and,
more importantly, makes it difficult to add a special relocation. So I've
introduced another opcode GET_TLS_ADDR to represent the function call, and
surrounded it with register copies to set up the parameter and return value.
Most of the code is pretty straightforward. I ran into one peculiarity
when I introduced a new PPC opcode BL8_NOP_ELF_TLSGD, which is just like
BL8_NOP_ELF except that it takes another parameter to represent the symbol
("x" above) that requires a relocation on the call. Something in the
TblGen machinery causes BL8_NOP_ELF and BL8_NOP_ELF_TLSGD to be treated
identically during the emit phase, so this second operand was never
visited to generate relocations. This is the reason for the slightly
messy workaround in PPCMCCodeEmitter.cpp:getDirectBrEncoding().
Two new tests are included to demonstrate correct external assembly and
correct generation of relocations using the integrated assembler.
Comments welcome!
Thanks,
Bill
llvm-svn: 169910
on 64-bit PowerPC ELF.
The patch includes code to handle external assembly and MC output with the
integrated assembler. It intentionally does not support the "old" JIT.
For the initial-exec TLS model, the ABI requires the following to calculate
the address of external thread-local variable x:
Code sequence Relocation Symbol
ld 9,x@got@tprel(2) R_PPC64_GOT_TPREL16_DS x
add 9,9,x@tls R_PPC64_TLS x
The register 9 is arbitrary here. The linker will replace x@got@tprel
with the offset relative to the thread pointer to the generated GOT
entry for symbol x. It will replace x@tls with the thread-pointer
register (13).
The two test cases verify correct assembly output and relocation output
as just described.
PowerPC-specific selection node variants are added for the two
instructions above: LD_GOT_TPREL and ADD_TLS. These are inserted
when an initial-exec global variable is encountered by
PPCTargetLowering::LowerGlobalTLSAddress(), and later lowered to
machine instructions LDgotTPREL and ADD8TLS. LDgotTPREL is a pseudo
that uses the same LDrs support added for medium code model's LDtocL,
with a different relocation type.
The rest of the processing is straightforward.
llvm-svn: 169281
The default for 64-bit PowerPC is small code model, in which TOC entries
must be addressable using a 16-bit offset from the TOC pointer. Additionally,
only TOC entries are addressed via the TOC pointer.
With medium code model, TOC entries and data sections can all be addressed
via the TOC pointer using a 32-bit offset. Cooperation with the linker
allows 16-bit offsets to be used when these are sufficient, reducing the
number of extra instructions that need to be executed. Medium code model
also does not generate explicit TOC entries in ".section toc" for variables
that are wholly internal to the compilation unit.
Consider a load of an external 4-byte integer. With small code model, the
compiler generates:
ld 3, .LC1@toc(2)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc ei[TC],ei
With medium model, it instead generates:
addis 3, 2, .LC1@toc@ha
ld 3, .LC1@toc@l(3)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc ei[TC],ei
Here .LC1@toc@ha is a relocation requesting the upper 16 bits of the
32-bit offset of ei's TOC entry from the TOC base pointer. Similarly,
.LC1@toc@l is a relocation requesting the lower 16 bits. Note that if
the linker determines that ei's TOC entry is within a 16-bit offset of
the TOC base pointer, it will replace the "addis" with a "nop", and
replace the "ld" with the identical "ld" instruction from the small
code model example.
Consider next a load of a function-scope static integer. For small code
model, the compiler generates:
ld 3, .LC1@toc(2)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc test_fn_static.si[TC],test_fn_static.si
.type test_fn_static.si,@object
.local test_fn_static.si
.comm test_fn_static.si,4,4
For medium code model, the compiler generates:
addis 3, 2, test_fn_static.si@toc@ha
addi 3, 3, test_fn_static.si@toc@l
lwz 4, 0(3)
.type test_fn_static.si,@object
.local test_fn_static.si
.comm test_fn_static.si,4,4
Again, the linker may replace the "addis" with a "nop", calculating only
a 16-bit offset when this is sufficient.
Note that it would be more efficient for the compiler to generate:
addis 3, 2, test_fn_static.si@toc@ha
lwz 4, test_fn_static.si@toc@l(3)
The current patch does not perform this optimization yet. This will be
addressed as a peephole optimization in a later patch.
For the moment, the default code model for 64-bit PowerPC will remain the
small code model. We plan to eventually change the default to medium code
model, which matches current upstream GCC behavior. Note that the different
code models are ABI-compatible, so code compiled with different models will
be linked and execute correctly.
I've tested the regression suite and the application/benchmark test suite in
two ways: Once with the patch as submitted here, and once with additional
logic to force medium code model as the default. The tests all compile
cleanly, with one exception. The mandel-2 application test fails due to an
unrelated ABI compatibility with passing complex numbers. It just so happens
that small code model was incredibly lucky, in that temporary values in
floating-point registers held the expected values needed by the external
library routine that was called incorrectly. My current thought is to correct
the ABI problems with _Complex before making medium code model the default,
to avoid introducing this "regression."
Here are a few comments on how the patch works, since the selection code
can be difficult to follow:
The existing logic for small code model defines three pseudo-instructions:
LDtoc for most uses, LDtocJTI for jump table addresses, and LDtocCPT for
constant pool addresses. These are expanded by SelectCodeCommon(). The
pseudo-instruction approach doesn't work for medium code model, because
we need to generate two instructions when we match the same pattern.
Instead, new logic in PPCDAGToDAGISel::Select() intercepts the TOC_ENTRY
node for medium code model, and generates an ADDIStocHA followed by either
a LDtocL or an ADDItocL. These new node types correspond naturally to
the sequences described above.
The addis/ld sequence is generated for the following cases:
* Jump table addresses
* Function addresses
* External global variables
* Tentative definitions of global variables (common linkage)
The addis/addi sequence is generated for the following cases:
* Constant pool entries
* File-scope static global variables
* Function-scope static variables
Expanding to the two-instruction sequences at select time exposes the
instructions to subsequent optimization, particularly scheduling.
The rest of the processing occurs at assembly time, in
PPCAsmPrinter::EmitInstruction. Each of the instructions is converted to
a "real" PowerPC instruction. When a TOC entry needs to be created, this
is done here in the same manner as for the existing LDtoc, LDtocJTI, and
LDtocCPT pseudo-instructions (I factored out a new routine to handle this).
I had originally thought that if a TOC entry was needed for LDtocL or
ADDItocL, it would already have been generated for the previous ADDIStocHA.
However, at higher optimization levels, the ADDIStocHA may appear in a
different block, which may be assembled textually following the block
containing the LDtocL or ADDItocL. So it is necessary to include the
possibility of creating a new TOC entry for those two instructions.
Note that for LDtocL, we generate a new form of LD called LDrs. This
allows specifying the @toc@l relocation for the offset field of the LD
instruction (i.e., the offset is replaced by a SymbolLo relocation).
When the peephole optimization described above is added, we will need
to do similar things for all immediate-form load and store operations.
The seven "mcm-n.ll" test cases are kept separate because otherwise the
intermingling of various TOC entries and so forth makes the tests fragile
and hard to understand.
The above assumes use of an external assembler. For use of the
integrated assembler, new relocations are added and used by
PPCELFObjectWriter. Testing is done with "mcm-obj.ll", which tests for
proper generation of the various relocations for the same sequences
tested with the external assembler.
llvm-svn: 168708
"Instruction 'foo' has no tokens" errors during llvm-tblgen
-gen-asm-matcher attempts. At this time, the added
tokens are "#comment" style rather than the actual mnemonic. This will
be revisited once the rest of the base asmparser bits get straightened
out for ppc64-elf-linux.
llvm-svn: 165237
Slight reorganisation of PPC instruction classes for scheduling. No
functionality change for existing subtargets.
- Clearly separate load/store-with-update instructions from regular loads and stores.
- Split IntRotateD -> IntRotateD and IntRotateDI
- Split out fsub and fadd from FPGeneral -> FPAddSub
- Update existing itineraries
Patch by Tobias von Koch.
llvm-svn: 162729
Allow load-immediates to be rematerialised in the register coalescer for
PPC. This makes test/CodeGen/PowerPC/big-endian-formal-args.ll fail,
because it relies on a register move getting emitted. The immediate load is
equivalent, so change this test case.
Patch by Tobias von Koch.
llvm-svn: 162727
The 32-bit ABI requires CR bit 6 to be set if the call has fp arguments and
unset if it doesn't. The solution up to now was to insert a MachineNode to
set/unset the CR bit, which produces a CR vreg. This vreg was then copied
into CR bit 6. When the register allocator saw a bunch of these in the same
function, it allocated the set/unset CR bit in some random CR register (1
extra instruction) and then emitted CR moves before every vararg function
call, rather than just setting and unsetting CR bit 6 directly before every
vararg function call. This patch instead inserts a PPCcrset/PPCcrunset
instruction which are then matched by a dedicated instruction pattern.
Patch by Tobias von Koch.
llvm-svn: 162725
Call instructions are no longer required to be variadic, and
variable_ops should only be used for instructions that encode a variable
number of arguments, like the ARM stm/ldm instructions.
llvm-svn: 160189
This patch adds DAG combines to form FMAs from pairs of FADD + FMUL or
FSUB + FMUL. The combines are performed when:
(a) Either
AllowExcessFPPrecision option (-enable-excess-fp-precision for llc)
OR
UnsafeFPMath option (-enable-unsafe-fp-math)
are set, and
(b) TargetLoweringInfo::isFMAFasterThanMulAndAdd(VT) is true for the type of
the FADD/FSUB, and
(c) The FMUL only has one user (the FADD/FSUB).
If your target has fast FMA instructions you can make use of these combines by
overriding TargetLoweringInfo::isFMAFasterThanMulAndAdd(VT) to return true for
types supported by your FMA instruction, and adding patterns to match ISD::FMA
to your FMA instructions.
llvm-svn: 158757
This pass is derived from the Hexagon HardwareLoops pass. The only significant enhancement over the Hexagon
pass is that PPCCTRLoops will also attempt to delete the replaced add and compare operations if they are
no longer otherwise used. Also, invalid preheader DebugLoc is not used.
llvm-svn: 158204
Loads and stores can have different pipeline behavior, especially on
embedded chips. This change allows those differences to be expressed.
Except for the 440 scheduler, there are no functionality changes.
On the 440, the latency adjustment is only by one cycle, and so this
probably does not affect much. Nevertheless, it will make a larger
difference in the future and this removes a FIXME from the 440 itin.
llvm-svn: 153821
Dynamic linking on PPC64 has had problems since we had to move the top-down
hazard-detection logic post-ra. For dynamic linking to work there needs to be
a nop placed after every call. It turns out that it is really hard to guarantee
that nothing will be placed in between the call (bl) and the nop during post-ra
scheduling. Previous attempts at fixing this by placing logic inside the
hazard detector only partially worked.
This is now fixed in a different way: call+nop codegen-only instructions. As far
as CodeGen is concerned the pair is now a single instruction and cannot be split.
This solution works much better than previous attempts.
The scoreboard hazard detector is also renamed to be more generic, there is currently
no cpu-specific logic in it.
llvm-svn: 153816
change, now you need a TargetOptions object to create a TargetMachine. Clang
patch to follow.
One small functionality change in PTX. PTX had commented out the machine
verifier parts in their copy of printAndVerify. That now calls the version in
LLVMTargetMachine. Users of PTX who need verification disabled should rely on
not passing the command-line flag to enable it.
llvm-svn: 145714
piclabel operand. The operand in the tablegen definition doesn't actually turn
into an MI operand, so it just confuses anything checking the TargetInstrDesc
for the number of operands. It suffices to just have an implicit def of LR.
llvm-svn: 131626
into the immediate field. This allows us to encode stuff like this:
lbz r3, lo16(__ZL4init)(r4) ; globalopt.cpp:5
; encoding: [0x88,0x64,A,A]
; fixup A - offset: 0, value: lo16(__ZL4init), kind: fixup_ppc_lo16
stw r3, lo16(__ZL1s)(r5) ; globalopt.cpp:6
; encoding: [0x90,0x65,A,A]
; fixup A - offset: 0, value: lo16(__ZL1s), kind: fixup_ppc_lo16
With this, we should have a completely function MCCodeEmitter for PPC, wewt.
llvm-svn: 119134
modes. For example, we now get:
ld r3, lo16(_G)(r3) ; encoding: [0xe8,0x63,A,0bAAAAAA00]
; fixup A - offset: 0, value: lo16(_G), kind: fixup_ppc_lo14
llvm-svn: 119133