2010-08-18 00:20:04 +08:00
|
|
|
//===- PPCCallingConv.td - Calling Conventions for PowerPC -*- tablegen -*-===//
|
2012-02-18 20:03:15 +08:00
|
|
|
//
|
2007-03-06 08:59:59 +08:00
|
|
|
// The LLVM Compiler Infrastructure
|
|
|
|
//
|
2007-12-30 04:36:04 +08:00
|
|
|
// This file is distributed under the University of Illinois Open Source
|
|
|
|
// License. See LICENSE.TXT for details.
|
2012-02-18 20:03:15 +08:00
|
|
|
//
|
2007-03-06 08:59:59 +08:00
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
//
|
|
|
|
// This describes the calling conventions for the PowerPC 32- and 64-bit
|
|
|
|
// architectures.
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
2012-11-06 03:39:45 +08:00
|
|
|
/// CCIfSubtarget - Match if the current subtarget has a feature F.
|
|
|
|
class CCIfSubtarget<string F, CCAction A>
|
2014-08-07 02:45:26 +08:00
|
|
|
: CCIf<!strconcat("static_cast<const PPCSubtarget&>"
|
|
|
|
"(State.getMachineFunction().getSubtarget()).",
|
|
|
|
F),
|
|
|
|
A>;
|
Add CR-bit tracking to the PowerPC backend for i1 values
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
2014-02-28 08:27:01 +08:00
|
|
|
class CCIfNotSubtarget<string F, CCAction A>
|
2014-08-07 02:45:26 +08:00
|
|
|
: CCIf<!strconcat("!static_cast<const PPCSubtarget&>"
|
|
|
|
"(State.getMachineFunction().getSubtarget()).",
|
|
|
|
F),
|
|
|
|
A>;
|
2012-11-06 03:39:45 +08:00
|
|
|
|
2007-03-06 08:59:59 +08:00
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
// Return Value Calling Convention
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
2015-01-14 09:07:51 +08:00
|
|
|
// PPC64 AnyReg return-value convention. No explicit register is specified for
|
|
|
|
// the return-value. The register allocator is allowed and expected to choose
|
|
|
|
// any free register.
|
|
|
|
//
|
|
|
|
// This calling convention is currently only supported by the stackmap and
|
|
|
|
// patchpoint intrinsics. All other uses will result in an assert on Debug
|
|
|
|
// builds. On Release builds we fallback to the PPC C calling convention.
|
|
|
|
def RetCC_PPC64_AnyReg : CallingConv<[
|
|
|
|
CCCustom<"CC_PPC_AnyReg_Error">
|
|
|
|
]>;
|
|
|
|
|
2007-03-06 08:59:59 +08:00
|
|
|
// Return-value convention for PowerPC
|
|
|
|
def RetCC_PPC : CallingConv<[
|
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
2015-01-14 09:07:51 +08:00
|
|
|
CCIfCC<"CallingConv::AnyReg", CCDelegateTo<RetCC_PPC64_AnyReg>>,
|
|
|
|
|
2012-11-06 03:39:45 +08:00
|
|
|
// On PPC64, integer return values are always promoted to i64
|
Add CR-bit tracking to the PowerPC backend for i1 values
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
2014-02-28 08:27:01 +08:00
|
|
|
CCIfType<[i32, i1], CCIfSubtarget<"isPPC64()", CCPromoteToType<i64>>>,
|
|
|
|
CCIfType<[i1], CCIfNotSubtarget<"isPPC64()", CCPromoteToType<i32>>>,
|
2012-11-06 03:39:45 +08:00
|
|
|
|
2008-03-17 10:13:43 +08:00
|
|
|
CCIfType<[i32], CCAssignToReg<[R3, R4, R5, R6, R7, R8, R9, R10]>>,
|
2008-03-18 01:11:08 +08:00
|
|
|
CCIfType<[i64], CCAssignToReg<[X3, X4, X5, X6]>>,
|
2013-01-18 03:34:57 +08:00
|
|
|
CCIfType<[i128], CCAssignToReg<[X3, X4, X5, X6]>>,
|
[PowerPC] ELFv2 aggregate passing support
This patch adds infrastructure support for passing array types
directly. These can be used by the front-end to pass aggregate
types (coerced to an appropriate array type). The details of the
array type being used inform the back-end about ABI-relevant
properties. Specifically, the array element type encodes:
- whether the parameter should be passed in FPRs, VRs, or just
GPRs/stack slots (for float / vector / integer element types,
respectively)
- what the alignment requirements of the parameter are when passed in
GPRs/stack slots (8 for float / 16 for vector / the element type
size for integer element types) -- this corresponds to the
"byval align" field
Using the infrastructure provided by this patch, a companion patch
to clang will enable two features:
- In the ELFv2 ABI, pass (and return) "homogeneous" floating-point
or vector aggregates in FPRs and VRs (this is similar to the ARM
homogeneous aggregate ABI)
- As an optimization for both ELFv1 and ELFv2 ABIs, pass aggregates
that fit fully in registers without using the "byval" mechanism
The patch uses the functionArgumentNeedsConsecutiveRegisters callback
to encode that special treatment is required for all directly-passed
array types. The isInConsecutiveRegs / isInConsecutiveRegsLast bits set
as a results are then used to implement the required size and alignment
rules in CalculateStackSlotSize / CalculateStackSlotAlignment etc.
As a related change, the ABI routines have to be modified to support
passing floating-point types in GPRs. This is necessary because with
homogeneous aggregates of 4-byte float type we can now run out of FPRs
*before* we run out of the 64-byte argument save area that is shadowed
by GPRs. Any extra floating-point arguments that no longer fit in FPRs
must now be passed in GPRs until we run out of those too.
Note that there was already code to pass floating-point arguments in
GPRs used with vararg parameters, which was done by writing the argument
out to the argument save area first and then reloading into GPRs. The
patch re-implements this, however, in favor of code packing float arguments
directly via extension/truncation, BITCAST, and BUILD_PAIR operations.
This is required to support the ELFv2 ABI, since we cannot unconditionally
write to the argument save area (which the caller might not have allocated).
The change does, however, affect ELFv1 varags routines too; but even here
the overall effect should be advantageous: Instead of loading the argument
into the FPR, then storing the argument to the stack slot, and finally
reloading the argument from the stack slot into a GPR, the new code now
just loads the argument into the FPR, and subsequently loads the argument
into the GPR (via BITCAST). That BITCAST might imply a save/reload from
a stack temporary (in which case we're no worse than before); but it
might be implemented more efficiently in some cases.
The final part of the patch enables up to 8 FPRs and VRs for argument
return in PPCCallingConv.td; this is required to support returning
ELFv2 homogeneous aggregates. (Note that this doesn't affect other ABIs
since LLVM wil only look for which register to use if the parameter is
marked as "direct" return anyway.)
Reviewed by Hal Finkel.
llvm-svn: 213493
2014-07-21 08:13:26 +08:00
|
|
|
|
|
|
|
// Floating point types returned as "direct" go into F1 .. F8; note that
|
|
|
|
// only the ELFv2 ABI fully utilizes all these registers.
|
|
|
|
CCIfType<[f32], CCAssignToReg<[F1, F2, F3, F4, F5, F6, F7, F8]>>,
|
|
|
|
CCIfType<[f64], CCAssignToReg<[F1, F2, F3, F4, F5, F6, F7, F8]>>,
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
|
|
|
|
// QPX vectors are returned in QF1 and QF2.
|
|
|
|
CCIfType<[v4f64, v4f32, v4i1],
|
|
|
|
CCIfSubtarget<"hasQPX()", CCAssignToReg<[QF1, QF2]>>>,
|
|
|
|
|
[PowerPC] ELFv2 aggregate passing support
This patch adds infrastructure support for passing array types
directly. These can be used by the front-end to pass aggregate
types (coerced to an appropriate array type). The details of the
array type being used inform the back-end about ABI-relevant
properties. Specifically, the array element type encodes:
- whether the parameter should be passed in FPRs, VRs, or just
GPRs/stack slots (for float / vector / integer element types,
respectively)
- what the alignment requirements of the parameter are when passed in
GPRs/stack slots (8 for float / 16 for vector / the element type
size for integer element types) -- this corresponds to the
"byval align" field
Using the infrastructure provided by this patch, a companion patch
to clang will enable two features:
- In the ELFv2 ABI, pass (and return) "homogeneous" floating-point
or vector aggregates in FPRs and VRs (this is similar to the ARM
homogeneous aggregate ABI)
- As an optimization for both ELFv1 and ELFv2 ABIs, pass aggregates
that fit fully in registers without using the "byval" mechanism
The patch uses the functionArgumentNeedsConsecutiveRegisters callback
to encode that special treatment is required for all directly-passed
array types. The isInConsecutiveRegs / isInConsecutiveRegsLast bits set
as a results are then used to implement the required size and alignment
rules in CalculateStackSlotSize / CalculateStackSlotAlignment etc.
As a related change, the ABI routines have to be modified to support
passing floating-point types in GPRs. This is necessary because with
homogeneous aggregates of 4-byte float type we can now run out of FPRs
*before* we run out of the 64-byte argument save area that is shadowed
by GPRs. Any extra floating-point arguments that no longer fit in FPRs
must now be passed in GPRs until we run out of those too.
Note that there was already code to pass floating-point arguments in
GPRs used with vararg parameters, which was done by writing the argument
out to the argument save area first and then reloading into GPRs. The
patch re-implements this, however, in favor of code packing float arguments
directly via extension/truncation, BITCAST, and BUILD_PAIR operations.
This is required to support the ELFv2 ABI, since we cannot unconditionally
write to the argument save area (which the caller might not have allocated).
The change does, however, affect ELFv1 varags routines too; but even here
the overall effect should be advantageous: Instead of loading the argument
into the FPR, then storing the argument to the stack slot, and finally
reloading the argument from the stack slot into a GPR, the new code now
just loads the argument into the FPR, and subsequently loads the argument
into the GPR (via BITCAST). That BITCAST might imply a save/reload from
a stack temporary (in which case we're no worse than before); but it
might be implemented more efficiently in some cases.
The final part of the patch enables up to 8 FPRs and VRs for argument
return in PPCCallingConv.td; this is required to support returning
ELFv2 homogeneous aggregates. (Note that this doesn't affect other ABIs
since LLVM wil only look for which register to use if the parameter is
marked as "direct" return anyway.)
Reviewed by Hal Finkel.
llvm-svn: 213493
2014-07-21 08:13:26 +08:00
|
|
|
// Vector types returned as "direct" go into V2 .. V9; note that only the
|
|
|
|
// ELFv2 ABI fully utilizes all these registers.
|
2015-05-06 00:10:44 +08:00
|
|
|
CCIfType<[v16i8, v8i16, v4i32, v2i64, v1i128, v4f32],
|
|
|
|
CCIfSubtarget<"hasAltivec()",
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
CCAssignToReg<[V2, V3, V4, V5, V6, V7, V8, V9]>>>,
|
|
|
|
CCIfType<[v2f64, v2i64], CCIfSubtarget<"hasVSX()",
|
|
|
|
CCAssignToReg<[VSH2, VSH3, VSH4, VSH5, VSH6, VSH7, VSH8, VSH9]>>>
|
2007-03-06 08:59:59 +08:00
|
|
|
]>;
|
|
|
|
|
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
2015-01-14 09:07:51 +08:00
|
|
|
// No explicit register is specified for the AnyReg calling convention. The
|
|
|
|
// register allocator may assign the arguments to any free register.
|
|
|
|
//
|
|
|
|
// This calling convention is currently only supported by the stackmap and
|
|
|
|
// patchpoint intrinsics. All other uses will result in an assert on Debug
|
|
|
|
// builds. On Release builds we fallback to the PPC C calling convention.
|
|
|
|
def CC_PPC64_AnyReg : CallingConv<[
|
|
|
|
CCCustom<"CC_PPC_AnyReg_Error">
|
|
|
|
]>;
|
2007-03-06 08:59:59 +08:00
|
|
|
|
2013-08-27 03:42:51 +08:00
|
|
|
// Note that we don't currently have calling conventions for 64-bit
|
|
|
|
// PowerPC, but handle all the complexities of the ABI in the lowering
|
|
|
|
// logic. FIXME: See if the logic can be simplified with use of CCs.
|
|
|
|
// This may require some extensions to current table generation.
|
|
|
|
|
2013-08-31 06:18:55 +08:00
|
|
|
// Simple calling convention for 64-bit ELF PowerPC fast isel.
|
|
|
|
// Only handle ints and floats. All ints are promoted to i64.
|
|
|
|
// Vector types and quadword ints are not handled.
|
|
|
|
def CC_PPC64_ELF_FIS : CallingConv<[
|
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
2015-01-14 09:07:51 +08:00
|
|
|
CCIfCC<"CallingConv::AnyReg", CCDelegateTo<CC_PPC64_AnyReg>>,
|
|
|
|
|
Add CR-bit tracking to the PowerPC backend for i1 values
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
2014-02-28 08:27:01 +08:00
|
|
|
CCIfType<[i1], CCPromoteToType<i64>>,
|
2013-08-31 06:18:55 +08:00
|
|
|
CCIfType<[i8], CCPromoteToType<i64>>,
|
|
|
|
CCIfType<[i16], CCPromoteToType<i64>>,
|
|
|
|
CCIfType<[i32], CCPromoteToType<i64>>,
|
|
|
|
CCIfType<[i64], CCAssignToReg<[X3, X4, X5, X6, X7, X8, X9, X10]>>,
|
|
|
|
CCIfType<[f32, f64], CCAssignToReg<[F1, F2, F3, F4, F5, F6, F7, F8]>>
|
|
|
|
]>;
|
|
|
|
|
2013-08-27 03:42:51 +08:00
|
|
|
// Simple return-value convention for 64-bit ELF PowerPC fast isel.
|
|
|
|
// All small ints are promoted to i64. Vector types, quadword ints,
|
|
|
|
// and multiple register returns are "supported" to avoid compile
|
|
|
|
// errors, but none are handled by the fast selector.
|
|
|
|
def RetCC_PPC64_ELF_FIS : CallingConv<[
|
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
2015-01-14 09:07:51 +08:00
|
|
|
CCIfCC<"CallingConv::AnyReg", CCDelegateTo<RetCC_PPC64_AnyReg>>,
|
|
|
|
|
Add CR-bit tracking to the PowerPC backend for i1 values
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
2014-02-28 08:27:01 +08:00
|
|
|
CCIfType<[i1], CCPromoteToType<i64>>,
|
2013-08-27 03:42:51 +08:00
|
|
|
CCIfType<[i8], CCPromoteToType<i64>>,
|
|
|
|
CCIfType<[i16], CCPromoteToType<i64>>,
|
|
|
|
CCIfType<[i32], CCPromoteToType<i64>>,
|
|
|
|
CCIfType<[i64], CCAssignToReg<[X3, X4]>>,
|
|
|
|
CCIfType<[i128], CCAssignToReg<[X3, X4, X5, X6]>>,
|
[PowerPC] ELFv2 aggregate passing support
This patch adds infrastructure support for passing array types
directly. These can be used by the front-end to pass aggregate
types (coerced to an appropriate array type). The details of the
array type being used inform the back-end about ABI-relevant
properties. Specifically, the array element type encodes:
- whether the parameter should be passed in FPRs, VRs, or just
GPRs/stack slots (for float / vector / integer element types,
respectively)
- what the alignment requirements of the parameter are when passed in
GPRs/stack slots (8 for float / 16 for vector / the element type
size for integer element types) -- this corresponds to the
"byval align" field
Using the infrastructure provided by this patch, a companion patch
to clang will enable two features:
- In the ELFv2 ABI, pass (and return) "homogeneous" floating-point
or vector aggregates in FPRs and VRs (this is similar to the ARM
homogeneous aggregate ABI)
- As an optimization for both ELFv1 and ELFv2 ABIs, pass aggregates
that fit fully in registers without using the "byval" mechanism
The patch uses the functionArgumentNeedsConsecutiveRegisters callback
to encode that special treatment is required for all directly-passed
array types. The isInConsecutiveRegs / isInConsecutiveRegsLast bits set
as a results are then used to implement the required size and alignment
rules in CalculateStackSlotSize / CalculateStackSlotAlignment etc.
As a related change, the ABI routines have to be modified to support
passing floating-point types in GPRs. This is necessary because with
homogeneous aggregates of 4-byte float type we can now run out of FPRs
*before* we run out of the 64-byte argument save area that is shadowed
by GPRs. Any extra floating-point arguments that no longer fit in FPRs
must now be passed in GPRs until we run out of those too.
Note that there was already code to pass floating-point arguments in
GPRs used with vararg parameters, which was done by writing the argument
out to the argument save area first and then reloading into GPRs. The
patch re-implements this, however, in favor of code packing float arguments
directly via extension/truncation, BITCAST, and BUILD_PAIR operations.
This is required to support the ELFv2 ABI, since we cannot unconditionally
write to the argument save area (which the caller might not have allocated).
The change does, however, affect ELFv1 varags routines too; but even here
the overall effect should be advantageous: Instead of loading the argument
into the FPR, then storing the argument to the stack slot, and finally
reloading the argument from the stack slot into a GPR, the new code now
just loads the argument into the FPR, and subsequently loads the argument
into the GPR (via BITCAST). That BITCAST might imply a save/reload from
a stack temporary (in which case we're no worse than before); but it
might be implemented more efficiently in some cases.
The final part of the patch enables up to 8 FPRs and VRs for argument
return in PPCCallingConv.td; this is required to support returning
ELFv2 homogeneous aggregates. (Note that this doesn't affect other ABIs
since LLVM wil only look for which register to use if the parameter is
marked as "direct" return anyway.)
Reviewed by Hal Finkel.
llvm-svn: 213493
2014-07-21 08:13:26 +08:00
|
|
|
CCIfType<[f32], CCAssignToReg<[F1, F2, F3, F4, F5, F6, F7, F8]>>,
|
|
|
|
CCIfType<[f64], CCAssignToReg<[F1, F2, F3, F4, F5, F6, F7, F8]>>,
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
CCIfType<[v4f64, v4f32, v4i1],
|
|
|
|
CCIfSubtarget<"hasQPX()", CCAssignToReg<[QF1, QF2]>>>,
|
2015-05-06 00:10:44 +08:00
|
|
|
CCIfType<[v16i8, v8i16, v4i32, v2i64, v1i128, v4f32],
|
|
|
|
CCIfSubtarget<"hasAltivec()",
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
CCAssignToReg<[V2, V3, V4, V5, V6, V7, V8, V9]>>>,
|
|
|
|
CCIfType<[v2f64, v2i64], CCIfSubtarget<"hasVSX()",
|
|
|
|
CCAssignToReg<[VSH2, VSH3, VSH4, VSH5, VSH6, VSH7, VSH8, VSH9]>>>
|
2013-08-27 03:42:51 +08:00
|
|
|
]>;
|
|
|
|
|
2007-03-06 08:59:59 +08:00
|
|
|
//===----------------------------------------------------------------------===//
|
2013-02-07 01:33:58 +08:00
|
|
|
// PowerPC System V Release 4 32-bit ABI
|
2009-07-03 14:45:56 +08:00
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
2013-02-07 01:33:58 +08:00
|
|
|
def CC_PPC32_SVR4_Common : CallingConv<[
|
Add CR-bit tracking to the PowerPC backend for i1 values
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
2014-02-28 08:27:01 +08:00
|
|
|
CCIfType<[i1], CCPromoteToType<i32>>,
|
|
|
|
|
2009-07-03 14:45:56 +08:00
|
|
|
// The ABI requires i64 to be passed in two adjacent registers with the first
|
|
|
|
// register having an odd register number.
|
2013-02-07 01:33:58 +08:00
|
|
|
CCIfType<[i32], CCIfSplit<CCCustom<"CC_PPC32_SVR4_Custom_AlignArgRegs">>>,
|
2009-07-03 14:45:56 +08:00
|
|
|
|
|
|
|
// The first 8 integer arguments are passed in integer registers.
|
2010-02-16 09:50:18 +08:00
|
|
|
CCIfType<[i32], CCAssignToReg<[R3, R4, R5, R6, R7, R8, R9, R10]>>,
|
2009-07-03 14:45:56 +08:00
|
|
|
|
|
|
|
// Make sure the i64 words from a long double are either both passed in
|
|
|
|
// registers or both passed on the stack.
|
2013-02-07 01:33:58 +08:00
|
|
|
CCIfType<[f64], CCIfSplit<CCCustom<"CC_PPC32_SVR4_Custom_AlignFPArgRegs">>>,
|
2009-07-03 14:45:56 +08:00
|
|
|
|
|
|
|
// FP values are passed in F1 - F8.
|
|
|
|
CCIfType<[f32, f64], CCAssignToReg<[F1, F2, F3, F4, F5, F6, F7, F8]>>,
|
|
|
|
|
|
|
|
// Split arguments have an alignment of 8 bytes on the stack.
|
|
|
|
CCIfType<[i32], CCIfSplit<CCAssignToStack<4, 8>>>,
|
|
|
|
|
|
|
|
CCIfType<[i32], CCAssignToStack<4, 4>>,
|
|
|
|
|
|
|
|
// Floats are stored in double precision format, thus they have the same
|
|
|
|
// alignment and size as doubles.
|
|
|
|
CCIfType<[f32,f64], CCAssignToStack<8, 8>>,
|
|
|
|
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
// QPX vectors that are stored in double precision need 32-byte alignment.
|
|
|
|
CCIfType<[v4f64, v4i1], CCAssignToStack<32, 32>>,
|
|
|
|
|
2009-07-03 14:45:56 +08:00
|
|
|
// Vectors get 16-byte stack slots that are 16-byte aligned.
|
2014-03-27 00:12:58 +08:00
|
|
|
CCIfType<[v16i8, v8i16, v4i32, v4f32, v2f64, v2i64], CCAssignToStack<16, 16>>
|
2009-07-03 14:45:56 +08:00
|
|
|
]>;
|
|
|
|
|
|
|
|
// This calling convention puts vector arguments always on the stack. It is used
|
|
|
|
// to assign vector arguments which belong to the variable portion of the
|
|
|
|
// parameter list of a variable argument function.
|
2013-02-07 01:33:58 +08:00
|
|
|
def CC_PPC32_SVR4_VarArg : CallingConv<[
|
|
|
|
CCDelegateTo<CC_PPC32_SVR4_Common>
|
2009-07-03 14:45:56 +08:00
|
|
|
]>;
|
|
|
|
|
2013-02-07 01:33:58 +08:00
|
|
|
// In contrast to CC_PPC32_SVR4_VarArg, this calling convention first tries to
|
|
|
|
// put vector arguments in vector registers before putting them on the stack.
|
|
|
|
def CC_PPC32_SVR4 : CallingConv<[
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
// QPX vectors mirror the scalar FP convention.
|
|
|
|
CCIfType<[v4f64, v4f32, v4i1], CCIfSubtarget<"hasQPX()",
|
|
|
|
CCAssignToReg<[QF1, QF2, QF3, QF4, QF5, QF6, QF7, QF8]>>>,
|
|
|
|
|
2009-07-03 14:45:56 +08:00
|
|
|
// The first 12 Vector arguments are passed in AltiVec registers.
|
2015-05-06 00:10:44 +08:00
|
|
|
CCIfType<[v16i8, v8i16, v4i32, v2i64, v1i128, v4f32],
|
|
|
|
CCIfSubtarget<"hasAltivec()", CCAssignToReg<[V2, V3, V4, V5, V6, V7,
|
|
|
|
V8, V9, V10, V11, V12, V13]>>>,
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
CCIfType<[v2f64, v2i64], CCIfSubtarget<"hasVSX()",
|
2014-03-29 03:58:11 +08:00
|
|
|
CCAssignToReg<[VSH2, VSH3, VSH4, VSH5, VSH6, VSH7, VSH8, VSH9,
|
[PowerPC] Add support for the QPX vector instruction set
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
2015-02-25 09:06:45 +08:00
|
|
|
VSH10, VSH11, VSH12, VSH13]>>>,
|
2009-07-03 14:45:56 +08:00
|
|
|
|
2013-02-07 01:33:58 +08:00
|
|
|
CCDelegateTo<CC_PPC32_SVR4_Common>
|
2009-07-03 14:45:56 +08:00
|
|
|
]>;
|
|
|
|
|
|
|
|
// Helper "calling convention" to handle aggregate by value arguments.
|
|
|
|
// Aggregate by value arguments are always placed in the local variable space
|
|
|
|
// of the caller. This calling convention is only used to assign those stack
|
|
|
|
// offsets in the callers stack frame.
|
|
|
|
//
|
|
|
|
// Still, the address of the aggregate copy in the callers stack frame is passed
|
|
|
|
// in a GPR (or in the parameter list area if all GPRs are allocated) from the
|
|
|
|
// caller to the callee. The location for the address argument is assigned by
|
2013-02-07 01:33:58 +08:00
|
|
|
// the CC_PPC32_SVR4 calling convention.
|
2009-07-03 14:45:56 +08:00
|
|
|
//
|
2013-02-07 01:33:58 +08:00
|
|
|
// The only purpose of CC_PPC32_SVR4_Custom_Dummy is to skip arguments which are
|
2009-07-03 14:45:56 +08:00
|
|
|
// not passed by value.
|
|
|
|
|
2013-02-07 01:33:58 +08:00
|
|
|
def CC_PPC32_SVR4_ByVal : CallingConv<[
|
2009-07-03 14:45:56 +08:00
|
|
|
CCIfByVal<CCPassByVal<4, 4>>,
|
|
|
|
|
2013-02-07 01:33:58 +08:00
|
|
|
CCCustom<"CC_PPC32_SVR4_Custom_Dummy">
|
2009-07-03 14:45:56 +08:00
|
|
|
]>;
|
|
|
|
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
def CSR_Altivec : CalleeSavedRegs<(add V20, V21, V22, V23, V24, V25, V26, V27,
|
|
|
|
V28, V29, V30, V31)>;
|
|
|
|
|
2012-03-07 00:41:49 +08:00
|
|
|
def CSR_Darwin32 : CalleeSavedRegs<(add R13, R14, R15, R16, R17, R18, R19, R20,
|
|
|
|
R21, R22, R23, R24, R25, R26, R27, R28,
|
|
|
|
R29, R30, R31, F14, F15, F16, F17, F18,
|
|
|
|
F19, F20, F21, F22, F23, F24, F25, F26,
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
F27, F28, F29, F30, F31, CR2, CR3, CR4
|
|
|
|
)>;
|
|
|
|
|
|
|
|
def CSR_Darwin32_Altivec : CalleeSavedRegs<(add CSR_Darwin32, CSR_Altivec)>;
|
2012-03-07 00:41:49 +08:00
|
|
|
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
def CSR_SVR432 : CalleeSavedRegs<(add R14, R15, R16, R17, R18, R19, R20,
|
2012-03-07 00:41:49 +08:00
|
|
|
R21, R22, R23, R24, R25, R26, R27, R28,
|
|
|
|
R29, R30, R31, F14, F15, F16, F17, F18,
|
|
|
|
F19, F20, F21, F22, F23, F24, F25, F26,
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
F27, F28, F29, F30, F31, CR2, CR3, CR4
|
|
|
|
)>;
|
|
|
|
|
|
|
|
def CSR_SVR432_Altivec : CalleeSavedRegs<(add CSR_SVR432, CSR_Altivec)>;
|
2012-03-07 00:41:49 +08:00
|
|
|
|
|
|
|
def CSR_Darwin64 : CalleeSavedRegs<(add X13, X14, X15, X16, X17, X18, X19, X20,
|
|
|
|
X21, X22, X23, X24, X25, X26, X27, X28,
|
|
|
|
X29, X30, X31, F14, F15, F16, F17, F18,
|
|
|
|
F19, F20, F21, F22, F23, F24, F25, F26,
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
F27, F28, F29, F30, F31, CR2, CR3, CR4
|
|
|
|
)>;
|
2012-03-07 00:41:49 +08:00
|
|
|
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
def CSR_Darwin64_Altivec : CalleeSavedRegs<(add CSR_Darwin64, CSR_Altivec)>;
|
|
|
|
|
|
|
|
def CSR_SVR464 : CalleeSavedRegs<(add X14, X15, X16, X17, X18, X19, X20,
|
2012-03-07 00:41:49 +08:00
|
|
|
X21, X22, X23, X24, X25, X26, X27, X28,
|
|
|
|
X29, X30, X31, F14, F15, F16, F17, F18,
|
|
|
|
F19, F20, F21, F22, F23, F24, F25, F26,
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
F27, F28, F29, F30, F31, CR2, CR3, CR4
|
|
|
|
)>;
|
|
|
|
|
|
|
|
def CSR_SVR464_Altivec : CalleeSavedRegs<(add CSR_SVR464, CSR_Altivec)>;
|
2013-03-22 05:37:52 +08:00
|
|
|
|
[PowerPC] Make r2 allocatable on PPC64/ELF for some leaf functions
The TOC base pointer is passed in r2, and we normally reserve this register so
that we can depend on it being there. However, for leaf functions, and
specifically those leaf functions that don't do any TOC access of their own
(which is generally due to accessing the constant pool, using TLS, etc.),
we can treat r2 as an ordinary callee-saved register (it must be callee-saved
because, for local direct calls, the linker will not insert any save/restore
code).
The allocation order has been changed slightly for PPC64/ELF systems to put r2
at the end of the list (while leaving it near the beginning for Darwin systems
to prevent unnecessary output changes). While r2 is allocatable, using it still
requires spill/restore traffic, and thus comes at the end of the list.
llvm-svn: 227745
2015-02-01 23:03:28 +08:00
|
|
|
def CSR_SVR464_R2 : CalleeSavedRegs<(add CSR_SVR464, X2)>;
|
|
|
|
|
|
|
|
def CSR_SVR464_R2_Altivec : CalleeSavedRegs<(add CSR_SVR464_Altivec, X2)>;
|
|
|
|
|
Cleanup PPC Altivec registers in CSR lists and improve VRSAVE handling
There are a couple of (small) related changes here:
1. The printed name of the VRSAVE register has been changed from VRsave to
vrsave in order to match the name accepted by GNU binutils.
2. Support for parsing vrsave has been added to the asm parser (it seems that
there was no test case specifically covering this code, so I've added one).
3. The list of Altivec registers, which was common to all calling conventions,
has been separated out. This allows us to define the base CSR lists, and then
lists for each ABI with Altivec included. This allows SjLj, for example, to
work correctly on non-Altivec targets without using unnatural definitions of
the NoRegs CSR list.
4. VRSAVE is now always reserved on non-Darwin targets and all Altivec
registers are reserved when Altivec is disabled.
With these changes, it is now possible to compile a function containing
__builtin_unwind_init() on Linux/PPC64 with debugging information. This did not
work previously because GNU binutils assumes that all .cfi_offset offsets will
be 8-byte aligned on PPC64 (and errors out if you provide a non-8-byte-aligned
offset). This is not true for the vrsave register, however, because this
register is used only on Darwin, GCC does not bother printing a .cfi_offset
entry for it (even though there is a slot in the stack frame for it as
specified by the ABI). This change allows us to do the same: we will also not
print .cfi_offset directives for vrsave.
llvm-svn: 185409
2013-07-02 11:39:34 +08:00
|
|
|
def CSR_NoRegs : CalleeSavedRegs<(add)>;
|
2013-03-22 05:37:52 +08:00
|
|
|
|
Revert "r225811 - Revert "r225808 - [PowerPC] Add StackMap/PatchPoint support""
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
2015-01-14 09:07:51 +08:00
|
|
|
def CSR_64_AllRegs: CalleeSavedRegs<(add X0, (sequence "X%u", 3, 10),
|
|
|
|
(sequence "X%u", 14, 31),
|
|
|
|
(sequence "F%u", 0, 31),
|
|
|
|
(sequence "CR%u", 0, 7))>;
|
|
|
|
|
|
|
|
def CSR_64_AllRegs_Altivec : CalleeSavedRegs<(add CSR_64_AllRegs,
|
|
|
|
(sequence "V%u", 0, 31))>;
|
|
|
|
|
|
|
|
def CSR_64_AllRegs_VSX : CalleeSavedRegs<(add CSR_64_AllRegs_Altivec,
|
|
|
|
(sequence "VSL%u", 0, 31),
|
|
|
|
(sequence "VSH%u", 0, 31))>;
|
|
|
|
|