Before this patch code wanting to create temporary labels for a given entity
(function, cu, exception range, etc) had to keep its own counter to have stable
symbol names.
createTempSymbol would still add a suffix to make sure a new symbol was always
returned, but it kept a single counter. Because of that, if we were to use
just createTempSymbol("cu_begin"), the label could change from cu_begin42 to
cu_begin43 because some other code started using temporary labels.
Simplify this by just keeping one counter per prefix and removing the various
specialized counters.
llvm-svn: 232535
This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
llvm-svn: 230413
changes to remove non-Function based subtargets out of the asm
printer. For module level emission we'll need to construct up
an MCSubtargetInfo so that we can encode instructions for
emission.
llvm-svn: 230050
EmitFunctionStubs is called from doFinalization and so can't
depend on the Subtarget existing. It's also irrelevant as
we know we're darwin since we're in the darwin asm printer.
llvm-svn: 230039
initialization. Initialize the subtarget once per function and
migrate EmitStartOfAsmFile to either use attributes on the
TargetMachine or get information from all of the various
subtargets.
llvm-svn: 229475
See full discussion in http://reviews.llvm.org/D7491.
We now hide the add-immediate and call instructions together in a
separate pseudo-op, which is tagged to define GPR3 and clobber the
call-killed registers. The PPCTLSDynamicCall pass prior to RA now
expands this op into the two separate addi and call ops, with explicit
definitions of GPR3 on both instructions, and explicit clobbers on the
call instruction. The pass is now marked as requiring and preserving
the LiveIntervals and SlotIndexes analyses, and fixes these up after
the replacement sequences are introduced.
Self-hosting has been verified on LE P8 and BE P7 with various
optimization levels, etc. It has also been verified with the
--no-tls-optimize flag workaround removed.
llvm-svn: 228725
Unfortunately, even with the workaround of disabling the linker TLS
optimizations in Clang restored (which has already been done), this still
breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native).
Bill is currently working on an alternate implementation to address the TLS
issue in a way that also fully elides the linker bug (which, unfortunately,
this approach did not fully), so I'm reverting this now.
llvm-svn: 228460
This patch is a third attempt to properly handle the local-dynamic and
global-dynamic TLS models.
In my original implementation, calls to __tls_get_addr were hidden
from view until the asm-printer phase, at which point the underlying
branch-and-link instruction was created with proper relocations. This
mostly worked well, but I used some repellent techniques to ensure
that the TLS_GET_ADDR nodes at the SD and MI levels correctly received
input from GPR3 and produced output into GPR3. This proved to work
badly in the presence of multiple TLS variable accesses, with the
copies to and from GPR3 being scheduled incorrectly and generally
creating havoc.
In r221703, I addressed that problem by representing the calls to
__tls_get_addr as true calls during instruction lowering. This had
the advantage of removing all of the bad hacks and relying on the
existing call machinery to properly glue the copies in place. It
looked like this was going to be the right way to go.
However, as a side effect of the recent discovery of problems with
linker optimizations for TLS, we discovered cases of suboptimal code
generation with this strategy. The problem comes when tls_get_addr is
called for the same address, and there is a resulting CSE
opportunity. It turns out that in such cases MachineCSE will common
the addis/addi instructions that set up the input value to
tls_get_addr, but will not common the calls themselves. MachineCSE
does not have any machinery to common idempotent calls. This is
perfectly sensible, since presumably this would be done at the IR
level, and introducing calls in the back end isn't commonplace. In
any case, we end up with two calls to __tls_get_addr when one would
suffice, and that isn't good.
I presumed that the original design would have allowed commoning of
the machine-specific nodes that hid the __tls_get_addr calls, so as
suggested by Ulrich Weigand, I went back to that design and cleaned it
up so that the copies were properly held together by glue
nodes. However, it turned out that this didn't work either...the
presence of copies to physical registers kept the machine-specific
nodes from being commoned also.
All of which leads to the design presented here. This is a return to
the original design, except that no attempt is made to introduce
copies to and from GPR3 during instruction lowering. Virtual registers
are used until prior to register allocation. At that point, a special
pass is run that identifies the machine-specific nodes that hide the
tls_get_addr calls and introduces the copies to and from GPR3 around
them. The register allocator then coalesces these copies away. With
this design, MachineCSE succeeds in commoning tls_get_addr calls where
possible, and we get nice optimal code generation (better than GCC at
the moment, which does not common these calls).
One additional problem must be dealt with: After introducing the
mentions of the physical register GPR3, the aggressive anti-dependence
breaker sees opportunities to improve scheduling by selecting a
different register instead. Flags must be used on the instruction
descriptions to tell the anti-dependence breaker to keep its hands in
its pockets.
One thing missing from the original design was recording a definition
of the link register on the GET_TLS_ADDR nodes. Doing this was found
to be insufficient to force a stack frame to be created, which led to
looping behavior because two different LR values were stored at the
same address. This appears to have been an oversight in
PPCFrameLowering::determineFrameLayout(), which is repaired here.
Because MustSaveLR() returns true for calls to builtin_return_address,
this changed the expected behavior of
test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but
formerly did not. I've fixed the test case to reflect this.
There are existing TLS tests to catch regressions; the checks in
test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the
face of instruction scheduling with these changes, so I fixed that
up.
I've added a new test case based on the PrettyStackTrace module that
demonstrated the original problem. This checks that we get correct
code generation and that CSE of the calls to __get_tls_addr has taken
place.
llvm-svn: 227976
Any code creating an MCSectionELF knows ELF and already provides the flags.
SectionKind is an abstraction used by common code that uses a plain
MCSection.
Use the flags to compute the SectionKind. This removes a lot of
guessing and boilerplate from the MCSectionELF construction.
llvm-svn: 227476
derived classes.
Since global data alignment, layout, and mangling is often based on the
DataLayout, move it to the TargetMachine. This ensures that global
data is going to be layed out and mangled consistently if the subtarget
changes on a per function basis. Prior to this all targets(*) have
had subtarget dependent code moved out and onto the TargetMachine.
*One target hasn't been migrated as part of this change: R600. The
R600 port has, as a subtarget feature, the size of pointers and
this affects global data layout. I've currently hacked in a FIXME
to enable progress, but the port needs to be updated to either pass
the 64-bitness to the TargetMachine, or fix the DataLayout to
avoid subtarget dependent features.
llvm-svn: 227113
Instructions that have high-order TOC relocations always carry R2 as their base
register, so it does not matter whether we take the register from the
instruction or just hard-code it in PPCAsmPrinter. In the future, however, we
might want to apply these relocations to instructions using a different
register, so taking the register from the instruction is a better thing to do.
No change in functionality here, however.
llvm-svn: 226403
utils/sort_includes.py.
I clearly haven't done this in a while, so more changed than usual. This
even uncovered a missing include from the InstrProf library that I've
added. No functionality changed here, just mechanical cleanup of the
include order.
llvm-svn: 225974
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225909
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
llvm-svn: 225808
Summary:
Large-model was added first. With the addition of support for multiple PIC
models in LLVM, now add small-model PIC for 32-bit PowerPC, SysV4 ABI. This
generates more optimal code, for shared libraries with less than about 16380
data objects.
Test Plan: Test cases added or updated
Reviewers: joerg, hfinkel
Reviewed By: hfinkel
Subscribers: jholewinski, mcrosier, emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D5399
llvm-svn: 221791
My original support for the general dynamic and local dynamic TLS
models contained some fairly obtuse hacks to generate calls to
__tls_get_addr when lowering a TargetGlobalAddress. Rather than
generating real calls, special GET_TLS_ADDR nodes were used to wrap
the calls and only reveal them at assembly time. I attempted to
provide correct parameter and return values by chaining CopyToReg and
CopyFromReg nodes onto the GET_TLS_ADDR nodes, but this was also not
fully correct. Problems were seen with two back-to-back stores to TLS
variables, where the call sequences ended up overlapping with unhappy
results. Additionally, since these weren't real calls, the proper
register side effects of a call were not recorded, so clobbered values
were kept live across the calls.
The proper thing to do is to lower these into calls in the first
place. This is relatively straightforward; see the changes to
PPCTargetLowering::LowerGlobalTLSAddress() in PPCISelLowering.cpp.
The changes here are standard call lowering, except that we need to
track the fact that these calls will require a relocation. This is
done by adding a machine operand flag of MO_TLSLD or MO_TLSGD to the
TargetGlobalAddress operand that appears earlier in the sequence.
The calls to LowerCallTo() eventually find their way to
LowerCall_64SVR4() or LowerCall_32SVR4(), which call FinishCall(),
which calls PrepareCall(). In PrepareCall(), we detect the calls to
__tls_get_addr and immediately snag the TargetGlobalTLSAddress with
the annotated relocation information. This becomes an extra operand
on the call following the callee, which is expected for nodes of type
tlscall. We change the call opcode to CALL_TLS for this case. Back
in FinishCall(), we change it again to CALL_NOP_TLS for 64-bit only,
since we require a TOC-restore nop following the call for the 64-bit
ABIs.
During selection, patterns in PPCInstrInfo.td and PPCInstr64Bit.td
convert the CALL_TLS nodes into BL_TLS nodes, and convert the
CALL_NOP_TLS nodes into BL8_NOP_TLS nodes. This replaces the code
removed from PPCAsmPrinter.cpp, as the BL_TLS or BL8_NOP_TLS
nodes can now be emitted normally using their patterns and the
associated printTLSCall print method.
Finally, as a result of these changes, all references to get-tls-addr
in its various guises are no longer used, so they have been removed.
There are existing TLS tests to verify the changes haven't messed
anything up). I've added one new test that verifies that the problem
with the original code has been fixed.
llvm-svn: 221703
Since block address values can be larger than 2GB in 64-bit code, they
cannot be loaded simply using an @l / @ha pair, but instead must be
loaded from the TOC, just like GlobalAddress, ConstantPool, and
JumpTable values are.
The commit also fixes a bug in PPCLinuxAsmPrinter::doFinalization where
temporary labels could not be used as TOC values, since code would
attempt (and fail) to use GetOrCreateSymbol to create a symbol of the
same name as the temporary label.
llvm-svn: 220959
Inline asm may specify 'U' and 'X' constraints to print a 'u' for an
update-form memory reference, or an 'x' for an indexed-form memory
reference. However, these are really only useful in GCC internal code
generation. In inline asm the operand of the memory constraint is
typically just a register containing the address, so 'U' and 'X' make
no sense.
This patch quietly accepts 'U' and 'X' in inline asm patterns, but
otherwise does nothing. If we ever unexpectedly see a non-register,
we'll assert and sort it out afterwards.
I've added a new test for these constraints; the test case should be
used for other asm-constraints changes down the road.
llvm-svn: 217622
This patch builds upon the two preceding MC changes to implement the
basic ELFv2 function call convention. In the ELFv1 ABI, a "function
descriptor" was associated with every function, pointing to both the
entry address and the related TOC base (and a static chain pointer
for nested functions). Function pointers would actually refer to that
descriptor, and the indirect call sequence needed to load up both entry
address and TOC base.
In the ELFv2 ABI, there are no more function descriptors, and function
pointers simply refer to the (global) entry point of the function code.
Indirect function calls simply branch to that address, after loading it
up into r12 (as required by the ABI rules for a global entry point).
Direct function calls continue to just do a "bl" to the target symbol;
this will be resolved by the linker to the local entry point of the
target function if it is local, and to a PLT stub if it is global.
That PLT stub would then load the (global) entry point address of the
final target into r12 and branch to it. Note that when performing a
local function call, r2 must be set up to point to the current TOC
base: if the target ends up local, the ABI requires that its local
entry point is called with r2 set up; if the target ends up global,
the PLT stub requires that r2 is set up.
This patch implements all LLVM changes to implement that scheme:
- No longer create a function descriptor when emitting a function
definition (in EmitFunctionEntryLabel)
- Emit two entry points *if* the function needs the TOC base (r2)
anywhere (this is done EmitFunctionBodyStart; note that this cannot
be done in EmitFunctionBodyStart because the global entry point
prologue code must be *part* of the function as covered by debug info).
- In order to make use tracking of r2 (as needed above) work correctly,
mark direct function calls as implicitly using r2.
- Implement the ELFv2 indirect function call sequence (no function
descriptors; load target address into r12).
- When creating an ELFv2 object file, emit the .abiversion 2 directive
to tell the linker to create the appropriate version of PLT stubs.
Reviewed by Hal Finkel.
llvm-svn: 213489
This adds initial support for PPC32 ELF PIC (Position Independent Code; the
-fPIC variety), thus rectifying a long-standing deficiency in the PowerPC
backend.
Patch by Justin Hibbits!
llvm-svn: 213427
Rafael opened http://llvm.org/bugs/show_bug.cgi?id=19893 to track non-optimal
code generation for forming a function address that is local to the compile
unit. The existing code was treating both local and non-local functions
identically.
This patch fixes the problem by properly identifying local functions and
generating the proper addis/addi code. I also noticed that Rafael's earlier
changes to correct the surrounding code in PPCISelLowering.cpp were also
needed for fast instruction selection in PPCFastISel.cpp, so this patch
fixes that code as well.
The existing test/CodeGen/PowerPC/func-addr.ll is modified to test the new
code generation. I've added a -O0 run line to test the fast-isel code as
well.
Tested on powerpc64[le]-unknown-linux-gnu with no regressions.
llvm-svn: 211056
This seems to match what gcc does for ppc and what every other llvm
backend does.
This is a fixed version of r209638. The difference is to avoid any change
in behavior for functions. The logic for using constant pools for function
addresseses is spread over a few places and we have to keep them in sync.
llvm-svn: 209821
This reverts commit r209638 because it broke self-hosting on ppc64/Linux. (the
Clang-compiled TableGen would segfault because it jumped to an invalid address
from within _ZNK4llvm17ManagedStaticBase21RegisterManagedStaticEPFPvvEPFvS1_E
(which is within the command-line parameter registration process)).
llvm-svn: 209745
This adds back r204781.
Original message:
Aliases are just another name for a position in a file. As such, the
regular symbol resolutions are not applied. For example, given
define void @my_func() {
ret void
}
@my_alias = alias weak void ()* @my_func
@my_alias2 = alias void ()* @my_alias
We produce without this patch:
.weak my_alias
my_alias = my_func
.globl my_alias2
my_alias2 = my_alias
That is, in the resulting ELF file my_alias, my_func and my_alias are
just 3 names pointing to offset 0 of .text. That is *not* the
semantics of IR linking. For example, linking in a
@my_alias = alias void ()* @other_func
would require the strong my_alias to override the weak one and
my_alias2 would end up pointing to other_func.
There is no way to represent that with aliases being just another
name, so the best solution seems to be to just disallow it, converting
a miscompile into an error.
llvm-svn: 204934
This reverts commit r204781.
I will follow up to with msan folks to see what is what they
were trying to do with aliases to weak aliases.
llvm-svn: 204784
Aliases are just another name for a position in a file. As such, the
regular symbol resolutions are not applied. For example, given
define void @my_func() {
ret void
}
@my_alias = alias weak void ()* @my_func
@my_alias2 = alias void ()* @my_alias
We produce without this patch:
.weak my_alias
my_alias = my_func
.globl my_alias2
my_alias2 = my_alias
That is, in the resulting ELF file my_alias, my_func and my_alias are
just 3 names pointing to offset 0 of .text. That is *not* the
semantics of IR linking. For example, linking in a
@my_alias = alias void ()* @other_func
would require the strong my_alias to override the weak one and
my_alias2 would end up pointing to other_func.
There is no way to represent that with aliases being just another
name, so the best solution seems to be to just disallow it, converting
a miscompile into an error.
llvm-svn: 204781
VSX is an ISA extension supported on the POWER7 and later cores that enhances
floating-point vector and scalar capabilities. Among other things, this adds
<2 x double> support and generally helps to reduce register pressure.
The interesting part of this ISA feature is the register configuration: there
are 64 new 128-bit vector registers, the 32 of which are super-registers of the
existing 32 scalar floating-point registers, and the second 32 of which overlap
with the 32 Altivec vector registers. This makes things like vector insertion
and extraction tricky: this can be free but only if we force a restriction to
the right register subclass when needed. A new "minipass" PPCVSXCopy takes care
of this (although it could do a more-optimal job of it; see the comment about
unnecessary copies below).
Please note that, currently, VSX is not enabled by default when targeting
anything because it is not yet ready for that. The assembler and disassembler
are fully implemented and tested. However:
- CodeGen support causes miscompiles; test-suite runtime failures:
MultiSource/Benchmarks/FreeBench/distray/distray
MultiSource/Benchmarks/McCat/08-main/main
MultiSource/Benchmarks/Olden/voronoi/voronoi
MultiSource/Benchmarks/mafft/pairlocalalign
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
SingleSource/Benchmarks/CoyoteBench/almabench
SingleSource/Benchmarks/Misc/matmul_f64_4x4
- The lowering currently falls back to using Altivec instructions far more
than it should. Worse, there are some things that are scalarized through the
stack that shouldn't be.
- A lot of unnecessary copies make it past the optimizers, and this needs to
be fixed.
- Many more regression tests are needed.
Normally, I'd fix these things prior to committing, but there are some
students and other contributors who would like to work this, and so it makes
sense to move this development process upstream where it can be subject to the
regular code-review procedures.
llvm-svn: 203768
Summary:
llvm/MC/MCSectionMachO.h and llvm/Support/MachO.h both had the same
definitions for the section flags. Instead, grab the definitions out of
support.
No functionality change.
Reviewers: grosbach, Bigcheese, rafael
Reviewed By: rafael
CC: llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D2998
llvm-svn: 203211
My understanding (from reading just the llvm code) is that
* most ppc cpus have a "sync n" instruction and an msync alias that is "sync 0".
* "book e" cpus instead have a msync instruction and not the more
general "sync n"
This patch reflects that in the .td files, allowing a single codepath for
asm ond obj streamer and incidentelly fixes a crash when EmitRawText was
called on a obj streamer.
llvm-svn: 199832
This will allow it to be called from target independent parts of the main
streamer that don't know if there is a registered target streamer or not. This
in turn will allow targets to perform extra actions at specified points in the
interface: add extra flags for some labels, extra work during finalization, etc.
llvm-svn: 199174
operand into the Value interface just like the core print method is.
That gives a more conistent organization to the IR printing interfaces
-- they are all attached to the IR objects themselves. Also, update all
the users.
This removes the 'Writer.h' header which contained only a single function
declaration.
llvm-svn: 198836
are part of the core IR library in order to support dumping and other
basic functionality.
Rename the 'Assembly' include directory to 'AsmParser' to match the
library name and the only functionality left their -- printing has been
in the core IR library for quite some time.
Update all of the #includes to match.
All of this started because I wanted to have the layering in good shape
before I started adding support for printing LLVM IR using the new pass
infrastructure, and commandline support for the new pass infrastructure.
llvm-svn: 198688
subsequent changes are easier to review. About to fix some layering
issues, and wanted to separate out the necessary churn.
Also comment and sink the include of "Windows.h" in three .inc files to
match the usage in Memory.inc.
llvm-svn: 198685
Before this patch any program that wanted to know the final symbol name of a
GlobalValue had to link with Target.
This patch implements a compromise solution where the mangler uses DataLayout.
This way, any tool that already links with Target (llc, clang) gets the exact
behavior as before and new IR files can be mangled without linking with Target.
With this patch the mangler is constructed with just a DataLayout and DataLayout
is extended to include the information the Mangler needs.
llvm-svn: 198438
MO_JumpTableIndex and MO_ExternalSymbol don't show up on inline asm.
Keeping parts of the old asm printer just to print inline asm to a string that
we then parse back looks like a hack.
llvm-svn: 196111
This patch fixes an old FIXME by creating a MCTargetStreamer interface
and moving the target specific functions for ARM, Mips and PPC to it.
The ARM streamer is still declared in a common place because it is
used from lib/CodeGen/ARMException.cpp, but the Mips and PPC are
completely hidden in the corresponding Target directories.
I will send an email to llvmdev with instructions on how to use this.
llvm-svn: 192181
Large code model on PPC64 requires creating and referencing TOC entries when
using the addis/ld form of addressing. This was not being done in all cases.
The changes in this patch to PPCAsmPrinter::EmitInstruction() fix this. Two
test cases are also modified to reflect this requirement.
Fast-isel was not creating correct code for loading floating-point constants
using large code model. This also requires the addis/ld form of addressing.
Previously we were using the addis/lfd shortcut which is only applicable to
medium code model. One test case is modified to reflect this requirement.
llvm-svn: 190882
Yet another chunk of fast-isel code. This one handles various
conversions involving floating-point. (It also includes some
miscellaneous handling throughout the back end for LWA_32 and LWAX_32
that should have been part of the load-store patch.)
llvm-svn: 189677
This patch provides basic support for powerpc64le as an LLVM target.
However, use of this target will not actually generate little-endian
code. Instead, use of the target will cause the correct little-endian
built-in defines to be generated, so that code that tests for
__LITTLE_ENDIAN__, for example, will be correctly parsed for
syntax-only testing. Code generation will otherwise be the same as
powerpc64 (big-endian), for now.
The patch leaves open the possibility of creating a little-endian
PowerPC64 back end, but there is no immediate intent to create such a
thing.
The LLVM portions of this patch simply add ppc64le coverage everywhere
that ppc64 coverage currently exists. There is nothing of any import
worth testing until such time as little-endian code generation is
implemented. In the corresponding Clang patch, there is a new test
case variant to ensure that correct built-in defines for little-endian
code are generated.
llvm-svn: 187179
In the commit message to r185476 I wrote:
>The PowerPC-specific modifiers VK_PPC_TLSGD and VK_PPC_TLSLD
>correspond exactly to the generic modifiers VK_TLSGD and VK_TLSLD.
>This causes some confusion with the asm parser, since VK_PPC_TLSGD
>is output as @tlsgd, which is then read back in as VK_TLSGD.
>
>To avoid this confusion, this patch removes the PowerPC-specific
>modifiers and uses the generic modifiers throughout. (The only
>drawback is that the generic modifiers are printed in upper case
>while the usual convention on PowerPC is to use lower-case modifiers.
>But this is just a cosmetic issue.)
This was unfortunately incorrect, there is is fact another,
serious drawback to using the default VK_TLSLD/VK_TLSGD
variant kinds: using these causes ELFObjectWriter::RelocNeedsGOT
to return true, which in turn causes the ELFObjectWriter to emit
an undefined reference to _GLOBAL_OFFSET_TABLE_.
This is a problem on powerpc64, because it uses the TOC instead
of the GOT, and the linker does not provide _GLOBAL_OFFSET_TABLE_,
so the symbol remains undefined. This means shared libraries
using TLS built with the integrated assembler are currently
broken.
While the whole RelocNeedsGOT / _GLOBAL_OFFSET_TABLE_ situation
probably ought to be properly fixed at some point, for now I'm
simply reverting the r185476 commit. Now this in turn exposes
the breakage of handling @tlsgd/@tlsld in the asm parser that
this check-in was originally intended to fix.
To avoid this regression, I'm also adding a different fix for
this problem: while common code now parses @tlsgd as VK_TLSGD,
a special hack in the asm parser translates this code to the
platform-specific VK_PPC_TLSGD that the back-end now expects.
While this is not really pretty, it's self-contained and
shouldn't hurt anything else for now. One the underlying
problem is fixed, this hack can be reverted again.
llvm-svn: 185945
A setting in MCAsmInfo defines the "assembler dialect" to use. This is used
by common code to choose between alternatives in a multi-alternative GNU
inline asm statement like the following:
__asm__ ("{sfe|subfe} %0,%1,%2" : "=r" (out) : "r" (in1), "r" (in2));
The meaning of these dialects is platform specific, and GCC defines those
for PowerPC to use dialect 0 for old-style (POWER) mnemonics and 1 for
new-style (PowerPC) mnemonics, like in the example above.
To be compatible with inline asm used with GCC, LLVM ought to do the same.
Specifically, this means we should always use assembler dialect 1 since
old-style mnemonics really aren't supported on any current platform.
However, the current LLVM back-end uses:
AssemblerDialect = 1; // New-Style mnemonics.
in PPCMCAsmInfoDarwin, and
AssemblerDialect = 0; // Old-Style mnemonics.
in PPCLinuxMCAsmInfo.
The Linux setting really isn't correct, we should be using new-style
mnemonics everywhere. This is changed by this commit.
Unfortunately, the setting of this variable is overloaded in the back-end
to decide whether or not we are on a Darwin target. This is done in
PPCInstPrinter (the "SyntaxVariant" is initialized from the MCAsmInfo
AssemblerDialect setting), and also in PPCMCExpr. Setting AssemblerDialect
to 1 for both Darwin and Linux no longer allows us to make this distinction.
Instead, this patch uses the MCSubtargetInfo passed to createPPCMCInstPrinter
to distinguish Darwin targets, and ignores the SyntaxVariant parameter.
As to PPCMCExpr, this patch adds an explicit isDarwin argument that needs
to be passed in by the caller when creating a target MCExpr. (To do so
this patch implicitly also reverts commit 184441.)
llvm-svn: 185858
Just as with mfocrf, it is also preferable to use mtocrf instead of
mtcrf when only a single CR register is to be written.
Current code however always emits mtcrf. This probably does not matter
when using an external assembler, since the GNU assembler will in fact
automatically replace mtcrf with mtocrf when possible. It does create
inefficient code with the integrated assembler, however.
To fix this, this patch adds MTOCRF/MTOCRF8 instruction patterns and
uses those instead of MTCRF/MTCRF8 everything. Just as done in the
MFOCRF patch committed as 185556, these patterns will be converted
back to MTCRF if MTOCRF is not available on the machine.
As a side effect, this allows to modify the MTCRF pattern to accept
the full range of mask operands for the benefit of the asm parser.
llvm-svn: 185561
When accessing just a single CR register, it is always preferable to
use mfocrf instead of mfcr, if the former is available on the CPU.
Current code makes that distinction in many, but not all places
where a single CR register value is retrieved. One missing
location is PPCRegisterInfo::lowerCRSpilling.
To fix this and make this simpler in the future, this patch changes
the bulk of the back-end to always assume mfocrf is available and
simply generate it when needed.
On machines that actually do not support mfocrf, the instruction
is replaced by mfcr at the very end, in EmitInstruction.
This has the additional benefit that we no longer need the
MFCRpseud hack, since before EmitInstruction we always have
a MFOCRF instruction pattern, which already models data flow
as required.
The patch also adds the MFOCRF8 version of the instruction,
which was missing so far.
Except for the PPCRegisterInfo::lowerCRSpilling case, no change
in generated code intended.
llvm-svn: 185556
As part of the global-dynamic and local-dynamic TLS sequences, we need
to use a special form of the call instruction:
bl __tls_get_addr(sym@tlsld)
bl __tls_get_addr(sym@tlsgd)
which generates two fixups. The current implementation of this causes
problems with recognizing this form in the asm parser. To fix this,
this patch reworks operand processing for this special form by using
a single operand to hold both __tls_get_addr and sym@tlsld and defining
a print method to output the above form, and an encoding method to
generate the two fixups.
As a side simplification, the patch replaces the two instruction
patterns BL8_NOP_TLSGD and BL8_NOP_TLSLD by a single BL8_NOP_TLS,
since the patterns already operate in an identical fashion (whether
we have a local-dynamic or global-dynamic symbol is already encoded
in the symbol modifier).
No change in code generation intended.
llvm-svn: 185477
The PowerPC-specific modifiers VK_PPC_TLSGD and VK_PPC_TLSLD
correspond exactly to the generic modifiers VK_TLSGD and VK_TLSLD.
This causes some confusion with the asm parser, since VK_PPC_TLSGD
is output as @tlsgd, which is then read back in as VK_TLSGD.
To avoid this confusion, this patch removes the PowerPC-specific
modifiers and uses the generic modifiers throughout. (The only
drawback is that the generic modifiers are printed in upper case
while the usual convention on PowerPC is to use lower-case modifiers.
But this is just a cosmetic issue.)
llvm-svn: 185476
This is dead code since PIC16 was removed in 2010. The result was an odd mix,
where some parts would carefully pass it along and others would assert it was
zero (most of the object streamer for example).
llvm-svn: 185436
This renames more VK_PPC_ enums, to make them more closely reflect
the @modifier string they represent. This also prepares for adding
a bunch of new VK_PPC_ enums in upcoming patches.
For consistency, some MO_ flags related to VK_PPC_ enums are
likewise renamed.
No change in behaviour.
llvm-svn: 184547
This is another minor cleanup; to bring enum names in line
with the corresponding @modifier names, this renames:
VK_PPC_TOC -> VK_PPC_TOCBASE
VK_PPC_TOC_ENTRY -> VK_PPC_TOC16
No code change intended.
llvm-svn: 184491
Frame index handling is now target-agnostic, so delete the target hooks
for creation & asm printing of target-specific addressing in DBG_VALUEs
and any related functions.
llvm-svn: 184067
When targeting the Darwin assembler, we need to generate markers ha16() and
lo16() to designate the high and low parts of a (symbolic) immediate. This
is necessary not just for plain symbols, but also for certain symbolic
expression, typically along the lines of ha16(A - B). The latter doesn't
work when simply using VariantKind flags on the symbol reference.
This is why the current back-end uses hacks (explicitly called out as such
via multiple FIXMEs) in the symbolLo/symbolHi print methods.
This patch uses target-defined MCExpr codes to represent the Darwin
ha16/lo16 constructs, following along the lines of the equivalent solution
used by the ARM back end to handle their :upper16: / :lower16: markers.
This allows us to get rid of special handling both in the symbolLo/symbolHi
print method and in the common code MCExpr::print routine. Instead, the
ha16 / lo16 markers are printed simply in a custom print routine for the
target MCExpr types. (As a result, the symbolLo/symbolHi print methods
can now replaced by a single printS16ImmOperand routine that also handles
symbolic operands.)
The patch also provides a EvaluateAsRelocatableImpl routine to handle
ha16/lo16 constructs. This is not actually used at the moment by any
in-tree code, but is provided as it makes merging into David Fang's
out-of-tree Mach-O object writer simpler.
Since there is no longer any need to treat VK_PPC_GAS_HA16 and
VK_PPC_DARWIN_HA16 differently, they are merged into a single
VK_PPC_ADDR16_HA (and likewise for the _LO16 types).
llvm-svn: 182616
BCL is normally a conditional branch-and-link instruction, but has
an unconditional form (which is used in the SjLj code, for example).
To make clear that this BCL instruction definition is specifically
the special unconditional form (which does not meaningfully take
a condition-register input), rename it to BCLalways.
No functionality change intended.
llvm-svn: 178803
The LDrs pattern is a duplicate of LD, except that it accepts memory
addresses where the displacement is a symbolLo64. An operand type
"memrs" is defined for just that purpose.
However, this wouldn't be necessary if the default "memrix" operand
type were to simply accept 64-bit symbolic addresses directly.
The only problem with that is that it uses "symbolLo", which is
hardcoded to 32-bit.
To fix this, this commit changes "memri" and "memrix" to use new
operand types for the memory displacement, which allow iPTR
instead of i32. This will also make address parsing easier to
implment in the asm parser.
No change in generated code.
llvm-svn: 178005
The ADDI/ADDI8 patterns are currently duplicated into ADDIL/ADDI8L,
which describe the same instruction, except that they accept a
symbolLo[64] operand instead of a s16imm[64] operand.
This duplication confuses the asm parser, and it actually not really
needed, since symbolLo[64] already accepts immediate operands anyway.
So this commit removes the duplicate patterns.
No change in generated code.
llvm-svn: 178004
I recently added a BCL instruction definition as part of implementing SjLj
support. This can also be used to MCize bcl emission in the asm printer.
No functionality change intended.
llvm-svn: 177830
We currently have a duplicated set of call instruction patterns depending
on the ABI to be followed (Darwin vs. Linux). This is a bit odd; while the
different ABIs will result in different instruction sequences, the actual
instructions themselves ought to be independent of the ABI. And in fact it
turns out that the only nontrivial difference between the two sets of
patterns is that in the PPC64 Linux ABI, the instruction used for indirect
calls is marked to take X11 as extra input register (which is indeed used
only with that ABI to hold an incoming environment pointer for nested
functions). However, this does not need to be hard-coded at the .td
pattern level; instead, the C++ code expanding calls can simply add that
use, just like it adds uses for argument registers anyway.
No change in generated code expected.
llvm-svn: 177735
Large code model is identical to medium code model except that the
addis/addi sequence for "local" accesses is never used. All accesses
use the addis/ld sequence.
The coding changes are straightforward; most of the patch is taken up
with creating variants of the medium model tests for large model.
llvm-svn: 175767
This patch adjust the r171506 to make all DWARF enconding pc-relative
for PPC64. It also adds the R_PPC64_REL32 relocation handling in MCJIT
(since the eh_frame will not generate PIC-relative relocation) and also
adds the emission of stubs created by the TTypeEncoding.
llvm-svn: 171979
code generation. Variables addressed through a GlobalAlias were not being
handled, and variables with available_externally linkage were treated
incorrectly. The patch contains two new tests to verify the correct code
generation for these cases.
llvm-svn: 171778
into their new header subdirectory: include/llvm/IR. This matches the
directory structure of lib, and begins to correct a long standing point
of file layout clutter in LLVM.
There are still more header files to move here, but I wanted to handle
them in separate commits to make tracking what files make sense at each
layer easier.
The only really questionable files here are the target intrinsic
tablegen files. But that's a battle I'd rather not fight today.
I've updated both CMake and Makefile build systems (I think, and my
tests think, but I may have missed something).
I've also re-sorted the includes throughout the project. I'll be
committing updates to Clang, DragonEgg, and Polly momentarily.
llvm-svn: 171366
for a wider range of GOT entries that can hold thread-relative offsets.
This matches the behavior of GCC, which was not documented in the PPC64 TLS
ABI. The ABI will be updated with the new code sequence.
Former sequence:
ld 9,x@got@tprel(2)
add 9,9,x@tls
New sequence:
addis 9,2,x@got@tprel@ha
ld 9,x@got@tprel@l(9)
add 9,9,x@tls
Note that a linker optimization exists to transform the new sequence into
the shorter sequence when appropriate, by replacing the addis with a nop
and modifying the base register and relocation type of the ld.
llvm-svn: 170209
PowerPC target. This is the last of the four models, so we now have
full TLS support.
This is mostly a straightforward extension of the general dynamic model.
I had to use an additional Chain operand to tie ADDIS_DTPREL_HA to the
register copy following ADDI_TLSLD_L; otherwise everything above the
ADDIS_DTPREL_HA appeared dead and was removed.
As before, there are new test cases to test the assembly generation, and
the relocations output during integrated assembly. The expected code
gen sequence can be read in test/CodeGen/PowerPC/tls-ld.ll.
There are a couple of things I think can be done more efficiently in the
overall TLS code, so there will likely be a clean-up patch forthcoming;
but for now I want to be sure the functionality is in place.
Bill
llvm-svn: 170003
Given a thread-local symbol x with global-dynamic access, the generated
code to obtain x's address is:
Instruction Relocation Symbol
addis ra,r2,x@got@tlsgd@ha R_PPC64_GOT_TLSGD16_HA x
addi r3,ra,x@got@tlsgd@l R_PPC64_GOT_TLSGD16_L x
bl __tls_get_addr(x@tlsgd) R_PPC64_TLSGD x
R_PPC64_REL24 __tls_get_addr
nop
<use address in r3>
The implementation borrows from the medium code model work for introducing
special forms of ADDIS and ADDI into the DAG representation. This is made
slightly more complicated by having to introduce a call to the external
function __tls_get_addr. Using the full call machinery is overkill and,
more importantly, makes it difficult to add a special relocation. So I've
introduced another opcode GET_TLS_ADDR to represent the function call, and
surrounded it with register copies to set up the parameter and return value.
Most of the code is pretty straightforward. I ran into one peculiarity
when I introduced a new PPC opcode BL8_NOP_ELF_TLSGD, which is just like
BL8_NOP_ELF except that it takes another parameter to represent the symbol
("x" above) that requires a relocation on the call. Something in the
TblGen machinery causes BL8_NOP_ELF and BL8_NOP_ELF_TLSGD to be treated
identically during the emit phase, so this second operand was never
visited to generate relocations. This is the reason for the slightly
messy workaround in PPCMCCodeEmitter.cpp:getDirectBrEncoding().
Two new tests are included to demonstrate correct external assembly and
correct generation of relocations using the integrated assembler.
Comments welcome!
Thanks,
Bill
llvm-svn: 169910
on 64-bit PowerPC ELF.
The patch includes code to handle external assembly and MC output with the
integrated assembler. It intentionally does not support the "old" JIT.
For the initial-exec TLS model, the ABI requires the following to calculate
the address of external thread-local variable x:
Code sequence Relocation Symbol
ld 9,x@got@tprel(2) R_PPC64_GOT_TPREL16_DS x
add 9,9,x@tls R_PPC64_TLS x
The register 9 is arbitrary here. The linker will replace x@got@tprel
with the offset relative to the thread pointer to the generated GOT
entry for symbol x. It will replace x@tls with the thread-pointer
register (13).
The two test cases verify correct assembly output and relocation output
as just described.
PowerPC-specific selection node variants are added for the two
instructions above: LD_GOT_TPREL and ADD_TLS. These are inserted
when an initial-exec global variable is encountered by
PPCTargetLowering::LowerGlobalTLSAddress(), and later lowered to
machine instructions LDgotTPREL and ADD8TLS. LDgotTPREL is a pseudo
that uses the same LDrs support added for medium code model's LDtocL,
with a different relocation type.
The rest of the processing is straightforward.
llvm-svn: 169281
Sooooo many of these had incorrect or strange main module includes.
I have manually inspected all of these, and fixed the main module
include to be the nearest plausible thing I could find. If you own or
care about any of these source files, I encourage you to take some time
and check that these edits were sensible. I can't have broken anything
(I strictly added headers, and reordered them, never removed), but they
may not be the headers you'd really like to identify as containing the
API being implemented.
Many forward declarations and missing includes were added to a header
files to allow them to parse cleanly when included first. The main
module rule does in fact have its merits. =]
llvm-svn: 169131
The default for 64-bit PowerPC is small code model, in which TOC entries
must be addressable using a 16-bit offset from the TOC pointer. Additionally,
only TOC entries are addressed via the TOC pointer.
With medium code model, TOC entries and data sections can all be addressed
via the TOC pointer using a 32-bit offset. Cooperation with the linker
allows 16-bit offsets to be used when these are sufficient, reducing the
number of extra instructions that need to be executed. Medium code model
also does not generate explicit TOC entries in ".section toc" for variables
that are wholly internal to the compilation unit.
Consider a load of an external 4-byte integer. With small code model, the
compiler generates:
ld 3, .LC1@toc(2)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc ei[TC],ei
With medium model, it instead generates:
addis 3, 2, .LC1@toc@ha
ld 3, .LC1@toc@l(3)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc ei[TC],ei
Here .LC1@toc@ha is a relocation requesting the upper 16 bits of the
32-bit offset of ei's TOC entry from the TOC base pointer. Similarly,
.LC1@toc@l is a relocation requesting the lower 16 bits. Note that if
the linker determines that ei's TOC entry is within a 16-bit offset of
the TOC base pointer, it will replace the "addis" with a "nop", and
replace the "ld" with the identical "ld" instruction from the small
code model example.
Consider next a load of a function-scope static integer. For small code
model, the compiler generates:
ld 3, .LC1@toc(2)
lwz 4, 0(3)
.section .toc,"aw",@progbits
.LC1:
.tc test_fn_static.si[TC],test_fn_static.si
.type test_fn_static.si,@object
.local test_fn_static.si
.comm test_fn_static.si,4,4
For medium code model, the compiler generates:
addis 3, 2, test_fn_static.si@toc@ha
addi 3, 3, test_fn_static.si@toc@l
lwz 4, 0(3)
.type test_fn_static.si,@object
.local test_fn_static.si
.comm test_fn_static.si,4,4
Again, the linker may replace the "addis" with a "nop", calculating only
a 16-bit offset when this is sufficient.
Note that it would be more efficient for the compiler to generate:
addis 3, 2, test_fn_static.si@toc@ha
lwz 4, test_fn_static.si@toc@l(3)
The current patch does not perform this optimization yet. This will be
addressed as a peephole optimization in a later patch.
For the moment, the default code model for 64-bit PowerPC will remain the
small code model. We plan to eventually change the default to medium code
model, which matches current upstream GCC behavior. Note that the different
code models are ABI-compatible, so code compiled with different models will
be linked and execute correctly.
I've tested the regression suite and the application/benchmark test suite in
two ways: Once with the patch as submitted here, and once with additional
logic to force medium code model as the default. The tests all compile
cleanly, with one exception. The mandel-2 application test fails due to an
unrelated ABI compatibility with passing complex numbers. It just so happens
that small code model was incredibly lucky, in that temporary values in
floating-point registers held the expected values needed by the external
library routine that was called incorrectly. My current thought is to correct
the ABI problems with _Complex before making medium code model the default,
to avoid introducing this "regression."
Here are a few comments on how the patch works, since the selection code
can be difficult to follow:
The existing logic for small code model defines three pseudo-instructions:
LDtoc for most uses, LDtocJTI for jump table addresses, and LDtocCPT for
constant pool addresses. These are expanded by SelectCodeCommon(). The
pseudo-instruction approach doesn't work for medium code model, because
we need to generate two instructions when we match the same pattern.
Instead, new logic in PPCDAGToDAGISel::Select() intercepts the TOC_ENTRY
node for medium code model, and generates an ADDIStocHA followed by either
a LDtocL or an ADDItocL. These new node types correspond naturally to
the sequences described above.
The addis/ld sequence is generated for the following cases:
* Jump table addresses
* Function addresses
* External global variables
* Tentative definitions of global variables (common linkage)
The addis/addi sequence is generated for the following cases:
* Constant pool entries
* File-scope static global variables
* Function-scope static variables
Expanding to the two-instruction sequences at select time exposes the
instructions to subsequent optimization, particularly scheduling.
The rest of the processing occurs at assembly time, in
PPCAsmPrinter::EmitInstruction. Each of the instructions is converted to
a "real" PowerPC instruction. When a TOC entry needs to be created, this
is done here in the same manner as for the existing LDtoc, LDtocJTI, and
LDtocCPT pseudo-instructions (I factored out a new routine to handle this).
I had originally thought that if a TOC entry was needed for LDtocL or
ADDItocL, it would already have been generated for the previous ADDIStocHA.
However, at higher optimization levels, the ADDIStocHA may appear in a
different block, which may be assembled textually following the block
containing the LDtocL or ADDItocL. So it is necessary to include the
possibility of creating a new TOC entry for those two instructions.
Note that for LDtocL, we generate a new form of LD called LDrs. This
allows specifying the @toc@l relocation for the offset field of the LD
instruction (i.e., the offset is replaced by a SymbolLo relocation).
When the peephole optimization described above is added, we will need
to do similar things for all immediate-form load and store operations.
The seven "mcm-n.ll" test cases are kept separate because otherwise the
intermingling of various TOC entries and so forth makes the tests fragile
and hard to understand.
The above assumes use of an external assembler. For use of the
integrated assembler, new relocations are added and used by
PPCELFObjectWriter. Testing is done with "mcm-obj.ll", which tests for
proper generation of the various relocations for the same sequences
tested with the external assembler.
llvm-svn: 168708