The x64 ABI requires that epilogues do not contain code other than stack
adjustments and some limited control flow. However, we'd insert code to
initialize the return address after stack adjustments. Instead, insert
EAX/RAX with the current value before we create the stack adjustments in
the epilogue.
llvm-svn: 248839
HHVM calling convention, hhvmcc, is used by HHVM JIT for
functions in translated cache. We currently support LLVM back end to
generate code for X86-64 and may support other architectures in the
future.
In HHVM calling convention any GP register could be used to pass and
return values, with the exception of R12 which is reserved for
thread-local area and is callee-saved. Other than R12, we always
pass RBX and RBP as args, which are our virtual machine's stack pointer
and frame pointer respectively.
When we enter translation cache via hhvmcc function, we expect
the stack to be aligned at 16 bytes, i.e. skewed by 8 bytes as opposed
to standard ABI alignment. This affects stack object alignment and stack
adjustments for function calls.
One extra calling convention, hhvm_ccc, is used to call C++ helpers from
HHVM's translation cache. It is almost identical to standard C calling
convention with an exception of first argument which is passed in RBP
(before we use RDI, RSI, etc.)
Differential Revision: http://reviews.llvm.org/D12681
llvm-svn: 248832
Summary:
Funclets have been turned into functions by the time they hit the object
file. Make sure that they have decent names for the symbol table and
CFI directives explaining how to reason about their prologues.
Differential Revision: http://reviews.llvm.org/D13261
llvm-svn: 248824
The immediate in the load/store should be scaled by the size of the memory
operation, not the size of the register being loaded/stored. This change gets
us one step closer to forming LDPSW instructions. This change also enables
pre- and post-indexing for halfword and byte loads and stores.
llvm-svn: 248804
alignment requirements, for example in the case of vectors.
These requirements are exploited by the code generator by using
move instructions that have similar alignment requirements, e.g.,
movaps on x86.
Although the code generator properly aligns the arguments with
respect to the displacement of the stack pointer it computes,
the displacement itself may cause misalignment. For example if
we have
%3 = load <16 x float>, <16 x float>* %1, align 64
call void @bar(<16 x float> %3, i32 0)
the x86 back-end emits:
movaps 32(%ecx), %xmm2
movaps (%ecx), %xmm0
movaps 16(%ecx), %xmm1
movaps 48(%ecx), %xmm3
subl $20, %esp <-- if %esp was 16-byte aligned before this instruction, it no longer will be afterwards
movaps %xmm3, (%esp) <-- movaps requires 16-byte alignment, while %esp is not aligned as such.
movl $0, 16(%esp)
calll __bar
To solve this, we need to make sure that the computed value with which
the stack pointer is changed is a multiple af the maximal alignment seen
during its computation. With this change we get proper alignment:
subl $32, %esp
movaps %xmm3, (%esp)
Differential Revision: http://reviews.llvm.org/D12337
llvm-svn: 248786
The splitting of > 4 dword SMRD instructions
if using an offset in an SGPR instead of an immediate
was not setting the destination register,
resulting an an instruction missing an operand
which would assert later.
Test will be included in a following commit
which fixes a related issue.
llvm-svn: 248739
Summary:
The P5600 is an out-of-order, superscalar implementation of the MIPS32R5
architecture.
The scheduler has a few missing details (see the 'Tricky Instructions'
section and some quirks of the P5600 are deliberately omitted due to
implementation difficulty and low chance of significant benefit (e.g. the
predicate on P5600WriteEitherALU). However, testing on SingleSource is
showing significant performance benefits on some apps (seven in the 10-30%
range) and only one significant regression (12%) when
-pre-RA-sched=linearize is given. Without -pre-RA-sched=linearize the
results are more variable. Some do even better (up to 55% improvement) but
increased numbers of copies are slowing others down (up to 12%).
Overall, the scheduler as it currently stands is a 2.4% win with
-pre-RA-sched=linearize and a 2.7% win without -pre-RA-sched=linearize.
I'm sure we can improve on this further.
For completeness, the FPGA this was tested on shows some failures with and
without the P5600 scheduler. These appear to be scheduling related since
the two test runs have fairly different sets of failing tests even after
accounting for other factors (e.g. spurious connection failures) however
it's not P5600 specific since we also get some for the generic scheduler.
Reviewers: vkalintiris
Subscribers: mpf, llvm-commits, atrick, vkalintiris
Differential Revision: http://reviews.llvm.org/D12193
llvm-svn: 248725
supportsTailCall() has two callers. Both of them double-check isThumb1Only(),
and refuse to proceed with tail-calling in that case.
Therefore, it makes sense to move this check to
ARMSubtarget::initSubtargetFeatures, where SupportsTailCall is initialized;
and to eliminate the extra checks at the call sites.
Following a review comment, added an "assert(supportsTailCall())"
in IsEligibleForTailCall.
NFC.
llvm-svn: 248703
These require multiple mov instructions to copy,
but the default value is that 1 instruction is needed.
I'm not sure if this actually changes anything.
llvm-svn: 248651
Trying to use the version with the explicit output operand
would complain because of the missing WriteSALU. I'm not sure
why it doesn't complain about this with the implicit VCC def.
llvm-svn: 248646
It's easier to understand creating a full instruction
than the current situation where sometimes a new
instruction is created and sometimes it is awkwardly
mutated in place.
llvm-svn: 248627
This is a redo of D7208 ( r227242 - http://llvm.org/viewvc/llvm-project?view=revision&revision=227242 ).
The patch was reverted because an AArch64 target could infinite loop after the change in DAGCombiner
to merge vector stores. That happened because AArch64's allowsMisalignedMemoryAccesses() wasn't telling
the truth. It reported all unaligned memory accesses as fast, but then split some 128-bit unaligned
accesses up in performSTORECombine() because they are slow.
This patch attempts to fix the problem in AArch's allowsMisalignedMemoryAccesses() while preserving
existing (perhaps questionable) lowering behavior.
The x86 test shows that store merging is working as intended for a target with fast 32-byte unaligned
stores.
Differential Revision: http://reviews.llvm.org/D12635
llvm-svn: 248622
Don't run passes related to stack maps, garbage collection,
exceptions since these aren't useful for GPUs.
There might be a few more to turn off that I'm less sure about
(e.g. ShrinkWrapping) or I'm not sure how to disable
(SafeStack and StackProtector)
llvm-svn: 248591
This fixes a select error when the i64 source was also
bitcasted to v2i32 in the original source.
Instead of awkwardly trying to select the modified source value and
the store, replace before isel begins.
Uses a worklist to avoid possible problems from mutating the DAG,
although it seems to work OK without it.
llvm-svn: 248589
When buffer resource descriptors were built, the upper two components
of the descriptor were first composed into a 64-bit register because
legalizeOperands assumed all operands had the same register class.
Fix that problem, but keep the workaround. I'm not sure anything
actually is actually emitting such a REG_SEQUENCE now.
If multiple resource descriptors are set up with different base
pointers, this is copied with a single s_mov_b64. We probably
should fix this better by recognizing a pair of s_mov_b32 later,
but for now delete the dead code.
llvm-svn: 248585
This was only set on the final _si/_vi version, but not
on the pseudos most of codegen sees.
No test since these instructions aren't used yet.
llvm-svn: 248583
These were all using the default 32-bit VALU write class,
but the i64/f64 compares are half rate.
I'm not sure this is really correct, because they are still using
the write to VALU write class, even though they really write
to the SALU.
llvm-svn: 248582
We now emit the compiler generated divide by zero check that was needed for the
MSVC routines. We construct a psuedo-instruction for the DBZ check as the
operation requires splitting up the BB. For the 64-bit operations, we need to
custom expand the node as we need to insert the DBZ check and then emit the
libcall to the appropriate name. Because this is target specific, it seemed
better to reproduce the expansion operation from the target-agnostic type
legalization rather than sink this there to avoid the duplication. The division
library calls now match MSVC semantically.
llvm-svn: 248561
Fix for D12561 - we weren't correctly ensuring that the base element for extension was moved to start on a boundary suitable for UNPCKL/H
llvm-svn: 248536