forked from OSchip/llvm-project
Split FP-stack notes out of the main readme. Next up: splitting out SSE.
llvm-svn: 28399
This commit is contained in:
parent
240f846495
commit
427ea6f0a7
|
@ -0,0 +1,99 @@
|
|||
//===---------------------------------------------------------------------===//
|
||||
// Random ideas for the X86 backend: FP stack related stuff
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Some targets (e.g. athlons) prefer freep to fstp ST(0):
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
On darwin/x86, we should codegen:
|
||||
|
||||
ret double 0.000000e+00
|
||||
|
||||
as fld0/ret, not as:
|
||||
|
||||
movl $0, 4(%esp)
|
||||
movl $0, (%esp)
|
||||
fldl (%esp)
|
||||
...
|
||||
ret
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
This should use fiadd on chips where it is profitable:
|
||||
double foo(double P, int *I) { return P+*I; }
|
||||
|
||||
We have fiadd patterns now but the followings have the same cost and
|
||||
complexity. We need a way to specify the later is more profitable.
|
||||
|
||||
def FpADD32m : FpI<(ops RFP:$dst, RFP:$src1, f32mem:$src2), OneArgFPRW,
|
||||
[(set RFP:$dst, (fadd RFP:$src1,
|
||||
(extloadf64f32 addr:$src2)))]>;
|
||||
// ST(0) = ST(0) + [mem32]
|
||||
|
||||
def FpIADD32m : FpI<(ops RFP:$dst, RFP:$src1, i32mem:$src2), OneArgFPRW,
|
||||
[(set RFP:$dst, (fadd RFP:$src1,
|
||||
(X86fild addr:$src2, i32)))]>;
|
||||
// ST(0) = ST(0) + [mem32int]
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
The FP stackifier needs to be global. Also, it should handle simple permutates
|
||||
to reduce number of shuffle instructions, e.g. turning:
|
||||
|
||||
fld P -> fld Q
|
||||
fld Q fld P
|
||||
fxch
|
||||
|
||||
or:
|
||||
|
||||
fxch -> fucomi
|
||||
fucomi jl X
|
||||
jg X
|
||||
|
||||
Ideas:
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
|
||||
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Add a target specific hook to DAG combiner to handle SINT_TO_FP and
|
||||
FP_TO_SINT when the source operand is already in memory.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Open code rint,floor,ceil,trunc:
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
|
||||
|
||||
Opencode the sincos[f] libcall.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
None of the FPStack instructions are handled in
|
||||
X86RegisterInfo::foldMemoryOperand, which prevents the spiller from
|
||||
folding spill code into the instructions.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Currently the x86 codegen isn't very good at mixing SSE and FPStack
|
||||
code:
|
||||
|
||||
unsigned int foo(double x) { return x; }
|
||||
|
||||
foo:
|
||||
subl $20, %esp
|
||||
movsd 24(%esp), %xmm0
|
||||
movsd %xmm0, 8(%esp)
|
||||
fldl 8(%esp)
|
||||
fisttpll (%esp)
|
||||
movl (%esp), %eax
|
||||
addl $20, %esp
|
||||
ret
|
||||
|
||||
This will be solved when we go to a dynamic programming based isel.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
|
@ -29,62 +29,6 @@ unsigned test(unsigned long long X, unsigned Y) {
|
|||
This can be done trivially with a custom legalizer. What about overflow
|
||||
though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Some targets (e.g. athlons) prefer freep to fstp ST(0):
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
On darwin/x86, we should codegen:
|
||||
|
||||
ret double 0.000000e+00
|
||||
|
||||
as fld0/ret, not as:
|
||||
|
||||
movl $0, 4(%esp)
|
||||
movl $0, (%esp)
|
||||
fldl (%esp)
|
||||
...
|
||||
ret
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
This should use fiadd on chips where it is profitable:
|
||||
double foo(double P, int *I) { return P+*I; }
|
||||
|
||||
We have fiadd patterns now but the followings have the same cost and
|
||||
complexity. We need a way to specify the later is more profitable.
|
||||
|
||||
def FpADD32m : FpI<(ops RFP:$dst, RFP:$src1, f32mem:$src2), OneArgFPRW,
|
||||
[(set RFP:$dst, (fadd RFP:$src1,
|
||||
(extloadf64f32 addr:$src2)))]>;
|
||||
// ST(0) = ST(0) + [mem32]
|
||||
|
||||
def FpIADD32m : FpI<(ops RFP:$dst, RFP:$src1, i32mem:$src2), OneArgFPRW,
|
||||
[(set RFP:$dst, (fadd RFP:$src1,
|
||||
(X86fild addr:$src2, i32)))]>;
|
||||
// ST(0) = ST(0) + [mem32int]
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
The FP stackifier needs to be global. Also, it should handle simple permutates
|
||||
to reduce number of shuffle instructions, e.g. turning:
|
||||
|
||||
fld P -> fld Q
|
||||
fld Q fld P
|
||||
fxch
|
||||
|
||||
or:
|
||||
|
||||
fxch -> fucomi
|
||||
fucomi jl X
|
||||
jg X
|
||||
|
||||
Ideas:
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
|
||||
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Improvements to the multiply -> shift/add algorithm:
|
||||
|
@ -136,11 +80,6 @@ allocator. Delay codegen until post register allocation.
|
|||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Add a target specific hook to DAG combiner to handle SINT_TO_FP and
|
||||
FP_TO_SINT when the source operand is already in memory.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
|
||||
|
||||
cmpl $1, %eax
|
||||
|
@ -181,24 +120,6 @@ flags.
|
|||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Open code rint,floor,ceil,trunc:
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
|
||||
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
Combine: a = sin(x), b = cos(x) into a,b = sincos(x).
|
||||
|
||||
Expand these to calls of sin/cos and stores:
|
||||
double sincos(double x, double *sin, double *cos);
|
||||
float sincosf(float x, float *sin, float *cos);
|
||||
long double sincosl(long double x, long double *sin, long double *cos);
|
||||
|
||||
Doing so could allow SROA of the destination pointers. See also:
|
||||
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
The instruction selector sometimes misses folding a load into a compare. The
|
||||
pattern is written as (cmp reg, (load p)). Because the compare isn't
|
||||
commutative, it is not matched with the load on both sides. The dag combiner
|
||||
|
@ -219,11 +140,6 @@ target specific hook.
|
|||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
LSR should be turned on for the X86 backend and tuned to take advantage of its
|
||||
addressing modes.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and
|
||||
other fast SSE modes.
|
||||
|
||||
|
@ -293,11 +209,6 @@ The pattern isel got this one right.
|
|||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
We need to lower switch statements to tablejumps when appropriate instead of
|
||||
always into binary branch trees.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
SSE doesn't have [mem] op= reg instructions. If we have an SSE instruction
|
||||
like this:
|
||||
|
||||
|
@ -351,12 +262,6 @@ much sense (e.g. its an infinite loop). :)
|
|||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
None of the FPStack instructions are handled in
|
||||
X86RegisterInfo::foldMemoryOperand, which prevents the spiller from
|
||||
folding spill code into the instructions.
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
In many cases, LLVM generates code like this:
|
||||
|
||||
_test:
|
||||
|
@ -827,11 +732,6 @@ _test:
|
|||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
A Mac OS X IA-32 specific ABI bug wrt returning value > 8 bytes:
|
||||
http://llvm.org/bugs/show_bug.cgi?id=729
|
||||
|
||||
//===---------------------------------------------------------------------===//
|
||||
|
||||
X86RegisterInfo::copyRegToReg() returns X86::MOVAPSrr for VR128. Is it possible
|
||||
to choose between movaps, movapd, and movdqa based on types of source and
|
||||
destination?
|
||||
|
|
Loading…
Reference in New Issue