llvm-project/llvm/lib/Target/X86/README.txt

169 lines
5.3 KiB
Plaintext

//===---------------------------------------------------------------------===//
// Random ideas for the X86 backend.
//===---------------------------------------------------------------------===//
Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
X86, & make the dag combiner produce it when needed. This will eliminate one
imul from the code generated for:
long long test(long long X, long long Y) { return X*Y; }
by using the EAX result from the mul. We should add a similar node for
DIVREM.
another case is:
long long test(int X, int Y) { return (long long)X*Y; }
... which should only be one imul instruction.
//===---------------------------------------------------------------------===//
This should be one DIV/IDIV instruction, not a libcall:
unsigned test(unsigned long long X, unsigned Y) {
return X/Y;
}
This can be done trivially with a custom legalizer. What about overflow
though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
//===---------------------------------------------------------------------===//
Some targets (e.g. athlons) prefer freep to fstp ST(0):
http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html
//===---------------------------------------------------------------------===//
This should use fiadd on chips where it is profitable:
double foo(double P, int *I) { return P+*I; }
//===---------------------------------------------------------------------===//
The FP stackifier needs to be global. Also, it should handle simple permutates
to reduce number of shuffle instructions, e.g. turning:
fld P -> fld Q
fld Q fld P
fxch
or:
fxch -> fucomi
fucomi jl X
jg X
Ideas:
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html
//===---------------------------------------------------------------------===//
Improvements to the multiply -> shift/add algorithm:
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
//===---------------------------------------------------------------------===//
Improve code like this (occurs fairly frequently, e.g. in LLVM):
long long foo(int x) { return 1LL << x; }
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
Another useful one would be ~0ULL >> X and ~0ULL << X.
//===---------------------------------------------------------------------===//
Should support emission of the bswap instruction, probably by adding a new
DAG node for byte swapping. Also useful on PPC which has byte-swapping loads.
//===---------------------------------------------------------------------===//
Compile this:
_Bool f(_Bool a) { return a!=1; }
into:
movzbl %dil, %eax
xorl $1, %eax
ret
//===---------------------------------------------------------------------===//
Some isel ideas:
1. Dynamic programming based approach when compile time if not an
issue.
2. Code duplication (addressing mode) during isel.
3. Other ideas from "Register-Sensitive Selection, Duplication, and
Sequencing of Instructions".
//===---------------------------------------------------------------------===//
Should we promote i16 to i32 to avoid partial register update stalls?
//===---------------------------------------------------------------------===//
Leave any_extend as pseudo instruction and hint to register
allocator. Delay codegen until post register allocation.
//===---------------------------------------------------------------------===//
Add a target specific hook to DAG combiner to handle SINT_TO_FP and
FP_TO_SINT when the source operand is already in memory.
//===---------------------------------------------------------------------===//
Check if load folding would add a cycle in the dag.
//===---------------------------------------------------------------------===//
Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.
cmpl $1, %eax
setg %al
testb %al, %al # unnecessary
jne .BB7
//===---------------------------------------------------------------------===//
Count leading zeros and count trailing zeros:
int clz(int X) { return __builtin_clz(X); }
int ctz(int X) { return __builtin_ctz(X); }
$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
clz:
bsr %eax, DWORD PTR [%esp+4]
xor %eax, 31
ret
ctz:
bsf %eax, DWORD PTR [%esp+4]
ret
however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
aren't.
//===---------------------------------------------------------------------===//
Use push/pop instructions in prolog/epilog sequences instead of stores off
ESP (certain code size win, perf win on some [which?] processors).
//===---------------------------------------------------------------------===//
Only use inc/neg/not instructions on processors where they are faster than
add/sub/xor. They are slower on the P4 due to only updating some processor
flags.
//===---------------------------------------------------------------------===//
Open code rint,floor,ceil,trunc:
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html
//===---------------------------------------------------------------------===//
Combine: a = sin(x), b = cos(x) into a,b = sincos(x).