forked from OSchip/llvm-project
![]() Default legalization will create two v8i64 truncs to v8i32, concat them to v16i32, and then truncate the rest of the way to v16i8. Instead we can truncate directly from v8i64 to v8i8 in the lower half of an xmm. Then concat the two halves to use vpunpcklqdq. This is the same number of uops, but the dependency chain through the uops is better since the halves are merged at the end. I had to had SimplifyDemandedBits support for VTRUNC to prevent a regression on vector-trunc-math.ll. combineTruncatedArithmetic no longer gets a chance to shrink vXi64 mul so we were producing the v8i64 multiply sequence using multiple PMULUDQs. With the demanded bits fix we are able to prune out the extra ops leaving just two PMULUDQs, one for each v8i64 half. This is twice the width of the 2 v8i32 PMULLDs we had before, but PMULUDQ is 1 uop and PMULLD is 2. We also save some truncates. It's probably worth using PMULUDQ even when PMULLQ is available since the latter is 3 uops, but that will require a different change. Differential Revision: https://reviews.llvm.org/D79231 |
||
---|---|---|
.. | ||
AArch64 | ||
AMDGPU | ||
ARC | ||
ARM | ||
AVR | ||
BPF | ||
Generic | ||
Hexagon | ||
Inputs | ||
Lanai | ||
MIR | ||
MSP430 | ||
Mips | ||
NVPTX | ||
PowerPC | ||
RISCV | ||
SPARC | ||
SystemZ | ||
Thumb | ||
Thumb2 | ||
VE | ||
WebAssembly | ||
WinCFGuard | ||
WinEH | ||
X86 | ||
XCore |