This is a suggested follow-up to D116765.
This removes a clear of the register operand, so it is better
for code size, but it does potentially create a false register
dependency on surrounding code. If that is a problem, it should
be solvable using dependency-breaking code that is used for
other instructions.
Differential Revision: https://reviews.llvm.org/D116804
To enable this on all targets there's still a number of regressions due to getSplatValue/getTargetVShiftNode but these don't really affect pre-AVX targets.
This is very similar to the existing ROTL/ROTR support for scalar shifts in LowerRotate, I think as time goes on we should be able to share much of this code in helpers between Funnel Shift + Rotation lowering.
1. Fix CombinerHelper::matchBitfieldExtractFromAnd to check legality
with the correct types for the G_UBFX that it builds.
2. Fix AMDGPUTargetLowering::isConstantUnsignedBitfieldExtractLegal to
match the legality rules: result and first operand can be s32 or s64
but the "shift amount" operands are always s32.
3. Add AMDGPU tests where the post-legalizer combiner would create
illegal MIR without the above fixes.
Differential Revision: https://reviews.llvm.org/D116802
By default we return the width of an LMUL=1 register. We can enable
testing with larger LMUL values by returning a larger bit width.
This patch adds a RISCV specific option to provide a LMUL which will be
multiplied by the LMUL=1 bit width.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D116339
getMinVectorRegisterBitWidth means what vector types is supported in
this target, and actually RISC-V support all fixed length vector types with
vector length less than `getMinRVVVectorSizeInBits`, so set it to 16,
means 2 x i8, that is minimal fixed length vector size in theory.
That also fixed one issue, some testcase migth become non-vectorizable
when `-riscv-v-vector-bits-min` set to larger value, because the vector size is
smaller than `-riscv-v-vector-bits-min`.
For example, following code can vectorize by SLP with
`-riscv-v-vector-bits-min=128` or `-riscv-v-vector-bits-min=256`, but
can't vectorize `-riscv-v-vector-bits-min=512` or larger:
```
void foo(double *da) {
da[0] = 0;
da[1] = 1;
da[2] = 2;
da[3] = 3;
}
```
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D116534
select (X != 0), -1, Y --> 0 - X; or (sbb), Y
select (X != 0), Y, -1 --> X - 1; or (sbb), Y
We already had these x86 carry-flag transforms, but one was over-specified to
handle a "0" select arm only. That's just a special-case of the more general
pattern (the 'or' will be deleted if Y is zero).
This is part of solving #53006, but it misses that example because some other
combine has already converted that exact pattern into math ops.
Differential Revision: https://reviews.llvm.org/D116765
Rename argument from 'Fatal' => 'ReportErrors'. HexagonShuffler refers to
this arg as 'ReportErrors' and calling it 'Fatal' in HexagonMCShuffler is
misleading and inconsistent.
Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.
Differential Revision: https://reviews.llvm.org/D116734
For below C code, we can use VNNI to combine the mul and add operation.
int usdot_prod_qi(unsigned char *restrict a, char *restrict b, int c,
int n) {
int i;
for (i = 0; i < 32; i++) {
c += ((int)a[i] * (int)b[i]);
}
return c;
}
We didn't support the combine acoss basic block in this patch.
Differential Revision: https://reviews.llvm.org/D116039
This change adds the patterns and divergence predicates for the ctpop (bitcount) nodes
to make them selected according to the divergence.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D116284
This patch is based on https://reviews.llvm.org/D99585.
While inside the stack probing loop, temporarily change the CFA
to be based on r11/eax, which are already used to hold the loop bound.
The stack pointer cannot be used for CFI here as it changes during the loop,
so it does not have a constant offset to the CFA.
Co-authored-by: YangKeao <keao.yang@yahoo.com>
Reviewed By: nagisa
Differential Revision: https://reviews.llvm.org/D116628
There are several duplicated lines for generating GPRXXX's
register list that can be eliminated by using `sub` operator.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116729
Previously compounding was all-or-nothing. Now, the
compounding attempts will iterate and yield the most
compounds that still result in a valid packet.
This adds some AArch64 specific smul_with_overflow and umul_with_overflow
costs, overriding the default costs. The code generation for these mul
with overflow intrinsics is usually better than the default expansion on
AArch64. The costs come from https://godbolt.org/z/zEzYhMWqo with various
types, or llvm/test/CodeGen/AArch64/arm64-xaluo.ll.
Differential Revision: https://reviews.llvm.org/D116732